Presenters: Lecturer, MSc. PhD. Student Nguyen Cam Tu and MSc. Student Tran Mai Vu
1. Vietnamese multi-document summarization
For an user query, the search engine VNSEN returns a set of Vietnamese web pages A.
The group considers some tasks as follows:
- To cluster the set A into groups of Vietnamese web pages A1, A2, …, Ak. We have integrated a clustering component in our Vietnamese search engine VNSEN [4] by using the HTC algorithm [CTT08]. We are going to ugrade the component by using the hiden topic model to modify the module [Tu08].
- For each subset Ai, to multi-document summarize for a label and a summarization [CTT08, VUH08]. We also compare our solution in VNSEN with the component of the search engine Vivisimo.
The group also considers using some solutions of Text Segmentation and Title Generation [BDB07, DZS03] for Generating a Table-of-Contents [Cuo07].
2. Semantic relation extraction
By using the researching results by Corina Roxana Girju [Rox02], we investigated some cause-and-effect relations such as Adverbial causal link, Preposition causal link, Subordination causal link, Clause integrated link [Han05]. These relations are usefull for making a Vietnamese Ontology for sementic searching on the field of Medical Health Care [TNT08]. For upgrading the Vietnamese search engine VNSEN to become a Vietnamese entity search engine [CC07, Cha08], the semantic Relation Extraction and its Applications [Rox08] will be studied.?
References
[BDB07] Branavan S.R.K., Deshpande P., Barzilay R. (2007). Generating a Table-of-Contents, In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics: 544-551, Prague, Czech Republic.
[CC07] Tao Cheng, Kevin Chen-Chuan Chang: Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web. CIDR 2007: 108-113
[Cha08] Kevin C. Chang (2008). Data-Aware Search on the Web, Act. 2: Entity Search, Technical Report, The Database and Information Systems Laboratory, University of Illinois at Urbana-Charmpaign (a talking in the seminar at College of Technology, Vietnam National University, Hanoi, July 08, 2008).
[CTT08] Nguyen Thi Thu Chung, Nguyen Thu Trang, Nguyen Cam Tu, Ha Quang Thuy (2008). An evaluation on clustering component of Vietnamese search engine, The 11th National Conference on Information Technology of Vietnam, Hue, June 12-13, 2008 (in Vietnamese; submitted and presented).
[Cuo07] Nguyen Viet Cuong (2007). Automatically Constructing a Table-of-Contents for long text. Master Thesis, College of Technology, Vietnam National University, Hanoi, November, 2007 (in Vietnamese).
[DZS03] Dorr B., Zajic D., Schwartz R. (2003). Hedge Trimmer: A parse-and-trim approach to headline generation, Proceedings of the HLT-NAACL 2003 Workshop on Text Summarization: 1-8, Edmonton, Canada.
[Han05] Vu Boi Hang (2005). Extraction cause-and-effect relations from Vietnamese document, Master Thesis, College of Technology, Vietnam National University, Hanoi, June 2005 (in Vietnamese).
[Rox02] Corina Roxana Girju (2002). Text mining for semantic relations, PhD. Thesis, The University of Texas at Dallas, 2002
[Rox08] Corina Roxana Girju (2008). Semantic Relation Extraction and its Applications, Invited tutorial at the European Summer School in Logic, Language and Information (ESSLLI 2008), Hamburg, Germany, August 2008.
[TNT08] Le Dieu Thu, Tran Thi Ngan, Nguyen Cam Tu, Nguyen Thu Trang (2008). A Vietnamese Ontology for sementic searching on the field of Medical Health Care, The 11th National Conference on Information Technology of Vietnam, Hue, June 12-13, 2008 (in Vietnamese; submitted and presented).
[Tu08] Nguyen Cam Tu (2008). Hidden Topic Discovery Towards Classification and ?
Clustering in Vietnamese Web Documents, Master Thesis, College of Technology, Vietnam National University, Hanoi, May, 2008.
[VUH08] Tran Mai Vu, Pham Thi Thu Uyen, Hoang Minh Hien, Ha Quang Thuy (2008). Semantic Similarity of sentences and application for multi-document summarization to evalute on clustering component of Vietnamese search engine, Workshop on Information Communication Technology (ICTFIT08), College of Science, Vietnam National University, Ho Chi Minh City, November 14, 2008 (in Vietnamese, accepted).
Some Vietnamese language processing utilities
- Nguyen Cam Tu, Phan Xuan Hieu. JvnSegmenter. http://jvnsegmenter.sourceforge.net
- Nguyen Cam Tu. JVnTextpro: A Java-based Vietnamese Text Processing Toolkit, SISLab Software Utility, College of Technology, Vietnam National University, Hanoi.
- Nguyen Cam Tu. JGibbsLDA: A Java and Gibbs Sampling based Implementation of Latent Dirichlet Allocation (LDA), SISLab Software Utility, College of Technology, Vietnam National University, Hanoi.
- http://203.113.130.205:8080/sise: VNSEN Search Engine, SISLab Software, College of Technology, Vietnam National University, Hanoi.
Download: NEC090908