Text Mining in Keyword Extraction

From REU@MU
Revision as of 18:33, 3 June 2016 by Phucnguyen (Talk | contribs)

Jump to: navigation, search

Project Description and Goal

Student: Phuc Nguyen

Mentor: Dr. Thomas Kaczmarek

Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.

This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time.

References

1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau.

2) Survey of Keyword Extraction Techniques by Brian Lott.