Text Mining in Keyword Extraction
Project Description and Goal
Student: Phuc Nguyen
Mentor: Dr. Thomas Kaczmarek
Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.
This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach.
References 1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau.
2) Survey of Keyword Extraction Techniques by Brian Lott.