Text Mining in Keywords Extraction

Jump to: navigation, search

Project Description and Goal

Student: Phuc Nguyen

Mentor: Dr. Thomas Kaczmarek

Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.

This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.

References 1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau. 2) Survey of Keyword Extraction Techniques by Brian Lott.

Weekly Log

Week 1 (5/31 to 6/3)

  • Attend REU orientation activities, fill out forms and paperworks
  • Meet with Dr. Kaczmarek to discuss goals and scopes of the project
  • Read articles related to text mining to find the potential research topic
  • Come up with a research topic and discuss with Dr. Kaczmarek