Difference between revisions of "Text Mining in Keywords Extraction"

From REU@MU
Jump to: navigation, search
(Created page with "== Project Description and Goal == Text mining or text analysis refers to the use of computational techniques to discover new and unknown information from unstructured textual...")
 
(Project Description and Goal)
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Project Description and Goal ==
 
== Project Description and Goal ==
Text mining or text analysis refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms or models are TF-IDF (Term Frequency-Inverse Document Frequency), RAKE (Rapid Automatic Keywords Extraction), or TextRank as well as some less popular ones such as using lexical chains or using Bayes classifier.
+
'''Student:''' Phuc Nguyen
 +
 
 +
'''Mentor:''' Dr. Thomas Kaczmarek
 +
 
 +
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.  
 +
 
 +
This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.
 +
 
 +
'''References'''
 +
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.
 +
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.
 +
 
 +
== Weekly Log ==
 +
'''Week 1 (5/31 to 6/3)'''
 +
* Attend REU orientation activities, fill out forms and paperworks
 +
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project
 +
* Read articles related to text mining to find the potential research topic
 +
* Come up with a research topic and discuss with Dr. Kaczmarek

Latest revision as of 18:20, 3 June 2016

Project Description and Goal

Student: Phuc Nguyen

Mentor: Dr. Thomas Kaczmarek

Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.

This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.

References 1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau. 2) Survey of Keyword Extraction Techniques by Brian Lott.

Weekly Log

Week 1 (5/31 to 6/3)

  • Attend REU orientation activities, fill out forms and paperworks
  • Meet with Dr. Kaczmarek to discuss goals and scopes of the project
  • Read articles related to text mining to find the potential research topic
  • Come up with a research topic and discuss with Dr. Kaczmarek