Text Mining in Keyword Extraction

From REU@MU
Revision as of 01:59, 5 July 2016 by Phucnguyen (Talk | contribs)

Jump to: navigation, search

Project Description and Goal

Student: Phuc Nguyen and Aditya Subramanian

Mentor: Dr. Thomas Kaczmarek

Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.

This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time.

Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity.

Background References

1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau.

2) Survey of Keyword Extraction Techniques by Brian Lott.

3) Automatic Keyword Extraction from Individual Documents by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.

4) Improved Automatic Keyword Extraction Given More Linguistic Knowledge by Annette Hulth

5) Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information by Yutaka Matsuo and Mitsuru Ishizuka

Weekly Log

Week 1 (5/31 to 6/3)

  • Attend REU orientation activities, fill out forms and paperworks
  • Meet with Dr. Kaczmarek to discuss goals and scopes of the project
  • Read articles related to text mining to find the potential research topic
  • Come up with a research topic and discuss with Dr. Kaczmarek

Week 2 (6/6 to 6/10)

  • Continue reading about research already done in the field
  • Finalize research topic
  • Create list of factors for machine learning algorithm
  • Outline proposed algorithm for new co-occurrence matrix
  • Start looking at RAKE, TextRank, and TF-IDF

Week 3 (6/6 to 6/10)

  • Complete code for machine learning algorithm
  • Complete code to test proposed algorithm with other algorithms
  • Begin to scrutinize the TextRank algorithm and obtain the Python code to implement this algorithm
  • Review tools from linear algebra and mathematical analysis

Week 4 (6/13 to 6/17)

  • Meet with Dr. Kaczmarek to discuss about obtaining a corpus of document
  • Start looking at the PageRank algorithm
  • Attempt to prove convergence and find runtime complexity for the original method

Week 5 (6/20 to 6/24)

  • Meet with Dr. Kaczmarek to explain the new approach to perform TextRank
  • Continue to review tools from linear algebra and mathematical analysis
  • Study the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem

Week 6 (6/27 to 7/1)

  • Complete code for co-occurrence matrix creator
  • Integrate PageRank to improve TextRank performance
  • Begin to write a formal report paper
  • Presentation about what we have come up with so far