Text Mining in Keyword Extraction

Revision as of 09:17, 20 July 2016 by Phucnguyen (Talk | contribs)

Jump to: navigation, search

Project Description and Goal

Student: Phuc Nguyen and Aditya Subramanian

Mentor: Dr. Thomas Kaczmarek

Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.

This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time.

Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity.

Background References

1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau.

2) Survey of Keyword Extraction Techniques by Brian Lott.

3) Automatic Keyword Extraction from Individual Documents by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.

4) Improved Automatic Keyword Extraction Given More Linguistic Knowledge by Annette Hulth

5) Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information by Yutaka Matsuo and Mitsuru Ishizuka

Weekly Log

Week 1 (5/31 to 6/3)

  • Attend REU orientation activities, fill out forms and paperworks
  • Meet with Dr. Kaczmarek to discuss goals and scopes of the project
  • Read articles related to text mining to find the potential research topic
  • Come up with a research topic and discuss with Dr. Kaczmarek

Week 2 (6/6 to 6/10)

  • Continue reading about research already done in the field
  • Finalize research topic
  • Create list of factors for machine learning algorithm
  • Outline proposed algorithm for new co-occurrence matrix
  • Start looking at RAKE, TextRank, and TF-IDF

Week 3 (6/13 to 6/17)

  • Complete code for machine learning algorithm
  • Complete code to test proposed algorithm with other algorithms
  • Review tools from linear algebra and mathematical analysis
  • Attempt to prove convergence and find runtime complexity for the original TextRank method

Week 4 (6/20 to 6/24)

  • Meet with Dr. Kaczmarek to explain the new approach to perform TextRank
  • Meet with Aditya to talk about collaboration between 2 projects
  • Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem
  • Review graph theory and theory of probability

Week 5 (6/27 to 7/1)

  • Complete code for co-occurrence matrix creator
  • Integrate PageRank to improve TextRank performance
  • Begin to write a formal report paper
  • Presentation about what we have come up with so far

Week 6 (7/4 to 7/8)

  • Begin to look at the rate of convergence for the new approach
  • Review theory of convergence from numerical analysis

Week 7 (7/11 to 7/15)

  • Start to look at some numerical techniques used to approximate eigenvectors
  • Review complexity theory involving big-O notation and the theory of Markov chain
  • Begin to write code to test the new approach, conclude that the theory is valid and the new approach can be reasonably implemented to run in practice

Week 8 (7/18 to 7/22)

  • Prepare poster to be submitted by the end of the week
  • Meet with Dr. Kaczmarek to discuss the plan for the remaining 2 weeks and begin to wrap everything up

Week 9 (7/25 to 7/29)

  • Prepare for poster presentation
  • Prepare for final presentation next week
  • Meet with Dr. Kaczmarek to finalize report paper

Week 10 (8/1 to 8/5)

  • Submit report paper
  • Deliver final presentation
  • Complete remaining paperwork and survey