Text Mining in Keyword Extraction

Project Description and Goal

Student: Phuc Nguyen and Aditya Subramanian

Mentor: Dr. Thomas Kaczmarek

Description: Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier.

This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time.

Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity.

Background References

1) TextRank: Bringing Orders into Texts by Rada Mihalcea and Paul Tarau.

2) Survey of Keyword Extraction Techniques by Brian Lott.

3) Automatic Keyword Extraction from Individual Documents by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.

4) Improved Automatic Keyword Extraction Given More Linguistic Knowledge by Annette Hulth

5) Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information by Yutaka Matsuo and Mitsuru Ishizuka

Weekly Log

Week 1 (5/31 to 6/3)

Attend REU orientation activities, fill out forms and paperworks
Meet with Dr. Kaczmarek to discuss goals and scopes of the project
Read articles related to text mining to find the potential research topic
Come up with a research topic and discuss with Dr. Kaczmarek

Week 2 (6/6 to 6/10)

Continue reading about research already done in the field
Finalize research topic
Create list of factors for machine learning algorithm
Outline proposed algorithm for new co-occurrence matrix
Start looking at RAKE, TextRank, and TF-IDF

Week 3 (6/6 to 6/10)

Complete code for machine learning algorithm
Complete code to test proposed algorithm with other algorithms
Begin to scrutinize the TextRank algorithm and obtain the Python code to implement this algorithm
Review tools from linear algebra and mathematical analysis

Week 4 (6/13 to 6/17)

Meet with Dr. Kaczmarek to discuss about obtaining a corpus of document
Start looking at the PageRank algorithm
Attempt to prove convergence and find runtime complexity for the original method

Week 5 (6/20 to 6/24)

Meet with Dr. Kaczmarek to explain the new approach to perform TextRank
Continue to review tools from linear algebra and mathematical analysis
Study the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem

Week 6 (6/27 to 7/1)

Complete code for co-occurrence matrix creator
Integrate PageRank to improve TextRank performance
Begin to write a formal report paper
Presentation about what we have come up with so far

Text Mining in Keyword Extraction

Project Description and Goal

Weekly Log

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools