https://reu.cs.mu.edu/api.php?action=feedcontributions&user=Phucnguyen&feedformat=atomREU@MU - User contributions [en]2024-03-28T19:42:25ZUser contributionsMediaWiki 1.23.13https://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-20T09:17:17Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (7/4 to 7/8)'''<br />
* Begin to look at the rate of convergence for the new approach<br />
* Review theory of convergence from numerical analysis<br />
<br />
'''Week 7 (7/11 to 7/15)'''<br />
* Start to look at some numerical techniques used to approximate eigenvectors<br />
* Review complexity theory involving big-O notation and the theory of Markov chain<br />
* Begin to write code to test the new approach, conclude that the theory is valid and the new approach can be reasonably implemented to run in practice<br />
<br />
'''Week 8 (7/18 to 7/22)'''<br />
* Prepare poster to be submitted by the end of the week<br />
* Meet with Dr. Kaczmarek to discuss the plan for the remaining 2 weeks and begin to wrap everything up<br />
<br />
'''Week 9 (7/25 to 7/29)'''<br />
* Prepare for poster presentation<br />
* Prepare for final presentation next week<br />
* Meet with Dr. Kaczmarek to finalize report paper<br />
<br />
'''Week 10 (8/1 to 8/5)'''<br />
* Submit report paper<br />
* Deliver final presentation<br />
* Complete remaining paperwork and survey</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-20T09:16:22Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (7/4 to 7/8)'''<br />
* Begin to look at the rate of convergence for the new approach<br />
* Review theory of convergence from numerical analysis<br />
<br />
'''Week 7 (7/11 to 7/15)'''<br />
* Start to look at some numerical techniques used to approximate eigenvectors<br />
* Review complexity theory involving big-O notation and the theory of Markov chain<br />
* Begin to write code to test the theory, conclude that the theory is valid and the method can be reasonably implemented to run in practice<br />
<br />
'''Week 8 (7/18 to 7/22)'''<br />
* Prepare poster to be submitted by the end of the week<br />
* Meet with Dr. Kaczmarek to discuss the plan for the remaining 2 weeks and begin to wrap everything up<br />
<br />
'''Week 9 (7/25 to 7/29)'''<br />
* Prepare for poster presentation<br />
* Prepare for final presentation next week<br />
* Meet with Dr. Kaczmarek to finalize report paper<br />
<br />
'''Week 10 (8/1 to 8/5)'''<br />
* Submit report paper<br />
* Deliver final presentation<br />
* Complete remaining paperwork and survey</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-20T09:14:52Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (7/4 to 7/8)'''<br />
* Begin to look at the rate of convergence for the new approach<br />
* Review theory of convergence from numerical analysis<br />
<br />
'''Week 7 (7/11 to 7/15)'''<br />
* Start to look at some numerical techniques used to approximate eigenvectors<br />
* Review complexity theory involving big-O notation<br />
<br />
'''Week 8 (7/18 to 7/22)'''<br />
* Prepare poster to be submitted by the end of the week<br />
* Meet with Dr. Kaczmarek to discuss the plan for the remaining 2 weeks and begin to wrap everything up<br />
<br />
'''Week 9 (7/25 to 7/29)'''<br />
* Prepare for poster presentation<br />
* Prepare for final presentation next week<br />
* Meet with Dr. Kaczmarek to finalize report paper<br />
<br />
'''Week 10 (8/1 to 8/5)'''<br />
* Submit report paper<br />
* Deliver final presentation<br />
* Complete remaining paperwork and survey</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-19T06:36:52Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (7/4 to 7/8)'''<br />
* Begin to look at the rate of convergence for the new approach<br />
* Review theory of convergence from numerical analysis<br />
<br />
'''Week 7 (7/11 to 7/15)'''<br />
* Start to look at some numerical techniques used to approximate eigenvectors<br />
* Review complexity theory involving big-O notation<br />
<br />
'''Week 8 (7/18 to 7/22)'''<br />
* Prepare poster to be submitted by the end of the week<br />
<br />
'''Week 9 (7/25 to 7/29)'''<br />
<br />
'''Week 10 (8/1 to 8/5)'''</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-19T06:33:31Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (7/4 to 7/8)'''<br />
<br />
'''Week 7 (7/11 to 7/15)'''<br />
<br />
'''Week 8 (7/18 to 7/22)'''<br />
<br />
'''Week 9 (7/25 to 7/29)'''<br />
<br />
'''Week 10 (8/1 to 8/5)'''</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T02:05:13Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original TextRank method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
* Review graph theory and theory of probability<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T02:03:29Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Aditya to talk about collaboration between 2 projects<br />
* Begin to look at the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T02:02:00Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/13 to 6/17)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Review tools from linear algebra and mathematical analysis<br />
* Attempt to prove convergence and find runtime complexity for the original method<br />
<br />
'''Week 4 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Meet with Ad<br />
<br />
'''Week 5 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T01:59:12Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/6 to 6/10)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Begin to scrutinize the TextRank algorithm and obtain the Python code to implement this algorithm<br />
* Review tools from linear algebra and mathematical analysis<br />
<br />
'''Week 4 (6/13 to 6/17)'''<br />
* Meet with Dr. Kaczmarek to discuss about obtaining a corpus of document<br />
* Start looking at the PageRank algorithm<br />
* Attempt to prove convergence and find runtime complexity for the original method<br />
<br />
'''Week 5 (6/20 to 6/24)'''<br />
* Meet with Dr. Kaczmarek to explain the new approach to perform TextRank<br />
* Continue to review tools from linear algebra and mathematical analysis<br />
* Study the proofs of the Perron-Frobenius Theorem and the Power Method Convergence Theorem<br />
<br />
'''Week 6 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T01:54:49Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2 (6/6 to 6/10)'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3 (6/6 to 6/10)'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Start looking at the PageRank algorithm<br />
* Review tools from linear algebra and mathematical analysis<br />
<br />
'''Week 4 (6/13 to 6/17)'''<br />
* Meet with Dr. Kaczmarek to<br />
<br />
'''Week 5 (6/20 to 6/24)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far<br />
<br />
'''Week 6 (6/27 to 7/1)'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T01:50:44Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document. On the other hand, Phuc's approach attempts to improve TextRank's performance (reduce runtime complexity, improve keyword score, etc) by integrating the model from the PageRank algorithm used by Google. His current problem is to refine the new approach even further by using tools from numerical analysis to reduce runtime complexity. <br />
<br />
'''Background References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Start looking at the PageRank algorithm<br />
* Review tools from linear algebra and mathematical analysis<br />
<br />
'''Week 4'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T01:46:36Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document.<br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
* Start looking at RAKE, TextRank, and TF-IDF<br />
<br />
'''Week 3'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
* Start looking at the PageRank algorithm<br />
* Review tools from linear algebra and mathematical analysis<br />
<br />
'''Week 4'''<br />
* Complete code for co-occurrence matrix creator<br />
* Integrate PageRank to improve TextRank performance<br />
* Begin to write a formal report paper<br />
* Presentation about what we have come up with so far</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-07-05T01:43:10Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen and Aditya Subramanian<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
Phuc and Aditya are working on two seperate projects within this field. Aditya's project has two aspects. The first is creating a machine learning algorithm to identify keywords. While this has already been done[4], we are considering many more factors than what has been done in the past. The second aspect is to create a better co-occurrence matrix. The model considers the distance between two words and compares it to the expected distance between the two words given the number of instances of the words, the positions of the first word, and the length of the document.<br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley.<br />
<br />
4) [http://dl.acm.org/citation.cfm?id=1119383 Improved Automatic Keyword Extraction Given More Linguistic Knowledge] by Annette Hulth<br />
<br />
5) [http://www.aaai.org/Papers/FLAIRS/2003/Flairs03-076.pdf Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information] by Yutaka Matsuo and Mitsuru Ishizuka<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek<br />
<br />
'''Week 2'''<br />
* Continue reading about research already done in the field<br />
* Finalize research topic<br />
* Create list of factors for machine learning algorithm<br />
* Outline proposed algorithm for new co-occurrence matrix <br />
<br />
'''Week 3'''<br />
* Complete code for machine learning algorithm<br />
* Complete code to test proposed algorithm with other algorithms<br />
<br />
'''Week 4'''<br />
*Complete code for co-occurrence matrix creator</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-06-03T18:43:43Z<p>Phucnguyen: </p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowly.<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-06-03T18:38:21Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
3) [https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents Automatic Keyword Extraction from Individual Documents] by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowly.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Phuc_NguyenPhuc Nguyen2016-06-03T18:34:55Z<p>Phucnguyen: </p>
<hr />
<div><br />
== Information ==<br />
I am an incoming junior at Marquette University double majoring in Mathematics and Computer Science.<br />
<br />
<br />
== Research Project Summer 2016 ==<br />
My research project and activities for summer 2016 can be found [http://reu.mscs.mu.edu/index.php/Text_Mining_in_Keyword_Extraction here]</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-06-03T18:33:01Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several algorithms described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach. First, we obtain the corpus and sample literature within that corpus. We then generate the keyword lists using all the testing algorithms and a list generated by the author of each literature. Finally, we consult an expert in the field to rank all the lists. We repeat the experiment with several documents and based on the outcomes gauge the potential of each method. It is worth noticing that in order to avoid biasedness from the expert, all the lists shouldn't be labelled and the order in which the lists occur should be shuffled randomly each time. <br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-06-03T18:23:20Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach.<br />
<br />
'''References'''<br />
<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-06-03T18:23:10Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div><br />
== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keyword extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keyword extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords list generated by humans and based on the existing methods propose a new approach.<br />
<br />
'''References'''<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T18:20:47Z<p>Phucnguyen: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keyword Extraction | Text Mining in Keyword Extraction]]. Student: [[Phuc Nguyen | Phuc Nguyen]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* [[MUzECS:Chrome|A browser-based IDE for the MUzECS platform.]] David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T18:20:21Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div>== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. <br />
<br />
This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
'''References'''<br />
1) [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf TextRank: Bringing Orders into Texts] by Rada Mihalcea and Paul Tarau.<br />
2) [http://www.cs.unm.edu/~pdevineni/papers/Lott.pdf Survey of Keyword Extraction Techniques] by Brian Lott.<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/User:PhucnguyenUser:Phucnguyen2016-06-03T16:59:55Z<p>Phucnguyen: </p>
<hr />
<div>Information can be found [http://reu.mscs.mu.edu/index.php/Phuc_Nguyen here]</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Phuc_NguyenPhuc Nguyen2016-06-03T16:58:45Z<p>Phucnguyen: /* Research Project Summer 2016 */</p>
<hr />
<div><br />
== Information ==<br />
I am an incoming junior at Marquette University double majoring in Mathematics and Computer Science.<br />
<br />
<br />
== Research Project Summer 2016 ==<br />
My research project and activities for summer 2016 can be found [http://reu.mscs.mu.edu/index.php/Text_Mining_in_Keywords_Extraction here]</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Phuc_NguyenPhuc Nguyen2016-06-03T16:57:54Z<p>Phucnguyen: </p>
<hr />
<div><br />
== Information ==<br />
I am an incoming junior at Marquette University double majoring in Mathematics and Computer Science.<br />
<br />
<br />
== Research Project Summer 2016 ==<br />
My research project for summer 2016 can be found [http://reu.mscs.mu.edu/index.php/Text_Mining_in_Keywords_Extraction here]</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/User:PhucnguyenUser:Phucnguyen2016-06-03T16:56:10Z<p>Phucnguyen: </p>
<hr />
<div>Information can be found</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Phuc_NguyenPhuc Nguyen2016-06-03T16:55:39Z<p>Phucnguyen: Created page with " == Information == I am an incoming junior at Marquette University double majoring in Mathematics and Computer Science. == Research Project Summer 2016 =="</p>
<hr />
<div><br />
== Information ==<br />
I am an incoming junior at Marquette University double majoring in Mathematics and Computer Science.<br />
<br />
<br />
== Research Project Summer 2016 ==</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T16:53:57Z<p>Phucnguyen: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. Student: [[Phuc Nguyen | Phuc Nguyen]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T16:53:37Z<p>Phucnguyen: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. Student: [Phuc Nguyen | Phuc Nguyen]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:48:43Z<p>Phucnguyen: /* Weekly Log */</p>
<hr />
<div>== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topic<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T16:45:12Z<p>Phucnguyen: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. Student: [[User:Phucnguyen | Phuc Nguyen]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T16:44:30Z<p>Phucnguyen: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. Student: [[User:Phucnguyen | Phuc Nguyen]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml |Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman] and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T16:43:31Z<p>Phucnguyen: </p>
<hr />
<div>[[Game Engine for Serious Educational Games]]. Student Researchers: <br />
[[User:Dcronce|Daniel Cronce]] and [[User:Mjbaker4|Michael Baker]]. Mentors [http://www.marquette.edu/ctl/about/staff.shtml Dr. Shaun Longstreet], [http://www.utdallas.edu/~kcooper/ Dr. Kendra Cooper], and [[User:Brylow|Dr. Dennis Brylow]].<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[User:Jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. Student: [[User:Phucnguyen | Phuc Nguyen]]. Mentor: [http://www.marquette.edu/mscs/facstaff-kaczmarek.shtml | Dr. Thomas Kaczmarek].<br />
<br />
[[Applied Probabilistic Forecasting Methods in Energy Consumption]]. Dr. George Corliss, students [[User:Scloew|Stephen Loew]] and [[User:ARuiz|Alberto Ruiz]]<br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
[[Development of Authentication and Management Systems for Systems Administration Offices]]. [[User:Cmorley|Charlie Morley]]. Mentors: [[http://www.marquette.edu/mscs/facstaff-staff.shtml Steve Goodman]] and [[http://www.mscs.mu.edu/~brylow/ Dr. Dennis Brylow]].<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: [[User:Brylow|Dr. Dennis Brylow]].</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:42:45Z<p>Phucnguyen: </p>
<hr />
<div>== Project Description and Goal ==<br />
'''Student:''' Phuc Nguyen<br />
<br />
'''Mentor:''' Dr. Thomas Kaczmarek<br />
<br />
'''Description:''' Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topics<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:41:58Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div>== Project Description and Goal ==<br />
Student: Phuc Nguyen<br />
<br />
Mentor: Dr. Thomas Kaczmarek<br />
<br />
Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topics<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:39:57Z<p>Phucnguyen: </p>
<hr />
<div>== Project Description and Goal ==<br />
Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
<br />
== Weekly Log ==<br />
'''Week 1 (5/31 to 6/3)'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topics<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:39:08Z<p>Phucnguyen: </p>
<hr />
<div>== Project Description and Goal ==<br />
Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
<br />
== Weekly Log ==<br />
'''Week 1'''<br />
* Attend REU orientation activities, fill out forms and paperworks<br />
* Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
* Read articles related to text mining to find the potential research topics<br />
* Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:38:09Z<p>Phucnguyen: /* Project Description and Goal */</p>
<hr />
<div>== Project Description and Goal ==<br />
Text mining or text analytics refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Keywords extraction can then be further utilized to classify and cluster documents. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms and models are the TF-IDF (Term Frequency-Inverse Document Frequency) model, the RAKE (Rapid Automatic Keywords Extraction) algorithm, and the TextRank model as well as some less popular ones such as using lexical chains or using Bayes classifier. This project attempts to analyze several methods described above to determine the strength and weakness of each method by comparing the results with a sample keywords generated by humans and based on the existing methods propose a new approach.<br />
<br />
<br />
== Weekly Log ==<br />
'''Week 1'''<br />
Attend REU orientation activities, fill out forms and paperworks<br />
Meet with Dr. Kaczmarek to discuss goals and scopes of the project<br />
Read articles related to text mining to find the potential research topics<br />
Come up with a research topic and discuss with Dr. Kaczmarek</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T16:23:10Z<p>Phucnguyen: </p>
<hr />
<div>SimSys Project. [[Dcronce|Daniel Cronce]] and [[Mjbaker4|Michael Baker]]. Mentors:<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. Student: [[User:Phucnguyen | Phuc Nguyen]]. Mentor: Dr. Thomas Kaczmarek.<br />
<br />
Applied Probabilistic Forecasting Methods in Energy Consumption. Dr. George Corliss, students Stephen Loew and Alberto Ruiz<br />
<br />
Abstract: In this research we try to find applications for probabilistic forecasting on GasDay energy consumption models. Additionally we will survey literature to find applications of probabilistic forecasting in industry. Furthermore, we will examine forecasting metrics for accuracy including the Brier Score, evaluate strength and weaknesses and consider possible improvements. <br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keywords_ExtractionText Mining in Keywords Extraction2016-06-03T16:22:40Z<p>Phucnguyen: Created page with "== Project Description and Goal == Text mining or text analysis refers to the use of computational techniques to discover new and unknown information from unstructured textual..."</p>
<hr />
<div>== Project Description and Goal ==<br />
Text mining or text analysis refers to the use of computational techniques to discover new and unknown information from unstructured textual resources. Within text mining, keywords extraction is one of the most important tasks that automatically identifies and retrieves the most relevant information from unstructured texts. Despite being commonly used in search engines to locate information, appropriate keywords are difficult to generate since the process is time-consuming for humans amid the massive amount of information available nowadays. Thus, many traditional methods have been used over years and new solutions are constantly proposed to tackle this problem. Examples of some prevalent algorithms or models are TF-IDF (Term Frequency-Inverse Document Frequency), RAKE (Rapid Automatic Keywords Extraction), or TextRank as well as some less popular ones such as using lexical chains or using Bayes classifier.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T15:59:55Z<p>Phucnguyen: </p>
<hr />
<div>SimSys Project. [[Dcronce|Daniel Cronce]] and [[Mjbaker4|Michael Baker]]. Mentors:<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. [[User:Phucnguyen | Phuc Nguyen]]. Mentor: Dr. Thomas Kaczmarek.<br />
<br />
Applied Probabilistic Forecasting Methods in Energy Consumption. Dr. George Corliss, students Stephen Loew and Alberto Ruiz<br />
<br />
Abstract: In this research we try to find applications for probabilistic forecasting on GasDay energy consumption models. Additionally we will survey literature to find applications of probabilistic forecasting in industry. Furthermore, we will examine forecasting metrics for accuracy including the Brier Score, evaluate strength and weaknesses and consider possible improvements. <br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T15:59:14Z<p>Phucnguyen: </p>
<hr />
<div>SimSys Project. [[Dcronce|Daniel Cronce]] and [[Mjbaker4|Michael Baker]]. Mentors:<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keyword Extraction | Text Mining in Keywords Extraction]]. [[User:Phucnguyen | Phuc Nguyen]]. Mentor: Dr. Thomas Kaczmarek.<br />
<br />
Applied Probabilistic Forecasting Methods in Energy Consumption. Dr. George Corliss, students Stephen Loew and Alberto Ruiz<br />
<br />
Abstract: In this research we try to find applications for probabilistic forecasting on GasDay energy consumption models. Additionally we will survey literature to find applications of probabilistic forecasting in industry. Furthermore, we will examine forecasting metrics for accuracy including the Brier Score, evaluate strength and weaknesses and consider possible improvements. <br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T15:58:38Z<p>Phucnguyen: </p>
<hr />
<div>SimSys Project. [[Dcronce|Daniel Cronce]] and [[Mjbaker4|Michael Baker]]. Mentors:<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keywords Extraction | Text Mining in Keywords Extraction]]. [[User:Phucnguyen | Phuc Nguyen]]. Mentor: Dr. Thomas Kaczmarek.<br />
<br />
Applied Probabilistic Forecasting Methods in Energy Consumption. Dr. George Corliss, students Stephen Loew and Alberto Ruiz<br />
<br />
Abstract: In this research we try to find applications for probabilistic forecasting on GasDay energy consumption models. Additionally we will survey literature to find applications of probabilistic forecasting in industry. Furthermore, we will examine forecasting metrics for accuracy including the Brier Score, evaluate strength and weaknesses and consider possible improvements. <br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Text_Mining_in_Keyword_ExtractionText Mining in Keyword Extraction2016-06-03T15:58:15Z<p>Phucnguyen: Created page with " == Project Description and Goal =="</p>
<hr />
<div><br />
== Project Description and Goal ==</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T15:57:40Z<p>Phucnguyen: </p>
<hr />
<div>SimSys Project. [[Dcronce|Daniel Cronce]] and [[Mjbaker4|Michael Baker]]. Mentors:<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[Text Mining in Keyword Extraction | Text Mining in Keyword Extraction]]. [[User:Phucnguyen | Phuc Nguyen]]. Mentor: Dr. Thomas Kaczmarek.<br />
<br />
Applied Probabilistic Forecasting Methods in Energy Consumption. Dr. George Corliss, students Stephen Loew and Alberto Ruiz<br />
<br />
Abstract: In this research we try to find applications for probabilistic forecasting on GasDay energy consumption models. Additionally we will survey literature to find applications of probabilistic forecasting in industry. Furthermore, we will examine forecasting metrics for accuracy including the Brier Score, evaluate strength and weaknesses and consider possible improvements. <br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/TextmingTextming2016-06-03T15:56:34Z<p>Phucnguyen: /* Text Mining in Keywords Extraction */</p>
<hr />
<div></div>Phucnguyenhttps://reu.cs.mu.edu/index.php/TextmingTextming2016-06-03T15:55:29Z<p>Phucnguyen: /* Text Mining in Keywords Extraction */</p>
<hr />
<div><br />
== '''<nowiki>Text Mining in Keywords Extraction</nowiki>''' ==</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/TextmingTextming2016-06-03T15:53:31Z<p>Phucnguyen: /* Text Mining in Keywords Extraction */</p>
<hr />
<div><br />
== '''Text Mining in Keywords Extraction''' ==</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/TextmingTextming2016-06-03T15:52:08Z<p>Phucnguyen: </p>
<hr />
<div><br />
== Text Mining in Keywords Extraction ==</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/TextmingTextming2016-06-03T15:51:27Z<p>Phucnguyen: Created page with "'''Text Mining in Keywords Extraction'''"</p>
<hr />
<div>'''Text Mining in Keywords Extraction'''</div>Phucnguyenhttps://reu.cs.mu.edu/index.php/Summer_2016_ProjectsSummer 2016 Projects2016-06-03T15:50:06Z<p>Phucnguyen: </p>
<hr />
<div>SimSys Project. [[Dcronce|Daniel Cronce]] and [[Mjbaker4|Michael Baker]]. Mentors:<br />
<br />
Predicting Relative 'Cleanability' from Geometry. [[User:Asisk|Anna Sisk]]. Mentors: Dr. Stephen Merrill and Casey O'Brien <br />
<br />
[[Sudoku Distances]]. [[User:Jbeilke|Julia Beilke]] and [[jmiller|Joel Miller]]. Mentor: Dr. Kim Factor.<br />
<br />
[[Algorithms of CT and SPECT Scans]]. [[kskamp | Kim Sommerkamp]]. Mentor: Dr. Anne Clough<br />
<br />
[[textming | Text Mining in Keyword Extraction]]. [[User:Phucnguyen | Phuc Nguyen]]. Mentor: Dr. Thomas Kaczmarek.<br />
<br />
Applied Probabilistic Forecasting Methods in Energy Consumption. Dr. George Corliss, students Stephen Loew and Alberto Ruiz<br />
<br />
Abstract: In this research we try to find applications for probabilistic forecasting on GasDay energy consumption models. Additionally we will survey literature to find applications of probabilistic forecasting in industry. Furthermore, we will examine forecasting metrics for accuracy including the Brier Score, evaluate strength and weaknesses and consider possible improvements. <br />
<br />
[[Analyzing and Mapping out data of Milwaukee]] . [[ghong|Gina Hong]]. Mentor: Dr. Gary Krenz.<br />
<br />
== Mathematics and Computer Science Education ==<br />
* MUzECS : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.<br />
<br />
* Embedded Xinu : Really cool things and stuff. David Hunpatin and [[User:Rthomas|Ryan Thomas]]. Mentor: Dr. Dennis Brylow.</div>Phucnguyen