Difference between revisions of "Michal"
(Created page with "==Personal Information== My name is Michal (pronounced me-how). I am 100% Polish, bilingual, and very proud of my heritage. I am currently a sophomore at Marquette University,...") |
(No difference)
|
Revision as of 18:51, 12 October 2018
Contents
Personal Information
My name is Michal (pronounced me-how). I am 100% Polish, bilingual, and very proud of my heritage. I am currently a sophomore at Marquette University, working on an Electrical Engineering major. For the 2010 REU program I will be working with Dr. Craig Struble in the Bistro Lab on the Intelligent Discovery of Acronyms and Abbreviations (IDA2) project, started by Adam Mallen last year. The general topic of this research is Natural Language Processing (NLP).
Final result
Paper: Media:Comparison of abbreviation recognition algorithms.pdf
Poster: Media:Poster- Comparison of algorithms.pdf
Presentation: Media:Presentation- Comparison of algorithms.pdf
Week 1: May 31- June 4
May 31
- Memorial Day- no work
- Took train to Milwaukee, settled in at the Men's Catholic House
June 1
- Attended introductory meeting for REU program.
- Browsed wiki from last year's IDA2 project, including Adam Mallen's work log.
- Attended REU talk about research practices.
- Practiced using LaTeX (for typesetting), Subversion (for version control), and Make (for building automation). These tools were discussed at last week's Bistro Lab meeting.
- Prepared for weekly lab meeting by reading Ashelford, et. al., "At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies." While not exactly related to my research project, it provided me with a chance to practice reading research papers.
- Time= 4 hours
June 2
- Worked through some LingPipe tutorials (Spelling Correction, Text Classification). LingPipe is a set of Java classes that are used for linguistic analysis.
- Met with Dr. Struble. Discussed my current understanding of NLP, worked some examples, and planned next few weeks.
- Attended weekly Bistro Lab meeting. Praful Aggarwal (a graduate student working with Dr. Struble) led presentation/discussion on software called Pintail (used to detect chimeric sequences in a public genomic database).
- Went to the library and checked out two books (recommended by Dr. Struble).
- "Foundations of Statistical Natural Language Processing" by Manning, Schutze
- "Programming for Corpus Linguistics: How to Do Text Analysis with Java" by Oliver Mason
- Time= 5.5 hours
June 3
- Set up my wiki, updated my work log.
- Will be leaving Milwaukee for long weekend (housesitting while family goes on vacation). Nevertheless, will continue to work diligently.
- Create list of basic NLP terms/definitions
- Time= 2.5 hours
Week 2: June 7- 11
June 7
- More basic research/learning (added to list of terms)
- Read Schwartz, Ariel S., and Marti A. Hearst. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 2003.
- Paper outlines algorithm that is currently used by IDA2
- Came up with ideas/questions to consider for improving elements of the IDA2 acronym finder
- Time= 6 hours
June 8
- Read introduction to Manning, Schutze's "Foundations of Statistical Natural Language Processing"
- Sections 1.1-1.4= Introduction
- 2.1.1-2.1.10= Mathematics Essentials
- 4.1-4.3= Corpus-Based Work
- 6.1-6.3= "Statistical Inference" n-gram Models over Sparse Data"
- Began writing basic n-gram model of my own. Corpus used is Mark Twain's "Tom Sawyer" (from Project Gutenburg)
- Time= 7 hours
June 9
- Continue writing n-gram code
- Resolved all problems with I/O
- Some improvements in tokenization
- Attend weekly lab meeting (read, discuss article on Genotype-Imputation Accuracy)
- Reread Schwartz, Hearst paper (jot down more notes, ideas)
- Time= 6 hours
June 10
- Reread Schutze 6.1-6.3 (dealing with n-gram models and estimators)
- Continue working on n-gram model code
- Had epiphany on how to calculate probabilities
- Stored trigrams in HashMap (key= word, value= frequency)
- Meet with Dr. Struble to discuss progress
- Looked at my code together, discussed it
- After meeting, improved/cleaned up code
- Got rid of unnecessary I/O
- Overall, shorten code from 140 to 90 lines
- Time= 10 hours
June 11
- Final edits to n-gram program
- Fix how probability is calculated
- Play with different corpora/texts (Shakespeare, etc.)
- Meet with Dr. Struble
- Get IDA2 code from lab repository
- Discuss how it works and possible shortfalls
- Time= 3 hours
Week 3: June 14- 18
June 14
- Become familiar with IDA2 code
- Run it with some small input (varied success)
- Find Schwartz, Hearst original code here
- Short meeting with Dr. Struble (discuss progress, future work)
- Test IDA2 algorithm with data used by Schwartz, Hearst (1000 MEDLINE abstracts, found here)
- Find that our version of algorithm finds 190 less short, long form pairs than original
- Clean up/fix labels in data (will use to determine precision/recall, determine differences in code)
- Time= 9.5 hours
June 15
- Continue struggling with poorly labeled data
- Create list of all abbreviations found by algorithm
- Organized into categories: Matches, Partials, Wrong, and Missing
- Calculate precision and recall based on this
- Precision: 90.16% (577 pairs correct/ 640 pairs found)
- Recall: 60.48% (577 pairs found/ 954 total pairs)
- Time= 6.5 hours
June 16
- Double check list generated of algorithms found
- Get Schwartz/Hearst's original algorithm and run it on same data
- Precision: 94.44% (Paper claims 95%)
- Recall: 80.18% (Paper claims 82%)
- Start comparing pairs found from both algorithms (looking for differences that suggest error in code)
- Meeting with Dr. Struble
- Resolve some small problems with program
- Learn basics of database management in 30 min.
- Attend weekly lab meeting (another of Dr. Struble's students gave his thesis defense for practice)
- Time= 5.5 hours
Week 4: June 21-25
June 21
- Clean up process of categorizing pairs (match, miss, etc.)
- Write new method to do it automatically (using generated list of pairs from algorithm and actual list)
- Fix some more errors in data (like labeling, but did not correct grammatical/spelling errors in abstracts)
- Make revelation: part of reason why IDA2 code is has much lower precision/recall is due to how text is passed in/parsed
- Goes line by line, but about 160 pairs are on two lines
- Time= 10 hours
June 22
- Fix error in I/O causing IDA2 program to miss pairs (now finding 802 instead of 637 pairs)
- Read in all lines and separate into sentences (save temporarily to ArrayList)
- Generate new list of pairs from IDA2
- Start categorizing and comparing (again)
- IDA2 algorithm is a lot better than initially predicted:
- Precision= 92.52% (742 pairs correct/ 802 pairs found)
- Recall= 76.18% (742 pairs found/ 974 total pairs)
- Time= 6.5 hours
June 23
- Attend REU talk- "Expressing yourself in verbally and in writing"
- Weekly lab meeting- just talk about everyone's recent progress
- Find differences between pairs found by each algorithm
- Original algorithm finds pair with no capital letters
- Also finds pairs with parentheses, punctuation better
- Collect information about context of each pair that is not found by both algorithms
- Time= 6 hours
June 24
- Continue working with pairs found by algorithms
- Get ideas on how to fix IDA2
- One recurring problem is nested parentheses
- Another problem is whether or not the short form has a capital letter
- Time= 2 hours
Week 5: June 28-July 2
June 28
- Create table/graph to compare performance of algorithms
- Search for new research papers to read (used Google Scholar, looked for papers that cited Schwartz/Hearst)
- Think of some more possible improvements to algorithm (still cannot explain some discrepancies)
- Time= 7 hours
June 29
- Continue working with pairs found by algorithms
- Think of other possible improvements to IDA2 (new algorithm, etc.)
- Learn some basics about databases and SQL (used by IDA2 to store all acronym/abbreviations found)
- Read from Raghu Ramakrishnan's "Database Management Systems" (old edition [1997])
- Section 1= Introduction to Database Systems
- 2= The Relational Model
- 4= File Organizations and Indexes
- 9= SQL: The Query Language
- Read from Raghu Ramakrishnan's "Database Management Systems" (old edition [1997])
- Time= 8 hours
June 30
- Continue learning/practicing DBMS/SQL
- Take break from algorithm (to clear head)
- Time= 3.5 hours
July 1
- Make improvements to both algorithms
- Reconcile differences in matches
- Fix problem with certain partials (nested parentheses) and wrong pairs (with no space before parentheses)
- Recalculate precision and recall of both algorithms
- SH*- Precision= 94.36% (786 pairs correct/ 833 pairs found), Recall= 80.70% (786 pairs found/ 974 total pairs)
- IDA2*- Precision= 94.42% (779 pairs correct/ 825 pairs found), Recall= 79.98% (779 pairs found/ 974 total pairs)
- Time= 7 hours
July 2
- Document changes in algorithms (and results)
- Find, print several papers to read (6-8)
- Looked for papers cited by Schwartz & Hearst, or that cited their paper
- Read:
- Larkey et al Acrophile: An Automated Acronym Extractor and Server
- Paper discusses a project very similar to IDA2. I was specifically interested in the 4 algorithms they used to find abbreviation/acronym pairs
- Some of their algorithms could detect and define short forms that were not inside parenthesis (using stop words).
- However, the precision and recall of these algorithms is very poor compared to S&H (at most, about 20%).
- Park, Byrd Hybrid text mining for finding abbreviations and their definitions.
- Cited by S&H (just like the previous one). It introduces an algorithm that uses a simple alignment scheme like S&H, but also creates a "RuleBase" of different patterns.
- S&H can only define abbreviations/acronyms if they are right next to each other, while this one can define them even if the long/short form pairs are offset from each other.
- Larkey et al Acrophile: An Automated Acronym Extractor and Server
- Time= 8.5 hours
Week 6: July 5-9
July 5
- Read another paper that S&H had cited: Using Compression to identify acronyms in text.
- Not as helpful as the other papers- published over 20 years ago, algorithm only dealt with acronyms
- Clever use of a threshold based on ratio of acronym to definition length.
- Read two other papers that cited S&H in their references (written in 2005/2006, 2-3 years after S&H).
- Torii et al [http://www.biomedcentral.com/content/pdf/1471-2105-8-S9-S5.pdf A comparison study o