Difference between revisions of "Michal"

Revision as of 18:51, 12 October 2018

Personal Information

My name is Michal (pronounced me-how). I am 100% Polish, bilingual, and very proud of my heritage. I am currently a sophomore at Marquette University, working on an Electrical Engineering major. For the 2010 REU program I will be working with Dr. Craig Struble in the Bistro Lab on the Intelligent Discovery of Acronyms and Abbreviations (IDA2) project, started by Adam Mallen last year. The general topic of this research is Natural Language Processing (NLP).

Final result

Paper: Media:Comparison of abbreviation recognition algorithms.pdf

Poster: Media:Poster- Comparison of algorithms.pdf

Presentation: Media:Presentation- Comparison of algorithms.pdf

Week 1: May 31- June 4

May 31

Memorial Day- no work
Took train to Milwaukee, settled in at the Men's Catholic House

June 1

Attended introductory meeting for REU program.
Browsed wiki from last year's IDA2 project, including Adam Mallen's work log.
Attended REU talk about research practices.
Practiced using LaTeX (for typesetting), Subversion (for version control), and Make (for building automation). These tools were discussed at last week's Bistro Lab meeting.
Prepared for weekly lab meeting by reading Ashelford, et. al., "At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies." While not exactly related to my research project, it provided me with a chance to practice reading research papers.
Time= 4 hours

June 2

Worked through some LingPipe tutorials (Spelling Correction, Text Classification). LingPipe is a set of Java classes that are used for linguistic analysis.
Met with Dr. Struble. Discussed my current understanding of NLP, worked some examples, and planned next few weeks.
Attended weekly Bistro Lab meeting. Praful Aggarwal (a graduate student working with Dr. Struble) led presentation/discussion on software called Pintail (used to detect chimeric sequences in a public genomic database).
Went to the library and checked out two books (recommended by Dr. Struble).
- "Foundations of Statistical Natural Language Processing" by Manning, Schutze
- "Programming for Corpus Linguistics: How to Do Text Analysis with Java" by Oliver Mason
Time= 5.5 hours

June 3

Set up my wiki, updated my work log.
Will be leaving Milwaukee for long weekend (housesitting while family goes on vacation). Nevertheless, will continue to work diligently.
Create list of basic NLP terms/definitions
Time= 2.5 hours

Week 2: June 7- 11

June 7

More basic research/learning (added to list of terms)
- Natural Language Processing
Read Schwartz, Ariel S., and Marti A. Hearst. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 2003.
- Paper outlines algorithm that is currently used by IDA2
- Came up with ideas/questions to consider for improving elements of the IDA2 acronym finder
Time= 6 hours

June 8

Read introduction to Manning, Schutze's "Foundations of Statistical Natural Language Processing"
- Sections 1.1-1.4= Introduction
- 2.1.1-2.1.10= Mathematics Essentials
- 4.1-4.3= Corpus-Based Work
- 6.1-6.3= "Statistical Inference" n-gram Models over Sparse Data"
Began writing basic n-gram model of my own. Corpus used is Mark Twain's "Tom Sawyer" (from Project Gutenburg)
Time= 7 hours

June 9

Continue writing n-gram code
- Resolved all problems with I/O
- Some improvements in tokenization
Attend weekly lab meeting (read, discuss article on Genotype-Imputation Accuracy)
Reread Schwartz, Hearst paper (jot down more notes, ideas)
Time= 6 hours

June 10

Reread Schutze 6.1-6.3 (dealing with n-gram models and estimators)
Continue working on n-gram model code
- Had epiphany on how to calculate probabilities
- Stored trigrams in HashMap (key= word, value= frequency)
Meet with Dr. Struble to discuss progress
- Looked at my code together, discussed it
After meeting, improved/cleaned up code
- Got rid of unnecessary I/O
- Overall, shorten code from 140 to 90 lines
Time= 10 hours

June 11

Final edits to n-gram program
- Fix how probability is calculated
- Play with different corpora/texts (Shakespeare, etc.)
Meet with Dr. Struble
- Get IDA2 code from lab repository
- Discuss how it works and possible shortfalls
Time= 3 hours

Week 3: June 14- 18

June 14

Become familiar with IDA2 code
- Run it with some small input (varied success)
- Find Schwartz, Hearst original code here
Short meeting with Dr. Struble (discuss progress, future work)
Test IDA2 algorithm with data used by Schwartz, Hearst (1000 MEDLINE abstracts, found here)
- Find that our version of algorithm finds 190 less short, long form pairs than original
- Clean up/fix labels in data (will use to determine precision/recall, determine differences in code)
Time= 9.5 hours

June 15

Continue struggling with poorly labeled data
Create list of all abbreviations found by algorithm
- Organized into categories: Matches, Partials, Wrong, and Missing
- Calculate precision and recall based on this
  - Precision: 90.16% (577 pairs correct/ 640 pairs found)
  - Recall: 60.48% (577 pairs found/ 954 total pairs)
Time= 6.5 hours

June 16

Double check list generated of algorithms found
Get Schwartz/Hearst's original algorithm and run it on same data
- Precision: 94.44% (Paper claims 95%)
- Recall: 80.18% (Paper claims 82%)
- Start comparing pairs found from both algorithms (looking for differences that suggest error in code)
Meeting with Dr. Struble
- Resolve some small problems with program
- Learn basics of database management in 30 min.
Attend weekly lab meeting (another of Dr. Struble's students gave his thesis defense for practice)
Time= 5.5 hours

Week 4: June 21-25

June 21

Clean up process of categorizing pairs (match, miss, etc.)
- Write new method to do it automatically (using generated list of pairs from algorithm and actual list)
- Fix some more errors in data (like labeling, but did not correct grammatical/spelling errors in abstracts)
Make revelation: part of reason why IDA2 code is has much lower precision/recall is due to how text is passed in/parsed
- Goes line by line, but about 160 pairs are on two lines
Time= 10 hours

June 22

Fix error in I/O causing IDA2 program to miss pairs (now finding 802 instead of 637 pairs)
- Read in all lines and separate into sentences (save temporarily to ArrayList)
- Generate new list of pairs from IDA2
- Start categorizing and comparing (again)
IDA2 algorithm is a lot better than initially predicted:
- Precision= 92.52% (742 pairs correct/ 802 pairs found)
- Recall= 76.18% (742 pairs found/ 974 total pairs)
Time= 6.5 hours

June 23

Attend REU talk- "Expressing yourself in verbally and in writing"
Weekly lab meeting- just talk about everyone's recent progress
Find differences between pairs found by each algorithm
- Original algorithm finds pair with no capital letters
- Also finds pairs with parentheses, punctuation better
Collect information about context of each pair that is not found by both algorithms
Time= 6 hours

June 24

Continue working with pairs found by algorithms
Get ideas on how to fix IDA2
- One recurring problem is nested parentheses
- Another problem is whether or not the short form has a capital letter
Time= 2 hours

Week 5: June 28-July 2

June 28

Create table/graph to compare performance of algorithms
Search for new research papers to read (used Google Scholar, looked for papers that cited Schwartz/Hearst)
Think of some more possible improvements to algorithm (still cannot explain some discrepancies)
Time= 7 hours

June 29

Continue working with pairs found by algorithms
Think of other possible improvements to IDA2 (new algorithm, etc.)
Learn some basics about databases and SQL (used by IDA2 to store all acronym/abbreviations found)
- Read from Raghu Ramakrishnan's "Database Management Systems" (old edition [1997])
  - Section 1= Introduction to Database Systems
  - 2= The Relational Model
  - 4= File Organizations and Indexes
  - 9= SQL: The Query Language
Time= 8 hours

June 30

Continue learning/practicing DBMS/SQL
Take break from algorithm (to clear head)
Time= 3.5 hours

July 1

Make improvements to both algorithms
- Reconcile differences in matches
- Fix problem with certain partials (nested parentheses) and wrong pairs (with no space before parentheses)
Recalculate precision and recall of both algorithms
- SH*- Precision= 94.36% (786 pairs correct/ 833 pairs found), Recall= 80.70% (786 pairs found/ 974 total pairs)
- IDA2*- Precision= 94.42% (779 pairs correct/ 825 pairs found), Recall= 79.98% (779 pairs found/ 974 total pairs)
Time= 7 hours

July 2

Document changes in algorithms (and results)
Find, print several papers to read (6-8)
- Looked for papers cited by Schwartz & Hearst, or that cited their paper
Read:
- Larkey et al Acrophile: An Automated Acronym Extractor and Server
  - Paper discusses a project very similar to IDA2. I was specifically interested in the 4 algorithms they used to find abbreviation/acronym pairs
  - Some of their algorithms could detect and define short forms that were not inside parenthesis (using stop words).
  - However, the precision and recall of these algorithms is very poor compared to S&H (at most, about 20%).
- Park, Byrd Hybrid text mining for finding abbreviations and their definitions.
  - Cited by S&H (just like the previous one). It introduces an algorithm that uses a simple alignment scheme like S&H, but also creates a "RuleBase" of different patterns.
  - S&H can only define abbreviations/acronyms if they are right next to each other, while this one can define them even if the long/short form pairs are offset from each other.
Time= 8.5 hours

Week 6: July 5-9

July 5

Read another paper that S&H had cited: Using Compression to identify acronyms in text.
- Not as helpful as the other papers- published over 20 years ago, algorithm only dealt with acronyms
- Clever use of a threshold based on ratio of acronym to definition length.
Read two other papers that cited S&H in their references (written in 2005/2006, 2-3 years after S&H).
- Torii et al [http://www.biomedcentral.com/content/pdf/1471-2105-8-S9-S5.pdf A comparison study o

Difference between revisions of "Michal"

Revision as of 18:51, 12 October 2018

Contents

Personal Information

Final result

Week 1: May 31- June 4

Week 2: June 7- 11

Week 3: June 14- 18

Week 4: June 21-25

Week 5: June 28-July 2

Week 6: July 5-9

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools