Michal

From REU@MU
Jump to: navigation, search

Personal Information

My name is Michal (pronounced me-how). I am 100% Polish, bilingual, and very proud of my heritage. I am currently a sophomore at Marquette University, working on an Electrical Engineering major. For the 2010 REU program I will be working with Dr. Craig Struble in the Bistro Lab on the Intelligent Discovery of Acronyms and Abbreviations (IDA2) project, started by Adam Mallen last year. The general topic of this research is Natural Language Processing (NLP).

Final result

Paper: Comparison of abbreviation recognition algorithms.pdf]

Poster: Media:Poster- Comparison of algorithms.pdf

Presentation: Media:Presentation- Comparison of algorithms.pdf

Week 1: May 31- June 4

May 31

  • Memorial Day- no work
  • Took train to Milwaukee, settled in at the Men's Catholic House

June 1

  • Attended introductory meeting for REU program.
  • Browsed wiki from last year's IDA2 project, including Adam Mallen's work log.
  • Attended REU talk about research practices.
  • Practiced using LaTeX (for typesetting), Subversion (for version control), and Make (for building automation). These tools were discussed at last week's Bistro Lab meeting.
  • Prepared for weekly lab meeting by reading Ashelford, et. al., "At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies." While not exactly related to my research project, it provided me with a chance to practice reading research papers.
  • Time= 4 hours

June 2

  • Worked through some LingPipe tutorials (Spelling Correction, Text Classification). LingPipe is a set of Java classes that are used for linguistic analysis.
  • Met with Dr. Struble. Discussed my current understanding of NLP, worked some examples, and planned next few weeks.
  • Attended weekly Bistro Lab meeting. Praful Aggarwal (a graduate student working with Dr. Struble) led presentation/discussion on software called Pintail (used to detect chimeric sequences in a public genomic database).
  • Went to the library and checked out two books (recommended by Dr. Struble).
  • Time= 5.5 hours

June 3

  • Set up my wiki, updated my work log.
  • Will be leaving Milwaukee for long weekend (housesitting while family goes on vacation). Nevertheless, will continue to work diligently.
  • Create list of basic NLP terms/definitions
  • Time= 2.5 hours

Week 2: June 7- 11

June 7

June 8

  • Read introduction to Manning, Schutze's "Foundations of Statistical Natural Language Processing"
    • Sections 1.1-1.4= Introduction
    • 2.1.1-2.1.10= Mathematics Essentials
    • 4.1-4.3= Corpus-Based Work
    • 6.1-6.3= "Statistical Inference" n-gram Models over Sparse Data"
  • Began writing basic n-gram model of my own. Corpus used is Mark Twain's "Tom Sawyer" (from Project Gutenburg)
  • Time= 7 hours

June 9

  • Continue writing n-gram code
    • Resolved all problems with I/O
    • Some improvements in tokenization
  • Attend weekly lab meeting (read, discuss article on Genotype-Imputation Accuracy)
  • Reread Schwartz, Hearst paper (jot down more notes, ideas)
  • Time= 6 hours

June 10

  • Reread Schutze 6.1-6.3 (dealing with n-gram models and estimators)
  • Continue working on n-gram model code
    • Had epiphany on how to calculate probabilities
    • Stored trigrams in HashMap (key= word, value= frequency)
  • Meet with Dr. Struble to discuss progress
    • Looked at my code together, discussed it
  • After meeting, improved/cleaned up code
    • Got rid of unnecessary I/O
    • Overall, shorten code from 140 to 90 lines
  • Time= 10 hours

June 11

  • Final edits to n-gram program
    • Fix how probability is calculated
    • Play with different corpora/texts (Shakespeare, etc.)
  • Meet with Dr. Struble
    • Get IDA2 code from lab repository
    • Discuss how it works and possible shortfalls
  • Time= 3 hours

Week 3: June 14- 18

June 14

  • Become familiar with IDA2 code
    • Run it with some small input (varied success)
    • Find Schwartz, Hearst original code here
  • Short meeting with Dr. Struble (discuss progress, future work)
  • Test IDA2 algorithm with data used by Schwartz, Hearst (1000 MEDLINE abstracts, found here)
    • Find that our version of algorithm finds 190 less short, long form pairs than original
    • Clean up/fix labels in data (will use to determine precision/recall, determine differences in code)
  • Time= 9.5 hours

June 15

  • Continue struggling with poorly labeled data
  • Create list of all abbreviations found by algorithm
    • Organized into categories: Matches, Partials, Wrong, and Missing
    • Calculate precision and recall based on this
      • Precision: 90.16% (577 pairs correct/ 640 pairs found)
      • Recall: 60.48% (577 pairs found/ 954 total pairs)
  • Time= 6.5 hours

June 16

  • Double check list generated of algorithms found
  • Get Schwartz/Hearst's original algorithm and run it on same data
    • Precision: 94.44% (Paper claims 95%)
    • Recall: 80.18% (Paper claims 82%)
    • Start comparing pairs found from both algorithms (looking for differences that suggest error in code)
  • Meeting with Dr. Struble
    • Resolve some small problems with program
    • Learn basics of database management in 30 min.
  • Attend weekly lab meeting (another of Dr. Struble's students gave his thesis defense for practice)
  • Time= 5.5 hours

Week 4: June 21-25

June 21

  • Clean up process of categorizing pairs (match, miss, etc.)
    • Write new method to do it automatically (using generated list of pairs from algorithm and actual list)
    • Fix some more errors in data (like labeling, but did not correct grammatical/spelling errors in abstracts)
  • Make revelation: part of reason why IDA2 code is has much lower precision/recall is due to how text is passed in/parsed
    • Goes line by line, but about 160 pairs are on two lines
  • Time= 10 hours

June 22

  • Fix error in I/O causing IDA2 program to miss pairs (now finding 802 instead of 637 pairs)
    • Read in all lines and separate into sentences (save temporarily to ArrayList)
    • Generate new list of pairs from IDA2
    • Start categorizing and comparing (again)
  • IDA2 algorithm is a lot better than initially predicted:
    • Precision= 92.52% (742 pairs correct/ 802 pairs found)
    • Recall= 76.18% (742 pairs found/ 974 total pairs)
  • Time= 6.5 hours

June 23

  • Attend REU talk- "Expressing yourself in verbally and in writing"
  • Weekly lab meeting- just talk about everyone's recent progress
  • Find differences between pairs found by each algorithm
    • Original algorithm finds pair with no capital letters
    • Also finds pairs with parentheses, punctuation better
  • Collect information about context of each pair that is not found by both algorithms
  • Time= 6 hours

June 24

  • Continue working with pairs found by algorithms
  • Get ideas on how to fix IDA2
    • One recurring problem is nested parentheses
    • Another problem is whether or not the short form has a capital letter
  • Time= 2 hours

Week 5: June 28-July 2

June 28

  • Create table/graph to compare performance of algorithms
  • Search for new research papers to read (used Google Scholar, looked for papers that cited Schwartz/Hearst)
  • Think of some more possible improvements to algorithm (still cannot explain some discrepancies)
  • Time= 7 hours

June 29

  • Continue working with pairs found by algorithms
  • Think of other possible improvements to IDA2 (new algorithm, etc.)
  • Learn some basics about databases and SQL (used by IDA2 to store all acronym/abbreviations found)
    • Read from Raghu Ramakrishnan's "Database Management Systems" (old edition [1997])
      • Section 1= Introduction to Database Systems
      • 2= The Relational Model
      • 4= File Organizations and Indexes
      • 9= SQL: The Query Language
  • Time= 8 hours

June 30

  • Continue learning/practicing DBMS/SQL
  • Take break from algorithm (to clear head)
  • Time= 3.5 hours

July 1

  • Make improvements to both algorithms
    • Reconcile differences in matches
    • Fix problem with certain partials (nested parentheses) and wrong pairs (with no space before parentheses)
  • Recalculate precision and recall of both algorithms
    • SH*- Precision= 94.36% (786 pairs correct/ 833 pairs found), Recall= 80.70% (786 pairs found/ 974 total pairs)
    • IDA2*- Precision= 94.42% (779 pairs correct/ 825 pairs found), Recall= 79.98% (779 pairs found/ 974 total pairs)
  • Time= 7 hours

July 2

  • Document changes in algorithms (and results)
  • Find, print several papers to read (6-8)
    • Looked for papers cited by Schwartz & Hearst, or that cited their paper
  • Read:
    • Larkey et al Acrophile: An Automated Acronym Extractor and Server
      • Paper discusses a project very similar to IDA2. I was specifically interested in the 4 algorithms they used to find abbreviation/acronym pairs
      • Some of their algorithms could detect and define short forms that were not inside parenthesis (using stop words).
      • However, the precision and recall of these algorithms is very poor compared to S&H (at most, about 20%).
    • Park, Byrd Hybrid text mining for finding abbreviations and their definitions.
      • Cited by S&H (just like the previous one). It introduces an algorithm that uses a simple alignment scheme like S&H, but also creates a "RuleBase" of different patterns.
      • S&H can only define abbreviations/acronyms if they are right next to each other, while this one can define them even if the long/short form pairs are offset from each other.
  • Time= 8.5 hours

Week 6: July 5-9

July 5