User:Amallen
__NOTITLE__
Contents
Adam Mallen
Hello everyone and welcome to my page! My name is Adam Mallen. I'm an alumni of Marquette with degrees in Mathematics and Computer Science, and I'm a brand new graduate student in the Computational Sciences program at Marquette University.
This page's primary purpose is to chronicle my efforts on the IDA2 (Intelligent Discovery of Acronyms and Abbreviations) summer REU project. This project is under the advisement of Dr. Craig Struble and Dr. Lenwood Heath.
This is a list of Abbreviation Issues I have found during research.
Weekly Work Log
This section describes the weekly meetings with Dr. Struble and the work I accomplished between meetings.
August 18th
- Plan on creating a poster for the REU poster session
- Plan on looking into how to start the next two big steps:
- Clustering similar (basically the same) long forms and treating them as the same "sense."
- Finding global abbreviations and disambiguating them.
August 11th
- Created a web interface for the database using Django.
- Have a prototype of the web interface running on the Bistro website.
- Cleaned up, organized, and commented programs, scripts, data-files, etc. associated with this project.
July 7th
- Ran through the LingPipe word sense disambiguation tutorial to learn how to use LingPipe as a tool for helping to train a model for abbreviation disambiguation.
- Ran through django tutorials so I can use the django system for creating the web interface front end of the database.
June 30th
- Populated the database with all abbreviations found in the Medline 2009 baseline files using the Schwartz and Hearst abbreviation finding algorithm. I used the Condor pool to distribute this task to many computers each processing its own baseline file in parallel.
- Found abbreviations/acronyms in 4,126,655 medline abstracts.
- Found 1,497,702 distinct short form/long form pairs.
- Found 365,792 distinct short forms.
- Issues while populating the database with condor:
- The Medline baseline files and the necessary Java jar files for LingPipe and JDBC need to be copied over for each condor job because the shared file system on some machines can't find these files.
- Two baseline files had to be re-run because of a Java connection exception. However, both files were process without error on their second run.
- Read Schwartz and Hearst, Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Biocomputing 2003.
- This paper outlines the abbreviation finding algorithm I used for populating the database.
- It is a very simple and straight forward algorithm that first finds short form/long form candidate pairs in the form of i) long form (short form) or ii) short form (long form). Then it scans
June 23rd
- Ran tests to see if multiple jobs can write to the database in parallel without causing problems.
- After running 10 threads in parallel each with a different Medline baseline file, the only difference between the serial implementation was 3 duplicate abbreviations entered into the dictionary. I believe this is the result of a small synchronization issue stemming from the fact that my program first queries the database to see if an abbreviation already exists in the dictionary and then adds that abbreviation to the dictionary if it is not a duplicate. But the time delay between these two actions creates a synchronization issue if another thread is trying to add the exact same abbreviation to the dictionary ad the same time.
- Began populating database with all abbreviations from Medline baseline files. This process will be done in parallel and then post-processing will be conducted to eliminate and properly deal with duplicate abbreviation entries in the dictionary.
- Because populating is taking much too long, I am looking into adding indexes to the database to quicken database queries.
June 16th
- Created (but not yet populated) a database for storing short form/long form pairs and the PMIDs of abstracts which contain them.
- Wrote script to parse Medline baseline files, find abbreviations, and store them in the database.
June 9th
- Got the abbreviation finder to work on the parsed Medline abstracts. Now the script automatically reads in Medline citations, prints the abstracts to a log file, and then uses that file as input for the abbreviation finder. So the end result is a list of unambiguous short form/long form pairs found in the abstracts of all Medline citations in the given input file.
- During some trial and error debugging, I noticed that this abbreviation finding algorithm does not recognize the relationship between Roman numerals and their corresponding Arabic numerals. This may or may not be a relevant issue. Also, this may be fixable once the parser translates the XML formatted input into tokens instead of just plain text.
- The following error checking changes had to be made to AbbreviationFinder.java to fix the problems I was seeing before
- Line 221, added a check to see if findBestLongForm returned null.
- Line 110, added a check if lastIndexOf returned -1
- Revised the program to find abbreviations on each abstract individually as they are parsed. Each abbreviation found is printed to the log file along with the pmID of the citation and the citation's associated MeSH headings. This output is tab delimited. So now our output file is a data file where each line represents an unambiguous abbreviation as a data point. The attributes of each data instance are the abbreviations short and long forms, the pmID of the citation in which it was found, and a list of MeSH topics associated with that citation.
- Next, I need to add other features to each data instance. Namely, I need to add the "Unigram and Linguistic features" described in Alamri, Abdulaziz Dhafer (2008). Also, I should look into adding MeSH tags that are 'parents' of the tags explicitly given to the citation.
- I read Stevenson et al., Disambiguation of Biomedical Abbreviations. This paper follows up on the Alamri, Abdulaziz Dhafer (2008) thesis. Some issues I thought of while reading this:
- What method should we use to find global abbreviations? That is, what distinguishes an abbreviation from regular text when there is no associated long form in parenthesis right next to it?
- The Schwartz and Hearst method finds the shortest candidate which contains all the characters in the abbreviation in the correct order. What about long forms whose last word is not represented in the abbreviation? Do such abbreviations exist?
- They use the entire abstract as the context of the abbreviation not just the sentence.
- I read S Gaudan, H Kirsch, D Rebholz-Schuhmann, Resolving abbreviations to their senses in Medline, Bioinformatics 2005. This paper outlined a system close to identical to the proposed IDA2 system.
- There were a couple of important ideas I took away from this paper:
- Combining closely related long forms into one common sense. Sometimes multiple long forms may not be the exact words but share the same common sense. This paper discussed a method to find and combine such long forms. Also, we may want to look at looking for abbreviations within long forms when searching for different long forms that have the same sense.
- The context extraction and feature vector representing the context of a given long form is also different from those described in Alamri, Abdulaziz Dhafer (2008). This paper uses the C-value algorithm described in [Frantzi and Ananiadou, The Cvalue domain independent method for multiword term extraction, 1999] for scoring words in an abstract. Then words with higher C value scores are kept as representing the context of an abbreviation found in that abstract.
- Should we remove 'rare' long forms from the training set? The method outlined in this paper did so to help training. Their reasoning was that global (ambiguous) abbreviations would be common because they are expected to be known by the reader, so it's safe to remove rare abbreviation/long form pairs from the dictionary since they would most likely never show up as global abbreviations.
- The following are references that I plan on reading to shed light on some of the previously mentioned issues/ideas:
- Wren et al., Biomedical term mapping databases
- Liu et al., Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS
- Yu et al., Automatic resolution of ambiguous abbreviations in biomedical texts using support vector machines and one sense per discourse hypothesis
- Tsuruoka and Tsujii, Probabilistic term variant generator for biomedical terms
- Pakhomov, Maximum Entropy based approach to acronym and abbreviation normalization in medical texts
- Adar, SaRAD: a Simple and Robust Abbreviation Dictionary
- Frantzi and Ananiadou, The C Value Domain Independent Method for Multiword Term Extraction, JNLP, 6(3), 145-179. 1999.
- There were a couple of important ideas I took away from this paper:
June 2nd
- Went through LingPipe tutorial on parsing Medline abstracts.
- Looked through the Biocreative implementation of the Schwartz and Hearst algorithm for abbreviation recognition.
- Made changes to the LingPipe tutorial's word count program to run through the XML Medline files and find just the text we're interested in. Right now that is only the abstracts, but can be easily changed to read in the MeSH tags, authors, titles, and other information stored in the Medline citations.
- Got the Biocreative AbbreviationFinder.java program to compile and run on a simple self-created test case.
- Wrote a small script to automate reading in Medline abstracts (as text, not tokens yet) and writing just the abstract in plain text to a file. This script also automatically compiles the necessary java files to do this.
- I tried to run the abbreviation finder on the Medline abstract text, but kept running into null pointer exceptions. Fixing this is next on my list.
May 26th
- Created overall system diagram using xfig
- Resolved my confusion/issue with the front-end interface. We decided that the front-end should just be a search mechanism to explore the database of disambiguated abstracts. A more sophisticated version can be built on top of this later if need be.
- Ran the LinPipe Medline word count tutorial with Dr. Struble using their prepared files and with our own Medline abstracts.
- Read H Yu, G Hripcsak, C Friedman, Mapping abbreviations to full forms in biomedical articles, Journal of the American Medical Informatics Association, 2002. This paper outlined a program called AbbRE. This uses a method for recognizing unambiguous abbreviations by searching for parenthetical expressions for paired abbreviations and full forms. This method could be used instead of the Schwartz and Hearst algorithm, if it improves performance. This paper also references several existing biomedical acronym and abbreviation databases set up to resolve specific types of short forms. These may be useful later, so I have included them in the following list:
- Genbank LocusLink: abbreviations and full forms of 54,719 genes.
- A note about LocusLink mentioned in Yu's work, this resource has been replaced by EntrezGene.
- SWISSPROT: protein-sequence database. Contains 88,800 protein abbreviations and full forms.
- LRABR: 10,000 abbreviations.
- BioABACUS: database of 6,000 common abbreviations in biotechnology and computer science
- Genbank LocusLink: abbreviations and full forms of 54,719 genes.
May 19th
- I Read: Alamri, Abdulaziz Dhafer, Word Sense Disambiguation in the Biomedical Domain: Disambiguation of Biomedical Abbreviations, Masters Thesis, University of Sheffield, England, 2008. We hope to use some of the approaches described in this thesis for building our corpus, building a short form/long form dictionary, learning a model for long form prediction of ambiguous short forms, and evaluating this model.
- Discussed the major pieces of the problem and the overall system design. The rough picture we came up with for the overall system of the final summer deliverable consists of a back-end system which will work as a pipeline, disambiguating short forms from Medline abstracts as they're released and storing them in a database for user retrieval, and a front-end web-interface for user communication with this database.
- The back-end system will work as follows:
- parses Medline abstracts,
- creates a dictionary on an appropriate subset of these abstracts,
- learns a model for predicting ambiguous short forms using the previously mentioned subset of appropriate Medline abstracts as a training set,
- runs the model on the remaining Medline abstracts to predict ambiguous short forms,
- and stores the results of these predictions in a database for user retrieval.
- The front-end system will consist of a web-interface with the database for users to retrieve associated short form long form pairs and abstracts which contain the short forms that have been predicted to be corresponding to the given long form.
- The back-end system will work as follows:
Note: appropriate in this case means abstracts with unambiguous short forms whose corresponding long form is defined within the abstract.
- Discussed some other open design ideas/issues:
- What should the front-end really look like? Should the web-interface allow users to submit abstracts/text for disambiguation? or just allow them to search for short form or long form terms and get back a list of abstracts containing these terms?
- Our predictive model should be fine tuned over time as updates and new abstracts are released. A simple way to do this involves adding only the abstracts with short form/long form definitions within the abstract (i.e. those abstracts in which we know the long form associated with a short form) to the new training set. Then we could use this new (and slightly better) model to re-predict all ambiguous short forms and also to predict those new abstracts with ambiguous short forms. This may take too much time because every time we retrain the model we have to go back and run the model on every ambiguous short form in every Medline abstract. Another idea may be to keep old predictions and simply run the model just on the new ambiguous short forms. In this case we could even think of retraining the model using our past predictions in addition to using the unambiguous short forms. This method would have to be extensively tested because taking our own predictions as fact when we retrain the model is dangerous.