Difference between revisions of "User:Carolinearnold"
From REU@MU
Line 67: | Line 67: | ||
'''Wednesday:''' | '''Wednesday:''' | ||
* Adapted Fischmann's script to analyze campaign descriptions according to the NRC Emotion Lexicon. | * Adapted Fischmann's script to analyze campaign descriptions according to the NRC Emotion Lexicon. | ||
− | * Evaluated the relative strengths of NRC and LIWC using this [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9565755/ | + | * Evaluated the relative strengths of NRC and LIWC using this [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9565755/ article]] as an example. |
Revision as of 04:02, 16 June 2023
Week 1: 5/30/23 - 6/2/23
Tuesday:
- Attended REU orientation.
- Toured Dr. Xu's lab and discussed his intentions for the project.
- Downloaded Windows Terminal and Python 3.11 to the lab computer.
- Installed OpenAI's command-line interface (CLI) to use in training a fine-tuned model.
- Completed Unit 1 and the first half of Unit 2 of RCR/RECR training as mandated by the NSF.
Wednesday:
- Attended Dr. Brylow's presentation on good research practices and the importance of keeping logs.
- Completed Units 2 and 3 of RCR/RECR training.
- Researched factors that influence perceived credibility in human and AI-generated communication.
- Examined methodologies used by other researchers to evaluate the degree of trust in unknown authors.
Thursday:
- Narrowed research to large language models (LLMs) and their capacity for social awareness.
- Attempted to connect with Alex Fischmann, who completed a related project under Dr. Xu's mentorship.
- Familiarized myself with PyCharm and began this tutorial for fine-tuning a model.
- Examined deliverables produced by Fischmann during the course of her project.
Friday:
- Explored case studies of fine-tuned LLMs and their applications.
- Began troubleshooting the fine-tuning process using this guide as a reference.
- Met with Dr. Xu to discuss a rough timeline of the project.
- Drafted a summary of research goals according to our proposed timeline.
Week 2: 6/5/23 - 6/9/23
Monday:
- Attended RCR training with Dr. Brylow.
Tuesday:
- Installed PyCharm 2022.1.4 and required packages for web scraping.
- Fixed warnings in Fischmann's web scraper.
- Adapted the aforementioned script to GoFundMe's medical crowdfunding homepage.
- Created a .csv file in which to store campaign data.
- Retrieved the following data: title, URL, description, organizer(s), and launch date.
Wednesday:
- Attended Dr. Brylow's presentation on technical writing and effective research talks.
- Modified my script to retrieve the following data: amount raised, goal, beneficiary, and number of donations.
- Added code to store campaign data in the aforementioned .csv file.
- Implemented try/catch statements to record "NA" values in the event that data isn't found.
- In order to store campaign descriptions in a .csv file, I had to remove commas, which will interfere with fine-tuning later in the project. Possible solutions include using a different file extension (e.g., .tsv) or replacing commas with punctuation unlikely to appear elsewhere in the description.
- Discovered the following bugs:
- If a campaign was launched within the last week, the launch date retrieved by the web scraper is expressed relative to the current time. For example, a campaign published yesterday will return "[launched] one day ago" as opposed to "[launched] June 7 2023." Work is needed to express all dates in the latter form OR throw out campaigns launched within the week.
- Long descriptions are hidden by a "Read more" button on GoFundMe. My script appends "Read more" to the visible description as opposed to retrieving hidden text. Code is needed to click the "Read more" button and retrieve the full description.
Thursday:
- Fixed bugs in description and organizer retrieval.
- Attempted to reformat launch dates. I might have to throw out campaigns without proper dates attached.
- Replaced commas in campaign descriptions with semicolons. I should be able to revert back to commas before fine-tuning my LLM. If I can't, semicolons are a decent substitution because they preserve tone and readability despite grammatical incorrectness.
- Fixed formatting issues in output.csv.
- Copied data from 504 campaigns to output.csv.
- Doesn't include number of donations due to a bug.
Friday:
- Called Dr. Xu to discuss my progress and appropriate next steps.
Week 3: 6/12/23 - 6/16/23
Monday:
- Reformatted launch dates for campaigns older than 24 hours.
- I think we could reasonably discard newer campaigns due to their relative infrequency.
- Fixed bug in number of donations.
- Scraped data from 1000 campaigns.
- n = 1000 is the maximum number of campaigns I can retrieve data from at once.
- Removed non-English campaigns and campaigns launched within 24 hours from the dataset.
- n = 952
Tuesday:
- Stored data from the first ten campaigns in a separate .csv file.
- Wrote a script to rewrite campaign descriptions using GPT-3.
- Reached the maximum number of tokens provided by OpenAI.
- Each rewrite was appended to the output file as a new row, as opposed to in-line with the corresponding campaign. I attempted to fix this before I ran out of tokens.
Wednesday:
- Adapted Fischmann's script to analyze campaign descriptions according to the NRC Emotion Lexicon.
- Evaluated the relative strengths of NRC and LIWC using this [article] as an example.