Difference between revisions of "User:Grberlstein"

From REU@MU
Jump to: navigation, search
(Clustering and Data Science)
Line 14: Line 14:
 
*[http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf Algorithms for Clustering Data]
 
*[http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf Algorithms for Clustering Data]
 
*[https://datasciencelab.wordpress.com/tag/k-means/ K-means Clustering in Python]
 
*[https://datasciencelab.wordpress.com/tag/k-means/ K-means Clustering in Python]
 +
*[http://online.liebertpub.com/doi/abs/10.1089/big.2016.0050 Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science]
  
 
= Project Log For Summer 2017 =
 
= Project Log For Summer 2017 =
Line 90: Line 91:
 
*Tweaked basic settings
 
*Tweaked basic settings
 
*Added ability to display different variations of K-Means on the map
 
*Added ability to display different variations of K-Means on the map
 +
 +
=='''Week Four (6/19 - 6/23)'''==
 +
==='''Day 1 (6/19)'''===
 +
*Tweaked the map visuals
 +
*Added a convex hull to display the cluster borders
 +
*Made the convex hull creation dynamic and attached to the data, rather than precomputed in the data frame
 +
==='''Day 2 (6/20)'''===
 +
*Added more visual tweaks to the map
 +
*Added a grid to the display and fixed inaccurate axis labels
 +
*Evaluated the relevancy and accuracy of the different clusters produced
 +
==='''Day 3 (6/21)'''===
 +
*Compared Milwaukee crime reports against produced clusters to gauge accuracy
 +
*Read [http://online.liebertpub.com/doi/abs/10.1089/big.2016.0050 Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science]
 +
*Met with Dr. Guha and discussed the next step in the project
 +
==='''Day 4 (6/22)'''===
 +
*Started programming a (mostly) vectorized implementation of K-Means to later modify
 +
*Continued reading ''Weapons of Math Destruction''
 +
==='''Day 5 (6/23)'''===
 +
*Fixed vectorized implementation of K-Means
 +
*Tested implementation on random datasets and compared with the results of Sci-Kit Learn's implementation
 +
*Implemented a geodesic distance metric using the Haversine great circle distance formula
 +
*Modified my implementation of K-Means to use the new distance metric and build a test framework to compare clustering with the geodesic distance and euclidan distance from the same set of starting points.
 +
 +
=='''Week Five (6/26 - 6/30)'''==

Revision as of 14:48, 26 June 2017

Griffin Berlstein

Nominally a person.

Readings

Background

Algorithmic Ethics

Clustering and Data Science

Project Log For Summer 2017

Week One (5/30 - 6/2)

Day 1 (5/30)

  • Attended REU orientation
  • Obtained ID card and computer access
  • Met with Dr. Guha and discussed broad ideas surrounding the project

Day 2 (5/31)

  • Attended Library orientation
  • Finished reading Ethics of Algorithms by Thijs Slot. This was the last of the pre-REU reading.
  • Started reviewing the basics of Python
  • Given crime data sets to review by Dr. Guha

Day 3 (6/1)

  • Attended a meeting on proper research practices by Dr. Factor
  • Set up direct deposit
  • Reviewed the basics of GitHub
  • Continued to review Python
  • Examined crime data and the various ways it was made publically available

Day 4 (6/2)

  • Moved mentor meeting to Wednesday due to scheduling issue
  • Started reading background information provided by Dr. Guha
  • Set up Jupyter notebook and the various dependent libraries
  • Created rough implementation of K-means clustering on random data
  • Obtained card access to Dr. Guha's lab
  • Posted rough, pre-discussion milestones

Week Two (6/5 - 6/9)

Day 1 (6/5)

  • Refined K-means implementation with the K-means++ seeding described in the Data Science Lab article
  • Tested the algorithm on random Gaussian distributions, rather than random points
  • Experimented with visual plotting of the algorithm using Seaborn and Matplotlib

Day 2 (6/6)

  • Attended RCR training
  • Finished reading the relevant sections of Algorithms for Clustering Data
  • Experimented with Scikit-learn's implementation of K-means

Day 3 (6/7)

  • Met with Dr. Guha and discussed the immediate future
  • Set the goal to produce an interactive crime map by next Wednesday
  • Gathered data from website and began sorting

Day 4 (6/8)

  • Created a script to aggregate the data from multiple spreadsheets into a single usable file
  • Looked into potential libraries needed to create the interactive map
  • Ran into issues with the format of the data location
  • Converted the addresses in the data into latitude/longitude coordinates

Day 5 (6/9)

  • Found a publically available shape file of the city
  • Set up the necessary scripts to display the file
  • Ran into an issue with the points not being in the same coordinate system as the shape file

Week Three (6/12 - 6/16)

Day 1 (6/12)

Day 2 (6/13)

Day 3 (6/14)

  • Finished website framework
  • Uploaded initial map version
  • Started on the second version of the map

Day 4 (6/14)

  • Split the data into multiple sets
  • Used K-Means to sort in a variety of ways
  • Wrote a python script to run K-Means multiple times and output results to be fed into D3

Day 5 (6/15)

  • Put modified data into D3 setup for the new map
  • Tweaked basic settings
  • Added ability to display different variations of K-Means on the map

Week Four (6/19 - 6/23)

Day 1 (6/19)

  • Tweaked the map visuals
  • Added a convex hull to display the cluster borders
  • Made the convex hull creation dynamic and attached to the data, rather than precomputed in the data frame

Day 2 (6/20)

  • Added more visual tweaks to the map
  • Added a grid to the display and fixed inaccurate axis labels
  • Evaluated the relevancy and accuracy of the different clusters produced

Day 3 (6/21)

Day 4 (6/22)

  • Started programming a (mostly) vectorized implementation of K-Means to later modify
  • Continued reading Weapons of Math Destruction

Day 5 (6/23)

  • Fixed vectorized implementation of K-Means
  • Tested implementation on random datasets and compared with the results of Sci-Kit Learn's implementation
  • Implemented a geodesic distance metric using the Haversine great circle distance formula
  • Modified my implementation of K-Means to use the new distance metric and build a test framework to compare clustering with the geodesic distance and euclidan distance from the same set of starting points.

Week Five (6/26 - 6/30)