Difference between revisions of "User:Grberlstein"
From REU@MU
Grberlstein (Talk | contribs) (→Log For Summer 2017) |
Grberlstein (Talk | contribs) |
||
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== '''Griffin Berlstein''' == | == '''Griffin Berlstein''' == | ||
− | |||
+ | Griffin is an incoming junior majoring in Mathematics and Computer Science at Vassar College in Poughkeepsie, New York. | ||
− | = Log For Summer 2017 = | + | = Readings = |
− | '''Week One (5/30 - 6/3)''' | + | == Background == |
+ | === Algorithmic Ethics === | ||
+ | *[http://essay.utwente.nl/70934/1/Slot_MA_BMS.pdf Ethics of Algorithms] | ||
+ | *[https://link.springer.com/article/10.1007/s10676-010-9233-7 Is There an Ethics of Algorithms?] | ||
+ | *[http://journals.sagepub.com/doi/abs/10.1177/0162243915606523 Toward an Ethics of Algorithms] | ||
+ | *[https://arxiv.org/pdf/1704.01347.pdf Quantifying Search Bias] | ||
+ | *[https://pdfs.semanticscholar.org/e092/65ed8eee4c7b35e3ebe53b5d75492b4628a2.pdf Understanding and Designing around Users' Interaction with Hidden Algorithms in Sociotechnical Systems] | ||
+ | *[http://journals.sagepub.com/doi/full/10.1177/2053951716679679 The ethics of algorithms: Mapping the debate] | ||
+ | === Clustering and Data Science === | ||
+ | *[http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf Algorithms for Clustering Data] | ||
+ | *[https://datasciencelab.wordpress.com/tag/k-means/ K-means Clustering in Python] | ||
+ | *[http://online.liebertpub.com/doi/abs/10.1089/big.2016.0050 Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science] | ||
+ | |||
+ | = Project Log For Summer 2017 = | ||
+ | |||
+ | =='''Week One (5/30 - 6/2)'''== | ||
+ | ==='''Day 1 (5/30)'''=== | ||
+ | *Attended REU orientation | ||
+ | *Obtained ID card and computer access | ||
+ | *Met with Dr. Guha and discussed broad ideas surrounding the project | ||
+ | ==='''Day 2 (5/31)'''=== | ||
+ | *Attended Library orientation | ||
+ | *Finished reading [http://essay.utwente.nl/70934/1/Slot_MA_BMS.pdf Ethics of Algorithms] by Thijs Slot. This was the last of the pre-REU reading. | ||
+ | *Started reviewing the basics of Python | ||
+ | *Given crime data sets to review by Dr. Guha | ||
+ | ==='''Day 3 (6/1)'''=== | ||
+ | *Attended a meeting on proper research practices by Dr. Factor | ||
+ | *Set up direct deposit | ||
+ | *Reviewed the basics of GitHub | ||
+ | *Continued to review Python | ||
+ | *Examined crime data and the various ways it was made publically available | ||
+ | ==='''Day 4 (6/2)'''=== | ||
+ | *Moved mentor meeting to Wednesday due to scheduling issue | ||
+ | *Started reading background information provided by Dr. Guha | ||
+ | *Set up Jupyter notebook and the various dependent libraries | ||
+ | *Created rough implementation of K-means clustering on random data | ||
+ | *Obtained card access to Dr. Guha's lab | ||
+ | *Posted rough, pre-discussion milestones | ||
+ | =='''Week Two (6/5 - 6/9)'''== | ||
+ | ==='''Day 1 (6/5)'''=== | ||
+ | *Refined K-means implementation with the K-means++ seeding described in the [https://datasciencelab.wordpress.com/2014/01/15/improved-seeding-for-clustering-with-k-means/ Data Science Lab] article | ||
+ | *Tested the algorithm on random Gaussian distributions, rather than random points | ||
+ | *Experimented with visual plotting of the algorithm using Seaborn and Matplotlib | ||
+ | |||
+ | ==='''Day 2 (6/6)'''=== | ||
+ | *Attended RCR training | ||
+ | *Finished reading the relevant sections of [http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf Algorithms for Clustering Data] | ||
+ | *Experimented with Scikit-learn's implementation of K-means | ||
+ | |||
+ | ==='''Day 3 (6/7)'''=== | ||
+ | *Met with Dr. Guha and discussed the immediate future | ||
+ | *Set the goal to produce an interactive crime map by next Wednesday | ||
+ | *Gathered data from website and began sorting | ||
+ | |||
+ | ==='''Day 4 (6/8)'''=== | ||
+ | *Created a script to aggregate the data from multiple spreadsheets into a single usable file | ||
+ | *Looked into potential libraries needed to create the interactive map | ||
+ | *Ran into issues with the format of the data location | ||
+ | *Converted the addresses in the data into latitude/longitude coordinates | ||
+ | |||
+ | ==='''Day 5 (6/9)'''=== | ||
+ | *Found a publically available shape file of the city | ||
+ | *Set up the necessary scripts to display the file | ||
+ | *Ran into an issue with the points not being in the same coordinate system as the shape file | ||
+ | |||
+ | =='''Week Three (6/12 - 6/16)'''== | ||
+ | ==='''Day 1 (6/12)'''=== | ||
+ | *Fixed point plotting to align with shapefile | ||
+ | *Added choropleth coloring by neighborhood | ||
+ | *Started reading [http://journals.sagepub.com/doi/full/10.1177/2053951716679679 The ethics of algorithms: Mapping the debate] | ||
+ | ==='''Day 2 (6/13)'''=== | ||
+ | *Finished [http://journals.sagepub.com/doi/full/10.1177/2053951716679679 The ethics of algorithms: Mapping the debate] | ||
+ | *Started reading ''Weapons of Math Destruction'' | ||
+ | *Started implementation of website from GitHub | ||
+ | *Established the needed dependencies to run a local instance of Jekyll | ||
+ | ==='''Day 3 (6/14)'''=== | ||
+ | *Finished website framework | ||
+ | *Uploaded initial map version | ||
+ | *Started on the second version of the map | ||
+ | ==='''Day 4 (6/14)'''=== | ||
+ | *Split the data into multiple sets | ||
+ | *Used K-Means to sort in a variety of ways | ||
+ | *Wrote a python script to run K-Means multiple times and output results to be fed into D3 | ||
+ | |||
+ | ==='''Day 5 (6/15)'''=== | ||
+ | *Put modified data into D3 setup for the new map | ||
+ | *Tweaked basic settings | ||
+ | *Added ability to display different variations of K-Means on the map | ||
+ | |||
+ | =='''Week Four (6/19 - 6/23)'''== | ||
+ | ==='''Day 1 (6/19)'''=== | ||
+ | *Tweaked the map visuals | ||
+ | *Added a convex hull to display the cluster borders | ||
+ | *Made the convex hull creation dynamic and attached to the data, rather than precomputed in the data frame | ||
+ | ==='''Day 2 (6/20)'''=== | ||
+ | *Added more visual tweaks to the map | ||
+ | *Added a grid to the display and fixed inaccurate axis labels | ||
+ | *Evaluated the relevancy and accuracy of the different clusters produced | ||
+ | ==='''Day 3 (6/21)'''=== | ||
+ | *Compared Milwaukee crime reports against produced clusters to gauge accuracy | ||
+ | *Read [http://online.liebertpub.com/doi/abs/10.1089/big.2016.0050 Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science] | ||
+ | *Met with Dr. Guha and discussed the next step in the project | ||
+ | ==='''Day 4 (6/22)'''=== | ||
+ | *Started programming a (mostly) vectorized implementation of K-Means to later modify | ||
+ | *Continued reading ''Weapons of Math Destruction'' | ||
+ | *Had the weekly working lunch and began early outlines of the mini-presentations | ||
+ | |||
+ | ==='''Day 5 (6/23)'''=== | ||
+ | *Fixed vectorized implementation of K-Means | ||
+ | *Tested implementation on random datasets and compared with the results of Sci-Kit Learn's implementation | ||
+ | *Implemented a geodesic distance metric using the Haversine great circle distance formula | ||
+ | *Modified my implementation of K-Means to use the new distance metric and build a test framework to compare clustering with the geodesic distance and Euclidean distance from the same set of starting points. | ||
+ | |||
+ | =='''Week Five (6/26 - 6/30)'''== | ||
+ | *Tested side-by-side visualizations for geodesic vs Euclidean clusterings | ||
+ | *Experimented with multiple methods of visualizations | ||
+ | *Finished reading ''Weapons of Math Destruction'' | ||
+ | *Gave mini-presentation | ||
+ | *Ran into difficulties with the public datasets on the Milwaukee website | ||
+ | |||
+ | =='''Week Six (7/3 - 7/7)'''== | ||
+ | *Got better datasets and resolved issues with publically available census data | ||
+ | *Overlayed demographic information on the maps | ||
+ | *Generated a demographic breakdown for each cluster and compared geodesic vs euclidean | ||
+ | *Merged functionality from different versions of the map | ||
+ | |||
+ | =='''Weeks Seven to Nine (7/10 - 7/28)'''== | ||
+ | *Failed to do logs consistently | ||
+ | *Finalized demographic overlay | ||
+ | *Implemented multiple version of a potential bias index | ||
+ | *Ran experiments on the data to get trends about the potential bias index | ||
+ | *Expanded data set to include all available years worth of data | ||
+ | *Geocoded all of the new data | ||
+ | *Expanded map functionality to include potential bias index and cluster similarity | ||
+ | *Added interactive graphs for demographics and potential bias | ||
+ | *Moved potential bias calculations to Python to allow for faster web access | ||
+ | *Read lots of papers for the literature review | ||
+ | *Wrote a rough draft of the literature review | ||
+ | *Created the poster for the poster session | ||
+ | |||
+ | =='''Week Ten (7/31 - 8/4)'''== | ||
+ | *Gave poster presentation | ||
+ | *Gave REU project presentation | ||
+ | *Reconfigured the maps to work with the other half of the data set | ||
+ | *Created more graphics for the paper | ||
+ | *Wrote a (very) rough draft of the discussion section | ||
+ | *Read a few more sources for the paper | ||
+ | *Minor tweaks to the map's color palette | ||
+ | *Departed for home |
Latest revision as of 16:27, 7 August 2017
Contents
Griffin Berlstein
Griffin is an incoming junior majoring in Mathematics and Computer Science at Vassar College in Poughkeepsie, New York.
Readings
Background
Algorithmic Ethics
- Ethics of Algorithms
- Is There an Ethics of Algorithms?
- Toward an Ethics of Algorithms
- Quantifying Search Bias
- Understanding and Designing around Users' Interaction with Hidden Algorithms in Sociotechnical Systems
- The ethics of algorithms: Mapping the debate
Clustering and Data Science
- Algorithms for Clustering Data
- K-means Clustering in Python
- Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science
Project Log For Summer 2017
Week One (5/30 - 6/2)
Day 1 (5/30)
- Attended REU orientation
- Obtained ID card and computer access
- Met with Dr. Guha and discussed broad ideas surrounding the project
Day 2 (5/31)
- Attended Library orientation
- Finished reading Ethics of Algorithms by Thijs Slot. This was the last of the pre-REU reading.
- Started reviewing the basics of Python
- Given crime data sets to review by Dr. Guha
Day 3 (6/1)
- Attended a meeting on proper research practices by Dr. Factor
- Set up direct deposit
- Reviewed the basics of GitHub
- Continued to review Python
- Examined crime data and the various ways it was made publically available
Day 4 (6/2)
- Moved mentor meeting to Wednesday due to scheduling issue
- Started reading background information provided by Dr. Guha
- Set up Jupyter notebook and the various dependent libraries
- Created rough implementation of K-means clustering on random data
- Obtained card access to Dr. Guha's lab
- Posted rough, pre-discussion milestones
Week Two (6/5 - 6/9)
Day 1 (6/5)
- Refined K-means implementation with the K-means++ seeding described in the Data Science Lab article
- Tested the algorithm on random Gaussian distributions, rather than random points
- Experimented with visual plotting of the algorithm using Seaborn and Matplotlib
Day 2 (6/6)
- Attended RCR training
- Finished reading the relevant sections of Algorithms for Clustering Data
- Experimented with Scikit-learn's implementation of K-means
Day 3 (6/7)
- Met with Dr. Guha and discussed the immediate future
- Set the goal to produce an interactive crime map by next Wednesday
- Gathered data from website and began sorting
Day 4 (6/8)
- Created a script to aggregate the data from multiple spreadsheets into a single usable file
- Looked into potential libraries needed to create the interactive map
- Ran into issues with the format of the data location
- Converted the addresses in the data into latitude/longitude coordinates
Day 5 (6/9)
- Found a publically available shape file of the city
- Set up the necessary scripts to display the file
- Ran into an issue with the points not being in the same coordinate system as the shape file
Week Three (6/12 - 6/16)
Day 1 (6/12)
- Fixed point plotting to align with shapefile
- Added choropleth coloring by neighborhood
- Started reading The ethics of algorithms: Mapping the debate
Day 2 (6/13)
- Finished The ethics of algorithms: Mapping the debate
- Started reading Weapons of Math Destruction
- Started implementation of website from GitHub
- Established the needed dependencies to run a local instance of Jekyll
Day 3 (6/14)
- Finished website framework
- Uploaded initial map version
- Started on the second version of the map
Day 4 (6/14)
- Split the data into multiple sets
- Used K-Means to sort in a variety of ways
- Wrote a python script to run K-Means multiple times and output results to be fed into D3
Day 5 (6/15)
- Put modified data into D3 setup for the new map
- Tweaked basic settings
- Added ability to display different variations of K-Means on the map
Week Four (6/19 - 6/23)
Day 1 (6/19)
- Tweaked the map visuals
- Added a convex hull to display the cluster borders
- Made the convex hull creation dynamic and attached to the data, rather than precomputed in the data frame
Day 2 (6/20)
- Added more visual tweaks to the map
- Added a grid to the display and fixed inaccurate axis labels
- Evaluated the relevancy and accuracy of the different clusters produced
Day 3 (6/21)
- Compared Milwaukee crime reports against produced clusters to gauge accuracy
- Read Critique and Contribute: A Practice-Based Framework for Improving Critical Data Studies and Data Science
- Met with Dr. Guha and discussed the next step in the project
Day 4 (6/22)
- Started programming a (mostly) vectorized implementation of K-Means to later modify
- Continued reading Weapons of Math Destruction
- Had the weekly working lunch and began early outlines of the mini-presentations
Day 5 (6/23)
- Fixed vectorized implementation of K-Means
- Tested implementation on random datasets and compared with the results of Sci-Kit Learn's implementation
- Implemented a geodesic distance metric using the Haversine great circle distance formula
- Modified my implementation of K-Means to use the new distance metric and build a test framework to compare clustering with the geodesic distance and Euclidean distance from the same set of starting points.
Week Five (6/26 - 6/30)
- Tested side-by-side visualizations for geodesic vs Euclidean clusterings
- Experimented with multiple methods of visualizations
- Finished reading Weapons of Math Destruction
- Gave mini-presentation
- Ran into difficulties with the public datasets on the Milwaukee website
Week Six (7/3 - 7/7)
- Got better datasets and resolved issues with publically available census data
- Overlayed demographic information on the maps
- Generated a demographic breakdown for each cluster and compared geodesic vs euclidean
- Merged functionality from different versions of the map
Weeks Seven to Nine (7/10 - 7/28)
- Failed to do logs consistently
- Finalized demographic overlay
- Implemented multiple version of a potential bias index
- Ran experiments on the data to get trends about the potential bias index
- Expanded data set to include all available years worth of data
- Geocoded all of the new data
- Expanded map functionality to include potential bias index and cluster similarity
- Added interactive graphs for demographics and potential bias
- Moved potential bias calculations to Python to allow for faster web access
- Read lots of papers for the literature review
- Wrote a rough draft of the literature review
- Created the poster for the poster session
Week Ten (7/31 - 8/4)
- Gave poster presentation
- Gave REU project presentation
- Reconfigured the maps to work with the other half of the data set
- Created more graphics for the paper
- Wrote a (very) rough draft of the discussion section
- Read a few more sources for the paper
- Minor tweaks to the map's color palette
- Departed for home