Finding hotspots in geo-spatial data using spatial statistics

From REU@MU
Jump to: navigation, search

Title: Finding hotspots in geo-spatial data using spatial statistics

Mentor: Dr.Satish Puri

Approach: This project deals with compute-intensive spatial data mining algorithms and uses parallel computing for speeding up analytics tasks. This project utilizes spatial correlation methods for mining interesting patterns in geo-spatial data. Summary: Spatial data mining is the process of discovering interesting and potentially useful patterns from spatial data sources. The complexity of spatial data and implicit spatial relationships limits the usefulness of standard data mining algorithms for extracting spatial patterns. Although standard data mining algorithms can be applied under assumptions such as independent and identical distribution, they often perform poorly on geo-spatial data due to their self-correlated nature. Here we focus on one such data mining algorithm, namely, hotspot detection. Hot spots are statistically significant clusters. In other words, given a set of weighted data points, the hotspots are those clusters of points with values higher in magnitude than what is possible by random chance. Centers for Disease Control (CDC) uses hot spot analysis to find disease outbreaks. Another example is finding traffic accident hotspots in a region. A compute-intensive algorithm known as Getis-Ord is used to find such hotspots in data. The output of the algorithm is a Z score for each location. The Z score represents the statistical significance of clustering for a specified distance. P-values are calculated to check for null hypothesis. New York Taxi trips data sets containing about 100 million records of pick-up and drop-off location/time will be used in the project. The size of the data and the computational complexity motivates exploring parallel computing methods in this project.

Student Research Activities: The REU fellows will perform the following major tasks:

  • Perform a systematic literature review of data mining techniques in geo-spatial data.
  • Understand hotspot detection and the associated data mining algorithms.
  • Implement and evaluate spatial data mining algorithms and apply parallel computing methods for speeding up analytic tasks on New York Taxi trips data set (publicly available).

Student Background: Students need to have basic computing knowledge and introductory programming skills in Python or C/C++. Students will be introduced to compiler pragma-based methods for quick parallelization of sequential codes on multi-core CPUs and manycore GPUs.