Difference between revisions of "Approximate Nearest Neighbor Searches of Real World Geometry"

From REU@MU
Jump to: navigation, search
(Created page with "This is currently a placeholder page until more project details are established.")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is currently a placeholder page until more project details are established.
+
'''Student Researcher:''' [[User:Schwennesen|Matt Schwennesen]]
 +
 
 +
'''Facility Mentor:''' [https://www.marquette.edu/computer-science/directory/satish-puri.php Dr. Satish Puri]
 +
 
 +
== Project Description ==
 +
 
 +
Nearest neighbor searches take a query data point and search the data space for the closest other data point in the space. This task is extremely information in all manner of data processing fields such an information retrieval, text mining and recommender systems. However, since the cost of exact nearest neighbor searches is computationally expensive, most of the recent work in this field has been done for approximate nearest neighbor searches.
 +
 
 +
Methods such as [https://en.wikipedia.org/wiki/Locality-sensitive_hashing locality sensitive hashing] have proven very useful when the data is text documents and recently graph based methods have been shown to be powerful and performant for vectors in high dimensional Euclidean space.
 +
 
 +
However, little to no work has been done on approximate nearest neighbor searches for polygons. There are many potential applications of this work, from trying to estimate physical parameters of say bodies of water by finding other bodies with more known features or potential applications in geo-locating images which have been stripped of metadata and passed through a feature extractor.
 +
 
 +
Unlike most data science problems, this problem is being investigated by efficient data structures rather then machine learning techniques.
 +
 
 +
== Project Goal ==
 +
 
 +
Develop a system which can search real world geometric datasets, such as the ones available [https://star.cs.ucr.edu/ here], and find approximate nearest neighbors of those shapes in an accurate and performant way. The system will be written in C++ using multithreading.
 +
 
 +
== Project Milestones ==
 +
 
 +
* Read / Write WKT.
 +
* Calculate Jaccard Distance and fill ratios.
 +
* Create grid over global minimum bounding rectangle.
 +
* Create LSH hashes.
 +
* Custom Hash Map implementation to allow more concurrency.
 +
* Query shapes while avoiding reporting the same neighbor multiple times.

Latest revision as of 22:31, 5 August 2022

Student Researcher: Matt Schwennesen

Facility Mentor: Dr. Satish Puri

Project Description

Nearest neighbor searches take a query data point and search the data space for the closest other data point in the space. This task is extremely information in all manner of data processing fields such an information retrieval, text mining and recommender systems. However, since the cost of exact nearest neighbor searches is computationally expensive, most of the recent work in this field has been done for approximate nearest neighbor searches.

Methods such as locality sensitive hashing have proven very useful when the data is text documents and recently graph based methods have been shown to be powerful and performant for vectors in high dimensional Euclidean space.

However, little to no work has been done on approximate nearest neighbor searches for polygons. There are many potential applications of this work, from trying to estimate physical parameters of say bodies of water by finding other bodies with more known features or potential applications in geo-locating images which have been stripped of metadata and passed through a feature extractor.

Unlike most data science problems, this problem is being investigated by efficient data structures rather then machine learning techniques.

Project Goal

Develop a system which can search real world geometric datasets, such as the ones available here, and find approximate nearest neighbors of those shapes in an accurate and performant way. The system will be written in C++ using multithreading.

Project Milestones

  • Read / Write WKT.
  • Calculate Jaccard Distance and fill ratios.
  • Create grid over global minimum bounding rectangle.
  • Create LSH hashes.
  • Custom Hash Map implementation to allow more concurrency.
  • Query shapes while avoiding reporting the same neighbor multiple times.