Difference between revisions of "User:Schwennesen"

From REU@MU
Jump to: navigation, search
(Created page with "= Week 1 = == 01 June 2022 == This morning started with a meeting hosted by Dr. Brylow about good research practices and the importance of keeping a daily log. Both and afte...")
 
Line 41: Line 41:
  
 
I an tentatively calling this project “Approximate Nearest Neighbor Searches of Real World Geometry”, but this title is subject to change once the problem becomes well defined.
 
I an tentatively calling this project “Approximate Nearest Neighbor Searches of Real World Geometry”, but this title is subject to change once the problem becomes well defined.
 +
 +
== Daily Log, 02 June 2022 ==
 +
 +
Today I continued the literature review regarding approximate nearest neighbor searches. I read three articles which I had identified yesterday:
 +
 +
<ol style="list-style-type: decimal;">
 +
<li><p>C. Fu and D. Cai, “EFANNA : An Extremely Fast Approximate Nearest Neighbor Search Algorithm Based on kNN Graph.” arXiv, Dec. 03, 2016. Accessed: Jun. 01, 2022. [Online]. Available: http://arxiv.org/abs/1609.07228</p>
 +
<p>While the graph based approaches to this ''do'' provide excellent results, the memory constraints are higher than the LSH methods used in the comparison. While the datasets used here are not as large as the ones mentioned in Dr. Puri’s LSH Applications paper, if we assume a linear memory growth, I’m not convinced that an approach like EFANNA is off the table.</p></li>
 +
<li><p>C. Cai and D. Lin, “Find Another Me Across the World – Large-scale Semantic Trajectory Analysis Using Spark.” arXiv, Apr. 02, 2022. Accessed: Jun. 01, 2022. [Online]. Available: http://arxiv.org/abs/2204.00878</p>
 +
<p>This paper uses minhashing, a type of LSH to find similar ''semantic trajectories'', basically people with similar life styles based off of the places that they visit, the order of places and the amount of time spent there. They developed a hashing method called Sequence-Sensitive Hashing (SSH) which was designed to respect the sequential nature of this data. It was interesting to see LSH outside of the textbook chapter that I read about it, but the data here is still rather similar to the document structure that LSH is frequently practiced on.</p></li>
 +
<li><p>T. Ge, K. He, Q. Ke, and J. Sun, “Optimized Product Quantization for Approximate Nearest Neighbor Search,” in ''2013 IEEE Conference on Computer Vision and Pattern Recognition'', Jun. 2013, pp. 2946–2953. doi: [https://doi.org/10.1109/CVPR.2013.379 10.1109/CVPR.2013.379].</p>
 +
<p>This paper was somewhat of a break from the other two. Product Quantization is an approach which is adjacent to LSH and serves, at least to my current understanding, as a “different backend” for this whole process. It involves breaking the data space into a number of subspaces such that the original space is the Cartesian product of the subspaces. This produces distortion in the quantization process, which must be minimized to push for better results. The authors were able to find a non-parametric solution to this issues which can optimize the distortion in an iterative manner and it performed very well. Additional specialization is discussed for data which follows a Gaussian or normal distribution.</p></li></ol>
 +
 +
I additionally identified six libraries which may be relevant to this project once programming begins.
 +
 +
# [https://github.com/spotify/annoy annoy] (C++ / Python) Since when did Spotify develop things like this? Thanks to the NetworkX Core Development Team for this recommendation.
 +
# [https://github.com/ZJULearning/efanna EFANNA] (C++) EFANNA implementation.
 +
# [https://github.com/ZJULearning/nsg NSG] (C++) NSG implementation by the EFANNA authors. This is their improvements on EFANNA I believe, but while I have the research paper for NSG I have not had the chance to read it.
 +
# [https://github.com/aaalgo/kgraph kGraph] (C++ with limited Python support). ANN graph approach using a technique called NN-descent. Used as a benchmark for EFANNA.
 +
# [https://github.com/flann-lib/flann FLANN] (C++ / Python) KD-tree based algorithms.
 +
# [https://github.com/cynthia/yael yael] (C++ / Python) Not sure what algorithms this uses but it was linked off the website for the datasets used in multiple of the papers I read today.
 +
 +
The datasets of [http://corpus-texmex.irisa.fr/ this website] where used in the Fu and Cai paper and Ge et al. papers, which may or may not be interesting resources to look at. I believe that these data sets are just numeric vectors in 128 or 960 dimensions which is idea for research on approximate nearest neighbor search (ANNS).
 +
 +
Looking at the programming resources that I have found, C++ is starting to pull ahead of my innate desire to program everything in python. I do not believe that this decision needs to be made this week.
 +
 +
One of the most important things for me to do this week is to find a narrow and tailored research topic. It looks like there as been a lot of research, which I am currently crawling through, on the general ANNS problem. However, I have not seen anything about using these techniques on shapes or other geometries. The Cai and Lin paper is probably the closest when it was discussing coordinate based similarity in their trajectories. I do need to investigate this specific sub-problem further, but I suspect that finding a suitable hashing or quantization function for the input geometries will be the heart of this problem. Once the LSH hashes or PQ codebooks are produces, I don’t see why existing methods like EFANNA or the algorithm described by Ge et al wouldn’t be applicable. However, if my experience with Google’s Summer of Code last year taught me anything it is that my last statement is almost certainly wrong and I just don’t have the knowledge yet see the problem and find a solution.

Revision as of 21:52, 2 June 2022

Week 1

01 June 2022

This morning started with a meeting hosted by Dr. Brylow about good research practices and the importance of keeping a daily log. Both and after that, I was reading some literature that my mentor Dr. Puri had sent over. The original project that I was assigned was to use ICESAT-2 data to build a model for predicting sea ice thickness. This project was already started by a different REU student from a previous year [1] and Dr. Puri mentioned a different project, using either locality sensitive hashing and / or quantization methods to approximate nearest neighbor searches for geometry and shapes.

This new project is more appealing to me, in part because it is based on topics that I am already familiar with such as Jaccard Similarity and many notions from set theory and text mining that I have learned about at Michigan Tech. Dr. Puri sent me some preliminary literature including a chapter from what I believe is a textbook on using locality sensitive hashing to compute approximately nearest neighbors for text documents, which is a well studied problem and not the focus of this research project.

The Jaccard Similarity is a measure of the likeness between sets, but it can also be defined for areas such as polygons. For a set it is defined as

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

And for polygons, we do the analogous steps and take the area of the intersection or overlap and divide it by the area of the union or the total footprint of the two polygons put together.

For large datasets, we may be interested in similar objects, but for a dataset with n objects where would be C(n,2) pairs to evaluate with is at least an O(n^2) operation before adding the time needed to compute the Jaccard coefficient. The idea behind locality sensitive hashing (LSH) is to create some hash function h which is more likely to place similar items together and less likely to place dissimilar items together. Most of this information is in [2], and I should find more information about the source of this PDF that Dr. Puri sent me. He also send [3], a shorter paper that he wrote about the principles of using LSH on GIS data and some comparison to other approximate nearest neighbor (ANN) approaches.

Using the references on [3], I have found PDF copies of the remaining references below and will be reading and annotating them in the coming days. In addition to the literature review, I also need to establish the actual research question of interest here and set my sights on a realizable goal for the end of the REU. Currently, we would like to produce a small model as a proof of concept which can be demonstrated at the end of the REU.

  1. T. Chen, “Using Machine Learning Techniques to Analyze Sea Ice Features Using ICESAT-2 Data”.
  2. “Finding Similar Items”. From a currently unknown textbook.
  3. S. Puri, “Jaccard Similarity and LSH Applications,” ACM, Jun. 2022, doi: https://doi.org/10.1145/nnnnnnn.nnnnnnn.
  4. C. Fu and D. Cai, “EFANNA : An Extremely Fast Approximate Nearest Neighbor Search Algorithm Based on kNN Graph.” arXiv, Dec. 03, 2016. Accessed: Jun. 01, 2022. [Online]. Available: http://arxiv.org/abs/1609.07228
  5. C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proc. VLDB Endow., vol. 12, no. 5, pp. 461–474, Jan. 2019, doi: 10.14778/3303753.3303754.
  6. C. Cai and D. Lin, “Find Another Me Across the World – Large-scale Semantic Trajectory Analysis Using Spark.” arXiv, Apr. 02, 2022. Accessed: Jun. 01, 2022. [Online]. Available: http://arxiv.org/abs/2204.00878
  7. Y. Kalantidis and Y. Avrithis, “Locally Optimized Product Quantization for Approximate Nearest Neighbor Search,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp. 2329–2336. doi: 10.1109/CVPR.2014.298.
  8. T. Ge, K. He, Q. Ke, and J. Sun, “Optimized Product Quantization for Approximate Nearest Neighbor Search,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2013, pp. 2946–2953. doi: 10.1109/CVPR.2013.379.
  9. Y. Matsui, Y. Uchida, H. Jegou, and S. Satoh, “A Survey of Product Quantization,” vol. 6, no. 1, p. 9, 2018.
  10. H. Jégou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, Jan. 2011, doi: 10.1109/TPAMI.2010.57.

This project has the opportunity for me to learn GPU programming which is something that I’d be interested in however Dr. Puri thinks that it is not needed and most of the GPU accelerated libraries use CUDA which I cannot use since my laptop does not have an NVIDIA GPU. I could investigate HIP which is the AMD competitor to CUDA but does not have the same level of support that CUDA does. I am presently unsure if the programming portion of this project will be conducted in python or C++. These are the libraries that Dr. Puri sent me:

  • Clipper (C++) A computational geometry library specializing in shape intersections.
  • Shapely (Python) A computational geometry library.
  • E2LSH (C++) A locality sensitive hashing library which can compute exact nearest neighbors for polygons. This would be useful as a base line to compare any approximate approaches against.

I have also identified these libraries:

  • FALCONN (C++) A more recent update to E2LSH apparently.
  • Scipy Spatial (Python) Part of the greater scipy package and does have limited nearest neighbor capabilities using KD-trees.

As for other pros and cons, I am more familiar with python at a whole then C++, but I am proficient in C++. Additionally, using C++ would be faster and I am be any to leverage an intermediate stage of parallelism: CPU multi-threading. This is possible in python but certainly not something that python is good at and would probably be easier then having to learn GPU programming from the ground up since have conceptually background on CPU threading. Unfortunately MTU uses a library called ThreadMentor which is not available outside of MTU department servers in any practical manner so I would have to learn the C++ standard library interface. It does look like semaphores would require C++ 20, which may or may not be supported on the department server here at Marquette that I should be getting access to in the coming days.

I an tentatively calling this project “Approximate Nearest Neighbor Searches of Real World Geometry”, but this title is subject to change once the problem becomes well defined.

Daily Log, 02 June 2022

Today I continued the literature review regarding approximate nearest neighbor searches. I read three articles which I had identified yesterday:

  1. C. Fu and D. Cai, “EFANNA : An Extremely Fast Approximate Nearest Neighbor Search Algorithm Based on kNN Graph.” arXiv, Dec. 03, 2016. Accessed: Jun. 01, 2022. [Online]. Available: http://arxiv.org/abs/1609.07228

    While the graph based approaches to this do provide excellent results, the memory constraints are higher than the LSH methods used in the comparison. While the datasets used here are not as large as the ones mentioned in Dr. Puri’s LSH Applications paper, if we assume a linear memory growth, I’m not convinced that an approach like EFANNA is off the table.

  2. C. Cai and D. Lin, “Find Another Me Across the World – Large-scale Semantic Trajectory Analysis Using Spark.” arXiv, Apr. 02, 2022. Accessed: Jun. 01, 2022. [Online]. Available: http://arxiv.org/abs/2204.00878

    This paper uses minhashing, a type of LSH to find similar semantic trajectories, basically people with similar life styles based off of the places that they visit, the order of places and the amount of time spent there. They developed a hashing method called Sequence-Sensitive Hashing (SSH) which was designed to respect the sequential nature of this data. It was interesting to see LSH outside of the textbook chapter that I read about it, but the data here is still rather similar to the document structure that LSH is frequently practiced on.

  3. T. Ge, K. He, Q. Ke, and J. Sun, “Optimized Product Quantization for Approximate Nearest Neighbor Search,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2013, pp. 2946–2953. doi: 10.1109/CVPR.2013.379.

    This paper was somewhat of a break from the other two. Product Quantization is an approach which is adjacent to LSH and serves, at least to my current understanding, as a “different backend” for this whole process. It involves breaking the data space into a number of subspaces such that the original space is the Cartesian product of the subspaces. This produces distortion in the quantization process, which must be minimized to push for better results. The authors were able to find a non-parametric solution to this issues which can optimize the distortion in an iterative manner and it performed very well. Additional specialization is discussed for data which follows a Gaussian or normal distribution.

I additionally identified six libraries which may be relevant to this project once programming begins.

  1. annoy (C++ / Python) Since when did Spotify develop things like this? Thanks to the NetworkX Core Development Team for this recommendation.
  2. EFANNA (C++) EFANNA implementation.
  3. NSG (C++) NSG implementation by the EFANNA authors. This is their improvements on EFANNA I believe, but while I have the research paper for NSG I have not had the chance to read it.
  4. kGraph (C++ with limited Python support). ANN graph approach using a technique called NN-descent. Used as a benchmark for EFANNA.
  5. FLANN (C++ / Python) KD-tree based algorithms.
  6. yael (C++ / Python) Not sure what algorithms this uses but it was linked off the website for the datasets used in multiple of the papers I read today.

The datasets of this website where used in the Fu and Cai paper and Ge et al. papers, which may or may not be interesting resources to look at. I believe that these data sets are just numeric vectors in 128 or 960 dimensions which is idea for research on approximate nearest neighbor search (ANNS).

Looking at the programming resources that I have found, C++ is starting to pull ahead of my innate desire to program everything in python. I do not believe that this decision needs to be made this week.

One of the most important things for me to do this week is to find a narrow and tailored research topic. It looks like there as been a lot of research, which I am currently crawling through, on the general ANNS problem. However, I have not seen anything about using these techniques on shapes or other geometries. The Cai and Lin paper is probably the closest when it was discussing coordinate based similarity in their trajectories. I do need to investigate this specific sub-problem further, but I suspect that finding a suitable hashing or quantization function for the input geometries will be the heart of this problem. Once the LSH hashes or PQ codebooks are produces, I don’t see why existing methods like EFANNA or the algorithm described by Ge et al wouldn’t be applicable. However, if my experience with Google’s Summer of Code last year taught me anything it is that my last statement is almost certainly wrong and I just don’t have the knowledge yet see the problem and find a solution.