Efficient Similarity Detection Program for Text Files Using Shingling and Jaccard Methods with CUDA Implementation

ece 4110 final project n.w
1 / 11
Embed
Share

The project aims to identify similar text files from a dataset of 250,000 files by utilizing shingling, Jaccard similarity, and MinHashing. The program processes data by creating shingles and comparing hash values to determine similarity scores, ultimately outputting the most similar files above a 0.8 threshold. Despite facing challenges like slow processing speed and resource-intensive GPU setup, the project showcases an approach to efficiently handling vast amounts of textual data.

  • Text Analysis
  • Shingling
  • Jaccard Similarity
  • CUDA Implementation
  • Data Processing

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ECE 4110 Final Project SHMYRDE JEAN-PAUL

  2. Abstract The goal of this project was to write a program that efficiently finds the two files that are similar in a dataset. The dataset contains 250,000 text files and are relatively small reports. To approach this problem, two methods were used, Shingling and Jaccard, and MinHashing (naive implementation). The first approach revolves around taking the data and splitting them up into separate strings called shingles, then after each document has their shingles created, they are compared to one another. The second approach is almost the same but instead of shingles, you convert them to hashes and compare hashes instead. This reduces the amount of comparison need at the cost of early overhead from creating the numerous hashes for each text file. At the end, the program will output which files are the most similar using a threshold of 0.8. For this project, C++, CUDA and Thrust were used.

  3. Approach 1 (Shingles) First, the dataset is read in one a time and then using OMP, I parsed the data into a user-defined struct which contained the ID of the file, the path and a char * to the contents of the file Once all the data has been parsed, the Shingles are created by creating a vector for each file in the struct. The length of each shingle is defined as a preprocessor directive which is used in numerous for-loops. The shingles are created and then converted to a hash value using crc32 hashing. This value is stored into the array since it takes less memory, and it is easier to compare integers than strings Jaccard Similarity is used to on every file combination possible (basically a permutation) and a similarity score is created for each one. This similarity score is then stored in a map which uses the files IDs, and the score has its parameters. Finally, the last item of the map is displayed which is naturally the pair that scored the highest similarity.

  4. Speed To be frank with you, this ran incredibly slow. Using OMP, I was able to parse the 250k files rather quickly (about a 1 min) but that s not where the problem began. Create the shingles for every single text file resulted in unwanted slowness because I had to create allocating new space in the vector which will then be used later with the GPU. Therefore, the shingling process took a bit of time to complete. However, this was not the main cause of the slowness. Allocating memory on the GPU was.

  5. Timing Proc. Type GPU GPU GPU GPU Time (secs) Host List Name Here are the required timing for this method: 007 007 007 007 00/100 Shingles 01/1,000 Shingles 02/10,000 Shingles 03/250,000 Shingles N/A 6.00 90.00 5940.00

  6. Issues Setting up the data for the GPU takes more resource than just writing the program for a CPU CUDAMalloc and CUDAFree are very resource heavy because they do background checks to ensure everything was either created properly or removed properly. The main issue is that I dynamically created space which necessitated to thousands of CUDAMallocs. 2D arrays are hard to manage in the kernel so everything must be flattened to a 1D for ease. Can t pass an object (such as a struct) by value to a GPU so therefore you have to copy the enire struct into GPU memory which leads to more CUDAMalloc call. Very limited places where the GPU can be used (well in this implementation) About three different locations where the time complexity is at least O(n^2)

  7. Approach 2 (Hash) It starts the same way as approach 1 with the parsing of the information. Once everything is parsed, shingles are calculated for every document but now the shingles are not immediately stored. Those shingles are passed through a random hash_function which then checks if that hash value is the smallest from all the calculated hashes so far, if so, it is stored into the vector and the loop continues for N number of hashes. The first thing to notice is this step is slower than the previous approach. The hope is that we gain a boost in performance in the next step (Jaccard similarity) Now before I continue, the main mistake I made for the approach is using Jaccard similarity and I will explain that later.

  8. Timing (Approach 2) Proc. Type GPU GPU GPU GPU Time (secs) Host List Name The required timing for this program: 007 007 007 007 00/100 Shingles 01/1,000 Shingles 02/10,000 Shingles 03/250,000 Shingles N/A 6.00 225.00 25170.00

  9. What went wrong? The first thing I noticed is the number of loops I used. I could have increased this code by vectorizing everything but since CUDA does not support STL library, I was stuck to what Thrust provided and using C. The point of using MinHashing is to reduce the number of comparisons needed Increasing the Load (accidently) Not optimizing the struct for GPU consumption

  10. Conclusion Don t ever procrastinate when your assignment involves C++ I spent approximately 25 Hours from this Wednesday to Today to create this Despite being a CUDA Programming course, I ve come to appreciate STL Iterators are so useful. Thrust and Iterators work well together so I used them to the utmost Don t dynamically allocate memory on the GPU This is a performance destroyer Just create __device__ functions and implement your function for the GPU Can t live without C++ functions so just reimplement them Just create __device__ functions and implement your function for the GPU Despite being slow, the program does in fact work

  11. Some Files that it Produced 100.list /data/isip/data/ece_3822/eeg_reports/set_02/data_03/02/0239/023907.txt /data/isip/data/ece_3822/eeg_reports/set_02/data_03/02/0239/023945.txt 1000.list /data/isip/data/ece_3822/eeg_reports/set_02/data_03/02/0276/027693.txt /data/isip/data/ece_3822/eeg_reports/set_02/data_03/02/0276/027643.txt

More Related Content