
Tools for Efficient Genomic Data Compression and Querying
Discover the power of compressed suffix arrays and FM-index in efficiently compressing and querying large genomic data. Learn from the advancements in data structures like the ?-index and Phi-1 for optimal space management and fast substring queries. Stay informed about the latest research on improving the FM-index for repetitive datasets and enabling random access prediction. Explore the innovative approaches presented by researchers from the University of Florida and beyond.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Speeding Up Compressed Suffix Array Queries HERMAN PERERA, CHRISTINA BOUCHER, AND MASSIMILIANO ROSSI UNIVERSITY OF FLORIDA 1
Compressed Suffix Arrays - Invaluable tool in compressing large genomic data which has been rapidly growing over the last decade. - Problems of optimally querying these data structures in limited amounts of space is important. 2
FM-index Ferragina, Manzini, Opportunistic data structures with applications [FOCS 2000] [FOCS 2000] ? =ACAACAC$ - Using the Burrows Wheeler Transform (BWT) BWT SA - An input text may be compressed and used for fast substring queries. 7 $ACAACAC 2 AACAC$AC 5 AC$ACAAC 0 ACAACAC$ 3 ACAC$ACA 6 C$ACAACA 1 CAACAC$A 4 CAC$ACAA - Suffix Array (SA) samples at constant intervals. Runs (?) 3
?-index Gagie, Navarro, Prezza, Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space [JACM 2020] [JACM 2020] - Improvement on the FM-index for repetitive datasets. ? =ACAACAC$ BWT SA - Stores the same data in fewer runs and smaller space. 7 $ACAACAC 2 AACAC$AC 5 AC$ACAAC 0 ACAACAC$ 3 ACAC$ACA 6 C$ACAACA 1 CAACAC$A 4 CAC$ACAA - With the added help of an additional data structure to reconstruct the suffix array. Runs (?) 4
? =ACAACACAACAACACAACAC$ Samples ? FM 20 $ACAACACAACAACACAACAC 7 AACAACACAACAC$ACAACAC 15 AACAC$ACAACACAACAACAC 2 AACACAACAACACAACAC$AC 10 AACACAACAC$ACAACACAAC 18 AC$ACAACACAACAACACAAC 5 ACAACAACACAACAC$ACAAC 13 ACAACAC$ACAACACAACAAC 0 ACAACACAACAACACAACAC$ 8 ACAACACAACAC$ACAACACA 16 ACAC$ACAACACAACAACACA 3 ACACAACAACACAACAC$ACA 11 ACACAACAC$ACAACACAACA 19 C$ACAACACAACAACACAACA 6 CAACAACACAACAC$ACAACA 14 CAACAC$ACAACACAACAACA 1 CAACACAACAACACAACAC$A 9 CAACACAACAC$ACAACACAA 17 CAC$ACAACACAACAACACAA 4 CACAACAACACAACAC$ACAA 12 CACAACAC$ACAACACAACAA 20 7 15 2 10 18 5 13 0 8 16 3 11 19 6 14 1 9 17 4 12 FM-index: 11 SA samples ?-index: 5 SA samples Runs (?) 5
Phi-1 - How does Phi-1 work? - By using Phi-1 we can recover these intermediate samples. 6
Phi-1 & Random Access Pred 26 3 14 24 3 11 21 . . . - By using Phi-1 we can recover these intermediate samples. ? ? 1(??[?]) = ??[? + 1] - The predecessor of a sample is the first end of a run that precedes it. E.g., Find SA[3]: E.g., Find SA[3]: Preceding end: 26 Predecessor: 26 Distance: 27 Next sample: 8 (8 + 27) % 27 = 8 (11 + 5) % 27 = 16 (23 + 2) % 27 = 25 Next sample: 8 Predecessor: 3 Distance: 5 Next sample: 11 Next sample: 23 Next sample: 16 Predecessor: 14 Distance: 2 7
Pred S E Graph Definition 26 21 14 18 22 9 0 17 24 3 11 20 19 26 3 3 22 3 9 0 17 3 3 11 20 0 26 8 6 23 5 9 0 17 7 3 11 20 2 - Given a compressed suffix array, we build a directed graph, ? = (?,?) such that all nodes correspond to samples at the end of each BWT run. - Let ? be the list of SA samples at the end of each run & ? be the list of SA samples at the start of each run. - There exists an edge between two samples ? & ?, where ? is ?[?] and ? is ????(?[? + 1]) 8
Pred S E Graph Construction 26 21 14 18 22 9 0 17 24 3 11 20 19 26 3 3 22 3 9 0 17 3 3 11 20 0 26 8 6 23 5 9 0 17 7 3 11 20 2 - Given a compressed suffix array, we build a directed graph, ? = (?,?) such that all nodes correspond to samples at the end of each BWT run. - There exists an edge between two samples ? & ?, where ? is ?[?] and ? is ????(?[? + 1]) 24 3 0 17 26 20 11 21 9 14 22 18 9
Cost & Limits Pred S E 26 21 14 18 22 9 0 17 24 3 11 20 19 26 3 3 22 3 9 0 17 3 3 11 20 0 26 8 6 23 5 9 0 17 7 3 11 20 2 - While constructing the graph, we calculate two values. - ????(?) = ? ? 1(?) ????(? ? 1(?)) e.g., ? ? 126 ???? ? ? 126 (8 3) = 5 - ????? ? = ????(?) ????(?) e.g., ???? 3 ???? 3 = 6 10
Pred S E Cost & Limits 26 21 14 18 22 9 0 17 24 3 11 20 19 26 3 3 22 3 9 0 17 3 3 11 20 0 26 8 6 23 5 9 0 17 7 3 11 20 2 24 (4,1) (0,2) (0,3) (0,6) (0,3) (2,1) (5,1) 17 11 20 0 26 3 (2,1) (0,2) (3,1) (1,3) (0,2) 21 18 9 14 22 11
Graph To Tree - Build a binary tree whose leaves correspond to the samples contained within the path. - Parent nodes store cumulative costs and their respective limit. ??????.???? = ?1 + ?2 ??????.????? = min(?1,?2 ?1) (?1 + ?2,min ?1,?2 ?1 ) (?1,?1) (?2,?2) 12
Tree Construction (4,1) (0,6) (0,3) (0,3) (2,1) (5,1) - Isolate the cycles so that we can have a long, linear path to traverse. 26 17 3 11 20 0 - Recursively build and compute the parents from their leaves. 11,-6 5,-3 6,-1 (?1 + ?2,min ?1,?2 ?1 ) 2,1 4,1 5,1 0,2 17 11 (?1,?1) (?2,?2) 5,1 0,6 2,1 0,3 20 0 26 3 13
Steps to Query Tree - Climbing until our cost exceeds the limit of the tree. - This expands our search. - Descend until we reach a leaf node to exit on. - This refines our search. 14
BV BV New Query Process 1 - - - - - 0 - 0 - - - 0 - - 0 0 1 1 - - 0 1 1 1 - 0 11,-6 E.g., Find SA[3]: (25) E.g., Find SA[3]: (25) Previous end: 26 Predecessor: 26 Distance: 3 Cost: 0 In cycle? True. 5,-3 6,-1 2,1 4,1 5,1 0,2 17 11 Next sample: 16 Predecessor: 14 In cycle? No. Use Phi-1. 2,1 5,1 0,6 0,3 3 0 20 26 15
(Experiments That Will Be Conducted) Future Work - Comparing against recent implementations of different compressed suffix arrays. Such as the r-index, sr-index, RLZ-CSA, BT-CSA, Dynamic r- index, LCSA, and CDAWG. - Datasets will include Chromosome 19 and SARS. 17
Thank You! Thanks for your time and attention. 18