Compacting de Bruijn Graphs in Deep Sequencing Data Analysis

Compacting de Bruijn Graphs in Deep Sequencing Data Analysis
Slide Note
Embed
Share

Challenges and importance of compaction in de Bruijn graph algorithms for deep sequencing data, focusing on improving efficiency and reducing memory usage to enhance assembly pipelines."

  • Data analysis
  • Sequencing
  • De Bruijn graphs
  • Algorithms
  • Compaction

Uploaded on Mar 07, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Compacting de Bruijn graphs (BCALM2) Seminar in Algorithms for Deep Sequencing Data, 2018 Yonatan Vernik Ofir Cohen 1

  2. Outline Motivation Related Work Terminology Algorithm Proofs Results Summary 2

  3. Motivation As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. 3

  4. Motivation Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. 4

  5. Example Before After 5

  6. Intuition Before 6

  7. Intuition Before - CCCT has an out- degree of 1 CCTC has as in- degree of 1 CCCT->CCTC in the de Bruijn graph - - -> we will combine them 7

  8. Intuition Before - CCCT has an out- degree of 1 CCTC has as in- degree of 1 CCCT->CCTC in the de Bruijn graph - - -> we will combine them 8

  9. Outline Motivation Related Work Terminology Algorithm Proofs Results Summary 9

  10. de Bruijn graph in fragment assembly The use of the de Bruijn graph in fragment assembly consists of a multi-step pipeline, however, the most data intensive steps are usually the first three: Nodes enumeration/distinct k-mer counting Compaction Graph cleaning (of sequencing errors) 10

  11. Related Work k-mer counting has seen orders of magnitude improvements in memory usage and speed. As a result graph compaction is becoming the new bottleneck and it has received little attention. Other parallel approaches have been proposed (for genome assemblers) So why bother? Because most are only implemented within the context of a specific assembler, and cannot be used as modules for the construction of other fragment assemblers or for other applications of de Bruijn graphs (e.g. metagenomics). 11

  12. Outline Motivation Related Work Terminology Algorithm Proofs Results Summary 12

  13. Terminology Multi-set k-spectrum Glue operation Binary relation Compactable strings (u, v) Unitig (and contig) l-minimizer, rmm, lmm m-compactable 13

  14. Terminology -Multi-set, Suffix, Prefix Multi-set: a set which may can contain multiple copies of the same element. Ex. {1, 1, 2, 3} 14

  15. Terminology -Multi-set, Suffix, Prefix Multi-set: a set which may can contain many copies of the same item. Ex. {1,1,2,3} : the first k characters of string v, v[1,2,..,k] : the last k characters of string v, v[|v|-k+1,..,|v|] 15

  16. Terminology -k-spectrum : for string s, |s|>=k, = the multi-set of all k-mer substrings in s. Ex. k-spectrum(S): for a set of strings S, Ex. 16

  17. Terminology -Glue-k operation The glue operation: For two strings u,v such that , u k v = u v[k+1.. v ] Ex. abcde 3 cdef = abcdef 17

  18. Terminology -Compactability Two strings (u,v) will fulfill the binary relation u v if: Those are the edges of the de Bruijn graph Two strings (u,v) are called compactable if: u v Out-degree(u)=1 In-degree(v)=1 18

  19. Terminology -Compactability If u,v are compactable, their compaction is u k-1 v Ex. k=4, u=ACCG, v=CCGT u k-1 v = ACCGT 19

  20. Terminology -Unitig A path p = (v1,v2,..,vm) is called a unitig if: If p=(v), then p is defined as a unitig. - or: i 1,m , out-degree(vi)=in-degree(vi)=1 in -degree(vm)=1 out-degree(v1)=1 Note: a unitig is always a simple path 20

  21. Terminology -Unitig Quick proof of the simpleness of unitigs: Note: we ignore paths which are cycles 21

  22. Terminology -Unitig A unitig is called maximal if it cannot be extended on either side 22

  23. Terminology -l-minimizer, lmm, rmm Let there be some strong ordering of l-mers. For a string s with at least l characters it s l-minimizer is: The smallest l-mer substring in s 23

  24. Terminology -l-minimizer, lmm, rmm Let there be some strong ordering of l-mers. For a string s with at least l characters it s L-minimizer is: The smallest l-mer substring in s For k-mer s: Its lmm is the l-minimizer of it s first (k-1) characters Its rmm is the l-minimizer of it s last (k-1) characters 24

  25. Terminology -l-minimizer, lmm, rmm Example (k=4, l=2) Let s=AGTC, and assume lexicographic ordering: lmm(s) = min{AG, GT} = AG rmm(s) = min{GT, TC} = GT 25

  26. Terminology -m-compactable For a set of strings S, two strings (u,v) S are m- compactable in S if: They are compactable in S lmm(u)=rmm(v)=m Their compaction is called an m-compaction For a set of strings S, it s m-compaction is obtained by applying m-compactions as much as possible in any order26

  27. Terminology Multi-set k-spectrum Glue operation Binary relation Compactable strings (u, v) Unitig (and contig) l-minimizer, rmm, lmm m-compactable We can now redefine the de Bruijn compaction problem under these new definitions: Given a set of k-mers K, find the set of all maximal unitigs U 27

  28. The problem of compacting a de Bruijn graph is to report the set of all maximal unitigs U. 28

  29. Outline Motivation Related Work Terminology Algorithm Proofs Results Summary 29

  30. Algorithm BCALM2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. U K BCALM2 30

  31. Algorithm Step 1: Distribute k-mers into buckets (doubled k-mer) Step 2: Compact within every bucket (non-lonely finished nodes, lonely nodes -opportunity for further compaction) Step 3: Reunite across buckets (union-find) 31

  32. Algorithm Step 1: Distribute the k-mers into buckets - - G<- For each k-mer s in K: - Find the l length minimizer for s[1,..,k-1], denoted by lmm(s) - Put s in the bucket F(lmm(s)) - Find the l length minimizer for s[2,..,k], denoted by rmm(s) - If rmm(s) lmm(s), also put s in F(rmm(s)), we will refer to these k- mers as doubled k-mers 32

  33. Algorithm Example: s1=CCCA, s2=CCCC, k=4, l=2 rmm(s1)=CA F(CA) CCCA CCCC F(CC) rmm(CCCC)=lmm(CCCC)=CC F(CC) lrmm(s1)=CC 33

  34. Algorithm Step 2:Apply compaction (for each bucket F(i)): /* in parallel */ - - U <-all i-compactions in F(i) For each string u in U: - Mark u s prefix as lonely if i lmm(u). - Mark u s suffix as lonely if i rmm(u). - If u s prefix and suffix are not lonely: Put u in G - Otherwise: /* lonely nodes */ - Put u in R (the Reunite file) 34

  35. Algorithm Example for i-compaction: CCCC CCCT CCCC F(CC) CCCC CCCT CCTC CCCA CCCA* CCCA CC-compaction CCTC induces CCCTC* Here k=4, l=2 35

  36. Algorithm Example for lonely node: s=CCCA, k=4, l=2 R rmm(s1)=CA F(CA) *CCCA CCCA *CCCA CCCA* CCCA* F(CC) lrmm(s1)=CC 36

  37. Algorithm Step 3.1: Reunite (for each string in R (reunite file)): /* in parallel */ - UF Union find data structure whose elements are the distinct lonely k-mer extremities For all (parallel) u in R - If both ends of u are lonely then - UF.union(suf_k(u), pre_k(u)) - 37

  38. Algorithm Step 3.2: Reunite (for each string in R (reunite file)): /* in parallel */ - For all (parallel) classes C of UF do - P = all u in R that have a lonely extremity in C - While exists u in P that does not have a lonely prefix, do - Remove u from P - Let s=u - While exists v in P such that suf_k(s) = pre_k(v) do - S glue (s, v) - Remove v from p - Output s 38

  39. Algorithm Example R = {CCCTC*, CCCA*, *CCTCTA, CTAA*, CTAC*, *CTAA, *CTAC, *CCCA} UF = {CCTC, CCCA, CTAA, CTAC} Parition and glue within the classes {CCTC} {CCCTC*, *CCTCTA} CCCTCTA /* in parallel */

  40. Algorithm Example for UF.union: F(CCC) ACCC CCCC CCCT F(ACC) ACCC F(CCT) CCCT ACCC Distribute i-compaction CCCC ACCC* *ACCCCT* *CCCT R = { } CCCT 40 UF same bucket {ACCC, CCCT}

  41. Algorithm Putting it all together: (k=4, l=2) 41

  42. Outline Motivation Related Work Terminology Algorithm Proofs Results Summary 42

  43. Proof of correctness Lemma 1: Let S and T be two sets of strings of length at least k such that spk+1(S)=spk+1(T)and spk(S)=spk(T) and all these spectrums are without duplicates. Then, S = T 43

  44. First, for every string sS, we will show that there is tT such that s is a substring of t. Proof of correctness Note that any k and k+1 substring of s is in some t T due to the spectrum requirements Lemma 1: If |s|=k, s spk(S) -> s spk(T) -> there s a t T s.t s is a substring of t. Let S and T be two sets of strings of length at least k such that spk+1(S)=spk+1(T)and spk(S)=spk(T) and all these spectrums are without duplicates. Then, S = T 44

  45. First, for every string sS, we will show that there is tT such that s is a substring of t. Proof of correctness Note that any k and k+1 substring of s is in some t T due to the spectrum requirements Lemma 1: If |s|=k, s spk(S) -> s spk(T) -> there s a t T s.t s is a substring of t. Let S and T be two sets of strings of length at least k such that spk+1(S)=spk+1(T)and spk(S)=spk(T) and all these spectrums are without duplicates. Then, S = T If |s|>k, let p=max{i N|s[1...i] is a substring of some t T}. We ll denote t as that string. Note p>=k+1. We will now prove p=|s|. 45

  46. First, for every string sS, we will show that there is tT such that s is a substring of t. Proof of correctness Note that any k and k+1 substring of s is in some t T due to the spectrum requirements Lemma 1: If |s|=k, s spk(S) -> s spk(T) -> there s a t T s.t s is a substring of t, similarly for |s|=k+1. Let S and T be two sets of strings of length at least k such that spk+1(S)=spk+1(T)and spk(S)=spk(T) and all these spectrums are without duplicates. Then, S = T If |s|>k, let p=max{i N|s[1...i] is a substring of some t T}. We ll denote t as that string. Note p>=k+1. We will now prove p=|s|. Assume by contradiction p<|s|. Observe the (k+1)-mer s[p+1-k, p+1]. 46

  47. First, for every string sS, we will show that there is tT such that s is a substring of t. Proof of correctness Note that any k and k+1 substring of s is in some t T due to the spectrum requirements Lemma 1: If |s|=k, s spk(S) -> s spk(T) -> there s a t T s.t s is a substring of t. Let S and T be two sets of strings of length at least k such that spk+1(S)=spk+1(T)and spk(S)=spk(T) and all these spectrums are without duplicates. Then, S = T If |s|>k, let p=max{i N|s[1...i] is a substring of some t T}. We ll denote t as that string. Note p>=k+1. We will now prove p=|s|. Assume by contradiction p<|s|. Observe the (k+1)-mer s[p+1-k, p+1]. Assume by contradiction t s. Repeat the above argument. ->s=t. So s T. So S T. symmetrically, T S. -> S=T 47

  48. Proof of correctness Lemma 2: spk(U)=K Reminder: U is the set all maximal unitigs, and K is the set of reads we receive as input 48

  49. Every k-mer in K is in a maximal unitig -> spk(U)K K is a set and therefore unique Proof of correctness Lemma 2: Maximal unitigs are disjoint -> spk(U) is unique (k-1)-Compactions preserve k-mers -> spk(U) K. spk(U)=K -> spk(U)=K Reminder: U is the set all maximal unitigs, and K is the set of reads we receive as input 49

  50. Proof of correctness Lemma 3: Define A to be the multiset of k+1-mers which can be formed by compacting 2 k-mers in K Ex. k=3 K={CAA,TAA,AAG,AGC,TTT} A:={TAAG,AAGC} Note: while A is a multiset, it cannot have duplicates since K cannot have duplicates 50

More Related Content