Small Memory Footprint URL Matching

scalable url matching with small memory footprint n.w
1 / 31
Embed
Share

Explore efficient URL matching techniques with small memory footprint for scalable networking solutions. Learn about URL compression, security device implementations, and ICN/NDN routing challenges in this research overview.

  • Networking
  • Security Devices
  • URL Matching
  • Compression
  • Scalability

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Scalable URL Matching with Small Memory Footprint David Hay With Anat Bremler-Barr, Daniel Krauthgamer, Shimrit Tzur David This research was supported by ERC starting grant 259805

  2. URL Matching URL Action work.com A1 net.orgnet.org A3 mrg.network.com A2 net.orgnet.org A3 work.org A1 Database of millions of URLs Table often stored within a data structure (trie, hash) 2

  3. URL Matching in ICN/NDN URL Action work.com Fwd #5 net.orgnet.org Fwd #7 mrg.network.com Drop net.orgnet.org Fwd #7 work.org Fwd #9 ICN/NDN routers use URLs in order to make forwarding decisions URL matching works on the datapath: Bump on the wire must work in the line rate Must use high performance memory (SRAM) Database footprint is the major bottleneck Misclassification implies forwarding to the wrong direction Mild case: routing over longer paths (and wasting network resources) Worst case: creating loops Two flavors of URL matching: Exact match on domain names: work.com matches only work.com Component-by-component longest prefix match: research.work.com matches research.work.com (if there is such entry) or work.com (otherwise)

  4. URL Matching in Security Devices URL Action work.com Alert net.orgnet.org Permit mrg.network.com Deny net.orgnet.org Permit work.org Permit Some NIDS have blacklists: Drop a connection with a URL in the list All others are permitted Some NIDS have whitelists: Permit a connection with a URL in the list All others are dropped Modern NIDS attach categories to each URL, to allow fine-grained and configurable protection Tens of categories URL can have more than one category Misclassification may block legitimate connections or allow illegitimate ones.

  5. Key idea: URL Compression Compressed URL URL Action Action 0100 work.com Fwd #5 Fwd #5 1110011100 net.orgnet.org Fwd #7 Fwd #7 101110100 mrg.network.com Drop Drop 10 bits 14 bytes 1110011100 net.orgnet.org Fwd #7 Fwd #7 01100 work.org Fwd #9 Fwd #9 Design Requirements: Accurate Lossless compression per URL Works with any data structure Compresses only the information Each URL is compressed by itself (stand-alone compression) Supports any URL matching flavors both exact match and longest prefix match High datapath performance 5

  6. Our URL Compression Framework Gzip (and similar compression techniques) are not suitable Each URL should be compressed by itself to allow fast datapath. Huffman coding works But we can do better. Observation: URLs contain repeated substrings of variable length (a.k.a. anchors): Exact match (domains): Anchor network.com appears in Cartoonnetwork.com, Moneynetwork.com Component by component (longest prefix match): Typical anchors are com , network , uk The data structure (e.g. tree) should maintain the semantics of / , . Key Idea: Encode (also) frequent anchors using Huffman code 6

  7. Agenda Offline Phase Experimental Results Online Phase (Datapath) 7

  8. Online Phase (Datapath) Compressed URL Action 0100 Fwd #5 net.orgnet.org 1110011100 Fwd #7 101110100 Drop Our Framework 1110011100 Fwd #7 01100 Fwd #9 8

  9. Online Phase: Zoom in comrgnetwork.com Huffman codes table com net org mrg network g o c m r n e t w k 00 11 010 101 0110 1001 01110 100011 100010 100001 100000 0111111 0111110 0111101 0111100 Deterministic Anchors selection and Huffman encoding 00 100001 1001 0101 Pattern Matching DFA Dictionary

  10. Dictionary: DFA n o r g t m c e Using DFA store the anchors with pointers to the Huffman coding Finding all possible anchors with one pass on the URL One memory reference per char Datapath should be fast! 10

  11. Anchor selection: multiple choices Huffman codes table c o m r g n e t com net org mrg network g o c m r n e t w k 00 11 010 101 0110 1001 01110 100011 100010 100001 100000 0111111 0111110 0111101 0111100 c - o - m r g - n e t com - r - g - n e t Different anchor selection Different code length 11

  12. Optimal Anchor Selection Algorithm One pass on the URL When processing the ith character of the URL, we select an anchor that: Is a suffix of the URL processed so far (the prefix of the URL of length i) If selected, minimizes the encoding of that prefix. Store the selected anchor A[i] and the resulting encoding length L[i], and move to the next character. Eligible anchors Encoding length if anchor ai is selected A[i]=argminai Si ui { }L[i-l(ai)]+h(ai) 12

  13. Optimal Anchor Selection: Example A[i]=argminai Si ui { }L[i-l(ai)]+h(ai) Huffman code table c o m r g n e t com net org mrg network g o c m r n e t w k 00 11 010 101 0110 1001 01110 100011 100010 100001 100000 0111111 0111110 0111101 0111100 L A 12 g 8 r 18 n 25 e 14 net 6 c 11 o 2 com com - r - g - n e t 13

  14. Agenda Offline Phase Experimental Results Online Phase (Datapath) 14

  15. Offline phase Huffman code table com net org mrg network g o c m r n e t w k 00 11 010 101 0110 1001 01110 100011 100010 100001 100000 0111111 0111110 0111101 0111100 URL list network.com 4 orgnet.org 5 work.org 2 gon.org/ctem 22 comrgnetwork.com 11 mrg.net/com 8 mrg.com.net 7 Offline algorithm Pattern Matching DFA 0110 00 4 01011 010 5 0111101011101000010111100 010 2 100101110100000 010 22 00100000110010110 00 1 1 101 11 00 8 101 00 11 7 Dictionary Compressed database 15

  16. Zoom in Offline algorithm Deterministic URL database compression and Huffman creation Find Heavy Hitters Anchors Optimal Anchors Selection 16

  17. Find Heavy Hitters Anchors Find substrings of variable length that appear frequently Two possible implementations: For n URLs with average URL size L Accurate Heavy Hitters: femtoZip library count all possible substrings and choose the heavy hitters: Time ?(? ? ???(? ?), Space ?(? ?) Approximate Heavy Hitters: space efficient! Geared Heavy Hitters algorithm to finding substrings of variable length [ANCS2014] The algorithm receives parameter k and returns all the anchors with frequency above ? ? : Time ?(? ?) , Space ?(? ?) 3? ? - irrelevant since recalculate use the real estimation. Estimation error of 17

  18. Optimal Anchor Selection Longer list: all possible frequent substrings, some would not be chosen A[i]=argminai Si ui { }L[i-l(ai)]+h(ai) Huffman code table com net org mrg network g o c m r n e t w k c o m e r g n 00 11 010 101 0110 1001 01110 100011 100010 100001 100000 0111111 0111110 0111101 0111100 t L A 12 g 8 r 18 n 25 e 14 net 6 c 11 o 2 com com - r - g - n e t 18

  19. Anchor Selection Challenge Longer list: all possible frequent substrings, some would not be chosen A[i]=argminai Si ui { }L[i-l(ai)]+h(ai) Huffman codes table com net org mrg network g o c m r n e t w k c o m e r g n ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? t L A Anchor frequency Huffman length com - r - g - n e t Anchor selection 19

  20. Anchor Selection Challenge A[i]=argminai Si ui Re-estimating the anchor frequency by { }L[i-l(ai)]+h(ai) Huffman codes table the anchor selection (new values for h(a)) com net org mrg network g o c m r n e t w k c o m e r g n ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? t L A Estimated Estimate of Huffman length according to current frequency h(a)=1 8log2 Anchor frequency Huffman length f(b) b A S f(a) Selecting an anchor a only if the gain in compression f(a)(l(a)-h(a)) > its footprint in the dictionary Anchor selection 20

  21. Deterministic URL Database Compression Huffman code table 1. Create Huffman code according to the re- estimated frequency and the selected anchors com net org mrg network g o c m r n e t w k 00 11 010 101 0110 1001 01110 100011 100010 100001 100000 0111111 0111110 0111101 0111100 Pattern Matching DFA 2. Create the DFA of selected anchors 3. Encode the compressed URL database using the optimal anchor selection algorithm with the Huffman code and selected anchors 0110 00 4 01011 010 5 0111101011101000010111100 010 2 100101110100000 010 22 00100000110010110 00 1 1 101 11 00 8 101 00 11 7 Compressed database

  22. Agenda Offline Phase Experimental Results Online Phase (Datapath) 22

  23. Implementation Github project: https://github.com/deepnesslab/urlmatching Usage: UrlMatchingTester [CMD] [-f urls_path] <options> <compression params> 1. testing CMD: test, ,testhash, article -f [String] urls filepath - REQUIRED -o [String] ouput filepath -a add header line to output_filepath -p [String] file path for dictionary , default: None, doesn't print dictionary -v Verify encoding by Decode - takes longer , default: no -b [Int] Take break time to measure program memory, Seconds, default: no -u [String] custom urls for lookup in testhash , default: use the original urls file , default: None , default: None 2. Building dictionary and encoding input file using existing dictionary CMD: build, encode All of the above flags and: -d [String] dicionary filepath -x 0 To skip Offline stage, this yield wrong compression ratio, default: 1 , default: "[urlsfile].dict" 3. compress file CMD: compress, extract -f [input file] -o [output file] <compression params> Compression params: ------------------ -l [char] Longest Prefix Match - split dictionary by /, default: false -k [Int] k-gram size default: 3 -r [Fload] consecutive k-gram ratio , default: 0.5 -n [Int] number of pattern anchors , default: 100 Debugging flags: --------------- -s dump Aho Corasik state machine -c [String] logger config file , , default: No , default: None 23

  24. Implementation Issues It is hard to allocate single bits A na ve programming yields x2 the amount of theoretical memory calculation: Slabs: Allocation of x bytes in the heap will get the smallest slab that contains x. Slabs are at fixed size of 2 magnitude (32,64,128 .. bytes) Memory alignment: Addresses of variables or records are either 4 or 8 bytes (for 32/64bit OS) padding padding padding 1 char Object 12 bytes Int Int 4 byte 4 byte 24

  25. Memory Improvement Design a linear allocator Pre-calculate total size of objects Allocate big chunks of memory Break it down to parts: anchors, buffers , etc. The difference from the real memory footprint to theoretical footprint is reduced from 100% to 20% 25

  26. Experiment URLBlackList.com daily generated list of around 2.2M URLs (57.3 MB) Memory of auxiliary data structure (DFA,Huffman encoding) is around 1%-3% Implementation in C Running on Intel Xenon E52690V3 2.6Ghz 16GB memory per core, L2 256 KB, L3 30 MB 26

  27. Compression of Methods Compression Method Compression Ratio femtoZip 0.57 Huffman Encoding only 0.59 Our Framework: Accurate Heavy Hitters 0.44 Our Framework: Approximate Heavy Hitters 0.43 27

  28. Evaluation: Datapath Throughput Vs. Compression ratio 0.52 350 URLs URLs Components Components 0.50 300 0.48 250 Compression Ratio Throuput (Mbps ) 0.46 200 0.44 150 0.42 100 0.40 50 0 5 10 15 0 5 10 15 Anchors (Thousands) Anchors (Thousands) ~5000 anchors are enough

  29. Datapath breakdown Analyze the time using two tools: Callgraph and Calgrind. Dominate by the DFA (finding the anchors) DFA 87% DFA 87% Deterministic URL compression 5%

  30. Conclusion URLs can be compressed well. 42% compression ratio Fast datapath of 200 Mbs with 1 ms latency Without any loss of accuracy Opens the door to work only with the compressed form Moving the compression to the end-hosts. Compressed URL is stored in the packet (instead of plain-text URL). Require sharing the dictionary (DFA+Huffman) between all the end-hosts Example: cloud solution for security gateway The client has a cache of recent queries. Upon a cache miss it goes to the cloud database. Client can send the query in its compressed form. With 42% compression ratio, we can store more than 2.5 the numbers of the URLs in the same space. 31

  31. Thank You 32

More Related Content