Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures

Slide Note

String matching algorithm usage in DNA and protein sequence analysis, data mining, security systems, and more. The algorithm efficiently implemented on various architectures such as FPGAs, Cray XMT, multicore and heterogeneous processors, GPUs. Hardware and software challenges, performance variability, and unknown inputs discussed with a focus on high-performance systems.

henr_497 Follow

Uploaded on Apr 04, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures qiao 1

author Name: Antonino Tumeo research scientist at Pacific Northwest National Laboratory (PNNL) research interests : modeling and simulation of high-performance architectures hardware- software codesign FPGA prototyping, GPGPU computing Name:Daniel G. Chavarr a-Miranda Pacific Northwest National Laboratory programming models, compilers, and languages for large- scale HPC systems a member of the ACM since 1997 2

INTRODUCTION String matching algorithm is using in: key components of DNA and protein sequence analysis data mining security systems Intrusion Detection Systems for Networks (NIDS ) Applications (APIDS) Protocols (PIDS) antivirus software machine learning problems 3

INTRODUCTION large quantities of textual data high performance Aho-Corasick algorithm Muti-pattern linearly efficient implementations : FPGAs Cray XMT (highly multithreaded solutions ) multicore processors heterogeneous processors Graphic Processing Units (GPUs) previous focus: speed Now: size of the input , number of patterns 4

INTRODUCTION Hardware: only small dictionary lack of memory difficult to customize Software: dividing the input data set requires a significant effort to efficiently map the algorithm performance variability depends input and patterns size Performance depends on cache-based or not For unknown input: can not preprocessed leading to highly variable performance 5

INTRODUCTION we present and compare several software-based implementations of the AC algorithm for high performance systems Focus attention on: unknow inputs supporting large dictionaries (up to 190000 patterns) high performance Enable flexibility and customization limit performance variability 6

INTRODUCTION implementations : shared memory Cray XMT with up to 128 processors (128 threads per processor) , 1TB shared memory dual-socket Niagara 2 (eight cores per processor, eight threads per core) dual-socket Intel Xeon 5560 (Nehalem architecture, four cores per processor, two threads per core) Distributed memory homogeneous cluster of Xeon 5560 processors (10 nodes, two processors per node) interconnected through Infiniband QDR and a heterogeneous cluster NVIDIA Tesla C1060 GPUs (10 nodes, two GPUs per node) homogeneous or heterogeneous processing elements 7

Cray XMT Shared memory multithreaded machine developed by Cray Cray Seastar-2.2 high speed interconnect can scale up to 8,192 Threadstorm processors and 128 TB of shared memory top memory bandwidth at around 100M requests per second Network bandwidth ranges from 90M to 30M requests per second depending on the number of nodes custom, multithreaded operating system (MTX) parallelizing C/C++ crosscompiler 8

hardware Niagara 2 processors at 1165 MHz (for a total of 128 threads) sharing 32 GB of memory Nehalem X86 CUDA-based GPUs In CUDA, a kernel is divided in a grid of up to 65,356 thread blocks, and each thread block is composed of up to 512 threads Tesla T10 GPU with 30 Streaming Multiprocessors (240 Streaming Processors) clocked at 1.3 GHz 4 GB of GDDR3 memory at 800 MHz 9

INTRODUCTION Section 2 : algorithmic design on the various machines Section 3 : discusses our experimental results and the comparison among all the machines Section 4 : conclusions 10

2 ALGORITHM DESIGN and OPTIMIZATION Algorithm design cornerstones minimize the number of memory references reduce memory contention For each possible input symbol, there is always a valid transition to another node in the automaton each input symbol: the same amount of work to perform For a given dictionary, the data structures in main memory (DFA and input symbols) are read-only Implementations : use multiple threads or processes 11

2 ALGORITHM DESIGN and OPTIMIZATION Each thread or process has a current_node and operates on a distinct section of the input Shared memory threads or threads running on GPU : access the same DFA Different MPI(Message Passing Interface) processes or different GPUs : access their own copy of the DFA Input stream : buffered, split, assigned 12

2 ALGORITHM DESIGN and OPTIMIZATION For example MPI with CUDA Each GPU receives a buffer partitions it among its own threads MPI with pthreads Each process receives a buffer partitioned among the threads executed by the different cores 13

2 ALGORITHM DESIGN and OPTIMIZATION Overlap The pattern that cross a boundary Overlapping = the length of the longest -1 (symbol) Inefficiency = (longest pattern-1)/(size of the chunk) 14

State Transition Table (STT) rows : nodes in the DFA columns : symbols of the alphabet one STT line : represent a node in DFA each cell in a line : store next_node Cray XMT and pthreads : STT lines are 256-byte aligned 90,000 text patterns of average length 16 bytes and the 256- character ASCII alphabet, the STT size is 9.8 GBytes 15

State Transition Table (STT) 16

2.1 Cray XMT Implementation Reducing latency variability Able to schedule a sufficient number of threads to hide constant or slowly latency Variability : due to the hotspots Hotspots: memory regions frequently accessed by multiple threads simultaneously Employs a hardware hashing mechanism Problem: pressure on memory is not balance Different block for different memory banks have different access ratios Lead the variability of access time 17

2.1 Cray XMT Implementation Reason 1.thread only access some subset area Each cell 8 bytes, for 256-symbol ASCII alphabet Each STT line need :8x256=2048 bytes or 32 blocks (64 bytes per block) When the input is only English words or decimal numbers ,the input symbol will be a subset of ASCII alphabet , concurrent threads will only access that subset 2. the first levels that responsible for the majority of accesses form hotspots For those input similar to the dictionary can avoid the problem. 18

2.1 Cray XMT Implementation Solutions 1. Alphabet shuffling Ensure contiguous symbols in the alphabet are spread out over multiple memory block Symbol = (symbol x fixed_offset)>>8 2. State replication randomly stored addresses of the different replicas of the same logic state (STT line) when create the STT in the STT cells pointing to that state ensures that the memory pressure is equally balanced Don t have a significant benefits on cache based system 19

Some concept GPU hardware SP : streaming processor(or CUDA core) SM : streaming multiprocessor CUDA(software) Thread : A CUDA parallel program will be executed with many threads Block : Several threads will be grouped into a block, threads in the same block can be synchronized, or communicated via shared memory Grid : Multiple blocks will reconstitute the grid Warp : CUDA warp size is 32 now, the thread in same warp will do same thing with different data resource Multiple SMs are combined in a Texture Processor Cluster (TPC), which provides a Texture Unit and a Texture Cache 20

2.2 GPU Implementation Each CUDA thread : a chunk of the input text Point 1: the size of chunk high utilization of each thread STT Each cell (32 bytes) contains the first 31 bits next state or the final states. remove the requirement for the 256-bytes alignment of the lines Still may happen hotspot Reason: partition camping saturating memory controller 21

2.2 GPU Implementation we added padding to each line of the STT corresponding to the size of a partition, generating a more even access pattern. bind the STT to the texture memory texture cache has an active set around 8 KB There is a limit on the size of textures in CUDA: when bound to linear memory, they can allocate up to 2^27 elements. So, when big STTs are used, we bind them to texture only partially 22

2.2 GPU Implementation Use shared memory is not effective for large dictionaries : Max block size: 521 threads 8 lines (4 bytes x 256 symbols =1024 bytes) store in 16 kb memory For 20k patterns ,the first two level already have 54 lines Point 2 Input text : buffered in the host memory Organize input to a matrix: each chunk corresponds to a row 23

2.2 GPU Implementation 24

2.2 GPU Implementation For Tesla C1060 GPU (T10 architecture) : The optimal size for loads : 4 symbols Half warp : 16 threads(minimum 16bit x 16 /8 =32 bytes) optimal size is 64 bytes Input transposed in blocks of 4 symbols Show figure1(b) number of chunks is not integer multiple of the number of threads in a half-warp (16) : add padding transparent ? to the rest of the system 25

2.2 GPU Implementation Each chunk collect match results Send them back to GPU 26

2.3 Distributed-Memory Implementation Wrap match engine in MPI load balancer master/slave scheduler Master : process distributes the work Slave : perform the effective computation For same scheduler 3 different implementation for clustered architectures 27

2.3 Distributed-Memory Implementation MPI-only For homogeneous x86 cluster, where each slave process is mapped to a core and wraps a single-threaded version of string matching engine MPI with phtreads each slave process is mapped to an entire node of the cluster and wraps a multithreaded version of string matching engine. MPI with CUDA each slave process is mapped to a GPU and wraps a CUDA kernel 28

MPI configuration homogeneous configuration 10 nodes connected with 4 Infiniband links (2.5 Gbps each) peaking rate of 40 Gbps heterogeneous configuration 5 NVIDIA Tesla S1070 boxes A Tesla S1070 is equivalent to 4 Tesla C1060 with 4 GB of GDDR3 memory each 2 GPUs connected to single node through PCI Express peak bandwidth of 8 GB/s Total 20 GPU 29

2.3 Distributed-Memory Implementation Load balancer : multibuffering scheme few matches vs many matches Few matches : few accesses many matches : many cache misses ,low speed Should not divide chunk statically : the slowest one will decide the performance set a relatively small fixed buffer size for the slave processes allow the master to send new data to a slave when previous data have been consumed 30

2.3 Distributed-Memory Implementation communication bandwidth may cause bottleneck overlap communication with computation Use nonblocking communication and multiple buffers 31

3 EXPERIMENTAL RESULTS Cray XMT x86 SMP Niagara 2 GPU cluster x86 cluster Two phase(step): STT build phase is performed offline execute string match focus on the string matching phase 32

3 EXPERIMENTAL RESULTS 4 dictionaries Dictionary 1: about190000-pattern data set with mostly text entries with an average length of 16 bytes Dictionary 2: about 190000-pattern data set with mixed text and binary entries and an average length of 16 bytes. English : A 20,000-pattern data set with the most common words from the English language and an average length of 8.5 bytes Random. A 50,000-pattern data set with entries generated at random from the ASCII alphabet with a uniform distribution and an average length of 8 bytes 33

3 EXPERIMENTAL RESULTS 4 kinds of input: Text : Which corresponds to the English text of the King James Bible TCP: Which corresponds to captured TCP/IP traffic. Random: Which corresponds to a random sample of characters from the ASCII alphabet Itself : Which corresponds to feeding the dictionary itself as an input stream for string matching. 34

3 EXPERIMENTAL RESULTS Machine: Niagara 2: two Niagara 2 processors (1,165 GHz), 32 GB of memory XMT: 128 nodes, with a total of 1 TB of memory, Seastar2 interconnection x86 SMP: two Xeon 5560 processors (2.8 GHz), 24 GB of memory x86 cluster: 10 nodes, each one configured as the x86 SMP , interconnected with Infiniband QDR (24 Gbps) GPU cluster: 10 nodes with two Tesla C1060 (4 GB of memory) each. 35

3 EXPERIMENTAL RESULTS Input: streamed from a single source and buffered in memory before being processed use an input buffer of 100 MB , saved in the shared memory on the master node for MPI For pthreads: input is equally divided among the threads For GPU and XMT : each thread process a fixed size, and create large numbers of threads 2 kb is the best chunk size For this size, the overlap is limit to 0.7% when the longest pattern is 16bytes 36

3 EXPERIMENTAL RESULTS 37

3 EXPERIMENTAL RESULTS Scaling of the optimized implementation on the Cray XMZ 38

3 EXPERIMENTAL RESULTS Compare other shared memory architecture x86 SMP Niagara 2 39

3 EXPERIMENTAL RESULTS 40

3 EXPERIMENTAL RESULTS Has stable performance when light and medium matching 41

3 EXPERIMENTAL RESULTS There is high variability in performance among low and heavy matching cases when the number of threads outgrows the number of cores, the scaling is progressively reduced 42

3 EXPERIMENTAL RESULTS We compare results of the XMT, the x86 SMP, and the Niagara 2, all shared-memory platforms on single tesla T1060 CUDA 3.0 43

3 EXPERIMENTAL RESULTS 44

3 EXPERIMENTAL RESULTS GPU perform good when the dictionary is small The GPU implementation s worst performance is 95 percent slower than its best The x86 SMP implementation 92.2 percent slower than its best The Niagara 2 90.3 percent slower than its best. XMT is stable and faster than others Less variability (about 12%) 45

3 EXPERIMENTAL RESULTS Homogeneous( ) For dic1 and dic2 ,the 24GB memory on each node will consumed soon For small dic ,even we can enlarge the number of thread, but the replication of the STT increase the memory traffic 46

3 EXPERIMENTAL RESULTS MPI-only implementation on the x86 SMP = a single node of the homogeneous cluster 47

3 EXPERIMENTAL RESULTS For GPU cluster ,each node(2 GPU) run two MPI process 48

3 EXPERIMENTAL RESULTS In MPI cluster master MPI process streams the input text to the other nodes in blocks of 3 MB each GPU maximum performance 3 GB/s (24 Gbps) with buffers over 1 MB PCI-Express 2.0 bandwidth host-to-device and device-to-host bandwidth adequate number of nodes, even the heavy matching benchmarks reach the same performance of the light matching benchmarks 49

3 EXPERIMENTAL RESULTS x86 cluster is higher than GPU cluster (because data block being delivered to the GPU go through an additional hop on the PCI-Express bus ) 50

Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures

Download Presentation

Presentation Transcript

Related

More Related Content