Context-Aware Time Series Anomaly Detection in Complex Systems
This research discusses the importance of collating information, switching to proactive maintenance, and the framework for combining logs and time series data for anomaly detection in complex systems. It emphasizes the joint mining of log and time series data for accurate and robust anomaly detection.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Context-Aware Time Series Anomaly Detection for Complex Systems Manish Gupta, UIUC Abhishek B. Sharma, NEC Labs America Haifeng Chen, NEC Labs America Guofei (Geoff) Jiang, NEC Labs America SDM4Service 5/4/2013 1
Focus on Complex Systems Lots of components interacting to accomplish challenging tasks Data centers Power plant Manufacturing plant SDM4Service 5/4/2013 2
Switch to proactive maintenance 1. Continuous monitoring from multiple vantage points 2. Replace calendar-based or reactive maintenance with early detection, localization, and remediation Enabling technologies low cost ubiquitous sensing and communication Challenge How to combine heterogeneous data? Unstructured or semi-structured log data Multivariate time series data SDM4Service 5/4/2013 3
Importance of Collating Information 0.1 Only time series data No context or the global semantic view Many false positives Multiple alarms related to a single event Only system logs High level application/workflow view Incomplete Coverage Cost Lack of root-cause visibility Absence of observed system behavior Normalized Measurement CPU Utilization 0.09 0.08 0.07 Mem Utilization 0.06 0.05 0.04 0.03 0.02 0.01 0 0 20 40 60 Time (s) Task execution SDM4Service 5/4/2013 4
Our vision Logs capture the context of a system s operations Time series monitoring data record the state of different components Hypothesis: jointly mining log and time series data for anomaly detection is more accurate and robust. Context-aware time series anomaly detection SDM4Service 5/4/2013 5
Outline 1. Introduction and Motivation 2. Framework for combining logs and time series data 3. Proposed solution 4. Instantiation details for Hadoop 5. Evaluation 6. Conclude SDM4Service 5/4/2013 6
Framework for combining logs and time series data Time series SDM4Service 5/4/2013 7
What is an instance? An instance spans the interval between two consecutive context changing events on a component. Assumption: we can identify context changing events. t1 t2 t1: Task execution starts t2: Task execution finishes Instance I = (C,M); C: content features, M: metrics/time series SDM4Service 5/4/2013 8
Problem statement and solution approach Given: Instances I1, I2, , IL Find: Top K anomalous instances 2 stage solution Find patterns Context patterns Metric patterns Find anomalies Two notions of similarity: Peer similarity: similarity in context variables across instances Temporal similarity: similarity in time series data for similar contexts SDM4Service 5/4/2013 9
Proposed Solution C1 M1 Extraction of Context Patterns Normalize the data Use K-means clustering Extraction of Metric Patterns M4 M3 C2 M2 similarity? ????? ? = 1 ??? ?,?2 C3 Mem Disk Read Disk Write 14.0 82.7 90.0 78.6 eth0 TX 4.2 4.0 1.6 7.1 eth0 RX CPU 5.6 1.8 1.6 0.3 Mem ory Disk Read Disk Write eth0 TX eth0 RX CPU ?? ? ory 20.9 622.4 24.5 977.7 24.6 836.4 20.6 198.6 ?? ? CPU 10 20 10 30 CPU Anomaly Detection 10 20 10 30 10 20.9 622.4 14.0 4.2 5.6 10 20 24.5 977.7 82.7 4.0 1.8 20 10 24.6 836.4 90.0 1.6 1.6 10 Not an anomaly Anomaly 30 40 20.6 198.6 78.6 29.3 850.9 99.1 7.1 5.1 10.0 40 0.3 30 similarity? Anomaly Post-processing Remove instance if nearest context cluster is far away. SDM4Service 5/4/2013 10
Instantiating the framework for MapReduce (Hadoop) MapReduce programming model Example: count the frequency of all words appearing in a document Distributed block storage (e.g. HDFS) Two phases of computation: Map and Reduce Intermediate output Map Final output A B C A: 1 B: 1 C: 1 A: 2 B: 3 C: 2 D: 2 E: 1 F: 1 G: 1 Reduce Map B C D B: 1 C: 1 D: 1 Map E F G E: 1 F: 1 G: 1 Map A B D A: 1 B: 1 D: 1 11 SDM4Service 5/4/2013
Hadoop: Open source implementation of MapReduce runtime Map and Reduce phases exhibit peer and temporal similarity SDM4Service 5/4/2013 12
Discussion Selecting number of principal components (?) Capture >95% variance for both time series. Selecting number of context/metric clusters Knee point of within cluster sum of squares versus # clusters curve. Richer context for MapReduce Job conf parameters Events extracted using regex pattern matches from logs. SDM4Service 5/4/2013 13
Evaluation 1. Synthetic datasets Context part comes from real Hadoop runs. Metrics part is synthetically generated. Hadoop cluster: master + 5 slaves. Workload: standard Hadoop examples like sorting, count word frequencies, etc. 3 context clusters. 2. Real Hadoop runs with injected faults CPU hog and Disk hog SDM4Service 5/4/2013 14
Synthetic data: Context Clusters for Hadoop Examples 3 Cluster1 Cluster2 Cluster3 Normalized Measurement 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 #Maps #Reduces HDFS BYTES READ RECORDS WRITTEN REDUCE OUTPUT RECORDS REDUCE SHUFFLE BYTES SPLIT RAW BYTES REDUCE INPUT RECORDS SPILLED RECORDS COMMITTED HEAP BYTES MAP INPUT RECORDS PHYSICAL MEMORY BYTES CPU MILLISECONDS MATERIALIZED BYTES COMBINE OUTPUT FILE BYTES WRITTEN MAP INPUT BYTES MAP OUTPUT BYTES REDUCE INPUT GROUPS VIRTUAL MEMORY BYTES MAP OUTPUT RECORDS Cluster 1: large number of Map tasks high values for Map counters. Cluster 2: instances with a few Map and a few Reduce tasks. Cluster 3: instances with large number of Reduce tasks and high values for Reduce counters. SDM4Service 5/4/2013 15
Injecting Anomalies in Synthetic Dataset Fix anomaly factor= Randomly select ? 2 instances into set R For each instance in R, choose to add swap- anomaly or new-anomaly. Swap Anomaly: Swap the metrics part with another randomly chosen instance. New Anomaly: Replace the metrics time series part with a new random matrix. SDM4Service 5/4/2013 16
Synthetic Dataset Results 20 experiments per setting. Avg. standard deviations are 3.34% for CA, 7.06 % for SI and 4.58% for NC. SI (1%) NC (28%) SDM4Service 5/4/2013 17
Results on real Hadoop runs with injected faults 100 Disk Hog CPU Hog Anomaly 0.3 80 CPU Utilization Anomaly Score Metric Cluster 0 60 0.2 Metric Cluster 1 40 0.1 20 Metric Cluster 2 0 0 1 21 41 61 1 12 23 34 45 56 67 78 89 100 111 122 133 Time (sec) Instance Number Original number of anomalies Disk hog: 7. CPU hog: 4. Detected anomalies Disk hog: 4 in top5, all 7 in top 10. CPU hog: 3 in top 5, all 4 in top 10. SDM4Service 5/4/2013 18
Conclusion and Future work Proactive maintenance is more effective when we combine information from heterogeneous sources System logs and time series measurements We proposed a clustering based approach for finding context patterns from log data and metric patterns from time series Use these patterns for anomaly detection Future directions How to define context and instances in other settings? Define anomalies based on transition in context and expected change in metrics SDM4Service 5/4/2013 19
Appendix SDM4Service 5/4/2013 20
Running Time 30 Execution Time for Metric #Metrics=5 Patterns Discovery (sec) 25 #Metrics=10 20 #Metrics=20 15 10 5 0 500 1000 2000 5000 Number of instances (N) Algorithm is linear in number of instances. Time spent in anomaly detection: ~188ms. SDM4Service 5/4/2013 21
Real Datasets Workload: Multiple runs of RandomWriter and Sort. RandomWriter (16 Maps) writes 1 GB data in 64 MB chunks and Sort (16 Maps and 16 Reduces) sorts the data. Anomalies are inserted on 1 machine for CPU Hog: Infinite loop. Disk Hog: Sequential write to file on disk. Total instances: 134 (Disk Hog) & 121 (CPU Hog). SDM4Service 5/4/2013 22
Context Clusters for RandomWriter+Sort Dataset 2 Cluster1 Cluster2 Cluster3 Normalized Measurement 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 #Reduces HDFS BYTES WRITTEN REDUCE OUTPUT #Maps RECORDS WRITTEN REDUCE SHUFFLE BYTES REDUCE INPUT RECORDS SPLIT RAW BYTES COMMITTED HEAP BYTES SPILLED RECORDS MAP OUTPUT RECORDS CPU MILLISECONDS MATERIALIZED BYTES PHYSICAL MEMORY FILE BYTES WRITTEN FILE BYTES READ VIRTUAL MEMORY BYTES RECORDS MAP OUTPUT BYTES Cluster 1 consists of a mix of Maps and Reduces and has a distinctly high number of HDFS bytes being written. Cluster 2 is Map-heavy and shows a large number of Map Output Records. Cluster 3 is Reduce-heavy and hence demonstrates a large activity in Reduce counters. SDM4Service 5/4/2013 23
Metric Patterns SDM4Service 5/4/2013 24