
Detection of Association-Based Clique Outliers in Heterogeneous Networks
Learn about association-based clique outliers in heterogeneous information networks and how they are detected using a conjunctive select query. Discover the applications and examples of ABC outliers, as well as the concept definitions and algorithms used in this detection process.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
On Detecting Association-Based Clique Outliers in Heterogeneous Information Networks Manish Gupta1, Jing Gao2, Xifeng Yan3, Hasan Cam4, Jiawei Han5 1Microsoft 2SUNY, Buffalo 3UCSB 4US-ARL 5UIUC ASONAM 2013 4/13/2025
Association-Based Clique Outliers Given a network, a conjunctive select query consists of (type, predicate) pairs Expected result is a clique with unranked results ABCOutliers: cliques ranked with respect to the unusual associations among entities Applications Discovering interesting relationships Data de-noising (removing incorrect data attributes or entity associations) Explaining the future behavior of objects participating in such associations 4/13/2025
ABCOutlier Examples CS Research (DBLP Network) Example Authors usually publish in conferences of their research area, and use terms related to their research area Query Outlier energy and Sustain -ability Research Area data computer networking author Author engineering conference Conference 3
Abstract ABCOutliers: Cliques containing rare and interesting associations between constituent entities Introduce a low-cost shared neighbors index to assist clique matching Outlierness of a clique is computed based on association outlierness of the attributes of nodes within the clique TopK algorithm to rank cliques Experiments on several synthetic datasets and a 10-type Wikipedia dataset 4/13/2025
Concept Definitions: A Network ? = ?,?,?1,?2 ?1:? where is set of entity types ?2 maps every node to attribute vector Each node has multiple attributes each with a different data type (??= (??1,??2, ,????) such that |??| = ??) Attributes could be numeric, categorical, sets of strings, time durations Edges are not typed for our work Person and movie; Edge could be actor, producer, director 4/13/2025
Concept Definitions Conjunctive Select Query (Q) of length L on a network ?1,?1, ?2,?2, , ??,?? ?? is defined only for type ?? Match C for set query Q ?1 1, ,??? Contains the same number and type of nodes as mentioned in the query All nodes in C are connected to each other in G and satisfy the predicates in Q Association-Based Clique Outlier Outlierness of a clique =f(outlierness of associations between the attributes of these nodes) Outlierness of association (?1,?2) for attributes ?1 and ?2 is high if ?1 and ?2 are frequent Co-occurrence of ?1 and ?2 is rare High correlation between ?1 and ?2 Association-Based Clique Outlier Detection Problem Given: a network G and a query Q Find: top K cliques that satisfy the query with highest outlier scores. Mahatma Gandhi Civil Rights Movement India South Africa ?? 1 ??= ??1 4/13/2025
Q=<(T1,P1), (T2,P2), , (TL,PL)> Q1=<(T1,P1)> Q2=<(T2,P2)> QL=<(TL,PL)> Matching Network G L L1 1 T1 L L2 2 T2 Candidate Computation by Matching TT T3 L LL L ?? ?, ,??? ?? ?, ,??? ?? ?> Cluster Computation for an Attribute ??=< ??? Score Computation for a Query Edge ?? ?> ??=< ??? TopK Quit? TopK Outlier Detection ABCOutliers
Candidate Computation by Matching Graph Indexing T1 Relational database: Attribute information associated with each of the vertices (entities) in G Memory: Connectivity information of the graph Shared neighbors index: For each entity, store the number of shared neighbors of each type, shared between the entity and its neighbors of a particular type TT T2 A B C A B C A B C A B C 1 0 0 1 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 2 5 8 A B C Network G 3 0 0 0 0 0 1 0 1 0 11 A 4 0 1 0 1 0 0 0 0 0 9 3 6 5 0 0 0 0 0 0 0 0 0 A B C 1 6 0 0 0 0 0 0 0 0 0 B B 7 0 0 1 0 0 0 1 0 0 10 4 7 C B A 8 0 0 1 0 0 0 1 0 0 9 0 2 1 2 0 0 1 0 0 10 0 0 0 0 0 1 0 2 0 ?(??2) 11 0 0 0 0 0 1 0 1 1 12 0 0 1 0 0 0 2 0 0 4/13/2025
Candidate Computation by Matching Candidate Filtering Given: L lists ?1,?2, ,?? Find: Cliques of size L such that each clique has a node from each list Start with size 1 cliques and grow them ???? is list of min size and has type ???? Prune ???? Prune the node if its typed neighbors cannot satisfy the requirements of the query Prune the node if its typed neighbors do not have enough shared neighbors 4/13/2025
Candidate Computation by Matching Generating Candidates Size 1 cliques: Elements of list ???? Grow length-l cliques to length-l+1 cliques Randomly choose next type t A node n of type t is added to length-l clique if it is connected to all nodes in clique Length-l clique is pruned off if it cannot grow Algorithm terminates when l=L 4/13/2025
Outlier Score Computation Scoring Attribute Value Pairs (1) Outlier score between values ?? and ?? should be high if Values ?? and ?? co-occur rarely Values ?? and ?? are individually frequent ??|co-occur freq(??,??) > ? freq(??) and ?? ?? ??|co-occur freq(??,??) > ? freq(??) and ?? ?? Computation for individual values may be noisy Compute clusters for every attribute KMeans for numbers, time durations Category label for categorical attributes Sets of strings: create network and then partition (METIS) Pakistan 0 ? 1 Hindi China Mandarin Mongolian Southern India 4/13/2025
Outlier Score Computation Scoring Attribute Value Pairs, Edges, Cliques Peakedness of Cluster Co-occurrence Curves ??????????(???,???) = max(0,1 ? (|??| 1)) Outlier Score of an Association ???? ??? ???? ??? ???? ???,??? Country ?????(??,??) = Hindi Peaked ??????????(???,???)+??????????(???,???) 2 1983 Hindi Speaking Countries Latitude Non-Peaked Languages in China ?? ?? ?=1 ?=1 ?????(??,??) ???? ????? ???? ? = ????? ?????? ? = ? ??????(???? ?) Mandarin Southern Others Mongolian India Nepal Others Pakistan 4/13/2025
Baselines EBC (Entity Based Clique Outlier Detection) Cluster attributes and find outliers ??????????? ?????? = ?????? ?????????????????(??????) CBA (Community Based Association Outlier Detection) Use community distributions as attributes Outlierness(edge)=KL-divergence between community distributions of end points 4/13/2025
Synthetic Dataset Results N #Attributes=4 #Attributes=6 #Attributes=10 (%) ABC EBC ABC 10000 2 95.4 75.4 5 96.7 72.3 97.6 10 95.6 73 97.8 20000 2 90.4 71.8 95.9 5 93.8 64 94.8 10 94.4 73.3 50000 2 91.5 71.2 96.2 5 94.1 69.2 97.5 10 96.4 72.6 97.7 N #Attributes=4 #Attributes=6 #Attributes=10 (%) ABC EBC ABC 10000 2 86.6 75.8 91.6 5 93.7 77.6 10 93.4 73.8 93.1 20000 2 96.9 72.7 94.6 5 97.3 75.1 94.4 10 97.4 74.8 96.7 50000 2 90.3 69.5 95.8 5 92.9 68.8 94.5 10 90.8 79.3 97.5 EBC 71.2 72.4 72.8 73.8 71.4 74.5 73.3 73.2 73.7 ABC 95 95.2 95.7 97.3 95.2 95.6 94.8 95.2 95.4 EBC 69.1 69.7 73.2 68.8 75.2 73.6 70.8 67.7 75.6 EBC 74.7 79.9 76.5 73 78.6 76.7 76.9 73.1 78 ABC 95.5 94.2 96 92.3 90.9 94.5 95.5 95.5 94.9 EBC 68 72 72.3 64.4 75.1 74.7 65.8 77.6 66.5 97 93 97 #Types = 5 #Types = 10 Min support = 1% ABC=Association Based Clique Outlier Detection EBC=Entity Based Clique Outlier Detection Variances: 2% and 3% for ABC and EBC resp Average #matches: 2136, 4252 and 10621 for N=10000, 20000 and 50000 resp 4/13/2025
Wikipedia Dataset Details Jan 2012 Wikipedia dump Entity e and e are connected if e appears on Wikipedia page of e and Both e and e have Wikipedia Infoboxes 10 entity types 45% of Wikipedia entities: 760K entities and 4.1M edges 28 attributes per entity Min support=1% 4/13/2025
Case Studies Query: (film, country= us ), (person, true), (settlement, true) (film="the road to el dorado", person="hernan cortes", settlement="seville") No. Type1 Attribute1 Type2 Attribute2 Value1 Value2 1 settlement subdivision_type3 film screenplay comarca ted elliott, terry rossio 2 settlement subdivision_type3 person birth_place comarca Castile 3 settlement coordinates_region film screenplay es ted elliott, terry rossio 4 settlement subdivision_type3 person death_date comarca 1485 5 settlement subdivision_type1 film studio autonomous community dreamworks animation, stardust pictures Query: (company, country= us"), (film, lang="english"), (person, birthplace= us"), (tv, true) (company="viacom", film="mission:impossible iii", person="tom cruise", tv="south park") No. Type1 Attribute1 Type2 Attribute2 Value1 Value2 alex kurtzman, roberto orci, j. j. abrams mtv networks, bet networks, paramount pictures corporation 1 film writers company divisions 2 television creator company #employees trey parker, matt stone 10900 mtv networks, bet networks, paramount pictures corporation mtv networks, bet networks, paramount pictures corporation 3 television #episodes company divisions 223 4 television network company divisions comedy central 5 person birth date company foundation 1962 1971 4/13/2025
Experiments ABCOutlier Method person CBA Method person haruko sugimura film settlement film settlement the road to el dorado drums along the mohawk what the bleep do we know!? hernan cortes seville, spain red beard tokyo, japan adam helmer german flatts, ny, us passenger 57 wesley snipes alfred hitchcock orlando, fl, us j. z. knight zacharias kunuk stephen fry yelm, wa, us psycho london, uk atanarjuat stardust igloolik, canada norwich, england skidoo pete the pup richmond, ny, us joie lee brooklyn, ny, us she s gotta have it #Nodes #Edges #Types Index Size (MB) Time (sec) 10K 100K 5 0.1 1 10K 100K 10 0.4 1.5 20K 200K 5 0.2 1.8 20K 200K 10 0.7 2.3 50K 500K 5 0.5 4.1 50K 500K 10 1.8 5.5 760K 4.1M 10 22 96.7 4/13/2025
Experiments 4/13/2025
Conclusions Motivated the idea of query-based outlier detection for heterogeneous information networks: ABCOutliers Proposed a methodology to compute outlierness of a clique based on association outlierness of the properties of nodes within the clique Experiments on several synthetic datasets and on Wikipedia entity network 4/13/2025
Thanks! 4/13/2025