Repairing Vertex Labels under Neighborhood Constraints

Repairing Vertex Labels under Neighborhood Constraints
Slide Note
Embed
Share

Repairing Vertex Labels under Neighborhood Constraints is a study focusing on addressing inconsistencies in data by enforcing constraints on vertex labels within a network. The approach combines a simple greedy heuristic with a contraction-based technique to enhance data accuracy and integrity, leading to improved data quality and reliability.

  • Data Quality
  • Vertex Labels
  • Neighborhood Constraints
  • Data Integrity
  • Data Accuracy

Uploaded on Mar 09, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Repairing Vertex Labels under Neighborhood Constraints Shaoxu Song (Tsinghua), Hong Cheng (CUHK), Jeffrey XuYu(CUHK), Lei Chen (HKUST)

  2. Outline Motivation examples A simple greedy heuristic How it fails Contraction-based technique Pros and cons A hybrid approach Put the beauty of greedy and contraction together Experimental results

  3. Example: Similarity Networks A constraint states that if two tuples have similar Address values (with distance in the range of [0, 6]) and the same City values (with distance in [0, 0]) then they should have similarTel values (distance in [0, 5]) by sharing the same area code and another two digits for the same region ID Address City Tel 1 No.721, West Lake St. HZ 0571-624-8209 2 No.735, West Lake St. HZ 0571-625-7241 3 No.735, West Lake Street HZ ****-***-7241 4 640, West Lake Street HZ 0571-624-6317

  4. Example: Similarity Networks Constraint graph: similarity on Tel ( then part of the constraint) each node is a distinct value of Tel domain an edge between two nodes if their distance is within [0, 5] Instance graph: similarities on Address and City ( if part in the constraint) of tuples each vertex is a tuple, with its Tel value as the label an edge indicates that the Address and City values of these two tuples have distance within [0, 6] and [0, 0] 1 a a :0571-624-8209 :0571-624-8209 violation relabel by b to repair 2 3 :0571-625-7241 b b d :****-***-7241 :0571-625-7241 d Model the similarity relationships by graphs :****-***-7241 violation 4 c c :0571-624-6317 :0571-624-6317 (a) constraint (b) instance

  5. 1 a a :0571-624-8209 :0571-624-8209 violation Inconsistencies relabel by b to repair 2 3 :0571-625-7241 b b d :****-***-7241 :0571-625-7241 d :****-***-7241 violation 4 c c :0571-624-6317 :0571-624-6317 (a) constraint (b) instance To validate whether a relation satisfies the constraint equivalent to investigate whether the corresponding instance graph satisfies the constraint graph for each tuple pair with similar Address and same City (having an edge in the instance graph) their Tel values must be similar as well (the pair of labels belong to an edge in the constraint) Violations exist Since data are collected from sources with various representation formats and dirty information

  6. Minimum Change Principle The repairing target is to return a modified result that minimally differs from the original data widely adopted in improving data quality, under the rationale that people try to make as few mistakes as possible. Relabeling cost in graph is evaluated by the modification on vertex labels Problem Statement For a constraint graph S and an instance graph G, it is to find a relabeled G of G such that G satisfiesS and the relabeling cost (G , G) is minimized

  7. Hardness Generally, it is NP-hard to find the minimum repair w.r.t. neighborhood constraints In some special cases, can efficiently solve or approximate Efficient computation by grouping tuples with equal values Rely on the transitivity of equality a=b and b=c implies a=c Does not apply to neighborhood in general cases With a band b c, it is not necessary to have a c Table1: Major theoretical results Constraints Complexity Hardnessof approximation General NP-complete (Theorem 1) Star NP-complete NP-hardtowithin afactor 1.36 (Theorem 2) (Theorem 3) Clique PTIME Greedy method noguaranteeof termination (ExampleinSection 5) (ln n + 1)-approximation (Proposition 4) Contraction method guaranteetoterminate (Proposition 5) factor 2-approximation (Proposition 6) Theorem 1. For aconstraint graph S, an instancegraph Gand a constant c, theproblemof determining whether thereexistsa rela- beledG of Gsuchthat G S andtherelabelingcost (G ,G) cis NP-complete. Table2: Notations Description constraint graph S withlabel set L,neighborhood N instancegraph Gwithvertex set V , edgeset E labeling of vertex labelsmatch constraint graph satisfiesconstraint cost of relabeling avertex cost of relabeling agraph set of vertexeswithviolationstovertex v having (v) anodeof contracted vertexes Symbol S(L,N ) G(V ,E) Proof idea. To show the hardness, we give a reduction from the set cover problem, which isknown to be NP-hard [2]. Due to the limitationof space, pleaserefer tothefull versiontechniquereport of thispaper [13] for proof details. TractableSpecial Case. Recognizing thehardness, weiden- tifyspecial casesof therelabelingproblemthatturnouttobetractable. T (v, (v)) R Definition1. W ecall S(L,N) acliqueconstraint,if transitivityon neighborhood issatisfied, i.e., for any ( 1, 2) N and ( 2, 3) N , it implies( 1, 3) N. Transitivityof neighborhoodonlabelscanbeappliedinsideeach clique. Such cliqueconstraintswith transitivity featurearepracti- cal. For example, theTel numbers in the same region are similar witheachother by sharingthesameareacodeandanother twodig- itsdenoting theregion, i.e., acliqueinS inFigure1. Indeed, for a DD withequality constraints[0,0] ontheright-hand-sideattributes, thecorresponding constraint graph consistsof cliqueswith sizes1 (seetheexampleof real dataset RSNT inSection8). Consequently, the relabeling process is to find connected com- ponents in the instance graph G(V ,E). To repair a connected component by a clique, it is to substitute all thevertex labels not belonging to the clique by a label from the clique that minimally differsfrom theoriginal label. For each connected component, we select acliquewiththeminimumtotal cost torelabel. Theinstance graph Gcan berelabeled efficiently inpolynomial time. AninstancegraphG(V ,E) satisfiesaconstraint graphS(L,N), denotedbyG S,if (v,u) E implies( (v), (u)) S, (v,u) E. That is, for any edge (v,u) E, their labels (v), (u) must match the constraint graph S with either (v) = (u) or ( (v), (u)) N. Wecall (v,u) aviolationtotheconstraint graphS,if (v,u) E and( (v), (u)) S. For example, inFigure1(b), theedge(1,3) indicatesaviolationto S, astheir labels(a,d) S areneither the samenor adjacent inFigure1(a) of constraints. Cost Function. The relabeling cost is evaluated by the differ- ence between the original and repaired data. More precisely, the repairing target is to return a modified result that minimally dif- fersfrom theoriginal data[3]. Thisminimum changeprincipleis widely adopted in improving dataquality, under therationale that peopletry tomakeasfew mistakesaspossible. Therefore, following thesamelineof repairing in databases[3], we formalize the relabeling cost in graph by evaluating the mod- ification on vertex labels. Let G be a relabeled graph of G. The relabeling cost isgiven by, Special Caseof Star Constraints. Motivated by thecenter rolesexisting in someconstraint graphs, such asthecellular com- ponent correlated to all the other GO terms (as mentioned in the introduction and observed in thereal dataset HPRD in Section 8), weconsider aspecial typeof constraintswithastar shape. (G ,G) = ( (v), (v)), (1) Definition 2. W e call S(L,N ) a star constraint, if there exists a center label 0 L thatisadjacent toall theother labels( 0, i) N , and ( 0, i) < ( j, i),i = j, i = 1,...,|L| 1;j = 1,...,|L| 1. v V where (v) is the new label of v V in the repaired G , and ( (v), (v)) denotes the (distance) cost of relabeling vertex v fromlabel (v) to (v). Thedistancemetric couldbeany string distancefunction or simply thecount of modifications[19]. That is, thelabel 0 isadjacent to all theother labelsin thecon- straint graph and hastherelabeling cost ( 0, i) lessthan others. It serves as a center of the constraint graph, thus we call such a structurestar constraint. Unfortunately, theproblem isstill hardinthisspecial case. Problem Statement. We are now ready to formalize the rela- beling problem: For aconstraint graph S and an instancegraph G, it istofind arelabeled G of Gsuch that G cost (G ,G) isminimized. S and therelabeling Theorem 2. For a star constraint S, the problem of relabeling a graph withtheminimumcost is NP-hard. PROBLEM ANALYSIS Beforediscussingtechnical details,wefirst analyzethehardness of therelabeling problem, and consequently identify special cases that may possibly beaddressed efficiently. 4. Proof idea. ToprovetheNP-hardness, weshow reductionfromthe vertex cover problem, whichisoneof Karp s21 NP-completeprob- lems[18]. Pleasealso refer to thefull version techniquereport of thispaper [13] for proof details. 990

  8. Greedy Heuristics Each iteration chooses a vertex relabeling that can eliminate the most violations by the least relabeling cost Unfortunately, the greedy approach is not guaranteed to terminate It may be trapped in local optima E.g., repeatedly relabeling vertex 5 between d and c In practice (experiments), Greedy can show high accuracy and low time cost, if it can terminate a f a f 1 7 violation c5 b g a b g f 8 2 4 6 c d a f 3 9 (a) constraint (b) instance

  9. Contraction-based Technique Intuitively, consider a group of vertexes as a whole, called a node Node contraction operation Merging all the vertexes of a node R1 into the other R2 All the vertexes in the contracted node R1 are enforced to assign the same label avoid introducing violations in the new node R2 b g f g ... ... ... ... R1 R2 R3 R1 R2 R3 violation b a t+2 g 0 f g f f t+2 g 0 f g t+1 t+1 ... ... ... ... ... ... b g 1 g 2 g t-1 g g f g 1 g 2 g t-1 g g t t (a) before (b) after

  10. Analysis on Contraction Guarantee to terminate by simply enforcing all vertexes to the same labels to stop violation spread Result could be arbitrarily badin terms of relabeling cost Example By keeping on contracting nodes Riwith host ito R1, i= 2,...,n, and assigning a new label b, the total relabeling cost is n 1 However, the relabeling with the minimum cost is to relabel vertex 1 to g a R1 R2 R3 R4 Rn violation f2 f f4 fn ... 1 3 (a) before ... R1 R2 R3 R4 Rn violation b2 a f f5 fn 3 1 (b) after

  11. A Hybrid Approach Greedy method is goodfor eliminating violations at most Even the maximum violation elimination could be negative leads to the non-termination case, local optima Contraction operation is instrumental in carrying on the relabeling Enforcing the same label may hurt the relabeling cost avoid contraction when unnecessary To incorporate the advantages of both greedy and contraction we can eliminate violations by the greedy method first If noviolations could be further reduced, contraction is applied The greedy operation has a higher priority

  12. (b) accuracy performance 1 Experiments 0.8 F-measure Hybrid Contract Greedy 0.6 0.4 On Protein Networks, HPRD Greedy terminates and shows goodaccuracy Contract fails with low accuracy On Workflow Networks, WFNT Hybrid approach can still show good performance, both greedy and contraction fail Greedy fails to terminate Contract has low accuracy 0.2 0 10 20 30 40 50 60 70 80 90100 Number frauds (b) accuracy performance 1 0.8 F-measure Hybrid Contract Greedy 0.6 0.4 0.2 0 10 20 30 40 50 60 70 80 90100 Number frauds

  13. Conclusion The relabeling problem is generally hard Spreads of violations during relabeling prevent the approximation methods performing Greedy heuristics cannot guarantee termination Contraction-based relabeling method which can always terminate but may have bad results in terms of relabeling cost Hybrid approach by cooperating contraction and greedy relabeling put together the beauty of violation elimination heuristics and termination

  14. Thanks

Related


More Related Content