Improved Approximations for Ultrametric Violation Distance
This presentation discusses ultrametrics and their practical applications, focusing on the problem of Ultrametric Violation Distance (UMVD). It covers the definition, hierarchical clustering, fitting ultrametrics with different objectives, weighted settings, related works, and techniques used in addressing this problem.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Improved Approximations for Improved Approximations for Ultrametric Ultrametric Violation Distance Violation Distance Ruiquan Gao Stanford University based on joint work with Moses Charikar 1
Ultrametrics Ultrametrics root = 100 Definition: Metric s.t. ??? max ???,??? Equivalently, distances between leaves in a tree in which all leaves have equal distances to the root. 5 5 = 60 20 30 3 3 = 40 25 25 = 20 5 3 15 15 1 4 3 2 Hierarchical Clustering: For any > 0, { ?,? :??? } forms a clustering (set of disjoint cliques) For example, 0 , 30 , 1 , 2 , 3 ,{4} [30,50), 1,4 , 2 , 3 [50,90), 1,4 , 2,3 [90,+ ), 1,2,3,4 1 2 3 4 1 90 90 30 2 90 50 90 3 90 50 90 4 30 90 90 2
Ultrametrics Ultrametrics in Practice in Practice phylogenetic trees How to adjust imperfect inputs to satisfy ultrametric conditions? 3
Ultrametric Ultrametric Violation Distance (UMVD) Violation Distance (UMVD) Problem definition: Given pairwise distances {???} Find ultrametric {? ??} that minimizes #pairs that disagree ||? ? ||0= 1 ??? ? ?? ?,? Fitting ultrametrics with 0objective Previous work: ?(1)-approx. [Cohen-Addad, Fan, Lee, and Mesmay, 2022] This talk: ?-approx. 4
Ultrametric Ultrametric Violation Distance (UMVD) Violation Distance (UMVD) Weighted settings: Objective: ?,?? ?,? 1(??? ? ??) General [Cohen-Addad, Fan, Lee, and Mesmay, 2022]: ? log?loglog? -approx. NP-hard for ?(1)-approx. under Unique Games Conjecture With triangle inequalities ? ?,? ? ?,? + ?(?,?): ? min ?,log? -approx. where ? is #distinct values in the input 16-approx. if ? ?,? {0,1} (a.k.a., unweighted k-partite cases). Unified Algorithmic Framework 5
Related Works Related Works Fitting ultrametrics with other objectives: : ? [FKW93] ?(1 ? < ): ? (log?loglog?1/?)-approx. [AC11] 1: ?(1)-approx. [CDK+21] Fitting tree metrics: reduction to fitting ultrametrics With loss of an ?(1) factor [ABF+98, CDK+21, Kip23] Fitting metrics: ?(? 1): ? by linear/convex program [BDST08] 0: ?(log?)-approx. [GJ17,FRVB18,FGR+20,CFLM22] 6
Techniques Techniques 7
Correlation Clustering Correlation Clustering Definition: Given a complete graph with +/ labels (similar/dissimilar) Find a clustering minimizing #inter-cluster edges, plus #intra-cluster +edges Special case of UMVD with 2 distinct input values +edges edges w/ smaller input values edges edges w/ larger input values History: Introduced by [Bansal, Blum, and Chawla, 2004] Pivot algorithm: 2.5-approx. [Ailon, Charikar, and Newman, 2008] Later, improved to 2.06 (on standard LP, [CMSY15]), 1.94 (on Sherali- Adams, [CLN22]), 1.73 (with set-based rounding, [CLLN23]) 8
Standard Standard LP Relaxations LP Relaxations Correlation Clustering: UMVD: Ideal solution: ??? If ???= ???= ?, ???? ???= ?. = ? if and only if ?,?+ ??? = 0 1 ?,? 1 min ???+ 1 ??? min 1 ??? ??? if < if ?? ?+ ?? ? ?? 20 30 Equivalently, ??? + ??? ??? ??? 1 ??? 1, 15 , 0 if = if ?.?. ???+??? ???, ??? 0,1 , ?,?,? ?,? ?.?. ??? ?,?,?, ?,?, ?,?, 2 1= 1 ??? 25 25 ??? 0,1 , 15 4 3 Ideal solution: ???= 1 if ?,? in different clusters ???= 0 if ?,? in the same cluster ?1> ?2> > ??: distinct values in the inputs ?,? : satisfy ? ?,? = ??? 9
Pivot Algorithms for Correlation Clustering Pivot Algorithms for Correlation Clustering Algorithm: 1. Choose a random pivot 2. Decide the cluster of the pivot 3. Delete all clustered vertices and repeat +edges edges ?1 ?4 Two natural instantiations for step 2 Combinatorial: include all + neighbors 3-approximation [ACN08] LP based: independently include each vertex ? w.p. 1 ??? 2.5-approximation [ACN08] ?2 ?3 4 1 2 3 10
Pivot Algorithms for UMVD Pivot Algorithms for UMVD Algorithm: 1. Choose a random pivot 2. Decide the distances to the pivot and minimally fix the remaining distances 3. Group non-pivots according to their distances to the pivot, and recursively solve each group with a cap Two natural instantiations for step 2 Combinatorial: use the input (log ?)-approx. [CFLM22] LP based? 5-approx. 5 3 5 3 1 4 2 3 4 4 3 4 5 Cap: ? 5 Cap: ? 3 Ultrametric inequality: ??? max ???,??? 11
LP LP- -based Pivot Algorithm: Candidate 1 based Pivot Algorithm: Candidate 1 Facts about the LP solutions Recall in definition: ??? Observation: ??? ??? 1for any ?,?, ?= 1in some optimal solutions 0= 0 and ??? Interpreting the LP as distributions (for each edge) ??? 1 Pr ? ?,? = ? = ??? For the cap, truncate the LP solution: ??? (to maintain the LP feasibility) We assume ??? ? = 0 for any ? above the cap ?,? = ?????? ??? 12
Triangle Triangle- -based Analysis based Analysis - - Preliminaries Preliminaries Goal: ???[??? ?,? ] ?(1) ????(?,?) ???(?,?): whether (?,?) is modified ??(?,?): LP contribution of (?,?) (i.e., 1 ??? ??? (?,?): #modifications on (?,?) (??? ?,? ??? (?,?)) ?,?+ ??? ?,? 1) Two types of modifications On pivot edges (by sampling): easy to bound for each edge On non-pivot edges (by fixing): bound for each triangle ? = (?,?,?) ??? ??,? : whether (?,?) is modified when ? chosen as pivot ???,? : the probability ? and ? are divided to different groups (i.e., appear in a subsequent recursive call) when ? chosen as pivot Goal: ??? ?? ??? ??,? ? 1 ??? ????,? ??(?,?) 13
Triangle Triangle- -based Analysis for UMVD based Analysis for UMVD (?,?) ?? ?? ???? ?? ????,? ?? ?,? can be unbounded. Issue for UMVD: Possible bad triangles Low-cost/high-cost edges: LP ?,? < ? or LP ?,? ? ??(?,?) can be too low: only when 2 low-cost edges share the same input key observation: when ? 1/3, the inputs satisfy ultrametric inequality for these triangles. ? ? ? ? The third one is low-cost The third one is high-cost 14
LP LP- -based Pivot Algorithm: Candidate 2 based Pivot Algorithm: Candidate 2 Combining two rounding schemes A threshold ? (e.g., 1/3) If ?? ?,? ?, sample the distance If ?? ?,? < ?, fix the distance to the input Property: no modifications in triangles w/ 3 low-cost edges This property is extremely crucial in the weighted cases we considered. 1 2 ??? ??? 2 ??? 1> 1 ? ? ?,? ?2 ??? 1 2 ??? ??? 2 ??? 1 1 ? ? ?,? ? ?,? ??? 15
Triangle Triangle- -based Analysis for UMVD (cont d) based Analysis for UMVD (cont d) (?,?) ?? ?? ???? ?? ????,? ?? ?,? can be unbounded. Issue for UMVD: Possible bad triangles Low-cost/high-cost edges: LP ?,? < ? or LP ?,? ? ??(?,?) can be too low: only when 2 low-cost edges share the same input key observation: though ??? ?,? can be big, ??? ?,? 1 ? ? The third one is high-cost 16
Triangle Triangle- -based based Analysis for UMVD (cont d) Analysis for UMVD (cont d) Another simple bound For high-cost edges, we can upper bound by 1 ?? ?,? 1 ??? ?,? ? ?? ?,? ?? ?,? In triangle-based analysis Only consider modifications on low-cost edges. Bound them by the LP contributions of all edges. Optimize the combination of these two bounds 5-approx. 17
Open Problems Open Problems ?(1)-approx. for ?ultrametric fitting (2 ? < ) ? ?(1)-approx. for Metric Violation Distance? Small constant approx. for 1 ultrametric fitting? Incorporate recent techniques of Correlation Clustering? E.g., Sherali-Adams [CLN22], set-based rounding [CLLN23] 18
Thank you! Thank you! 19