Causal Discovery with Mixed Data Using Constraint-Based Methods

Download Presenatation
constraint based causal discovery with mixed data n.w
1 / 23
Embed
Share

Explore the world of causal discovery with mixed data through constraint-based approaches, learning causal networks, and the PC Algorithm for Bayesian network learning. Discover the methods for identifying causal relationships and orienting edges in a Bayesian network to gain insights into complex data structures.

  • Causal Discovery
  • Mixed Data
  • Bayesian Networks
  • Constraint-Based Methods
  • PC Algorithm

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Constraint-based Causal Discovery with Mixed Data MICHAIL TSAGRIS, GIORGOS BORBOUDAKIS, VINCENZO LAGANI, IOANNIS TSAMARDINOS IOANNIS TSAMARDINOS ASSOCIATE PROFESSOR, COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF CRETE CEO AND CO-FOUNDER, GNOSIS DATA ANALYSIS PC VISITING PROFESSOR, UNIVERSITY OF HUDDERSFIELD

  2. The Causal Discovery Problem Data generation RiskAversion VehicleYear ThisCarDam Cautious Older Normal Current Adventurous Older Normal Current Normal Older Normal Older Normal Older Normal Older Normal Older UpperMiddle Cautious 1010 Adolescent Middle Normal Age SocioEcon Middle Middle Prole Middle Middle 1000 Adult 1001 Adult 1002 Adult 1003 Senior 1004 Adult 1005 Adolescent Prole 1006 Adult 1007 Adolescent Prole 1008 Adult 1009 Adult None None None None Moderate None None None None Severe Mild Prole Middle Current Older Causal Discovery Reconstruction

  3. Learning Causal networks Causal networks Bayesian networks (DAGs, no latent confounders) Maximal ancestral graphs (mixed graphs, admit latent variables) Score-based approaches Measure goodness of fit of causal model, search over space of graphs Constraint-based approaches Use conditional independence tests to exclude edges and constraint search space Hybrid approaches Combine the above

  4. The PC Algorithm for Bayesian Network Learning Phase A: Skeleton Identification 1. Form the complete undirected graph Con the vertex set V (variables). 2. k = 0. 3. Repeat 4. Repeat 5. Select an ordered pair of variables X and Y that are adjacent in C such that Adjacencies(C, X)\{Y} has cardinality greater than or equal to k, and a subset S of Adjacencies(C, X)\{Y} of cardinality k, and if p-value(? ?|?) < delete edge X - Y from C. 6. Until all ordered pairs of adjacent variables X and Y such that Adjacencies(C, X)\{Y} has cardinality greater than or equal to k and all subsets S of Adjacencies(C, X)\{Y} of cardinality k have been tested for independence. 7. k = k+ 1; 8. Until for each ordered pair of adjacent vertices X, Y, Adjacencies(C, X)\{Y} is of cardinality less than k. p-value from conditional independence test Same algorithm can be used for different data types by plugging in appropriate conditional independence test! Phase B: Edge Orientation Apply PC rules for determining the direction of the edges

  5. Conditional Independence Tests For multivariate Gaussian data partial correlation test For discrete data G2 test, X2 test What about other data types?

  6. Types of Data Continuous: I. on the real line (temperature, profit) II. positive valued (time, weight) III. percentages (including or excluding 0 and or 1) (Glycated haemoglobin, HbA1c levels) IV. clustered (measurements on school units) Categorical: I. binary (sick, healthy) II. nominal (blood type, gender) III. ordinal (satisfaction level in a video call) Counts (number of hospitalisation days)

  7. Prior Work Heckerman and Geiger (1995) were the first to use a scoring algorithm with mixed continuous and categorical variables. Continuous valued variables are not allowed to have discrete valued parents. Discretization of the continuous variables (Monti and Cooper, 1998) (computationally expensive and possible loss of information). Bach and Jordan (2002) used kernel-based approach for learning graphical models with mixed continuous and discrete variables. Also identifies non-linear relations. Hard to tune (2 hyper- parameters) and computationally demanding. Copula based PC algorithm with continuous and ordinal data (Cui et al., 2016). The idea is to construct a correlation matrix using MCMC sampling. Not general, applicable only to continuous, ordinal and binary data. Conditional independence test using likelihood-ratio tests (Sedgewick et al., 2017). They consider continuous, binary and nominal data only. Its limitations are presented later.

  8. Conditional Independence Tests using Likelihood Ratio Tests Null hypothesis H0 of conditional independence of X and Y given set of variables Z P(X,Y|Z) = P(X|Z) P(Y|Z) Equivalently P(X|Y,Z) = P(X|Z) P(Y|X,Z) = P(Y|Z) Can be tested using regression models and likelihood-ratio tests (Wald tests, score tests) To test P(X|Y,Z) = P(X|Z) 1. Fit models M1 = X ~ f(Y,Z) and M0 = X ~ f(Z) 2. Test if models fit equally well Statistic = 2 [loglikelihood(M1) loglikelihood(M0)] Dof = parameters(M1) parameters(M0) p-value = 1 2cdf(Statistic, Dof) 3.

  9. Types of Variables and Regression Models Variable Appropriate model Example: X binary Y continuous Z set of mixed variables Continuous Linear regression Binary Logistic regression Test conditional independence of X and Y given Z Nominal Multinomial regression One way: test P(X|Y,Z) = P(X|Z) Fit 2 binary logistic regression models for X Ordinal (Generalized) ordinal regression Counts Negative binomial regression Alternatively: test P(Y|X,Z) = P(Y|Z) Fit 2 linear regression models for Y Positive Weibull, gamma regression Proportions in (0, 1) Beta regression Are both tests equivalent?

  10. The Asymmetry Problem It is not guaranteed that the tests will give the same p-value Approach 1: Ignore problem, perform only one test arbitrarily Approach 2: Perform only one test, using rules for prioritizing data types and tests Sedgewick et al. (2017) consider continuous, binary and nominal data. They prioritize continuous data types (using linear regression) over the others Approach 3: Perform both tests and combine them appropriately p-values are dependent! Need to take this into account

  11. Combining Dependent p-values Na ve approaches 1. minimum p-value: 2. maximum p-value: pmax = max(p1, p2) pmin = min(p1, p2) Combine the 2 dependent p-values (Benjamini and Heller, 2008) : pmm = min(2 min(p1, p2), max(p1, p2))

  12. Simulation Studies Assessing the Combination of p-values Performed simulation studies on mixed data types Variable types: Continuous, binary, nominal, ordinal Regression models: Linear, binary logistic, multinomial logistic, generalized ordinal logistic For all combinations of the above we tested: Unconditional Independence Unconditional Dependence Conditional Independence given one continuous variable (fork) Conditional Dependence given one continuous variable (collider)

  13. Proportion of Decision Agreements L-B: Linear-Binary L-M: Linear-Multinomial L-O: Linear-Ordinal B-O: Binary-Ordinal M-O: Multinomial-Ordinal Performing both asymmetric tests Decision Agreement: Proportion of times both tests lead to same decision (threshold a = 0.05) In all cases, decision tends to be the same with large sample size. The conditioning is on one continuous variable. Important to account for asymmetry in low sample settings.

  14. Estimated Type I Error for Conditional Independence Linear & Multinomial Multinomial & Ordinal Linear & Ordinal Binary & Ordinal Linear & Binary Estimated Type I error rates (a = 0.05) for asymmetric tests, and 3 methods for combining p-values Ideally, a method should have a Type I error of 0.05 Proposed MM method outperforms rest (especiall on the Multinomial & Ordinal case)

  15. Learning Bayesian Networks with Mixed Data Generated BNs with continuous, binary and ordinal data only (no multinomial). Compared MM method with both asymmetric tests against 1. Copula PC algorithm (Cui et al., 2016) and 2. PC with mixed data of Sedgewick et al. (2017). We extend Sedgewick et al. (2017) work by also handling ordinal data. Since, only one test is performed, we have termed this approach Fast . We prioritize data types as: continuous > binary > multinomial > ordinal

  16. Orientation Precision 3 neighbors 50 variables n = 500 0.979 0.969* 0.940* 100 variables n = 500 0.965 0.942* 0.932* Method MM Fast Copula n = 200 0.686 0.608* 0.812 n = 1000 0.988 0.978* 0.928* 5 neighbors n = 200 0.943 0.928* 0.976 n = 1000 0.974 0.948* 0.913* 50 variables n = 500 0.992 0.989 0.970* 100 variables n = 500 0.992 0.985* 0.959* Method MM Fast Copula n = 200 0.987 0.984 0.975* n = 1000 0.993 0.992 0.950* n = 200 0.986 0.982 0.984 n = 1000 0.989 0.984* 0.949* Bold: Best method Asterisk: MM statistically significantly better than competitor Italic: Competitor better than MM

  17. Orientation Recall 3 neighbors 50 variables n = 500 0.692 0.621* 0.666 100 variables n = 500 0.668 0.600* 0.625* Method MM Fast Copula n = 200 0.118 0.108* 0.092* n = 1000 0.806 0.698* 0.793 5 neighbors n = 200 0.504 0.476* 0.342* n = 1000 0.79 0.669* 0.791 50 variables n = 500 0.606 0.561* 0.591 100 variables n = 500 0.613 0.569* 0.583* Method MM Fast Copula n = 200 0.413 0.406* 0.327* n = 1000 0.711 0.638* 0.696 n = 200 0.43 0.428 0.289* n = 1000 0.719 0.649* 0.722 Bold: Best method Asterisk: MM statistically significantly better than competitor Italic: Competitor better than MM

  18. Structural Hamming Distance MM Fast : Copula : MM has lowest SHD on average in all experiments In most cases, SHD of MM statistically significantly lower.

  19. Limitations Performing both tests not easy/possible for all data types E.g. time-to-event outcomes or longitudinal data Can still be handled by performing one test Tests assume correct model specification Often work under mild model misspecification Common problem of all parametric models Even if network is correctly specified locally (e.g. variables are parametric functions of parents), there are cases where dependencies may be missed by tests! Tests still work with algorithms that only assume adjacency faithfulness (e.g. Conservative PC by Ramsey et. al, 2006). Thanks to the anonymous reviewer

  20. Limitations - Example Y: Uniformly distributed nominal with 3 values Y X ~ -Y1 + Y2 + 0.1 * ~ Y1 + Y2 + 0.1 * X Z X and Z unconditionally independent! Correlation = 0.008, p-value = 0.795 Yi: 1 if Y = i, 0 otherwise : N(0,1)

  21. Key Things to Remember Approach not limited to learning Bayesian networks, but can be applied for learning other graphical models (e.g. maximal ancestral graphs, Markov networks) or for feature selection Not limited to types considered here, but can be used for many more different data types, as long as an appropriate regression model is available. Important to perform both tests and combine p-values. Recommendation: use methods that only assume adjacency faithfulness to avoid mistakes.

  22. Acknowledgements This project has received funding from the European Research Council (ERC) under the European Union s Seventh under the European Union s Seventh Framework Programme (FP/2007-2013) (grant agreement No 617393).

  23. References Bach FR, Jordan MI (2002) Learning graphical models with Mercer kernels. In: NIPS, vol 15, pp 1009 1016 Benjamini Y, Heller R (2008). Screening for partial conjunction hypotheses Biometrics 64(4):1215 1222 Cui R, Groot P, Heskes T (2016) Copula PC Algorithm for Causal Discovery from Mixed Data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 377 392 Heckerman D, Geiger D (1995) Learning bayesian networks: a unification for discrete and Gaussian domains. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 274 284 Monti S, Cooper GF (1998) A multivariate discretization method for learning Bayesian networks from mixed data. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 404 413 Ramsey J, Spirtes P, Zhang J (2006) Adjacency faithfulness and conservative causal inference. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp 401 408 Sedgewick AJ, Ramsey JD, Spirtes P, Glymour C, Benos PV (2017) Mixed Graphical Models for Causal Analysis of Multi-modal Variables. arXiv preprint arXiv:170402621

More Related Content