
Uncovering Correlated Patterns in Databases - Effective Strategies
Explore insightful approaches for identifying statistically significant patterns in databases. Learn about frequent itemset mining, evaluating itemset correlations, applying statistical testing, and discovering association rules to enhance pattern analysis in a database setting.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Finding Correlated and Statistically Significant Patterns Philippe Fournier-Viger http://www.philippe-Fournier-viger.com Source code and datasets available in the SPMF library 1
Todays topic Finding patterns in database that indicate a strong correlation or are statistically significant. Frequent itemset mining (FIM) 2
Frequent itemset mining Let there be a positive number minsup, set by the user. Frequent itemset mining (FIM) consists of enumerating all frequent itemsets, that is sets of values having a support greater or equal to minsup in a database D. Transaction Items appearing in the transaction T1 T2 T3 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} T4 {pasta, lemon, orange, cake} 3
Example Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange cake} For minsup= 2, the frequent itemsets are: {lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon, orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, {lemon, pasta, orange} These itemsets appear many times but should we believe them? 4
No. Many itemsets are spurious! Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange cake} A frequent itemset may contain items that are weakly correlated (appear together more or less by chance). e.g {pasta, cake} appears in 50% of the transactions. But it does not mean much because pasta appears in all transactions! 5
How to solve this problem? Solution 1: Evaluate the correlation between items in an itemset using functions such as: Bond All-confidence Solution 2: Apply statistical testing Solution 3: Find other types of patterns, e.g., association rules. 6
THE BOND MEASURE Omiecinski, E.R.: Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering 15(1), pp. 57-69 (2003) 7
(Conjunctive) support The (conjunctive)support of an itemset Xis the number of transactions that contains X. sup(?) = |{? | ? ? ? ?}| Example: sup({pasta, orange}) = 3 Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} 8
Disjunctive support The disjunctivesupport of an itemset Xis the number of transactions that contains at least an item from X. ????(?) = |{? | ? ? ? ?}| Example: dsup({pasta, orange}) = 4 Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} 9
Bond The bond of an itemset X is its conjunctive support divided by its disjunctive support. ????(?) = sup(?) / ????(?) Example: bond({pasta, orange}) = 3 / 4 = 0.75 Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} 10
Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange cake} The bond is a value in the [0,1]interval 0 = no correlation 1 = maximum correlation Examples: bond({cake,bread}) = 0 / 3 = 0 bond({lemon,bread}) = 1 / 3 = 0.33 bond({lemon,orange}) = 2 / 4 = 0.5 bond({pasta, lemon}) = 3 / 4 = 0.75 11
Property 1 of the bond: The bond of an itemset ? containing a single item is ????(?) = 1. Transaction Items appearing in the transaction T1 T2 T3 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} T4 {pasta, lemon, orange cake} Example: bond(bread) = 1 / 1 = 1 This can be proven easily. 12
Property 2 of the bond (Apriori property): Let there be two itemsets ? and ?. If ? Y, the bond of ? is less than or equal to the bond of ?. Example: The bond of {pasta} is 1 The bond of {pasta, lemon} is 3 / 4 = 0.75 The bond of {pasta, lemon, orange} is 2 / 4 = 0.5 Transaction Items appearing in the transaction T1 T2 T3 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} T4 {pasta, lemon, orange, cake} (bond is anti-monotonic) 13
Proof sketch By the definition of the bond: ????(?) = sup(?) / ????(?) ????(?) = sup(?) / ????(?) Since ? ?, we have sup(?) sup(?) by the Apriori property of the support. We could show that dsup(?) dsup(?) Hence, bond(X) bond (Y) 14
Finding correlated frequent itemsets (using the bond) Input: a transaction database a minsup threshold a minbond threshold [0,1] Output: Each frequent itemset X such that: sup(?) ?????? ????(?) ??????? 15
Example Transaction T1 T2 T3 T4 Items appearing in the transaction {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} For minsup= 2 and minbond = 0.75, thecorrelated frequent itemsets are: {lemon}, {pasta}, {orange}, {cake} bond = 1 {lemon, pasta}, {lemon, orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, {lemon, pasta, orange} bond = 0.75 bond = 0.5 bond = 0.75 bond = 0.5 bond = 0.66 bond = 0.5 16
Finding rare correlated frequent itemsets using the bond Input: a transaction database a maxsup threshold a minbond threshold [0,1] Output: Each rare itemset X such that: sup(X) < maxsup bond(X) ??????? 17
Example Transaction T1 T2 T3 T4 Items appearing in the transaction {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} For maxsup= 3 and minbond = 0.6, therarecorrelated itemsets are: {bread} support = 1 bond = 1 {cake} support = 2 bond = 1 {orange, cake} support = 2 bond = 0.66 18
The CORI algorithm Based on Eclat The bond of all single items is 1. To find larger itemsets, Eclat combines pairs of itemsets ? and ? to obtain a new itemset ? = ? ? To calculate the bond of Z, we need to find: its conjunctive support its disjunctive support. CORI: Bouasker, S., Ben Yahia, S.: Key correlation mining by simultaneous monotone and anti-monotone constraints checking. In: Proc. 30th Symp. on Applied Computing, pp. 851-856, 2015. 19
Two structures Each itemset X has two structures TID-List (Transaction-ID list): list of transactions containing X. DTID-List (Disjunctive Transaction-ID list): list of transactions containing at least one item from X. Example 20
TID-List and DTID-List Transaction T1 T2 T3 T4 Items appearing in the transaction {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} TIDLIST({pasta}) = {T1, T2, T3, T4} TIDLIST({pasta,lemon}) = {T1, T2, T4} TIDLIST({bread,orange}) = {T1}
TID-List and DTID-List Transaction T1 T2 T3 T4 Items appearing in the transaction {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} TIDLIST({pasta}) = {T1, T2, T3, T4} TIDLIST({pasta,lemon}) = {T1, T2, T4} TIDLIST({bread,orange}) = {T1} DTIDLIST({pasta}) = {T1, T2, T3, T4} DTIDLIST({pasta,lemon}) = {T1, T2, T3,T4} DTIDLIST({bread,orange}) = {T1,T3, T4}
TID-List and DTID-List Transaction T1 T2 T3 T4 Items appearing in the transaction {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} TIDLIST({pasta}) = {T1, T2, T3, T4} TIDLIST({pasta,lemon}) = {T1, T2, T4} TIDLIST({bread,orange}) = {T1} sup( ) = 4 sup( ) = 3 sup( ) = 1 DTIDLIST({pasta}) = {T1, T2, T3, T4} DTIDLIST({pasta,lemon}) = {T1, T2, T3,T4} DTIDLIST({bread,orange}) = {T1,T3, T4}
TID-List and DTID-List Transaction T1 T2 T3 T4 Items appearing in the transaction {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} {pasta, lemon, orange cake} TIDLIST({pasta}) = {T1, T2, T3, T4} TIDLIST({pasta,lemon}) = {T1, T2, T4} TIDLIST({bread,orange}) = {T1} sup( ) = 4 sup( ) = 3 sup( ) = 1 DTIDLIST({pasta}) = {T1, T2, T3, T4} DTIDLIST({pasta,lemon}) = {T1, T2, T3,T4} dsup( ) = 4 DTIDLIST({bread,orange}) = {T1,T3, T4} dsup( ) = 4 dsup( ) = 3 If we have these lists, we can calculate the bond!
The CORI algorithm The TID-Lists and DTID-Lists of single items are calculated by reading the database. For any itemset Z = X U Y that is obtained by joining two itemsets X and Y: TIDLIST(Z) = TIDLIST(X) TIDLIST(Y) DTIDLIST(Z) = DTIDLIST(X) DTIDLIST(Y) bond(Z) = | TIDLIST(Z) | / | DTIDLIST(Z) | If bond(Z) < minbond, we don t need to check supersets of Z. 25
THE ALL-CONFIDENCE MEASURE Omiecinski, E.R.: Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering 15(1), pp. 57-69 (2003) 26
All-confidence The all-confidence of an itemset Xis : sup(?) ??????? ? = ?????? ??????? ? ? ?} Values are in [0,1] allconf({pasta, orange}) = 3 4= 0.75 Example: Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} 27
All-confidence The all-confidence of an itemset Xis : sup(?) ??????? ? = ?????? ??????? ? ? ?} Values are in [0,1] Example: allconf({lemon, orange,cake}) = 1 3= 0.33 Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} 28
Property 1 of the all-confidence: The all-confidence of an itemset ? containing a single item is ???????(?) = 1. Transaction Items appearing in the transaction T1 T2 T3 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} T4 {pasta, lemon, orange cake} Example: allconf(cake) = 2 / 2 = 1 This can be proven easily. 29
Property 2 of the all-confidence (Apriori property): Let there be two itemsets ? and ?. If ? Y, the all-confidence of ? is less than or equal to the all-confidence of ?. Example: The all-confidence of {pasta} is 1 The all-confidence of {pasta, lemon} is 3 / 4 = 0.75 The all-confidence of {pasta, lemon, orange} is 2 / 4 = 0.5 Transaction Items appearing in the transaction T1 T2 T3 {pasta, lemon, bread, orange} {pasta, lemon} {pasta, orange, cake} T4 {pasta, lemon, orange, cake} (all-confidence is anti-monotonic) 30
Algorithm It is easy to modify a frequent itemset mining algorithm such as Apriori and Eclat to calculate the all-confidence. We just need to pre-calculate the support of each single item. This is all we need to calculate the all- confidence. 31
Two interesting observations Relationship between bond and all- confidence: It can be proven that for any itemset X: ???? ? ???????(?) The bond and all-confidence are null-invariant. If we add transactions that does not contain an itemset X, the bond and all-confidence of X will not change. 32
OTHER CORRELATION MEASURES 33
Other correlation measures Frequence affinity. Ahmed, C. F., Tanbeer, S. K., Jeong, B. S., Choi, H. J.: A framework for mining interesting high utility patterns with a strong frequency affinity. Information Sciences. 181(21), pp. 4878 4894 (2011) Coherence Barsky, M., Kim, S., Weninger, T., Han, J.: Mining flipping correlations from large datasets with taxonomies. In: Proc. 38th Int. Conf. on Very Large Databases, pp. 370 381 (2012) ??, lift, cosine, kulc, maxconf Wu et al.: Re-examination of interestingness measures in pattern mining: a unified framework. DMKD 21:371-397 (2010) 34
Other correlation measures Note: the definitions are given the case of two items a and b. Wu et al.: Re-examination of interestingness measures in pattern mining: a unified framework. DMKD 21:371-397 (2010) 35
STATISTICALLY SIGNIFICANT ITEMSETS Webb, G.I. and Vreeken, J., 2013. Efficient discovery of the most interesting associations.ACM Transactions on Knowledge Discovery from Data (TKDD),8(3), pp.1-31. 36
Introduction The various correlation measures help to ensure that items are correlated. However, they do not ensure statistical significance. We can raise the bar higher to select itemsets and use statistical tests. This can be useful to analyze medical data and draw conclusions that are more likely to be valid 37
Non-redundant and productive itemsets Goal: Reduce the number of patterns found by identifying only those that are statistically significant. Statistical significance is measured with Fisher Exact test. Non-redundant Itemsets Productive itemsets Webb, G. I., Vreeken, J.: Efficient discovery of the most interesting associations. ACM Transactions on Knowledge Discovery from Data. 8(3),15 (2014)
Non-redundant itemsets An itemset is non-redundant (i.e., is a generator) if it has no proper subset having the same support (occurrence frequency) Example: sup({pregnant,heart disease,woman}) = sup({pregnant,heart disease}) Thus, {pregnant,heart disease,woman} is redundant.
Productive itemsets Idea: Check if we can predict the support of an itemset by assuming that some of its partitions are independent. An itemset is productive if all its partitions into two subsets are positively correlated with each other. {a,b,c} two partitions: {a,b} and {c} another possibility: {a,c} and {b} another possibility: {b,c} and {a} Example: {alchool, liver_cancer} is productive because alchool is positively correlated withliver_cancer
Productive itemsets Example 2: {alchool, liver_cancer, black_hair} {alchool, liver_cancer} is not positively correlated with {black_hair}. Thus, {alchool, liver_cancer, black_hair} is not productive.
Fisher Exact Test to evaluate statistical significance =10 =9 =1 =3 =11 =14 =12 =12 {studying ,Men} is positively correlated? 0.001380 <0.05 YES! Online calculator: www.statology.org/fishers-exact-test-calculator/
How to find non-redundant productive itemsets? First algorithm: Opus-Miner (Webb, 2014) It performs a depth-first search. The user must set a parameter k. The algorithm returns the top k non-redundant productive itemsets having the highest lift or leverage. Other algorithms: IDPI+: Allows to interactively search for productive itemsets (Fournier-Viger, et a., 2018, Interactive Discovery of Statistically Significant Itemsets ) Query = X= {male, tobacco, Alzheimer} is productive?
Conclusion We have discussed: correlated patterns bond all-confidence other correlation measures the CORI algorithm statistically significant patterns Open source Java data mining software, 240 algorithms http://www.phillippe-fournier-viger.com/spmf/ 44