×

Flexible constrained sampling with guarantees for pattern mining. (English) Zbl 1411.68097

Summary: Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to (1) flexibility in terms of quality measures and constraints that can be used, and/or (2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: (1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and (2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62D05 Sampling theory, sample surveys
68T10 Pattern recognition, speech recognition

Software:

LCM; M4RI
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aggarwal CC, Han J (eds) (2014) Frequent pattern mining. Springer International Publishing, New York · Zbl 1297.68010
[2] Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 307-328
[3] Albrecht M, Bard G (2012) The M4RI Library. The M4RI Team. https://bitbucket.org/malb/m4ri
[4] Berlingerio M, Pinelli F, Calabrese F (2013) ABACUS: frequent pattern mining-based community discovery in multidimensional networks. Data Min Knowl Discov 27(3):294-320 · Zbl 1281.68173 · doi:10.1007/s10618-013-0331-0
[5] Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1):65-89 · doi:10.1007/s10115-009-0212-4
[6] Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of the 10th SIAM international conference on data mining (SDM ’10), pp 177-188
[7] Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’11), pp 582-590
[8] Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’12), pp 69-77
[9] Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining—interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics (IDEA ’13), pp 28-36
[10] Bonchi F, Giannotti F, Lucchese C, Orlando S, Perego R, Trasarti R (2009) A constraint-based querying system for exploratory pattern discovery. Inf Syst 34(1):3-27 · doi:10.1016/j.is.2008.02.007
[11] Bouillaguet C, Delaplace C (2016) Sparse Gaussian elimination modulo \[p\] p: an update. In: Proceedings of the 18th international workshop on computer algebra in scientific computing (CASC ’16), pp 101-116 · Zbl 1453.65086
[12] Bringmann B, Nijssen S, Tatti N, Vreeken J, Zimmermann A (2010) Mining sets of patterns. In: Tutorial at the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’10)
[13] Bucilă C, Gehrke J, Kifer D, White W (2003) Dualminer: a dual-pruning algorithm for itemsets with constraints. Data Min Knowl Discov 7(3):241-272 · doi:10.1023/A:1024076020895
[14] Calders, T.; Rigotti, C.; Boulicaut, JF; Boulicaut, JF (ed.); Raedt, L. (ed.); Mannila, H. (ed.), A survey on condensed representations for frequent sets, 64-80 (2006), Berlin · doi:10.1007/11615576_4
[15] Carvalho DR, Freitas AA, Ebecken N (2005) Evaluating the correlation between objective rule interestingness measures and real human interest. In: Proceedings of the 9th European conference on principles of data mining and knowledge discovery (PKDD ’05), pp 453-461
[16] Chakraborty S, Meel KS, Vardi MY (2013) A scalable and nearly uniform generator of SAT witnesses. In: Proceedings of the 25th international conference on computer-aided verification (CAV ’13), pp 608-623
[17] Chakraborty S, Fremont DJ, Meel KS, Vardi MY (2014) Distribution-aware sampling and weighted model counting for SAT. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI ’14), pp 1722-1730
[18] Chakraborty S, Fremont DJ, Meel KS, Seshia SA, Vardi MY (2015) On parallel scalable uniform SAT witness generation. In: Proceedings of the 21st international conference on tools and algorithms for the construction and analysis of systems (TACAS ’15), vol 9035, pp 304-319
[19] De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 7th SIAM international conference on data mining (SDM ’07), pp 237-248
[20] Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Proceedings of the 21st Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’17) (in press)
[21] Ermon S, Gomes CP, Sabharwal A, Selman B (2013a) Embed and project: discrete sampling with universal hashing. Adv Neural Inf Process Syst 26:2085-2093
[22] Ermon S, Gomes CP, Sabharwal A, Selman B (2013b) Taming the curse of dimensionality: discrete integration by hashing and optimization. In: Proceedings of the 30th international conference on machine learning (ICML ’13), pp 334-342
[23] Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of the 7th international conference on discovery science (DS ’04), pp 278-289 · Zbl 1110.68373
[24] Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3):119-130 · doi:10.1007/s41060-016-0019-9
[25] Gomes CP, van Hoeve Wj, Sabharwal A, Selman B (2007a) Counting CSP solutions using generalized XOR constraints. In: Proceedings of the 22nd AAAI conference on artificial intelligence (AAAI ’07), pp 204-209
[26] Gomes CP, Sabharwal A, Selman B (2007b) Near-uniform sampling of combinatorial spaces using XOR constraints. Adv Neural Inf Process Syst 19:481-488
[27] Guns T, Nijssen S, De Raedt L (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12-13):1951-1983 · Zbl 1353.68233 · doi:10.1016/j.artint.2011.05.002
[28] Guns T, Nijssen S, De Raedt L \[(2013) k\] k-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402-418 · doi:10.1109/TKDE.2011.204
[29] Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730-741 · doi:10.14778/1687627.1687710
[30] Kemmar A, Ugarte W, Loudni S, Charnois T, Lebbah Y, Boizumault P, Crémilleux B (2014) Mining relevant sequence patterns with CP-based framework. In: Proceedings of the 26th IEEE international conference on tools with artificial intelligence (ICTAI ’14), pp 552-559
[31] Khiari M, Boizumault P, Crémilleux B (2010) Constraint programming for mining n-ary patterns. In: Proceedings of the 16th international conference on principles and practice of constraint programming (CP ’10), pp 552-567
[32] Knobbe A, Ho E (2006) Pattern teams. In: Proceedings of the 10th European conference on principles of data mining and knowledge discovery (PKDD ’06), pp 577-584
[33] Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery (ECML/PKDD ’13), pp 288-303
[34] Meel K, Vardi M, Chakraborty S, Fremont D, Seshia S, Fried D, Ivrii A, Malik S (2016) Constrained sampling and counting: universal hashing meets SAT solving. In: Proceedings of the beyond NP AAAI workshop
[35] Nijssen, S.; Zimmermann, A.; Aggarwal, CC (ed.); Han, J. (ed.), Constraint-based pattern mining, 147-163 (2014), New York
[36] Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in ROC space: a constraint programming approach. In: Proceedings of the 15th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’09), pp 647-655
[37] Paramonov S, van Leeuwen M, Denecker M, De Raedt L (2015) An exercise in declarative modeling for relational query mining. In: Proceedings of the 25th international conference on inductive logic programming (ILP ’15)
[38] Pei J, Han J (2000) Can we push more constraints into frequent pattern mining? In: Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’00), pp 350-354
[39] Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm R (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD conference on knowledge discovery and data mining (KDD ’04), pp 266-275
[40] Shervashidze N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS ’09), pp 488-495
[41] Soos M (2010) Enhanced Gaussian elimination in DPLL-based SAT solvers. In: Proceedings of the pragmatics of SAT workshop (POS ’10), pp 2-14
[42] Uno T, Kiyomi M, Arimura H (2005) LCM ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (OSDM ’05), pp 77-86
[43] Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the 3rd ACM SIGKDD conference on knowledge discovery and data mining (KDD ’97), pp 283-296
[44] Zimmermann, A.; Nijssen, S.; Aggarwal, CC (ed.); Han, J. (ed.), Supervised pattern mining and applications to classification, 425-442 (2014), New York · Zbl 1298.68249
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.