×

Skypattern mining: from pattern condensed representations to dynamic constraint satisfaction problems. (English) Zbl 1404.68132

Summary: Data mining is the study of how to extract information from data and express it as useful knowledge. One of its most important subfields, pattern mining, involves searching and enumerating interesting patterns in data. Various aspects of pattern mining are studied in the theory of computation and statistics. In the last decade, the pattern mining community has witnessed a sharp shift from efficiency-based approaches to methods which can extract more meaningful patterns. Recently, new methods adapting results from studies of economic efficiency and multi-criteria decision analyses such as Pareto efficiency, or skylines, have been studied. Within pattern mining, this novel line of research allows the easy expression of preferences according to a dominance relation. This approach is useful from a user-preference point of view and tends to promote the use of pattern mining algorithms for non-experts. We present a significant extension of our previous work on the discovery of skyline patterns (or “skypatterns”) based on the theoretical relationships with condensed representations of patterns. We show how these relationships facilitate the computation of skypatterns and we exploit them to propose a flexible and efficient approach to mine skypatterns using a dynamic constraint satisfaction problems (CSP) framework.
We present a unified methodology of our different approaches towards the same goal. This work is supported by an extensive experimental study allowing us to illustrate the strengths and weaknesses of each approach.

MSC:

68T10 Pattern recognition, speech recognition
68T05 Learning and adaptive systems in artificial intelligence
68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
PDFBibTeX XMLCite
Full Text: DOI HAL

References:

[1] Soulet, A.; Raïssi, C.; Plantevit, M.; Crémilleux, B., Mining dominant patterns in the sky, (ICDM (2011), IEEE Computer Society), 655-664
[2] Ugarte, W.; Boizumault, P.; Loudni, S.; Crémilleux, B.; Lepailleur, A., Mining (soft-) skypatterns using dynamic CSP, (CPAIOR. CPAIOR, LNCS, vol. 8451 (2014), Springer), 71-87 · Zbl 1407.68416
[3] Zaki, M. J.; Sequeira, K., Data mining in computational biology, (Aluru, S., Handbook of Computational Molecular Biology. Handbook of Computational Molecular Biology, Computer and Information Science Series (2006), Chapman & Hall/CRC Press), 1-26, Ch. 38 · Zbl 1115.68387
[4] Gasteiger, J.; Engel, T., Chemoinformatics: A Textbook (2003), Wiley VCH: Wiley VCH Weinheim
[5] Backstrom, L.; Huttenlocher, D. P.; Kleinberg, J. M.; Lan, X., Group formation in large social networks: membership, growth, and evolution, (SIGKDD (2006), ACM), 44-54
[6] Backstrom, L.; Kleinberg, J. M.; Kumar, R., Optimizing web traffic via the media scheduling problem, (SIGKDD (2009), ACM), 89-98
[7] Fan, W.; Miller, M.; Stolfo, S. J.; Lee, W.; Chan, P. K., Using artificial anomalies to detect unknown and known network intrusions, (ICDM (2001), IEEE Computer Society), 123-130
[8] Agrawal, R.; Imielinski, T.; Swami, A. N., Mining association rules between sets of items in large databases, (SIGMOD (1993), ACM Press), 207-216
[9] Mannila, H.; Toivonen, H., Levelwise search and borders of theories in knowledge discovery, Data Min. Knowl. Discov., 1, 3, 241-258 (1997)
[10] Wrobel, S., An algorithm for multi-relational discovery of subgroups, (PKDD. PKDD, LNCS, vol. 1263 (1997), Springer), 78-87
[11] Novak, P. K.; Lavrac, N.; Webb, G. I., Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining, J. Mach. Learn. Res., 10, 377-403 (2009) · Zbl 1235.68178
[12] Geerts, F.; Goethals, B.; Mielikäinen, T., Tiling databases, (DS. DS, LNCS, vol. 3245 (2004), Springer), 278-289 · Zbl 1110.68373
[13] Bonchi, F.; Giannotti, F.; Lucchese, C.; Orlando, S.; Perego, R.; Trasarti, R., A constraint-based querying system for exploratory pattern discovery, Inf. Syst., 34, 1, 3-27 (2009)
[14] Pasquier, N.; Bastide, Y.; Taouil, R.; Lakhal, L., Efficient mining of association rules using closed itemset lattices, Inf. Syst., 24, 1, 25-46 (1999)
[15] Calders, T.; Goethals, B., Non-derivable itemset mining, Data Min. Knowl. Discov., 14, 1, 171-206 (2007)
[16] Boulicaut, J.; Bykowski, A.; Rigotti, C., Free-sets: a condensed representation of boolean data for the approximation of frequency queries, Data Min. Knowl. Discov., 7, 1, 5-22 (2003)
[17] Bistarelli, S.; Bonchi, F., Soft constraint based pattern mining, Data Knowl. Eng., 62, 1, 118-137 (2007)
[18] Ugarte, W.; Boizumault, P.; Loudni, S.; Crémilleux, B.; Lepailleur, A., Extracting and summarizing the frequent emerging graph patterns from a dataset of graphs, J. Intell. Inf. Syst., 1-29 (2013)
[19] Ke, Y.; Cheng, J.; Yu, J. X., Top-k correlative graph mining, (SDM (2009), SIAM), 1038-1049
[20] Wang, J.; Han, J.; Lu, Y.; Tzvetkov, P., TFP: an efficient algorithm for mining top-k frequent closed itemsets, IEEE Trans. Knowl. Data Eng., 17, 5, 652-664 (2005)
[21] Börzsönyi, S.; Kossmann, D.; Stocker, K., The skyline operator, (ICDE (2001), IEEE Computer Society), 421-430
[22] Bentley, J. L.; Kung, H. T.; Schkolnick, M.; Thompson, C. D., On the average number of maxima in a set of vectors and applications, J. ACM, 25, 4, 536-543 (1978) · Zbl 0388.68056
[23] Calders, T.; Rigotti, C.; Boulicaut, J., A survey on condensed representations for frequent sets, (Constraint-Based Mining and Inductive Databases. Constraint-Based Mining and Inductive Databases, LNCS, vol. 3848 (2005), Springer), 64-80 · Zbl 1172.68446
[24] Ng, R. T.; Lakshmanan, L. V.S.; Han, J.; Pang, A., Exploratory mining and pruning optimizations of constrained association rules, (SIGMOD (1998)), 13-24
[25] Siebes, A.; Vreeken, J.; van Leeuwen, M., Item sets that compress, (SDM (2006), SIAM), 395-406
[26] Knobbe, A. J.; Ho, E. K.Y., Pattern teams, (ECML/PKDD. ECML/PKDD, LNCS, vol. 4213 (2006), Springer), 577-584
[27] Raedt, L. D.; Zimmermann, A., Constraint-based pattern set mining, (SDM (2007), SIAM), 237-248
[28] Bringmann, B.; Zimmermann, A., The chosen few: on identifying valuable patterns, (ICDM (2007), IEEE Computer Society), 63-72
[29] Garriga, G. C.; Kralj, P.; Lavrac, N., Closed sets for labeled data, J. Mach. Learn. Res., 9, 559-580 (2008) · Zbl 1225.68179
[30] Kontonasios, K.; Bie, T. D., An information-theoretic approach to finding informative noisy tiles in binary databases, (SDM (2010), SIAM), 153-164
[31] Hämäläinen, W.; Nykänen, M., Efficient discovery of statistically significant association rules, (ICDM (2008), IEEE Computer Society), 203-212
[32] Gallo, A.; Bie, T. D.; Cristianini, N., MINI: mining informative non-redundant itemsets, (PKDD. PKDD, LNCS, vol. 4702 (2007), Springer), 438-445
[33] Gionis, A.; Mannila, H.; Mielikäinen, T.; Tsaparas, P., Assessing data mining results via swap randomization, (SIGKDD (2006), ACM), 167-176
[34] Mampaey, M.; Tatti, N.; Vreeken, J., Tell me what I need to know: succinctly summarizing data with itemsets, (SIGKDD (2011), ACM), 573-581
[35] Guns, T.; Nijssen, S.; Raedt, L. D., Itemset mining: a constraint programming perspective, Artif. Intell., 175, 12-13, 1951-1983 (2011) · Zbl 1353.68233
[36] Raedt, L. D.; Guns, T.; Nijssen, S., Constraint programming for itemset mining, (SIGKDD (2008), ACM), 204-212
[37] Khiari, M.; Boizumault, P.; Crémilleux, B., Constraint programming for mining n-ary patterns, (CP. CP, LNCS, vol. 6308 (2010), Springer), 552-567
[38] Kung, H. T.; Luccio, F.; Preparata, F. P., On finding the maxima of a set of vectors, J. ACM, 22, 4, 469-476 (1975) · Zbl 0316.68030
[39] Matousek, J., Computing dominances in ên, Inf. Process. Lett., 38, 5, 277-278 (1991) · Zbl 0738.68080
[40] Steuer, R. E., Multiple Criteria Optimization: Theory, Computation and Application (1986), John Wiley, 546 pp. · Zbl 0663.90085
[41] Papadias, D.; Tao, Y.; Fu, G.; Seeger, B., Progressive skyline computation in database systems, ACM Trans. Database Syst., 30, 1, 41-82 (2005)
[42] Tan, K.; Eng, P.; Ooi, B. C., Efficient progressive skyline computation, (VLDB (2001), Morgan Kaufmann), 301-310
[43] Papadopoulos, A. N.; Lyritsis, A.; Manolopoulos, Y., Skygraph: an algorithm for important subgraph discovery in relational graphs, Data Min. Knowl. Discov., 17, 1, 57-76 (2008)
[44] Shelokar, P.; Quirin, A.; Cordón, O., Mosubdue: a Pareto dominance-based multiobjective subdue algorithm for frequent subgraph mining, Knowl. Inf. Syst., 34, 1, 75-108 (2013)
[45] Cook, D. J.; Holder, L. B., Graph-based data mining, IEEE Intell. Syst., 15, 2, 32-41 (2000)
[46] van Leeuwen, M.; Ukkonen, A., Discovering skylines of subgroup sets, (ECML/PKDD. ECML/PKDD, LNCS, vol. 8190 (2013), Springer), 272-287
[47] Négrevergne, B.; Dries, A.; Guns, T.; Nijssen, S., Dominance programming for itemset mining, (ICDM (2013), IEEE Computer Society), 557-566
[48] Pennerath, F.; Napoli, A., The model of most informative patterns and its application to knowledge extraction from graph databases, (ECML/PKDD. ECML/PKDD, LNCS, vol. 5782 (2009), Springer), 205-220
[49] Soulet, A.; Crémilleux, B., Adequate condensed representations of patterns, Data Min. Knowl. Discov., 17, 1, 94-110 (2008)
[50] Soulet, A.; Crémilleux, B., Mining constraint-based patterns using automatic relaxation, Intell. Data Anal., 13, 1, 109-133 (2009)
[51] Omiecinski, E., Alternative interest measures for mining associations in databases, IEEE Trans. Knowl. Data Eng., 15, 1, 57-69 (2003)
[52] Dechter, R.; Dechter, A., Belief maintenance in dynamic constraint networks, (AAAI (1988), AAAI Press/The MIT Press), 37-42
[53] Verfaillie, G.; Jussien, N., Constraint solving in uncertain and dynamic environments: a survey, Constraints, 10, 3, 253-281 (2005) · Zbl 1086.68595
[54] Lecoutre, C., Constraint Networks: Targeting Simplicity for Techniques and Algorithms (2009), Wiley-ISTE
[55] (Rossi, F.; van Beek, P.; Walsh, T., Handbook of Constraint Programming (2006), Elsevier) · Zbl 1175.90011
[56] Lepailleur, A.; Poezevara, G.; Bureau, R., Automated detection of structural alerts (chemical fragments) in (eco)toxicology, Comput. Struct. Biotech. J., 5, e201302013 (2013)
[57] Cuissart, B.; Poezevara, G.; Crémilleux, B.; Lepailleur, A.; Bureau, R., Emerging patterns as structural alerts for computational toxicology, (Contrast Data Mining: Concepts, Algorithms, and Applications (2013), CRC Press), 269-282
[58] Sushko, I.; Salmina, E.; Potemkin, V.; Poda, G.; Tetko, I. V., Toxalerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions, J. Chem. Inf. Model., 52, 8, 2310-2316 (2012)
[59] Hansch, C.; McCarns, S.; Smith, C.; Dodittle, D., Comparative QSAR evidence for a free-radical mechanism of phenol-induced toxicity, Chem.-Biol. Interact., 127, 61-72 (2000)
[60] Lozano, S.; Poezevara, G.; Halm-Lemeille, M.; Lescot-Fontaine, E.; Lepailleur, A.; Bissell-Siders, R.; Crémilleux, B.; Rault, S.; Cuissart, B.; Bureau, R., Introduction of jumping fragments in combination with QSARs for the assessment of classification in ecotoxicology, J. Chem. Inf. Model., 50, 8, 1330-1339 (2010)
[61] Lo, D.; Khoo, S.; Li, J., Mining and ranking generators of sequential patterns, (SDM (2008), SIAM), 553-564
[62] Coquery, E.; Jabbour, S.; Saïs, L.; Salhi, Y., A SAT-based approach for discovering frequent, closed and maximal patterns in a sequence, (ECAI. ECAI, Frontiers in Artificial Intelligence and Applications, vol. 242 (2012), IOS Press), 258-263 · Zbl 1327.68213
[63] Kemmar, A.; Ugarte, W.; Loudni, S.; Charnois, T.; Lebbah, Y.; Boizumault, P.; Crémilleux, B., Mining relevant sequence patterns with CP-based framework, (ICTAI (2014), IEEE Computer Society), 552-559
[64] Ugarte, W.; Boizumault, P.; Loudni, S.; Crémilleux, B., Computing skypattern cubes, (ECAI. ECAI, Frontiers in Artificial Intelligence and Applications, vol. 263 (2014), IOS Press), 903-908 · Zbl 1366.68248
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.