Browse Source

2018-02-16: bachelor's thesis

master
Simon Lackerbauer 1 year ago
parent
commit
d888ff24cb
Signed by: Simon Lackerbauer <simon@lackerbauer.com> GPG Key ID: 2B27C889039C0125
15 changed files with 3319 additions and 0 deletions
  1. 941
    0
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/bibliography.bib
  2. 145
    0
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/dbstmpl.sty
  3. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/dummy_pattern.png
  4. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/dummy_random_walk.png
  5. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/example_pattern.png
  6. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/fullgraph.png
  7. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/partial_graph.png
  8. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/small_pattern.png
  9. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/sphx_glr_plot_separating_hyperplane_0011.png
  10. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/kopf.pdf
  11. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/presentation.pdf
  12. 285
    0
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/presentation.tex
  13. BIN
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/thesis.pdf
  14. 657
    0
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/thesis.tex
  15. 1291
    0
      2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/utcaps.bst

+ 941
- 0
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/bibliography.bib View File

@@ -0,0 +1,941 @@
% Generated by Paperpile. Check out http://paperpile.com for more information.
% BibTeX export options can be customized via Settings -> BibTeX.

@TECHREPORT{Saarinen2015-vh,
title = "The {BLAKE2} cryptographic hash and message authentication code
({MAC})",
author = "Saarinen, Markku Juhani and Aumasson, Jean-Philippe",
number = "RFC 7693",
year = 2015,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@TECHREPORT{Yan2002-hg,
title = "gSpan: {Graph-Based} Substructure Pattern Mining, Expanded
Version",
author = "Yan, Xifeng and Han, Jiawei",
number = "UIUCDCS-R-2002-2296",
institution = "UIUC Technical Report",
year = 2002,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Chang2011-wa,
title = "{LIBSVM}: A Library for Support Vector Machines",
author = "Chang, Chih-Chung and Lin, Chih-Jen",
journal = "ACM Trans. Intell. Syst. Technol.",
publisher = "ACM",
volume = 2,
number = 3,
pages = "27:1--27:27",
month = may,
year = 2011,
address = "New York, NY, USA",
keywords = "Classification LIBSVM optimization regression support vector
machines SVM;compsci;Efficient Event Classification through
Constrained Subgraph Mining"
}

@MISC{noauthor_undated-io,
title = "1.4. Support Vector Machines --- scikit-learn 0.19.1
documentation",
howpublished = "\url{http://scikit-learn.org/stable/modules/svm.html}",
note = "Accessed: 2018-2-4",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Pedregosa2011-ld,
title = "Scikit-learn: Machine Learning in Python",
author = "Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre
and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and
Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and
Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and
Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and
Duchesnay, {\'E}douard",
journal = "J. Mach. Learn. Res.",
volume = 12,
number = "Oct",
pages = "2825--2830",
year = 2011,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Han1999-bj,
title = "Efficient mining of partial periodic patterns in time series
database",
booktitle = "Proceedings 15th International Conference on Data Engineering
(Cat. {No.99CB36337})",
author = "Han, Jiawei and Dong, Guozhu and Yin, Yiwen",
pages = "106--115",
month = mar,
year = 1999,
keywords = "data mining;statistical databases;time series;data mining;hit
set property;partial periodicity search;periodicity
search;time-series databases;Algorithm design and
analysis;Cities and towns;Computer science;Councils;Data
analysis;Data mining;Databases;Read only
memory;Sun;compsci;Efficient Event Classification through
Constrained Subgraph Mining"
}

@ARTICLE{Zaki2001-jy,
title = "{SPADE}: An Efficient Algorithm for Mining Frequent Sequences",
author = "Zaki, Mohammed J",
journal = "Mach. Learn.",
publisher = "Kluwer Academic Publishers",
volume = 42,
number = "1-2",
pages = "31--60",
month = jan,
year = 2001,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@ARTICLE{Han2004-qs,
title = "From sequential pattern mining to structured pattern mining: A
pattern-growth approach",
author = "Han, Jia-Wei and Pei, Jian and Yan, Xi-Feng",
journal = "J. Comput. Sci. \& Technol.",
publisher = "Science Press",
volume = 19,
number = 3,
pages = "257--279",
month = may,
year = 2004,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@ARTICLE{Pei2002-ud,
title = "Constrained Frequent Pattern Mining: A Pattern-growth View",
author = "Pei, Jian and Han, Jiawei",
journal = "SIGKDD Explor. Newsl.",
publisher = "ACM",
volume = 4,
number = 1,
pages = "31--39",
month = jun,
year = 2002,
address = "New York, NY, USA",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Han2000-ut,
title = "Mining frequent patterns by pattern-growth: methodology and
implications",
author = "Han, Jiawei and Pei, Jian",
journal = "ACM SIGKDD Explorations Newsletter",
publisher = "ACM",
volume = 2,
number = 2,
pages = "14--20",
month = dec,
year = 2000,
keywords = "associations; constraint-based mining; frequent patterns;
scalable data mining methods and algorithms; sequential
patterns;compsci;Efficient Event Classification through
Constrained Subgraph Mining"
}

@INPROCEEDINGS{Srikant1996-dy,
title = "Mining Quantitative Association Rules in Large Relational Tables",
booktitle = "Proceedings of the 1996 {ACM} {SIGMOD} International Conference
on Management of Data",
author = "Srikant, Ramakrishnan and Agrawal, Rakesh",
publisher = "ACM",
pages = "1--12",
series = "SIGMOD '96",
year = 1996,
address = "New York, NY, USA",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Kossinets2006-rw,
title = "Empirical analysis of an evolving social network",
author = "Kossinets, Gueorgi and Watts, Duncan J",
affiliation = "Department of Sociology and Institute for Social and Economic
Research and Policy, Columbia University, 420 West 118th
Street, MC 3355, New York, NY 10027, USA. gk297@columbia.edu",
journal = "Science",
publisher = "science.sciencemag.org",
volume = 311,
number = 5757,
pages = "88--90",
month = jan,
year = 2006,
keywords = "socsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@ARTICLE{Cortes1995-ix,
title = "Support-vector networks",
author = "Cortes, Corinna and Vapnik, Vladimir",
journal = "Mach. Learn.",
publisher = "Kluwer Academic Publishers",
volume = 20,
number = 3,
pages = "273--297",
month = sep,
year = 1995,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@MISC{noauthor_undated-bv,
title = "Apache Hadoop - Open source software for reliable, scalable,
distributed computing",
howpublished = "\url{http://hadoop.apache.org/}",
note = "Accessed: 2018-2-2",
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining"
}

@MISC{noauthor_undated-zu,
title = "Elasticsearch - Open Source Search \& Analytics",
howpublished = "\url{https://www.elastic.co/}",
note = "Accessed: 2018-2-2",
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining"
}

@MISC{noauthor_undated-vl,
title = "Grafana - The open platform for analytics and monitoring",
booktitle = "Grafana Labs",
howpublished = "\url{https://grafana.com/}",
note = "Accessed: 2018-2-2",
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining"
}

@MISC{noauthor_undated-xi,
title = "Graylog - Open Source Log Management",
howpublished = "\url{https://www.graylog.org/}",
note = "Accessed: 2018-2-2",
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Travers1967-cn,
title = "The small world problem",
author = "Travers, Jeffrey and Milgram, Stanley",
journal = "Phychology Today",
publisher = "JSTOR",
volume = 1,
number = 1,
pages = "61--67",
year = 1967,
keywords = "socsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@BOOK{Newman2010-ac,
title = "Networks: An Introduction",
author = "Newman, Mark",
publisher = "Oxford University Press",
month = mar,
year = 2010,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@INPROCEEDINGS{Cook1971-fj,
title = "The complexity of theorem-proving procedures",
booktitle = "Proceedings of the third annual {ACM} symposium on Theory of
computing",
author = "Cook, Stephen A",
publisher = "ACM",
pages = "151--158",
month = may,
year = 1971,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@MISC{Rahtz_undated-bv,
title = "{TeX} Live - {TeX} Users Group",
author = "Rahtz, Sebastian and Kakuto, Akira and Berry, Karl and
Scarso, Luigi and Miklavec, Mojka and Preining, Norbert and
Kotucha, Reinhard and Kroonenberg, Siep and Wawrykiewicz,
Staszek",
howpublished = "\url{https://www.tug.org/texlive/}",
note = "Accessed: 2018-2-2",
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining"
}

@MISC{Van_der_Zander_undated-kf,
title = "{TeXstudio}",
author = "van der Zander, Benito",
howpublished = "\url{http://www.texstudio.org/}",
note = "Accessed: 2018-2-2",
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Shannon2003-gg,
title = "Cytoscape: a software environment for integrated models of
biomolecular interaction networks",
author = "Shannon, Paul and Markiel, Andrew and Ozier, Owen and Baliga,
Nitin S and Wang, Jonathan T and Ramage, Daniel and Amin, Nada
and Schwikowski, Benno and Ideker, Trey",
affiliation = "Institute for Systems Biology, Seattle, Washington 98103, USA.",
journal = "Genome Res.",
volume = 13,
number = 11,
pages = "2498--2504",
month = nov,
year = 2003,
keywords = "misc;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@ARTICLE{Garey1976-po,
title = "Some simplified {NP-complete} graph problems",
author = "Garey, M R and Johnson, D S and Stockmeyer, L",
journal = "Theor. Comput. Sci.",
volume = 1,
number = 3,
pages = "237--267",
month = feb,
year = 1976,
keywords = "maths;Efficient Event Classification through Constrained Subgraph
Mining"
}

@TECHREPORT{Fortin1996-la,
title = "The graph isomorphism problem",
author = "Fortin, Scott",
number = "96-20",
institution = "University of Alberta",
year = 1996,
keywords = "maths;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Dragoni2016-fh,
title = "Microservices: yesterday, today, and tomorrow",
author = "Dragoni, Nicola and Giallorenzo, Saverio and Lafuente,
Alberto Lluch and Mazzara, Manuel and Montesi, Fabrizio and
Mustafin, Ruslan and Safina, Larisa",
month = jun,
year = 2016,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "cs.SE",
eprint = "1606.04036"
}

@ARTICLE{Peters1993-fw,
title = "The history and development of transaction log analysis",
author = "Peters, Thomas A",
journal = "Library Hi Tech",
volume = 11,
number = 2,
pages = "41--66",
year = 1993,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Agrawal1993-nc,
title = "Mining Association Rules Between Sets of Items in Large
Databases",
author = "Agrawal, Rakesh and Imieli{\'n}ski, Tomasz and Swami, Arun",
journal = "SIGMOD Rec.",
publisher = "ACM",
volume = 22,
number = 2,
pages = "207--216",
month = jun,
year = 1993,
address = "New York, NY, USA",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Agrawal1994-ca,
title = "Fast algorithms for mining association rules",
booktitle = "Proc. 20th int. conf. very large data bases, {VLDB}",
author = "Agrawal, Rakesh and Srikant, Ramakrishnan and {Others}",
volume = 1215,
pages = "487--499",
year = 1994,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Pei2000-rz,
title = "Mining Access Patterns Efficiently from Web Logs",
booktitle = "Knowledge Discovery and Data Mining. Current Issues and New
Applications",
author = "Pei, Jian and Han, Jiawei and Mortazavi-asl, Behzad and Zhu,
Hua",
publisher = "Springer, Berlin, Heidelberg",
pages = "396--407",
month = apr,
year = 2000,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en",
conference = "Pacific-Asia Conference on Knowledge Discovery and Data Mining"
}

@ARTICLE{Jung2017-zs,
title = "When is Network Lasso Accurate?",
author = "Jung, Alexander",
month = apr,
year = 2017,
keywords = "stats;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "stat.ML",
eprint = "1704.02107"
}

@ARTICLE{Haghiri2017-mk,
title = "Comparison Based Nearest Neighbor Search",
author = "Haghiri, Siavash and Ghoshdastidar, Debarghya and von
Luxburg, Ulrike",
month = apr,
year = 2017,
keywords = "stats;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "stat.ML",
eprint = "1704.01460"
}

@ARTICLE{Lei2017-ip,
title = "{Cross-Validation} with Confidence",
author = "Lei, Jing",
month = mar,
year = 2017,
keywords = "stats;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "stat.ME",
eprint = "1703.07904"
}

@INPROCEEDINGS{Ringsquandl2016-en,
title = "Knowledge Graph Constraints for Multi-label Graph
Classification",
booktitle = "2016 {IEEE} 16th International Conference on Data Mining
Workshops ({ICDMW})",
author = "Ringsquandl, Martin and Lamparter, Steffen and Thon, Ingo
and Lepratti, Raffaello and Kroger, Peer",
publisher = "IEEE",
pages = "121--127",
year = 2016,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
conference = "2016 IEEE 16th International Conference on Data Mining
Workshops (ICDMW)"
}

@ARTICLE{De_Graaf2007-hh,
title = "Clustering with Lattices in the Analysis of Graph Patterns",
author = "de Graaf, Edgar H and Kok, Joost N and Kosters, Walter A",
month = may,
year = 2007,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "cs.AI",
eprint = "0705.0593"
}

@ARTICLE{Hallac2015-uk,
title = "Network Lasso: Clustering and Optimization in Large Graphs",
author = "Hallac, David and Leskovec, Jure and Boyd, Stephen",
affiliation = "Stanford University. Stanford University. Stanford University.",
journal = "KDD",
volume = 2015,
pages = "387--396",
month = aug,
year = 2015,
keywords = "ADMM; Convex Optimization; Network Lasso;compsci;Efficient
Event Classification through Constrained Subgraph Mining",
language = "en"
}

@ARTICLE{Wei2017-ak,
title = "A Joint Framework for Argumentative Text Analysis
Incorporating Domain Knowledge",
author = "Wei, Zhongyu and Li, Chen and Liu, Yang",
month = jan,
year = 2017,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "cs.CL",
eprint = "1701.05343"
}

@ARTICLE{Bayer2017-af,
title = "Graph Based Relational Features for Collective
Classification",
author = "Bayer, Immanuel and Nagel, Uwe and Rendle, Steffen",
month = feb,
year = 2017,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
archivePrefix = "arXiv",
primaryClass = "cs.IR",
eprint = "1702.02817"
}

@ARTICLE{Dhiman2016-wo,
title = "Optimizing Frequent Subgraph Mining for Single Large Graph",
author = "Dhiman, Aarzoo and Jain, S K",
journal = "Procedia Comput. Sci.",
volume = 89,
pages = "378--385",
year = 2016,
keywords = "Frequent Subgraph Mining; Graph; Optimization; Single Graph;
Subgraph Isomorphism;compsci;Efficient Event Classification
through Constrained Subgraph Mining"
}

@INPROCEEDINGS{Dhiman2016-jq,
title = "Frequent subgraph mining algorithms for single large
graphs --- A brief survey",
booktitle = "2016 International Conference on Advances in Computing,
Communication, \& Automation ({ICACCA}) (Spring)",
author = "Dhiman, Aarzoo and Jain, S K",
publisher = "IEEE",
pages = "1--6",
month = apr,
year = 2016,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
conference = "2016 International Conference on Advances in Computing,
Communication, \& Automation (ICACCA) (Spring)"
}

@INPROCEEDINGS{Zou2010-ze,
title = "Frequent subgraph mining on a single large graph using sampling
techniques",
booktitle = "Proceedings of the Eighth Workshop on Mining and Learning with
Graphs",
author = "Zou, Ruoyu and Holder, Lawrence B",
publisher = "ACM",
pages = "171--178",
month = jul,
year = 2010,
keywords = "graph mining; large graph; sampling;compsci;Efficient Event
Classification through Constrained Subgraph Mining"
}

@INCOLLECTION{Moussaoui2016-ng,
title = "{POSGRAMI}: Possibilistic Frequent Subgraph Mining in a Single
Large Graph",
booktitle = "Information Processing and Management of Uncertainty in
{Knowledge-Based} Systems",
author = "Moussaoui, Mohamed and Zaghdoud, Montaceur and Akaichi, Jalel",
editor = "Carvalho, Joao Paulo and Lesot, Marie-Jeanne and Kaymak, Uzay
and Vieira, Susana and Bouchon-Meunier, Bernadette and Yager,
Ronald R",
publisher = "Springer International Publishing",
volume = 610,
pages = "549--561",
series = "Communications in Computer and Information Science",
year = 2016,
address = "Cham",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Elseidy2014-fz,
title = "{GRAMI}: Frequent Subgraph and Pattern Mining in a Single Large
Graph",
booktitle = "Proceedings of the {VLDB} Endowment",
author = "Elseidy, Mohammed and Abdelhamid, Ehab and Skiadopoulos, Spiros
and Kalnis, Panos",
month = mar,
year = 2014,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Yan2003-dl,
title = "{CloseGraph}: Mining Closed Frequent Graph Patterns",
booktitle = "Proceedings of the ninth {ACM} {SIGKDD} international
conference on Knowledge discovery and data mining",
author = "Yan, Xifeng and Han, Jiawei",
year = 2003,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
conference = "KDD '03"
}

@INPROCEEDINGS{Han2007-qx,
title = "Frequent pattern mining: current status and future directions",
booktitle = "Proceedings of the fifteenth {ACM} {SIGKDD} international
conference on Knowledge discovery and data mining",
author = "Han, Jiawei and Cheng, Hong and Xin, Dong and Yan, Xifeng",
publisher = "Kluwer Academic Publishers-Plenum Publishers",
volume = 15,
pages = "55--86",
month = aug,
year = 2007,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@INPROCEEDINGS{Yan2002-sj,
title = "gSpan: graph-based substructure pattern mining",
booktitle = "2002 {IEEE} International Conference on Data Mining, 2002.
Proceedings.",
author = "Yan, Xifeng and Han, Jiawei",
publisher = "IEEE Comput. Soc",
pages = "721--724",
year = 2002,
keywords = "data mining;tree searching;algorithm;canonical
label;depth-first search strategy;frequent connected subgraph
mining;frequent graph-based pattern mining;frequent
substructure discovery;gSpan;graph datasets;graph-based
substructure pattern mining;lexicographic order;performance
study;unique minimum DFS code;Chemical compounds;Computer
science;Costs;Data mining;Data
structures;Graphics;Itemsets;Kernel;Testing;Tree
graphs;compsci;Efficient Event Classification through
Constrained Subgraph Mining",
conference = "2002 IEEE International Conference on Data Mining. ICDM 2002"
}

@BOOK{Ojeda2014-lq,
title = "Practical Data Science Cookbook",
author = "Ojeda, Tony and Murphy, Sean Patrick and Bengfort, Benjamin and
Dasgupta, Abhijit",
publisher = "Packt Publishing Ltd",
month = sep,
year = 2014,
keywords = "stats;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@BOOK{Cuesta2013-ha,
title = "Practical Data Analysis",
author = "Cuesta, Hector",
publisher = "Packt Publishing",
year = 2013,
keywords = "stats;Efficient Event Classification through Constrained
Subgraph Mining"
}

@BOOK{Coelho2015-wo,
title = "Building Machine Learning Systems with Python - Second Edition",
author = "Coelho, Luis Pedro and Richert, Willi",
publisher = "Packt Publishing Ltd",
month = mar,
year = 2015,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@INPROCEEDINGS{Crouch2013-qe,
title = "Dynamic Graphs in the {Sliding-Window} Model?",
booktitle = "Algorithms -- {ESA} 2013 Proceedings",
author = "Crouch, Michael S and Mc Gregor, Andrew and Stubbs, Daniel",
year = 2013,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
conference = "Algorithms -- ESA 2013"
}

@INPROCEEDINGS{Kyrola2012-wx,
title = "{GraphChi}: {Large-Scale} Graph Computation on Just a {PC}",
booktitle = "{OSDI}",
author = "Kyrola, Aapo and Blelloch, Guy E and Guestrin, Carlos and
{Others}",
volume = 12,
pages = "31--46",
year = 2012,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Grone1994-qc,
title = "The Laplacian spectrum of a graph {II}",
author = "Grone, Robert and Merris, Russell",
journal = "SIAM J. Discrete Math.",
publisher = "SIAM",
volume = 7,
number = 2,
pages = "221--229",
year = 1994,
keywords = "maths;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Grone1990-fa,
title = "The Laplacian Spectrum of a Graph",
author = "Grone, Robert and Merris, Russell and Sunder, V S",
journal = "SIAM J. Matrix Anal. Appl.",
publisher = "SIAM",
volume = 11,
number = 2,
pages = "218--238",
year = 1990,
keywords = "maths;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Mohar1991-lx,
title = "The Laplacian spectrum of graphs",
author = "Mohar, Bojan and Alavi, Y and Chartrand, G and Oellermann, O R",
journal = "Graph theory, combinatorics, and applications",
publisher = "academia.edu",
volume = 2,
number = "871-898",
pages = "12",
year = 1991,
keywords = "maths;Efficient Event Classification through Constrained
Subgraph Mining"
}

@BOOK{Han2006-on,
title = "Data Mining: Concepts and Techniques",
author = "Han, Jiawei and Kamber, Micheline",
publisher = "Morgan Kaufinann",
edition = 2,
year = 2006,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Yan2005-nz,
title = "Mining closed relational graphs with connectivity constraints",
author = "Yan, X and Zhou, X and Han, J",
journal = "Proceedings of the eleventh ACM SIGKDD",
publisher = "dl.acm.org",
year = 2005,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Kong2013-ug,
title = "Multi-label Classification by Mining Label and Instance
Correlations from Heterogeneous Information Networks",
booktitle = "Proceedings of the 19th {ACM} {SIGKDD} International Conference
on Knowledge Discovery and Data Mining",
author = "Kong, Xiangnan and Cao, Bokai and Yu, Philip S",
publisher = "ACM",
pages = "614--622",
series = "KDD '13",
year = 2013,
address = "New York, NY, USA",
keywords = "data mining, heterogeneous information network, label
correlation, multi-label classification;compsci;Efficient Event
Classification through Constrained Subgraph Mining"
}

@INPROCEEDINGS{Kong2012-fj,
title = "Meta Path-based Collective Classification in Heterogeneous
Information Networks",
booktitle = "Proceedings of the 21st {ACM} International Conference on
Information and Knowledge Management",
author = "Kong, Xiangnan and Yu, Philip S and Ding, Ying and Wild, David J",
publisher = "ACM",
pages = "1567--1571",
series = "CIKM '12",
year = 2012,
address = "New York, NY, USA",
keywords = "heterogeneous information networks, meta path;compsci;Efficient
Event Classification through Constrained Subgraph Mining"
}

@BOOK{Han2011-ms,
title = "Data Mining: Concepts and Techniques",
author = "Han, Jiawei and Pei, Jian and Kamber, Micheline",
publisher = "Elsevier",
edition = 3,
month = jun,
year = 2011,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@INPROCEEDINGS{Deville2016-pp,
title = "{GriMa}: A Grid Mining Algorithm for {Bag-of-Grid-Based}
Classification",
booktitle = "Structural, Syntactic, and Statistical Pattern Recognition",
author = "Deville, Romain and Fromont, Elisa and Jeudy, Baptiste and
Solnon, Christine",
editor = "Robles-Kelly, Antonio and Loog, Marco and Biggio, Battista and
Escolano, Francisco and Wilson, Richard",
publisher = "Springer International Publishing",
pages = "132--142",
series = "Lecture Notes in Computer Science",
month = nov,
year = 2016,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en",
conference = "Joint IAPR International Workshops on Statistical Techniques in
Pattern Recognition (SPR) and Structural and Syntactic Pattern
Recognition (SSPR)"
}

@ARTICLE{Bianchi2015-wu,
title = "Granular Computing Techniques for Classification and Semantic
Characterization of Structured Data",
author = "Bianchi, Filippo Maria and Scardapane, Simone and Rizzi,
Antonello and Uncini, Aurelio and Sadeghian, Alireza",
journal = "Cognit. Comput.",
publisher = "Springer US",
volume = 8,
number = 3,
pages = "442--461",
month = dec,
year = 2015,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining",
language = "en"
}

@ARTICLE{Hu2016-uv,
title = "An aerial image recognition framework using discrimination and
redundancy quality measure",
author = "Hu, Yuxing and Nie, Liqiang",
journal = "J. Vis. Commun. Image Represent.",
publisher = "Elsevier",
volume = 37,
pages = "53--62",
year = 2016,
keywords = "Aerial image; Categorization; Discriminative; Subgraph; Data
mining; Image recognition; Framework; Quality
measure;compsci;Efficient Event Classification through
Constrained Subgraph Mining"
}

@INPROCEEDINGS{Arora2010-bc,
title = "Sentiment Classification Using Automatically Extracted Subgraph
Features",
booktitle = "Proceedings of the {NAACL} {HLT} 2010 Workshop on Computational
Approaches to Analysis and Generation of Emotion in Text",
author = "Arora, Shilpa and Mayfield, Elijah and Penstein-Ros{\'e},
Carolyn and Nyberg, Eric",
publisher = "Association for Computational Linguistics",
pages = "131--139",
series = "CAAGET '10",
year = 2010,
address = "Stroudsburg, PA, USA",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@ARTICLE{Conte2004-ki,
title = "{THIRTY} {YEARS} {OF} {GRAPH} {MATCHING} {IN} {PATTERN}
{RECOGNITION}",
author = "Conte, D and Foggia, P and Sansone, C and Vento, M",
journal = "Int. J. Pattern Recognit Artif Intell.",
publisher = "World Scientific",
volume = 18,
number = 03,
pages = "265--298",
year = 2004,
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Nguyen2009-ts,
title = "Graph-based Mining of Multiple Object Usage Patterns",
booktitle = "Proceedings of the the 7th Joint Meeting of the European
Software Engineering Conference and the {ACM} {SIGSOFT}
Symposium on The Foundations of Software Engineering",
author = "Nguyen, Tung Thanh and Nguyen, Hoan Anh and Pham, Nam H and
Al-Kofahi, Jafar M and Nguyen, Tien N",
publisher = "ACM",
pages = "383--392",
series = "ESEC/FSE '09",
year = 2009,
address = "New York, NY, USA",
keywords = "anomaly, api usage, clone, graph mining, groum, object usage,
pattern;compsci;Efficient Event Classification through
Constrained Subgraph Mining"
}

@ARTICLE{Washio2003-fc,
title = "State of the Art of Graph-based Data Mining",
author = "Washio, Takashi and Motoda, Hiroshi",
journal = "SIGKDD Explor. Newsl.",
publisher = "ACM",
volume = 5,
number = 1,
pages = "59--68",
month = jul,
year = 2003,
address = "New York, NY, USA",
keywords = "data mining, graph, graph-based data mining, path, structured
data, tree;compsci;Efficient Event Classification through
Constrained Subgraph Mining"
}

@INPROCEEDINGS{Cheng2009-td,
title = "Identifying Bug Signatures Using Discriminative Graph Mining",
booktitle = "Proceedings of the Eighteenth International Symposium on
Software Testing and Analysis",
author = "Cheng, Hong and Lo, David and Zhou, Yang and Wang, Xiaoyin and
Yan, Xifeng",
publisher = "ACM",
pages = "141--152",
series = "ISSTA '09",
year = 2009,
address = "New York, NY, USA",
keywords = "bug signature, discriminative subgraph mining;compsci;Efficient
Event Classification through Constrained Subgraph Mining"
}

@INPROCEEDINGS{Zhang2009-wg,
title = "{GADDI}: Distance Index Based Subgraph Matching in Biological
Networks",
booktitle = "Proceedings of the 12th International Conference on Extending
Database Technology: Advances in Database Technology",
author = "Zhang, Shijie and Li, Shirong and Yang, Jiong",
publisher = "ACM",
pages = "192--203",
series = "EDBT '09",
year = 2009,
address = "New York, NY, USA",
keywords = "compsci;Efficient Event Classification through Constrained
Subgraph Mining"
}

@INPROCEEDINGS{Jin2011-vz,
title = "{LTS}: Discriminative subgraph mining by learning from search
history",
booktitle = "2011 {IEEE} 27th International Conference on Data Engineering",
author = "Jin, N and Wang, W",
publisher = "ieeexplore.ieee.org",
pages = "207--218",
month = apr,
year = 2011,
keywords = "data mining;graph theory;greedy algorithms;learning (artificial
intelligence);pattern classification;branch and bound
algorithm;discriminative subgraph mining method;graph
classifier;graph indices;greedy algorithm;learning to
search;Accuracy;Algorithm design and analysis;Chemical
compounds;Classification algorithms;Frequency
estimation;History;Kernel;compsci;Efficient Event Classification
through Constrained Subgraph Mining"
}

+ 145
- 0
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/dbstmpl.sty View File

@@ -0,0 +1,145 @@
%%
%% dbstmpl.sty
%% von Stefan Brecheisen, LFE fuer Datenbanksysteme am
%% Institut fuer Informatik der LMU Muenchen
%%
%% kleine Anpassungen an das aktuelle CD von Marisa Thoma
%%
%% Dieses Latex-Paket definiert sinnvolle Einstellungen und hilfreiche
%% Befehle zum Erstellen der Ausarbeitung einer Diplom- oder Projektarbeit
%% an der LFE fuer Datenbanksysteme.
%%
\def\fileversion{v1.1}
\def\filedate{2010/06/21}
\NeedsTeXFormat{LaTeX2e}
\ProvidesPackage{dbstmpl}[\filedate\space\fileversion]
%%
%% benoetigte Pakete
%%
%% Wenn es englisch sein soll:
%\RequirePackage{lmodern,polyglossia}
\RequirePackage[german,english]{babel}
%% sonst: für deutsche Arbeiten:
%\RequirePackage[english,german]{babel}
\RequirePackage{german}
\RequirePackage{amsmath}
\RequirePackage{amssymb}
\RequirePackage{geometry}
\RequirePackage{fancyhdr}
\RequirePackage[nottoc]{tocbibind}
\RequirePackage{graphicx}
%%
%% haeufig benutzte Symbole
%%
\newcommand{\N}{\mathbb{N}} % Menge der natuerlichen Zahlen
\newcommand{\Z}{\mathbb{Z}} % Menge der ganzen Zahlen
\newcommand{\Q}{\mathbb{Q}} % Menge der rationalen Zahlen
\newcommand{\R}{\mathbb{R}} % Menge der reellen Zahlen
\newcommand{\C}{\mathbb{C}} % Menge der komplexen Zahlen
%%
%% Einstellungen
%%
% Seitenraender
\geometry{body={140mm,210mm},footskip=12mm}
% Gestaltung der Kopf- und Fusszeilen
\pagestyle{fancy}
\headheight 14pt
\fancyhf{}
\fancyhead[L]{\small\slshape\leftmark}
\fancyfoot[C]{\thepage}
% subsubsections numerieren und ins Inhaltsverzeichnis aufnehmen
\setcounter{secnumdepth}{3}
\setcounter{tocdepth}{3}
%%
%% Globale Variablen
%%
\newtoks\arbeit % Art der Arbeit
\newtoks\fach % Studiengang
\newtoks\titel % Titel der Arbeit
\newtoks\bearbeiter % Name des Bearbeiters
\newtoks\betreuer % Name des Betreuers
\newtoks\aufgabensteller % Name des Aufgabenstellers
\newtoks\abgabetermin % Datum der Abgabe
\newtoks\ort % Wohnort des Bearbeiters
%%
%% Ausgabe des Deckblatts fuer eine Diplom- oder Projektarbeit.
%%
\newcommand{\deckblatt}{
\begin{titlepage}
~
\vspace{-2cm}
\begin{center}
\parbox[t]{145.5mm}{ \includegraphics[width=145.5mm]{kopf} }
\end{center}
\begin{center}
\vspace{2.5cm}\Large
\the\arbeit
{\large in \the\fach}
\vspace{1cm}\huge
\the\titel
\vspace{1cm}\large
\the\bearbeiter
\vspace{\fill}\normalsize
\begin{tabular}{ll}
Aufgabensteller: & \the\aufgabensteller\\
Betreuer: & \the\betreuer\\
Abgabedatum: & \the\abgabetermin
\end{tabular}
\end{center}
\end{titlepage}
}
%%
%% Ausgabe der Erklaerung ueber die selbstaendige Anfertigung
%% einer Diplomarbeit
%%
\newcommand{\erklaerung}{
\begin{titlepage}
\vspace*{\fill}
\parindent 0cm
\begin{center}
\textbf{Erkl"arung}
\vspace{1cm}
\begin{minipage}{9.8cm}
Hiermit versichere ich, dass ich diese \the\arbeit\ selbst\"andig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
\vspace{1cm}
\the\ort, den \the\abgabetermin
\vspace{1.5cm}
\makebox[9.8cm]{\dotfill}\\
\the\bearbeiter
\end{minipage}
\end{center}
\vspace*{\fill}
\end{titlepage}
}
\newcommand{\emptypage}{
\begin{titlepage}
\vspace*{\fill}
\end{titlepage}
}

BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/dummy_pattern.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/dummy_random_walk.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/example_pattern.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/fullgraph.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/partial_graph.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/small_pattern.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/images/sphx_glr_plot_separating_hyperplane_0011.png View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/kopf.pdf View File


BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/presentation.pdf View File


+ 285
- 0
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/presentation.tex View File

@@ -0,0 +1,285 @@
\documentclass{beamer}
%\setbeameroption{show only notes}
\usepackage{polyglossia}
\usetheme{default} %minimal
\setbeamercovered{transparent}
\setbeamertemplate{bibliography item}{}
\setbeamertemplate{caption}[numbered]
\setbeamercolor*{bibliography entry title}{fg=black}
\setbeamercolor*{bibliography entry author}{fg=black}
\setbeamercolor*{bibliography entry location}{fg=black}
\setbeamercolor*{bibliography entry note}{fg=black}
\usepackage{natbib}
\usepackage{tikz}
\bibliographystyle{plain}
\renewcommand\bibfont{\scriptsize}
\beamertemplatenavigationsymbolsempty

\AtBeginSection[]
{
\begin{frame}<beamer>
\frametitle{Outline}
\tableofcontents[currentsection]
\end{frame}
}


\title{Efficient Event Classification through Constrained Subgraph Mining}
\subtitle{Abschlussvortrag Bachelorarbeit}

\author{Simon Lackerbauer}

\institute[Ludwig-Maximilians-Universität München]
{
}

\date{2018-04-23}

\subject{}

\AtBeginSubsection[]
{
\begin{frame}<beamer>{Outline}
\tableofcontents[currentsection,currentsubsection]
\end{frame}
}


\begin{document}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}{Outline}
\tableofcontents
\end{frame}
\section{Problemstellung}
\begin{frame}{Problemstellung}{}
\begin{itemize}
\item Datenherkunft: Industrielle Fertigungsstrecke von Siemens
\item Datenart: Fehlermeldungen der verschiedenen Fertigungsmodule
\item Daten sind proprietär, deshalb wurde ein zusätzliches, synthetisches Datenset konstruiert, das bei den meisten Folien zum Einsatz kommt
\end{itemize}
\end{frame}

\begin{frame}{Problemstellung}
\begin{itemize}
\item Die Anlage hat viele Ausfälle
\item Ziel war es, Patterns aus den Daten zu generieren, von denen auf die Ursprünge der Probleme beim Ablauf geschlossen werden kann
\item Mit diesen Patterns sollten die eigentlichen Anlagentechniker die Gründe der häufigen Ausfälle ausmachen und dementsprechend mitigieren können
\end{itemize}
\end{frame}

\begin{frame}{Beispieldaten}
\begin{table}[]
\centering
\caption{Synthetisches Datenset (Auszug)}
\label{table:dummy_messages}
\footnotesize
\begin{tabular}{l|l|l|l}
time stamp & log message & module id & part id \\ \hline
2017-04-05 11:01:05 & Laser überhitzt & Module 1 & 88495775TEST \\
2017-04-05 11:01:05 & Laser überhitzt & Module 1 & 88495776TEST \\
2017-04-05 11:01:06 & Teil verkantet & Module 2 & 88495776TEST \\
2017-04-05 11:01:06 & Laser überhitzt & Module 1 & 88495776TEST \\
2017-04-05 11:01:10 & Laser überhitzt & Module 1 & 88495776TEST \\
2017-04-05 11:01:12 & Auffangbehälter leeren & Module 2 & 88495775TEST \\
2017-04-05 11:01:17 & Unbekannter Ausnahmefehler & Module 0 & 88495775TEST \\
2017-04-05 11:01:17 & Auffangbehälter leeren & Module 2 & 88495775TEST \\
2017-04-05 11:01:19 & Unbekannter Ausnahmefehler & Module 0 & 88495775TEST \\
2017-04-05 11:05:22 & Laser überhitzt & Module 1 & 88495775TEST \\
\multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots}
\end{tabular}
\end{table}
\end{frame}

\begin{frame}{Problemstellung}
Fehlermeldungen sind
\begin{itemize}
\item komplett unstrukturiert
\item vollständig Deutsch
\item sehr kurz, bzw. keine vollständigen Sätze
\item teilweise nur für Experten verständlich
\end{itemize}
\end{frame}

\begin{frame}{Evaluation}
\begin{itemize}
\item Als weitere Metrik über die Anlage wurde die \textit{Overall Equipment Efficiency} (OEE) bereitgestellt
\item Der OEE-Score ist eine Größe zwischen 0 und 1, die sich folgendermaßen berechnet: $OEE = \frac{POK \cdot CT}{OT}$
\item Auf der OEE-Zeitreihe wurde eine Anomalie-Detektion durchgeführt
\item Die gefundenen Patterns sollten dann diese Anomalien vorhersagen
\end{itemize}
\end{frame}

\section{Vorausgegangene Ansätze und Idee}
\begin{frame}{Sequenzpattern-Mining}
\begin{itemize}
\item Sequenzen von \textit{frequent patterns} zu generieren, führte bereits zu kleinen Erfolgen
\item Die gefunden Patterns waren jedoch leider den Technikern mit Expertenwissen bereits bekannt
\end{itemize}
\end{frame}

\begin{frame}{Erster Ansatz: Ein einzelner großer Graph}
\begin{itemize}
\item Es gibt bereits Ansätze zum Mining von Patterns auf großen Graphen, (vgl. \textit{GRAMI}, Elseidy et al, 2014, und \textit{POSGRAMI}, Moussaoui et al, 2016)
\item Eine eigene Idee war, mittels der Suche nach kürzesten Pfaden (Dijkstra), längere, aufeinander aufbauende, und damit vermutlich kausal zusammenhängende, Pfade zu finden
\end{itemize}
\end{frame}

\begin{frame}{Darstellung großer Graph}
\begin{figure}
\centering
\noindent\includegraphics[width=\linewidth]{images/fullgraph}
%\caption{Naive single graph encoding all available information}
\label{fig:fullgraph}
\end{figure}
\end{frame}

\section{gSpan und SVM}
\begin{frame}{Graph-Aufbau}
\begin{itemize}
\item Die Daten wurden unter Verwendung von Wissen um den Anlagen-Aufbau als \textit{constraints} in eine Graph-Form gebracht
\item Jeder Graph enkodiert 5 Minuten an Informationen
\item Auf der Menge der generierten Graphen wird dann der \textit{gSpan}-Algorithmus zur Pattern-Suche ausgeführt
\end{itemize}
\end{frame}

\begin{frame}{Graph-Isomorphismus}
\begin{itemize}
\item Das Grundproblem beim Graph-Mining ist die Feststellung, ob zwei (Sub-)Graphen zueinander isomorph sind
\item Def.: Seien $G$ und $H$ Graphen. Sei $f: V(G) \rightarrow V(H)$ eine Bijektion und $u, v \in V(G), (u,v) \in E(G)$. Dann gilt $G \simeq H$ g.d.w $(f(u), f(v)) \in E(H)$.
\item Das Subgraph-Isomorphie-Problem ist NP-complete
\end{itemize}
\end{frame}

\begin{frame}{gSpan}
\begin{itemize}
\item \textit{gSpan} ist ein pattern-growth Algorithmus von \textit{Yan und Han} aus 2002
\item \textit{gSpan} weist jedem Graph ein kanonisches, auf DFS traversal basierendes Label zu (DFS-Codes)
\item Zwei Graphen mit gleichem Label sind isomorph
\item \textit{gSpan} findet sodann alle Subgraphen der Elemente einer Menge von Graphen, welche einen \textit{minimum support threshold} (\textit{min\_sup}) erreichen.
\end{itemize}
\end{frame}

\begin{frame}{Modifikation von gSpan}
\begin{itemize}
\item Beim Implementieren von \textit{gSpan} in Python fiel auf, dass die DFS-Codes ähnlich wie Hashes funktionieren, aber die verwendete Datenstruktur Vergleichsoperationen nicht sehr effizient macht
\item Leider kann gSpan nicht vollständig auf den reinen Vergleich von Hashes umgestellt werden, da über der Menge der DFS-Codes eine starke Totalordnung liegen muss
\end{itemize}
\end{frame}

\begin{frame}{Beispiel DFS-Code}
\begin{figure}
\centering
\begin{tikzpicture}[node distance = 2cm]
\tikzset{VertexStyle/.style = {
shape=circle,
draw=black
}}
\node[VertexStyle, label={[label distance=-.2cm]45:\small $v_1$}] (1){X};
\node[VertexStyle, right of= 1, label={[label distance=-.2cm]45:\small $v_2$}] (2){Y};
\node[VertexStyle, right of= 2, label={[label distance=-.2cm]45:\small $v_3$}] (3){Z};
\node[VertexStyle, below of= 2, right of=2, label={[label distance=-.2cm]45:\small $v_4$}] (4){U};
\path [-] (1) edge node[above] {a} (2);
\path [-] (2) edge node[above] {b} (3);
\path [-] (3) edge node[left] {c} (4);
\path [-] (2) edge node[left] {d} (4);
\end{tikzpicture}
%\caption[Graph $G$ from chapter \ref{chapter:theoretical_basis} with labels]{Graph $G$ from chapter \ref{chapter:theoretical_basis} with labels}
\label{fig:example_graph_dfs}
\end{figure}
\begin{table}[h]
\centering
%\caption{Minimum DFS code of graph $G$}
\label{table:dummy_min_dfs_codes}
\begin{tabular}{l|l}
edge no. & DFS code \\ \hline
0 & $(0,2,U,d,Y)$ \\
1 & $(1, 2, X, a, Y)$ \\
2 & $(0, 3, U, c,Z)$ \\
3 & $(2, 3, Y, b, Z)$
\end{tabular}
\end{table}
\end{frame}

\begin{frame}{Pattern-growth Aspekt}
\begin{itemize}
\item Beim Suchen nach neuen Patterns verwendet \textit{gSpan} die schon gefundenen Patterns
\item Pattern-Kandidaten können neue Kanten nur am \textit{rightmost path} anfügen, was den Suchraum eingrenzt
\end{itemize}
\end{frame}

\begin{frame}{Support Vector Machine}
\begin{itemize}
\item Zum Klassifizieren der Patterns zu den gefundenen Anomalien wurde eine SVM eingesetzt
\item Eine SVM ist ein supervised learning Modell, das relativ effizient hochdimensionale Datenpunkte auf zwei Klassen verteilen kann
\end{itemize}
\end{frame}

\section{Ergebnisse}

\begin{frame}{Beispiel-Pattern}
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{images/dummy_pattern}
%\caption{8-edge pattern from the synthetic data set, min\_sup = .4}
\label{fig:dumm_pattern}
\end{figure}
\end{frame}

\begin{frame}{Synthetischer OEE-Verlauf}

\begin{figure}
\centering
\includegraphics[width=1\linewidth]{images/dummy_random_walk}
%\caption{Synthetic OEE values}
\label{fig:dummyrandomwalk}
\end{figure}
\end{frame}

\begin{frame}{Laufzeiten synthetische Daten}
\begin{table}[]
\centering
%\caption{Run times and patterns found (synthetic data set)}
\label{table:runtimes_syn}
\begin{tabular}{l|l|l}
data set & \textit{t} & patterns \\ \hline
import errors and graph generation & 1s & \\
import and anomalies detection on OEE scores & 8s & \\ \hline
\textit{gSpan} (min\_sup = .7) & 2s & 40 \\
\textit{gSpan} (min\_sup = .6) & 8s & 106 \\
\textit{gSpan} (min\_sup = .5) & 19s & 241 \\
\textit{gSpan} (min\_sup = .4) & 74s & 1056 \\ \hline
SVM training and validation (min\_sup = .7) & 4s & \\
SVM training and validation (min\_sup = .6) & 8s & \\
SVM training and validation (min\_sup = .5) & 35s & \\
SVM training and validation (min\_sup = .4) & 13m 14s &
\end{tabular}
\end{table}
The validation data set consisted of 49 time windows, 33 of which were deemed as a noticeable drop by the OEE evaluation algorithm. Of these 33, the SVM correctly identified 28 as drops, for a sensitivity score of 84.85\%. Of the remaining 19 non-drops, 5 were falsely identified as positives, for a specificity score of 73.68\%.
\end{frame}

\begin{frame}{Laufzeiten reale Anlagendaten}
\begin{table}
\centering
%\caption{Run times and patterns found (facility data set)}
\label{table:runtimes_real}
\begin{tabular}{l|l|l}
data set & \textit{t} & patterns \\ \hline
import errors and graph generation & 50s & \\
import and anomalies detection on OEE & 2m 27s & \\ \hline
\textit{gSpan} (min\_sup = .9) & 2m 20s & 12 \\
\textit{gSpan} (min\_sup = .7) & 6h 27m 12s & 846 \\
\textit{gSpan} (min\_sup = .5) & \textit{OOM killed} & -- \\ \hline
SVM training and validation (min\_sup = .7) & 27s &
\end{tabular}
\end{table}
The validation data set consisted of 486 time windows, 64 of which were deemed as a noticeable drop by the OEE evaluation algorithm. Of these 64, the SVM trained on patterns with a min\_sup of .7 correctly identified 60 as drops, for a sensitivity score of 93.75\%. Of the remaining 422 non-drops, 18 were identified as false positives, for a specificity score of 95.73\%.
\end{frame}

\end{document}

BIN
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/thesis.pdf View File


+ 657
- 0
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/thesis.tex View File

@@ -0,0 +1,657 @@
% warning: use pdflatex to compile!
\documentclass[pdftex,12pt,a4paper]{report}
\usepackage{dbstmpl}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{url}
\usepackage{tikz}
\usetikzlibrary{arrows,shapes,positioning}
\usepackage{algorithm}
\usepackage{algpseudocode}
\algnewcommand\algorithmicforeach{\textbf{for each}}
\algdef{S}[FOR]{ForEach}[1]{\algorithmicforeach\ #1\ \algorithmicdo}
\usepackage{tabularx}
\global\arbeit{Bachelorarbeit}
\global\titel{Efficient Event Classification through Constrained Subgraph Mining}
\global\bearbeiter{Simon Lackerbauer}
\global\betreuer{Martin Ringsquandl}
\global\aufgabensteller{Prof. Dr. Peer Kr"oger}
\global\abgabetermin{16. Februar 2018}
\global\ort{M"unchen}
\global\fach{Informatik}
\begin{document}
\deckblatt
\erklaerung
\begin{abstract}
With this work, I consider the problem of discovering patterns in error log data predicting common failure modes within a near fully-automated assembly line. I present a novel approach of encoding error log events in a graph structure and leverage a constraint based mining method to efficiently discover and score sophisticated patterns in these data, using a self implemented version of the pattern-growth algorithm \textit{gSpan}. As the algorithm as implemented does not scale quite as well as many traditional sequential pattern mining approaches, outside expert knowledge should be used to keep the input data to a manageable size and help with graph construction.
\end{abstract}
\tableofcontents
\chapter{Introduction}
The ability to analyze log files is a crucial part in the work of systems administrators and software developers. Developers often enough outright provoke the generation of detailed logs while debugging a piece of software, while system and network operators are routinely pulled from their efforts to fine tune services by incessant alerts from their monitoring systems.
Indeed, transaction log files, a common subgroup of the more general event logs, have been used to diagnose errors in the workings of data processing systems nearly since their inception, and dedicated research efforts into the analysis of transaction log files can be traced back until at least the mid-1960s.\cite{Peters1993-fw}
With the decline of monolithic service architectures, and the recent rise of complex systems of interdependent microservices,\cite{Dragoni2016-fh} the accurate reading of log files and recognition of the patterns within has become more important than ever. Nowadays, a plethora of log file management and search tools exist as both open source and commercially licensed tooling.\cite{noauthor_undated-bv}\cite{noauthor_undated-zu}\cite{noauthor_undated-vl}\cite{noauthor_undated-xi} More sophisticated log file analysis that goes beyond simple exploratory data analysis and simple monitoring and alerting systems is, however, not usually the focus of these tools.
But not only modern microservice architectures need powerful log file analysis tools to understand where bottlenecks and emergent properties of the architecture might stem from. That's why, with this bachelor's thesis, I want to leverage these techniques with a slightly more traditional, yet similarly highly modularized, architecture in mind: an automated production line.
Modern automated production lines usually consist of highly sophisticated robotic modules that assemble each produced unit with an efficiency and consistency that would be virtually impossible for human workers to achieve. However, at the same time, such automated systems are less tolerant of errors accumulating along the way and might stop working for relatively mundane reasons like a part being slightly out of position on the assembly line. Unlike a human worker, a specialized robot cannot solve most of these problems by itself. Depending on the problems encountered, the amount of time and (human) effort needed to deal with them can drive up the cost of running the line enormously, maybe even up to the point of operating at a loss.
If problems like these crop up consistently, it is only natural to assume a common cause between propagated failures along the assembly line\cite{Ringsquandl2016-en} and it can be assumed that, after identifying such causes early on and mitigating them, propagation of failures might be reduced or entirely averted. The production line analyzed in this thesis had the goal to optimize their efficiency as well.
Thus, the aim of this thesis will be to work on this problem using a graph mining based approach, starting with an overview of the literature (chapter \ref{chapter:related_work}) and the theoretical basis of graph mining (chapter \ref{chapter:theoretical_basis}), reflecting on how to extract a graph data structure from the data available as a log table (chapter \ref{section:data_set_splicing}), how best to mine the resulting graph or graphs for patterns (chapter \ref{section:gspan}), and eventually how to assess the resulting patterns (chapter \ref{subsection:evaluation}).
\chapter{Related work}
\label{chapter:related_work}
Leveraging a graph mining based approach for mining log data constitutes a relatively novel use for most of the algorithms implemented in this work. Graph mining as a concept, has, up to this point, been mostly used on data that naturally lend themselves to a graph or network based data structure, such as social networks, modeling human relationship networks in general, chemical component analysis or link networks.\cite{Washio2003-fc}\cite{Han2007-qx} Meanwhile, mostly sequential pattern mining approaches have been leveraged against access or error log based data, such as the \textit{basket data} collected by most large retailers nowadays\cite{Agrawal1993-nc} or \textit{web access logs}\cite{Pei2000-rz}. The \textit{gSpan} algorithm was originally tested on synthetic graph data as well as for mining chemical compound data.\cite{Yan2002-sj}
Most of the above mentioned naturally occurring graph data don't include time as a feature having measurable impact on the depicted relations. Chemical compounds may change and degrade over time, but their graph representations usually portray their idealized form. Social networks have also been of interest as dynamic processes themselves,\cite{Kossinets2006-rw} but these analyses tend to focus on snapshots of the full network after longer periods of time, whereas the data set examined in this work is a classical time series, with each row explicitly time stamped down to the second.
\section{Sequential pattern mining}
Sequential pattern analysis is a staple approach in time series mining. The term was first introduced and defined by Agrawal and Srikant\cite{Agrawal1994-ca} as follows:
\begin{quotation}
[G]iven a sequence database where each sequence is a list of transactions ordered by transaction time and each transaction consists of a set of items, find all sequential patterns with a user-specified minimum support, where the support is the number of data sequences that contain the pattern.
\end{quotation}
More formally, let $I = \{i_0, i_1, ..., i_n\}$ be a set of all items. Then a $k$-itemset $I^*$, which consists of $k$ items from $I$, is said to be \textit{frequent} if it occurs in a transaction database $D$ no less than $\theta|D|$ times, where $\theta$ is a user-specified \textit{minimum support threshold} (often abbreviated \textit{min\_sup}).
Mining sequential patterns can take the form of a-priori algorithms (see subsection \ref{subsection:apriori}), like Srikant and Agrawal's GSP\cite{Srikant1996-dy} or pattern growth algorithms, like Yan and Han's \textit{gSpan}\cite{Yan2002-sj} used in this work.
\section{Classical graph mining approaches}
The classical graph mining approach often focuses on obtaining general structural information about a network. As an example, the mining of social networks often reveals an overall structure of almost-leaves (e.g. friend- or kinship groups) being connected via so called ''multiplicators`` or ''influencers`` -- nodes with a high degree centrality\cite{Newman2010-ac} that are also interconnected with each other, together acting as the central cluster in a basically star-shaped network, which offers an explanation for the small world problem\cite{Travers1967-cn}.
Meanwhile, for this work, the overall structure of the graph (the dependency network between modules and inputs for the assembly line) is known beforehand, and even formally defined. Analysis of the design of the found substructures -- viz. \textit{what} the pattern symbolizes as opposed to \textit{finding} it in the first place -- is indeed only of interest after most of the work is already done.
\subsection{A-priori based approach}
\label{subsection:apriori}
The A-priori based paradigm to frequent pattern mining is a heuristic that generates a reduced set of patterns through each iteration.\cite{Pei2000-rz} The a-priori principle, on which these approaches are based on, states that \textit{any super-pattern of an infrequent pattern cannot be frequent}.\cite{Han2004-qs} Thus, these algorithms first generate a set of all frequent 1-element sequences. From that, they generate new candidate sequences in a step-wise way. For example, if the patterns \textit{A} and \textit{B} are each frequent according to a specific \textit{min\_sup}, then the pattern \textit{AB} might be frequent as well. These generated patterns are then tested and discarded if and only if they don't reach the given \textit{min\_sup} (cf. algorithm \ref{alg:apriori}, where $T$ is the transaction database and $C_k$ is the candidate set for sequence length k).
\begin{algorithm}
\caption[APriori($T,min\_sup$)]{APriori($T,min\_sup$)\cite{Agrawal1994-ca}}\label{alg:apriori}
\begin{algorithmic}[1]
\State {$L_1 \gets$ \{large 1-itemsets\};}
\State {$k \gets 2$}
\While {$L_{k-1} \neq \emptyset$}
\State {$C_k \gets \{ a \cup \{b\} \mid a \in L_{k-1} \land b \not \in a \} - \{ c \mid \{ s \mid s \subseteq c \land |s| = k-1 \} \nsubseteq L_{k-1} \}$}
\ForEach {$t \in T$}
\State {$C_t \gets \{ c \mid c \in C_k \land c \subseteq t \}$}
\ForEach {$c \in C_t$}
\State {$count[c] \gets count[c] + 1$}
\EndFor
\EndFor
\State {$L_k \gets \{c|c \in C_k \land count[c] \geq min\_sup\}$}
\State {$k \gets k+1$}
\EndWhile \\
\Return {$\bigcup_k L_k$;}
\end{algorithmic}
\end{algorithm}
Well-studied a-priori-based algorithms include the already mentioned GSP\cite{Srikant1996-dy}, SPADE\cite{Zaki2001-jy}, or HitSet\cite{Han1999-bj}. As a-priori-based approaches have to search through the input data at least once for every candidate pattern, the best a-priori algorithms achieve runtime efficiencies of $O(n^2)$.
\subsection{Pattern-growth based approach}
In contrast, the pattern-growth approach adopts a divide-and-conquer principle as follows: \textit{sequence databases are recursively projected into a set of smaller projected databases based on the current sequential pattern(s), and sequential patterns are grown in each projected database by exploring only locally frequent fragments}.\cite{Han2004-qs} The \textit{gSpan}-algorithm used in this work is, like its predecessors \textit{FreeSpan} and \textit{PrefixSpan}, such a pattern-growth based approach.
\chapter{Theoretical basis}
\label{chapter:theoretical_basis}
The proposed event classification method mines a graph set $\mathcal{D} = (G_0, ..., G_n)$ for substructures of interest, often called ``patterns'' in the literature. To establish a firm theoretical understanding of what the these structures are, some definitions of graph theory are in order.
\section{Graph}
Traditionally, a graph is represented as a set of vertices $V$ and a set of edges $E$, with a mapping $f : E \rightarrow V \times V$ that defines each edge as a tuple of vertices. In directed graphs, this tuple is ordered. In undirected graphs, it is unordered, so that $\forall v_i, v_j: (v_i, v_j) \in E \rightarrow (v_j, v_i) \in E$. In this work, the vertex set $V$ of a graph $G$ may also be denoted $V(G)$. Likewise, the edge set may be denoted $E(G)$. To encode these structures in software, adjacency matrices and edge lists are commonly deployed. The following example adjacency matrix (table \ref{tab:example_adj_mat}) and visual representation (figure \ref{fig:example_graph}) all encode the same undirected graph $G = \left(\{v_1, v_2, v_3, v_4\}, \{(v_1, v_2), (v_2,v_3), (v_3, v_4), (v_4, v_2)\}\right)$.
\begin{table*}
\caption{Adjacency matrix for graph $G$}
\label{tab:example_adj_mat}
\centering
\begin{tabular}{l|cccc}
& $v_1$ & $v_2$ & $v_3$ & $v_4$ \\ \hline
$v_1$ & 0 & 1 & 0 & 0 \\
$v_2$ & 1 & 0 & 1 & 1 \\
$v_3$ & 0 & 1 & 0 & 1 \\
$v_4$ & 0 & 1 & 1 & 0
\end{tabular}
\end{table*}
\begin{figure}
\centering
\begin{tikzpicture}[node distance = 2cm]
\tikzset{VertexStyle/.style = {
shape=circle,
draw=black
}}
\node[VertexStyle] (1){$v_1$};
\node[VertexStyle, right of= 1] (2){$v_2$};
\node[VertexStyle, right of= 2] (3){$v_3$};
\node[VertexStyle, below of= 2, right of=2] (4){$v_4$};
\path [-] (1) edge node[above] {} (2);
\path [-] (2) edge node[above] {} (3);
\path [-] (3) edge node[left] {} (4);
\path [-] (2) edge node[left] {} (4);
\end{tikzpicture}
\caption[Visual representation of graph $G$]{Visual representation of graph $G$}
\label{fig:example_graph}
\end{figure}
\section{Subgraph}
Let $G_s$ and $G$ be two graphs, where $G_s = (V_s, E_s)$ and $V_s \subset V(G), E_s \subset E(G)$. Then, if the following holds:
\[ \forall (v_i, v_j) \in E_s \implies v_i, v_j \in V_s, \]
$G_s$ is said to be a subgraph of $G$. Figure \ref{fig:example_subgraph} illustrates an example subgraph of $G$ with three of $G$'s four vertices and two of its four edges. Note that the graphical representation need not be drawn in the same way to depict a subgraph.
\begin{figure}[h]
\centering
\begin{tikzpicture}[node distance = 2cm]
\tikzset{VertexStyle/.style = {
shape=circle,
draw=black
}}
\node[VertexStyle, right of= 1] (2){$v_2$};
\node[VertexStyle, right of= 2, below of=2] (3){$v_3$};
\node[VertexStyle, right of=3, above of=3] (4){$v_4$};
\path [-] (2) edge node[above] {} (3);
\path [-] (3) edge node[left] {} (4);
\end{tikzpicture}
\caption{Example subgraph $G_s$ of $G$}
\label{fig:example_subgraph}
\end{figure}
\section{(Sub-)Graph isomorphism}
Two graphs $G$ and $H$ are said to be isomorph when the following holds: let $f: V(G) \rightarrow V(H)$ be a bijection and let vertices $u$ and $v$ of $G$ be adjacent in $G$. Then $G \simeq H$ if and only if $f(u)$ and $f(v)$ are adjacent in $H$. It is currently unknown if the graph isomorphism problem is P or NP.\cite{Fortin1996-la}
The subgraph isomorphism problem is the decision problem of whether, when given two graphs $G$ and $H$, there exists a subgraph in $G$ that is isomorphic to $H$. The subgraph isomorphism problem is known to be NP-complete.\cite{Cook1971-fj}.
\chapter{Methodology}
Especially considering time as both an explicit attribute of nodes and edges, as well as implicitly encoded into the graph structure and as such still present in subsequently mined subgraphs, constitutes an approach to exploring log data through graph mining that has not been examined extensively before.
\section{Disregarded approaches}
Before homing in on analyzing the given data set with \textit{gSpan} and a \textit{Support Vector Machine}, some approaches with lesser quality results were taken. Even though these approaches didn't prove fruitful in the long run, they helped me understand the data set and its underlying structure better and as such are, with their basic premises, included here to give a full account of all measures taken.
\subsection{Shortest path algorithms on a single graph}
\begin{figure}
\centering
\noindent\includegraphics[width=\linewidth]{images/fullgraph}
\caption{Naive single graph encoding all available information}
\label{fig:fullgraph}
\end{figure}
As the manufacturing process consists of a circuit of interconnected modules, the first approach of translating the given data into a graph form was to build one single large graph with all information available around this first circuit of module nodes, with the intention of later on mining frequent patterns from this single large graph, using algorithms such as Elseidy et al's \textit{GRAMI}\cite{Elseidy2014-fz} or Moussaoui et al's \textit{POSGRAMI}\cite{Moussaoui2016-ng}.
To encode the log error data on top of this framework, the error messages were split into terms (essentially words) and each term made a node. Edges were drawn between all term-nodes in a given message, as well as between all term-nodes and the module-node they occurred in, as well as between all term-nodes and the production-unit-ID-node. This first naive approach of visualizing the available data produced a graph consisting of 523 nodes connected by 2,182 edges (illustrated in figure \ref{fig:fullgraph}). Edges were weighted simply by a count of how often they appeared throughout the whole data set, with some edges only appearing once and a maximum weight of over 10,000 appearances.
This first graph did not, to a first approximation, retain its expected circuit-like appearance, instead clustering heavily around a few modules and production unit IDs that produced the most errors, meaning there were parts in the system vastly more error prone than others.
In a second step, the Dijkstra shortest paths algorithm was used to find all paths between nodes that had a full path length of less than a specific constant, afterwards ordered by path length. For example, a short 4-edge path could connect two modules and the part ID via two error terms, indicating some kind of correlation, much like a sequential pattern analysis could find.
For Dijkstra to work, the naive edge weight for edge $i$ ($w_{i_{naive}}$) had to be transformed from simple counts to a normalized form that would retain comparison operations between edges, but invert path lengths (as the Dijkstra algorithm is used, as its name suggests, to find the \emph{shortest paths}). This was achieved by the following transformation being calculated after graph generation, with~$w_{all}~=~\sum_{j=0}^{n} w_{j_{naive}}$:
\[w_{i_{normalized}} = -\ln \left( \frac{w_{i_{naive}}}{w_{all}} \right) \]
This weight was later penalized further by adding time constraints, so that
\[w_{i_{penalized}} = w_{i_{normalized}} + P(i)\]
where
\[P(i) = \begin{cases}
\log_{\tilde{\Delta t}} \frac{\Delta t}{C(\Delta t) \cdot \tilde{\Delta t}}, & \text{if }C(\Delta t), \tilde{\Delta t} > 0 \\
0, & \text{otherwise}
\end{cases}\]
Here, $\Delta t$ is the amount of time between rows being connected by the edge, $C(\Delta t)$ is the amount of connections between these nodes, and $\tilde{\Delta t}$ is the average time differential between rows overall. This ensured penalization for long times between events (indicating that the events weren't correlated), while similarly rewarding events that happened more often in conjunction.
\begin{figure}
\centering
\noindent\includegraphics[width=\linewidth]{images/example_pattern}
\caption{Large example pattern}
\label{fig:example_patterns}
\end{figure}
\begin{figure}
\centering
\noindent\includegraphics[width=\linewidth]{images/small_pattern}
\caption{Smaller example pattern}
\label{fig:small_pattern}
\end{figure}
\subsection{Digraph}
Using a directed graph was briefly considered, but did, in some first tests, not lead to noticeably different patterns from the undirected graph. This was almost expected, as graph connections were highly predictable, because of the construction of the graphs similar to dependency graphs.
If, for example, in one splice, specific terms had a directed edge \textit{towards} a specific module, then this basic structure would obviously be repeated in other slices as well, by virtue of a construction that always pointed terms towards modules. The simple connectedness attribute that is an edge in an undirected graph already served this purpose implicitly when taking into account this background knowledge used during graph construction.
\subsection{Natural language processing}
The idea of combining both techniques of graph mining and a natural language processing unit to make sense of the actual contents of the error logs was briefly entertained as well. Good, easily available NLP software for languages other than English is still quite hard to come by, however. In addition, the analyzed error logs were often very short and technical, and thus didn't lend themselves to an actual content based analysis. Even to native speakers, many of the messages would've been quite cryptic and so it was decided that the processing power that would've been needed for an NLP based approach would be better used elsewhere.
\section{Data set splicing}
\label{section:data_set_splicing}
The investigated data set consisted of about 57,000 messages logged over the course of five consecutive days in October 2016. Available columns were a time stamp (precision: 1s), an unstructured log message in German, the module ID where the message originated and the part ID of the produced item in that run. Messages sometimes included additional partly structured data, like a more detailed report of the location where the error occurred included in the German log message. See table \ref{table:dummy_messages} for an example of the data structure and a part of the synthetic data set used for the testing setup in section \ref{section:synthetic_data_results}.
\begin{table}[]
\centering
\caption{Synthetic data set (excerpt)}
\label{table:dummy_messages}
\footnotesize
\begin{tabular}{l|l|l|l}
time stamp & log message & module id & part id \\ \hline
2017-04-05 11:01:05 & Laser "uberhitzt & Module 1 & 88495775TEST \\
2017-04-05 11:01:05 & Laser "uberhitzt & Module 1 & 88495776TEST \\
2017-04-05 11:01:06 & Teil verkantet & Module 2 & 88495776TEST \\
2017-04-05 11:01:06 & Laser "uberhitzt & Module 1 & 88495776TEST \\
2017-04-05 11:01:10 & Laser "uberhitzt & Module 1 & 88495776TEST \\
2017-04-05 11:01:12 & Auffangbeh"alter leeren & Module 2 & 88495775TEST \\
2017-04-05 11:01:17 & Unbekannter Ausnahmefehler & Module 0 & 88495775TEST \\
2017-04-05 11:01:17 & Auffangbeh"alter leeren & Module 2 & 88495775TEST \\
2017-04-05 11:01:19 & Unbekannter Ausnahmefehler & Module 0 & 88495775TEST \\
2017-04-05 11:05:22 & Laser "uberhitzt & Module 1 & 88495775TEST \\
\multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots}
\end{tabular}
\end{table}
The data had to be cleaned up slightly prior to a first cursory visual analysis. An, at first glance, large amount of the total message count consisted of (without expert knowledge) seemingly meaningless general error messages consisting of only an error code and no further explanation. These specific messages existed in 4 slightly different formats of about 30 messages with the same time stamp each, with the error code incremented by one with each message. Each instance of these 30 message bursts was replaced with a simple ``general error'' message with the same time stamp. After this preprocessing step, the amount of messages to be considered had roughly halved.
Further, if the module ID and manufacturing unit ID were included in the message in one of several standardized ways, they were extracted and given their own column in the data set to prohibit random integers cropping up in the messages to be mistaken for e.g. a module id.
To generate the graph set to be mined, a week's log data was spliced along 5 minute time frames, producing graphs of the size like exemplified in figure \ref{fig:partial_graph}. Two different modules throwing errors in this specific time window can be easily distinguished. A 3 minute window was briefly considered, but the resulting graphs were, almost surprisingly, much smaller overall, so that the mined patterns weren't open to any meaningful interpretation and were fewer in number as well. Windows smaller than 5 minutes in general mostly led to many more small graphs, so that no long pattern could conceivably ever reach any useful \textit{min\_sup} threshold.
\begin{figure}
\centering
\noindent\includegraphics[width=\linewidth]{images/partial_graph}
\caption{Graph of a 5-minute log data window}
\label{fig:partial_graph}
\end{figure}
\section{Basic graph construction}
Extracting a basic graph structure from the input is both one of the least computationally intensive steps in the proposed methodology and the most important one. The analyzed data sets don't lend themselves to a natural scheme; background knowledge about the facility had to play an important part in determining the basic structure of the graphs.
As previously mentioned, the constructed graphs heavily relied on known dependencies between error messages, modules and part IDs. Figure \ref{fig:example_slice} illustrates the common connections between the full error message \textit{Z}, its terms \textit{T}, the error logging module \textit{M} and the specific part ID \textit{S}. Error messages could be connected by using common words between them, through, e.g., a further narrowed down standardized localization term, while modules and part ids would be connected to all error messages and their terms produced with their involvement.
\begin{figure}
\centering
\begin{tikzpicture}[node distance = 2cm]
\tikzset{VertexStyle/.style = {
shape=circle,
draw=black
}}
\node[VertexStyle] (1) {Z};
\node[VertexStyle, left of=1, above of=1] (2) {T};
\node[VertexStyle, right of=1, above of=1] (3) {T};
\node[VertexStyle, left of=1, below of=1] (4) {T};
\node[VertexStyle, right of=1, below of=1] (5) {T};
\node[VertexStyle, left of=2, below of=2] (6) {M};
\node[VertexStyle, right of=5, above of=5] (7) {S};
\node[right of=7, above of=7] (8) {$\dots$};
\node[left of=6, below of=6] (9) {$\dots$};
\path [-] (1) edge node[left] {} (2);
\path [-] (1) edge node[left] {} (3);
\path [-] (1) edge node[left] {} (4);
\path [-] (1) edge node[left] {} (5);
\path [-] (3) edge node[left] {} (2);
\path [-] (4) edge node[left] {} (2);
\path [-] (5) edge node[left] {} (3);
\path [-] (5) edge node[left] {} (4);
\path [-] (2) edge node[left] {} (6);
\path [-] (3) edge node[left] {} (6);
\path [-] (4) edge node[left] {} (6);
\path [-] (5) edge node[left] {} (6);
\path [-] (2) edge node[left] {} (7);
\path [-] (3) edge node[left] {} (7);
\path [-] (4) edge node[left] {} (7);
\path [-] (5) edge node[left] {} (7);
\path [-] (7) edge node[left] {} (8);
\path [-] (6) edge node[left] {} (9);
\end{tikzpicture}
\caption[Example slice]{Example slice}
\label{fig:example_slice}
\end{figure}
\subsection{Using background knowledge during model construction}
In the case of production facilities, the basic structure of the facility itself, even more so in a modularized system like the one on hand, has to be integrated into the basic graph scheme. Considering inherent parallelisms in the system (like two modules working in parallel), already places a few constraints on the resulting input graphs. Constraints were mostly input manually into the system prior to graph construction.
\section{Features of interest}
As mentioned before, the production facility doesn't run at peak performance most of the time. As such, of interest to this analysis was mostly if specific events or sequence of events would be able to predict a decrease in the OOE figure. A sequence analysis performed by my advisor some time before this thesis had already yielded some preliminary results in this direction which were considered known problems by the experts. Stumbling upon a pattern which would be considered a novel mechanical problem to solve would be the prime result of this work.
\section{Modified gSpan}
\label{section:gspan}
\textit{gSpan} was first introduced by Yan and Han in 2002.\cite{Yan2002-sj}\cite{Yan2002-hg} \textit{gSpan} leverages depth-first search (DFS) to map graphs to minimum DFS codes, which are a canonical lexicographic graph labeling method. As all isomorphic graphs have the same canonical label, once computation of the labels is completed, it's trivially easy to solve the isomorphism question for any two graphs by comparing their canonical labels. If those labels are available in a lexicographic format, their comparison in itself is also trivially achievable by simple string comparison. The modification of \textit{gSpan} in this work is based on adding a hash function to make the lexicographic comparison faster.
Introducing a hashing algorithm obviously introduces the risk of collisions between hashes, rendering the formerly unambiguous canonical label to graph mapping no longer bijective, but instead surjective. In the case of structured or semi-structured operator controlled inputs, however, the theoretical possibility of collisions because of the hashing shouldn't be a huge concern.
\subsection{DFS Lexicographic Order}
To demonstrate the construction of a minimum DFS code, we're again using the example graph from chapter \ref{chapter:theoretical_basis}, this time with additional labels for nodes and edges as visualized in figure \ref{fig:example_graph_dfs}.
\begin{figure}
\centering
\begin{tikzpicture}[node distance = 2cm]
\tikzset{VertexStyle/.style = {
shape=circle,
draw=black
}}
\node[VertexStyle, label={[label distance=-.2cm]45:\small $v_1$}] (1){X};
\node[VertexStyle, right of= 1, label={[label distance=-.2cm]45:\small $v_2$}] (2){Y};
\node[VertexStyle, right of= 2, label={[label distance=-.2cm]45:\small $v_3$}] (3){Z};
\node[VertexStyle, below of= 2, right of=2, label={[label distance=-.2cm]45:\small $v_4$}] (4){U};
\path [-] (1) edge node[above] {a} (2);
\path [-] (2) edge node[above] {b} (3);
\path [-] (3) edge node[left] {c} (4);
\path [-] (2) edge node[left] {d} (4);
\end{tikzpicture}
\caption[Graph $G$ from chapter \ref{chapter:theoretical_basis} with labels]{Graph $G$ from chapter \ref{chapter:theoretical_basis} with labels}
\label{fig:example_graph_dfs}
\end{figure}
Mapping this graph to a minimum DFS code using algorithm \ref{alg:mindfscode} yields the minimum DFS code in table \ref{table:dummy_min_dfs_codes}.
\begin{table}[h]
\centering
\caption{Minimum DFS code of graph $G$}
\label{table:dummy_min_dfs_codes}
\begin{tabular}{l|l}
edge no. & DFS code \\ \hline
0 & $(0,2,U,d,Y)$ \\
1 & $(1, 2, X, a, Y)$ \\
2 & $(0, 3, U, c,Z)$ \\
3 & $(2, 3, Y, b, Z)$
\end{tabular}
\end{table}
\begin{algorithm}
\caption{MinDFSCode($G$)}\label{alg:mindfscode}
\begin{algorithmic}[1]
\State {initiate $S \gets \emptyset$}
\ForEach {vertex $v \in V(G)$}
\State {perform a depth-first search with $v$ as a starting point;}
\State {transform the resulting DFS tree $t$ into a DFS code tuple;}
\EndFor
\State {sort DFS code tuples by comparing their length according to DFS lexicographic order and choose the smallest one as the canonical label}
\end{algorithmic}
\end{algorithm}
The DFS lexicographic order is a linear order defined by the less or equal function in algorithm \ref{alg:dfs_lexicographic_order}. For the neighborhood restrictions and the comparison function between two DFS code tuples $a = (i_a, j_a, l_{i_a}, l_{(i_a, j_a)}, l_{j_a})$ and $b = (i_b, j_b, l_{i_b}, l_{(i_b, j_b)}, l_{j_b})$, please see algorithm \ref{alg:dfs_lexicographic_order_tuples}. In both algorithms, the following definitions apply: $\alpha = (a_0, a_1, ..., a_m)$ and $\beta = (b_0, b_1, ..., b_n)$, where each $a_t, b_t$ is a DFS code tuple of the form $x_t = (i_x, j_x, l_{i_x}, l_{(i_x, j_x)}, l_{j_x})$. $i_x, j_x$ are vertices, $l_{i_x}, l_{j_x}$ are their labels, and $l_{(i_x, j_x)}$ is the edge label.
\begin{algorithm}
\caption{DFSLexicographicLE($\alpha, \beta$)}\label{alg:dfs_lexicographic_order}
\begin{algorithmic}[1]
\If {$n \geq m$ and $a_m = b_m$}
\State\Return {True, ie $\alpha \leq \beta$}
\Else
\State {$a_{\text{forward}} \gets \text{Bool}(j_a > i_a)$}
\State {$b_{\text{forward}} \gets \text{Bool}(j_b > i_b)$}
\State {$a_{\text{backward}} = \neg a_{\text{forward}}$}
\State {$b_{\text{backward}} = \neg b_{\text{forward}}$}
\If{$a_{\text{forward}} \land b_{\text{forward}}$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\If{$a_{\text{backward}} \land b_{\text{backward}} \land j_a < j_b$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\If{$a_{\text{backward}} \land b_{\text{backward}} \land j_a = j_b \land l_{(i_a, j_a)} < l_{(i_b, j_b)}$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\If{$a_{\text{forward}} \land b_{\text{forward}} \land i_b < i_a$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\If{$a_{\text{forward}} \land b_{\text{forward}} \land i_b = i_a \land l_{i_a} < l_{i_b}$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\If{$a_{\text{forward}} \land b_{\text{forward}} \land i_b = i_a \land l_{i_a} = l_{i_b} \land l_{(i_a, j_a)} < l_{(i_b, j_b)}$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\If{$a_{\text{forward}} \land b_{\text{forward}} \land i_b = i_a \land l_{i_a} = l_{i_b} \land l_{(i_a, j_a)} = l_{(i_b, j_b)} \land l_{j_a} < l_{j_b}$}
\State\Return {True, ie $\alpha \leq \beta$}
\EndIf
\State\Return {False, ie $\alpha > \beta$}
\EndIf
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{DFSTuplesLexicographicLE($a, b$)}\label{alg:dfs_lexicographic_order_tuples}
\begin{algorithmic}[1]
\State {$a_{\text{forward}} \gets \text{Bool}(j_a > i_a)$}
\State {$b_{\text{forward}} \gets \text{Bool}(j_b > i_b)$}
\State {$a_{\text{backward}} = \neg a_{\text{forward}}$}
\State {$b_{\text{backward}} = \neg b_{\text{forward}}$}
\If{$a_{\text{forward}} \land b_{\text{forward}} \land a_j < b_j$}
\State\Return {True, ie $a \leq b$}
\EndIf
\If{$a_{\text{backward}} \land b_{\text{backward}} \land (a_i < b_i \lor (a_i = b_i \land a_j < b_j))$}
\State\Return {True, ie $a \leq b$}
\EndIf
\If{$a_{\text{backward}} \land b_{\text{forward}} \land a_i < b_j$}
\State\Return {True, ie $a \leq b$}
\EndIf
\If{$a_{\text{forward}} \land b_{\text{backward}} \land b_j \leq a_i$}
\State\Return {True, ie $a \leq b$}
\EndIf
\If{$a_{\text{backward}}$}
\If{$b_{\text{forward}} \land b_i \leq a_i \land b_j = a_i + 1$}
\State\Return {True, ie $a \leq b$}
\EndIf
\If{$b_{\text{backward}} \land b_i = a_i \land a_j < b_j$}
\State\Return {True, ie $a \leq b$}
\EndIf
\EndIf
\If{$a_{\text{forward}}$}
\If{$b_{\text{forward}} \land b_i \leq a_j \land b_j = a_j + 1$}
\State\Return {True, ie $a \leq b$}
\EndIf
\If{$b_{\text{backward}} \land b_i = a_j \land b_j < a_i$}
\State\Return {True, ie $a \leq b$}
\EndIf
\EndIf
\State\Return {False, ie $a > b$}
\end{algorithmic}
\end{algorithm}
With these comparison algorithms in place, a \textit{DFS Code Tree} can be constructed. In a \textit{DFS Code Tree}, each node represents one graph via its DFS code. Obviously, in such a tree, the DFS code for a graph can turn up more than once, depending on node addition order. Thus, the first code that turns up on a pre-order depth-first search of the \textit{DFS Code Tree} is what we previously called the minimum DFS code.
This results in the more formal definition given by \cite{Yan2002-hg}:
\begin{quotation}
Given a graph $G$, $Z(G) = \{code(G, T) | \forall T, T \text{ is a DFS code}\}$, based on DFS lexicographic order, the minimum one, $\min(Z(G))$, is called \textbf{Minimum DFS Code} of $G$. It is also the canonical label of $G$.
\end{quotation}
\subsection{Graphset projection and subgraph mining}
The pattern growth approach now becomes clearer when we're trying to construct a new pattern from an already found one: to construct a valid DFS code for the new pattern, the new edge cannot be added at an arbitrary position, but can only be added to vertices on the ''rightmost path.`` This is further limited, as only forward edges can grow from all vertices on the rightmost path, whereas backward edges can only be grown from the rightmost vertex.
With these definitions in place, the \textit{gSpan} algorithm works as follows (see algorithm \ref{alg:graph_set_projection} for the pseudocode):
\begin{algorithm}
\caption[GraphSet\_Projection($\mathcal{D,S}$)]{GraphSet\_Projection($\mathcal{D,S}$)\cite{Yan2002-hg}}
\label{alg:graph_set_projection}
\begin{algorithmic}[1]
\State sort labels of the vertices and edges in $\mathcal D$ by their frequency;
\State remove infrequent vertices and edges;
\State relabel the remaining vertices and edges in descending frequency;
\State $\mathcal S^1 \gets \text{all frequent 1-edge graphs in } \mathcal D$;
\State sort $\mathcal S^1$ in DFS lexicographic order;
\State $\mathcal S \gets \mathcal S^1$;
\ForEach {$\text{edge }e \in \mathcal S^1$}
\State $\text{initialize } s \text{with } e, \text{set } s.GS = \{g | \forall g \in \mathcal D, e \in E(g)\}$;
\State Subgraph\_Mining($\mathcal{D, S}, s$);
\State $\mathcal D \gets \mathcal D - e$;
\If {$|\mathcal D| < \textit{min\_sup}$}
\State \textbf{break};
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
In a first step, infrequent single nodes and edges are removed from the search space, as there can be no longer patterns with infrequent substructures in them. The frequent one-edge subgraphs are stored in $\mathcal S^1$ and will be used as the seeds from which longer patterns are grown by calling algorithm \ref{alg:subgraph_mining} on all such one-edge patterns. Along the way, the graph set $\mathcal{D}$ is consecutively shrunk during each iteration, as previously searched patterns cannot turn up again later on. After finding all one-edge patterns and all their decedents, the algorithm terminates. For a definition of Enumerate(), see \cite{Yan2002-hg}.
\begin{algorithm}
\caption[Subgraph\_Mining($\mathcal{D, S}, s$)]{Subgraph\_Mining($\mathcal{D, S}, s$)\cite{Yan2002-hg}}
\label{alg:subgraph_mining}
\begin{algorithmic}[1]
\If {$\textit{s} \neq \textit{min}(s)$}
\State \textbf{return};
\EndIf
\State $\mathcal S \gets \mathcal S \cup \{s\}$
\State generate all \textit{s'} potential children with one edge growth;
\State Enumerate(\textit{s});
\ForEach {$c, c \text{ is } s' \text{ child}$}
\If {$\textit{support}(c) \geq \textit{min\_sup}$}
\State $s \gets c;$
\State Subgraph\_Mining($\mathcal{D, S}, s$);
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
\section{Support Vector Machine}
An SVM\cite{Cortes1995-ix} is a supervised machine learning model that can classify a data set into distinct groups by constructing a hyperplane between their feature vectors that maximizes the distance between the nearest data points and the hyperplane for any class (the functional margin). Figure \ref{fig:hyperplane} illustrates this for a simple, two dimensional example with two intuitively distinct classes. The samples on the margin are called the support vectors. The SVM module from the Python package scikit-learn\cite{Pedregosa2011-ld} was used.
\begin{figure}
\centering
\noindent\includegraphics[width=\linewidth]{images/sphx_glr_plot_separating_hyperplane_0011}
\caption[Functional margin between two classes of data points]{Functional margin between two classes of data points\cite{noauthor_undated-io}}
\label{fig:hyperplane}
\end{figure}
The original problem formulation for support vector classification is as follows:\cite{Chang2011-wa}
Let $\boldsymbol{x}_i \in \mathbb{R}^p$ with $i=1,...,n$ and let $\boldsymbol{y} \in \mathbb{R}^l$ be an indicator vector, such that $\boldsymbol{y}_i \in \{1, -1\}$. Then SVC solves the following primal optimization problem:
\[\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i \]
subject to
\[ y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,
\zeta_i \geq 0, i=1, ..., n\]
with its dual being
\[\min_{\alpha} \frac{1}{2} \alpha^T Q \alpha - e^T \alpha\]
subject to
\[y^T \alpha = 0,
0 \leq \alpha_i \leq C, i=1, ..., n\]
where $\phi(\boldsymbol{x}_i)$ maps $\boldsymbol{x}_i$ into a higher-dimensional space, $e$ is the vector of all ones, $Q$ is an $x \times x$ positive semidefinite matrix with $Q_{ij} = y_i y_j K(x_i, x_j)$, with $K(x_i, x_j) = \phi(x_i)^T\phi(x_j)$ being the kernel, and $C > 0$ is the regularization parameter (upper bound).
The decision function is given by:
\[\operatorname{sgn}\left(\sum_{i=1}^n y_i \alpha_i K(x_i, x) + \rho\right).\]
\section{OEE anomaly detection}
Site performance is measured through OEE (Overall Equipment Effectiveness) scoring. OEE calculation yields a scoring between 0 and 1 according to the following formula:
\[OEE = \frac{POK \cdot CT}{OT},\]
where \textit{POK} is the number of \textbf{p}arts that came out of quality control \textbf{OK}, \textit{CT} is the \textbf{c}ycle \textbf{t}ime in seconds per part and \textit{OT} is the \textbf{o}perational \textbf{t}ime of the assembly line in seconds. All values are reset during shift changes, resulting in a short period of 0\% OEE before the first part of a new shift is produced. As an example, if the line ran for 3600 seconds, needed 10 seconds to produce a part and produced 300, the resulting OEE score would be
\[OEE = \frac{300 \cdot 10}{3600}\approx .83. \]
This result would indicate an assembly line running with about 83\% effectiveness. It should've produced 60 parts more in the given time, and, thus, was held up for some reason or another about 17\% of the time.
OEE scores were available as a time series for the same time frame as the factory data set. As OEE scores were calculated every second, the resulting data set was considerably larger than the error logs, consisting of more than 800,000 rows of 26 columns, most of which weren't used. Anomaly detection consisted mainly of identifying more or less sudden drops in OEE scoring, indicating times when no parts were produced. The first few anomaly detection systems proved very capable of detecting shift changes and not much else, while later iterations did indeed pick up on most of the intuitively obvious drops.
Algorithm \ref{alg:oeedetectanomalies} provides an efficient anomaly detection algorithm with $O(n)$ complexity, with $S$ being a set of slope indicators, $\tilde{S}$ the mean slope indication and $c$ a manually set parameter for how many standard deviations from the mean slope a anomaly should be assumed.
\begin{algorithm}
\caption{OEEDetectAnomalies(OEE\_data)}\label{alg:oeedetectanomalies}
\begin{algorithmic}[1]
\State {$S \gets \emptyset$;}
\State {$R \gets \emptyset$;}
\ForEach {5 minute slice $\boldsymbol{s}$;}
\If {$\min_{OEE} \boldsymbol{s} \neq 0$;}
\State {$x \gets \frac{\max_{t} \boldsymbol{s}}{\min_{t} \boldsymbol{s}} -1 $;}
\State {$S \gets S \cup x$;}
\Else
\State {$S \gets S \cup 0$;}
\EndIf
\EndFor
\State {$l \gets \tilde{S} - c \cdot SD(S)$;}
\ForEach {$\hat{\boldsymbol{s}} = (x, y, z)$ in $S, z<l$;}
\State {$R \gets R \cup \hat{\boldsymbol{s}}$;}
\EndFor \\
\Return {$R$;}
\end{algorithmic}
\end{algorithm}
\chapter{Experiments and performance study}
\section{Test setup}
All tests were performed on a 2015 Lenovo Thinkpad T450s, with 12GB of RAM and an Intel Core i7-5600U clocked at 2.6 GHz, running a NixOS 17.09 with Python 3.6.4 built with GCC 6.4.0.
\section{Results for a synthetic data set}
\label{section:synthetic_data_results}
The synthetic data set was generated by first simulating a random walk of OEE values from 11 am until about 9 pm, mimicking about one shift (figure \ref{fig:dummyrandomwalk}). Later, equivalent error logs were created, with some messages more likely to turn up at times when the generated random walk resulted in an OEE drop as recognized by the OEE anomalies detection.
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{images/dummy_random_walk}
\caption{Synthetic OEE values}
\label{fig:dummyrandomwalk}
\end{figure}
These generated data look similar to the real facility set, with various drops and and ascents. Resulting patterns like figure \ref{fig:dumm_pattern} also look very similar to real patterns like figure \ref{fig:small_pattern} above.
\begin{figure}
\centering
\includegraphics[width=1\linewidth]{images/dummy_pattern}
\caption{8-edge pattern from the synthetic data set, min\_sup = .4}
\label{fig:dumm_pattern}
\end{figure}
The synthetic error log consisted of 1000 rows of errors, the OEE set of 35,919 rows (roughly 9 hours of second by second logs). Run times were very manageable for higher min\_sup values, but soon reached exponential (and thus, unsustainable) growth for min\_sup values much lower than .2 (cf. table \ref{table:runtimes_syn}).
\begin{table}[]
\centering
\caption{Run times and patterns found (synthetic data set)}
\label{table:runtimes_syn}
\begin{tabular}{l|l|l}
data set & \textit{t} & patterns \\ \hline
import errors and graph generation & 1s & \\
import and anomalies detection on OEE & 8s & \\ \hline
\textit{gSpan} (min\_sup = .7) & 2s & 40 \\
\textit{gSpan} (min\_sup = .6) & 8s & 106 \\
\textit{gSpan} (min\_sup = .5) & 19s & 241 \\
\textit{gSpan} (min\_sup = .4) & 74s & 1056 \\ \hline
SVM training and validation (min\_sup = .7) & 4s & \\
SVM training and validation (min\_sup = .6) & 8s & \\
SVM training and validation (min\_sup = .5) & 35s & \\
SVM training and validation (min\_sup = .4) & 13m 14s &
\end{tabular}
\end{table}
\subsection{Evaluation with OEE data set}
OEE anomalies were split into a training data set and an validation data set, with 80\% of the data being used for training and the remaining 20\% used for validation. For the min\_sup = .5 run, experimental mean slope $\tilde{S}$ (cf. algorithm \ref{alg:oeedetectanomalies}) was .5 with a standard deviation of .21 and a $c$-value of .1. The validation data set consisted of 49 time windows, 33 of which were deemed as a noticeable drop by the OEE evaluation algorithm. Of these 33, the SVM correctly identified 28 as drops, for a sensitivity score of 85\%. Of the remaining 19 non-drops, 5 were falsely identified as positives, for a specificity score of 74\%.
For the min\_sup = .4 run, experimental mean slope $\tilde{S}$ was .5 with a standard deviation of .21 and a $c$-value of .1. The validation data set consisted of 49 time windows, 33 of which were deemed as a noticeable drop by the OEE evaluation algorithm. Of these 33, the SVM correctly identified 28 as drops, for a sensitivity score of 84.85\%. Of the remaining 19 non-drops, 5 were falsely identified as positives, for a specificity score of 73.68\%.
\section{Results for facility data set}
The facility data set consisted of 57,171 rows of error logs and 802,800 rows of OEE evaluation data. Results with min\_sup values of less than .5 could not be achieved. The algorithm consistently used up so much memory that it was OOM killed by the operating system after about a day of run time.
\begin{table}
\centering
\caption{Run times and patterns found (facility data set)}
\label{table:runtimes_real}
\begin{tabular}{l|l|l}
data set & \textit{t} & patterns \\ \hline
import errors and graph generation & 50s & \\
import and anomalies detection on OEE & 2m 27s & \\ \hline
\textit{gSpan} (min\_sup = .9) & 2m 20s & 12 \\
\textit{gSpan} (min\_sup = .7) & 6h 27m 12s & 846 \\
\textit{gSpan} (min\_sup = .5) & \textit{OOM killed} & -- \\ \hline
SVM training and validation (min\_sup = .7) & 27s &
\end{tabular}
\end{table}
\subsection{Evaluation with OEE data set}
\label{subsection:evaluation}
OEE anomalies were split as above. Experimental mean slope $\tilde{S}$ (cf. algorithm \ref{alg:oeedetectanomalies}) was 1.1 with a standard deviation of .16 and a $c$-value of .1. The validation data set consisted of 486 time windows, 64 of which were deemed as a noticeable drop by the OEE evaluation algorithm. Of these 64, the SVM trained on patterns with a min\_sup of .7 correctly identified 60 as drops, for a sensitivity score of 93.75\%. Of the remaining 422 non-drops, 18 were identified as false positives, for a specificity score of 95.73\%.
\chapter{Summary and Discussion}
To conclude this bachelor's thesis, the following will summarize my findings with the acknowledgment that, although the essential research in this work was of my own design and execution, a project such as this is virtually impossible without guidance by an advisor and support by friends and family.
With this work, I've introduced a method and provided a Python program to mine error log data for useful patterns, using a graph representation to take advantage of structural information and incorporate outside expert knowledge. I touched upon the most important concepts on which my model assumptions rest and expounded on some approaches that did not yield usable results.
The proposed algorithm has been shown to produce patterns with adequate experimental time complexity, with synthetic data and proprietary Siemens facility data. The found patterns, to a first approximation, seem to provide real informational value and seem able to predict facility downtimes, as measured by a drop in OEE, all to a reasonable degree. A possible next step would be to show the patterns and the thoughts that went into the OEE anomaly detection to an expert with domain knowledge and then refine the proposed approaches through a few more iterations.
The proposed approach has been shown to be somewhat fragile, in that at least some implementation details of \textit{gSpan}, the overall data structure used in this work, and maybe even the included libraries should be reevaluated at a later time, to hammer out possible errors and improve on the interaction between parts.
Further improvements to the algorithm, especially to improve on average-case time and memory performance, and allow it to directly process data streams instead of stale data would be much appreciated, but are sadly out of scope for this bachelor's thesis.
The results that \textit{could} be reached, however, point in a promising direction. The overall approach -- leveraging a graph-based mining algorithm against a time series of event logs -- seems to have merit, not least of all because the resulting patterns can be visualized in a way that immediately makes a lot of sense to both the casual observer as well as the expert with intimate domain knowledge.
This remains true even if event logs don't immediately spring to mind as being structurally similar to networks and as such means this approach needs further research and should at least be tried again with similarly non-obvious graph data in the future.
\chapter*{Acknowledgments}
\addcontentsline{toc}{chapter}{Acknowledgements}
I want to thank first and foremost my advisor, Martin Ringsquandl, as without his ever intelligent and on-point criticisms and ideas this bachelor's thesis wouldn't have been possible. Second, Prof. Dr. Kröger, for allowing this thesis as an external bachelor's thesis at the Munich Siemens AG headquarters. Third, Siemens Corporate Technology, and all the intelligent and lovely folks at the Research, Development and Automation/Business Analytics and Monitoring unit, who provided valuable input not only during lunch hours. I also wish to thank all my friends and family, who constantly bugged me about my progress especially during the later stages, especially Irina, Christina and my mother. Last, but certainly not least, I also want to thank my cat Tigris, who bugged me as well while I was writing, although he was mostly out for food.
\sloppy{Furthermore, I am very thankful to live in a time with tools such as \mbox{CytoScape}\cite{Shannon2003-gg},\mbox{TeXStudio}\cite{Van_der_Zander_undated-kf}, \mbox{TeXLive}\cite{Rahtz_undated-bv}, \mbox{IntelliJ PyCharm}, and \mbox{PaperPile} for making the development of my analytics software and the later write-up much easier and more efficient.}
\listoffigures
\listoftables
\listof{algorithm}{List of Algorithms}
\bibliographystyle{utcaps}
\bibliography{bibliography}
\end{document}

+ 1291
- 0
2018-02-16_efficient_event_classification_through_constrained_subgraph_mining/utcaps.bst
File diff suppressed because it is too large
View File


Loading…
Cancel
Save