1 Introduction
The demand for explainable machine learning is increasing, driven by the spread of machine learning techniques to sensitive domains like cybersecurity, medicine, and smart infrastructure among others. Often, the need is abstract but nevertheless a requirement, e.g., in the recent EU regulation
(Goodman and Flaxman, 2016).Approaches to explanations range from posthoc explanation systems like Turner (2015), which provide explanations of decisions taken by blackbox systems to the use of linear or specialized whitebox systems (Fiterau et al., 2012) that generate models seen as simple enough for nonexpert humans to interpret and understand. Lipton (2016)
outlines how different motivations and requirements for interpretations lead different notions of interpretable models in supervised learning.
Automata models, such as (probabilistic) deterministic finite state automata ((P)DFA) and timed automata (RA) have long been studied (Hopcroft et al. (2013)) and are often seen as interpretable models. Moreover, they are learnable from data samples, both in supervised and unsupervised (see Higuera (2010)
) fashion. But which properties make these models interpretable, and how can we get the most benefit from them? We argue that—especially in the case of unsupervised learning—automata models have a number of properties that make it easy for humans to understand the learned model and project knowledge into it: a graphical representation, transparent computation, generative nature, and our good understanding of their theory.
2 Preliminaries
Finite automata. The models we consider are variants of deterministic finite state automata, or finite state machines. These have long been key models for the design and analysis of computer systems (Lee and Yannakakis, 1996). We provide a conceptual introduction here, and refer to Section B
in the appendix for details. An automaton consists of a set of states, connected by transitions labeled over an alphabet. It is said to accept a word (string) over the alphabet in a computation if there exists a path of transitions from a predefined start state to one of the predefined final states, using transitions labeled with the letters of the word. Automata are called deterministic when there exists exactly one such path for every possible string. Probabilistic automata include probability values on transitions and compute word probabilities using the product of these values along a path, similar to hidden Markov models (HMMs).
Learning approaches.
As learning finite state machines has long been of interest in the field of grammar induction, different approaches ranging from active learning algorithms
(Angluin, 1987)to algorithms based on the method of moments
(Balle et al., 2014) have been proposed. Process mining (Van Der Aalst et al., 2012) can also be seen as a type of automaton learning, focusing on systems that display a large amount of concurrency, such as business processes, represented as interpretable Petrinets. We are particularly interested in statemerging approaches, based on Oncina and Garcia (1992). While Section C in the appendix provides formal details, we provide a conceptual introduction here.The starting point for statemerging algorithms is the construction of a treeshaped automaton from the input sample, called augmented prefix tree acceptor (APTA). It contains all sequences from the input sample, with each sequence element as a directed labeled transition. Two samples share a path if they share a prefix. The statemerging algorithm reduces the size of the automaton iteratively by reducing the tree through merging pairs of states in the model, and forcing the result to be deterministic. The choice of the pairs, and the evaluation of a merge is made heuristically: Each possible merge is evaluated and scored, and the highest scoring merge is executed. This process is repeated and stops if no merges with high scores are possible. These merges generalize the model beyond the samples from the training set: the starting prefix tree is already an acyclic finite automaton. It has a finite set of computations, accepting all words from the training set. Merges can make the resulting automaton cyclic. Automata with cycles accept an infinite set of words. Statemerging algorithms generalize by identifying repetitive patterns in the input sample and creating appropriate cycles via merges. Intuitively, the heuristic tries to accomplish this generalization by identifying pairs of states that have similar future behaviors. In a probabilistic setting, this similarity might be measured by similarity of the empirical probability distributions over the outgoing transition labels. In grammar inference, heuristics rely on occurrence information and the identity of symbols, or use global model selection criteria to calculate merge scores.
3 Flexible StateMerging
The key step in statemerging algorithms is the identification of good merges. A merge of two states is considered good if the possible futures of the two states are very similar. By focusing on applicationdriven notions of similarity of sequences and sequence elements, we modify the statemerging algorithm as follows: For each possible merge, a heuristic can evaluate an applicationdriven similarity measure to obtain the merge score. Optionally, each symbol of the words is enriched with addition data, e.g. real values. This information can be aggregated in each state, e.g. by averaging. It is used to guide the heuristic in reasoning about the similarity of transitions and therefore the inferred latent states, or guide a model selection criterion, e.g. by using meansquarederror minimization as an objective function. It effectively separates the symbolic description connecting the latent variables from the objective (function) used to reason about the similarity. An implementation in C++ is available^{1}^{1}1https://bitbucket.org/chrshmmmr/dfasat. The importance of combining data with symbolic data is getting renewed attention, c.f. recent works such as Garnelo et al. (2016). In Section 5, we outline regression automata as a use case of our approach.
4 Aspects of Interpretability in Automata
To understand how a particular model works as well as how to go beyond the scope of the model and combing foreground knowledge about the application with the model itself, we now focus on which aspects of automata enable interpretations:

Automata have an easy graphical representation as cyclic, directed, labeled graphs, offering a hierarchical view of sequential data.

Computation is transparent. Each step of the computation is the same for each symbol of an input sample . It can be verified manually (e.g. visually), and compared to other computation paths through the latent state space. This makes it possible to analyze training samples and their contribution to the final model.

Automata are generative. Sampling from the model helps to understand what it describes. Tools like model checkers to query properties in a formal way, e.g. using temporal logic, can help to analyze the properties of the model.

Automata are well studied in theory and practice, including composition and closure properties, subclasses and related equally expressive formalisms. This makes it easy for humans to transfer their knowledge onto it: The model is frequently used in system design as a way to describe system logic, and are accessible to a wide audience.
The intention behind using either of these aspects depends on the purpose of the interpretation, e.g. trust, transparency, or generalizing beyond the input sample. Especially for unsupervised learning, we believe that knowledge transfer and exploratory knowledge discovery are common motivations, e.g. in (software) process discovery.
5 Application Case Studies
In the following we present some use cases of automata models and how they are interpreted and how the properties identified in Section 4 contribute to it. While this is by no means an exhaustive literature study, we hope that it helps to illustrate how the different aspects of interpretability are used in practice. In unsupervised learning, the data is observations without labels or counterexamples to the observed events. Often, there is no groundtruth to be used in an objective function. Methods for learning such systems typically use statistical assumptions to compute state similarity.
Software systems. Automata models are often used to infer models of software
in an unsupervised fashion, e.g. Walkinshaw et al. (2007). In these cases, the generative property of automaton models is see as interpretable: It is possible to ask queries using a temporal logic like LTL (Clarke et al., 2005) to answer questions regarding the model, e.g. whether a certain condition will eventually be true, or analyze at what point a computation path deviated from the expected outcome. In Smetsers et al. (2016), the authors use this property to test and validate properties of code by first fuzzing the code to obtain execution paths and the associated inputs and then learn a model to check LTL queries on.
Additionally, we can transfer human expert knowledge on system design (Wagner et al., 2006) to inspect the model, e.g. to identify the function of substructures identified. An example can be found in Smeenk et al. (2015), where the authors infer a state machine for a printer via active learning. Through visual inspection alone it is possible to identify deadlocks that are otherwise hard to see in the raw data. The visual analysis helped to identify bugs in the software the developers were unaware of. The appendix shows the final model in Figure 3.
Biological systems. In Schmidt and Kramer (2014), timed automata are used to infer the cell cycle of yeast based on sequences of gene expression activity. The graphical model obtained (c.f. Figure 4) can be visually compared to existing models derived using other methods and combined with apriori knowledge in biology.
Driver behavior. In our ongoing work using the RTI+ statemerging algorithm for timed automata (Verwer et al., 2008), we analyze car following behavior of human drivers. The task is to relate driver actions like changes of vehicle speed (and thus distance and relative speed to a lead vehicle) to a response action, e.g. acceleration. The inferred automaton model is inspected visually like a software controller. Different driving behaviors are distinguished by clustering the states of the automaton, i.e. the latent state space. The discovered distinct behaviors form control loops within the model. Figure 2 in the appendix shows an example with the discovered clusters highlighted.
Wind speed prediction. In our own work, we applied automata learning in different ways to a problem not related to grammar inference, predicting shortterm changes in wind speeds. We take two different approaches to obtain models that tell us more about the data than just a minimizer of the objective function: In one approach, (Pellegrino et al., 2016), we discover structure
in the data by using by inferring transition guards over a potentially infinite alphabet, effectively discovering a clustering as transition labels from the sequences automatically. The only constraint and objective used here is the similarity of future behaviors. The learned model can be seen as a structure like a decision tree built in a bottomup fashion, but allowing loops for repetitive patterns. Figure
1 in the appendix shows an example of such an automaton. In another approach (Lin et al., 2016), we use our flexible statemerging framework to impose structure through parameter choices. We derive discrete events from the physical wind speed observations by using clustering approaches to obtain a small alphabet of discrete events as a symbolic representation that fits the underlying data well. Using a heuristic that scores merges based on minimizing a mean squared error measure, the automata model has a good objective function for regression, as well as discovers latent variables in terms of the given discretization. In practice, other choices of discretization can be used. By using thresholds of the turbine used in wind mills, e.g. the activation point, one could infer a model whose latent states relate to the physical limitations of the application. We are planning to analyze this approach in future work. As with the previous example, the learned model can be seen as a description of decision paths taken in the latent statespace. If the model makes predictions that are difficult to believe for human experts, the computation and the model prediction can be visually analyzed to see which factors contributed to it, and how the situation relates to similar sets of features.6 Discussion
The applications surveyed in Section 5 show that interpreting finite state automata as models takes many forms and serves different goals. As such, this interpretation is not a feature inherent in the models or the algorithms themselves. Rather, interpretations are defined by the need and intention of the user. But yet, interpretations of automata models draw from a core set of properties as identified in Section 4: graphical representations, transparent computation, generative nature, and our understanding of their theory.
We note that automata models are particularly useful in unsupervised learning:
Applications of automata models often aim at easing the transfer of knowledge about the subject, or related subjects, to the data generating process. In this case, machine learning serves as a tool for exploration, to deal with epistemic uncertainty in observed systems. The goal is not only to obtain a more compact view of the data, but learn how to generalize from the observed data. Often, it is often unclear what the apriori knowledge is as users rely on experience. This makes it very difficult to formalize a clear objective function. A visual model with a traceable computation helps to guide the users, and helps to iterate over multiple models.
Flexible statemerging allows to obtain automata models in new domains: in our flexible statemerging framework presented in Section 3, we try to separate the symbolic representation from the objective function and heuristic. We hope that this will help to guide discovery by stating the model parameters, e.g. the symbols, independently form the heuristic that guides the discovery of latent variables. In this fashion, it is possible to learn models with interpretable aspects without having to sacrifice the model performance on the intended task.
Future work. We hope that this discussion will help to build a bridge between practitioners and experts in applied fields on one side, and the grammar inference and machine learning community on other side. As probabilistic deterministic automata models are almost as expressive as HMMs, the models and techniques described here are applicable to a wide range of problems with decent performance metrics. We see a lot of promise in combining symbolic data with numeric or other data via a flexible statemerging approach to bring automata learning to fields beyond grammatical inference.
Acknowledgments
I would like to thank, in no particular order, my colleagues, Joshua Moermann, Rick Smeters, Nino Pellegrino, Sara Messelaar, Corina Grigore, and Mijung Park for their time and feedback on this work. This work is partially funded by the FNR AFR grant PAULINE and Technologiestichting STW VENI project 13136 (MANTA) and NWO project 62001628 (LEMMA).
plus 0ex
References
 Angluin [1987] Dana Angluin. Learning Regular Sets from Queries and Counterexamples. Information and Computation, 75(2):87–106, 1987. URL http://dx.doi.org/10.1016/08905401(87)900526.
 Balle et al. [2014] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral Learning of Weighted Automata. Machine Learning, 96(12):33–63, 2014. URL http://link.springer.com/article/10.1007/s109940135416x.
 Clarke et al. [2005] Edmund Clarke, Ansgar Fehnker, Sumit Kumar Jha, and Helmut Veith. Temporal Logic Model Checking. In Handbook of Networked and Embedded Control Systems, Control Engineering, pages 539–558. Birkhäuser Boston, 2005. ISBN 9780817632397 9780817644048. URL http://link.springer.com/chapter/10.1007/0817644040_23.
 Fiterau et al. [2012] Madalina Fiterau, Artur Dubrawski, Jeff Schneider, and Geoff Gordon. Tradeoffs in Explanatory Model Learning. 2012. URL http://www.ml.cmu.edu/research/dappapers/dap_fiterau.pdf.
 Garnelo et al. [2016] Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards Deep Symbolic Reinforcement Learning. arXiv:1609.05518 [cs], September 2016. URL http://arxiv.org/abs/1609.05518. arXiv: 1609.05518.
 Goodman and Flaxman [2016] Bryce Goodman and Seth Flaxman. European Union Regulations on Algorithmic Decisionmaking and a "Right to Explanation". arXiv:1606.08813 [cs, stat], June 2016. URL http://arxiv.org/abs/1606.08813. arXiv: 1606.08813.
 Higuera [2010] Colin de la Higuera. Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, April 2010. ISBN 9780521763165.
 Hopcroft et al. [2013] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Pearson, Harlow, Essex, Pearson New International Edition edition, November 2013. ISBN 9781292039053.
 Lee and Yannakakis [1996] D. Lee and M. Yannakakis. Principles and Methods of Testing Finite State Machinesa Survey. Proceedings of the IEEE, 84(8):1090–1123, August 1996. ISSN 00189219. doi: 10.1109/5.533956.
 Lin et al. [2016] Qin Lin, Christian Hammerschmidt, Gaetano Pellegrino, and Sicco Verwer. Shortterm Time Series Forecasting with Regression Automata. In MiLeTS Workshop at KDD 2016, 2016. URL http://wwwbcf.usc.edu/~liu32/milets16/paper/MiLeTS_2016_paper_17.pdf.
 Lipton [2016] Zachary C. Lipton. The Mythos of Model Interpretability. arXiv:1606.03490 [cs, stat], June 2016. URL http://arxiv.org/abs/1606.03490. arXiv: 1606.03490.

Oncina and Garcia [1992]
Jose Oncina and Pedro Garcia.
Identifying Regular Languages In Polynomial Time.
In
Advances in Structural and Syntactic Pattern Recognition
, pages 99–108. World Scientific, 1992.  Pellegrino et al. [2016] Gaetano Pellegrino, Christian Albert Hammerschmidt, Qin Lin, and Sicco Verwer. Learning Deterministic Finite Automata from Infinite Alphabets. In Proceedings of The 13th International Conference on Grammatical Inference, Delft, October 2016.
 Schmidt and Kramer [2014] Jana Schmidt and Stefan Kramer. Online Induction of Probabilistic RealTime Automata. Journal of Computer Science and Technology, 29(3):345–360, May 2014. ISSN 10009000, 18604749. doi: 10.1007/s1139001414358. URL http://link.springer.com/article/10.1007/s1139001414358.
 Smeenk et al. [2015] Wouter Smeenk, Joshua Moerman, Frits Vaandrager, and David N. Jansen. Applying Automata Learning to Embedded Control Software. In Formal Methods and Software Engineering, number 9407 in Lecture Notes in Computer Science, pages 67–83. November 2015. URL http://link.springer.com/chapter/10.1007/9783319254234_5.
 Smetsers et al. [2016] Rick Smetsers, Joshua Moerman, Mark Janssen, and Sicco Verwer. Complementing Model Learning with MutationBased Fuzzing. arXiv:1611.02429 [cs], 2016. URL http://arxiv.org/abs/1611.02429. arXiv: 1611.02429.
 Turner [2015] Ryan Turner. A Model Explanation System. In Black Box Learning and Inference NIPS Workshop, 2015. URL http://www.blackboxworkshop.org/pdf/Turner2015_MES.pdf.
 Van Der Aalst et al. [2012] Wil Van Der Aalst, Arya Adriansyah, Ana Karla Alves de Medeiros, Franco Arcieri, Thomas Baier, Tobias Blickle, Jagadeesh Chandra Bose, Peter van den Brand, Ronald Brandtjen, Joos Buijs, and others. Process Mining Manifesto. In Business Process Management Workshops, pages 169–194. Springer, 2012. URL http://link.springer.com/chapter/10.1007/9783642281082_19.
 Verwer et al. [2008] Sicco Verwer, Mathijs de Weerdt, and Cees Witteveen. Efficiently Learning Simple Timed Automata. Induction of Process Models, pages 61–68, 2008. URL http://www.cs.ru.nl/~sicco/papers/ipm08.pdf.
 Wagner et al. [2006] Ferdinand Wagner, Ruedi Schmuki, Thomas Wagner, and Peter Wolstenholme. Modeling Software with Finite State Machines: A Practical Approach. CRC Press, May 2006. ISBN 9781420013641.
 Walkinshaw et al. [2007] N. Walkinshaw, K. Bogdanov, M. Holcombe, and S. Salahuddin. Reverse Engineering State Machines by Interactive Grammar Inference. In 14th Working Conference on Reverse Engineering (WCRE 2007), 2007.
Appendix A Visualizations of Automata
Appendix B Finite State Automata
In this section, we give a formal introduction to deterministic finite state automata as the most basic model considered in the related work. For other variants like probabilistic automata, timed and real time automata, and regression automata, we refer to the cited papers for formal introductions.
A deterministic finite state automaton (DFA) is one of the basic and most commonly used finite state machines. Below, we provide a concise description of DFAs, the reader is referred to Hopcroft et al. [2013] for a more elaborate overview. A DFA is a directed graph consisting of a set of states (nodes) and labeled transitions (directed edges). An example is shown in Figure 6. The start state is a specific state of the DFA and any state can be an accepting state (final state) in . The labels of transitions are all members of a given alphabet . A DFA can be used to generate or accept sequences of symbols (strings) using a process called DFA computation. This process begins in , and iteratively activates (or fires) an outgoing transition with label from the source state it is in , moving the process to the target state pointed to by . A computation is accepting if the state it ends in (its last state) is an accepting state , otherwise it is rejecting. The labels of the activated transitions form a string . A DFA accepts exactly those strings formed by the labels of accepting computations, it rejects all others. Since a DFA is deterministic there exists exactly one computation for every string, implying that for every state and every label there exists at most one outgoing transition from with label . A string is said to reach all the states contained in the computation that forms , is said to end in the last state of such a computation. The set of all strings accepted by a DFA is called the language of .
Appendix C StateMerging Algorithms
The idea of a statemerging algorithm is to first construct a treeshaped DFA from the input sample , and then to merge the states of . This DFA is called an augmented prefix tree acceptor (APTA). An example is shown in Figure 6. For every state of , there exists exactly one computation that ends in . This implies that the computations of two strings and reach the same state if and only if and share the same prefix until they reach . Furthermore, an APTA is constructed to be consistent with the input sample , i.e., and . Thus a state is accepting only if there exists a string such that the computation of ends in . Similarly, it is rejecting only if the computation of a string ends in . As a consequence, can contain states that are neither accepting nor rejecting. No computation of any string from ends in such a state. Therefore, the rejecting states are maintained in a separate set , with . Whether a state should be accepting or rejecting is determined by merging the states of the APTA and trying to find a DFA that is as small as possible.
A merge of two states and combines the states into one: it creates a new state that has the incoming and outgoing transitions of both and , i.e., replace all by and all by . Such a merge is only allowed if the states are consistent, i.e., it is not the case that is accepting while is rejecting or vice versa. When a merge introduces a nondeterministic choice, i.e., is now the source of two transitions and in with the same label , the target states of these transitions and are merged as well. This is called the determinization process (c.f. the whileloop in Algorithm 2), and is continued until there are no nondeterministic choices left. However, if this process at some point merges two inconsistent states, the original states and are also considered inconsistent and the merge will fail. The result of a successful merge is a new DFA that is smaller than before, and still consistent with the input sample . A statemerging algorithm iteratively applies this state merging process until no more consistent merges are possible. The general algorithm is outlined in Algorithm 1. Figure 6 shows an automaton obtained from the input given in Figure 7, which is also depicted as an APTA in Figure 6.
Comments
There are no comments yet.