← ^ →

Computational Learning Theory

Sample Complexity for Finite Hypothesis Spaces

  • The growth in the number of required training examples with problem size is called the sample complexity of the learning problem.
  • We will consider only consistent learners , which are those that maintain a training error of 0.
  • We can derive a bound on the number of training examples required by any consistent learner!
  • Fact: Every consistent learner outputs a hypothesis belonging to the version space.
  • Therefore, we need to bound the number of examples needed to assure that the version space contains no unacceptable hypothesis.
  • The version space $VS_{H,D}$ is said to be ε-exhausted with respect to $c$ and $\cal{D}$, if every hypothesis $h$ in $VS_{H,D}$ has error less than ε with respect to $c$ and $\cal{D}$. \[(\forall h \in VS_{H,D}) error_{\cal{D}}(h) < \epsilon \]

José M. Vidal .

Chapter 7: Computational Learning Theory

Sample complexity for infinite hypothesis spaces.

  • The Vapnik-Chervonenkis dimension of H, VC(H) , allows us to typically place a fairly tight bound on the sample complexity
  • A set of instances S is shattered by hypothesis H if and only if for every dichotomy of S (2 |S| total), there exists some hypothesis in H consistent with this dichotomy. Take a look at Figure 7.3.
  • VC(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H
  • Note: if an arbitrarily large finite set of X can be shattered by H, then VC(H) ≡ ∞
  • Note: for finite H, VC(H) ≤ log 2 |H|
  • If X = {real numbers} and H = {intervals on the real line} then VC(H) = 2
  • If X = {points on the x,y plane} and H = {linear decision surfaces} then VC(H) = 3. See Figure 7.4.
  • If X = {conjunctions of exactly 3 boolean literals} and H = {conjunctions of up to 3 boolean literals} then VC(H) = 3

Sample Complexity Bounds

  • Upper Bound. m ≥ (1/ε)[4 log 2 (2/δ) + 8 VC(H) log 2 (13/ε)]
  • Lower bound. Consider any concept class C such that VC(C) ≥ 2, and learner L, and any 0 < ε < 1/8 and 0 < δ < 1/100. Then there exists a distribution D and target concept in C such that if L observes fewer examples than max[(1/ε)log(1/δ), (VC(C) - 1)/(32 ε)], then with probability at least δ, L outputs a hypothesis h having error D (h) > ε

Mistake Bound of Learning

  • How many mistakes will the learner make in its predictions before it learns the target concept?
  • The learner must predict c(x) before being told the answer
  • This is useful in applications such as predicting fraudulent credit card purchases
  • In this section, we are interested in learning the target concept exactly
  • Initialize h to a 1 ⋀ ¬a 1 ... ⋀ a n ⋀ ¬a n
  • For each positive training instance x, remove from h any literal that is not satisfied by x

The largest number of mistakes that can be made to learn a concept (the mistake bound) is n + 1

Halving Algorithm

  • Use a version space
  • The goal is to reduce the number of viable hypotheses to 1
  • The classification is determined using a majority vote
  • The mistake bound is log 2 |H|
  • Again, this is an upper bound - it is possible to learn without making any mistakes

Optimal Mistake Bounds

  • Let M A (c) denote the maximum over all possible sequences of training examples of the number of mistakes made by A to exactly learn c
  • Let M A (C) ≡ max M A (c)
  • Let C be an arbitrary non-empty concept class
  • The optimal mistake bound for C, Opt(C), is the minimum over all possible learning algorithms A of M A (C)
  • VC(C) ≤ Opt(C) ≤ M HALVING (C) ≤ log 2 |C|

Weighted Majority Algorithm

  • Table 7.1 shows the algorithm
  • Theorem: Relative mistake bound for Weighted-Majority. Let D be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training sequence D. Then the number of mistakes over D made by the Weighted-Majority algorithm using Β = 1/2 is at most 2.4(k + log 2 n)
  • Note: the weighted majority idea is the basis behind many ensemble learners that use boosting (such as AdaBoost ) in order to obtain better classification accuracy

Valid XHTML 1.0!

Andy Jones Publications CV Technical blog [email protected]

Sample complexity of linear regression.

Here, we’ll look at linear regression from a statistical learning theory perspective. In particular, we’ll derive the number of samples necessary in order to achieve a certain level of regression error. We’ll also see a technique called “discretization” that allows for proving things about infinite sets by relying on results in finite sets.

Problem setup

Consider a very standard linear regression setup. We’re given a set of $d$-dimensional data and 1-dimensional labels (or “response variables”), ${(x_i, y_i)}$, and we’re tasked with finding a linear map $w \in \mathbb{R}^d$ that minimizes the squared error between the predictions and the truth. In particular, we’d like to minimize $\sum\limits_{i = 1}^n (w^\top x_i - y_i)^2$.

Without loss of generality, assume for all $i$,

(The following results will hold true for any constant bound, and we can always rescale as needed.)

Learnability for finite hypothesis classes

Recall that a hypothesis class is essentially the set of functions over which we’re searching in any given statistical learning problem. In our current example, the hypothesis class is the set of all linear functions with bounded norm on its weights. Said another way, each hypothesis in our hypothesis class corresponds to a weight vector $w$, where

In the PAC learning setting, a hypothesis class is considered “learnable” if there exists an algorithm that, for any $\epsilon$ and $\delta$, can return a hypothesis with error at most $\epsilon$ with probability $1 - \delta$ if it observes enough samples.

An important result in PAC learning is that all finite hypothesis classes are (agnostically) PAC learnable with sample complexity

where $|\mathcal{H}|$ is the cardinality of the hypothesis class.

This can be shown by using a Hoeffding bound. Note that the “agnostic” part of learnability means that the algorithm will return a hypothesis that has error $\epsilon$ as compared to the best hypothesis in the class $\mathcal{H}$, rather than absolute error.

However, notice that in the linear regression setting, the hypothesis class is infinite: even though the weight vector’s norm is bounded, it can still take an infinite number of values. Can we somehow leverage the result for finite classes here? This brings us to an important point: We can discretize the set of hypothesis classes and bound how much error we incur by doing so.

After discretizing, we need to account for error arising from two places:

  • Error from not finding the best hypothesis originally.
  • Error from discretizing the set of hypotheses.

Discretizing hypothesis classes

We now set out to “approximate” our infinite hypothesis class with a finite one. We’ll do this in the most straightforward way: simply choose a resolution at which to discretize, and split up each dimension into equally spaced bins. Mathematically, if we choose any $\epsilon’$, then we can represent the discretized hypothesis class as $\mathcal{H}’ = {h_w \in \mathcal{H} : w \in \epsilon’ \mathbb{Z}}$.

Recall that we constrained $w$ such that

so each dimension of $w$ lives in the interval $[-1, 1]$. Thus, there are $\left(\frac{2}{\epsilon’}\right)^d$ hypotheses in $\mathcal{H}’$. Using the generic sample complexity for finite hypothesis classes, this means that the sample complexity to learn this class is

Quantifying error

After discretizing the set of hypotheses, we won’t be able to find the optimal hypothesis in the original continuous set. Let’s now quantify how much error we incur by doing this. If $\tilde{w}$ is the learned weight vector after discretizing, then $\tilde{w}^\top x$ are the “predictions”. Quantifying the error here, we have for any $x$ in the training set:

\begin{align} |\tilde{w}^\top x - w^\top x| &\leq \left| \sum\limits_j x^{(j)} \epsilon’ \right| && (x^{(j)} \text{ is the j’th coordinate}) \\ &\leq \epsilon’ ||x||_1 \\ &\leq d\epsilon’\end{align}

Using the Cauchy-Schwarz inequality, we have

Recall that our original goal was to minimize the squared error. The error in the discretized setting will be $(\tilde{w}^\top x - y)^2$, and in the original continous setting, it will be $(w^\top x - y)^2$. We’d like to quantify the difference between these two errors. An important note is that the function $f(x) = x^2$ is 4-Lipshitz, i.e., for any $x, y \in \mathbb{R}$, $|x^2 - y^2| \leq 4|x - y|$. Thus, we have

\begin{align} |(\tilde{w}^\top x - y)^2 - (w^\top x - y)^2| &\leq 4|(\tilde{w}^\top x - y) - (w^\top x - y)| \\ &= 4|\tilde{w}^\top x - w^\top x| \\ &\leq 4 d\epsilon’.\end{align}

If we now set $\epsilon’ = \frac{\epsilon}{4d}$ (remember, we can choose $\epsilon’$ to be any level of discretization we like), we obtain that

To sum up, we incur $\epsilon$ error when we discretize the hypothesis class, on top of the $\epsilon$ error already incurred in the original setting. This means the total error will be $2\epsilon$. Using our sample complexity computed above, and plugging in $\epsilon’ = \frac{\epsilon}{4d}$, we have:

Here, we used the discretization trick to compute the sample complexity of linear regression. Interestingly, the sample complexity increases proportionally to $d\log d$ (ignoring the $\epsilon$ terms). This was a surprising result to me, as this is relatively slow growth.

  • Prof. Elad Hazan’s Theoretical Machine Learning course
  • Understanding Machine Learning: From Theory to Algorithms , by Shai Shalev-Shwartz and Shai Ben-David

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 28 June 2024

Promising directions of machine learning for partial differential equations

  • Steven L. Brunton   ORCID: orcid.org/0000-0002-6565-5118 1 &
  • J. Nathan Kutz 2  

Nature Computational Science ( 2024 ) Cite this article

106 Accesses

11 Altmetric

Metrics details

  • Applied mathematics
  • Computational science

Partial differential equations (PDEs) are among the most universal and parsimonious descriptions of natural physical laws, capturing a rich variety of phenomenology and multiscale physics in a compact and symbolic representation. Here, we examine several promising avenues of PDE research that are being advanced by machine learning, including (1) discovering new governing PDEs and coarse-grained approximations for complex natural and engineered systems, (2) learning effective coordinate systems and reduced-order models to make PDEs more amenable to analysis, and (3) representing solution operators and improving traditional numerical algorithms. In each of these fields, we summarize key advances, ongoing challenges, and opportunities for further development.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

92,52 € per year

only 7,71 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

sample complexity for infinite hypothesis space in machine learning

Similar content being viewed by others

sample complexity for infinite hypothesis space in machine learning

Physics-informed learning of governing equations from scarce data

sample complexity for infinite hypothesis space in machine learning

Laplace neural operator for solving differential equations

sample complexity for infinite hypothesis space in machine learning

Sparse inference and active learning of stochastic differential equations from data

Brezis, H. & Browder, F. Partial differential equations in the 20th century. Adv. Math. 135 , 76–144 (1998).

Article   MathSciNet   Google Scholar  

Dissanayake, M. & Phan-Thien, N. Neural-network-based approximations for solving partial differential equations. Commun. Numer. Methods Eng. 10 , 195–201 (1994).

Article   Google Scholar  

Rico-Martinez, R. & Kevrekidis, I. G. Continuous time modeling of nonlinear systems: a neural network-based approach. In Proc. IEEE International Conference on Neural Networks 1522–1525 (IEEE, 1993).

González-García, R., Rico-Martìnez, R. & Kevrekidis, I. G. Identification of distributed parameter systems: a neural net based approach. Comput. Chem. Eng. 22 , S965–S968 (1998).

Raissi, M., Perdikaris, P. & Karniadakis, G. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378 , 686–707 (2019).

Yu, B. et al. The Deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat. 6 , 1–12 (2018).

Müller, J. & Zeinhofer, M. Deep Ritz revisited. Preprint at https://arxiv.org/abs/1912.03937 (2019).

Gao, H., Zahr, M. J. & Wang, J.-X. Physics-informed graph neural Galerkin networks: a unified framework for solving PDE-governed forward and inverse problems. Comput. Methods Appl. Mech. Eng. 390 , 114502 (2022).

Bruna, J., Peherstorfer, B. & Vanden-Eijnden, E. Neural Galerkin schemes with active learning for high-dimensional evolution equations. J. Comput. Phys. 496 , 112588 (2024).

Battaglia, P. W. et al. Relational inductive biases, deep learning and graph networks. Preprint at https://arxiv.org/abs/1806.01261 (2018).

Sanchez-Gonzalez, A. et al. Learning to simulate complex physics with graph networks. In Proc. International Conference on Machine Learning 8459–8468 (PMLR, 2020).

Burger, M. et al. Connections between deep learning and partial differential equations. Eur. J. Appl. Math. 32 , 395–396 (2021).

Loiseau, J.-C. & Brunton, S. L. Constrained sparse Galerkin regression. J. Fluid Mech. 838 , 42–67 (2018).

Cranmer, M. et al. Lagrangian neural networks. Preprint at https://arxiv.org/abs/2003.04630 (2020).

Brunton, S. L., Noack, B. R. & Koumoutsakos, P. Machine learning for fluid mechanics. Annu. Rev. Fluid Mech. 52 , 477–508 (2020).

Wang, R., Walters, R. & Yu, R. Incorporating symmetry into deep dynamics models for improved generalization. In International Conference on Learning Representations (ICLR, 2021).

Wang, R., Kashinath, K., Mustafa, M., Albert, A. & Yu, R. Towards physics-informed deep learning for turbulent flow prediction. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 1457–1466 (ACM, 2020).

Brandstetter, J., Berg, R. V. D., Welling, M. & Gupta, J. K. Clifford neural layers for PDE modeling. In Eleventh International Conference on Learning Representations (ICLR, 2023)

De Haan, P., Weiler, M., Cohen, T. & Welling, M. Gauge equivariant mesh CNNS: anisotropic convolutions on geometric graphs. In International Conference on Learning Representations (ICLR, 2021).

Brandstetter, J., Welling, M. & Worrall, D. E. Lie point symmetry data augmentation for neural PDE solvers. In Proc. International Conference on Machine Learning 2241–2256 (PMLR, 2022).

Brandstetter, J., Worrall, D. & Welling, M. Message passing neural PDE solvers. Preprint at https://arxiv.org/abs/2202.03376 (2022).

Karniadakis, G. E. et al. Physics-informed machine learning. Nat. Rev. Phys. 3 , 422–440 (2021).

Brunton, S. L. & Kutz, J. N. Data-Driven Science and Engineering : Machine Learning , Dynamical Systems and Control 2nd edn (Cambridge Univ. Press, 2022).

Bongard, J. & Lipson, H. Automated reverse engineering of nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 104 , 9943–9948 (2007).

Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324 , 81–85 (2009).

Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 113 , 3932–3937 (2016).

Cranmer, M. Interpretable machine learning for science with PySR and SymbolicRegression.jl. Preprint at https://arxiv.org/abs/2305.01582 (2023).

Rudy, S. H., Brunton, S. L., Proctor, J. L. & Kutz, J. N. Data-driven discovery of partial differential equations. Sci. Adv 3 , e1602614 (2017).

Schaeffer, H. Learning partial differential equations via data discovery and sparse optimization. Proc. Math. Phys. Eng. Sci. 473 , 20160446 (2017).

MathSciNet   Google Scholar  

Zanna, L. & Bolton, T. Data-driven equation discovery of ocean mesoscale closures. Geophys. Res. Lett. 47 , e2020GL088376 (2020).

Schmelzer, M., Dwight, R. P. & Cinnella, P. Discovery of algebraic Reynolds-stress models using sparse symbolic regression. Flow Turbulence Combustion 104 , 579–603 (2020).

Beetham, S. & Capecelatro, J. Formulating turbulence closures using sparse regression with embedded form invariance. Phys. Rev. Fluids 5 , 084611 (2020).

Beetham, S., Fox, R. O. & Capecelatro, J. Sparse identification of multiphase turbulence closures for coupled fluid-particle flows. J. Fluid Mech. 914 , A11 (2021).

Bakarji, J. & Tartakovsky, D. M. Data-driven discovery of coarse-grained equations. J. Comput. Phys. 434 , 110219 (2021).

Maslyaev, M., Hvatov, A. & Kalyuzhnaya, A. Data-driven partial derivative equations discovery with evolutionary approach. In Proc. Computational Science–ICCS 2019 : 19th International Conference Part V 19 , 635–641 (Springer, 2019).

Xu, H., Zhang, D. & Wang, N. Deep-learning based discovery of partial differential equations in integral form from sparse and noisy data. J. Comput. Phys. 445 , 110592 (2021).

Xu, H., Chang, H. & Zhang, D. DLGA-PDE: Discovery of PDEs with incomplete candidate library via combination of deep learning and genetic algorithm. J. Comput. Phys. 418 , 109584 (2020).

Xu, H., Zhang, D. & Zeng, J. Deep-learning of parametric partial differential equations from sparse and noisy data. Phys. Fluids 33 , 037132 (2021).

Xu, H. & Zhang, D. Robust discovery of partial differential equations in complex situations. Phys. Rev. Res. 3 , 033270 (2021).

Chen, Y., Luo, Y., Liu, Q., Xu, H. & Zhang, D. Symbolic genetic algorithm for discovering open-form partial differential equations (SGA-PDE). Phys. Rev. Res. 4 , 023174 (2022).

Taira, K. & Colonius, T. The immersed boundary method: a projection approach. J. Comput. Phys. 225 , 2118–2137 (2007).

Colonius, T. & Taira, K. A fast immersed boundary method using a nullspace approach and multi-domain far-field boundary conditions. Comput. Methods Appl. Mech. Eng. 197 , 2131–2146 (2008).

Van Breugel, F., Kutz, J. N. & Brunton, B. W. Numerical differentiation of noisy data: a unifying multi-objective optimization framework. IEEE Access 8 , 196865–196877 (2020).

Messenger, D. A. & Bortz, D. M. Weak SINDy: Galerkin-based data-driven model selection. Multiscale Model. Simul. 19 , 1474–1497 (2021).

Messenger, D. A. & Bortz, D. M. Weak SINDy for partial differential equations. J. Comput. Phys. 443 , 110525 (2021).

Schaeffer, H. & McCalla, S. G. Sparse model selection via integral terms. Phys. Rev. E 96 , 023302 (2017).

Fasel, U., Kutz, J. N., Brunton, B. W. & Brunton, S. L. Ensemble-SINDy: robust sparse model discovery in the low-data, high-noise limit, with active learning and control. Proc. R. Soc. A 478 , 20210904 (2022).

Gurevich, D. R., Reinbold, P. A. & Grigoriev, R. O. Robust and optimal sparse regression for nonlinear PDE models. Chaos 29 , 103113 (2019).

Alves, E. P. & Fiuza, F. Data-driven discovery of reduced plasma physics models from fully kinetic simulations. Phys. Rev. Res. 4 , 033192 (2022).

Reinbold, P. A., Gurevich, D. R. & Grigoriev, R. O. Using noisy or incomplete data to discover models of spatiotemporal dynamics. Phys. Rev. E 101 , 010203 (2020).

Suri, B., Kageorge, L., Grigoriev, R. O. & Schatz, M. F. Capturing turbulent dynamics and statistics in experiments with unstable periodic orbits. Phys. Rev. Lett. 125 , 064501 (2020).

Reinbold, P. A., Kageorge, L. M., Schatz, M. F. & Grigoriev, R. O. Robust learning from noisy, incomplete, high-dimensional experimental data via physically constrained symbolic regression. Nat. Commun. 12 , 3219 (2021).

Pope, S. A more general effective-viscosity hypothesis. J. Fluid Mech. 72 , 331–340 (1975).

Ling, J., Kurzawski, A. & Templeton, J. Reynolds averaged turbulence modelling using deep neural networks with embedded invariance. J. Fluid Mech. 807 , 155–166 (2016).

Duraisamy, K., Iaccarino, G. & Xiao, H. Turbulence modeling in the age of data. Annu. Rev. Fluid Mech. 51 , 357–377 (2019).

Ahmed, S. E. et al. On closures for reduced order models—a spectrum of first-principle to machine-learned avenues. Phys. Fluids 33 , 091301 (2021).

Supekar, R. et al. Learning hydrodynamic equations for active matter from particle simulations and experiments. Proc. Natl Acad. Sci. USA 120 , e2206994120 (2023).

Kaptanoglu, A. A. et al. PySINDy: a comprehensive Python package for robust sparse system identification. J. Open Source Softw. 7 , 3994 (2022).

Long, Z., Lu, Y., Ma, X. & Dong, B. PDE-Net: learning PDEs from data. In Proc. International Conference on Machine Learning 3208–3216 (PMLR, 2018).

Long, Z., Lu, Y. & Dong, B. PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. J. Comput. Phys. 399 , 108925 (2019).

Atkinson, S. Bayesian hidden physics models: uncertainty quantification for discovery of nonlinear partial differential operators from data. Preprint at https://arxiv.org/abs/2006.04228 (2020).

Cai, J.-F., Dong, B., Osher, S. & Shen, Z. Image restoration: total variation, wavelet frames and beyond. J. Am. Math. Soc. 25 , 1033–1089 (2012).

Dong, B., Jiang, Q. & Shen, Z. Image restoration: wavelet frame shrinkage, nonlinear evolution PDEs and beyond. Multiscale Model. Simul. 15 , 606–660 (2017).

Schaeffer, H., Caflisch, R., Hauck, C. D. & Osher, S. Sparse dynamics for partial differential equations. Proc. Natl Acad. Sci. USA 110 , 6634–6639 (2013).

Cranmer, M. D., Xu, R., Battaglia, P. & Ho, S. Learning symbolic physics with graph networks. Preprint at https://arxiv.org/abs/1909.05862 (2019).

Cranmer, M. et al. Discovering symbolic models from deep learning with inductive biases. In Advances in Neural Information Processing Systems (eds Larochelle, H. et al.) 17429–17442 (Curran Associates, Inc., 2020).

Callaham, J. L., Koch, J. V., Brunton, B. W., Kutz, J. N. & Brunton, S. L. Learning dominant physical processes with data-driven balance models. Nat. Commun. 12 , 1016 (2021).

Cross, M. C. & Hohenberg, P. C. Pattern formation outside of equilibrium. Rev. Mod. Phys. 65 , 851–1112 (1993).

Bakarji, J., Callaham, J., Brunton, S. L. & Kutz, J. N. Dimensionally consistent learning with Buckingham Pi. Nat. Comput. Sci. 2 , 834–844 (2022).

Koopman, B. O. Hamiltonian systems and transformation in Hilbert space. Proc. Natl Acad. Sci. USA 17 , 315–318 (1931).

Mezić, I. & Banaszuk, A. Comparison of systems with complex behavior. Phys. D Nonlinear Phenomena 197 , 101–133 (2004).

Mezić, I. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dyn. 41 , 309–325 (2005).

Mezić, I. Analysis of fluid flows via spectral properties of the Koopman operator. Ann. Rev. Fluid Mech. 45 , 357–378 (2013).

Rowley, C. W., Mezić, I., Bagheri, S., Schlatter, P. & Henningson, D. Spectral analysis of nonlinear flows. J. Fluid Mech. 645 , 115–127 (2009).

Kutz, J. N., Brunton, S. L., Brunton, B. W. & Proctor, J. L. Dynamic Mode Decomposition : Data-Driven Modeling of Complex Systems (SIAM, 2016).

Budišić, M., Mohr, R. & Mezić, I. Applied Koomanism. Chaos 22 , 047510 (2012).

Brunton, S. L., Budišić, M., Kaiser, E. & Kutz, J. N. Modern Koopman theory for dynamical systems. SIAM Rev. 64 , 229–340 (2022).

Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron Comput. EC-14 , 326–334 (1965).

Hopf, E. The partial differential equation u t  +  u u x  =  μ u x x . Commun. Pure Appl. Math. 3 , 201–230 (1950).

Cole, J. D. On a quasi-linear parabolic equation occurring in aerodynamics. Q. Appl. Math. 9 , 225–236 (1951).

Ablowitz, M. J., Kaup, D. J., Newell, A. C. & Segur, H. The inverse scattering transform-fourier analysis for nonlinear problems. Stud. Appl. Math. 53 , 249–315 (1974).

Ablowitz, M. J. & Segur, H. Solitons and the Inverse Scattering Transform (SIAM, 1981).

Lusch, B., Kutz, J. N. & Brunton, S. L. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9 , 4950 (2018).

Wehmeyer, C. & Noé, F. Time-lagged autoencoders: deep learning of slow collective variables for molecular kinetics. J. Chem. Phys 148 , 241703 (2018).

Mardt, A., Pasquali, L., Wu, H. & Noé, F. VAMPnets: deep learning of molecular kinetics. Nat. Commun. 9 , 5 (2018).

Takeishi, N., Kawahara, Y. & Yairi, T. Learning Koopman invariant subspaces for dynamic mode decomposition. In Advances in Neural Information Processing Systems 1130–1140 (NIPS, 2017).

Yeung, E., Kundu, S. & Hodas, N. Learning deep neural network representations for Koopman operators of nonlinear dynamical systems. Preprint at https://arxiv.org/abs/1708.06850 (2017).

Otto, S. E. & Rowley, C. W. Linearly recurrent autoencoder networks for learning dynamics. SIAM J. Appl. Dyn. Syst. 18 , 558–593 (2019).

Li, Q., Dietrich, F., Bollt, E. M. & Kevrekidis, I. G. Extended dynamic mode decomposition with dictionary learning: a data-driven adaptive spectral decomposition of the Koopman operator. Chaos 27 , 103111 (2017).

Eivazi, H., Guastoni, L., Schlatter, P., Azizpour, H. & Vinuesa, R. Recurrent neural networks and Koopman-based frameworks for temporal predictions in a low-order model of turbulence. Int. J. Heat Fluid Flow 90 , 108816 (2021).

Gin, C., Lusch, B., Brunton, S. L. & Kutz, J. N. Deep learning models for global coordinate transformations that linearise PDEs. Eur. J. Appl. Math. 32 , 515–539 (2021).

Noé, F. & Nuske, F. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Model. Simul. 11 , 635–655 (2013).

Nüske, F., Keller, B. G., Pérez-Hernández, G., Mey, A. S. & Noé, F. Variational approach to molecular kinetics. J.Chem. Theory Comput. 10 , 1739–1752 (2014).

Williams, M. O., Kevrekidis, I. G. & Rowley, C. W. A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J. Nonlinear Sci. 6 , 1307–1346 (2015).

Williams, M. O., Rowley, C. W. & Kevrekidis, I. G. A kernel approach to data-driven Koopman spectral analysis. J. Comput. Dyn. 2 , 247–265 (2015).

Google Scholar  

Klus, S. Data-driven model reduction and transfer operator approximation. J. Nonlinear Sci. 28 , 985–1010 (2018).

Kutz, J. N., Proctor, J. L. & Brunton, S. L. Applied Koopman theory for partial differential equations and data-driven modeling of spatio-temporal systems. Complexity 2018 , 6010634 (2018).

Page, J. & Kerswell, R. R. Koopman analysis of burgers equation. Phys. Rev. Fluids 3 , 071901 (2018).

Lu, L., Jin, P. & Karniadakis, G. E. DeepONet: learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. Preprint at https://arxiv.org/abs/1910.03193 (2019).

Lu, L., Jin, P., Pang, G., Zhang, Z. & Karniadakis, G. E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 3 , 218–229 (2021).

Kovachki, N. et al. Neural operator: learning maps between function spaces with applications to PDEs. J. Mach. Learn. Res. 24 , 1–97 (2023).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

Colbrook, M. J., Ayton, L. J. & Szőke, M. Residual dynamic mode decomposition: robust and verified Koopmanism. J. Fluid Mech. 955 , A21 (2023).

Colbrook, M. J. & Townsend, A. Rigorous data-driven computation of spectral properties of Koopman operators for dynamical systems. Commun. Pure Appl. Math. 77 , 221–283 (2024).

Lumley, J. Toward a turbulent constitutive relation. J. Fluid Mech. 41 , 413–434 (1970).

Berkooz, G., Holmes, P. & Lumley, J. The proper orthogonal decomposition in the analysis of turbulent flows. Annu. Rev. Fluid Mech. 25 , 539–575 (1993).

Holmes, P., Lumley, J. L., Berkooz, G. & Rowley, C. W. Turbulence , Coherent Structures , Dynamical Systems and Symmetry 2nd edn (Cambridge Univ. Press, 2012).

Taira, K. et al. Modal analysis of fluid flows: an overview. AIAA J. 55 , 4013–4041 (2017).

Taira, K. et al. Modal analysis of fluid flows: applications and outlook. AIAA J. 58 , 998–1022 (2020).

Benner, P., Gugercin, S. & Willcox, K. A survey of projection-based model reduction methods for parametric dynamical systems. SIAM Rev. 57 , 483–531 (2015).

Qian, E., Kramer, B., Peherstorfer, B. & Willcox, K. Lift & Learn: physics-informed machine learning for large-scale nonlinear dynamical systems. Phys. D Nonlinear Phenom. 406 , 132401 (2020).

Peherstorfer, B. & Willcox, K. Data-driven operator inference for nonintrusive projection-based model reduction. Comput. Methods Appl. Mech. Eng. 306 , 196–215 (2016).

Benner, P., Goyal, P., Kramer, B., Peherstorfer, B. & Willcox, K. Operator inference for non-intrusive model reduction of systems with non-polynomial nonlinear terms. Comput. Methods Appl. Mech. Eng. 372 , 113433 (2020).

Peherstorfer, B., Drmac, Z. & Gugercin, S. Stability of discrete empirical interpolation and gappy proper orthogonal decomposition with randomized and deterministic sampling points. SIAM J. Sci. Comput. 42 , A2837–A2864 (2020).

Holmes, P. & Guckenheimer, J. (eds) Nonlinear Oscillations, Dynamical Systems and Bifurcations of Vector Fields Vol. 42 (Springer, 1983).

Rowley, C. W. & Marsden, J. E. Reconstruction equations and the Karhunen-Loève expansion for systems with symmetry. Phys. D Nonlinear Phenom. 142 , 1–19 (2000).

Abraham, R., Marsden, J. E. & Ratiu, T. Manifolds, Tensor Analysis and Applications Vol. 75 (Springer, 1988).

Marsden, J. E. & Ratiu, T. S. Introduction to Mechanics and Symmetry 2nd edn (Springer, 1999).

Lee, K. & Carlberg, K. T. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. J. Comput. Phys. 404 , 108973 (2020).

Schmid, P. J. Dynamic mode decomposition for numerical and experimental data. J. Fluid Mech. 656 , 5–28 (2010).

Loiseau, J.-C., Noack, B. R. & Brunton, S. L. Sparse reduced-order modeling: sensor-based dynamics to full-state estimation. J. Fluid Mech. 844 , 459–490 (2018).

Deng, N., Noack, B. R., Morzynski, M. & Pastur, L. R. Low-order model for successive bifurcations of the fluidic pinball. J. Fluid Mech. 884 , A37 (2020).

Loiseau, J.-C. Data-driven modeling of the chaotic thermal convection in an annular thermosyphon. Theor. Comput. Fluid Dyn 34 , 339–365 (2020).

Guan, Y., Brunton, S. L. & Novosselov, I. Sparse nonlinear models of chaotic electroconvection. R. Soc. Open Sci. 8 , 202367 (2021).

Kaptanoglu, A. A., Morgan, K. D., Hansen, C. J. & Brunton, S. L. Physics-constrained, low-dimensional models for magnetohydrodynamics: first-principles and data-driven approaches. Phys. Rev. E 104 , 015206 (2021).

Kaptanoglu, A. A., Callaham, J. L., Aravkin, A., Hansen, C. J. & Brunton, S. L. Promoting global stability in data-driven models of quadratic nonlinear dynamics. Phys. Rev. Fluids 6 , 094401 (2021).

Deng, N., Noack, B. R., Morzyński, M. & Pastur, L. R. Galerkin force model for transient and post-transient dynamics of the fluidic pinball. J. Fluid Mech. 918 , A4 (2021).

Callaham, J. L., Loiseau, J.-C., Rigas, G. & Brunton, S. L. Nonlinear stochastic modelling with Langevin regression. Proc. R. Soc. A 477 , 20210092 (2021).

Callaham, J. L., Brunton, S. L. & Loiseau, J.-C. On the role of nonlinear correlations in reduced-order modeling. J. Fluid Mech. 938 , A1 (2022).

Callaham, J. L., Rigas, G., Loiseau, J.-C. & Brunton, S. L. An empirical mean-field model of symmetry-breaking in a turbulent wake. Sci. Adv. 8 , eabm4786 (2022).

Pathak, J., Lu, Z., Hunt, B. R., Girvan, M. & Ott, E. Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data. Chaos 27 , 121102 (2017).

Pathak, J., Hunt, B., Girvan, M., Lu, Z. & Ott, E. Model-free prediction of large spatiotemporally chaotic systems from data: a reservoir computing approach. Phys. Rev. Lett. 120 , 024102 (2018).

Champion, K., Lusch, B., Kutz, J. N. & Brunton, S. L. Data-driven discovery of coordinates and governing equations. Proc. Natl Acad. Sci. USA 116 , 22445–22451 (2019).

Baddoo, P. J., Herrmann, B., McKeon, B. J. & Brunton, S. L. Kernel learning for robust dynamic mode decomposition: linear and nonlinear disambiguation optimization (LANDO). Proc. R. Soc. A 478 , 20210830 (2022).

Reed, M. & Simon, B. Methods of Modern Mathematical Physics. I 2nd edn (Academic, 1980).

Courant, R. & Hilbert, D. Methods of Mathematical Physics : Partial Differential Equations (Wiley, 2008).

Li, Z. et al. Neural operator: graph kernel network for partial differential equations. Preprint at https://arxiv.org/abs/2003.03485 (2020).

Li, Z. et al. Multipole graph neural operator for parametric partial differential equations. Preprint at https://arxiv.org/abs/2006.09535 (2020).

Li, Z. et al. Fourier neural operator for parametric partial differential equations. Preprint at https://arxiv.org/abs/2010.08895 (2020).

Gin, C. R., Shea, D. E., Brunton, S. L. & Kutz, J. N. DeepGreen: deep learning of Green’s functions for nonlinear boundary value problems. Sci. Rep. 11 , 21614 (2021).

de Hoop, M. V., Kovachki, N. B., Nelsen, N. H. & Stuart, A. M. Convergence rates for learning linear operators from noisy data. SIAM-ASA J. Uncertain. Quantif. 11 , 480–513 (2023).

De Hoop, M., Huang, D. Z., Qian, E. & Stuart, A. M. The cost-accuracy trade-off in operator learning with neural networks. Preprint at https://arxiv.org/abs/2203.13181 (2022).

Mollenhauer, M., Mücke, N. & Sullivan, T. Learning linear operators: infinite-dimensional regression as a well-behaved non-compact inverse problem. Preprint at https://arxiv.org/abs/2211.08875 (2022).

Lange, H., Brunton, S. L. & Kutz, J. N. From Fourier to Koopman: spectral methods for long-term time series prediction. J. Mach. Learn. Res. 22 , 1–38 (2021).

Lanthaler, S., Mishra, S. & Karniadakis, G. E. Error estimates for DeepONets: a deep learning framework in infinite dimensions. Trans. Math. Appl. 6 , tnac001 (2022).

Kovachki, N., Lanthaler, S. & Mishra, S. On universal approximation and error bounds for Fourier neural operators. J. Mach. Learn. Res. 22 , 13237–13312 (2021).

Lyu, Y., Zhao, X., Gong, Z., Kang, X. & Yao, W. Multi-fidelity prediction of fluid flow based on transfer learning using Fourier neural operator. Phys. Fluids 35 , 077118 (2023).

Gopakumar, V. et al. Plasma surrogate modelling using Fourier neural operators. Nucl. Fusion 64 , 056025 (2024).

Kurth, T. et al. FourCastNet: accelerating global high-resolution weather forecasting using adaptive Fourier neural operators. In Proc. Platform for Advanced Scientific Computing Conference 1–11 (ACM, 2023).

Raonić, B. et al. Convolutional neural operators for robust and accurate learning of PDEs. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 77187–77200 (NeurIPS, 2023).

Fanaskov, V. & Oseledets, I. Spectral neural operators. Preprint at https://arxiv.org/abs/2205.10573 (2022).

Chen, T. & Chen, H. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural Networks 6 , 911–917 (1995).

Cao, S. Choose a transformer: Fourier or Galerkin. Adv. Neural Inf. Process. Syst. 34 , 24924–24940 (2021).

Li, Z., Meidani, K. & Farimani, A. B. Transformer for partial differential equations’ operator learning. Transact. Mach. Learn. Res. https://openreview.net/pdf?id=EPPqt3uERT (2023).

Vinuesa, R. & Brunton, S. L. Enhancing computational fluid dynamics with machine learning. Nat. Comput. Sci. 2 , 358–366 (2022).

Mishra, S. A machine learning framework for data driven acceleration of computations of differential equations. Math. Eng. 1 , 118–146 (2019).

Bar-Sinai, Y., Hoyer, S., Hickey, J. & Brenner, M. P. Learning data-driven discretizations for partial differential equations. Proc. Natl Acad. Sci. USA 116 , 15344–15349 (2019).

Stevens, B. & Colonius, T. Enhancement of shock-capturing methods via machine learning. Theor. Comput. Fluid Dyn. 34 , 483–496 (2020).

Barrault, M., Maday, Y., Nguyen, N. C. & Patera, A. T. An ‘empirical interpolation’ method: application to efficient reduced-basis discretization of partial differential equations. C. R. Math. 339 , 667–672 (2004).

Chaturantabut, S. & Sorensen, D. C. Nonlinear model reduction via discrete empirical interpolation. SIAM J. Sci. Comput. 32 , 2737–2764 (2010).

Stevens, B. & Colonius, T. FiniteNet: a fully convolutional LSTM network architecture for time-dependent partial differential equations. Preprint at https://arxiv.org/abs/2002.03014 (2020).

Yousif, M. Z., Zhang, M., Yu, L., Vinuesa, R. & Lim, H. A transformer-based synthetic-inflow generator for spatially developing turbulent boundary layers. J. Fluid Mech. 957 , A6 (2023).

Kochkov, D. et al. Machine learning-accelerated computational fluid dynamics. Proc. Natl Acad. Sci. USA 118 , e2101784118 (2021).

Fukami, K., Fukagata, K. & Taira, K. Super-resolution reconstruction of turbulent flows with machine learning. J. Fluid Mech. 870 , 106–120 (2019).

Fukami, K., Fukagata, K. & Taira, K. Machine-learning-based spatio-temporal super resolution reconstruction of turbulent flows. J. Fluid Mech. 909 , A9 (2021).

Fukami, K. & Taira, K. Robust machine learning of turbulence through generalized Buckingham Pi-inspired pre-processing of training data. In APS Division of Fluid Dynamics Meeting Abstracts A31-004 (APS, 2021).

Pan, S., Brunton, S. L. & Kutz, J. N. Neural implicit flow: a mesh-agnostic dimensionality reduction paradigm of spatio-temporal data. J. Mach. Learn. Res. 24 , 1607–1666 (2023).

Takamoto, M. et al. PDEBench: an extensive benchmark for scientific machine learning. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS, 2022).

Download references

Acknowledgements

We acknowledge support from the National Science Foundation AI Institute in Dynamic Systems (grant no. 2112085). S.L.B. acknowledges support from the Army Research Office (ARO W911NF-19-1-0045).

Author information

Authors and affiliations.

Department of Mechanical Engineering, University of Washington, Seattle, WA, USA

Steven L. Brunton

Department of Applied Mathematics, University of Washington, Seattle, WA, USA

J. Nathan Kutz

You can also search for this author in PubMed   Google Scholar

Contributions

S.L.B. and J.N.K. contributed equally to this Review.

Corresponding author

Correspondence to Steven L. Brunton .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Computational Science thanks Lu Lu, Samuel Lanthaler and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Brunton, S.L., Kutz, J.N. Promising directions of machine learning for partial differential equations. Nat Comput Sci (2024). https://doi.org/10.1038/s43588-024-00643-2

Download citation

Received : 08 May 2023

Accepted : 13 May 2024

Published : 28 June 2024

DOI : https://doi.org/10.1038/s43588-024-00643-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

sample complexity for infinite hypothesis space in machine learning

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

What exactly is a hypothesis space in machine learning?

Whilst I understand the term conceptually, I'm struggling to understand it operationally. Could anyone help me out by providing an example?

  • machine-learning
  • terminology

Five σ's user avatar

  • $\begingroup$ A space where we can predict output by a set of some legal hypothesis (or function) and function is represented in terms of features. $\endgroup$ –  Abhishek Kumar Commented Aug 9, 2019 at 17:03

3 Answers 3

Lets say you have an unknown target function $f:X \rightarrow Y$ that you are trying to capture by learning . In order to capture the target function you have to come up with some hypotheses, or you may call it candidate models denoted by H $h_1,...,h_n$ where $h \in H$ . Here, $H$ as the set of all candidate models is called hypothesis class or hypothesis space or hypothesis set .

For more information browse Abu-Mostafa's presentaton slides: https://work.caltech.edu/textbook.html

pentanol's user avatar

  • 8 $\begingroup$ This answer conveys absolutely no information! What is the intended relationship between $f$, $h$, and $H$? What is meant by "hypothesis set"? $\endgroup$ –  whuber ♦ Commented Nov 28, 2015 at 20:50
  • 5 $\begingroup$ Please take a few minutes with our help center to learn about this site and its standards, JimBoy. $\endgroup$ –  whuber ♦ Commented Nov 28, 2015 at 20:57
  • $\begingroup$ The answer says very clear, h learns to capture target function f . H is the space where h1, h2,..hn got defined. $\endgroup$ –  Logan Commented Nov 29, 2018 at 21:47
  • $\begingroup$ @whuber I hope this is clearer $\endgroup$ –  pentanol Commented Aug 6, 2021 at 8:51
  • $\begingroup$ @pentanol You have succeeded in providing a different name for "hypothesis space," but without a definition or description of "candidate model," it doesn't seem to add any information to the post. What would be useful is information relevant to the questions that were posed, which concern "understand[ing] operationally" and a request for an example. $\endgroup$ –  whuber ♦ Commented Aug 6, 2021 at 13:55

Suppose an example with four binary features and one binary output variable. Below is a set of observations:

This set of observations can be used by a machine learning (ML) algorithm to learn a function f that is able to predict a value y for any input from the input space .

We are searching for the ground truth f(x) = y that explains the relation between x and y for all possible inputs in the correct way.

The function f has to be chosen from the hypothesis space .

To get a better idea: The input space is in the above given example $2^4$ , its the number of possible inputs. The hypothesis space is $2^{2^4}=65536$ because for each set of features of the input space two outcomes ( 0 and 1 ) are possible.

The ML algorithm helps us to find one function , sometimes also referred as hypothesis, from the relatively large hypothesis space.

  • A Few Useful Things to Know About ML

Lerner Zhang's user avatar

  • 1 $\begingroup$ Just a small note on your answer: the size of the hypothesis space is indeed 65,536, but the a more easily explained expression for it would be $2^{(2^4)}$, since, there are $2^4$ possible unique samples, and thus $2^{(2^4)}$ possible label assignments for the entire input space. $\endgroup$ –  engelen Commented Jan 10, 2018 at 9:52
  • 1 $\begingroup$ @engelen Thanks for your advice, I've edited the answer. $\endgroup$ –  So S Commented Jan 10, 2018 at 21:00
  • $\begingroup$ @SoS That one function is called classifier?? $\endgroup$ –  user125163 Commented Aug 22, 2018 at 16:26
  • 2 $\begingroup$ @Arjun Hedge: Not the one, but one function that you learned is the classifier. The classifier could be (and that's your aim) the one function. $\endgroup$ –  So S Commented Aug 22, 2018 at 16:50

The hypothesis space is very relevant to the topic of the so-called Bias-Variance Tradeoff in maximum likelihood. That's if the number of parameters in the model(hypothesis function) is too small for the model to fit the data(indicating underfitting and that the hypothesis space is too limited), the bias is high; while if the model you choose contains too many parameters than needed to fit the data the variance is high(indicating overfitting and that the hypothesis space is too expressive).

As stated in So S ' answer, if the parameters are discrete we can easily and concretely calculate how many possibilities are in the hypothesis space(or how large it is), but normally under realy life circumstances the parameters are continuous. Therefore generally the hypothesis space is uncountable.

Here is an example I borrowed and modified from the related part in the classical machine learning textbook: Pattern Recognition And Machine Learning to fit this question:

We are selecting a hypothesis function for an unknown function hidding in the training data given by a third person named CoolGuy living in an extragalactic planet. Let's say CoolGuy knows what the function is, because the data cases are provided by him and he just generated the data using the function. Let's call it(we only have the limited data and CoolGuy has both the unlimited data and the function generating them) the ground truth function and denote it by $y(x, w)$ .

enter image description here

The green curve is the $y(x,w)$ , and the little blue circles are the cases we have(they are not actually the true data cases transmitted by CoolGuy because of the it would be contaminated by some transmission noise, for example by macula or other things).

We thought that that hidden function would be very simple then we make an attempt at a linear model(make a hypothesis with a very limited space): $g_1(x, w)=w_0 + w_1 x$ with only two parameters: $w_0$ and $w_1$ , and we train the model use our data and we obtain this:

enter image description here

We can see that no matter how many data we use to fit the hypothesis it just doesn't work because it is not expressive enough.

So we try a much more expressive hypothesis: $g_9=\sum_j^9 w_j x^j $ with ten adaptive paramaters $w_0, w_1\cdots , w_9$ , and we also train the model and then we get:

enter image description here

We can see that it is just too expressive and fits all data cases. We see that a much larger hypothesis space( since $g_2$ can be expressed by $g_9$ by setting $w_2, w_3, \cdots, w_9$ as all 0 ) is more powerful than a simple hypothesis. But the generalization is also bad. That is, if we recieve more data from CoolGuy and to do reference, the trained model most likely fails in those unseen cases.

Then how large the hypothesis space is large enough for the training dataset? We can find an aswer from the textbook aforementioned:

One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model.

And you'll see from the textbook that if we try to use 4 parameters, $g_3=w_0+w_1 x + w_2 x^2 + w_3 x^3$ , the trained function is expressive enough for the underlying function $y=\sin(2\pi x)$ . It's kind a black art to find the number 3(the appropriate hypothesis space) in this case.

Then we can roughly say that the hypothesis space is the measure of how expressive you model is to fit the training data. The hypothesis that is expressive enough for the training data is the good hypothesis with an expressive hypothesis space. To test whether the hypothesis is good or bad we do the cross validation to see if it performs well in the validation data-set. If it is neither underfitting(too limited) nor overfititing(too expressive) the space is enough(according to Occam Razor a simpler one is preferable, but I digress).

  • $\begingroup$ This approach looks relevant, but your explanation does not agree with that on p. 5 of your first reference: "A function $h:X\to\{0,1\}$ is called [an] hypothesis. A set $H$ of hypotheses among which the approximation function $y$ is searched is called [the] hypothesis space." (I would agree the slide is confusing, because its explanation implicitly requires that $C=\{0,1\}$, whereas that is generically labeled "classes" in the diagram. But let's not pass along that confusion: let's rectify it.) $\endgroup$ –  whuber ♦ Commented Sep 24, 2016 at 15:33
  • 1 $\begingroup$ @whuber I updated my answer just now more than two years later after I have learned more knowledge on the topic. Please help check if I can rectify it in a better way. Thanks. $\endgroup$ –  Lerner Zhang Commented Feb 5, 2019 at 11:41

Not the answer you're looking for? Browse other questions tagged machine-learning terminology definition or ask your own question .

  • Featured on Meta
  • Upcoming sign-up experiments related to tags

Hot Network Questions

  • Folk stories and notions in mathematics that are likely false, inaccurate, apocryphal, or poorly founded?
  • Are there alternatives to alias I'm not aware of?
  • Predictable Network Interface Names: ensX vs enpXsY
  • Artinian Gorenstein subrings with same socle degree
  • Logical AND (&&) does not short-circuit correctly in #if
  • What is the term for when a hyperlink maliciously opens different URL from URL displayed when hovered over?
  • Cloud masking ECOSTRESS LST data
  • Do known physical systems all have a unique time evolution?
  • Sangaku problem involving eight circles
  • Why can't LaTeX (seem to?) Support Arbitrary Text Sizes?
  • Why is my GFCI tripping when i turn the breaker on?
  • Con permiso to enter your own house?
  • In equation (3) from lecture 7 in Leonard Susskind’s ‘Classical Mechanics’, should the derivatives be partial?
  • Embedded terminal for Nautilus >= version 43
  • Did the BBC censor a non-binary character in Transformers: EarthSpark?
  • Have children's car seats not been proven to be more effective than seat belts alone for kids older than 24 months?
  • Integration of the product of two exponential functions
  • What is a positive coinductive type and why are they so bad?
  • Can front gear be replaced on a Retrospec Judd folding bicycle?
  • How to engagingly introduce a ton of history that happens in, subjectively, a moment?
  • Summation of arithmetic series
  • Remove assignment of [super] key from showing "activities" in Ubuntu 22.04
  • Different outdir directories in one Quantum ESPRESSO run
  • Is the zero vector necessary to do quantum mechanics?

sample complexity for infinite hypothesis space in machine learning

SAT-Based Learning of Computation Tree Logic

  • Conference paper
  • Open Access
  • First Online: 01 July 2024
  • Cite this conference paper

You have full access to this open access conference paper

sample complexity for infinite hypothesis space in machine learning

  • Adrien Pommellet   ORCID: orcid.org/0000-0002-1825-0097 27 ,
  • Daniel Stan   ORCID: orcid.org/0000-0002-4723-5742 27 &
  • Simon Scatton 27  

Part of the book series: Lecture Notes in Computer Science ((volume 14739))

Included in the following conference series:

  • International Joint Conference on Automated Reasoning

The CTL learning problem consists in finding for a given sample of positive and negative Kripke structures a distinguishing CTL formula that is verified by the former but not by the latter. Further constraints may bound the size and shape of the desired formula or even ask for its minimality in terms of syntactic size. This synthesis problem is motivated by explanation generation for dissimilar models, e.g. comparing a faulty implementation with the original protocol. We devise a SAT -based encoding for a fixed size CTL formula, then provide an incremental approach that guarantees minimality. We further report on a prototype implementation whose contribution is twofold: first, it allows us to assess the efficiency of various output fragments and optimizations. Secondly, we can experimentally evaluate this tool by randomly mutating Kripke structures or syntactically introducing errors in higher-level models, then learning CTL distinguishing formulas.

You have full access to this open access chapter,  Download conference paper PDF

  • Computation Tree Logic
  • Passive learning
  • SAT solving

1 Introduction

Passive learning is the act of computing a theoretical model of a system from a given set of data, without being able to acquire further information by actively querying said system. The input data may have been gathered through monitoring, collecting executions and outputs of systems. Automata and logic formulas tend to be the most common models, as they allow one to better explain systems of complex or even entirely opaque design.

Linear-time Temporal Logic LTL  [ 19 ] remains one of the most widely used formalisms for specifying temporal properties of reactive systems. It applies to finite or infinite execution traces, and for that reason fits the passive learning framework very well: a LTL formula is a concise way to distinguish between correct and incorrect executions. The LTL learning problem, however, is anything but trivial: even simple fragments on finite traces are NP -complete [ 10 ], and consequently recent algorithms tend to leverage SAT solvers [ 16 ].

Computation Tree Logic CTL  [ 8 ] is another relevant formalism that applies to execution trees instead of isolated linear traces. It is well-known [ 1 , Thm. 6.21] that LTL and CTL are incomparable: the former is solely defined on the resulting runs of a system, whereas the latter depends on its branching structure.

However, the CTL passive learning problem has seldom been studied in as much detail as LTL . In this article, we formalize it on Kripke structures (KSs): finite graph-like representations of programs. Our goal is to find a CTL formula (said to be separating ) that is verified by every state in a positive set \(S^+\) yet rejected by every state in a negative set \(S^-\) .

We first prove that an explicit formula can always be computed and we bound its size, assuming the sample is devoid of contradictions. However, said formula may not be minimal. The next step is therefore to solve the bounded learning problem: finding a separating CTL formula of size smaller than a given bound  n . We reduce it to an instance \(\varPhi _n\) of the Boolean satisfiability problem whose answer can be computed by a SAT solver; to do so, we encode CTL ’s bounded semantics, as the usual semantics on infinite executions trees can lead to spurious results. Finally, we use a bottom-up approach to pinpoint the minimal answer by solving a series of satisfiability problems. We show that a variety of optimizations can be applied to this iterative algorithm. These various approaches have been implemented in a C++ tool and benchmarked on a test sample.

Related Work. Bounded model checking harnesses the efficiency of modern SAT solvers to iteratively look for a witness of bounded size that would contradict a given logic formula, knowing that there exists a completeness threshold after which we can guarantee no counter-example exists. First introduced by Biere et al. [ 3 ] for LTL formulas, it was later applied to CTL formulas [ 17 , 24 , 25 ].

This approach inspired Neider et al. [ 16 ], who designed a SAT -based algorithm that can learn a LTL formula consistent with a sample of ultimately periodic words by computing propositional Boolean formulas that encode both the semantics of LTL on the input sample and the syntax of its Directed Acyclic Graph (DAG) representation. This work spurred further SAT -based developments such as learning formulas in the property specification language PSL  [ 21 ] or \(\textsf {LTL}_\textsf {f}\)  [ 7 ], applying MaxSAT solving to noisy datasets [ 11 ], or constraining the shape of the formula to be learnt [ 15 ]. Our article extends this method to CTL formulas and Kripke structures. It subsumes the original LTL learning problem: one can trivially prove that it is equivalent to learning CTL formulas consistent with a sample of lasso-shaped KSs that consist of a single linear sequence of states followed by a single loop.

Fijalkow et al. [ 10 ] have studied the complexity of learning \(\textsf {LTL}_\textsf {f}\) formulas of size smaller than a given bound and consistent with a sample of finite words: it is already \(\textsf {NP}\) -complete for fragments as simple as \(\textsf {LTL}_\textsf {f}(\wedge , {{\,\mathrm{\textsf{X}}\,}})\) , \(\textsf {LTL}_\textsf {f}(\wedge , {{\,\mathrm{\textsf{F}}\,}})\) , or \(\textsf {LTL}_\textsf {f}({{\,\mathrm{\textsf{F}}\,}}, {{\,\mathrm{\textsf{X}}\,}}, \wedge , \vee )\) . However, their proofs cannot be directly extended to samples of infinite but ultimately periodic words.

Browne et al. [ 6 ] proved that KSs could be characterized by CTL formulas and that conversely bisimilar KSs verified the same set of CTL formulas. As we will show in Sect.  3 , this result guarantees that a solution to the CTL learning problem actually exists if the input sample is consistent.

Wasylkowski et al. [ 23 ] mined CTL specifications in order to explain preconditions of Java functions beyond pure state reachability. However, their learning algorithm consists in enumerating CTL templates of the form \({{\,\mathrm{{\forall \textsf{F}}}\,}}a\) , \({{\,\mathrm{{\exists \textsf{F}}}\,}}a\) , \({{\,\mathrm{{\forall \textsf{G}}}\,}}(a \implies {{\,\mathrm{{\forall \textsf{X}}}\,}}{{\,\mathrm{{\forall \textsf{F}}}\,}}b)\) and \({{\,\mathrm{{\forall \textsf{G}}}\,}}(a \implies {{\,\mathrm{{\exists \textsf{X}}}\,}}{{\,\mathrm{{\exists \textsf{F}}}\,}}b)\) where \(a,b \in \textsf{AP}\) for each function, using model checking to select one that is verified by the Kripke structure representing the aforementioned function.

Two very recent articles, yet to be published, have addressed the CTL learning problem as well. Bordais et al. [ 5 ] proved that the passive learning problem for LTL formulas on ultimately periodic words is NP -hard, assuming the size of the alphabet is given as an input; they then extend this result to CTL passive learning, using a straightforward reduction of ultimately periodic words to lasso-shaped Kripke structures. Roy et al. [ 22 ] used a SAT -based algorithm, resulting independently to our own research in an encoding similar to the one outlined in Sect.  4 . However, our explicit solution to the learning problem, the embedding of the negations in the syntactic DAG, the approximation of the recurrence diameter as a semantic bound, our implementation of this algorithm, its test suite, and the experimental results are entirely novel contributions.

2 Preliminary Definitions

2.1 kripke structures.

Let \(\textsf{AP}\) be a finite set of atomic propositions. A Kripke structure is a finite directed graph whose vertices (called states ) are labelled by subsets of \(\textsf{AP}\) .

Definition 1

(Kripke Structure). A Kripke structure (KS) \(\mathcal {K}\) on \(\textsf{AP}\) is a tuple \(\mathcal {K}= (Q, \delta , \lambda )\) such that:

Q is a finite set of states; the integer | Q | is known as the size of \(\mathcal {K}\) ;

\(\delta : Q \rightarrow (2^Q \setminus \{\emptyset \})\) is a transition function; the integer \(\underset{q \in Q}{\max }\ \, |\delta (q)|\) is known as the degree of \(\mathcal {K}\) ;

\(\lambda : Q \rightarrow 2^\textsf{AP}\) is a labelling function.

An infinite run r of \(\mathcal {K}\) starting from a state \(q \in Q\) is an infinite sequence \(r = (s_i) \in Q^\omega \) of consecutive states such that \(s_0 = q\) and \(\forall \,i \ge 0\) , \(s_{i+1} \in \delta (s_i)\) . \(\mathcal {R}_\mathcal {K}(q)\) is the set of all infinite runs of \(\mathcal {K}\) starting from q .

The recurrence diameter \(\alpha _\mathcal {K}(q)\) of state q in \(\mathcal {K}\) is the length of the longest finite run \((s_i)_{i = 0, \ldots , \alpha _\mathcal {K}(q)}\) starting from q such that \(\forall \,i, j \in [0 \mathrel {{.}\,{.}}\alpha _\mathcal {K}(q)]\) , if \(i \ne j\) then \(s_i \ne s_j\) (i.e. the longest simple path in the underlying graph structure). We may omit the index \(\mathcal {K}\) whenever contextually obvious.

Note that two states may generate the same runs despite their computation trees not corresponding. It is therefore necessary to define an equivalence relation on states of KSs that goes further than mere run equality.

Definition 2

(Bisimulation Relation). Let \(\mathcal {K}= (Q, \delta , \lambda )\) be a KS on \(\textsf{AP}\) . The canonical bisimulation relation \({\sim } \subseteq Q \times Q\) is the coarsest (i.e. the most general) equivalence relation such that for any \(q_1 \sim q_2\) , \(\lambda (q_1) = \lambda (q_2)\) and \(\forall \,q'_1 \in \delta (q_1), \exists \,q'_2 \in \delta (q_2)\) such that \(q'_1 \sim q'_2\) .

Bisimilarity does not only entails equality of runs, but also similarity of shape: two bisimilar states have corresponding computation trees at any depth. A partition refinement algorithm allows one to compute \(\sim \) by refining a sequence of equivalence relations \((\sim _i)_{i \ge 0}\) on \(Q \times Q\) inductively, where for every \(q_1, q_2 \in Q\) :

Where \([q]_{\sim _i}\) stands for the equivalence class of \(q \in Q\) according to the equivalence relation \(\sim _i\) . Intuitively, \(q_1 \sim _i q_2\) if their computation trees are corresponding up to depth i . The next theorem is a well-known result [ 1 , Alg. 31]:

(Characteristic Number). Given a KS \(\mathcal {K}\) , there exists \(i_0 \in \mathbb {N}\) such that \(\forall \,i \ge i_0\) , \({\sim } = {\sim _i}\) . The smallest integer \(i_0\) verifying that property is known as the characteristic number \(\mathcal {C}_\mathcal {K}\) of \(\mathcal {K}\) .

Note that Browne et al. [ 6 ] introduced an equivalent definition: the characteristic number of a KS is also the smallest integer \(\mathcal {C}_\mathcal {K}\in \mathbb {N}\) such that any two states are not bisimilar if and only if their labelled computation trees of depth \(\mathcal {C}_\mathcal {K}\) are not corresponding.

2.2 Computation Tree Logic

Definition 3.

(Computation Tree Logic). Computation Tree Logic ( CTL ) is the set of formulas defined by the following grammar, where \(a \in \textsf{AP}\) is any atomic proposition and \(\dagger \in \{\forall \,, \exists \,\}\) a quantifier:

Given \(E \subseteq \{\lnot , \wedge , \vee , {{\,\mathrm{{\forall \textsf{X}}}\,}}, {{\,\mathrm{{\exists \textsf{X}}}\,}}, {{\,\mathrm{{\forall \textsf{F}}}\,}}, {{\,\mathrm{{\exists \textsf{F}}}\,}}, {{\,\mathrm{{\forall \textsf{G}}}\,}}, {{\,\mathrm{{\exists \textsf{G}}}\,}}, {{\,\mathrm{{\forall \textsf{U}}}\,}}, {{\,\mathrm{{\exists \textsf{U}}}\,}}\}\) , we define the (syntactic) fragment \(\textsf {CTL}(E)\) as the subset of CTL formulas featuring only operators in E .

CTL formulas are verified against states of KSs (a process known as model checking ). Intuitively, \({\forall }\) ( all ) means that all runs starting from state q must verify the property that follows, \({\exists }\) ( exists ), that at least one run starting from q must verify the property that follows, \({{\,\mathrm{\textsf{X}}\,}}\varphi \) ( next ), that the next state of the run must verify \(\varphi \) , \({{\,\mathrm{\textsf{F}}\,}}\varphi \) ( finally ), that there exists a state of run verifying \(\varphi \) , \({{\,\mathrm{\textsf{G}}\,}}\varphi \) ( globally ), that each state of the run must verify \(\varphi \) , and \(\varphi {{\,\mathrm{\textsf{U}}\,}}\psi \) ( until ), that the run must keep verifying \(\varphi \) at least until \(\psi \) is eventually verified.

More formally, for a state \(q\in Q\) of a KS \(\mathcal {K}= (Q, \delta , \lambda )\) and a CTL formula \(\varphi \) , we write \((q \models _\mathcal {K}\varphi )\) when \(\mathcal {K}\) satisfies \(\varphi \) . CTL ’s semantics are defined inductively on \(\varphi \) (see [ 1 , Def. 6.4] for a complete definition); we recall below the until case:

Definition 4

(Semantics of \({{{\,\mathrm{{\forall \textsf{U}}}\,}},{{\,\mathrm{{\exists \textsf{U}}}\,}}}\) ). Let \(\mathcal {K}= (Q, \delta , \lambda )\) be a KS, \(\varphi \) and \(\psi \) two CTL formulas, \(q \in Q\) , and \(\dagger \in \{\forall \,, \exists \,\}\) . Then:

Bisimilarity and CTL equivalence coincide [ 1 , Thm. 7.20] on finite KSs. The proof relies on the following concept:

(Browne et al.  [ 6 ] ). Given a KS \(\mathcal {K}= (Q, \delta , \lambda )\) and a state \(q \in Q\) , there exists a CTL formula \(\varphi _q \in \textsf {CTL}(\{\lnot , \wedge , \vee , {{\,\mathrm{{\forall \textsf{X}}}\,}}, {{\,\mathrm{{\exists \textsf{X}}}\,}}\})\) known as the master formula of state q such that, for any \(q' \in Q\) , \(q' \models _\mathcal {K}\varphi _q\) if and only if \(q \sim q'\) .

figure 1

The syntactic tree and indexed DAG of the CTL formula \(\lnot a \wedge {{\,\mathrm{{\forall \textsf{X}}}\,}}a\) .

To each CTL formula \(\varphi \) , we associate a syntactic tree \(\mathcal {T}\) . For brevity’s sake, we consider a syntactic directed acyclic graph (DAG) \(\mathcal {D}\) by coalescing identical subtrees in the original syntactic tree \(\mathcal {T}\) , as shown in Fig.  1 . The size \(|\varphi |\) of a CTL formula \(\varphi \) is then defined as the number of nodes of its smallest syntactic DAG. As an example, \(|\lnot a \wedge {{\,\mathrm{{\forall \textsf{X}}}\,}}a| = 4\) .

2.3 Bounded Semantics

We introduce the bounded temporal operators \({{\,\mathrm{{\forall \textsf{F}}}\,}}^u\) , \({{\,\mathrm{{\exists \textsf{F}}}\,}}^u\) , \({{\,\mathrm{{\forall \textsf{G}}}\,}}^u\) , \({{\,\mathrm{{\exists \textsf{G}}}\,}}^u\) , \({{\,\mathrm{{\forall \textsf{U}}}\,}}^u\) , and \({{\,\mathrm{{\exists \textsf{U}}}\,}}^u\) , whose semantics only applies to the first u steps of a run. Formally:

Definition 5

(Bounded Semantics of CTL ). Let \(\mathcal {K}= (Q, \delta , \lambda )\) be a KS, \(\varphi \) and \(\psi \) two CTL formulas, \(u \in \mathbb {N}\) and \(q \in Q\) . The bounded semantics of CTL of rank u with regards to \(\mathcal {K}\) are defined as follows for the quantifier \(\dagger \in \{\forall \,, \exists \,\}\) :

Intuitively, the rank u of the bounded semantics acts as a timer: \((q \models _\mathcal {K}{{\,\mathrm{{\forall \textsf{G}}}\,}}^u \varphi )\) means that \(\varphi \) must hold for the next u computation steps; \((q \models _\mathcal {K}{{\,\mathrm{{\forall \textsf{F}}}\,}}^u \varphi )\) , that q must always be able to reach a state verifying \(\varphi \) within u steps; \((q \models _\mathcal {K}\forall \,\varphi {{\,\mathrm{\textsf{U}}\,}}^u \psi )\) , that q must always be able to reach a state verifying \(\psi \) within u steps, and that \(\varphi \) must hold until it does; etc. This intuition results in the following properties:

(Base case). \((q \models _\mathcal {K}\psi ) \iff (q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{F}}\,}}^0 \psi ) \iff (q \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}^0 \psi ) \iff (q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{G}}\,}}^0 \psi )\) .

(Induction). \((q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{F}}\,}}^{u+1} \varphi ) \iff (q \models _\mathcal {K}\varphi ) \vee \underset{q' \in \delta (q)}{\bigtriangleup }\ (q' \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{F}}\,}}^u \varphi )\) , \((q \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}^{u+1} \psi ) \iff (q \models _\mathcal {K}\psi ) \vee \biggl [(q \models _\mathcal {K}\varphi ) \wedge \underset{q' \in \delta (q)}{\bigtriangleup }\ (q' \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}^u \psi )\biggr ]\) , and \((q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{G}}\,}}^{u+1} \varphi ) \iff (q \models _\mathcal {K}\varphi ) \wedge \underset{q' \in \delta (q)}{\bigtriangleup }\ (q' \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{G}}\,}}^u \varphi )\) , where \({\vartriangle } = \wedge \) if \(\dagger = \forall \) and \({\vartriangle } = \vee \) if \(\dagger = \exists \) .

(Spread). \((q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{F}}\,}}^u \varphi ) \implies (q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{F}}\,}}^{u+1} \varphi )\) , \((q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{G}}\,}}^{u+1} \varphi ) \implies (q \models _\mathcal {K}\dagger {{\,\mathrm{\textsf{G}}\,}}^u \varphi )\) , and \((q \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}^u \psi ) \implies (q \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}^{u+1} \psi )\) .

Bounded model checking algorithms [ 25 ] rely on the following result, as one can then restrict the study of CTL semantics to finite and fixed length paths.

Given \(q \in Q\) , for \(\dagger \in \{\forall \,, \exists \,\}\) and \({{\,\mathrm{{\triangleright }}\,}}\in \{{{\,\mathrm{\textsf{F}}\,}}, {{\,\mathrm{\textsf{G}}\,}}\}\) , \(q \models _\mathcal {K}\dagger {{\,\mathrm{{\triangleright }}\,}}\varphi \) (resp. \(q \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}\psi \) ) if and only if \(q \models _\mathcal {K}\dagger \triangleright ^{\alpha (q)} \varphi \) (resp. \(q \models _\mathcal {K}\dagger \, \varphi {{\,\mathrm{\textsf{U}}\,}}^{\alpha (q)} \psi \) ).

A full proof of this result is available in Appendix A .

3 The Learning Problem

We consider the synthesis problem of a distinguishing CTL formula from a sample of positive and negative states of a given KS.

3.1 Introducing the Problem

First and foremost, the sample must be self-consistent: a state in the positive sample cannot verify a CTL formula while another bisimilar state in the negative sample does not.

Definition 6

(Sample). Given a KS \(\mathcal {K}= (Q, \delta , \lambda )\) , a sample of \(\mathcal {K}\) is a pair \((S^+, S^-) \in 2^Q \times 2^Q\) such that \(\forall \,q^+ \in S^+\) , \(\forall \,q^- \in S^-\) , \(q^+ \not \sim q^-\) .

We define the characteristic number \(\mathcal {C}_\mathcal {K}(S^+, S^-)\) of a sample as the smallest integer \(c \in \mathbb {N}\) such that for every \(q^+ \in S^+\) , \(q^- \in S^-\) , \(q^+ \not \sim _c q^-\) .

Definition 7

(Consistent Formula). A CTL formula \(\varphi \) is said to be consistent with a sample \((S^+, S^-)\) of \(\mathcal {K}\) if \(\forall \,q^+ \in S^+\) , \(q^+ \models _\mathcal {K}\varphi \) and \(\forall \,q^- \in S^-\) , \(q^- \not \models _\mathcal {K}\varphi \) .

The rest of our article focuses on the following passive learning problems:

Definition 8

(Learning Problem). Given a sample \((S^+, S^-)\) of a KS \(\mathcal {K}\) and \(n \in \mathbb {N}^*\) , we introduce the following instances of the CTL learning problem:

\(\textsf {L}_{\textsf {CTL}(E)}(\mathcal {K}, S^+, S^-)\) . Is there \(\varphi \in \textsf {CTL}(E)\) consistent with \((S^+, S^-)\) ?

\(\textsf {L}_{\textsf {CTL}(E)}^{\le n}(\mathcal {K}, S^+, S^-)\) . Is there \(\varphi \in \textsf {CTL}(E)\) , \(|\varphi | \le n\) , consistent with \((S^+, S^-)\) ?

\(\textsf {ML}_{\textsf {CTL}(E)}(\mathcal {K}, S^+, S^-)\) . Find the smallest \(\varphi \in \textsf {CTL}(E)\) consistent with \((S^+, S^-)\) .

\(\textsf {L}_{\textsf {CTL}}(\mathcal {K}, S^+, S^-)\) and \(\textsf {ML}_{\textsf {CTL}}(\mathcal {K}, S^+, S^-)\) always admit a solution.

Consider \(\psi = \underset{q^+ \in S^+}{\bigvee }\ \varphi _{q^+}\) . This formula \(\psi \) is consistent with \((S^+, S^-)\) by design. Thus \(\textsf {L}_{\textsf {CTL}}(\mathcal {K}, S^+, S^-)\) always admits a solution, and so does the problem \(\textsf {ML}_{\textsf {CTL}}(\mathcal {K}, S^+, S^-)\) , although \(\psi \) is unlikely to be the minimal solution.     \(\square \)

Bordais et al. [ 5 ] proved that \(\textsf {L}_{\textsf {CTL}}^{\le n}(\mathcal {K}, S^+, S^-)\) is NP -hard, assuming the set of atomic propositions \(\textsf{AP}\) is given as an input as well.

3.2 An Explicit Solution

We must find a formula consistent with the sample \((S^+, S^-)\) , an easier problem than Browne et al. [ 6 ]’s answer to Theorem 2 that subsumes bisimilarity with an entire KS. As we know that every state in \(S^-\) is dissimilar to every state in \(S^+\) , we will try to encode this fact in \(\textsf {CTL}\) form, then use said encoding to design a formula consistent with the sample.

Definition 9

(Separating Formula). Let \((S^+, S^-)\) be a sample of a KS \(\mathcal {K}= (Q, \delta , \lambda )\) . Assuming that \(\textsf{AP}\) and Q are ordered, and given \(q_1, q_2 \in Q\) such that \(q_1 \not \sim q_2\) , formula \(D_{q_1, q_2}\) is defined inductively w.r.t. \(c = \mathcal {C}_\mathcal {K}(\{q_1\}, \{q_2\})\) as follows:

if \(c = 0\) and \(\lambda (q_1) \setminus \lambda (q_2) \ne \emptyset \) has minimal element a , then \(D_{q_1, q_2} = a\) ;

else if \(c = 0\) and \(\lambda (q_2) \setminus \lambda (q_1) \ne \emptyset \) has minimal element a , then \(D_{q_1, q_2} = \lnot a\) ;

else if \(c \ne 0\) and \(\exists \,q_1' \in \delta (q_1)\) , \(\forall \,q_2' \in \delta (q_2)\) , \(q_1' \not \sim _{c-1} q_2'\) , then \(D_{q_1, q_2} = {{\,\mathrm{{\exists \textsf{X}}}\,}}\, \biggl ( \underset{q_2' \in \delta (q_2)}{\bigwedge }\ D_{q_1', q'_2} \biggr )\) , picking the smallest \(q_1'\) verifying this property;

else if \(c \ne 0\) and \(\exists \,q_2' \in \delta (q_2)\) , \(\forall \,q_1' \in \delta (q_1)\) , \(q_1' \not \sim _{c-1} q_2'\) , then \(D_{q_1, q_2} = {{\,\mathrm{{\forall \textsf{X}}}\,}}\, \lnot \biggl ( \underset{q_1' \in \delta (q_1)}{\bigwedge }\ D_{q_2', q_1'} \biggr )\) , picking the smallest \(q_2'\) verifying this property.

The formula \(\mathcal {S}_\mathcal {K}(S^+, S^-) = \underset{q^+ \in S^+}{\bigvee }\ \underset{q^- \in S^-}{\bigwedge }\ D_{q^+, q^-} \in \textsf {CTL}(\{\lnot , \wedge , \vee , {{\,\mathrm{{\forall \textsf{X}}}\,}}, {{\,\mathrm{{\exists \textsf{X}}}\,}}\})\) is then called the separating formula of sample \((S^+, S^-)\) .

Intuitively, the \(\textsf {CTL}\) formula \(D_{q_1, q_2}\) merely expresses that states \(q_1\) and \(q_2\) are dissimilar by negating Definition 2 ; it is such that \(q_1 \models _\mathcal {K}D_{q_1, q_2}\) but \(q_2 \not \models _\mathcal {K}D_{q_1, q_2}\) . Either \(q_1\) and \(q_2\) have different labels, \(q_1\) admits a successor that is dissimilar to \(q_2\) ’s successors, or \(q_2\) admits a successor that is dissimilar to \(q_1\) ’s. The following result is proven in Appendix B :

The separating formula \(\mathcal {S}_\mathcal {K}(S^+, S^-)\) is consistent with \((S^+, S^-)\) .

As proven in Appendix C , we can bound the size of \(\mathcal {S}_\mathcal {K}(S^+, S^-)\) :

Corollary 1

Assume the KS \(\mathcal {K}\) has degree k and \(c = \mathcal {C}_\mathcal {K}(S^+, S^-)\) , then:

if \(k \ge 2\) , then \(|\mathcal {S}_\mathcal {K}(S^+, S^-)| \le (5 \cdot k^c + 1) \cdot |S^+| \cdot |S^-|\) ;

if \(k = 1\) , then \(|\mathcal {S}_\mathcal {K}(S^+, S^-)| \le (2 \cdot c + 3) \cdot |S^+| \cdot |S^-|\) .

4 SAT-Based Learning

The universal fragment \({\textsf {CTL}_{\forall }}= \textsf {CTL}(\{\lnot , \wedge , \vee , {{\,\mathrm{{\forall \textsf{X}}}\,}}, {{\,\mathrm{{\forall \textsf{F}}}\,}}, {{\,\mathrm{{\forall \textsf{G}}}\,}}, {{\,\mathrm{{\forall \textsf{U}}}\,}}\}\) ) of CTL happens to be as expressive as the full logic [ 1 , Def. 6.13]. For that reason, we will reduce a learning instance of \({\textsf {CTL}_{\forall }}\) of rank n to an instance of the SAT problem. A similar reduction has been independently found by Roy et al. [ 22 ].

There exists a Boolean propositional formula \(\varPhi _n\) such that the instance \(\textsf {L}_{\textsf {CTL}_{\forall }}^{\le n}(\mathcal {K}, S^+, S^-)\) of the learning problem admits a solution \(\varphi \) if and only if the formula \(\varPhi _n\) is satisfiable.

4.1 Modelling the Formula

Assume that there exists a syntactic DAG \(\mathcal {D}\) of size smaller than or equal to n representing the desired \(\textsf {CTL}\) formula \(\varphi \) . Let us index \(\mathcal {D}\) ’s nodes in \([1 \mathrel {{.}\,{.}}n]\) in such a fashion that each node has a higher index than its children, as shown in Fig.  1 . Hence, n always labels a root and 1 always labels a leaf.

Let \(\mathcal {L}= \textsf{AP}\cup \{\top , \lnot , \wedge , \vee , {{\,\mathrm{{\forall \textsf{X}}}\,}}, {{\,\mathrm{{\forall \textsf{F}}}\,}}, {{\,\mathrm{{\forall \textsf{G}}}\,}}, {{\,\mathrm{{\forall \textsf{U}}}\,}}\}\) be the set of labels that decorates the DAG’s nodes. For each \(i\ \in [1 \mathrel {{.}\,{.}}n]\) and \(o \in \mathcal {L}\) , we introduce a Boolean variable \(\tau _i^o\) such that \(\tau _i^o = 1\) if and only if the node of index i is labelled by o .

For all \(i \in [1 \mathrel {{.}\,{.}}n]\) and \(j \in [0 \mathrel {{.}\,{.}}i-1]\) , we also introduce a Boolean variable \(\textsf{l}_{i, j}\) (resp. \(\textsf{r}_{i, j}\) ) such that \(\textsf{l}_{i, j} = 1\) (resp. \(\textsf{r}_{i, j} = 1\) ) if and only if j is the left (resp. right) child of i . Having a child of index 0 stands for having no child at all in the actual syntactic DAG \(\mathcal {D}\) .

Three mutual exclusion clauses guarantee that each node of the syntactic DAG has exactly one label and at most one left child and one right child. Moreover, three other clauses ensure that a node labelled by an operator of arity x has exactly x actual children (by convention, if \(x = 1\) then its child is to the right). These simple clauses are similar to Neider et al.’s encoding [ 16 ] and for that reason are not detailed here.

4.2 Applying the Formula to the Sample

For all \(i \in [1 \mathrel {{.}\,{.}}, n]\) and \(q \in Q\) , we introduce a Boolean variable \(\varphi _i^q\) such that \(\varphi _i^q = 1\) if and only if state q verifies the sub-formula \(\varphi _i\) rooted in node i . The next clauses implement the semantics of the true symbol \(\top \) , the atomic propositions, and the CTL operator \({{\,\mathrm{{\forall \textsf{X}}}\,}}\) .

figure a

Semantic clauses are structured as follows: an antecedent stating node i ’s label and its possible children implies a consequent expressing \(\varphi _i^{q}\) ’s semantics for each \(q \in Q\) . Clause \(\texttt{sem}_\top \) states that \(q \models \top \) is always true; clause \(\texttt{sem}_a\) , that \(q \models a\) if and only if q is labelled by a ; and clause \(\texttt{sem}_{{{\,\mathrm{{\forall \textsf{X}}}\,}}}\) , that \(q \models {{\,\mathrm{{\forall \textsf{X}}}\,}}\psi \) if and only if all of q ’s successors verify \(\psi \) . Similar straightforward clauses encode the semantics of the Boolean connectors \(\lnot \) , \(\wedge \) and \(\vee \) .

CTL semantics are also characterized by fixed points, whose naive encoding might however capture spurious (least vs greatest) sets: we resort to the bounded semantics \({{\,\mathrm{{\forall \textsf{F}}}\,}}^u\) , \({{\,\mathrm{{\forall \textsf{U}}}\,}}^u\) and \({{\,\mathrm{{\forall \textsf{G}}}\,}}^u\) . For all \(i \in [1 \mathrel {{.}\,{.}}n]\) , \(q \in Q\) , and \(u \in [0 \mathrel {{.}\,{.}}\alpha (q)]\) , we introduce a Boolean variable \(\rho _{i,q}^u\) such that \(\rho _{i,q}^u = 1\) if and only if q verifies the sub-formula rooted in i according to the CTL bounded semantics of rank u (e.g. \(q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}^u \psi \) , assuming sub-formula \({{\,\mathrm{{\forall \textsf{F}}}\,}}\psi \) is rooted in node i ).

Thanks to Theorem 3 we can introduce the following equivalence clause:

figure b

Property 3 yields two other clauses whose inclusion is not mandatory (they were left out by Roy et al. [ 22 ]) that further constrains the bounded semantics:

figure c

The next clause enables variable \(\rho _{i,q}^u\) for temporal operators only:

figure d

Properties 1 and 2 yield an inductive definition of bounded semantics. We only explicit the base case \(\texttt{base}_{\rho }\) and the semantics \(\texttt{sem}_{{{\,\mathrm{{\forall \textsf{U}}}\,}}}\) of \({{\,\mathrm{{\forall \textsf{U}}}\,}}^u\) , but also implement semantic clauses for the temporal operators \({{\,\mathrm{{\forall \textsf{F}}}\,}}\) and \({{\,\mathrm{{\forall \textsf{G}}}\,}}\) .

figure e

Finally, the last clause ensures that the full formula \(\varphi \) (rooted in node n ) is verified by the positive sample but not by the negative sample.

figure f

4.3 Solving the SAT Instance

We finally define the formula \(\varPhi _n\) as the conjunction of all the aforementioned clauses. Assuming an upper bound d on the KS’s recurrence diameter, this encoding requires \(\mathcal {O}(n^2 + n \cdot |\textsf{AP}| + n \cdot |Q| \cdot d)\) variables and \(\mathcal {O}(n \cdot |\textsf{AP}| + n^3 \cdot |Q| \cdot d + n \cdot |\textsf{AP}| \cdot |Q|)\) clauses, not taking transformation to conjunctive normal form into account. By design, Lemma 1 holds.

The syntactic clauses allow one to infer the DAG of a formula \(\varphi \in \textsf {CTL}\) of size smaller than or equal to n from the valuations taken by the variables \((\tau _i^o)\) , \((\textsf{l}_{i, j})\) , and \((\textsf{r}_{i, j})\) . Clauses \(\texttt{sem}_a\) to \(\texttt{sem}_\varphi \) guarantee that the sample is consistent with said formula \(\varphi \) , thanks to Theorem 3 and Properties 1 , 2 , and 3 .     \(\square \)

5 Algorithms for the Minimal Learning Problem

We introduce in this section an algorithm to solve the minimum learning problem \(\textsf {ML}_{\textsf {CTL}_{\forall }}(\mathcal {K}, S^+, S^-)\) . Remember that it always admits a solution if and only if the state sample is consistent by Theorem 4 .

figure g

5.1 A Bottom-Up Algorithm

By Theorem 4 , there exists a rank \(n_0\) such that the problem \(\textsf {L}_{{\textsf {CTL}_{\forall }}}^{\le n_0}(\mathcal {K}, S)\) admits a solution. Starting from \(n = 0\) , we can therefore try to solve \(\textsf {L}_{\textsf {CTL}_{\forall }}^{\le n}(\mathcal {K}, S)\) incrementally until a (minimal) solution is found, in a similar manner to Neider and Gavran [ 16 ]. Algorithm 1 terminates with an upper bound \(n_0\) on the number of required iterations.

5.2 Embedding Negations

The CTL formula \({{\,\mathrm{{\exists \textsf{F}}}\,}}a\) is equivalent to the \({\textsf {CTL}_{\forall }}\) formula \(\lnot {{\,\mathrm{{\forall \textsf{G}}}\,}}\lnot a\) , yet the former remains more succinct, being of size 2 instead of 4. While \({\textsf {CTL}_{\forall }}\) has been proven to be as expressive as CTL , the sheer amount of negations needed to express an equivalent formula can significantly burden the syntactic DAG. A possible optimization is to no longer consider the negation \(\lnot \) as an independent operator but instead embed it in the nodes of the syntactic DAG, as shown in Fig.  2 .

figure 2

The syntactic DAG of \(\lnot \exists \,\top {{\,\mathrm{\textsf{U}}\,}}\lnot {{\,\mathrm{{\forall \textsf{X}}}\,}}\lnot a\) , before and after embedding negations.

Note that such a definition of the syntactic DAG alters one’s interpretation of a CTL formula’s size: as a consequence, under this optimization, Algorithm 1 may yield a formula with many negations that is no longer minimal under the original definition of size outlined in Sect.  2.2 .

Formally, for each \(i \in [1 \mathrel {{.}\,{.}}n]\) , we introduce a new variable \(\nu _i\) such that \(\nu _i = 0\) if and only if the node of index i is negated. As an example, in Fig.  2 , \(\nu _1 = \nu _3 = \nu _4 = 0\) , but \(\nu _2 = 1\) and the sub-formula rooted in node 3 is \(\lnot {{\,\mathrm{{\forall \textsf{X}}}\,}}\lnot a\) .

We then change the SAT encoding of \({\textsf {CTL}_{\forall }}\) ’s semantics accordingly. We remove the \(\lnot \) operator from the syntactic DAG clauses and the set \(\mathcal {L}\) of labels. We delete its semantics and update the semantic clauses of the other operators. Indeed, the right side of each equivalence expresses the semantics of the operator rooted in node i before applying the embedded negation; we must therefore change the left side of the semantic equivalence accordingly, replacing the Boolean variable \(\varphi _i^q\) with the formula \(\tilde{\varphi }_i^q = (\lnot \nu _i \wedge \lnot \varphi _i^q) \vee (\nu _i \wedge \varphi _i^q)\) that is equivalent to \(\varphi _i^q\) if \(\nu _i = 1\) and \(\lnot \varphi _i^q\) if \(\nu _i = 0\) .

5.3 Optimizations and Alternatives

Minimizing the Input KS. In order to guarantee that an input \(S\) is indeed a valid sample, one has to ensure no state in the positive sample is bisimilar to a state in the negative sample. To do so, one has to at least partially compute the bisimilarity relation \(\sim \) on \(\mathcal {K}= (Q, \delta , \lambda )\) . But refining it to completion can be efficiently performed in \(\mathcal {O}(|Q| \cdot |\textsf{AP}| + |\delta | \cdot \log (|Q|))\) operations [ 1 , Thm. 7.41], yielding a bisimilar KS \(\mathcal {K}_{\min }\) of minimal size.

Minimizing the input KS is advantageous as the size of the semantic clauses depends on the size of \(\mathcal {K}\) , and the SAT solving step is likely to be the computational bottleneck. As a consequence, we always fully compute the bisimulation relation \(\sim \) on \(\mathcal {K}\) and minimize said KS.

Approximating the Recurrence Diameter. Computing the recurrence diameter of a state q is unfortunately an NP -hard problem that is known to be hard to approximate [ 4 ]. A coarse upper bound is \(\alpha (q) \le |Q|-1\) : it may however result in a significant number of unnecessary variables and clauses. Fortunately, the decomposition of a KS \(\mathcal {K}\) into strongly connected components (SCCs) yields a finer over-approximation shown in Fig.  3 that relies on the ease of computing \(\alpha \) in a DAG. It is also more generic and suitable to CTL than existing approximations dedicated to LTL bounded model checking [ 14 ].

Contracting each SCC to a single vertex yields a DAG known as the condensation of \(\mathcal {K}\) . We weight each vertex of this DAG \(\mathcal {G}\) with the number of vertices in the matching SCC. Then, to each state q in the original KS \(\mathcal {K}\) , we associate the weight \(\beta (q)\) of the longest path in the DAG \(\mathcal {G}\) starting from q ’s SCC, minus one (in order not to count q ). Intuitively, our approximation assumes that a simple path entering a SCC can always visit every single one of its states once before exiting, a property that obviously does not hold for two of the SCCs shown here.

Encoding the Full Logic. \({\textsf {CTL}_{\forall }}\) is semantically exhaustive but the existential temporal operators commonly appear in the literature; we can therefore consider the learning problem on the full CTL logic by integrating the operators \({{\,\mathrm{{\exists \textsf{X}}}\,}}\) , \({{\,\mathrm{{\exists \textsf{F}}}\,}}\) , \({{\,\mathrm{{\exists \textsf{G}}}\,}}\) , and \({{\,\mathrm{{\exists \textsf{U}}}\,}}\) and provide a Boolean encoding of their semantics. We also consider the fragment \({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}= \{\lnot , \vee , {{\,\mathrm{{\exists \textsf{X}}}\,}}, {{\,\mathrm{{\exists \textsf{G}}}\,}}, {{\,\mathrm{{\exists \textsf{U}}}\,}}\}\) used by Roy et al. [ 22 ].

figure 3

An approximation \(\beta \) of the recurrence diameter \(\alpha \) relying on SCC decomposition that improves upon the coarse upper bound \(\alpha (q) \le |Q|-1 = 6\) .

6 Experimental Implementation

We implement our learning algorithm in a C++ prototype tool LearnCTL Footnote 1 relying on Microsoft’s Z3 due to its convenient C++ API. It takes as an input a sample of positive and negative KSs with initial states, then coalesced into a single KS and a sample of states compatible with the theoretical framework we described. It finally returns a separating \({\textsf {CTL}_{\forall }}\) , CTL , or \({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}\) formula after a sanity check performed by model-checking the input KSs against the learnt formula, using a simple algorithm based on Theorem 3 .

6.1 Benchmark Collection

We intend on using our tool to generate formulas that can explain flaws in faulty implementations of known protocols. To do so, we consider structures generated by higher formalisms such as program graphs: a single mutation in the program graph results in several changes in the resulting KS. This process has been achieved manually according to the following criteria:

The mutations only consist in deleting lines.

The resulting KS should be small , less than \(\sim 1000\) states.

Any mutation should result in a syntactically correct model.

We collected program graphs in a toy specification language for a CTL model checker class implemented in Java. Furthermore, we also considered PROMELA models from the Spin model-checker [ 12 ] repository. Translations were then performed through the Python interface of spot/ltsmin  [ 9 , 13 ].

figure 4

Peterson’s mutual exclusion protocol in PROMELA and learnt formulas for each deleted instruction.

Consider the mutual exclusion protocol proposed by [ 18 ] and specified in PROMELA in Fig.  4 that generates a KS with 55 states. We generate mutants by deleting no more than one line of code at a time, ignoring variable and process declarations as they are necessary for the model to be compiled and the two assertion lines that are discarded by our KS generator, our reasoning being that subtle changes yield richer distinguishing formulas.

Furthermore, removing the instruction ncrit -​-​ alone would lead to an infinite state space; thus, its deletion is only considered together with the instruction ncrit++ . Finally, we set some atomic propositions of interest: c stands for at least one process being in the critical section ( ncrit>0 ), m for both processes ( ncrit>1 ), and t for process 0’s turn. An extra \(\textit{dead}\) atomic proposition is added by Spot/LTSMin to represent deadlocked states.

As summarized on Fig.  4 , every mutated model, once compared with the original KS, lead to distinguishing formulas characterizing Peterson’s protocol: mutations m1 , m2 , and m3 yield a mutual exclusion property, m4 yields a liveness property, m5 yields a fairness property, and m6 yields global liveness formula.

6.2 Quantitative Evaluation

We quantitatively assess the performance of the various optimizations and CTL fragments discussed previously. To do so, we further enrich the benchmark series through the use of random mutations of hard-coded KSs: these mutations may alter some states, re-route some edges, and spawn new states. We consider a total of 234 test samples, ranging from size 11 to 698 after minimization. We perform the benchmarks on a GNU/Linux Debian machine ( bullseye ) with 24 cores (Intel(R) Xeon(R) CPU E5-2620 @ 2.00 GHz) and 256Go of RAM, using version 4.8.10 of libz3 and 1.0 of LearnCTL .

Table  1 displays a summary of these benchmarks: \(\beta \) stands for the refined approximation of the recurrence diameter described in Sect.  5.3 ; \(\lnot \) , for the embedding of negations in the syntactic tree introduced in Sect.  5.2 . The average size of the syntactic DAGs learnt is 4.14.

Option \(\beta \) yields the greatest improvement, being on average at least 6 times faster than the default configuration; option \(\lnot \) further divides the average runtime by at least 2. These two optimizations alone speed up the average runtime by a factor of 12 to 20. The CTL fragment used, all other options being equal, does not influence the average runtime as much (less than twofold in the worst case scenario); \(({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}, \beta , \lnot )\) is the fastest option, closely followed by \(({\textsf {CTL}_{\forall }}, \beta , \lnot )\) .

Intuitively, approximating the recurrence diameter aggressively cuts down the number of SAT variables needed: assuming that \(\alpha \) has upper bound d , we only need \(n \cdot |Q| \cdot d\) Boolean variables \((\rho _{i,q}^u)\) instead of \(n \cdot |Q|^2\) . Moreover, embedding negations, despite requiring more complex clauses, results in smaller syntactic DAGs with “free” negations, hence faster computations, keeping in mind that the last SAT instances are the most expensive to solve, being the largest.

Figure  5 further displays a log-log plot comparing the runtime of the most relevant fragments and options to \(({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}, \beta , \lnot )\) . For a given set of parameters, each point stands for one of the 234 test samples. Points above the black diagonal favour \(({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}, \beta , \lnot )\) ; points below, the aforementioned option. Points on the second dotted lines at the edges of the figure represent timeouts.

Unsurprisingly, \(({\textsf {CTL}_{\forall }}, \beta , \lnot )\) and \((\textsf {CTL}, \beta , \lnot )\) outperform \(({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}, \beta , \lnot )\) when a minimal distinguishing formula using the operator \({{\,\mathrm{{\forall \textsf{U}}}\,}}\) exists: the duality between \({{\,\mathrm{{\forall \textsf{U}}}\,}}\) and \({{\,\mathrm{{\exists \textsf{U}}}\,}}\) is complex and, unlike the other operators, cannot be handled at no cost by the embedded negation as it depends on the release operator.

figure 5

Comparing \(({\textsf {CTL}_{{{\,\mathrm{\textsf{U}}\,}}}}, \beta , \lnot )\) to other options on every sample.

7 Conclusion and Further Developments

We explored in this article the CTL learning problem: we first provided a direct explicit construction before relying on a SAT encoding inspired by bounded model-checking to iteratively find a minimal answer. We also introduced in Sect.  3 an explicit answer to the learning problem that belongs to the fragment \(\textsf {CTL}(\lnot , \wedge , \vee , {{\,\mathrm{{\forall \textsf{X}}}\,}}, {{\,\mathrm{{\exists \textsf{X}}}\,}})\) . It remains to be seen if a smaller formula can be computed using a more exhaustive selection of \(\textsf {CTL}\) operators. A finer grained explicit solution could allow one to experiment with a top-down approach as well.

Moreover, we provided a dedicated C++ implementation, and evaluated it on models of higher-level formalisms such as PROMELA. Since the resulting KSs have large state spaces, further symbolic approaches are to be considered for future work, when dealing with program graphs instead of Kripke structures. In this setting, one might also consider the synthesis problem of the relevant atomic propositions from the exposed program variables. Nevertheless, the experiments on Kripke structures already showcase the benefits of the approximated recurrence diameter computation and of our extension of the syntactic DAG definition, as well as the limited relevance of the target CTL fragment.

Another avenue for optimizations can be inferred from the existing SAT -based LTL learning literature: in particular, Rienier et al. [ 20 ] relied on a topology-guided approach by explicitly enumerating the possible shapes of the syntactic DAG and solving the associated SAT instances in parallel. Given the small size on average of the formulas learnt so far and the quadratic factor impacting the number of semantic clauses such as \(\texttt{sem}_{{{\,\mathrm{{\forall \textsf{U}}}\,}}}\) due to the structural variables \(\textsf{l}_{i, j}\) and \(\textsf{r}_{i, k}\) , this approach could yield huge performance gains in CTL ’s case as well.

We relied on Z3 ’s convenient C++ API, but intuit that we would achieve better performance with state-of-the-art SAT solvers such as the winners of the yearly SAT competition [ 2 ]. We plan on converting our Boolean encoding to the DIMACS CNF format in order to interface our tool with modern SAT solvers.

Finally, it is known that the bounded learning problem is NP -complete, but we would also like to find the exact complexity class of the minimal CTL learning problem. We intuit that it is not, Kripke structures being a denser encoding in terms of information than lists of linear traces: as an example, one can trivially compute an LTL formula (resp. a CTL formula) of polynomial size that distinguishes a sample of ultimately periodic words (resp. of finite computation trees with lasso-shaped leaves), but the same cannot be said of a sample of Kripke structures. It remains to be seen if this intuition can be confirmed or infirmed by a formal proof.

publicly available at https://gitlab.lre.epita.fr/adrien/learnctl .

Baier, C., Katoen, J.: Principles of Model Checking. MIT Press (2008)

Google Scholar  

Balyo, T., Heule, M., Iser, M., Järvisalo, M., Suda, M. (eds.): Proceedings of SAT Competition 2023: Solver, Benchmark and Proof Checker Descriptions. Department of Computer Science Series of Publications B, Department of Computer Science, University of Helsinki, Finland (2023)

Biere, A., Cimatti, A., Clarke, E., Zhu, Y.: Symbolic model checking without BDDs. In: Cleaveland, W.R. (ed.) Tools and Algorithms for the Construction and Analysis of Systems, pp. 193–207. Springer, Berlin Heidelberg, Berlin, Heidelberg (1999). https://doi.org/10.1007/3-540-49059-0_14

Chapter   Google Scholar  

Björklund, A., Husfeldt, T., Khanna, S.: Approximating longest directed paths and cycles. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 222–233. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-27836-8_21

Bordais, B., Neider, D., Roy, R.: Learning temporal properties is NP-hard (2023)

Browne, M., Clarke, E., Grümberg, O.: Characterizing finite kripke structures in propositional temporal logic. Theor. Comput. Sci. 59 (1), 115–131 (1988). https://doi.org/10.1016/0304-3975(88)90098-9 https://www.sciencedirect.com/science/article/pii/0304397588900989

Camacho, A., McIlraith, S.A.: Learning interpretable models expressed in linear temporal logic. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, no. 1, pp. 621–630 (May 2021). https://doi.org/10.1609/icaps.v29i1.3529 , https://ojs.aaai.org/index.php/ICAPS/article/view/3529

Clarke, E.M., Emerson, E.A.: Design and synthesis of synchronization skeletons using branching time temporal logic. In: Kozen, D. (ed.) Logic of Programs 1981. LNCS, vol. 131, pp. 52–71. Springer, Heidelberg (1982). https://doi.org/10.1007/BFb0025774

Duret-Lutz, A., et al.: From Spot 2.0 to Spot 2.10: What’s new? In: Proceedings of the 34th International Conference on Computer Aided Verification (CAV’22). LNCS, vol. 13372, pp. 174–187. Springer (Aug 2022). https://doi.org/10.1007/978-3-031-13188-2_9

Fijalkow, N., Lagarde, G.: The complexity of learning linear temporal formulas from examples. In: Chandlee, J., Eyraud, R., Heinz, J., Jardine, A., van Zaanen, M. (eds.) Proceedings of the 15th International Conference on Grammatical Inference, 23-27 August 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 153, pp. 237–250. PMLR (2021). https://proceedings.mlr.press/v153/fijalkow21a.html

Gaglione, J.R., Neider, D., Roy, R., Topcu, U., Xu, Z.: Learning linear temporal properties from noisy data: a MaxSAT-based approach. In: Hou, Z., Ganesh, V. (eds.) Automated Technology for Verification and Analysis, pp. 74–90. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-88885-5_6

Holzmann, G.J.: The SPIN Model Checker - Primer and Reference Manual. Addison-Wesley (2004)

Kant, G., Laarman, A., Meijer, J., van de Pol, J., Blom, S., van Dijk, T.: LTSmin: high-performance language-independent model checking. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 692–707. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_61

Kroening, D., Strichman, O.: Efficient computation of recurrence diameters. In: Zuck, L.D., Attie, P.C., Cortesi, A., Mukhopadhyay, S. (eds.) Verification, Model Checking, and Abstract Interpretation, pp. 298–309. Springer, Berlin Heidelberg, Berlin, Heidelberg (2003). https://doi.org/10.1007/3-540-36384-X_24

Lutz, S., Neider, D., Roy, R.: Specification sketching for linear temporal logic. In: André, É., Sun, J. (eds.) Automated Technology for Verification and Analysis, pp. 26–48. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-45332-8_2

Neider, D., Gavran, I.: Learning linear temporal properties. In: Bjørner, N.S., Gurfinkel, A. (eds.) 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin, TX, USA, October 30 - November 2, 2018, pp. 1–10. IEEE (2018). https://doi.org/10.23919/FMCAD.2018.8603016 , https://doi.org/10.23919/FMCAD.2018.8603016

Penczek, W., Woźna, B., Zbrzezny, A.: Bounded model checking for the universal fragment of CTL. Fundam. Inf. 51 (1–2), 135–156 (2002)

Peterson, G.L.: Myths about the mutual exclusion problem. Inf. Process. Lett. 12 (3), 115–116 (1981). https://doi.org/10.1016/0020-0190(81)90106-X , https://doi.org/10.1016/0020-0190(81)90106-X

Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pp. 46–57 (1977). https://doi.org/10.1109/SFCS.1977.32

Riener, H.: Exact synthesis of LTL properties from traces. In: 2019 Forum for Specification and Design Languages (FDL), pp. 1–6 (2019). https://doi.org/10.1109/FDL.2019.8876900

Roy, R., Fisman, D., Neider, D.: Learning interpretable models in the property specification language. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. IJCAI 2020 (2021)

Roy, R., Neider, D.: Inferring properties in computation tree logic (2023)

Wasylkowski, A., Zeller, A.: Mining temporal specifications from object usage. In: 2009 IEEE/ACM International Conference on Automated Software Engineering, pp. 295–306 (Nov 2009). https://doi.org/10.1109/ASE.2009.30

Xu, L., Chen, W., Xu, Y.Y., Zhang, W.H.: Improved bounded model checking for the universal fragment of CTL. J. Comput. Sci. Technol. 24 (1), 96–109 (2009). https://doi.org/10.1007/s11390-009-9208-5 , https://doi.org/10.1007/s11390-009-9208-5

Zhang, W.: Bounded semantics of CTL and sat-based verification. In: Breitman, K., Cavalcanti, A. (eds.) Formal Methods and Software Engineering, pp. 286–305. Springer, Berlin Heidelberg, Berlin, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10373-5_15

Download references

Author information

Authors and affiliations.

LRE, EPITA, Le Kremlin-Bicêtre, France

Adrien Pommellet, Daniel Stan & Simon Scatton

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Adrien Pommellet .

Editor information

Editors and affiliations.

Otto-Friedrich-Universität Bamberg, Bamberg, Germany

Christoph Benzmüller

Carnegie Mellon University, Pittsburgh, PA, USA

Marijn J.H. Heule

The University of Manchester, Manchester, UK

Renate A. Schmidt

A Proof of Theorem 3

\({{\,\mathrm{{\forall \textsf{F}}}\,}}\) . Assume that \(q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}\varphi \) . Let us prove that \(q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}^{\alpha (q)} \varphi \) . Consider a run \(r = (s_i) \in \mathcal {R}(q)\) . By hypothesis, we can define the index \(j = \min \{i \in \mathbb {N}\mid s_i \models \varphi \}\) .

Now, assume that \(j > \alpha (q)\) . By definition of the recurrence diameter \(\alpha \) , \(\exists \,k_1, k_2 \in [0 \mathrel {{.}\,{.}}j-1]\) such that \(k_1 \le k_2\) and \(s_{k_1} = s_{k_2}\) . Consider the finite runs \(u = (s_i)_{i \in [0 \mathrel {{.}\,{.}}k_1]}\) and \(v = (s_i)_{i \in [k_1+1 \mathrel {{.}\,{.}}k_2]}\) . We define the infinite, ultimately periodic run \(r' = u \cdot v^\omega = (s'_i)\) . By definition of j , \(\forall \,i \in \mathbb {N}\) , \(s'_i \not \models \varphi \) in order to preserve the minimality of j . Yet \(r' \in \mathcal {R}(q)\) and \(q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}\varphi \) . By contradiction, \(j \le \alpha (q)\) . As consequence, \((q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}\varphi ) \implies (q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}^{\alpha (q)} \varphi )\) holds.

Trivially, \((q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}^{\alpha (q)} \varphi ) \implies (q \models {{\,\mathrm{{\forall \textsf{F}}}\,}}\varphi )\) holds. Hence, we have proven both sides of the desired equivalence for \({{\,\mathrm{{\forall \textsf{F}}}\,}}\) .

\({{\,\mathrm{{\forall \textsf{G}}}\,}}\) . Assume that \(q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}^{\alpha (q)} \varphi \) . Let us prove that \(q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}\varphi \) . Consider a run \(r = (s_i) \in \mathcal {R}(q)\) and \(j \in \mathbb {N}\) . Let us prove that \(s_j \models \varphi \) .

State \(s_j\) is obviously reachable from q . Let us consider a finite run without repetition \(u = (s'_i)_{i \in [0 \mathrel {{.}\,{.}}k]}\) such that \(s_0 = q\) and \(s'_k = s_j\) . By definition of the recurrence diameter, \(k \le \alpha (q)\) . We define the infinite runs \(v = (s_i)_{i > j}\) and \(r' = u \cdot v\) . Since \(r' \in \mathcal {R}(q)\) and \(q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}^{\alpha (q)} \varphi \) , \(s_k \models \varphi \) , hence \(s_j \models \varphi \) . As a consequence, \((q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}^{\alpha (q)} \varphi ) \implies (q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}\varphi )\) .

Trivially, \((q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}\varphi ) \implies (q \models {{\,\mathrm{{\forall \textsf{G}}}\,}}^{\alpha (q)} \varphi )\) holds. Hence, we have proven both sides of the desired equivalence for \({{\,\mathrm{{\forall \textsf{G}}}\,}}\) .

\({{\,\mathrm{{\exists \textsf{F}}}\,}}\) and \({{\,\mathrm{{\exists \textsf{G}}}\,}}\) . Formula \({{\,\mathrm{{\exists \textsf{F}}}\,}}\varphi \) (rep. \({{\,\mathrm{{\exists \textsf{G}}}\,}}\varphi \) ) being equivalent to the dual formula \(\lnot {{\,\mathrm{{\forall \textsf{G}}}\,}}\lnot \varphi \) (resp. \(\lnot {{\,\mathrm{{\forall \textsf{F}}}\,}}\lnot \varphi \) ), the previous proofs immediately yield the desired equivalences.

\({{\,\mathrm{{\forall \textsf{U}}}\,}}\) and \({{\,\mathrm{{\exists \textsf{U}}}\,}}\) . We can handle the case of \(\forall \,\varphi {{\,\mathrm{\textsf{U}}\,}}\psi \) in a manner similar to \({{\,\mathrm{{\forall \textsf{F}}}\,}}\) : we prove by contradiction that the first occurrence of \(\psi \) always happens in less than \(\alpha (q)\) steps. And the semantic equivalence for \(\exists \,\varphi {{\,\mathrm{\textsf{U}}\,}}\psi \) can be handled in a fashion similar to \({{\,\mathrm{{\forall \textsf{G}}}\,}}\) : an existing infinite run yields a conforming finite prefix without repetition of length lesser than or equal to \(\alpha (q)\) .

B Proof of Theorem 5

Given two dissimilar states \(q_1, q_2 \in Q\) , let us prove by induction on the characteristic number \(c_{q_1, q_2} = \mathcal {C}_\mathcal {K}(\{q_1\}, \{q_2\})\) that \(q_1 \models D_{q_1, q_2}\) and \(q_2 \not \models D_{q_1, q_2}\) .

Base Case. If \(c_{q_1, q_2} = 0\) , then by definition \(\lambda (q_1) \ne \lambda (q_2)\) and by design \(D_{q_1, q_2}\) is a literal (i.e. an atomic proposition or the negation thereof) verified by \(q_1\) but not by \(q_2\) .

Inductive Case. Assume that the property holds for all characteristic numbers smaller than or equal to \(c \in \mathbb {N}\) . Consider two states \(q_1, q_2 \in Q\) such that \(c_{q_1, q_2} = c+1\) . By definition of the refined equivalence relation \(\sim _{c+1}\) , \(\exists \,q_1' \in \delta (q_1)\) , \(\forall \,q_2' \in \delta (q_2)\) , \(q_1' \not \sim _c q_2'\) , hence \(c_{q'_1, q'_2} \le c\) .

By induction hypothesis, \(D_{q'_1, q'_2}\) is well-defined, \(q'_1 \models D_{q'_1, q'_2}\) and \(q'_2 \not \models D_{q'_1, q'_2}\) . As a consequence, \(D_{q_1, q_2}\) is well-defined, \(q_1 \models {{\,\mathrm{{\exists \textsf{X}}}\,}}\, \biggl ( \underset{q_2' \in \delta (q_2)}{\bigwedge }\ D_{q_1', q'_2} \biggr )\) , and \(q_2 \not \models {{\,\mathrm{{\exists \textsf{X}}}\,}}\, \biggl ( \underset{q_2' \in \delta (q_2)}{\bigwedge }\ D_{q_1', q'_2} \biggr )\) .

We handle the case where \(\exists \,q_2' \in \delta (q_2)\) , \(\forall \,q_1' \in \delta (q_1)\) , \(q_1' \not \sim _c q_2'\) in a similar fashion. As a consequence, the property holds at rank \(c+1\) .

Therefore, for each \(q^+ \in S^+\) and each \(q^- \in S^-\) , \(q^+ \models D_{q^+, q^-}\) and \(q^- \not \models D_{q^+, q^-}\) . Hence, \(q^+ \models \mathcal {S}_\mathcal {K}(S^+, S^-)\) and \(q^- \not \models \mathcal {S}_\mathcal {K}(S^+, S^-)\) .     \(\square \)

C Proof of Corollary 1

First, given \(q^+ \in S^+\) and \(q^- \in S^-\) , let us bound the size of \(D_{q^+, q^-}\) based on their characteristic number \(c_{q_1, q_2} = \mathcal {C}_\mathcal {K}(\{q_1\}, \{q_2\})\) .

We are looking for an upper bound \((U_n)_{n \ge 0}\) such that \(\forall \,n \in \mathbb {N}\) , \(\forall \,q^+ \in S^+\) , \(\forall \,q^- \in S^-\) , if \(c_{q^+, q^-} \le n\) , then \(|D_{q^+, q^-}| \le U_n\) . We define it inductively:

Assuming \(k \ge 2\) , we explicit the bound \(U_n = (2 + \frac{k+1}{k-1}) \cdot k^n - \frac{k+1}{k-1} \le 5 \cdot k^n\) . As \((\{q^+\}, \{q^-\})\) is a sub-sample of S , \(c_{q^+, q^-}\le c\) and \(|D_{q^+, q^-}| \le U_c \le 5 \cdot k^c\) . We can finally bound the size of \(\mathcal {S}_\mathcal {K}(S^+, S^-)\) :

Yielding the aforementioned upper bound. If \(k=1\) , then \(U_n = 2 \cdot n + 2\) and the rest of the proof is similar to the previous case.     \(\square \)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2024 The Author(s)

About this paper

Cite this paper.

Pommellet, A., Stan, D., Scatton, S. (2024). SAT-Based Learning of Computation Tree Logic. In: Benzmüller, C., Heule, M.J., Schmidt, R.A. (eds) Automated Reasoning. IJCAR 2024. Lecture Notes in Computer Science(), vol 14739. Springer, Cham. https://doi.org/10.1007/978-3-031-63498-7_22

Download citation

DOI : https://doi.org/10.1007/978-3-031-63498-7_22

Published : 01 July 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-63497-0

Online ISBN : 978-3-031-63498-7

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. PPT

    sample complexity for infinite hypothesis space in machine learning

  2. PPT

    sample complexity for infinite hypothesis space in machine learning

  3. PPT

    sample complexity for infinite hypothesis space in machine learning

  4. PPT

    sample complexity for infinite hypothesis space in machine learning

  5. PPT

    sample complexity for infinite hypothesis space in machine learning

  6. PPT

    sample complexity for infinite hypothesis space in machine learning

VIDEO

  1. Riemannian Geometry

  2. week3 (Learning Theory

  3. EfficientML.ai Lecture 2

  4. Hypothesis spaces, Inductive bias, Generalization, Bias variance trade-off in tamil -AL3451 #ML

  5. Types of Learning :: Learning with Different Output Space @ Machine Learning Foundations (機器學習基石)

  6. Probabilistic ML

COMMENTS

  1. PDF LECTURE 16: LEARNING THEORY

    CS446 Machine Learning Sample complexity (finite H) - The version space VS H,D is said to be ε-exhausted with respect to concept c and distribution D if every h∈ VS ... Infinite Hypothesis Space The previous analysis was restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than

  2. Sample complexity

    Definition. Let be a space which we call the input space, and be a space which we call the output space, and let denote the product .For example, in the setting of binary classification, is typically a finite-dimensional vector space and is the set {,}. Fix a hypothesis space of functions :.A learning algorithm over is a computable map from to .In other words, it is an algorithm that takes as ...

  3. PDF Sample complexity for in nite hypothesis spaces

    With probability at least 1 , a hypothesis h2Hconsistent with mexamples sampled independently from distribution Dsatis es err(h) lnjHj+ln 1 m: Sample complexity for in nite hypothesis spaces We seek to generalize Occam's razor to in nite hypothesis spaces. To do so, we look at the set of behaviors H(S) of hypotheses from Hon a sample S. H(S ...

  4. PDF CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning

    CS 446 Machine Learning Fall 2016. OCT 11, 2016. utational Learning TheoryProfessor: Dan Roth Scribe: Ben Zhou, C. Cervantes1 PAC LearningWe want to develop a theory to relate the probability of successful learning, the number of training examples, the complexity of the hypothesis space, the accuracy to which.

  5. PDF 10-806 Foundations of Machine Learning and Data Science

    We now prove an important sample complexity result using the shatter coe cient. We focus on the realizable case (where the target function belongs to class C). It can be easily changed to handle the non-realizable case (and will cover it in a future lecture). Theorem 1 Let Cbe an arbitrary hypothesis space. Let Dbe an arbitrary, xed unknown proba-

  6. PDF Machine Learning Theory II

    Sample Complexity: Infinite Hypothesis Spaces Realizable Case •Not too easy to interpret sometimes hard to calculate exactly, but can get a good bound using "VC-dimension •VC-dimension is roughly the point at which H stops looking like it contains all functions, so hope for solving for m. If Hm൞2୽, then m൤୽ ϵ ቌ….ቍ

  7. PDF Computational Learning Theory

    Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004 Sample Complexity for infinite hypothesis spaces • Another measure of the complexity of H called Vapnik-Chervonenkis dimension, or VC(H) • We will use VC(H) instead of |H| • Results in tighter bounds • Allows characterizing sample complexity of infinite hypothesis spaces and is ...

  8. PDF Machine Learning

    Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing ... hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily ... Thus the VC dimension of this machine is infinite.

  9. PDF Learning Theory Part 1: PAC Model

    Sample complexity and the VC dimension •using VC-dim(H) as a measure of complexity of H, we can derive the following bound [Blumer et al., JACM 1989] m t 1 e 4 l og 2 2 d § ©¨ · ¹¸ + 8VC-dim( H ) l og 2 13 e § ©¨ · ¹¸ § ©¨ · ¹ can be used for both finite and infinite hypothesis spaces m grows log ×linear in ε(better than ...

  10. Sample Complexity for Finite Hypothesis Spaces

    Fact: Every consistent learner outputs a hypothesis belonging to the version space. Therefore, we need to bound the number of examples needed to assure that the version space contains no unacceptable hypothesis.

  11. Chapter 7: Computational Learning Theory

    Chapter 7: Computational Learning Theory Sample Complexity for Infinite Hypothesis Spaces. The Vapnik-Chervonenkis dimension of H, VC(H), allows us to typically place a fairly tight bound on the sample complexity A set of instances S is shattered by hypothesis H if and only if for every dichotomy of S (2 |S| total), there exists some hypothesis in H consistent with this dichotomy.

  12. PDF ml learning with infinite hypothesis sets

    Mehryar Mohri - Foundations of Machine Learning page Rademacher Complexity Bound Theorem: Let be a family of functions mapping from to . Then, for any , with probability at least , the following holds for all : Proof: Apply McDiarmid's inequality to 6 G Z > 0 1 g G [0, 1] E[g(z)] 1 m m i=1 g(z i)+2R m(G)+ ⇥ log 1 2m. E[g(z)] 1 m m i=1

  13. Infinite Hypothesis Spaces

    Infinite Hypothesis Spaces. Sample complexity = ln(d /|H|) / ln(1-e) Assumes |H| is finite Consider Hypothesis represented as a rectangle |H| is infinite, but expressiveness is not!

  14. PDF CSC 411 Lecture 23-24: Learning theory

    Finite hypothesis space A rst simple example of PAC learnable spaces - nite hypothesis spaces. Theorem (uniform convergence for nite H) Let Hbe a nite hypothesis space and ': YY! [0;1] be a bounded loss function, then Hhas the uniform convergence property with M( ; ) = ln(2jHj ) 2 2 and is therefore PAC learnable by the ERM algorithm. Proof .

  15. Sample complexity of linear regression

    Sample complexity of linear regression ... However, notice that in the linear regression setting, the hypothesis class is infinite: even though the weight vector's norm is bounded, it can still take an infinite number of values. Can we somehow leverage the result for finite classes here? ... Understanding Machine Learning: ...

  16. PDF Lecture 5: PAC Sample Complexity Proof

    nction, VC dimension, the relationshipb. tween them, and the following theorem. In this lec. ally prove these results.1Theorem 1.1(PAC Lea. lgorithm that learns aco. cept. + log2() :2 Proof of Theorem 1.1In this lecture, we defin. and work with three "bad" events. F. win. ev.

  17. PDF CS 391L: Machine Learning: Computational Learning Theory

    specific algorithms in terms of computational complexity or sample complexity , i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy. • Complexity of a learning problem depends on: - Size or expressiveness of the hypothesis space. - Accuracy to which target concept must be approximated.

  18. PDF Lecture 8: Sample Complexity of Agnostic Learning

    Remark 3.2. This sample complexity in the agnostic setting has an additional factor of 1 compared to the realizable setting. Proof sketch. The proof is very similar to the sample complexity upper bound for realizable PAC. Therefore, we only highlight some of the major steps. Take S ˘D m, and for the sake of analysis imagine an independent ...

  19. PDF Machine Learning, Chapter 7 CSE 574, Spring 2004

    Two frameworks for analyzing learning algorithms. 1. Probably Approximately Correct (PAC) framework. Identify classes of hypotheses that can/cannot be learned from a polynomial number of training samples. Finite hypothesis space. Infinite hypotheses (VC dimension) Define natural measure of complexity for hypothesis spaces (VC dimension) that ...

  20. PDF ml learning with finite hypothesis sets

    Learning Bound for Finite H - Consistent Case. Theorem: let H be a finite set of functions from X to 1} and L an algorithm that for any target concept c H and sample S returns a consistent hypothesis hS : . Then, for any 0 , with probability at least. (hS ) = 0 >.

  21. PDF Optimization for deep learning: an overview

    theoretical machine learning area. Classical theory of vanilla SGD [30] often uses diminishing stepsize and thus does not enjoy the same bene t as SVRG and SAGA; however, as mentioned before, constant step-size SGD works quite well in many machine learning problems, and in these problems SGD may have inherited the same advantage of SVRG and SAGA.

  22. PDF Sponsored Search Acution Design Via Machine Learning

    Sample Complexity: Infinite Hypothesis Spaces Realizable Case •Not too easy to interpret sometimes hard to calculate exactly, but can get a good bound using "VC-dimension •VC-dimension is roughly the point at which H stops looking like it contains all functions, so hope for solving for m. If Hm=2m, then m≥m ϵ (….)

  23. Peirce in the Machine: How Mixture of Experts Models Perform Hypothesis

    larization is a technique common in machine learning to combat overfitting. Overfitting is where a machine learning algorithm essentially does well on training data while failing to predict hold-out test data; the model "mem-orizes" the training data instead of truly learning the relevant "inductive patterns in the data".

  24. PDF Promising directions of machine learning for partial ...

    More recently, machine learning algorithms have provided an entirely new approach for solving PDEs, based on the increasing wealth of high-quality data generated from both simulations and

  25. What exactly is a hypothesis space in machine learning?

    To get a better idea: The input space is in the above given example 24 2 4, its the number of possible inputs. The hypothesis space is 224 = 65536 2 2 4 = 65536 because for each set of features of the input space two outcomes ( 0 and 1) are possible. The ML algorithm helps us to find one function, sometimes also referred as hypothesis, from the ...

  26. SAT-Based Learning of Computation Tree Logic

    The CTL learning problem consists in finding for a given sample of positive and negative Kripke structures a distinguishing CTL formula that is verified by the former but not by the latter. Further constraints may bound the size and shape of the desired formula or even ask for its minimality in terms of syntactic size. This synthesis problem is motivated by explanation generation for ...

  27. Single-cell topographical profiling of the immune synapse ...

    For statistical analysis of spectral complexity values across samples, we used permutation testing to assess statistical significance, which obviated the need to assume a uniform distribution for the null hypothesis. Bootstrap resampling was used to generate 10 5 random permutations of the samples in each group. Then, the absolute difference in ...