Available via license: CC BY-NC 4.0

Content may be subject to copyright.

Worrisome Properties of Neural Network Controllers

and Their Symbolic Representations

Jacek Cyranka a;*, Kevin E M Churchband Jean-Philippe Lessardc

aInstitute of Informatics, University of Warsaw

bCentre de Recherches Mathématiques, Université de Montréal

cDepartment of Mathematics and Statistics, McGill University

Abstract. We raise concerns about controllers’ robustness in sim-

ple reinforcement learning benchmark problems. We focus on neural

network controllers and their low neuron and symbolic abstractions.

A typical controller reaching high mean return values still generates

an abundance of persistent low-return solutions, which is a highly

undesirable property, easily exploitable by an adversary. We ﬁnd that

the simpler controllers admit more persistent bad solutions. We pro-

vide an algorithm for a systematic robustness study and prove exis-

tence of persistent solutions and, in some cases, periodic orbits, using

a computer-assisted proof methodology.

1 Introduction

The study of neural network (NN) robustness properties has a long

history in the research on artiﬁcial intelligence (AI). Since establish-

ing the existence of so-called adversarial examples in deep NNs in

[14], it is well known that NN can output unexpected results by

slightly perturbing the inputs and hence can be exploited by an adver-

sary. Since then, the robustness of other NN architectures has been

studied [44]. In the context of control design using reinforcement

learning (RL), the robustness of NN controllers has been studied

from the adversarial viewpoint [29, 42]. Due to limited interpretabil-

ity and transparency, deep NN controllers are not suitable for de-

ployment for critical applications. Practitioners prefer abstractions of

deep NN controllers that are simpler and human-interpretable. Sev-

eral classes of deep NN abstractions exist, including single layer or

linear nets, programs, tree-like structures, and symbolic formulas. It

is hoped that such abstractions maintain or improve a few key fea-

tures: generalizability – the ability of the controller to achieve high

performance in similar setups (e.g., slightly modiﬁed native simula-

tor used in training); deployability – deployment of the controller in

the physical world on a machine, e.g., an exact dynamical model is

not speciﬁed and the time horizon becomes undeﬁned; veriﬁability –

one can verify a purported controller behavior (e.g., asymptotic sta-

bility) in a strict sense; performance – the controller reaches a very

close level of average return as a deep NN controller.

In this work, we study the robustness properties of some symbolic

controllers derived in [24] as well as deep NN with their a few neuron

and symbolic abstractions derived using our methods. By robustness,

we mean that a controller maintains its average return values when

changing the simulator conﬁguration (scheme/ time-step) at test time

∗Corresponding Author. Email: jcyranka@gmail.com

while being trained on some speciﬁc conﬁguration. Moreover, a ro-

bust controller does not admit open sets of simulator solutions with

extremely poor return relative to the average. In this regard, we found

that NNs are more robust than simple symbolic abstractions, still

achieving comparable average return values. To conﬁrm our ﬁnd-

ings, we implement a workﬂow of a symbolic controller derivation:

regression of a trained deep NN and further ﬁne-tuning. For the sim-

plest benchmark problems, we ﬁnd that despite the controllers reach-

ing the performance of deep NNs measured in terms of mean return,

there exist singular solutions that behave unexpectedly and are per-

sistent for a long time. In some cases, the singular solutions are per-

sistent forever (periodic orbits). The found solutions are stable and

an adversary having access to the simulation setup knowing the ex-

istence of persistent solutions and POs for speciﬁc setups and initial

conditions may reconﬁgure the controlled system and bias it towards

the bad persistent solutions; resulting in a signiﬁcant performance

drop, and if the controller is deployed in practice, may even lead to

damage of robot/machine. This concern is critical in the context of

symbolic controllers, which are simple abstractions more likely to be

deployed on hardware than deep NNs. Two systems support the ob-

served issues. First, the standard pendulum benchmark from OpenAI

gym [4] and the cartpole swing-up problem.

Each instance of an persistent solution we identify is veriﬁed math-

ematically using computer-assisted proof (CAP) techniques based on

interval arithmetic [27, 38] implemented in Julia [3]. Doing so, we

verify that the solution truly exists and is not some spurious object re-

sulting from e.g., ﬁnite arithmetic precision. Moreover, we prove the

adversarial exploitability of a wide class of controllers. The existence

of persistent solutions is most visible in the case of symbolic con-

trollers. For deep NN, persistent solutions are less prevalent, and we

checked that deep NN controllers’ small NN abstractions (involving

few neurons) somewhat alleviate the issue of symbolic controllers,

strongly suggesting that the robustness is inversely proportional to

the number of parameters, starkly contrasting with common beliefs

and examples in other domains.

Main Contributions. Let us summarize the main novel contribu-

tions of our work to AI community below.

Systematic controller robustness study. In light of the average re-

turn metric being sometimes deceptive, we introduce a method for

investigating controller robustness by designing an persistent solu-

tions search and the penalty metric.

Identiﬁcation and proofs of abundant persistent solutions. We sys-

ECAI 2023

K. Gal et al. (Eds.)

© 2023 The Authors.

This article is published online with Open Access by IOS Press and distributed under the terms

of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).

doi:10.3233/FAIA230311

517

tematically ﬁnd and prove existence of a concerning number of per-

sistent orbits for symbolic controllers in simple benchmark problems.

Moreover, we carried out a proof of a periodic orbit for a deep NN

controller, which is of independent interest. To our knowledge, this

is the ﬁrst instance of such a proof in the literature.

NN controllers are more robust than symbolic. We ﬁnd that the

symbolic controllers admit signiﬁcantly more bad persistent solu-

tions than the deep NN and small distilled NN controllers.

1.1 Related Work

(Continuous) RL. A review of RL literature is beyond the scope of

this paper (see [34] for an overview). In this work we use state-of-

the-art TD3 algorithms dedicated for continuous state/action spaces

[12] based on DDPG [25]. Another related algorithm is SAC [16].

Symbolic Controllers. Symbolic regression as a way of obtaining

explainable controllers appeared in [22, 20, 24]. Other representa-

tions include programs [39, 37] or decision trees [26]. For a broad

review of explainable RL see [41].

Falsiﬁcation of Cyber Physical Systems (CPS) The research on fal-

siﬁcation [2, 10, 40, 43] utilizes similar techniques for demonstrating

the violation of a temporal logic formula, e.g., for ﬁnding solutions

that never approach the desired equilibrium. We are interested in so-

lutions that do not reach the equilibrium but also, in particular, the

solutions that reach minimal returns.

Veriﬁcation of NN robustness using SMT Work on SMT like Re-

LUplex [5, 11, 21] is used to construct interval robustness bounds

for NNs only. In our approach we construct interval bounds for so-

lutions of a coupled controller (a NN) with a dynamical system and

also provide existence proofs.

Controllers Robustness. Design of NN robust controllers focused

on adversarial defence methods [29, 42].

CAPs. Computer-assisted proofs for ordinary differential equa-

tions (ODEs) in AI are not common yet. Examples include validation

of NN dynamics [23] and proofs of spurious local minima [32].

1.2 Structure of the Paper

Section 2 provides background on numerical schemes and RL frame-

work used in this paper. Section 3 describes the training workﬂow for

the neural network and symbolic controllers. The class of problems

we consider is presented in Section 4. We describe the computer-

assisted proof methodology in Section 5. Results on persistent peri-

odic orbits appear in Section 6, and we describe the process by which

we search for these and related singular solutions in Section 7.

2 Preliminaries

2.1 Continuous Dynamics Simulators for AI

Usually, there is an underlying continuous dynamical system with

control input that models the studied problem s(t)=f(s(t),a(t)),

where s(t)is the state, a(t)is the control input at time t, and fis

a vector ﬁeld. For instance, the rigid body general equations of mo-

tion in continuous time implemented in robotic simulators like Mu-

JoCo [36] are Mv+c=τ+JTf,J, f is the constraint Jacobian

and force, τis the applied force, Minertia matrix and cthe bias

forces. For training RL algorithms, episodes of simulated rollouts

(s0,a

0,r

1,s

1,...)are generated; the continuous dynamical system

needs to be discretized using one of the available numerical schemes

like the Euler or Runge-Kutta schemes [17]. After generating a state

rollout, rewards are computed rk+1 =r(sk,a

k). The numerical

schemes are characterized by the approximation order, time-step, and

explicit/implicit update. In this work, we consider the explicit Euler

(E) scheme sk+1 =sk+hf(sk,a

k); this is a ﬁrst-order scheme

with the quality of approximation being proportional to time-step h

(a hyperparameter). Another related scheme is the so-called semi-

implicit Euler (SI) scheme, a two-step scheme in which the velocities

are updated ﬁrst. Then the positions are updated using the computed

velocities. Refer to the appendix for the exact form of the schemes.

In the research on AI for control, the numerical scheme and time-

resolution1of observations hare usually ﬁxed while simulating

episodes. Assume we are given a controller that was trained on sim-

ulated data generated by a particular scheme and h; we are interested

in studying the controller robustness and properties after the zero-

shot transfer to a simulator utilizing a different scheme or h, e.g.,

explicit to semi-implicit or using smaller h’s.

2.2 Reinforcement Learning Framework

Following the standard setting used in RL, we work with a Markov

decision process (MDP) formalism (S,A,F,r,ρ

0,γ), where Sis a

state space, Ais an action space, F:S×A→Sis a deterministic

transition function, r:S×A→Ris a reward function, ρ0is an

initial state distribution, and γ∈(0,1) is a discount factor used in

training. Smay be equipped with an equivalence relation, e.g. for

an angle variable θ,wehaveθ≡θ+k2πfor all k∈Z. In RL,

the agent (policy) interacts with the environment in discrete steps

by selecting an action atfor the state stat time t, causing the state

transition st+1 =F(st,a

t); as a result, the agent collects a scalar

reward rt+1(st,a

t), the (undiscounted) return is deﬁned as the sum

of discounted future reward Rt=T

i=tr(si,a

i)with T>0be-

ing the ﬁxed episode length of the environment. RL aims to learn

a policy that maximizes the expected return over the starting state

distribution.

In this work, we consider the family of MDPs in which the transi-

tion function is a particular numerical scheme. We study robustness

w.r.t. the scheme; to distinguish the transition function used for train-

ing (also called native) from the transition function used for testing,

we introduce the notation Ftrain and Ftest resp. e.g. explicit Euler

with time-step his denoted F∗(E,h), where ∗∈{test, train}.

3 Algorithm for Training of Symbolic Controllers

and Small NNs

Carrying out the robustness study of symbolic and small NN con-

trollers requires that the controllers are ﬁrst constructed (trained).

We designed a three-step deep learning algorithm for constructing

symbolic and small NN controllers. Inspired by the preceding work

in this area the controllers are derived from a deep RL NN controller.

The overall algorithm is summarized in Alg. 1.

3.1 RL Training

First we train a deep NN controller using the state-of-the-art model-

free RL algorithm TD3 [25, 12] – the SB3 implementation [30]. We

choose TD3, as it utilizes the replay buffer and constructs determinis-

tic policies (NN). Plots with the evaluation along the training proce-

dure for studied systems can be found in Appendix C of the extended

version of the paper [8].

1While in general time-resolution may not be equal to the time step, in this

work we set them to be equal.

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations518

Algorithm 1 Symbolic/Small NN Controllers Construction

input MDP determining studied problem; RL training h-params;

symbolic & small NN regression h-params; ﬁne-tuner h-params;

output deep NN policy πdeep; small NN policy πsmall; family of

symbolic policies {πsymb,k }(kcomplexity);

1: Apply an off-policy RL algorithm for constructing a determinis-

tic deep NN policy πdeep;

2: Using the replay buffer data apply symbolic regression for com-

puting symbolic abstractions {πsymb,k }(having complexity k)

of deep NN controller and MSE regression for small NN πsmall

policy distillation;

3: Fine-tune the constructed controllers parameters for maximizing

the average return using CMA-ES and/or analytic gradient.

3.2 Symbolic Regression

A random sample of states is selected from the TD3 training replay

buffer. Symbolic abstractions of the deep NN deterministic policies

are constructed using the symbolic regression over the replay buffer

samples. Following earlier work [22, 20, 24] the search is performed

by an evolutionary algorithm. For such purpose, we employ the PySR

Python library [6, 7]. The main hyperparameter of this step is the

complexity limit (number of unary/binary operators) of the formulas

(kin Alg. 1). This procedure outputs a collection of symbolic repre-

sentations with varying complexity. Another important hyperparam-

eter is the list of operators used to deﬁne the basis for the formulas.

We use only the basic algebraic operators (add, mul., div, and multip.

by scalar). We also tried a search involving nonlinear functions like

tanh, but the returns were comparable with a larger complexity.

3.3 Distilling Simple Neural Nets

Using a random sample of states from the TD3 training replay buffer

we ﬁnd the parameters of the small NN representation using the

mean-squared error (MSE) regression.

3.4 Controller Parameter Fine-tuning

Just regression over the replay buffer is insufﬁcient to construct con-

trollers that achieve expected returns comparable with deep NN con-

trollers, as noted in previous works. The regressed symbolic con-

trollers should be subject to further parameter ﬁne-tuning to max-

imize the rewards. There exist various strategies for ﬁne-tuning.

In this work, we use the non-gradient stochastic optimization co-

variance matrix adaptation evolution strategy (CMA-ES) algorithm

[19, 18]. We also implemented analytic gradient optimization, which

takes advantage of the simple environment implementation, and per-

forms parameter optimization directly using gradient descent on the

model rollouts from the differentiable environment time-stepping im-

plementation in PyTorch.

4 Studied Problems

We perform our experimental investigation and CAP support in the

setting of two control problems belonging to the set of standard

benchmarks for continuous optimization. First, the pendulum prob-

lem is part of the most commonly used benchmark suite for RL –

OpenAI gym [4]. Second, the cart pole swing-up problem is part of

the DeepMind control suite [35]. Following the earlier work [13] we

used a closed-form implementation of the cart pole swing-up prob-

lem. While these problems are of relatively modest dimension, com-

pared to problems in the MuJoCo suite, we ﬁnd them most suitable

to convey our message. The low system dimension makes a self-

contained cross-platform implementation easier and eventually pro-

vides certiﬁcates for our claims using interval arithmetic and CAPs.

4.1 Pendulum

The pendulum dynamics is described by a 1d 2nd order nonlinear

ODE. We followed the implementation in OpenAI gym, where the

ODEs are discretized with a semi-implicit (SI) Euler method with

h=0.05. For training we use Ftrain(SI,0.05). Velocity ωis clipped

to the range [−8,8], and control input ato [−2,2]. There are several

constants: gravity, pendulum length and mass (g, l, m), which we

set to defaults. See Appendix A ([8]) for the details. The goal of

the control is therefore to stabilize the up position θ=0 mod2π,

with zero angular velocity ω. The problem uses quadratic reward for

training and evaluation r=−θ2−0.1ω2−0.001a2, where θ=

arccos(cos(θ)) at given time tand action a. The episode length is

200 steps. The max reward is 0, and large negative rewards might

indicate long-term simulated dynamics that are not controlled.

4.2 Cartpole Swing-up

The cartpole dynamics is described by a 2d 2nd order nonlinear

ODEs with two variables: movement of the cart along a line (x, x),

and a pole attached to the cart (θ, θ). We followed the implementa-

tion given in [15]. The ODEs are discretized by the explicit Euler (E)

scheme with h=0.01. As with the pendulum we use clipping on

some system states, and several constants are involved, which we set

to defaults. See Appendix B ([8]) for details. The goal of the control

is to stabilize the pole upwards θ=0 mod2πwhile keeping the

cart xwithin ﬁxed boundaries. The problem uses a simple formula

for reward r=cosθ, plus the episode termination condition if |x|is

above threshold. The episode length is set to 500, hence the reward

is within [−500,500]. Large negative reward is usually indicative

of undesirable behaviour, with the pole continuously oscillating, the

cart constantly moving, and escaping the boundaries fairly quickly.

5 Rigorous Proof Methodology

All of our theorems presented in the sequel are supported by a

computer-assisted proof, guaranteeing that they are fully rigorous

in a mathematical sense. Based on the existing body of results and

our algorithm we developed in Julia, we can carry out the proofs

for different abstractions and problems as long as the set of points

of non-differentiability is small, e.g., it works for almost all prac-

tical applications: ReLU nets, decision trees, and all sorts of prob-

lems involving dynamical systems in a closed form. The input to our

persistent solutions prover is a function in Julia deﬁning the con-

trolled problem, the only requirement being that the function can be

automatically differentiated. To constitute a proof, this part needs to

be carried out rigorously with interval arithmetic. Our CAPs are au-

tomatic; once our searcher ﬁnds a candidate for a persistent solu-

tion/PO, a CAP program attempts to verify the existence of the so-

lution/PO by verifying the theorem (Theorem 1) assumptions. If the

prover succeeds this concludes the proof.

5.1 Interval Arithmetic

Interval arithmetic is a method of tracking rounding error in nu-

merical computation. Operations on ﬂoating point numbers are in-

stead done on intervals whose boundaries are ﬂoating point num-

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations 519

bers. Functions fof real numbers are extended to functions fde-

ﬁned on intervals, with the property that f(X)necessarily contains

{f(x):x∈X}.The result is that if yis a real number and Yis a

thin interval containing y, then f(y)∈f(Y). For background, the

reader may consult the books [27, 38]. Function iteration on intervals

leads to the wrapping effect, where the radius of an interval increases

along with composition depth. See Figure 1 for a visual.

episode

0 500 1000

0

1

2

3

4

episode

0 500 1000

log(r)

-100

-50

0

Figure 1: Left: midpoint of interval enclosure of a proven persistent

solution (see Appendix Tab. 23 [8]). Right: log-scale of radius of the

interval enclosure. Calculations done at 163 bit precision, the mini-

mum possible for this solution at episode length 1000.

5.2 Computer-assisted Proofs of Periodic Orbits

For x=(x1,...,x

n), let ||x|| =max{|x1|,...,|xn|}. The follow-

ing is the core of our CAPs.

Theorem 1 Let G:U→Rnbe continuously differentiable, for U

an open subset of Rn. Let x∈Rnand r∗≥0. Let Abe a n×n

matrix2of full rank. Suppose there exist real numbers Y,Z0and Z2

such that

||AG(x)|| ≤ Y, (1)

||I−ADG(x)|| ≤ Z0(2)

sup

|δ|≤r∗||A(DG(x+δ)−DG(x))|| ≤ Z2,(3)

where DG(x)denotes the Jacobian of Gat x, and the norm on ma-

trices is the induced matrix norm. If Z0+Z2<1and Y/(1 −Z0−

Z2)≤r∗, the map Ghas a unique zero xsatisfying ||x−x|| ≤ r

for any r∈(Y/(1 −Z0−Z2),r

∗].

A proof can be completed by following Thm 2.1 in [9]. In Sec. 5.3,

we identify Gwhose zeroes correspond to POs. Conditions (1)–(3)

imply that the Newton-like operator T(x)=x−AG(x)is a con-

traction on the closed ball centered at the approximate zero xwith

radius r>0. Being a contraction, it has a unique ﬁxed point (xsuch

that x=T(x)) by the Banach ﬁxed point theorem. As Ais full rank,

G(x)=0, hence an orbit exists. The radius rmeasures how close

the approximate orbit xis to the exact orbit, x. The contraction is rig-

orously veriﬁed by performing all necessary numerical computations

using interval arithmetic. The technical details appear in Appendix D

([8]).

5.3 Set-up of the Nonlinear Map

A PO is a ﬁnite MDP trajectory. Let the step size be h, and let the

period of the orbit be m. We present a nonlinear map that encodes

(as zeroes of the map) POs when his ﬁxed. However, for technical

2In practice, a numerical approximation A≈DF (x)−1.

reasons (see Appendix E [8]), it is possible for such a proof to fail. If

Alg. 2 fails to prove the existence of an orbit with a ﬁxed step size h,

we fall back to a formulation where the step size is not ﬁxed, which

is more likely to yield a successful proof. This alternative encoding

map G2is presented in Appendix D ([8]). Given h, pick g(h, ·)∈

{gE,g

SI}one of the discrete dynamical systems used for numerically

integrating the ODE. Let pbe the dimension of the state space, so

g(h, ·):Rp→Rp. We interpret the ﬁrst dimension of Rpto be

the angular component, so that a periodic orbit requires a shift by

a multiple of 2πin this variable. Given h, the number of steps m

(i.e. period of the orbit) and the number of signed rotations jin the

angular variable, POs are zeroes of the map (if and only if) G1:

Rpm →Rpm, deﬁned by

G1(X)=

⎛

⎜

⎜

⎜

⎜

⎜

⎝

x0−g(h, xm)+(j2π, 0)

x1−g(h, x0)

x2−g(h, x1)

.

.

.

xm−g(h, xm−1)

⎞

⎟

⎟

⎟

⎟

⎟

⎠

,

where 0is the zero vector in Rp−1,X=(x1,...,x

m)for xi∈Rp,

and x1,...,x

mare the time-ordered states.

6 Persistent Orbits in Controlled Pendulum

When constructing controllers using machine learning or statistical

methods, the most often used criterion for measuring their quality

is the mean return from performing many test episodes. The mean

return may be a deceptive metric for constructing robust controllers.

More strongly, our ﬁndings suggest that mean return is not correlated

to the presence of periodic orbits or robustness. One would typically

expect a policy with high mean return to promote convergence to-

ward states that maximize the return for any initial condition (IC)

and also for other numerical schemes. Our experiments revealed rea-

sons to believe this may be true for deep NN controllers. However,

in the case of simple symbolic controllers, singular persistent solu-

tions exist that accumulate large negative returns at a fast pace. By

persistent solutions we mean periodic orbits that remain εaway from

the desired equilibrium. This notion we formalize in Sec. 7.1. We

emphasize that all of the periodic orbits that we prove are necessary

stable in the usual Lyapunov sense, i.e., the solutions that start out

near an equilibrium stay near the equilibrium forever, and hence fea-

sible in numerical simulations. We ﬁnd such solutions for controllers

as provided in the literature and constructed by ourselves employ-

ing Alg. 1. We emphasize that our ﬁndings are not only numerical,

but we support them with (computer-assisted) mathematical proofs

of existence.

6.1 Landajuela et. al [24] Controller

First, we consider the symbolic low complexity controller for the

pendulum a=−7.08s2−(13.39s2+3.12s3)/s1+0.27, derived

in [24] (with model given in Appendix A [8]), where s1=cosθ,

s2=sinθ,s3=ω=θ, and ais the control input. While

this controller looks more desirable than a deep NN with hundreds

thousand of parameters, its performance changes dramatically when

using slightly different transition function at test-time, i.e., halved

h(Ftest(SI,0.025)) or the explicit Euler scheme (Ftest (E,0.05)).

Trajectories in Fig. 2 illustrate that some orbits oscillate instead of

stabilizing at the equilibrium ˆs=ˆ

θ=0mod 2π. The average return

signiﬁcantly deteriorates for the modiﬁed schemes and the same ICs

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations520

compared to Ftrain(SI,0.05); see Tab. 1. Such issues are present in

deep NN controllers and small distilled NN to a signiﬁcantly lower

extent. We associate the cause of the return deterioration with ex-

istence of ’bad’ solutions – persistent periodic orbits (POs) (formal

Def. 1). Using CAPs (c.f., Sec. 5) we obtain:

Theorem 2 For h∈H={0.01,0.005,0.0025,0.001}, the non-

linear pendulum system with controller afrom [24] described in the

opening paragraph of Section 6.1 has a periodic orbit (PO) under

the following numerical schemes;

1) (SI) with step size h∈H,

2) (E) at h=0.05 (native), and for all h∈H.

The identiﬁed periodic orbits are persistent (see Def. 2) and gener-

ate minus inﬁnity return for inﬁnite episode length, with each episode

decreasing the reward by at least 0.198.

(a) (SI), h=0.05

(native)

(b) (E), h=0.05 (c) (SI), h=0.025

Figure 2:100 numerical simulations with IC ω=0and θsampled

uniformly, time horizon set to T=6,x-axis shows the (unnormal-

ized) ω, and y-axis θ. In (a), all IC are attracted by an equilibrium at

ω=0mod2π,θ=0. Whereas when applying different Ftest, (b)

and (c) show existence of attracting periodic solutions (they can be

continued inﬁnitely as our theorems demonstrate).

6.2 Our Controllers

The issues with robustness and performance of controllers of Sec. 6.1

may be an artefact of a particular controller construction rather

than a general property. Indeed, that controller had a division by

s1. To investigate this further we apply Alg. 1 for constructing

symbolic controllers of various complexities (without divisions).

Using Alg. 1 we distill a small NN (single hidden layer with

10 neurons) for comparison. In step 2 we use ﬁne-tuning based

on either analytic gradient or CMA-ES, each leading to different

controllers. The studied controllers were trained using the default

transition Ftrain(SI,0.05), and for testing using Ftest (E,0.05),

Ftest(E,0.025),Ftest (SI,0.05),Ftest(SI,0.025).

Tab. 1 reveals that the average returns deteriorate when using

other numerical schemes for the symbolic controllers obtained us-

ing Alg. 1, analogous to the controller from [24]. The average return

discrepancies are very large as well. We emphasize that all of the

studied metrics for the symbolic controllers are far from the metrics

achieved for the deep NN controller. Terminating Alg. 1 at step 2 re-

sults in a very bad controller achieving mean return only of −1061,

i.e., as observed in the previous works the symbolic regression over a

dataset sampled from a trained NN is not enough to construct a good

controller. Analogous to Theorem 2, we are able to prove the follow-

ing theorems on persistent periodic orbits (Def. 1) for the controllers

displayed in Table 1.

Theorem 3 For h∈H={0.025,0.0125}, the nonlinear pendu-

lum system with controller generated by analytic gradient reﬁnement

in Tab. 1 has POs under

1) (SI) with h∈Hand at the native step size h=0.05,

2) (E) with h∈H.

The identiﬁed periodic orbits are persistent (see Def. 2) and gener-

ate minus inﬁnity return for inﬁnite episode length, with each episode

decreasing the reward by at least 0.18.

Theorem 4 For h=0.0125 and h=0.05 (native) with scheme (E),

the nonlinear pendulum system with controller generated by CMA-

ES reﬁnement in Tab. 1 has POs which generate minus inﬁnity return

for inﬁnite episode length, with each episode decreasing the reward

by at least 0.20.

7 Systematic Robustness Study

We consider a controller to be robust when it has “good" return statis-

tics at the native simulator and step size, which persist when we

change simulator and/or decrease step size. If a degradation of re-

turn statistics on varying the integrator or step size is identiﬁed, we

wish to identify the source.

7.1 Background on Persistent Solutions and Orbits

Consider a MDP tuple (S,A,F,r,ρ

0,γ), a precision parameter ε>

0, a policy π:S→A(trained using Ftrain and tested using Ftest ),

a desired equilibrium ˆs(corresponding to the maximized reward r),

and episode length N.

Deﬁnition 1 We call a persistent periodic orbit (PO) (of period n)an

inﬁnite MDP trajectory (s0,a

0,r

1,s

1,a

1,...), such that skn =s0

for some n>1and all k∈N, and such that ˆs−sj>εfor all

j≥0.

Deﬁnition 2 A ﬁnite MDP trajectory of episode length N

(s0,a

0,p

1,s

1,a

1,...,s

N)such that ˆs−sj>εfor all 0≤j≤

Nis called a persistent solution.

Locating the objects in dynamics responsible for degradation of the

reward is not an easy task, as they may be singular or local minima of

a non-convex landscape. For locating such objects we experimented

with different strategies, but found the most suitable the evolutionary

search of penalty maximizing solutions. The solutions identiﬁed us-

ing such a procedure are necessarily stable. We introduce a measure

of ’badness’ of persistent solutions and use it as a search criteria.

Deﬁnition 3 We call a penalty value, a function p:S×A→R+,

such that for a persistent solution/orbit the accumulated penalty

value is bounded from below by a set threshold M0, that is

N−1

i=0 p(si,a

i)≥M.

Remark 4 The choice of particular penalty in Def. 3 depend on the

particular studied example. We choose the following penalties in the

studied problems.

1. p(s, a)=−r(s, a)for pendulum.

2. p(s, a)=−r(s)+0.5(θ)2+0.5(x)2for cartpole swing-

up. Subtracting from the native reward value r(s)=cosθthe

scaled sum of squared velocities (the cart and pole) and turning off

the episode termination condition. This allows capturing orbits that

manage to stabilize the pole, but are unstable and keep the cart mov-

ing. The threshold Min Def. 3 can be set by propagating a number

of trajectories with random IC and taking the maximal penalty as M.

Remark 5 For a PO, the accumulated penalty admits a linear lower

bound, i.e. n−1

m=0 p(sm,a

m)≥Cnfor some C>0. Thm. 2 implies

C=0.14 for the POs in Tab. 6 in the Appendix [8].

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations 521

Table 1: Comparison of different controllers for the pendulum. Mean ±std.dev. rounded to decimal digit, returns over 100 episodes reported

for different Ftest (larger the better). Ftest =Ftrain marked in bold. In this case mean return is equal to negative accumulated penalty.

Absolute return discrepancies measure discrepancy in episodic return between different schemes (E/SI) for the same IC (smaller the better).

The meaning of observation vector at given time t,x0=cosθ(t),x1=sinθ(t),x2=ω(t)=θ(t).

MEAN RETURN FOR GIVEN Ftest

h=0.05 h=0.025 DISCREPANCY

ORIGIN FORMULA SI ESI E RETURN E/SI

ALG.1,3.ANALYTIC (SYMB.k=9)((((1.30 ·x2+4.18 ·x1)x0)+0.36x1)/−0.52) −207 ±183 −604 ±490 −431 ±396 −910 ±853 479 ±416

ALG. 1, 3.CMA-ES (SYMB.k=9)((((−10.59x2+−42.47x1)x0)+1.2x1)/5.06) −165 ±113 −659 ±461 −331 ±225 −1020 ±801 538 ±401

ALG.1,SMALL NN 10 NEURONS DISTILLED SMALL NN −157 ±99 −304 ±308 −311 ±196 −290 ±169 188 ±285

[24] (a1)−7.08x1−(13.39x1+3.12x2)/x0+0.27 −150 ±87 −703 ±445 −318 ±190 −994 ±777 577 ±401

TD3 TRAINING DEEP NN −149 ±86 −138 ±77 −298 ±171 −278 ±156 18 ±38

7.2 Searching for and Proving Persistent Orbits

We designed a pipeline for automated persistent/periodic orbits

search together with interval proof certiﬁcates. By an interval proof

certiﬁcate of a PO we mean interval bounds within which a CAP

that the orbit exist was carried out applying the Newton scheme

(see Sec. 5.2), whereas by a proof certiﬁcate of a persistent solution

(which may be a PO or not) we mean interval bounds for the solution

at each step, with a bound for the reward value, showing that it does

not stabilize by verifying a lower bound ˆs−st>ε. The search

procedure is implemented in Python, while the CAP part is in Julia,

refer Sec. 5 for further details.

Algorithm 2 Persistent Solutions/Orbits Search & Prove

input Ftest; control policy π;h-parameters of the evolutionary

search; penalty function p; trajectory length; search domain;

output interval certiﬁcates of persistent/periodic orbits;

1: for each MDP do

2: for number of searches do

3: initialize CMA-ES search within speciﬁed bounds;

4: search for a candidate maximizing penalty pduring the

ﬁxed episode length;

5: end for

6: order found candidates w.r.t. their pvalue;

7: end for

8: for each candidate do

9: search for nearby periodic orbit with Newton’s method cor-

rection applied to suitable sub-trajectory;

10: if potential periodic orbit found then

11: attempt to prove existence of the orbit with Thm. 1;

12: if proof successful then

13: return an interval certiﬁcate of the orbit;

14: else

15: return proof failure;

16: end if

17: else

18: return periodic orbit not found;

19: end if

20: produce and return an interval certiﬁcate of the uncontrolled

solution;

21: end for

7.3 Findings: Pendulum

Changing simulator or step size resulted in substantial mean return

loss (see Tab. 1), and simulation revealed stable POs (see Fig. 2).

We proved existence of POs using the methods of Section 5.2–5.3.

Proven POs are presented in tables in Appendix F ([8]). See also

Fig. 3, where an persistent solution shadows an unstable PO before

converging to the stable equilibrium. We present proven persistent

solutions in the tables in Appendix F ([8]).

Comparing the mean returns in Tab. 1 we immediately see that

deep NN controller performance does not deteriorate as much as for

the symbolic controller, whereas the small net is located between

the two extremes. This observation is conﬁrmed after we run Alg. 2

for the symbolic controllers and NN. In particular, we did not iden-

tify any stable periodic orbits or especially long persistent solutions.

However, the Deep NN controller is not entirely robust, admitting

singular persistent solutions achieving returns far from the mean; re-

fer to Tab. 4. On the other hand, the small 10 neuron NN also seems

to be considerably more robust than the symbolic controllers. For

the case Ftest(E,0.05) the average returns are two times larger than

for the symbolic controllers, but still two times smaller than for the

deep NN. However, in the case Ftest (E,0.05), the average returns

are close to those of the deep NN contrary to the symbolic con-

trollers. The small NN compares favorably to symbolic controllers

in terms of E/SI return discrepancy metrics, still not reaching the

level of deep NN. This supports our earlier conjecture (Sec. 1) that

controller robustness is proportional to the parametric complexity.

Table 2: Examples of persistent solutions found by the persistent so-

lutions Search & Prove Alg. 2 for the pendulum maximizing accu-

mulated penalty, episodes of ﬁxed length N= 1000. The found per-

sistent solutions were the basis for the persistent orbit/solution proofs

presented in Appendix F ([8])

CONTROLLER MDP r(s, a)

ALG.1(k=9) (SI) h=0.05 −9869.6

ALG.1(k=9) (SI) h=0.025 −1995.7

ALG.1SMALL NN (SI) h=0.05 −926.8

ALG.1SMALL NN (SI) h=0.025 −1578.4

ALG.1SMALL NN (E) h=0.05 −747.3

[24] (a1)(SI) h=0.05 −873.8

[24] (a1)(SI) h=0.025 −1667.6

[24] (a1)(E) h=0.05 −5391.1

DEEP NN (SI) h=0.05 −426.4

DEEP NN (SI) h=0.025 −818.6

DEEP NN (E) h=0.05 −401.4

7.4 Findings: Cartpole Swing-Up

We computed the mean return metrics for a representative symbolic

controller, a distilled small NN controller and the deep NN, see

Tab. 3. For the symbolic controller, the average return deteriorates

more when changing the simulator’s numerical scheme to other than

the native (Ftrain(E,0.01)). Notably, the E/SI discrepancy is an or-

der of magnitude larger than in the case of deep NN. As for the pen-

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations522

Table 3: Mean ±std.dev. reported, rounded to single decimal digits, of returns over 100 episodes reported for different Ftest (larger the better).

Ftest =Ftrain marked in bold. Return discrepancies measure discrepancy in episodic return between different schemes (E/SI) for the same

IC (smaller the better). The formula for the symbolic controller with k=21appears in Appendix Tab. 27 [8]

MEAN RETURN FOR GIVEN Ftest

h=0.01 h=0.005 DISCREPANCY

ORIGIN SI ESI E RETURN E/SI

ALG. 1, 3.CMA-ES (SYMB.k=21)220.2±96.7 334.3±37 474.6±194.3 632.2±119.3121.9±88.9

ALG.1,SMALL NN (25 NEURONS)273.3±128.7 332.9±79.2585.1±229.1 683.7±103.386.6±135.1

TD3 TRAINING 381.2±9.1 382.9±9760.9±18.4764.0±18.1 1.7±0.9

-4 -2 0

-5

0

5

t

051015

-4

-2

0

Figure 3: A persistent solution with poor reward ≈−7527 over

episode length 1000 with step size h=0.0125, plotted until near-

stabilization at t=17.825. Left: plot in phase space. Right: time

series of θ. Other data for this solution is in Appendix Tab. 22 [8]

dulum, the small NN sits between the symbolic and deep NN in terms

of the studied metrics. We computed the mean accumulated shaped

penalty p(s, a)=−r(s)+0.5(θ)2+0.5(x)2for the selected con-

trollers in Tab. 5. The contrast between the deep NN and the symbolic

controller is clear, with the small NN being in between those two ex-

tremes. The mean penalty is a measure of the prevalence of persistent

solutions. However, we emphasize that the Deep NN controller is not

entirely robust and also admits singular persistent solutions with bad

returns, refer to Tab. 4. Rigorously proving the returns for the deep

NN was not possible in this case; see Rem. 6.

Investigating the persistent solutions found with Alg. 2 in Fig. 4

we see that in case Ftest(SI,0.01) the symbolic controller admits

bad persistent solutions with xtdecreasing super-linearly, whereas θ

stabilizes at θ∼0.01. In contrast, the deep NN exhibits fairly stable

control with small magnitude oscillations. This example emphasizes

the shaped penalty’s usefulness in detecting such bad persistent so-

lutions. We can see several orders of magnitude differences in the

accumulated penalty value for the deep NN controller vs. the sym-

bolic controller case. We identify and rigorously prove an abundance

of persistent solutions for each of the studied symbolic controllers.

For example, we can prove:

Theorem 5 For the symbolic controller with complexity k=21

and native step size h=0.01, there are 2000-epsiode persistent

solutions of the cartpole swing-up model with accumulated penalty

≥2.66 ×105for the explicit scheme, and ≥3.77 ×105for the

semi-implicit scheme. With the Small NN controller, the conclusions

hold with accumulated penalties ≥6263 and ≥2.68 ×106.

We demonstrate persistent solutions for each considered controller

in Tab. 4. The found persistent solutions were the basis for the

persistent orbit/solution proofs presented in Appendix G ([8]). The

symbolic and small NN controllers admit much worse solutions

with increasing velocity, as illustrated in Fig. 4b. Deep NN con-

trollers admit such bad solutions when tested using smaller time steps

((E,0.005),(SI, 0.005)); see examples in Tab. 4. They also exhibit

persistent periodic solutions, albeit with a small ; see Fig. 4a. We

have proven the following.

Table 4: Examples of persistent solutions found by the transient so-

lutions Search & Prove Alg. 2 for the cartpole-swingup maximizing

the accumulated penalty, episodes of ﬁxed length N= 2000 without

taking into account the termination condition. The found persistent

solutions were the basis for the persistent orbit/solution proofs pre-

sented in Appendix G ([8])

CONTROLLER MDP r(s, a)

ALG.1(k=21) (SI) h=0.01 −41447.2

ALG.1(k=21) (SI) h=0.005 −11204.3

ALG.1(k=21) (E) h=0.01 −29878.0

ALG.1(k=21) (E) h=0.005 −8694.3

ALG.1SMALL NN (SI) h=0.01 −2684696.8

ALG.1SMALL NN (SI) h=0.005 −798442.3

ALG.1SMALL NN (E) h=0.01 −520.9

ALG.1SMALL NN (E) h=0.005 −2343.8

DEEP NN (SI) h=0.01 306.6

DEEP NN (SI) h=0.005 −396074.9

DEEP NN (E) h=0.01 226.5

DEEP NN (E) h=0.005 −1181.7

Theorem 6 For hclose to30.005 and h=0.01 (native), the cart-

pole swing-up model has POs for (E) and (SI) with the deep NN

controller. The mean penalties along orbits are greater than −0.914

and are persistent4with ≥0.036.

Remark 6 We were not able to rigorously compute the penalty val-

ues of the persistent solutions for the deep NN controller due to wrap-

ping effect of interval arithmetic calculations [38], which is made

much worse by the width of the network (400,300) and the long ep-

siode length (which introduces further composition). However, this is

not a problem for the periodic orbits: we enclose them using Theo-

rem 1, which reduces the wrapping effect.

Table 5: Comparison of different controllers for the cartpole swing-

up for h=0.01. Mean and std.dev. (after ±) reported of accu-

mulated penalties p(sk)=−r(sk)+0.5(θ

k)2+0.5(x

k)2

(larger the worse) over 100 episodes reported for different Ftest.

Ftest =Ftrain marked in bold. Controllers same as in Tab. 3.

ORIGIN SI E

ALG. 1, 3.CMA-ES (SYMB.k=21)3123.0±719.9 2257.2±234.1

ALG.1,SMALL NN (25 NEURONS)1413.4±9670.1 404.2±148.4

TD3 TRAINING 335.7±64.7 425.6±72.1

3The exact step size is smaller than h, with relative error up to 2%. See

Appendix G ([8]) for precise values and detailed data for the POs.

4With respect to the translation-invariant seminorm ||(x, ˙x, θ, ˙

θ)|| =

max{| ˙x|,|θ|,|˙

θ|}

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations 523

0 100 200 300 400 500

t

0

1

2

3

example transient for deep NN controller

θt

xt

(a) Deep NN controller

0 100 200 300 400 500

t

−10

−5

0

5example transient for symbolic (k=21) controller

θt

xt

(b) a symbolic controller

Figure 4: The persistent solutions (evolution of (θ, x)(Def. 2) for

cartpole swing-up problem found with Alg. 2 that maximize ac-

cumulated penalty p(s, a)=−r(s)+0.5(θ)2+0.5(x)2

over episodes of length 2000 without terminations, using SI with

h=0.01. (a) p(s, a)=−306; (b) p(s, a) = 37746.

8 Codebase

Our full codebase is written in Python and Julia shared in a github

repository [1]. The reason why the second part of our codebase is

written in Julia is the lack of a suitable interval arithmetic library

in Python. The Python part of the codebase consists of four inde-

pendent parts – scripts: deep NN policy training, symbolic/small NN

controller regression, regressed controller ﬁne-tuning and periodic

orbit/persistent solution searcher. All controllers that we use are im-

plemented in Pytorch [28]. For the deep NN policy training we just

use the Stable-baselines 3 library [30], which outputs a trained pol-

icy (which achieved the best return during training) and the train-

ing replay buffer of data. For the symbolic regression we employ

the PySR lib. [6]. For the regressed controller ﬁne-tuning we em-

ploy the pycma CMA-ES implementation [18]. Our implementation

in Julia uses two external packages: IntervalArithmetic.jl [33] (for

interval arithmetic) and ForwardDiff.jl [31] (for forward-mode auto-

matic differentiation). These packages are used together to perform

the necessary calculations for the CAPs.

9 Conclusion and Future Work

Our work is a ﬁrst step towards a comprehensive robustness study

of deep NN controllers and their symbolic abstractions, which are

desirable for deployment and trustfulness reasons. Studying the con-

trollers’ performance in a simple benchmark, we identify and prove

existence of an abundance of persistent solutions and periodic orbits.

Persistent solutions are undesirable and can be exploited by an adver-

sary. Future work will apply the developed methods to study higher

dimensional problems often used as benchmarks for continuous con-

trol.

10 Acknowledgements

The project is ﬁnanced by the Polish National Agency for Aca-

demic Exchange. The ﬁrst author has been supported by the Polish

National Agency for Academic Exchange Polish Returns grant no.

PPN/PPO/2018/1/00029 and the University of Warsaw IDUB New

Ideas grant. This research was supported in part by PL-Grid Infras-

tructure.

References

[1] Code repository. https://github.com/MIMUW-RL/worrisome-nn. Ac-

cessed: 2023-07-27.

[2] Houssam Abbas, Georgios Fainekos, Sriram Sankaranarayanan, Franjo

Ivanˇ

ci´

c, and Aarti Gupta, ‘Probabilistic temporal logic falsiﬁcation of

cyber-physical systems’, ACM Trans. Embed. Comput. Syst.,12(2s),

(may 2013).

[3] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah, ‘Ju-

lia: A fresh approach to numerical computing’, SIAM review,59(1),

65–98, (2017).

[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider,

John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[5] Rudy Bunel, Ilker Turkaslan, Philip H.S. Torr, Pushmeet Kohli, and

M. Pawan Kumar, ‘A uniﬁed view of piecewise linear neural network

veriﬁcation’, in Proceedings of the 32nd International Conference on

Neural Information Processing Systems, NIPS’18, p. 4795–4804, Red

Hook, NY, USA, (2018). Curran Associates Inc.

[6] Miles Cranmer. Pysr: Fast & parallelized symbolic regression in

python/julia, September 2020.

[7] Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu,

Kyle Cranmer, David Spergel, and Shirley Ho, ‘Discovering sym-

bolic models from deep learning with inductive biases’, NeurIPS 2020,

(2020).

[8] Jacek Cyranka, Kevin E M Church, and Jean-Philippe Lessard, ‘Worri-

some properties of neural network controllers and their symbolic rep-

resentation —- extended version’, arXiv preprint arXiv:2307.15456,

(2023). https://arxiv.org/abs/2307.15456.

[9] Sarah Day, Jean-Philippe Lessard, and Konstantin Mischaikow, ‘Vali-

dated Continuation for Equilibria of PDEs’, SIAM Journal on Numeri-

cal Analysis,45(4), 1398–1424, (jan 2007).

[10] Tommaso Dreossi, Alexandre Donzé, and Sanjit A. Seshia, ‘Compo-

sitional falsiﬁcation of cyber-physical systems with machine learning

components’, J. Autom. Reason.,63(4), 1031–1053, (dec 2019).

[11] Rüdiger Ehlers, ‘Formal veriﬁcation of piece-wise linear feed-forward

neural networks’, in Automated Technology for Veriﬁcation and Analy-

sis, eds., Deepak D’Souza and K. Narayan Kumar, pp. 269–286, Cham,

(2017). Springer International Publishing.

[12] Scott Fujimoto, Herke van Hoof, and David Meger, ‘Addressing Func-

tion Approximation Error in Actor-Critic Methods’, arXiv e-prints,

arXiv:1802.09477, (February 2018).

[13] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen, ‘Improv-

ing PILCO with Bayesian neural network dynamics models’, in Data-

Efﬁcient Machine Learning workshop, International Conference on

Machine Learning, (2016).

[14] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, ‘Explaining

and harnessing adversarial examples’, arXiv preprint arXiv:1412.6572,

(2014).

[15] David Ha, ‘Evolving stable strategies’, blog.otoro.net, (2017).

[16] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker,

Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter

Abbeel, et al., ‘Soft actor-critic algorithms and applications’, arXiv

preprint arXiv:1812.05905, (2018).

[17] E. Hairer, S. P. Nørsett, and G. Wanner, Solving Ordinary Differen-

tial Equations I (2nd Revised. Ed.): Nonstiff Problems, Springer-Verlag,

Berlin, Heidelberg, 1993.

[18] Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma

on Github. Zenodo, DOI:10.5281/zenodo.2559634, February 2019.

[19] Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos, ‘Re-

ducing the time complexity of the derandomized evolution strategy

with covariance matrix adaptation (cma-es)’, Evolutionary Computa-

tion,11(1), 1–18, (2003).

[20] Daniel Hein, Steffen Udluft, and Thomas A. Runkler, ‘Interpretable

policies for reinforcement learning by genetic programming’, Engi-

neering Applications of Artiﬁcial Intelligence,76, 158–169, (2018).

[21] Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J.

Kochenderfer, ‘Reluplex: An efﬁcient smt solver for verifying deep

neural networks’, in Computer Aided Veriﬁcation, eds., Rupak Majum-

dar and Viktor Kunˇ

cak, pp. 97–117, Cham, (2017). Springer Interna-

tional Publishing.

[22] Jiˇ

rí Kubalík, Eduard Alibekov, and Robert Babuška, ‘Optimal control

via reinforcement learning with symbolic policy approximation’, IFAC-

PapersOnLine,50(1), 4162–4167, (2017). 20th IFAC World Congress.

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations524

[23] Christian Kuehn and Elena Queirolo. Computer validation of neural

network dynamics: A ﬁrst case study, 2022.

[24] Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P San-

tiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, and Daniel Fais-

sol, ‘Discovering symbolic policies with deep reinforcement learning’,

in Proceedings of the 38th International Conference on Machine Learn-

ing, eds., Marina Meila and Tong Zhang, volume 139 of Proceedings of

Machine Learning Research, pp. 5979–5989. PMLR, (18–24 Jul 2021).

[25] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas

Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, ‘Con-

tinuous control with deep reinforcement learning.’, in ICLR, eds.,

Yoshua Bengio and Yann LeCun, (2016).

[26] Guiliang Liu, Oliver Schulte, Wang Zhu, and Qingcan Li, ‘Toward in-

terpretable deep reinforcement learning with linear model u-trees’, in

ECML/PKDD, (2018).

[27] Ramon E. Moore, Interval analysis, Prentice-Hall, Inc., Englewood

Cliffs, N.J., 1966.

[28] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James

Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia

Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward

Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-

amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala,

‘Pytorch: An imperative style, high-performance deep learning library’,

in Advances in Neural Information Processing Systems 32, 8024–8035,

Curran Associates, Inc., (2019).

[29] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta,

‘Robust adversarial reinforcement learning’, in Proceedings of the 34th

International Conference on Machine Learning, eds., Doina Precup and

Yee Whye Teh, volume 70 of Proceedings of Machine Learning Re-

search, pp. 2817–2826. PMLR, (06–11 Aug 2017).

[30] Antonin Rafﬁn, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maxi-

milian Ernestus, and Noah Dormann, ‘Stable-baselines3: Reliable re-

inforcement learning implementations’, Journal of Machine Learning

Research,22(268), 1–8, (2021).

[31] Jarrett Revels, Miles Lubin, and Theodore Papamarkou. Forward-mode

automatic differentiation in julia, 2016.

[32] Itay Safran and Ohad Shamir, ‘Spurious local minima are common in

two-layer ReLU neural networks’, in Proceedings of the 35th Interna-

tional Conference on Machine Learning, eds., Jennifer Dy and Andreas

Krause, volume 80 of Proceedings of Machine Learning Research, pp.

4433–4441. PMLR, (10–15 Jul 2018).

[33] David P. Sanders, Luis Benet, Luca Ferranti, Krish Agarwal, Benoît

Richard, Josua Grawitter, Eeshan Gupta, Marcelo Forets, Michael F.

Herbst, yashrajgupta, Eric Hanson, Braam van Dyk, Christopher

Rackauckas, Rushabh Vasani, Sebastian Miclut

,a-Câmpeanu, Sheehan

Olver, Twan Koolen, Caroline Wormell, Daniel Karrasch, Favio An-

dré Vázquez, Guillaume Dalle, Jeffrey Sarnoff, Julia TagBot, Kevin

O’Bryant, Kristoffer Carlsson, Morten Piibeleht, Mosè Giordano,

Ryan, Robin Deits, and Tim Holy. Juliaintervals/intervalarithmetic.jl:

v0.20.8, October 2022.

[34] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An

Introduction, The MIT Press, second edn., 2018.

[35] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego

de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew

Lefrancq, Timothy Lillicrap, and Martin Riedmiller, ‘DeepMind Con-

trol Suite’, arXiv e-prints, arXiv:1801.00690, (January 2018).

[36] Emanuel Todorov, Tom Erez, and Yuval Tassa, ‘Mujoco: A physics en-

gine for model-based control’, in 2012 IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems, pp. 5026–5033. IEEE, (2012).

[37] Dweep Trivedi, Jesse Zhang, Shao-Hua Sun, and Joseph J Lim, ‘Learn-

ing to synthesize programs as interpretable and generalizable policies’,

in Advances in Neural Information Processing Systems, eds., M. Ran-

zato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan,

volume 34, pp. 25146–25163. Curran Associates, Inc., (2021).

[38] Warwick Tucker, Validated Numerics, Princeton University Press, jul

2011.

[39] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet

Kohli, and Swarat Chaudhuri, ‘Programmatically interpretable rein-

forcement learning’, in Proceedings of the 35th International Confer-

ence on Machine Learning, eds., Jennifer Dy and Andreas Krause, vol-

ume 80 of Proceedings of Machine Learning Research, pp. 5045–5054.

PMLR, (10–15 Jul 2018).

[40] Masaki Waga, ‘Falsiﬁcation of cyber-physical systems with robustness-

guided black-box checking’, in Proceedings of the 23rd International

Conference on Hybrid Systems: Computation and Control, HSCC ’20,

New York, NY, USA, (2020). Association for Computing Machinery.

[41] Lindsay Wells and Tomasz Bednarz, ‘Explainable ai and reinforcement

learning—a systematic review of current approaches and trends’, Fron-

tiers in Artiﬁcial Intelligence,4, (2021).

[42] Tsui-Wei Weng, Krishnamurthy (Dj) Dvijotham*, Jonathan Uesato*,

Kai Xiao*, Sven Gowal*, Robert Stanforth*, and Pushmeet Kohli, ‘To-

ward evaluating robustness of deep reinforcement learning with con-

tinuous control’, in International Conference on Learning Representa-

tions, (2020).

[43] Yoriyuki Yamagata, Shuang Liu, Takumi Akazaki, Yihai Duan, and

Jianye Hao, ‘Falsiﬁcation of cyber-physical systems using deep re-

inforcement learning’, IEEE Transactions on Software Engineering,

47(12), 2823–2840, (2021).

[44] Hanshu YAN, Jiawei DU, Vincent TAN, and Jiashi FENG, ‘On robust-

ness of neural ordinary differential equations’, in International Confer-

ence on Learning Representations, (2020).

J. Cyranka et al. / Worrisome Properties of Neural Network Controllers and Their Symbolic Representations 525