### Encoding causal structures into neural networks

The methods developed in this work are in principle applicable to any causal structure. Here, we demonstrate how to encode a network nonlocality configuration into a neural network on the highly nontrivial example of the triangle network with quaternary outputs and no inputs. In this scenario three sources, *α*, *β*, *γ*, send information through either a classical or a quantum channel to three parties, Alice, Bob, and Charlie. Flow of information is constrained such that the sources are independent from each other, and each one only sends information to two parties of the three, as depicted in Fig. 1. Alice, Bob, and Charlie process their inputs with arbitrary local response functions, and they each output a number *a*, *b*, *c* ∈ {0, 1, 2, 3}, respectively. Under the assumption that each source is independent and identically distributed from round to round, and that the local response functions are fixed (though possibly stochastic), such a scenario is well characterized by the probability distribution *p*(*a**b**c*) over the random variables of the outputs.

If quantum channels are permitted from the sources to the parties then the set of distributions is larger than that achievable classically. Due to the nonlocal nature of quantum theory, these correlations are often referred to as nonlocal ones, as opposed to local behaviors arising from only using classical channels. In the classical case, the scenario is equivalent to a causal structure, otherwise known as a Bayesian network^{32,33}.

For the classical setup we can assume without loss of generality that the sources each send a random variable drawn from a uniform distribution on the continuous interval between 0 and 1 (any other distribution can be reabsorbed by the parties’ response functions, e.g., via the inverse transform sampling method). Given the network constraint, the probability distribution over the parties’ outputs can be written as

$$p(abc)=mathop{int}nolimits_{0}^{1}dalpha dbeta dgamma {p}_{text{A}}(a| beta gamma ){p}_{text{B}}(b| gamma alpha ){p}_{text{C}}(c| alpha beta ),$$

(1)

where the conditional probability *p*_{X}(*x*∣ ⋅ , ⋅) is the response function of party X.

We now construct a neural network which is able to approximate a distribution of the form (1). We use a feedforward neural network, since it is described by a directed acyclic graph, similarly to a causal structure^{32,33,34}. This allows for a seamless transfer from the causal structure to the neural network model. On a practical level, we represent each party’s response function by a fully connected multilayer perceptron, one of the simplest artificial neural network architectures^{34}. In our case, the inputs to the three perceptrons are the hidden variables, i.e., uniformly drawn random numbers *α*, *β*, *γ*. So as to respect the communication constraints of the triangle, inputs are routed to the three perceptrons in a restricted manner, as shown in Fig. 1. The outputs are the conditional probabilities conditioned on the respective inputs, *p*_{A}(*a*∣*β**γ*), *p*_{B}(*b*∣*γ**α*), and *p*_{C}(*c*∣*α**β*), i.e., three normalized vectors, each of length 4. This restructuring can also be viewed as having one large, not fully connected multilayer perceptron, outputting the three probability vectors *p*_{A}(*a*∣*β**γ*), *p*_{B}(*b*∣*γ**α*), *p*_{C}(*c*∣*α**β*) for a given input *α*, *β*, *γ*. Due to the restricted architecture, the output conditional probabilities will obey the causal network constraints, i.e., by construction only local models can be generated by such a neural network.

We evaluate the neural network for *N*_{batch} values of *α*, *β*, *γ* in order to approximate the joint probability distribution (1) with a Monte Carlo approximation,

$${p}_{text{M}}(abc)=frac{1}{{N}_{text{batch}}}mathop{sum }limits_{i=1}^{{N}_{text{batch}}}{p}_{text{A}}(a| {beta }_{i}{gamma }_{i}){p}_{text{B}}(b| {gamma }_{i}{alpha }_{i}){p}_{text{C}}(c| {alpha }_{i}{beta }_{i}).$$

(2)

Note that before summing over the batch, we take the Cartesian product of the conditional probability vectors. In our implementation each of these three conditional probability functions is modeled by a multilayer perceptron, with rectified linear or tangent hyperbolic activations, except at the last layer, where we have a softmax layer to impose normalization. Note, however, that any feedforward network can be used to model these conditional probabilities. The loss function can be any differentiable measure of discrepancy between the target distribution *p*_{t} and the neural network’s output *p*_{M}, such as the Kullback–Leibler divergence of one relative to the other, namely

$$L({p}_{text{M}})=sum _{abc}{p}_{text{t}}(abc){log},left(frac{{p}_{text{t}}(abc)}{{p}_{text{M}}(abc)}right).$$

(3)

In order to train the neural network we synthetically generate uniform random numbers for the hidden variables, the inputs. We then adjust the weights of the network after evaluating the loss function on a minibatch of size *N*_{batch}, using conventional neural network optimization methods^{34}. The minibatch size is chosen arbitrarily and can be increased in order to increase the neural network’s precision. For the triangle with quaternary outputs an *N*_{batch} of several thousands is typically satisfactory.

By encoding the causal structure in a neural network like this, we can train the neural network to try to reproduce a given target distribution. The procedure generalizes in a straight-forward manner to any causal structure, and is thus in principle applicable to any quantum nonlocality network problem. We provide specific code online for the triangle configuration, as well as for the standard Bell scenario, which has inputs as well (see Section “Code availability”). After finishing this work we realized that related ideas have been investigated in causal inference, though in a different context, where network architectures and weights are simultaneously optimized to reproduce a given target distribution over continuous outputs, as opposed to discrete ones examined here^{35}. In addition, due to the strict constraint of having a single fixed causal structure we evaluate results differently, by examining transitions in compatibility with the causal structure at hand, as we will soon demonstrate.

### Evaluating the output of the neural network

Given a target distribution *p*_{t}, the neural network provides an explicit model for a distribution *p*_{M}, which is, according to the machine, the closest local distribution to *p*_{t}. The distribution *p*_{M} is guaranteed to be from the local set by construction. When can we confidently deduce that the target distribution is local (i.e., if we see *p*_{t} ≈ *p*_{M}), or nonlocal (*p*_{t} ≠ *p*_{M})? At first sight the question is difficult, since the neural network will almost never exactly reproduce the target distribution since *p*_{M} is evaluated by sampling the model a finite number of times, and additionally the learning techniques do not guarantee convergence to the global optimum. A first approach could be to define some confidence level for the similarity between *p*_{M} and *p*_{t}. This would, however, be somewhat arbitrary, and would give only limited insight into the problem. A central notion in this work is to search for qualitative changes in the machine’s behavior when transitioning from the local set to the nonlocal one. We believe this to be much more robust and informative for deciding nonlocality than a confidence level approach.

In order to find such a “phase transition”, we typically define a family of target distributions *p*_{t}(*v*) by taking a distribution which is believed to be nonlocal and by adding some noise controlled by the parameter *v*, with *p*_{t}(*v* = 0) being the completely noisy (local) distribution and *p*_{t}(*v* = 1) being the noiseless, “most nonlocal” one. By adding noise in a physically meaningful way we guarantee that at some parameter value, (v^ast), we will enter the local set and stay in it for *v* < (v^ast). For each noisy target distribution we retrain the neural network and obtain a family of learned distributions *p*_{M}(*v*) (see Fig. 2 for an illustration). Observing a qualitative change in the machine’s performance at some point is an indication of traversing the local set’s boundary. In this work we extract information from the learned model through

the distance between the target and the learned distribution,

$$d({p}_{text{t}},{p}_{text{M}})=sqrt{sum _{abc}{left[{p}_{text{t}}(abc)-{p}_{text{M}}(abc)right]}^{2},}$$

the learned distributions

*p*_{M}(*v*), in particular by examining the local response functions of Alice, Bob, and Charlie.

Observing a clear liftoff of the distance *d*_{M}(*v*) ≔ *d*(*p*_{t}(*v*), *p*_{M}(*v*)) at some point is a signal that we are leaving the local set. Somewhat surprisingly, we can deduce even more from the distance *d*_{M}(*v*). Though the shape of the local set and the threshold value (v^ast) are unknown, in some cases, under mild assumptions, we can extract from *d*_{M}(*v*) not only (v^ast), but also the angle at which the curve *p*_{t}(*v*) exits the local set, and in addition gain confidence in the proper functioning of the algorithm. To do this, let us first assume that the local set is flat near *p*_{t}((v^ast)) and that *p*_{t}(*v*) is a straight curve. Then the true distance from the local set is

$$d(v)=left{begin{array}{*{20}{l}}0&,{rm{if}},v,le, {v}^{ast}\ dleft({p}_{text{t}}(v),{p}_{text{t}}({v}^{ast})right)sin (theta )&,{rm{if}},v,>,{v}^{ast },end{array}right.$$

(4)

where *θ* is the angle between the curve *p*_{t}(*v*) and the local set’s hyperplane (see Fig. 2 for an illustration). In the more general setting Eq. (4) is still approximately correct even for *v* > (v^ast), if *p*_{t}(*v*) is almost straight and the local set is almost flat near (v^ast). We denote this analytic approximation of the true distance form the local set as (hat{d}(v)). We use Eq. (4) to calculate it but keep in mind that it is only an approximation. After having trained the machine, we fit (hat{d}(v)) to *d*_{M}(*v*) by adjusting (v^ast) and *θ*. Finding a good fit of the two distance functions gives us strong evidence that indeed the curve *p*_{t}(*v*) exits the local set at ({hat{v}}^{* }) at an angle (hat{theta }), where the hat is used to signify the obtained estimates. Acquiring such a fit gives us more confidence in the machine since now we do not just observe a qualitative phase transition, but we can also model it quantitatively with just two free parameters, (v^ast) and *θ*.

In addition, we get information out of the learned model by looking at the local responses of Alice, Bob and Charlie. Recall that the shared random variables, the sources, are uniformly distributed, hence the response functions encode the whole problem. We can visualize, for example, Bob’s response function *p*_{B}(*b*∣*α*, *γ*) by sampling several thousand values of {*α*, *γ*} ∈ [0, 1]^{2}. In order to capture the stochastic nature of the responses, for each pair *α*, *γ* we sample from *p*_{B}(*b*∣*α*, *γ*) 30 times and color-code the results *b* ∈ {red, blue, green, and yellow}. By scatter plotting these points with a finite opacity we gain an impression of the response function, such as in Fig. 3b.

These figures are already interesting in themselves and can guide us towards analytic guesses of the ideal response functions. However, they can also be used to verify our results in some special cases. For example, if *θ* = 90^{∘} and the local set is sufficiently flat, then the response functions should be the same for all *v* ≥ (v^ast), as it is in Fig. 3b. On the other hand if *θ* < 90^{∘} then we are in a scenario similar to that of panel (a) in Fig. 2 and the response functions should differ for different values of *v*. Finally, note that for any target distribution there is no unique closest local response function, so the visualized response functions could vary greatly. As a result, in order to have visually more similar response functions and to smooth the results, after running the algorithm for the full range of *v*, for each *v* we check whether the models at other (v^{prime}) values perform better for *p*_{t}(*v*) (after allowing for small adjustments) and update the model for *v* accordingly.

### Fritz distribution

In order to benchmark the method, we first consider the quantum distribution proposed by Fritz^{5}, which can be viewed as a Bell scenario wrapped into the triangle topology, and its nonlocality is thus well understood. Alice and Bob share a singlet, i.e., ({left|psi rightrangle }_{text{AB}}=left|{psi }^{-}rightrangle =frac{1}{sqrt{2}}left(left|01rightrangle -left|10rightrangle right)), while Bob and Charlie share either a maximally entangled or a classically correlated state with Charlie, such as ({rho }_{text{BC}}=frac{1}{2}(left|00rightrangle leftlangle 00right|+left|11rightrangle leftlangle 11right|)) and similarly for *ρ*_{AC}. Alice measures the shared state with Charlie in the computational basis and, depending on this random bit, she measures either the Pauli *X* or *Z* observable. Bob does the same with his shared state with Charlie and measures either (frac{X,+,Z}{sqrt{2}}) or (frac{X,-,Z}{sqrt{2}}). They then both output the measurement result and the bit which they used to decide the measurement. Charlie measures both sources in the computational basis and announces the two bits. As a noise model we introduce a finite visibility for the singlet shared by Alice and Bob, thus we examine a Werner state,

$$rho (v)=vleft|{psi }^{-}rightrangle leftlangle {psi }^{-}right|+(1-v)frac{{mathbb{I}}}{4},$$

(5)

where ({mathbb{I}}/4) denotes the maximally mixed state of two qubits. For such a state we expect to find a local model below the threshold of ({v}^{* }=frac{1}{sqrt{2}}).

In Fig. 3a we plot the learned *d*_{M}(*v*) and analytic ({hat{d}}(v)) distances discussed previously, for (hat{theta }=9{0}^{circ }) and ({hat{v}}^{* }=frac{1}{sqrt{2}}). The coincidence of the two curves is already good evidence that the machine finds the closest local distributions to the target distributions. Upon examining the response functions of Alice, Bob and Charlie, in Fig. 3b, we see that they do not change above ({hat{v}}^{* }), which means that the machine finds the same distributions for target distributions outside the local set. This is in line with our expectations. Due to the connection with the standard Bell scenario (where the local set is actually a polytope), we believe the curve *p*_{t}(*v*) exits the local set perpendicularly, as it is depicted on panel (b) in Fig. 2. These results confirm that our algorithm functions well.

### Elegant distribution

Next we turn our attention to a more demanding distribution, as neither its locality or nonlocality has been proven to date, and is lacking a proper numerical analysis due to the intractability of conventional optimization over local models (see the “Discussion” section). Compared to the Fritz distribution, it is also more native to the triangle structure, as it combines entangled states and entangled measurements. We examine the Elegant distribution, which is conjectured in ref. ^{31} to be outside the local set. The three parties share singlets and each perform a measurement on their two qubits, the eigenstates of which are

$$left|{Phi }_{j}rightrangle =sqrt{frac{3}{2}}left|{m}_{j},-{m}_{j}rightrangle +ifrac{sqrt{3}-1}{2}left|{psi }^{-}rightrangle ,$$

(6)

where the (left|{m}_{j}rightrangle) are the pure qubit states with unit length Bloch vectors pointing at the four vertices of the tetrahedron for *j* = 1, 2, 3, 4, and (left|-{m}_{j}rightrangle) are the same for the inverted tetrahedron.

We examine two noise models—one at the sources and one at the detectors. First we introduce a visibility to the singlets such that all three shared quantum states have the form (5). Second, we examine detector noise, in which each detector defaults independently with probability 1 − *v* and gives a random output as a result. This is equivalent to adding white noise to the quantum measurements performed by the parties, i.e., the positive operator-valued measure elements are ({{mathcal{M}}}_{j}=vleft|{Phi }_{j}rightrangle leftlangle {Phi }_{j}right|+(1-v)frac{{mathbb{I}}}{4}).

For both noise models we see a transition in the distance *d*_{M}(*v*), depicted in Fig. 4a, giving us strong evidence that the conjectured distribution is indeed nonlocal. Through this examination we gain insight into the noise robustness of the Elegant distribution as well. It seems that for visibilities above ({hat{v}}^{* }approx 0.80), or for detector efficiency above ({hat{v}}^{* }approx 0.86), the distribution is still nonlocal. The curves exit the local set at approximately (hat{theta }approx 5{0}^{circ }) and (hat{theta }approx 6{0}^{circ }), respectively. Note that for both distribution families, by looking at the unit tangent vector, one can analytically verify that the curves are almost straight for values of *v* above the observed threshold. This gives us even more confidence that it is legitimate to use the analytic distance (hat{d}(v)) as a reference (see Eq. (4)). In Fig. 4b, we illustrate how the response function of Charlie changes when adding detector noise. It is peculiar how the machine often prefers horizontal and vertical separations of the latent variable space, with very clean, deterministic responses, similarly to how we would do it intuitively, especially for noiseless target distributions.

### Renou et al. distribution

The authors of ref. ^{20} recently introduced the first distribution in the triangle scenario which is not directly inspired by the Bell scenario and is proven to be nonlocal. To generate the distribution take all three shared states to be the entangled states (left|{phi }^{+}rightrangle =frac{1}{sqrt{2}}left(left|00rightrangle +left|11rightrangle right)). Each party performs the same measurement, characterized by a single parameter (uin [frac{1}{sqrt{2}},1]), with eigenstates (left|01rightrangle ,left|10rightrangle ,uleft|00rightrangle +sqrt{1-{u}^{2}}left|11rightrangle ,sqrt{1-{u}^{2}}left|00rightrangle -uleft|11rightrangle). The authors prove that for ({u}_{max }^{2},<,{u}^{2},<,1) this distribution is nonlocal, where ({u}_{max }^{2}approx 0.785) and also show that there exist local models for ({u}^{2}in {0.5,{u}_{max }^{2},1}). Though they argue that there must be some noise tolerance of the distribution, they lack a proper estimation of it.

First we examine these distributions as a function of *u*^{2}, without any added noise. The results are depicted in Fig. 5a. To start with, note how the distances are numerically much smaller than in the previous examples, i.e., the machine finds distributions which are extremely close to the targets. See the inset in Fig. 5a for examples which exhibit how close the learned distributions are to the targets even for the points which have large distances (*u*^{2} = 0.63, 0.85). We observe, consistently with analytic findings, that for ({u}_{max }^{2},<,{u}^{2},<,1), the machine finds a nonzero distance from the local set. It also recovers the local models at ({u}^{2}in {0.5,{u}_{max }^{2},1}), with minor difficulties around ({u}_{max }^{2}). Astonishingly, the machine finds that for some values of (0.5,<,{u}^{2},<,{u}_{max }^{2}), the distance from the local set is even larger than in the provenly nonlocal regime. This is a somewhat surprising finding, as one might naively assume that between 0.5 and ({u}_{max }^{2}) distributions are local, especially when one looks at the nonlocality proof used in the other regime. However, this is not what the machine finds. Instead it gives us a nontrivial conjecture about nonlocality in a new range of parameters *u*^{2}. Though extracting precise boundaries in terms of *u*^{2} for the new nonlocal regime would be difficult from the results in Fig. 5a alone, they strongly suggest that there is some nonlocality in this regime.

Finally, we have a look at the noise robustness of the distribution with *u*^{2} = 0.85, which is approximately the most distant distribution from the local set, from within the provenly nonlocal regime. For the detector efficiency and visibility noise models we recover ({hat{v}}^{* }approx 0.91), ({hat{v}}^{* }approx 0.89) respectively, and (hat{theta }approx {6}^{circ }) for both. Note that these estimates are much more crude than those obtained for the Elegant distributions, primarily due to the target distributions being so much closer to the local set and the neural network getting stuck in local optima. This increases the variations in independent runs of the learning algorithm. E.g. in Fig. 5a, at *u*^{2} = 0.85 the distance is about 0.0034, whereas in Fig. 5b, in an independent run, the distance for this same point (*v* = 1) is around 0.0055. The absolute difference is small, however the relative changes can have an impact in extracting noise thresholds. Given that the local set is so close to the target distributions (exemplified in the inset in Fig. 5a), it is easily possible that the noise tolerance is smaller than that obtained here.