PDEs: Would you take a cup of tea with your Plateau?
Table of Contents
Explanation of the post title
In French, the word plateau has three different meanings:
- an area of high flat land; a mountain with a wide, flat top. (same definition as in English, from the Cambridge dictionary),
- the equivalent of the English words stage or set, somewhere where a show happens,
- the equivalent of the English words tray or platter, such as in “cheese platter”.
The title of this blog post uses the third definition above, using the fact that tea is often served on a plateau (on a platter or a tray in English).
This is the seventh post of the series about partial differential equations (PDEs). In the previous post, I wrote about one of the issues with variational algorithm: the fact that they rely on global optimisation problems we cannot solve efficiently. In this post, I want to introduce you to another main issue of variational quantum algorithms: Barren Plateaus.
A few reminders
Before actually speaking about Barren Plateaus, let’s do a quick recap of what is a variational quantum algorithm (VQA).
VQA is a framework anyone can use to rephrase a very generic problem into an optimisation problem. Just like the problem
is there an input $x$ such that the function $f$ outputs $f(x) = 3$?
can be rephrased as
is the minimum of $\vert f(x) - 3\vert$ close to $0$?
VQAs are simply another way of looking at the same problem, rephrasing it within an optimisation framework. Reformulating your problem as a VQA allows to focus on two steps:
- rephrasing correctly your original problem into an optimisation framework,
- optimising efficiently the resulting optimisation problem to find the solution to your original problem.
It turns out that the first point listed above is crucial: some theoretical results show that a reformulation that fails to check some conditions will eventually be impossible to optimise.
The last sentence should not surprise you: a bad modelisation will always impact your ability to numerically solve the problem. See the graph image in my talk IEEE P3329 Quantum Energy Initiative presentation that is re-included below.
The “Modelling” step being before the “Algorithm”, “Implementation” and “Result” ones, it has impact on all of them.
A bad modelling will lead to, even with the best algorithms and implementations, a suboptimal result.
In today blog post, I am writing about these theoretical results that condition the “optimisability” of a VQA.
What are Barren Plateaus?
A visual example
Let’s take the simple setup that has been used in Cost function dependent barren plateaus in shallow parametrized quantum circuits:
- Our ansatz is composed of $R_x$ gates: $$ V(\vec{\theta}) = \bigotimes_{j=1}^n e^{-i \theta_j \sigma_x^j / 2}, $$
- We want to find the set of angles $\theta$ that will make $V\left(\theta\right)$ as close to the identity as possible, which translates into trying to minimise the cost $$C_G(\vec{\theta}) = 1 - \prod_{j=1}^n \cos^2\left(\frac{\theta_j}{2}\right).$$
Now before talking about Barren Plateaus, let’s plot the cost function $C_G$ and see if we can understand a few things. The two plot below shows a 2-dimensional cross-section of the cost function $C_G$ for a number of qubits $n\in [2, 60]$.
Visualisation 1
In this section, the 2-dimensional cross-section is performed by setting each parameter with an odd index (i.e., $\theta_j$ for $j$ odd) to a given value $\alpha_\text{odd}$ and each even-indexed parameter to $\alpha_\text{even}$. This gives the 2-dimensional cost function $$ C_G^1(\alpha_\text{odd}, \alpha_\text{even}) = 1 - \cos^{2k}\left(\frac{\alpha_\text{even}}{2}\right) \cos^{2(n - k)}\left(\frac{\alpha_\text{odd}}{2}\right) $$ with $k = \left\lfloor \frac{n}{2} \right\rfloor$.
Visualisation 2
Another way to visualise $C_G$ as a 2-dimensional cross-section is to fix all its parameters but 2 to random values. That gives the cost function $$ C_G^2(\vec{\theta}) = 1 - \gamma_{n-2} \cos^{2}\left(\frac{\beta_0}{2}\right) \cos^{2}\left(\frac{\beta_1}{2}\right) $$ with:
- $\beta_0$ and $\beta_1$ the two varying parameters of the cost function,
- $\gamma_{n-2}$ the constant resulting from the product of the $n - 2$ cosinus terms that are all constant due to their $\theta_i$ parameter being fixed.
Analysis of the visualisations
Let’s take a moment to understand the two visualisation above.
The first visualisation shows that, when the number of qubits grows, the cost landscape becomes flat nearly everywhere, except very close to the global minimum. In fact, it can be shown that what is visually flat on the plot above is not exactly flat, the gradient is not exactly $0$, but it is exponentially (in the number of qubits $n$) close to $0$.
The second visualisation shows the same exponential flattening towards $1$, but this time the global minimum is also flattened.
You might think this is fine. In both situations, let’s just use gradient descent with a large enough step size, and everything will be alright, right?
No you cannot use gradient descent! Because to use gradient descent, you need to have a good-enough estimate of the actual gradient. And guess how much it would cost?
It would cost an exponential number of shots! Because approximating numerically the gradient requires you to compute a quantity that is exponentially small to a precision that should, if you want anything but noise out of it, be computed to an exponentially small precision.
If the validity of the previous sentence is not obvious to you, let’s imagine the quantity you want to estimate is close to $2^{-n}$. Any precision $\epsilon$ that does not also decrease exponentially with $n$ will eventually become bigger than the quantity you are trying to estimate.
Let’s take a (very) small but constant precision of $\epsilon = 10^{-39}$ (see the bottom of this note for explanations on why this particular number has been picked).
For $n = \left\lceil - \log_2\left( 10^{-39} \right) \right\rceil = 130$ the quantity you want to estimate is close to your precision. For any $n > 130$, the precision $\epsilon$ is not enough to have a meaningful estimate.
This argument can be translated to any precision $\epsilon (n)$ that scales sub-exponentially with $n$.
And because estimating these exponentially vanishing quantities is done by sampling from the output distribution of a quantum circuit, which has a precision in $O\left(N^{-1/2}\right)$, your number of shots $N$ has to grow exponentially.
More in-depth definition
What you witnessed in the last few sections is… the Barren Plateau phenomenon!
I will describe quickly Barren Plateau here, but I highly encourage you to go and read the beginning of Cost function dependent barren plateaus in shallow parametrized quantum circuits. It is a research paper but:
- it is freely accessible,
- the beginning should be easier to read than what you already processed here,
- its authors are real experts of Barren Plateau, way more competent than me on that subject.
Barren Plateau is a phenomenon that arise in some variational quantum algorithms when the number of qubits grows. It is an asymptotic theoretical result. These two words are really important here:
- asymptotic means that the phenomenon will eventually appear, but makes no guarantee on when. It might start to appear on $3$-qubit problems, or on $10^{10^{10^{10}}}$-qubits (yes, that’s a lot of qubits) problems.
- theoretical means that, in the absence of errors in the proof, the result will hold until the end of times. It is not an observation made from numerical experiments, it is way stronger than that.
The phenomenon affects the cost function used to re-phrase your original problem into a variational algorithm. When some conditions are met, this cost function is guaranteed to eventually concentrate exponentially fast around its expectation value. In more mathematical terms, the cost function will eventually have an exponentially small variance: $$ \mathrm{Var}_\theta\left[C_G\left(\theta\right)\right] \in O\left(b^{-n}\right) $$ for a given constant $b > 1$.
The above definition comes from another research paper: Showcasing a Barren Plateau Theory Beyond the Dynamical Lie Algebra. This paper is, to my knowledge, the last core result around Barren Plateau.
I will come back to that paper later, for now, I just wanted to give credits for the definition.
Now, you know what are Barren Plateau, but you are still missing a core piece of the equation: under which conditions is the Barren Plateau phenomenon appearing, and what can you do to avoid it?
Answering the second question is trivial: just make sure that the conditions for Barren Plateau to appear are not met, and you will be fine. But then, what are these conditions?
The conditions to have (or avoid) Barren Plateaus
It turns out that having a complete characterisation of the conditions that lead to the Barren Plateau phenomenon is quite difficult. The quantum computing community started by providing theoretical results on specific cases, for example in the seminal paper Barren plateaus in quantum neural network training landscapes. Then, researchers found some of the causes of Barren Plateau, for example Cost function dependent barren plateaus in shallow parametrized quantum circuits, Entanglement Induced Barren Plateaus or Noise-induced barren plateaus in variational quantum algorithms.
But up until September 2023, a unified theory around Barren Plateau was still missing. And in September, in a 3 days window, two research papers introducing a new framework based on Lie algebras (The Adjoint Is All You Need: Characterizing Barren Plateaus in Quantum Ansätze and A Unified Theory of Barren Plateaus for Deep Parametrized Quantum Circuits) have been published.
Finally, the last results I am aware of are laid out in the very nice paper Showcasing a Barren Plateau Theory Beyond the Dynamical Lie Algebra. The main author Marco Cerezo also made a Twitter (now X) thread to discuss about its results. The full thread can be accessed by following this link.
In a few years, research went from
Any of the following may eventually lead to the Barren Plateau phenomenon: hardware noise, cost functions obtained from global observables, polynomial-depth ansätz, highly entangled trial state.
to the question
Does variational quantum algorithm trainability implies classical simulability?
that can be found in the abstract of the very last paper cited:
Finally, while parameterized matchgate circuits are not efficiently simulable in general, our results suggest that the structure allowing for trainability may also lead to classical simulability.
From the abstract of Showcasing a Barren Plateau Theory Beyond the Dynamical Lie Algebra
To rephrase the above statements, the current state of research is questioning the ability of variational quantum algorithms to exhibit any kind of quantum advantage, ever.
If it turns out that any variational quantum algorithm that can be trained at a reasonable cost (not exhibiting Barren Plateau is a necessary condition for that) is classically simulable, then variational quantum algorithms will be theoretically proven useless.
Barren Plateau and solvers of partial differential equations
Let’s come back to the main subject of this series of blog post: partial differential equations! Is there any hope to be able to solve a PDE using a variational quantum algorithm?
My personal point of view is no, for the reasons presented in this blog post, previous ones such as PDEs: A global optimum? Best I can do is local!, and ones to come in the future months.
But:
- as a human being I may be wrong,
- as a researcher, my personal point of view does not matter, only facts do.
For the above reasons, I also want to talk about a few features that could potentially-maybe-perhaps help variational quantum algorithms be meaningful to solve PDEs.
You need more restrictions, it’s for your own good trust me
First and foremost, unlike the generic task of solving a system of linear equations, PDEs are very constrained problems. They provide a lot of a-priori information that could be used to reduce the search space for variational quantum algorithms (i.e., smaller ansatzs, which is a good first step to avoid Barren Plateaus):
- most of the time, the solution of a PDE has some continuity conditions,
- as shown in PDEs: what a nice hypercube you have here!, PDEs: theses (physical) bounds are made for solvin’ and PDEs: theses (time) bounds are made for solvin’, PDEs need well-defined values at specific time-steps or physical locations, restricting even more the potentially valid solution space,
- the solution of a given PDE is governed by the physic laws encoded in that PDE. In particular, this means that there are a lot of “physically invalid” solutions that could potentially-maybe-perhaps be avoided. This is what the UCCSD ansatz tries to achieve for quantum chemistry problems.
There is no evidence (again, that I am aware of) that the above restrictions are enough to avoid the Barren Plateau phenomenon, but restricting the size of the ansatz is a good first step towards potentially-maybe-perhaps avoiding Barren Plateaus.
Compute, measure, measure, measure, …
One of the strength of variational quantum algorithms arrives when the algorithm is over: it is trivial to have (quantum) copies of your computed solution! You only need to apply your ansatz, with the optimised weights, and the solution is encoded in the quantum state!
This means that, unlike LSQ (Large-Scale Quantum) algorithms, variational quantum algorithms may eventually be cheaper at computing the value of an observable. Again, note the emphasised words:
- may, because we have no convergence proof on variational quantum algorithms, and if we cannot compute the solution once then the “advantage” does not exist,
- eventually because the above situation might happen at very large shot numbers that do not even make sense at the universe time scale, e.g., $$N = 10\uparrow\uparrow 10 = 10^{10^{10^{10^{10^{10^{10^{10^{10^{10}}}}}}}}}$$ shots (see Knuth’s up-arrow notation for the notation).
Before ending this blog post, the above points made me think about one of my favourite notion in algorithmic complexity: Galactic algorithms. It is also my favourite way to explain the limitations of the big-O notation for asymptotic complexity.
One example: the optimal algorithm for integer multiplication (like in $3 \times 5 = 15$) is a galactic algorithm. It uses Fast Fourier Transforms with $1729$ dimensions to reach that asymptotic complexity. Have a look at this blog post from one of its author.
The new algorithm is not really practical in its current form, because the proof given in our paper only works for ludicrously large numbers. Even if each digit was written on a hydrogen atom, there would not be nearly enough room available in the observable universe to write them down.
David Harvey, one of the authors of Integer multiplication in time $O\left(n \log n\right)$.
What’s next
I do not know yet what will the next blog post be about.
There are still at least $2$ posts left on the PDEs series, but I also want to talk about implementing quantum error correction codes, software abstractions and their benefit/cost ratio.
I might also talk about my company and what I have been doing since its creation, but I am not ready to make that fully public yet so it will wait at least a few more months.
If you prefer any of the above subjects (my company excluded), please let me know in the comments!
If you want to be notified when each post is out, you can add me on LinkedIn, I will do a LinkedIn post for each new blog post.
Also, if this post picked your interest or raised some questions in your mind, please feel free to send me a message 😉