Bridging the Simulation-to-Experiment Gap with ADAPO: Adversarial Distribution Alignment from Partial Observations

^*Equal contribution ¹UC Berkeley

The Simulation-to-Experiment Gap

Simulations are cheap, and model the underlying state of the physical world, but incur modeling errors.

Experiments provide the ground truth, but are expensive, only providing lossy, incomplete measurements.

We want to bridge the gap between the two to build better computational models of the physical world.

Approach

We propose ADAPO, an adversarial algorithm that aligns generative models of underlying states to datasets of experimental observables.

We present the first algorithm that aligns generative models to the full distribution of multiple noisy experimental observations, without requiring explicit knowledge of joint structure.

What Can ADAPO Do?

Experimental measurements such as cryo-EM images are inherently noisy and capture only a partial view of the full molecular state. Below, cryo-EM micrographs of Trp-Cage (PDB ID: 2JOF) illustrate the challenge: each image is a noisy 2-D projection of the underlying 3-D atomic coordinates.

Noisy cryo-EM images of Trp-Cage — Cryo-EM micrographs of Trp-Cage. Each image is a noisy 2-D projection of the 3-D atomic structure.

ADAPO takes these noisy, high-dimensional images and uses them to improve a generative model of the underlying atom positions. The base model, trained on classical MD simulations, oversamples the unfolded state. After alignment, ADAPO concentrates the distribution around the folded state reported experimentally in the PDB — having only seen the noisy cryo-EM images.

Radius-of-gyration improvement after ADAPO alignment — Distribution of R_g before and after alignment. ADAPO shifts the model toward the experimentally observed folded state.

The generated structures correspondingly shift toward the folded conformations observed in the PDB.

Aligned protein conformations — Representative generated conformations before (left) and after (right) ADAPO alignment.

How Did We Do It?

The ADAPO optimization landscape: minimizing divergence from the simulation prior subject to matching experimental observable distributions.

Observable Matching Problem Formalized

We begin with a base distribution $\mu_{\text{base}}$, obtained from simulated data (e.g., classical molecular dynamics). Our goal is to construct a distribution $\mu_\theta$ that remains close to $\mu_{\text{base}}$ while matching experimentally observed statistics.

Let $\nu(x)$ denote the unknown full-state distribution of the system, and let $o^{(i)}$ be the $i$-th experimental observable. Crucially, we do not observe or sample from the full distribution $\nu(x)$. Instead, we only have access to samples from the induced experimental distributions $o^{(i)}_\# \nu$: the pushforwards of $\nu$ through the observables $o^{(i)}$. These correspond to the distributions of measured quantities obtained experimentally, without access to the underlying full system states.

We formulate this task as a constrained optimization problem (visualized above):

$$ \begin{aligned} \arg \min_{\mu_\theta} \quad & D_{\mathrm{KL}}(\mu_\theta \,\|\, \mu_{\text{base}}) \\ \text{s.t.} \quad & \mu_\theta \in \mathcal{M}_o(\mathcal{X}), \end{aligned} $$

where the observable constraint set is $$ \mathcal{M}_o(\mathcal{X}) = \left\{ \mu \;:\; o^{(i)}_\# \mu = o^{(i)}_\# \nu \;\; \forall i \in I \right\}. $$ That is, $\mu_\theta$ must reproduce the experimental distributions for all observed quantities.

Our Solution

We solve this constrained problem by reformulating it as the following min–max optimization:

$$ \begin{aligned} \max_{\mu_\theta} \; \min_{(f^{(i)}) \in \mathrm{Lip}_1^{|I|}} \; \mathcal{L}\bigl(\mu_\theta, (f^{(i)}), \beta \bigr), \end{aligned} $$ $$ \begin{aligned} \mathcal{L}\bigl(\mu_\theta, (f^{(i)}), \beta \bigr) = {} & -D_{\mathrm{KL}}\bigl(\mu_\theta \,\|\, \mu_{\text{base}}\bigr) \\ & + \beta \sum_{i \in I} \left( \mathbb{E}_{(o^{(i)})_\# \mu_\theta}[f^{(i)}] - \mathbb{E}_{(o^{(i)})_\# \nu}[f^{(i)}] \right). \end{aligned} $$

Analogous to GAN training, we alternate between updating the critic functions $f^{(i)}$ to distinguish generated from experimental observables, and updating $\mu_\theta$ to satisfy these learned constraints while remaining close to the base distribution. We prove in the full paper that this procedure converges to the target experimental observable distributions.

ADAPO combines prior physical knowledge while circumventing the approximation errors of simulations by aligning directly to experimental data.

Outlook

Obtaining large amounts of fully observed experimental data is expensive and simulations can incur modeling errors. ADAPO provides a general framework that leverages prior knowledge through simulation data while aligning to experimental data to learn accurate models of the physical world. For example, ADAPO could be scaled up to learn a generative model of the underlying atom positions of a protein from a large dataset of noisy cryo-EM images. We hope that frameworks like ADAPO that use experimental data also motivate the development of larger experimental datasets.

For more details, see our paper.