WO2024118915A1 - Thermodynamic artificial intelligence for generative diffusion models and bayesian deep learning - Google Patents
Thermodynamic artificial intelligence for generative diffusion models and bayesian deep learning Download PDFInfo
- Publication number
- WO2024118915A1 WO2024118915A1 PCT/US2023/081816 US2023081816W WO2024118915A1 WO 2024118915 A1 WO2024118915 A1 WO 2024118915A1 US 2023081816 W US2023081816 W US 2023081816W WO 2024118915 A1 WO2024118915 A1 WO 2024118915A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- analog
- network
- voltage
- score
- processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
- G06F17/13—Differential equations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
- G06N3/065—Analogue means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Definitions
- GM generative modeling
- ML machine learning
- AI arti- ficial intelligence
- text-to-image application of GM has captured the imagination of users, allowing them to generate their own artwork simply by typing in words.
- generating seem- ingly realistic (but ultimately fake) images of human faces demonstrates the power of GM.
- Generation of text, audio, computer code, and molecular structures are additional applications of GM.
- a generative model uses a probabilistic framework and describes how a dataset is generated in terms of a probabilistic model. One can then sample from this model in order to generate new data.
- Generative Adversarial Networks were popular in the early days of GM. GANs employ two neural networks acting in an adversarial setting, where one network tries to discriminate the output of the other from real data samples. More recently, Diffusion Models have been introduced for GM and typically have superior performance over GANs. 2 Diffusion Models A class of physics-inspired models, known as Diffusion Models (DMs), has recently revolutionized the field of GM.
- DMs Diffusion Models
- Score SDE A unified framework for diffusion models based on stochastic differential equations (SDEs), called the Score SDE approach, follows four steps: Attorney Docket No. NORM-002WO01 1. Generate training data by evolving under the Forward SDE, which adds noise to data from the dataset of interest. 2. Use this data to train a neural network (the “score network”) in order to match the score (the gradient of the logarithm of the probability) associated with the distribution at each noise level. 3. Produce a sample from the noisy distribution, i.e., the distribution associated with the final time point of Step 1. 4.
- Step 1 Evolve this sample under the Reverse SDE (which is defined using the trained score network) to generate a novel datapoint.
- Step 2 Steps 3 and 4 can be repeated many times to generate many novel datapoints.
- FIG. 1 shows how Steps 3 and 4 can be repeated in the context of image processing.
- the forward SDE in Step 1 above
- dx f(x, t)dt+G(t)dw (1)
- x is the data vector
- dt is a positive timestep
- dw denotes a Gaussian-distributed noise term associated with Brownian motion
- f(x, t) is a function that determines the drift
- G(t) acts as the diffusion tensor.
- the Variance Preserving (VP) process is the continuous version of the discrete Markov chain often used in Denoising Diffusion Proba- bilistic Models (DDPMs).
- the Variance Exploding (VE) process is the continuous version of the discrete Markov chain used in Score Matching with Langevin Dynamics (SMLD).
- dx f(t)xdt+ g(t)dw
- dx f(t)xdt ⁇ g(t) 2 ⁇ x log t t (x)dt+ g(t)dw (Reverse) (8) for some time-dependent functions f(t) and g(t).
- This loss function is a weighted average over input data x(0) and noised input data x(t)
- This loss function uses knowledge of the form of the conditional distributions p 0t (x(t)
- the distributions p t (x) can be highly complex (e.g., multimodal) and hence so can their associ- ated score functions. Therefore, typically a score network should have many parameters to be expressive enough.
- the output of the score network s ⁇ (x, t) can be a composition of affine transformations and non-linear activation functions. This is similar to how standard neural networks are constructed. Neural networks are good at approximating a wide variety of functions.
- inductive bias can improve the performance of the score network, including its trainability and generalization performance. In practice, inductive bias often corresponds to accounting for the problem geometry when contructing the neural network.
- the problem geometry may correspond to a grid (e.g., in 1D, 2D, or 3D) or graph.
- a grid e.g., in 1D, 2D, or 3D
- Implementing such geometrical inductive biases is useful in fields such as molecule synthesis and material design, where including symmetries as inductive biases substantially improves the performance of diffusion models. Therefore, the construction of the score network can depend on the nature of the problem. For example, when considering 2D images, it is common to employ a so-called U-Net for the score network.
- the U-Net was originally introduced for medical image segmentation, and it has a structure similar to a convolutional neural network.
- Such networks account for spatial locality and problem geometry.
- score networks should be constructed in such a way to be both expressive as well as accounting for problem geometry.
- Latent and Spectral Diffusion Models Some extensions of DMs allow for some pre- and post-processing of the data. For example, Latent Diffusion Models (LDMs) first pass the data through a trained autoencoder that transforms the data from data space to latent space, then act with the DM in the latent space, and then reverse the autoencoder to go back to the data space. This reduces the computational difficulty for the DM since the latent space is lower dimensional. Similarly, Spectral Diffusion Models (SDMs) first transform spatial data into the spectral domain prior to acting with the DM and then transform back in the end, which allows the noise process to be correlated in the spatial domain.
- LDMs Latent Diffusion Models
- SDMs Spectral Diffusion Models
- Score SDE can be supplemented by additional steps: (0) Pre-process the data with an encoding process, (1) - (4) Perform the Score SDE protocol given above on the encoded data, (5) Map the encoded space back to the data space.
- Inductive bias refers to prior knowledge being inputted into the structure of the model, for example, knowledge of symmetries in the data. Having a strong inductive bias can reduce the training data requirements as well as improve the speed of training of the model.
- the third difficulty refers to the challenge of using digital hardware to simulate time dynamics. This includes both the challenge of digitally generating Gaussian randomness (i.e., the dw term in the SDE) and the challenge of numerical time integration via discretizing the time dynamics.
- the fourth difficulty refers to the fact that there are many cases in which numerically integrating SDEs and ODEs may lead to instabilities due to the structure of the equations, particularly for SDEs when the number of dimensions is large.
- thermodynamic involving randomness
- a nature-based computer for this application can be thermodynamic, involving physical randomness (e.g., due to heat fluctuations).
- a thermodynamic system can address the computational bottlenecks encountered when trying to solve diffusion models with standard digital computers.
- a physical system can address the inductive bias issue as well as the time dynamics simulation issue raised above.
- Examples of the inventive technology aims to remove the computational bottlenecks encountered by generative diffusion models run on purely digital hardware. These examples exploit the insight that diffusion is a natural process that occurs in many physical systems, including electrical circuits and continuous-variable optical systems.
- One example of the inventive technology can be implemented as a system for executing a generative diffusion model on a dataset. This system includes a physical system with several degrees of freedom, each of which is associated with a corresponding feature of the dataset. Each degree of freedom has a continuous state variable that evolves according to a corresponding differential equation having a tunable diffusion term and a tunable drift term.
- the degrees of freedom can be physically coupled to each other according to a problem geometry associated with the dataset.
- the system may also include a processor, operably coupled to the physical system, to reduce the entropy of the continuous state variables so as to produce an output of the generative diffusion model.
- the processor can reduce the entropy of the continuous state variables by reading off the continuous state variables and altering at least one of the tunable drift terms and/or by executing a score network.
- the processor may be configured to be trained by an optimization routine that reduces a loss function.
- the physical system may be configured to assist with estimating the loss function, and/or the processor may include analog circuitry configured to assist with estimating the loss function.
- the loss function may involve simultaneous time evolution of the physical system and of the analog circuitry in the processor.
- the loss function may quantify a degree of score matching.
- the system can also include a function generator that is operably coupled to the physical system. In operation, the function generator multiplies the tunable drift terms and tunable diffusion terms by arbitrary time-dependent functions so as to modify the differential equations governing evolution of the continuous state variables.
- the system can include a digital device that is operably coupled to the physical system. In operation, the digital device uploads the dataset to the degrees of freedom and downloads new data corresponding to measurements of values of the continuous state variables after evolution of the continuous state variables.
- the physical system can include a network of electrical circuits, each of which provides a corresponding degree of freedom and includes a capacitor having a charge encoding the continuous state variable; a stochastic noise source, in series with the capacitor, to generate the tunable diffusion term; and a resistor, in series with the capacitor, to generate the tunable drift term.
- Each electrical circuit can also include first and second tunable voltage sources operably coupled to the stochastic noise source and the resistor, respectively.
- the first tunable voltage source tunes the tunable diffusion term of the differential equation
- the second tunable voltage source tunes the tunable drift term of the differential equation.
- the stochastic noise source may be a thermal noise source and/or a shot noise source.
- each electrical circuit in the network of electrical circuits could include a variable resistor, in parallel with the capacitor and having a variable resistance controlled by a tunable voltage source, to tune the tunable drift term of the differential equation; and an amplifier, operably coupled to the stochastic noise source, to amplify an output of the stochastic noise source so as to tune the tunable diffusion term of the differential equation.
- the amplifier may have a variable gain determined by an additional variable resistor whose resistance is controlled by an additional tunable voltage source.
- the physical system can also include switches coupling the electrical circuits in the network of electri- cal circuits according to a problem geometry associated with the dataset.
- the physical system may also include switches configured to be actuated between settings for a forward diffusion process and settings for a reverse diffusion process.
- the overall system can include a processor, operably coupled to the network of electrical circuits, to reduce entropy of the continuous state variables so as to produce an output of the generative diffusion model. Again, the processor can reduce the entropy of the continuous state variables by reading off the continuous state variables and altering at least one of the tunable drift terms.
- the processor may include analog circuitry configured to evolve over time simultaneously with the physical system.
- the analog circuitry may be configured to continuously output predictions for score values as analog signals to be used as inputs to the physical system.
- the score values can be produced by physically modeling, with trainable physical devices, partial derivatives of the score with respect to time and with respect to the continuous state variables.
- the trainable physical devices used to model the partial derivatives may be artificial neural networks.
- the processor may include an integrator circuit, operably coupled to the trainable physical devices, to produce the score values by integrating, over time, outputs of the trainable physical devices.
- the processor may further include a field-programmable gate array (FPGA).
- the analog circuitry in the processor may include a network of electrical circuits, each electrical circuit in the network of electrical circuits providing a corresponding degree of freedom and comprising: a capacitor; a resistor in series with the capacitor; and a voltage source in series with the capacitor. At least one electrical circuit in the network of electrical circuits may be capacitively coupled to at least one other electrical circuit in the network of electrical circuits according to a problem geometry associated with the dataset.
- the voltage source may be a multi-layer neural network configured to take the continuous state variable as input and to output a voltage value for each electrical circuit in the network of electrical circuits.
- FIG. 1 illustrates a diffusion model applied to an image.
- the forward diffusion process, or forward process adds noise to each pixel, while the reverse diffusion process, or reverse process, removes noise from the image to produce a new data point.
- FIG. 1 illustrates a diffusion model applied to an image.
- the forward diffusion process, or forward process adds noise to each pixel, while the reverse diffusion process, or reverse process, removes noise from the image to produce a new data point.
- FIG. 2 is a schematic diagram of our thermodynamic AI device for simulating the forward process of generative diffusion models.
- the physical system is composed of multiple degrees of freedom (DOFs).
- DOFs degrees of freedom
- Each DOF has a continuous state variable, and that variable evolves according to a differential equation that, in general, could have both a diffusion and drift term.
- a function generator is capable of multiplying these diffusion and drift terms by arbitrary time-dependent functions, respectively h j (t) and k j (t), for the jth DOF.
- the problem geometry associated with a given dataset, can be uploaded onto the device by selectively connecting the various DOFs, which mathematically couples the differential equations of the various DOFs.
- a datapoint from the dataset of interest can be uploaded to the device by initializing the values of the continuous state variables to be the corresponding feature values of the datapoint.
- data can be downloaded (and decoded) from the device by measuring the values of the continuous state variables after some time evolution.
- FIG. 3 is a schematic diagram of our thermodynamic AI device for simulating the reverse process of generative diffusion models. In addition to all of the device components that are present in the forward process, the reverse process uses a trained score network.
- the inputs to the score network are the values of the continuous state variables at some time t, and the output is the value of the score.
- the jth component, s j (t), of the score gets added as a drift term in the evolution of the jth DOF.
- the score network acts as a Maxwell’s demon, which continuously monitors the physical system and appropriately adapts the drift term in order to reduce the physical system’s entropy.
- FIG. 4 shows a circuit diagram for the unit cell, the building block of our analog device.
- the voltage functions k(t) and h(t) are chosen by the user to allow for different drift and diffusion coefficients and are multiplied by the intrinsic circuit voltages using voltage mixers (circle with cross).
- FIG. 5 is a circuit diagram of one unit cell of the forward process using variable resistors to control the drift and diffusion terms.
- FIG. 6 shows two unit cells coupled to each other via a coupling capacitor C 12 . This coupling forms the basis for connecting unit cells in our analog device, in general.
- FIG. 7 is a circuit diagram of two capacitively coupled unit cells using variable resistors. The images above the main circuit diagram are definitions of shorthand circuit symbols used in the main diagram.
- FIG.8 shows a simplified version of the circuit used in our two unit cells experimentally implemented as a heat engine.
- the plot shows that the voltages across the capacitors in the unit cells were highly correlated with a large coupling capacitance. This suggests that a coupling capacitor effectively correlates the random walks (i.e., the noise processes) in the unit cells. This spatial correlation is desirable to engineer the inductive bias of the model.
- FIG. 9 illustrates four possible problem geometries, although there are other possibilities.
- DNA sequences have a 1D geometry
- images have a 2D geometry
- solutions to partial differential equations (PDEs) in real space such as fluid flow
- molecular structures have some graph connectivity and hence follow the geometry of a graph.
- FIG. 10 illustrates mapping the connectivity matrix onto the hardware.
- Each off-diagonal element of the connectivity matrix is translated into the state of a switch in a wire connecting two unit cells.
- the capacitors in series with the switches are not shown.
- the switches are placed in series with the resistive bridges that bridge the unit cells.
- the switches are placed in series with the capacitive bridges that bridge the unit cells.
- FIG.11 shows a comparison of the typical scenario for Maxwell’s Demon (left panel) with our scenario (right panel). In our scenario, the voltage across a capacitor in an electrical circuit plays the role of the dynamical variable, analogous to the positions of the gas particles in the left panel.
- FIG.12 is a schematic illustration of how the score network, which is stored and executed on a digital device such as a central processing unit (CPU) or a field-programmable gate array (FPGA), interacts with the entire set of analog unit cells via Analog-to-Digital converters (ADCs) and Digital-to-Analog converters (DACs).
- ADCs Analog-to-Digital converters
- DACs Digital-to-Analog converters
- FIG.13A shows coupling the digital score network (SN) to the analog unit cell, for solving the Reverse SDE, where the output of the SN is multiplied by a function g(t) 2 in an analog fashion.
- FIG. 13B shows coupling the digital SN to the analog unit cell, for solving the Reverse SDE, where the output of the score network is multiplied on a digital device.
- FIG. 14 is a flow chart for our Thermodynamic AI system whenever the score network has already been pre-trained.
- FIG. 15 is a flow chart for our Thermodynamic AI system when the score network is untrained.
- FIG. 16 is a schematic circuit diagram for the score device. This diagram represents the process used to obtain score values during the evolution of the reverse process.
- FIG. 17 shows circuit diagrams for a voltage adder (left) and a voltage integrator (right).
- FIG. 19 is a circuit diagram for a root-mean-square (RMS) converter based on the thermal method.
- FIG. 20 is a schematic illustration of a training process for the analog score device.
- FIG.21 is a flow chart for our thermodynamic AI system when the score network is an analog device, and this analog device is used for evaluating the loss function during training and for interfacing with the reverse process after training.
- FIG. 22A shows one unit cell of an analog score device.
- FIG. shows two unit cells of the analog score device, capacitively coupled together.
- FIG. 23 shows a layered analog neural network to evaluate functions r (i) (also applies to evaluating q).
- the N +1 dimensional input (v(t), t) is fed into a parametrized layer A 1 , that is detailed in Fig. 24.
- the output is then fed into a diode followed by a resistor whose current output is a nonlinear function f d of the voltage.
- B K shows the last nonlinear layer.
- FIG.24 shows the first resistive layer of the analog neural network.
- FIG. 25 is a schematic of a basic voltage follower.
- FIG. 26 is a schematic of the circuit to compute the dot product between the input voltages and the j th row of A 1 .
- FIG. 28 shows a circuit diagram for a universal unit cell that can be used for both the forward and reverse processes.
- FIG. 29 illustrates how a Bayesian neural network (BNN) differs from a standard neural network, showing that the weights in a BNN have some probability distribution.
- FIG. 30 illustrates a multimodal distribution. The posterior distribution of the weights in a BNN is typically multimodal.
- FIG. 31 shows how the four components or subsystems of a BNN interact and feed signals to each other.
- FIG. 32 shows a unit cell for the hidden layer network (HLN), including a capacitor C j , a resistor R j , a non-linear element (NLE) in parallel with the capacitor, and a voltage source ⁇ j .
- FIG. 33 shows two units cell for the HLN coupled together via a resistive bridge.
- FIG. 34 shows an analog augmented neural ODE, which can be used for the HLN.
- FIG. 35 shows a unit cell for the prior weight diffuser.
- FIG. 36 shows two unit cells, for the prior weight diffuser, coupled together via a capacitive bridge.
- FIG. 37A shows a unit cell for the posterior weight diffuser for the cases where the posterior drift network (PDN) is an analog device.
- PDN posterior drift network
- FIG. 37B shows a unit cell for the posterior weight diffuser for the cases where the PDN is a digital device, such as an FPGA.
- FIG. 38 illustrates outputting weights from the weight diffuser device to the HLN device for two unit cells of the weight diffuser. This concept applies to an arbitrary number of unit cells.
- FIG.39 illustrates inputting weights from the weight diffuser device into the HLN device for two unit cells of the HLN. This concept applies to an arbitrary number of unit cells.
- FIG. 40 is a circuit diagram illustrating that the same device can be used for the prior and posterior weight diffusers.
- the thick lines represent switches that allow one to toggle between the two different diffusers. When the switches are open (closed), the device corresponds to the prior (posterior) weight diffuser.
- FIG. 41 illustrates a feedback process between the posterior weight diffuser and the posterior drift network for the case where the posterior drift network is a digital device, such as an FPGA or CPU.
- FIG. 42 depicts a layered analog neural network configured to evaluate functions , in the context of the PDN.
- the W + 1 dimensional input (w(t), t) is fed into a parameterized layer, corresponding to an affine matrix that is detailed in Fig. 43.
- FIG. 43 shows the first resistive layer of the analog neural network.
- the resistive layers take the same generic form, possibly with different hyperparameters.
- Each input voltage is copied M 1 times, so that they can be fed into M 1 ⁇ W + 1 resistors with inverse values being the entries of the A 1 matrix (A 1 ) j,k .
- This circuit produces a weighted average of the input voltages, stored in each A 1 [w(t), t] j , which is then fed to the nonlinear layer.
- FIG. 44 is a schematic diagram of the circuit used to integrate the total derivative for the drift, in order to output a drift value from the PDN.
- FIG. 45 is a circuit diagram for the estimation of the ⁇ term in the loss function, employing digital time integration.
- FIG. 46 is a circuit diagram for the estimation of the ⁇ term in the loss function, employing analog time integration.
- FIG.47 shows a unit cell for an analog neural ODE employing a voltage mixer. This unit cell can allow for reversing the direction of time based on the choice of function k(t), e.g., by choosing k(t) ⁇ k(1 ⁇ t).
- FIG. 48 shows two unit cells for the adjoint device. Each cell has a resistor and a capacitor, and the cells are coupled via a resistive bridge.
- the adjoint device evolves the adjoint variable a over time, where this variable can be encoded in the voltages across the capacitors in the device. This is in the context of the adjoint sensitivity method for computing gradients of neural ODEs.
- FIG. 49 shows a circuit diagram for an integrator circuit.
- FIG. 50 illustrates an analog latent ODE for fitting and extrapolating time-series data.
- FIG. 51 shows a unit cell for an analog neural SDE processor.
- FIG. 52 illustrates various algorithms unified under a single mathematical framework of thermody- namic AI algorithms.
- FIGS. 53 and 53 are circuit diagram of a physical realization of an s-mode comprising a noisy resistor and a capacitor.
- FIGS. 54A and 54B are circuit diagrams of physical realizations of coupling between s-modes using a coupling resistor and a coupling capacitor, respectively.
- FIG. 55 illustrates a force-based approach to constructing a Maxwell’s Demon device.
- Attorney Docket No. NORM-002WO01 Detailed Description 3 Generic Physical Thermodynamic Artificial Intelligence (AI) Devices
- FIGS. 2 and 3 are schematic diagrams showing a generic physical architecture of a thermodynamic AI device.
- FIG. 2 illustrates the thermodynamic AI device 200 performing the forward process
- FIG. 3 illustrates the thermodynamic AI device 300 performing the reverse process.
- This architecture can be implemented in electrical circuits, as described below, or in other physical systems, such as continuous- variable optics systems.
- the physical system 200 has multiple degrees of freedom (DOFs) 210-1 through 210-4.
- DOFs degrees of freedom
- the number of DOFs 210 matches the dimensionality of the data, i.e., the number of features in the input dataset 201.
- Each DOF 210 has a continuous state variable, and that variable evolves according to a differential equation that, in general, could have both a diffusion and drift term.
- Function generators 220-1 and 220-2 can multiply these diffusion and drift terms by arbitrary time-dependent functions.
- the problem geometry 203 associated with a given dataset, can be uploaded onto the device by selectively connecting the various DOFs 210 with switches 230.
- a datapoint from the dataset 201 of interest can be uploaded to the device by initializing the values of the continuous state variables to be the corresponding feature values of the datapoint.
- new data 209 can be downloaded (and decoded) from the device by measuring the values of the continuous state variables after some time evolution.
- the reverse process uses a trained score network 340, as in FIG. 3.
- the inputs to the score network are the values of the continuous state variables at some time t, and the output is the value of the score.
- the jth component of the score gets added as a drift term in the evolution of the jth DOF.
- thermodynamic AI device can be implemented as a hybrid analog-digital system that is thermodynamic in nature. This analog system generates diffusion via a thermal noise system, such as an electrical resistor. As noted above, Gaussian randomness is costly to generate digitally and hence is better generated with an analog system. Moreover, time dynamics are performed on the analog device via natural time evolution of, say, the voltage on an electrical capacitor. This addresses the computational bottleneck of numerical integration of dynamics with digital solvers since the analog system naturally performs this integration.
- the analog system is composed of repeated subunits, or unit cells, where the number of unit cells is equal to the dimensionality of the problem that the analog system is configured to solve.
- Each unit cell is composed of a thermal noise source and resistive and capacitive circuit elements.
- each unit cell can, in principle, be coupled (via a capacitive bridge) to every other unit cell.
- An arbitrary connectivity matrix (analog to the adjacency matrix in graph theory) describes how the unit cells are coupled to each other. This allows the connectivity to be tailored to the geometry of the problem at hand.
- inductive bias reduces training data requirements and training complexity. Our system enables the user to incorporate inductive bias into the model.
- the connectivity matrix should be closely connected to the geometry of the problem to maintain a strong inductive bias.
- the second law of thermodynamics says that global entropy does not decrease over time, but a Maxwell’s Demon can locally reduce the entropy of a system by making observations on that system and adaptively interacting with the system.
- An electrical version of Maxwell’s Demon can reduce the entropy of the set of unit cells in our analog device, specifically by observing the capacitors’ charges in the unit cells and adjusting an applied voltage in the cells.
- our system can physically implement the dynamics in the Reverse SDE, which tend to reduce entropy and thus at first look contradictory with the second law of thermodynamics.
- Maxwell’s Demon can be implemented as a digital device, such as a central processing unit (CPU) or a field-programmable gate array (FPGA), that stores the trained score network and interacts continuously with the analog device.
- the Maxwell’s Demon (or digital score network) acts as an AI agent who intelligently interacts with a thermodynamic physical system.
- the analog system and the digital AI agent/Maxwell’s Demon form a thermodynamic AI system for generative modeling.
- Fig. 4 shows the RC circuit with two voltage mixers, which add in time-dependence of the drift and diffusion terms. The two voltage mixers multiply the circuit voltages before and after resistor R by time-dependent voltages k(t) and h(t), respectively.
- w is an arbitrary voltage noise source with noise voltage v w
- the voltage sources h(t) and k(t) are time-dependent voltage sources with voltages v k and v h .
- the probabilistic flow ODE can also be implemented in an RC circuit corresponding to Eq. (16) by adding a voltage mixer to the voltage at a point between the resistor and the capacitor. Adding this voltage mixer yields the following ODE: By adding in a second voltage equal to 1/2G(t) 2 s ⁇ (related to the second term in Eq. (12)), it is possible to simulate the behavior of the probabilistic flow ODE given in Eq. (13) to generate new samples. This is explained in more detail in section 11.2 for N unit cells. 5.3 Unit cell involving time-varying resistors An alternative version of our unit cell does not use voltage mixers. In practice, voltage mixers involve multiple components, and hence avoiding voltage mixers reduces circuit complexity.
- time-varying resistors can be used to introduce time dependence into the drift and diffusion terms.
- One way of constructing a time-varying resistor is with a field effect transistor (FET) or with a network of FETs.
- FET field effect transistor
- a time-varying voltage By applying a time-varying voltage to the gate of the FET, within its linear operation range, we then get a time-varying resistor.
- a voltage-controlled resistor such as a FET, can also be used to manipulate the gain of a simple non-inverting amplifier.
- Thermal noise also called Johnson-Nyquist noise, comes from the random thermal agitation of the charge carriers in a conductor, resulting in fluctuations in voltage or current inside the conductor.
- the amplitude of the voltage fluctuations can be controlled by changing the temperature or the resistance.
- a thermal noise source can be implemented using a large resistor in series with a voltage amplifier.
- a thermal noise source can be implemented using a tunable resistor (transistor operated in the linear regime) where the noise amplitude scales with the value of the resistance.
- a tunable resistor transistor operated in the linear regime
- shot noise arises from the discrete nature of charge carriers and from the fact that the probability of a charge carrier crossing a point in a conductor at any time is random. This effect is particularly notable in semiconductor junctions where the charge carriers should overcome a potential barrier to conduct a current. The probability of a charge carrier passing over the potential barrier is an independent random event. This induces fluctuations in the current through the junction.
- the amplitude of the current fluctuations can be controlled by changing the magnitude of the DC current passing through the junction.
- a source of shot noise would be implemented using a pn diode (for example, a Zener diode in reverse bias configuration) in series with a controllable current source. 6 Two Coupled Unit Cells As mentioned above, incorporating the problem geometry into the noise process contributes to reducing the strain on the score network, as this can be viewed as providing an inductive bias for the model.
- the 2 ⁇ 2 capacitance matrix defined in the 2-cell case generalizes to the N ⁇ N matrix: ⁇ with C i the capacitance of the capacitor in cell i, and C ij the capacitance of the branch coupling i and j.
- the matrix C is a fully dense symmetric matrix.
- Arbitrary symmetric matrices can be constructed by specifying an arbitrary adjacency matrix A.
- the circuit geometry is specified by a sparse adjacency matrix. For example, if N cells are arranged on an d-dimensional lattice, each cell is coupled only to its nearest neighbors, and there are 2d nearest neighbors for each unit cell.
- the nature of this mapping from the data feature value to a voltage depends on the dataset; for images, the values of the pixels can be converted to voltages in a certain voltage window.
- the forward process evolves the vector v(t) over time, and eventually at a later time t this vector is downloaded and converted back to a vector of feature values.
- the unit cell is constructed with voltage mixers and fixed resistors.
- the treatment in this section extends straightforwardly to other variants of the unit cell, such as the unit cell based on variable resistors.
- Each unit cell provides one Brownian source voltage in series with its resistor, and the N -tuple of all such signals is an N -dimensional Brownian process w(t).
- the following construction generalizes the two- cell system: the N ⁇ N matrix C defined in Eq.
- Eq. (1) has no dependence on the voltage, and the drift term f in Eq. (1) depends linearly on the voltage.
- This therefore includes state-of-the-art diffusion models, such as those represented by Eq. (7) where G(t) ⁇ g(t) is a scalar and the drift term is linear. It also includes diffusion models with inductive biases, since correlations can be introduced by the C matrix that depends on the connectivity of the full circuit.
- the primary role for the forward process in diffusion models is to generate sample trajectories, and these trajectories serve as training data used to evaluate the loss function associated with training the score network.
- the forward process possesses a unique stationary distribution p noise (x); any initial distribution p data (x) of the forward process converges towards p noise as t ⁇ ⁇ .
- the observer watches the gas particles to see which component is approaching the barrier and then removes the barrier when it would help the sorting process.
- Our circuit for solving the reverse process in diffusion models is a physical example of the Maxwell’s Demon scenario.
- the dynamical variables are the voltages across the capacitors in each of the unit cells. Even if these voltages start in a high entropy state (e.g., being uncorrelated), they can evolve over time towards a low entropy state (e.g., being strongly correlated). This entropy reduction is facilitated by a neural network, called the score network.
- the score network can take various forms.
- the score network could even be an analog device, as discussed below in Sec. 15.
- This neural network / processing device combination acts as the demon in the Maxwell’s Demon scenario in that it continuously observes the state of the analog system and adapts the applied voltage in each circuit appropriately. This is illustrated in the right panel of Fig.11. It therefore makes sense to refer to our system as a thermodynamic AI system. After all, our system exploits an artificial intelligence unit Attorney Docket No.
- NORM-002WO01 the neural network on the CPU or FPGA
- Fig.12 shows details of how the score network interacts with the analog system.
- a voltmeter reads off the voltage across the capacitor in each unit cell, and the set of voltages forms the state vector, denoted x.
- Each of these voltages is converted into a digital signal via an analog-to-digital converter (ADC) and these digital signals are sent to the CPU or FPGA.
- ADC analog-to-digital converter
- the score network is evaluated at the point associated with the inputted signal.
- the output of the score network is the predicted score, which is a vector (the score vector).
- This score vector is converted from a digital signal into an analog signal via a digital-to-analog converter (DAC), and then it becomes a vector of voltages.
- DAC digital-to-analog converter
- Each element of this voltage vector corresponds to a voltage that is applied inside of the corresponding unit cell.
- Figure 13A and 13B delve deeper into how the output of the score network gets incorporated into each unit cell as an applied voltage. The main issue here is how to multiply the output of the score network by the function g(t) 2 that appears in the Reverse SDE or Reverse ODE.
- Figure 13A and 13B depict two alternative methods for coupling the score network to the analog unit cell, for solving the Reverse process.
- One method shown in Fig. 13A, is analog in spirit.
- the multiplication by g(t) 2 is done physically in the circuit with analog components.
- the other method shown in Fig. 13B, is digital in spirit.
- the multiplication by g(t) 2 is done digitally on the CPU or FPGA, and then the resulting digital signal is fed into the ADC and directly applied as a voltage in the unit cell.
- FIG. 13A and 13B provide two alternative methods to program the approximated non-linear drift piece e(v, T ⁇ ⁇ ) into the pre-existing circuit model simulating the forward SDE. Both of these methods involve digital evaluations of s ⁇ (v, T ⁇ ⁇ ) passed into the circuit unit cells as a voltage signal, via DACs.
- Figure 13A shows an example of how to do this multiplication with an analog circuit, in the special case where M(T ⁇ ⁇ ) is diagonal and hence can be thought of as a set of scalar quantities (each of the form g(T ⁇ ⁇ )). For each i, we port e(v, T ⁇ ⁇ ) i to the ith unit cell. Hence, we encode the integral equation in the stochastic dynamics of the circuit. Let us now consider the integral equation. Suppose the score network s ⁇ (v, T ⁇ ⁇ ) is trained on data from a forward process that is stopped after time T .
- the probabilistic flow ODE can be used for sample generation.
- the probabilistic flow ODE reads Attorney Docket No. NORM-002WO01 New data samples x can therefore be generated in exactly the same way than when using the reverse SDE. The performance of this approach will depend on the noise level of the system, as it supposes that there is no noise stemming from the RC circuit.
- the same physical device e.g., the same electrical chip
- the unit cell circuits in Fig. 4 and Fig. 13 which pertain to the forward and reverse processes, respectively, shows that the unit cells are not exactly the same.
- the unit cell for the reverse process has additional circuit elements that are not used in the forward process.
- the circuit in Fig. 28 addresses this issue.
- Fig. 28 shows the unit cell in Fig. 13A, but with switches added in various locations. These switches can make it possible to toggle a unit cell between the forward process and the reverse process.
- Fig. 28 illustrates this concept for a unit cell construction based on voltage mixers. However, this concept applies more broadly to other unit cell constructions (e.g., the construction based on variable resistors). Indeed, switches can be added to other other unit cell constructions for toggling between the forward and reverse processes. 13 Flowchart of the entire system (with digital score network) Next we present our entire thermodynamic AI system for generative modeling.
- FIG. 14 shows a flowchart for the scenario is which the score network has already been trained. In this case, we do not need to run the forward process. We simply run the reverse process to generate new datapoints.
- the digital device stores prior information in the form of a connectivity matrix (1402). This information is uploaded to the analog device by opening or closing the relevant switches (1406), e.g., as shown in Fig. 10. Similarly, the digital device stores a sample from the noise distribution (1404), and this sample is uploaded to the analog device by charging the capacitors appropriately (1408).
- the analog device With the connectivity chosen and the capacitors charged, the analog device is ready to evolve in time under the reverse process (either the Reverse SDE or Reverse ODE) (1410).
- This evolution involves continuous communication with the pre-trained score network (1412), which is stored on the digital device as shown in Fig. 12.
- the data is downloaded onto the digital device by reading off the voltages across the capacitors on the analog device and then passing the resulting voltages through an ADC (1414). Finally, if appropriate, the data is decoded (e.g., mapped from a latent space to the original feature space) (1416).
- Fig. 15 shows a flowchart for the scenario where the score network is trained by the user.
- the user generates (noisy) training data by evolving under the forward process and stores the training data on the digital device (1502).
- the data may be encoded on the digital device (e.g., into a latent space or a spectral domain) (1504).
- the data is then uploaded to the analog device by appropriately charging the capacitors of the unit cells (1510).
- the problem geometry is stored on the Attorney Docket No. NORM-002WO01 digital device (1402) and uploaded to the analog device by choosing the connectivity of the unit cells (1406) as in the pre-trained case. With the connectivity chosen and the capacitors charged, the analog device then evolves over time (1518) according to the so-called forward process (forward SDE or forward ODE), which adds noise to the data.
- forward SDE forward SDE
- forward ODE forward ODE
- the noisy data is then downloaded to the digital device by reading off the capacitors’ voltages and passing them through an ADC (1520).
- This data acts as training data for the score matching optimization, with the loss function given in Eq. (9).
- a trained score network which means that we are at the same point as the starting point of the algorithm in Fig. 14. This means that the rest of the protocol follows the same steps as the protocol in Fig. 14. 14
- Computational advantages (assuming digital score network) The are advantages of performing the forward and reverse processes in analog hardware with respect to doing them digitally. These are: 1.
- SDEs are in general numerically unstable for large dimensions, requiring sophisticated numeri- cal integrators and fine-tuned time-stepping schedules. With an analog system, these difficulties disappear, as there is no time step to be chosen. 2.
- the total physical time can be tuned, and depends on the operating decay rate of the system, proportional to 1/(RC) for a single unit cell. This means the physical time of integration can be made faster, an advantage of analog computation. 3.
- the voltages should be initialized to match the input data x and then to measure the voltages for each unit cell at each instant of a chosen discretization for training the score network. For the reverse process, no intermediate measurements are needed, and the initial voltages are set to match the noisy data x(T ).
- Querying a digital score network takes time, because it involves a forward pass through a deep neural network.
- This forward pass involves multiple matrix-vector multiplications and non-linear activation functions. Hence, it takes some non-trivial amount of time to perform this forward pass and to receive the predicted value of the score s ⁇ (x, t).
- the time-delay associated with this process as the latency of the score network.
- the latency of the score network implies Attorney Docket No. NORM-002WO01 that the numerical SDE solver should be slowed down. In other words, each time step of the numerical integration may take longer, due to this latency, since every single time step involves a forward pass through the score network.
- the reverse process is performed on an analog device, such as the device described above, the latency of the score network is also an issue. In this case, the reverse process is evolving continuously on an analog device. Because this system is evolving continuously in a physical system, the evolution cannot be interrupted or slowed down, once it starts. This implies that the score values received by the reverse process may not be fully accurate—they may be time delayed.
- the latency of the score network translates into a slowing down of the sample generation process.
- analog diffusion processes it translates into either into inaccuracies in the time evolution or into a slowing down of the sample generation process, or both. This motivates removing the latency issue with the analog version of the score network disclosed below. 15.2 General strategy for removing latency
- Our general strategy for removing latency is to have an analog device that acts as score network. We refer to this as the score device. The score device evolves in real time, simultaneous with the time evolution of the reverse process.
- the score device outputs a prediction for the score.
- t is the abstract time associated with the reverse diffusion process.
- f the score device and the reverse process evolve simultaneously with each other allows us to address the latency issue, since the two devices can communicate with each other in real time, without relying on the intervention of a digital device.
- the two devices evolve together according to a system of differential equations, where the two differential equations are coupled. The simultaneous time evolution concept is useful even if one of the devices is digital.
- the reverse diffusion process is digitally solved with a numerical SDE solver while the analog score network evolves in time. It is also beneficial if the reverse diffusion process is analog but the score network is digitally integrated with an ODE solver. However, the benefit is likely more pronouced if both devices are analog. In this case, the two devices can be physically coupled with an analog link, and all signals can remain analog (i.e., no analog-to-digital conversion is necessary). Hence, in what follows, we focus mostly on the case where both the score device Attorney Docket No. NORM-002WO01 and the reverse process correspond to analog devices.
- Equations (63) and (64) form a system of coupled differential equations that simultaneously evolve forward in real time ⁇ .
- 15.4 General integral equations for the score device This section pertains to the score predicted during the reverse diffusion process.
- the actual values of the score predicted by the score network depend both on the differential equation as well as the initial condition.
- the predicted score values have the form:
- the initial condition s 0 is related to the score of p noise .
- p noise is simple enough such that we may have an analytical description of the score function.
- the final distribution as t ⁇ ⁇ is unique and independent of the initial data distribution p data .
- This means that we can choose an arbitrary initial distribution at t 0 (i.e., an arbitrary data distribution p data ) for the forward process, and still end up at the same final distribution as t ⁇ ⁇ .
- the stationary distribution is a multi-variate Gaussian distribution (for the forward processes that we consider, which have an affine drift term), since in this case the stationary distribution corresponds to a conditional distribution, of the form p 0t (x(t)
- Example differential equations for the score device based on total deriva- tive Consider two different examples for how one can construct the score device.
- Example 1 In the first example, we consider constructing the score device based on the theoretical concept of the total derivative. The score s(v( ⁇ ), ⁇ ) is a function of both ⁇ and v( ⁇ ). Because of this, we can write the total derivative with respect to ⁇ as follows: Here ⁇ v denotes the gradient with respect to the v vector. Copmaring the form of Equation (69) with the form of Equation (64) shows that the expression provides a blueprint for how to construct the function h ⁇ ( ⁇ ,v, s).
- the form of this static function is determined by: (1) the data distribution p data and (2) the forward diffusion process (i.e., the forward SDE).
- the forward SDE the forward diffusion process
- Equation (69) q ⁇ (v, ⁇ ) + r ⁇ (v, ⁇ ) ⁇ dv (70) d ⁇ d ⁇ with
- the trainable function q ⁇ (v, ⁇ ) provides a model for the partial derivative
- the trainable function r ⁇ (v, ⁇ ) provides a model for the gradient ⁇ v s.
- Equation (72) represents the differential equation for how the score evolves over time during the reverse process, assuming the total-derivative model presented in this section.
- Eq. (72) is a formal mathematical statement.
- each component of may physically correspond to (or be proportional to) the voltage across a resistor that is in series with the capacitor in each unit cell of the reverse diffusion process.
- the voltage across each of these resistors is an instantaneous measurement of and can be fed to the score device.
- FIG 16 is an overall schematic diagram for an example score device 340, which includes six (sets of) components for the total derivative approach: Attorney Docket No. NORM-002WO01 1. voltage sources 341 whose outputs are r ⁇ (v, ⁇ ); 2. voltage mixers 342 that multiply the components of r ⁇ (v, ⁇ ) with the corresponding components , which are physically encoded in voltage vectors; 3. a voltage source 343 whose output is q ⁇ (v, ⁇ ); 4.
- FIG. 16 depicts several circuits as black boxes. This includes the adder circuits 344 and 347, the integrator circuits 345, and the circuits 341 and 343 associated with the voltage sources q ⁇ (v, ⁇ ) and r ⁇ (v, ⁇ ). We elaborate on the form of these voltage sources in the next subsection.
- Figure 17 shows examples of an adder circuit (left) and an integrator circuit (right) suitable for use in the score device 340.
- FIG. 23 shows a general analog system for evaluating q or any of the at positions v and times t > 0. As the score device is coupled to a trajectory v(t) of the forward process, we evaluate q and at all space-time points (v(t), t) along the trajectory.
- Each layer is denoted by the pair ⁇ A,B ⁇ k ;
- a k is a block implementing an parameterized affine transformation, and this is followed by an element-wise nonlinear transformation B k .
- the number of layers K and the width of each layer M k are left as hyperparameters.
- Figure 23 shows the electrical system that performs these layered transformations.
- Attorney Docket No. NORM-002WO01 The first affine sublayer A 1 is implemented by passing the voltage signals (v(t), t) through a network of resistors; for now, we limit to the case that the affine transformation is just a matrix multiplication (No added bias term). This is shown in Figure 24.
- the voltage at the node is the dot product of (v(t), t) and the vector of inverse resistance values (conductances) attached to each wire. If each wire carrying a voltage component signal (v, t) (i) is placed in series with a resistor with inverse resistance equal to the (i, j)th entry of A 1 , the voltage reading at the node where all wires meet will be the dot product between (v, t) and the jth row of A 1 . If we have M 1 copies of (v(t), t), we can repeat this construction for each row of A k , and output each component of the vector A 1 (v(t), t) as M voltage signals. The same idea applies to each layer A k .
- the nonlinear layer B k can be constructed by attaching a diode followed by a resistor after each linear layer A k , shown in Fig.23. Measuring the voltage at each resistor yields a nonlinearly transformed voltage thanks to the nonlinearity of the characteristic function of the diode (or any other electrical nonlinear element).
- the current-voltage charactertistic of diode under forward bias resembles the activation function associated with a rectified linear unit (ReLU). This provides motivation for using a forward-biased diode, since ReLU is one of the most common activation functions used in neural networks.
- the output of this analog neural network is an N -dimensional voltage vector, corresponding to either r(i) ⁇ (v, t) or q ⁇ , that is fed into voltage mixers or adders, as explained in above and depicted in Fig. 16.
- the outputs r(i) ⁇ (v, t) and q ⁇ (v, t) reach the mixers and adders at exactly time t, although it could take some non-zero time for a pass through the analog neural network.
- the electric fields Attorney Docket No. NORM-002WO01 generated by the input voltage signals propagate through the analog components of the neural network at the speed of light, so the output is computed almost instantaneously.
- the parameters ⁇ are the values of tunable resistors used throughout the affine layers of the network. In the spirit of digital neural networks, the nonlinear layers with diodes have no tunable components.
- the resistors can be tuned to reduce or minimize a cost function L, whose evaluation can be done in the analog domain as explained in the next subsection.
- Another possibility is to consider an analytical form for q and r, whose parameters would be coeffi- cients of a chosen series expansion. These would have the general form: where J i are basis functions, ⁇ (1) , ⁇ (2) are hyperparameters and , ⁇ (2) are sets of parameters to learn.
- Example 2 In the second example, we consider constructing the score device based on a simple circuit.
- the score device is composed of the following elements.
- Figure 22A shows a circuit diagram for a single unit cell. Here we break up the resistor into two resistors R i and ⁇ i , since the resistors become inequivalent when we introduce the coupling between the unit cell.
- Figure 22B shows a circuit diagram for two unit cells coupled together, via a capacitor. This is essentially the same kind of coupling we introduced for the forward diffusion process. 1.
- the score device is composed of N unit cells. 2.
- Each unit cell contains three components in series: (1) a capacitor with capacitance C ⁇ ,i , (2) a resistor with resistance R ⁇ ,i , (3) a voltage source V ⁇ ,i (x) whose output depends on the input vector x. 3. All three of these components are parameterized by parameters (hence the subscript ⁇ ) that can be trained during the training process. Attorney Docket No. NORM-002WO01 4.
- the N unit cells are capacitively coupled to each other according to some connectivity matrix. This coupling can follow the same geometry as that used by the forward diffusion process, as described in Sec. 7 and Fig. 10. 5.
- the voltage source V ⁇ ,i (x) is essentially an analog neural network, involving multiple layers, in- cluding affine layers and non-linear activation functions.
- For the connectivity between the unit cells we can use the same strategy as that employed in Fig.10, which involves switches that can be toggled off or on based on the connectivity matrix. In fact, we can choose the exact same connectivity matrix for the score device as that employed in the forward diffusion process.
- the voltage source V ⁇ ,i (x) can have a very similar structure as that of the q and r sources discussed in Sec. 15.7. The main difference is that V ⁇ ,i (x) does not explicitly depend on the time parameter t, and hence the trainable weights are time independent.
- V ⁇ (v) can be represented by a digital neural network.
- a DAC can convert the output of this neural network to an analog voltage, which is applied to the analog unit cells in the score device, and the analog unit cells the analog score used by the reverse diffusion process.
- NORM-002WO01 15.10 Incorporating problem geometry and inductive bias into the analog score device We have presented two approaches to constructing an analog score network (or hybrid digital-analog score network). In practice, both of these approaches benefit from accounting for problem geometry. Sec. 7 and Fig. 9 show that different problems have different geometries. Accounting for this geometry improves performance in many machine learning tasks, including generative modeling. In the context of analog score devices, problem geometry can be accounting for in the circuits used to implement the parameterized voltage sources. Specifically, this refers to the sources q ⁇ (v, ⁇ ) and r ⁇ (v, ⁇ ) in the device based on the total derivative, and the source V ⁇ (v) in the alternative score device based on the simple circuit.
- These voltage sources can be modeled as analog neural networks where each layer performs an affine transformation followed by a non-linear activation function.
- the affine transformation is represented by a matrix A, as discussed in Sec.15.7 and depicted in Fig.24.
- the problem geometry can be incorporated into each of the A matrices associated with these affine transformations.
- these A matrices can be sparse, with non-zero elements only corresponding to the connectivity of the problem.
- these A matrices can be chosen to have a similar structure to the adjacency matrix A from Sec.7.
- this provides a recipe for incorporating problem geometry into the analog voltage sources q ⁇ (v, ⁇ ), r ⁇ (v, ⁇ ), and V ⁇ (v), and hence into the analog score devices. Physically speaking, this corresponds to limiting the number of wires uses in the circuit structure in Fig. 24. In other words, if the adjacency matrix A has non-zero elements associated with the wires in Fig. 24, the wires can be bundled together. Hence, the adjacency matrix can be used to guide the construction of the circuit in Fig. 24, to incorporate problem geometry.
- the conditional distributions may also be simple.
- the scores of the conditional distributions are affine functions whenever the drift term in the SDE is affine.
- the form of this affine function might not be known.
- One particular case where we can analytically solve for the score of the conditional distribution is when the matrix L(t), which is defined in Eq. (51), is a diagonal matrix.
- v(t) can be stored on the analog device.
- the matrix A(t) and vector b(t) can be stored digitally, e.g., on memory of a CPU.
- This provides two options for how to compute the SCD value: s x(t)
- x(0) A(t)v(t) + b(t) (82)
- a first option is to send the vector v(t) through an analog-to-digital converter, and then compute Eq. (82) on a digital device (the same device that stores A(t) and b(t)).
- the resulting vector can then be fed back to the analog device with a digital-to-analog converter.
- a second option is to compute Eq. (82) on the analog device.
- the final step is to compute the three expectation values appearing in Eq. (9).
- x(0) involves sampling different x(t) from a fixed starting point of x(0). Operationally, this involves running the forward process (with a fixed starting point) multiple times, say, K times, and then averaging over all runs.
- N kl (t m ) ⁇ d kl (t m ) ⁇ 2 denote the norm squared of the score difference in Eq. (90) for the kth run of the forward process, with a fixed starting point x(0) l , where the forward process is run up to a time t m .
- Eq. (92) digitally. This involves taking the voltage N kl (t m ), passing it through an analog-to-digital converter, and then computing the average in (92) on a digital device, such as a CPU. Hence, we assume that E lm is stored on a digital device. The next step is to compute the expectation value E x(0) .
- the digital device produces samples of x(0) from the data distribution p data . Hence, we estimate this expectation value with an estimator This involves producing L samples of x(0) on the digital device, and computing the average in Eq.
- analog devices are capable of computing the time integral of the square of a voltage.
- RMS root-mean-square voltage measurement.
- an unknown, time-dependent voltage heats a resistor R 1 .
- a digital device e.g., CPU
- applies DC voltage to an equal resistance R 2 until both resistors reach the same temperature.
- the DC voltage is then equal to the RMS voltage of the unknown source.
- the temperature sensing can be carried out by two semiconductor diodes, which can be viewed as thermistors.
- the overall circuit diagram is shown in Fig. 19, and also includes an op amp. The idea is to apply this subroutine to the voltage source which is the ith component of the voltage vector d ⁇ (t).
- Fig. 20 illustrates a hybrid analog-digital feedback loop 2000 for training the parameters of the score device 340.
- the analog device evaluates the loss function (or its gradient) for a fixed value of the parameters ⁇ (2210). The result of this evaluation is sent to the digital de- vice. The digital device chooses a new value for the parameters ⁇ ′ , based on some optimization routine (2220).
- Examples of such optimization routines could be gradient descent, stochastic gradient descent, or gradient-free methods like Nelder-Mead).
- the new values ⁇ ′ for the parameters are then programmed into the analog device in order to evaluate the loss function (or its gradient) again.
- This feedback loop 2000 can be iterated multiple times until some convergence criterion is reached.
- the convergence criterion could say that the optimization terminates when the loss function fails to decrease substantially in value for several iterations in a row.
- ⁇ ⁇ the final value of the parameters, after the convergence criterion is met, as ⁇ ⁇ .
- This final set of parameters ⁇ ⁇ can be uploaded onto the score device for use in the reverse process.
- Figure 27 shows two possible unit cell constructions are considered: a unit cell based on voltage mixers and a unit cell based on variable resistors. The reason for the complicated circuits is that the score value should be multiplied by a prefactor of g(t) 2 where g(t) is the prefactor on the diffusion term.
- FIG. 16 Flowchart of entire system with analog score network
- Figure 21 shows a flowchart of the entire thermodynamic AI system, whenever both the diffusion process (forward and reverse process) as well as the score network all correspond to analog devices.
- the analog score device interfaces with the reverse process, providing score values that are used by the reverse process.
- This flowchart can be compared to the flowchart in Fig.15, which illustrates the case of a digital score network. Overall the two flowcharts are similar, although there are more steps performed by analog components in Fig.
- analog SDE integrator can solve the reverse SDE in the same interval in exactly T seconds. Moreover, the computation time should remain T seconds regardless of the SDE’s dimension.
- analog SDE integration we show how an analog score network ensures the solutions the reverse process provided by an analog integrator are accurate.
- the analog score network addresses the issue of latency in continuous trajectories of the reverse process. To be clear, the latency issue encountered during the analog reverse process is not an issue for a digital device simulating the same process because a digital system solves the reverse SDE with discrete time steps, and the time to query the score network does not affect the discrete solver’s solution. Solving the latency issue ensures that the reverse process can be solved accurately with our analog SDE integrator.
- the reverse process can be solved much faster on an analog device, and the analog score network can ensure such solutions rival the accuracy of digital solutions that take much longer to obtain.
- the analog system solving the reverse process gains the following advantages when it includes an analog score network: •
- the drift term of the reverse process need not be computed with a digital score network. Hence, there is no latency from the digital evaluation of the score network with a general neural network. (see Fig.3). As such, the drift term in the reverse process can more accurately follow its true shape along a trajectory. • No latency from the DACs to be used at each time step of the forward and reverse processes.
- DACs are those used for the initialization, where a data vector x is converted to a voltage vector v, and the collection of v at the end of the time evolution.
- the size of the analog score device device may be small, (have fewer variational parameters) since this approach is based on splitting the terms that contribute to the derivative of the score function. Splitting terms results in tracking simpler contributions, which may use fewer resources.
- an analog score network we can solve for the score along a continuous trajectory of points, where these continuous evaluations may be ported into another analog device which evaluates loss functions against it. (depending on its exact form, see section 15.13.
- Deep learning Deep learning
- NNNs neural networks
- DL Deep learning
- a deep learning system extracts high level features about a dataset that are useful for classifying the data in the dataset.
- Prototypical example applications of deep learning are classifying images of handwritten digits or classifying images of cats and dogs.
- UQ uncertainty quantification
- UQ aims to quantify the uncertainty of the predictions made by the neural network.
- UQ is useful for high-stakes applications (e.g., cancer detection in medicine) because it provides guidance for when the user should defer to human judgement over the machine’s predictions.
- UQ is widely recognized as making machine learning more reliable and trustworthy.
- Several different methods exist for UQ in machine learning A simple example of UQ is adding confidence intervals to the predictions made by the neural network.
- a more sophisticated and rigorous approach to UQ is the Bayesian framework.
- Bayesian frame- work quantifies uncertainty by accounting for prior knowledge (often called the prior distribution) and updates that knowledge due to data or observations (often called the posterior distribution). Bayesian methods aim to quantitatively capture knowledge in the form of probability distributions.
- 21 Bayesian neural networks BNNs
- Neural networks are machine learning models that typically have multiple layers of linear and non- linear transformations, allowing them to express a wide variety of potential functions.
- Bayesian neural networks allow for uncertainty quantification on the predicted outputs of the neural network. This improves the reliability and trustworthiness of the model.
- Figure 29 illustrates differences between a Bayesian neural network (right) and a standard neural network (left). In the Bayesian case, the weights do not have definite values.
- D is the training data
- w are the weights (i.e., parameters) of the BNN
- x represents a test input data point
- y represents the output.
- x,w) is the predictive distribution for a given value of the weights w
- D) is the posterior distribution on the weights after training with data D.
- Variational inference is a form of Bayesian inference that involves postulating an ansatz Q (i.e., a family of possible solutions) for the posterior distribution. Optimizing over the distributions q(w) in the ansatz Q yields an approximate posterior distribution.
- the Evidence Lower Bound (ELBO) Minimizing or reducing the KL divergence is equivalent to maximizing or increasing the so-called evi- ⁇ dence, which is given by log p(D) log p(D,w)dw.
- the integral in the evidence formula is typically intractable to compute, in some cases taking exponential time to compute. Therefore, it is common to maximize or increase a more tractable quantity, which lower bounds the evidence.
- Stochastic Variational Inference combines natural gradients with stochastic optimization.
- the idea behind SVI is to use a cheaply computed, noisy, unbiased estimate of the natural gradient inside of a gradient ascent optimization.
- the natural gradient is different from the standard, Euclidean gradient, as it accounts for the geometry of the problem.
- SVI works with an unbiased estimator of the natural gradient, rather than the natural gradient itself.
- ODEs Neural Ordinary Differential Equations
- Neural SDEs Neural Stochastic Differential Equations (SDEs) represent a continuous-depth version of Bayesian neural networks. Once again, a continuous-depth version of a BNN evolves according to a system of coupled differential equations. However, the system is stochastic in nature: where dB is a Brownian motion term. Attorney Docket No.
- SDE for Prior Distribution The prior distribution can be formulated over the weights of a BNN as an SDE.
- the SDE can take the forms of the function f w (t, w t ) and g w (t, w t ) that appear in Eq. (112).
- f w (t, w t ) ⁇ w t
- g w (t, w t ) ⁇ I d .
- the SDE system is: 23.7 SDE for Posterior Distribution
- the posterior distribution can also be formulated over the weights of a BNN as an SDE.
- the posterior distribution should be highly expressive.
- the SDE can be more complicated than the SDE of a prior distribution.
- the trainable parameters in this model include both the parameters ⁇ appearing in the drift term as well as the initial condition w 0 on the weights.
- the neural network NN ⁇ can be referred to as the Posterior Drift Network (PDN), since it determines the drift associated with the posterior distribution.
- PDN Posterior Drift Network
- Inductive bias refers to prior knowledge being inputted into the structure of the model, for example knowledge of symmetries in the data. Having a strong inductive bias can reduce the training data requirements as well as improve the speed of training of the model.
- the third difficulty refers to the challenge of using digital hardware to simulate time dynamics.
- Thermodynamic AI System for Bayesian Deep Learning Figure 31 illustrates a thermodynamic AI system for Bayesian deep learning with subsystems that can perform four subroutines.
- Each subsystem can be implemented as a physical analog device (subsys- tem/component) that performs a corresponding subroutine. Some subset of the subroutines may also be stored in and processed on a digital device.
- These four subsystems include: 1. a Weight Diffuser (WD) 3310; 2. a Hidden Layer Network (HLN) 3320; 3. a Posterior Drift Network (PDN) 3330; and 4. a Loss Evaluator (LE) 3340.
- HNN Hidden Layer Network
- PDN Posterior Drift Network
- LE Loss Evaluator
- the WD also communicates back-and-forth with the Posterior Drift Network (PDN): the WD feeds weight values to the PDN and the PDN feeds drift values to the WD.
- the Loss Evaluator (LE) takes in signals from all three of the other subsystems—the HLN, the WD, and the PDN—in order to evaluate the loss function.
- the Hidden Layer Network 25.1 Overview of HLN The Hidden Layer Network (HLN) is represented by the differential equation: The output of the HLN is given by the integral equation:
- Attorney Docket No. NORM-002WO01 The HLN can be viewed as a neural ordinary differential equation (neural ODE).
- This neural ODE can be stored and processed on a digital device or can be implemented on an analog device.
- 25.2 Digital Hidden Layer Network One possible setup for the overall system is for the HLN to be stored and processed on a digital device, while the other devices (WD, PDN, and LE) have analog components.
- the digital device that stores and processes the HLN could be a central processing unit (CPU) or field programmable gate array (FPGA).
- This setup involves conversion between digital and analog signals.
- drop-in UQ Drop-in uncertainty quantification
- the name drop-in UQ is inspired by the idea of providing a non-invasive service whereby uncertainty quantification is added as a feature on top of an existing (e.g., digital) architecture.
- a physical device can offer a drop-in UQ service where uncertainty quantification is provided to the user’s application. This would involve interfacing the user’s HLN and dataset with our physical device (which includes the WD, PDN, and LE).
- each input vector x (m) is N -dimensional, meaning that there are N features associated with each datapoint.
- the Unit Cell An analog HLN can have N unit cells, corresponding to one unit cell for each data feature.
- Figure 32 shows a possible architecture for a unit cell in the HLN. The voltage across the capacitor represents the state variable h t .
- the resistor in the unit cell provides the time derivative term the capacitor and voltage source ⁇ j provide the linear drift term (linear in h t ), and the non-linear element (NLE), such as a diode or transistor, provides a non-linear drift term.
- Figure 33 shows two unit cells for the HLN that are coupled together via a resistive bridge.
- v ⁇ C ⁇ 1 [J s ⁇ Jv ⁇ I NL ], (122)
- A is an adjacency matrix that represents the problem geometry.
- the matrices A and J allow the problem-specific geometry to be built into the circuit. This can improve performance of the model, since accounting for problem geometry leads to an inductive bias for the model, which often improves the trainability and generalization of the model.
- switches e.g., voltage-gated transistors
- switches can upload the problem geometry onto the circuit connectivity, as illustrated in Figure 10.
- the differential equations in Eq. (122) are similar to the equation for a neural network in part because neural networks typically involve alternating layers of affine transformations followed by non- linear transformations. Taking the limit where the layers are infinitesimally small, it appears as if the affine transformation and non-linear transformation happen simultaneously. This limit of infinitesimally small layers is the limit associated with neural ODEs, and hence is the limit that our HLN corresponds to. Therefore, the right-hand-side of Eq. (122) involves affine and non-linear transformations acting at the same time.
- NLEs non-linear elements
- the NLE to be a diode mimics these kinds of activation functions.
- the current-voltage characteristic for a transistor can be non-linear, with the form of the function depending on whether one is looking at the input or output characteristic.
- the input characteristic is convex with the current rising sharply with the applied voltage.
- the corresponding activation function shows a saturation effect, like the sigmoid function and the hyperbolic tangent functions.
- x could be a feature vector x (m) from the training dataset.
- This encoding may involve an digital-to-analog converter, which converts a digital signal associated with x into an analog signal.
- x can be stored on the analog HLN by appropriately charging the capacitors in the HLN.
- the jth feature of x is mapped to the capacitor in the jth unit cell, possibly with some permutation of the indices.
- the notation v represents the physical vector of voltages across the capacitors in the unit cells and also corresponds mathematically to the vector h associated with the hidden layer values.
- v and h are essentially interchangeable from a notation perspective.
- Attorney Docket No. NORM-002WO01 25.6 Analog Augmented HLN
- a feature vector x can be encoded into the initial hidden layer value h 0 .
- x can be encoded into a larger dimensional space, by padding the vector with zeros.
- This concept leads to what is known as augmented neural ODE.
- Augmented neural networks aim to expand the dimensionality of the feature space in order to more easily separate the data. Neural networks (without augmentation) may not be able to implement functions where trajectories intersect and hence neural ODEs may not be able to represent all functions. Augmented neural networks represented by augmented neural ODEs are useful for addressing this issue.
- Figure 34 illustrates an analog version of an augmented neural network that executes an augmented neural ODE.
- the concept illustrated in Figure 34 can be used as the basis for constructing the HLN, which can be thought of as an augmented HLN.
- the augmented HLN could have some non-trivial connectivity between the N unit cells in data space and the N a unit cells in the additional space. This is illustrated in Figure 34.
- the rest of the protocol can proceed similarly to the case without augmentation. 25.7
- the goal is to convert h 1 into a prediction for the output y.
- a map which could be a probabilistic map, that maps h 1 to y.
- This map can be described via the conditional distribution p(y
- h 1 ) p(y
- the quantity appearing in Eq. (128) is the same quantity appearing in Eq. (108).
- the subsystem or device associated with this process is called the Output Predictor (OP).
- the OP takes h 1 as its input and outputs a y value, via either a deterministic or probabilistic function.
- the OP could be implemented with an analog device or a digital device. If h 1 has an analog form and y has a digital form, then the OP can use analog-to-digital converters and digital-to-analog converters to convert between the different types of signals. Alternatively, it is possible for h 1 and y to either both be analog or both be digital, in which case no conversion (between different signal types) is necessary.
- the weights can be modeled as undergoing a diffusion process, which adds uncertainty to the weights.
- a physical thermodynamic device called a Weight Diffuser (WD) devices diffuses the weights over time.
- the thermodynamic aspect of the WD device is supplied by an analog stochastic noise source, which leads to physical diffusion of the system’s state variable.
- WD device for the prior distribution.
- WD device for the posterior distribution.
- the two devices can be represented by a single device that is equipped with switches, to toggle between the prior and posterior cases. 26.2 Prior Weight Diffuser
- the prior distribution is typically assumed to be relatively simple, and consequently the WD device for the prior distribution is also relatively simple.
- the prior weight diffuser may not be necessary, since the simplified loss function in Eq. (115) does not involve the prior distribution.
- the loss function in Eq. (110) does involve the prior distribution, and hence the prior weight diffuser would be useful in this case.
- the prior weight diffuser is an optional component in the overall system for Bayesian deep learning.
- the Unit Cell of a Prior Weight Diffuser Figure 35 provides a circuit diagram for the unit cell for the prior weight diffuser.
- the unit cell includes a capactor C i , a resistor R i , and a stochastic noise source B i , all of which are in series.
- the voltage across the capacitor in the unit cell is denoted as v ⁇ i .
- the abstract weight w i is encoded in the physical voltage v ⁇ i , which means that the dynamical state variable for the WD device is: Because there is a stochastic noise source in the circuit, the state variable evolves according to a stochastic differential equation (SDE).
- SDE stochastic differential equation
- Figure 36 shows the case of capacitive coupling between two unit cells.
- Each cell may or may not be coupled to other cells, where the choice of whether or not to couple two cells can be based on problem geometry.
- the various objects are defined as: R ⁇ diag(R 1 , R 2 , ..., R W ), ⁇ ⁇ diag( ⁇ 1 , ⁇ 2 , ..., ⁇ W ), v ⁇ ⁇ (v ⁇ 1 , v ⁇ 2 , ..., v ⁇ W ), dB ⁇ (dB 1 , dB 2 , ..., dB W ).
- the matrix C has elements given by where the adjacency matrix A appears here again.
- the adjacency matrix for the WD can be different than the adjacency matrix for the HLN. For simplicity, we present the case where they are the same, although they could be different.
- the architecture for the Posterior Weight Diffuser is similar to that of the Prior Weight Diffuser.
- the unit cell for the posterior weight diffuser is illustrated in Figs. 37A and 37B. This includes the case where the posterior drift network (PDN) is analog (Fig. 37A), and also includes the case where the PDN is digital (Fig. 37B).
- the digital PDN an analog-to-digital converter converts the measured voltage on the capacitor to a digital signal, and a digital-to-analog converter converts the output of the PDN to an analog voltage that is applied inside the unit cell.
- the analog PDN operates without converters.
- the unit cell is similar to that of the prior weight diffuser.
- capacitive bridges can connect the unit cells, and the connectivity can be chosen according to an adjacency matrix A and hence according to the problem geometry. Again, this is illustrated by Figure 10. It is reasonable to choose the same connectivity for the prior weight diffuser and the posterior weight diffuser.
- the overall SDE for a state vector associated with the W unit cells is given by:
- a difference here, relative to the prior weight diffuser, is the addition of the drift term s ( ⁇ ) (v ⁇ , t) which Attorney Docket No. NORM-002WO01 can be a complicated function of the time t and the state vector v ⁇ .
- the Posterior Weight Diffuser is parameterized with parameters ⁇ that appear in the drift term.
- the initial condition is also has parameters.
- ⁇ represents the parameters
- v ⁇ ( ⁇ ) (0) and w ( ⁇ ) (0) are, respectively, the physical version and abstract version of the initial condition.
- the Posterior Weight Diffuser can be written as an integral equation:
- the set of parameters is During the training process, a digital processor stores the values of the ⁇ parameters and propose updates to these parameters during an optimization routine. This is discussed more below.
- the initial condition v ⁇ ( ⁇ ) (0) can be set by the digital processor.
- the stochastic noise source in the Weight Diffuser as an abstract device.
- This abstract circuit element is illustrated as B i in Figures 35 and 37.
- the stochastic noise source can be a thermal noise source, a shot noise source, or a source of both thermal noise and shot noise.
- a variable amplifier can amplify the output of the stochastic noise source, making it possible to tune the standard deviation ⁇ appearing in the differential equation (e.g., as in Eqs. (132) and (134)).
- One type of possible stochastic noise is thermal noise.
- Thermal noise also called Johnson-Nyquist noise, comes from the random thermal agitation of the charge carriers in a conductor, resulting in fluctuations in voltage or current inside the conductor.
- the amplitude of the voltage fluctuations can be controlled by Attorney Docket No. NORM-002WO01 changing the temperature or the resistance.
- a thermal noise source can be implemented using a large resistor in series with a voltage amplifier.
- a thermal noise source can be implemented using a tunable resistor (e.g., a transistor operated in the linear regime) where the noise amplitude scales with the value of the resistance.
- a tunable resistor e.g., a transistor operated in the linear regime
- shot noise arises from the discrete nature of charge carriers and from the fact that the probability of a charge carrier crossing a point in a conductor at any time is random. This effect is particularly important in semiconductor junctions where the charge carriers should overcome a potential barrier to conduct a current. The probability of a charge carrier passing over the potential barrier is an independent random event. This induces fluctuations in the current through the junction.
- the amplitude of the current fluctuations can be controlled by changing the magnitude of the DC current passing through the junction.
- a source of shot noise can be implemented using a pn diode (for example, a Zener diode in reverse bias configuration) in series with a controllable current source.
- FIG. 39 illustrates how to input weights from the WD to the HLN.
- the first step is to parameterize the HLN in terms of voltages.
- the HLN involves variable resistors that can be implemented by voltage-controlled circuit elements.
- Figure 39 shows the variable resistors as voltage-gated transistors, where the gate voltage controls the value of the resistance.
- the variable resistances are:
- the functions f j and f jj are device-dependent functions that translate the gate voltage into a resistance.
- the voltage vectors ⁇ and ⁇ and the voltage matrix ⁇ represent the set of parameters Attorney Docket No. NORM-002WO01 used by the HLN device.
- ⁇ (t) ⁇ (t), ⁇ (t), ⁇ (t) ⁇ (142) where these parameters depend on time t.
- the notation ⁇ (t) for this set of parameters corresponds precisely to the output vector of the WD device, given in Eq. (140).
- Eq. (140) We applied a permutation function in Eq. (140) to account for a routing scheme that maps the outputs of the WD device to the inputs of the HLN device.
- the Posterior Drift Network plays a central role in generating the posterior distribution for the weights.
- the PDN takes in real-time measurements of the the weights and outputs drift values as voltages to be applied in the WD unit cells.
- 28.2 PDN as a digital neural network
- One approach is to use a digital device to store and process the PDN. This digital device could be a CPU or an FPGA.
- FIG 41 shows how the PDN interacts with the unit cells of the WD, in the case that the PDN is stored on a digital device.
- the PDN takes in the entire weight vector as input to the neural network.
- Analog-to-digital and digital-to-analog converters are used to appropriately convert the signals between the analog and digital domains.
- a digital PDN has the advantage of being flexible in its design, since the construction can be modified in software (rather than in hardware).
- digital neural networks often allow for more parameters and hence more expressibility than their analog counterparts.
- FIG. 42 shows a possible architecture for an analog PDN.
- This analog neural network has alternating layers of affine transformation followed by non-linear transformations.
- the non-linear transformation is illustrated in Figure 42. This can involve a non-linear element (NLE) that is in series with a resistor.
- NLE non-linear element
- the NLE could correspond to a diode or a transistor, and Sec.
- NLE 25.4 discusses these NLEs in more detail.
- the resistor in series with the NLE makes it possible to read off the current through the NLE as an output voltage.
- the output voltage is a non-linear Attorney Docket No. NORM-002WO01 function of the input voltage, which is the desired feature of activation functions that are commonly used in neural networks.
- Figure 43 elaborates on possible affine transformations. This can involve a layer of resistors in series with the inputs, followed by a wire combining the outputs. This functions to produce an output voltage that is a weighted average of the input voltages. This produces a particular affine transformation.
- the resistances in the affine layer are free parameters that get trained during the optimization process. Moreover, these resistances can be time dependent. Each resistance can be expressed as a time-dependent function that is parameterized in some way. For example, the resistance can be expressed as a linear function of time, in which case the slope and intercept on that linear function would be trainable pa- rameters. Physically speaking, the resistors can be transistors, the time-dependent resistances could be implemented with time-dependent voltages applied to the gates of the transistors. Overall, this PDN construction has the benefit of essentially no latency.
- the total derivative formula can be written Attorney Docket No.
- ⁇ w denotes the gradient with respect to w.
- the total derivative formula captures the dependence on both t and w.
- it provides a recipe for how to design an analog PDN.
- ⁇ w denotes the gradient with respect to w.
- ⁇ w denotes the gradient with respect to w.
- the total derivative formula captures the dependence on both t and w.
- it provides a recipe for how to design an analog PDN.
- Equation (145) yields a model for the drift function: with
- the trainable function q ( ⁇ ) (t,w) provides a model for the partial derivative and the trainable function r ( ⁇ ) (t,w) provides a model for the gradient ⁇ w s.
- Figure 44 provides a schematic diagram of a circuit 4440 that can output drift values from the PDN, using the total derivative approach. This circuit has several components: 1. voltage sources 4441 whose outputs are r ( ⁇ ) (t,w); 2.
- integrator circuits 4445 that i ( ⁇ ) (t ′ ) that are the time integrals of the signals e ⁇ (t,w, d d w t ), of the form 6.
- Figure 44 depicts several circuits as black boxes.
- FIG. 17 shows examples of an adder circuit (left) and an integrator circuit (right) suitable for use in the score device 340.
- the voltage vector input to each voltage mixer can be obtained as follows.
- the unit cell has a resistor that is in series with the capacitor, as shown in Figure 37. The voltage across this resistor is proportional to the time derivative of the capacitor’s voltage, and hence is proportional .
- a circuit can provide the voltage sources q ( ⁇ ) (t,w) and r ( ⁇ ) (t,w).
- q ( ⁇ ) (t,w) and r ( ⁇ ) (t,w) can be constructed based on the circuit structure shown in Fig.42, and the affine layer can be constructed in a manner similar to that shown in Fig. 43.
- NORM-002WO01 28.5 Incorporating problem geometry into the PDN There are several different approaches to constructing the PDN. These approaches can benefit from accounting for problem geometry.
- the problem geometry can be incorporated into the circuit structures shown in Figures 42 and 43.
- Each affine transformation in Figures 42 and 43 is mathematically represented by some matrix A.
- the problem geometry can be incorporated into each of the A matrices associated with these affine transformations.
- these A matrices can be sparse, with non-zero elements only corresponding to the connectivity of the problem. In other words, one can choose these A matrices to have a similar structure to the adjacency matrix A that has been discussed previously.
- Matching the zero elements of the A matrices to the zero elements of the adjacency matrix A provides a recipe for incorporating problem geometry into the analog PDN. Physically speaking, this corresponds to limiting the number of wires uses in the circuit structure in Fig. 43. In other words, the wires in Fig. 43 can be bundled together if the adjacency matrix A has non-zero elements associated with those wires. Hence, the adjacency matrix can be used to guide incorporation of problem geometry in the construction of the circuit in Fig. 43.
- w)] , An unbiased estimator can compute the loss function (or its gradient). This estimator can be low precision, which conserves resources. Next, consider some estimators for the loss function. Consider sampling from q(w). Suppose we draw K samples, w (k) ⁇ q(w), with k 1, ...,K.
- An unbiased estimator for the log likelihood can be based on inverse binomial sampling (IBS).
- L (m,k) is a random variable.
- E( ⁇ (m,k) ) ⁇ (m,k) .
- Another suitable biased estimators is based on fixed sampling. Although it is biased, it may be easier to use analog hardware to estimate it, compared to ⁇ 2 .
- ⁇ (m,k) is an estimator for ⁇ (m,k) , although it can have a bias that becomes more pronounced when is small.
- ⁇ (m,k) is an estimator for ⁇ (m,k) , although it can have a bias that becomes more pronounced when is small.
- ⁇ and L be an estimator for ⁇ and L as follows:
- ⁇ 3 can involve the following protocol: 1. Choose a datapoint x (m) from the dataset D 2. Use this datapoint to initialize the input h(0) of the HLN device; 3.
- An amplifier 4510 coupled to the weight diffuser 3310 amplifies the weights and provides them to a componentwise adder 4520, which adds the amplified weights to drift terms from the posterior drift network 3330.
- the digital processor 4540 multiplies d(t) by g ⁇ 1 , then takes the norm squared of the resulting vector, then integrates over time, and finally sums over all samples (sum over k). The result of this computation is ⁇ 1 . 29.4
- Analog time integration for the ⁇ term involve more computations on the analog device. After all, analog systems are efficient at computing time integrals. Hence we can compute the time integral in Eq. (160) on the analog device.
- Figure 46 shows a system that uses analog time integration to compute ⁇ 1 . Like the system shown in Figure 45, the system in Figure 46 includes the amplifier 4510 and adder 4520 for generating the analog voltage vector d(t) as described above.
- analog circuitry e.g., mixers 4610
- a voltage source generates an unknown time-dependent voltage.
- This voltage source heats a resistor R 1 .
- a digital device e.g., a CPU
- the DC voltage is then equal to the RMS voltage of the voltage source.
- the temperature sensing can be carried out by two semiconductor diodes, which can be viewed as thermistors.
- the overall circuit diagram is shown in Fig. 19, and also includes an op amp.
- the Loss Evaluator discussed above can be used as a subroutine in the training process for a Bayesian neural network.
- This training process takes the form of a hybrid analog-digital process involving back- and-forth communication between the analog device (including the HLN, WD, PDN, and LE) and a digital processor.
- the analog device evaluates the loss function for a fixed value of the parameters, and then sends this loss function value to the digital processor.
- the digital processor then updates the parameters to new values, based on an optimization routine.
- This optimization routine could be chosen in a variety of ways, including gradient free methods and other standard methods.
- the new values of the parameters are fed back to the analog device, which then determines the new value of the loss function, and the process repeats.
- DACs and ADCs can be used to convert the parameter values to analog voltages and to convert the loss function values to digital signals, respectively.
- An alternative approach is to use the analog device to compute gradients, and then feed these gradients to the digital process, which implements a gradient descent optimization routine.
- Computing gradients with the analog device could be done by adopting the adjoint sensitivity method.
- the adjoint sensitivity method is discussed in detail below.
- this adjoint sensitivity method can be carried out on an analog device in the context of a neural ODE. Therefore, the electrical circuits presented below provide inspiration for how to perform the adjoint sensitivity method on our Bayesian Neural Network. Specifically, the methods below can be extended to the case of neural SDEs.
- Analog Neural ODE 31.1 Overview A subroutine of the thermodynamic AI system for Bayesian deep learning is an analog neural ODE.
- the analog neural ODE corresponds to/represents the Hidden Layer Network discussed in Section 25.3.
- Neural networks represented by or that implement neural ODEs have a wide range of applications, both in supervised machine learning and in fitting time-series data. In this sense, our architecture for an analog neural ODE has applications beyond Bayesian deep learning and is relevant to deep learning in general as well as for fitting time-series data.
- Section 32 elaborates on the application to fitting time-series data. Recall that an architecture for executing an analog neural ODE is shown in Figures 32, 33, 10, and 34. Figure 32 shows the unit cell. Figure 32 shows how the unit cells may be coupled.
- Figure 10 shows the connectivity structure.
- Figure 34 shows how to make an augmented version of the analog neural neural network for implementing an augmented neural ODE by expanding the space, i.e., increasing the number of unit cells.
- the free parameters in the analog neural network for a neural ODE are voltages ⁇ , ⁇ , and ⁇ , as illustrated in Figure 39.
- these free parameters do not have to come from a weight diffuser (as Attorney Docket No. NORM-002WO01 Fig.39 shows), but instead can be supplied by a digital processor such as a CPU or FPGA.
- these parameters could be supplied by an analog device that evolves the weights over time but does not add stochastic noise, in which case the device acts as a weight evolver (instead of a weight diffuser).
- the adjoint sensitivity method for computing gradients provides one way to compute the gradient of neural ODEs.
- the gradient refers to the derivative of the loss function L with respect to the parameters ⁇ , denoted .
- the variables h(t), a(t), and g(t) evolve together according to a system of coupled differential equa- tions.
- the Adjoint Device Next consider introducing circuits for the time evolution of a.
- the circuit for evolving a is called the adjoint device.
- Figure 48 shows a possible architecture for the adjoint device.
- Each unit cell is composed of a capacitor C j a and resistor in series.
- each unit cell can be connected to every other unit cell (full connectivity). It is also possible to restrict this connectivity such that the unit cells are connected to a lesser degree.
- the voltages across the capacitors denoted by the vector v a , represent the state variable. In other words, we encode the adjoint variable a in v a .
- the matrix J a has elements given by where we assume full connectivity between the unit cells (although this can be extended to partial connectivity).
- Eq. (167) has the correct form provided that the A matrix can be related to the matrix . This be accomplished by encoding in the matrix A.
- One approach would be to compute digitally. Consider computing digitally at each time step (e.g., with automatic differentiation).
- a DAC can convert the digital signals associated with this matrix into analog voltages.
- the resulting analog voltages can be used to determine the various resistances and R j a j of the resistors shown in Figure 48.
- These resistors can be physically implemented as voltage- controlled elements, such as transistors whose gate voltages controls the resistance. Therefore, we can take the analog voltages associated with the matrix elements of and apply them as gate to the transistors that determine the resistances and R jj .
- the vector a can be read by picking off the voltages across the capacitors in the adjoint device.
- This vector can be fed into an analog circuit that performs matrix-vector multiplication (e.g., using a layer of resistors).
- This matrix-vector multiplication can involve multiplying a by an analog version of the matrix , which was obtained in the previous step.
- the matrix-vector product a ⁇ ⁇ f ⁇ has been obtained as an analog voltage, it can be integrated over time.
- This can involve a set of integrator circuit.
- Figure 49 shows a circuit diagram for an integrator, i.e., a circuit that computes the time integral of a voltage signal.
- the output of this set of integrator circuit is be the gradient g.
- some operations can be performed digitally.
- the matrix-vector multiplication can be performed digitally by first passing a through an ADC, then digitally integrating this matrix-vector product.
- 31.7 Training the analog neural network The method above can be used for computing gradients with respect to the parameters ⁇ of the analog neural ODE that represents the analog neural network. In practice, these parameters values can be stored digitally in memory of or coupled to a CPU or FPGA, and then supplied to the analog circuit as needed. The training process can therefore correspond to an optimization routine that involves a hybrid analog-digital feedback loop.
- the digital device supplies the values of the parameters ⁇ , then the analog device computes the gradient, then the digital device provides new values for the parameters after doing a gradient descent step, and the process repeats.
- DACs and ADCs can be used to convert the parameter values to analog voltages and to convert the gradient values to digital signals, respectively.
- this corresponds to using a gradient-descent approach to training the analog neural network.
- Adjoint sensitivity methods are not needed in a gradient-free approach to training the analog neural network.
- an analog neural network can be trained using a loss function based on the outputs of the analog neural network. Then, this loss function value can be used in the context of a gradient-free optimization routine that is facilitated by a digital device.
- Time-series data provide an important application relevant to financial analysis, market prediction, epidemiology, and medical data analysis. In many cases, data at particular time points may be at irregular time intervals. In these cases, it can be helpful to have a model that makes predictions at all times and hence interpolates between the datapoints and extrapolates beyond the data, e.g., to make predictions about the future where no data is available.
- Discrete neural networks such as recurrent neural networks, have been used in the past for inter- polating and extrapolating time-series data. However, neural networks that implement or obey latent ODEs have been shown to outperform recurrent neural networks at this task.
- Latent ODEs are continuous time models that are essentially the same as neural ODEs, although the different names are used to distinguish the application. Namely, a neural ODE (latent ODE) represents a neural network used for supervised machine learning (fitting time-series data). A latent ODE can be viewed as a parameterized ODE, where the corresponding neural network has parameters trained to fit the time-series data (according to some loss function).
- 32.2 Analog Latent ODEs Section 31 presents an analog architecture for implementing neural ODEs. We can employ this analog architecture to carry out a subroutine of a latent ODE model for fitting time-series data.
- Figure 50 provides a schematic diagram for our analog implementation of a latent ODE. The analog latent ODE implementation in Fig.
- the 50 has three components or subsystems: 1. an encoder; 2. an analog neural ODE processor; and 3. a decoder.
- the training data are provided as observations from some time series. These time-series observations are fed into an encoder.
- the encoder has free parameters that can be trained.
- the encoder could be a recurrent neural network.
- the output of the encoder can be the initial vector h(0) of the hidden layer values, or the output could be a probability distribution from which h(0) is sampled. If the encoder is stored on a digital device, its output can pass through a DAC, after which it can be fed as an analog signal to the analog neural ODE.
- the analog neural ODE processor which is presented in Sec. 31, provides the latent space for the latent ODE.
- This latent space is initialized to h(0) by the encoder. Then then hidden layer values evolve over time according to the differential equation that describes the analog neural ODE. Recall that Figures 32, 33, 10, and 34 provide details for how the analog neural ODE processor can be constructed.
- the hidden layer values h(t k ) can be read off at a set ⁇ t k ⁇ of various times, by measuring the voltages on the capacitors in the analog neural ODE processor. This set ⁇ h(t k ) ⁇ of values can be fed to a decoder.
- the decoder could be a neural network that is stored on a digital device.
- ADCs can digitize the analog signals associated with ⁇ h(t k ) ⁇ and feed the resulting digital signals to the decoder.
- the decoder can have free parameters that to be trained.
- the decoder can be probabilistic, such that the outputs predicted by the decoder are a probabilistic function of the hidden layer values ⁇ h(t k ) ⁇ .
- a benefit of making the decoder probabilistic is that it can be used to simulate noisy processes or stochastic processes.
- Attorney Docket No. NORM-002WO01 The outputs of the decoder correspond to predictions that the latent ODE model makes for the true time series. These predictions can go beyond the time interval associated with the observations, in which case the predictions correspond to extrapolated values.
- a training process occurs where the parameters of the encoder, the decoder, and the analog neural ODE processor are optimized in order to minimize or maximize a loss function.
- This essentially corre- sponds to fitting the time-series data.
- a loss function based on the evidence-based lower bound (ELBO) can be employed in this training process, or other loss functions are also possible.
- ELBO evidence-based lower bound
- a gradient based approach can be taken, and the adjoint sensitivity method discussed in Sec. 31 can be employed for computing gradients.
- 32.3 Extension to analog latent SDEs The framework and architecture presented in Figure 50 can be extended as follows.
- An analog neural SDE processor can replace the analog neural ODE processor for the latent space used in Figure 50. Replacing the analog neural ODE with an analog neural SDE results in an analog latent SDE.
- the analog latent SDE can be used to model time-series data that are generated by stochastic processes. This has useful applications in financial and market analysis.
- Fig. 51 illustrates an analog neural SDE processor with a stochastic noise source in each unit cell of the architecture for the analog neural ODE processor.
- the voltage source B j denotes a stochastic noise source.
- the rest of the architecture of the analog neural SDE processor could be the same as that of the analog neural ODE processor.
- An analog neural SDE processor inserted into the latent space shown in Fig. 50 in place of the analog neural ODE processor could be employed to fit and extrapolate time-series data from sources that are stochastic in nature.
- 33 General Framework Here we provide a general framework that encompasses multiple applications.
- Thermodynamic AI algorithms as algorithms consisting of at least two subroutines: Attorney Docket No. NORM-002WO01 1.
- SDE stochastic differential equation
- p, x, and f respectively are the momentum, position, and force.
- the matrices M , D, and B are hyperparameters, with M being the mass matrix and D being the diffusion matrix.
- the dw term is a Wiener process.
- U ⁇ is a (trainable) potential energy function.
- thermodynamic AI hardware As the name “thermodynamic” suggests, a thermodynamic system is inherently dynamic in nature. Therefore, the fundamental building blocks should also be dynamic. This is contrast to classical bits or qubits, where the state of the system ideally remains fixed unless it is actively changed by gates.
- thermodynamic building block should passively and naturally evolve over time, even without the application of gates. But what dynamical process should it follow? A reasonable proposal is a stochastic Markov process. Naturally this should be continuous in time, since no time point is more special than any other time point.
- the discrete building block which we call an s-bit, would follow a continuous-time Markov chain (CTMC).
- CMC continuous-time Markov chain
- the “s” in s-bit stands for stochastic.
- the continuous building block, which we call an s-mode the natural analog would be a Brownian motion (also known as a Wiener process).
- a Martingale which is typically assumed for Brownian motion
- s-unit as a generic term to encompass both s-bits and s-modes.
- p-modes the fundamental building blocks of a probabilistic system
- the p-bit can be thought of as a random number generator, which either generates 0 or 1 at random.
- the analog of this in the continuous case, which we call a p-mode could be a random number generator that generates a real number according to a Gaussian distribution with zero mean and some variance.
- s-modes s-bits
- p-modes p-bits
- thermo- dynamic AI hardware A natural starting point for implementing thermo- dynamic AI hardware is analog electrical circuits, as these circuits have inherent fluctuations that could be harnessed for computation.
- the most ubiquitous source of noise in electrical circuits is thermal noise.
- Thermal noise also called Johnson-Nyquist noise, comes from the random thermal agitation of the charge carriers in a conductor, resulting in fluctuations in voltage or current inside the conductor.
- the amplitude of the voltage fluctuations can be controlled by changing the temperature or the resistance.
- a thermal noise source can be implemented using a large resistor in series with a voltage amplifier.
- Another type of electrical noise is shot noise. Shot noise arises from the discrete nature of charge carriers and from the fact that the probability of a charge carrier crossing a point in a conductor at any time is random. This effect is particularly important in semiconductor junctions where the charge carriers should overcome a potential barrier to conduct a current. The probability of a charge carrier passing over the potential barrier is an independent random event. This induces fluctuations in the current through the junction.
- the amplitude of the current fluctuations can be controlled by changing the magnitude of the DC current passing through the junction.
- a source of shot noise would be implemented using a pn diode (for example, a Zener diode in reverse bias configuration) in series with a controllable current source.
- any physical implementation of s-modes should have the amplitude of its stochasticity be independently controllable with respect to the other system parameters.
- thermal and shot noise both have tuning knobs to control the amplitude of the noise to some extent.
- Thermal and shot noise sources typically have voltage fluctuations of the order of a few ⁇ V or less.
- amplification will be necessary. This amplification can be done using single- or multi-stage voltage amplifiers.
- Variable-gain amplifiers can also let one independently control the amplitude of the fluctuations.
- the s-mode can be represented through the dynamics of any degree of freedom of an electrical circuit. Attorney Docket No.
- a simple stochastic voltage noise source plays the role of the s-mode. This can be realized by using a noisy resistor at non-zero temperature.
- the circuit schematic in Fig. 53 shows the typical equivalent noise model for a noisy resistor composed of a stochastic voltage noise source, ⁇ v(t), in series with an ideal (non-noisy) resistor of resistance R.
- the inherent terminal capacitance, C, of the resistor is also added to the equivalent resistor model.
- the dynamics of the s-mode in this case obeys the following SDE model:
- the form of the SDE comprises a drift term proportional to v(t) and a diffusion or stochastic term proportional to ⁇ v(t).
- their inherent stochastic dynamics must be constrained to the properties of an algorithm.
- 33.3.2 Coupling s-modes When building systems of many s-modes, one will most likely wish to introduce some form of coupling between them to express correlations and geometric constraints. Again, the medium of analog electrical circuits presents a natural option for the coupling of s-modes. As a first example, two circuits of the type described in section 33.3.1 could be coupled through a resistor, as pictured in the upper panel of Fig. 54.
- the coupled s-modes represented by the voltage on nodes 1 and 2, are then coupled through their drift terms as (we omit the time dependencies for readability) (175) (176)
- a second method of coupling two s-modes together is by using a capacitor as a the coupling element, as pictured in the lower panel of Fig. 54.
- the two coupled s-modes represented by the voltage on nodes 1 and 2 have drift and diffusion coupling as (we omit the time dependencies for Attorney Docket No. NORM-002WO01 readability) (178) (179)
- the demon acts as an intelligent observer who regularly gathers data from (i.e., measures) the system, and based on the gathered information, the demon performs some action on the system.
- the classic example involves a gaseous mixture and a physical barrier as illustrated in Figure 11, although it can be implemented by various physical means including with electrical ciruits.
- a Maxwell Demon is both: (1) A key component of Thermodynamic AI systems due to the complex entropy dynamics required for AI applications, and (2) Straightforward to implement in practice for several different hardware architectures.
- AI applications like Bayesian inference aim to approximate a posterior dis- tribution, and it is known that such posteriors can be extremely complicated and multi-modal.
- thermodynamic hardware In this case, one would need to communicate the state vector v to the digital processor (to be the input to the neural network), and then communicate the proposed action of the Maxwell’s Demon (i.e., the output of the neural network) back to the thermodynamic hardware. Hence, one simply needs a means to interconvert signals between the thermodynamic hardware and the digital processor. This is illustrated in diagrams such as Fig. 12 or Fig. 41, with the interconversion shown as analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). 33.4.2 Analog Maxwell’s Demon An analog MD device could allow one to integrate it more closely to the rest of the thermodynamic hardware. Moreover, this could allow one to avoid interconverting signals.
- ADCs analog-to-digital converters
- DACs digital-to-analog converters
- Equation (181) provides a recipe for how to construct an MD device. The idea would be to view the momentum p as the state vector associated with the s-mode system. Hence, the momentum will evolve over time according to the SDE equation associated with the s-mode system, such as Eq. (177).
- the MD system will take the momentum vector in as an input. Then the MD system will output a force that is a function of both the time t and the momentum p(t), Input from s-modes: p(t), Output to s-modes: f(t,p(t)) (182)
- the MD device performs the mapping from input to output in Eq. (182).
- the MD device has a latent variable or hidden variable, which corresponds to the position vector x(t).
- the latent variable x(t) is stored inside the MD device’s memory, and it evolves over time. Specifically, it evolves over time according to the differential equation in Eq. (181).
- Latent variable where x(0) is an initial starting point for the latent variable.
- the MD device also stores a potential energy function U ⁇ (t,x(t)).
- U ⁇ (t,x(t)) For generality, we allow this po- tential energy function to be time-dependent. This time dependence is important for certain applications such as annealing, where one wishes to vary the potential energy function over time.
- the MD device combines (183) and (184) to produce a force.
- diffusion models can fit into our framework using the following mapping: (diffusion process) ⁇ (s-mode device) (188) (score network) ⁇ (Maxwell’s demon device) (189)
- the mathematical diffusion process in diffusion models can be mapped to the physical diffusion process in the s-mode device.
- the score vector outputted by the score network corresponds to the vector d(t,v(t)) outputted by the MD device.
- Equations (192) and (193) describe a simulated annealing process. Equations (192) and (193) are special cases of our general framework, given in Eqs. (169) and (170), Attorney Docket No. NORM-002WO01 for Thermodynamic AI hardware. Specifically, we have the following mapping to our hardware: (auxiliary SDE) ⁇ (s-mode device) (194) (optimization ODE) ⁇ (latent variable evolution in Maxwell’s demon device) (195) The idea is that the auxiliary SDE describing the evolution of p can be performed on the s-mode device.
- ⁇ L(x) corresponds to the vector d output by the Maxwell’s demon in our hardware.
- the optimization ODE maps onto the evolution of the latent variable in the Maxwell’s demon device. This employs a forced-based Maxwell’s demon, as discussed above and shown Figure 55.
- M I. 34
- any combination of two or more such features, sys- tems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.
- various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments. All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
- a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including components other than B); in another embodiment, to B only (optionally including components other than A); in yet another embodiment, to Attorney Docket No. NORM-002WO01 both A and B (optionally including other components); etc.
- “or” should be understood to have the same meaning as “and/or” as defined above.
- the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
- the phrase “at least one,” in reference to a list of one or more components should be understood to mean at least one component selected from any one or more of the components in the list of components, but not necessarily including at least one of each and every component specifically listed within the list of components and not excluding any combinations of components in the list of components.
- “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including components other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including components other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other components); etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Neurology (AREA)
- Operations Research (AREA)
- Complex Calculations (AREA)
- Design And Manufacture Of Integrated Circuits (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020257022173A KR20250131778A (en) | 2022-12-02 | 2023-11-30 | Thermodynamic AI for Generative Diffusion Models and Bayesian Deep Learning |
| EP23898893.5A EP4627417A1 (en) | 2022-12-02 | 2023-11-30 | Thermodynamic artificial intelligence for generative diffusion models and bayesian deep learning |
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263385891P | 2022-12-02 | 2022-12-02 | |
| US63/385,891 | 2022-12-02 | ||
| US202363478710P | 2023-01-06 | 2023-01-06 | |
| US63/478,710 | 2023-01-06 | ||
| US202363483856P | 2023-02-08 | 2023-02-08 | |
| US63/483,856 | 2023-02-08 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024118915A1 true WO2024118915A1 (en) | 2024-06-06 |
Family
ID=91325002
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/081816 Ceased WO2024118915A1 (en) | 2022-12-02 | 2023-11-30 | Thermodynamic artificial intelligence for generative diffusion models and bayesian deep learning |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4627417A1 (en) |
| KR (1) | KR20250131778A (en) |
| WO (1) | WO2024118915A1 (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118518982A (en) * | 2024-07-19 | 2024-08-20 | 江西师范大学 | FDIA detection method and FDIA detection device based on diffusion model |
| CN120046045A (en) * | 2025-04-23 | 2025-05-27 | 国网浙江省电力有限公司金华供电公司 | Novel energy power system line blocking early warning method based on martingale model |
| CN120123137A (en) * | 2025-05-14 | 2025-06-10 | 深圳市康莱米电子股份有限公司 | Temperature control device and temperature control method for tablet computer |
| CN120197518A (en) * | 2025-05-26 | 2025-06-24 | 青岛理工大学 | A method for end-to-end dynamic modeling of reservoirs based on diffusion model |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170169276A1 (en) * | 2011-09-27 | 2017-06-15 | The Board Of Regents Of The University Of Texas System | Systems and methods for automated screening and prognosis of cancer from whole-slide biopsy images |
| WO2018235004A1 (en) * | 2017-06-22 | 2018-12-27 | Sendyne Corporation | RESOLVER OF DIFFERENTIAL STOCHASTIC EQUATIONS |
| US20190113438A1 (en) * | 2016-04-07 | 2019-04-18 | The General Hospital Corporation | White Blood Cell Population Dynamics |
| WO2021035038A1 (en) * | 2019-08-20 | 2021-02-25 | The General Hospital Corporation | Single-cell modeling of clinical data to determine red blood cell regulation |
| US20210397955A1 (en) * | 2020-06-16 | 2021-12-23 | Robert Bosch Gmbh | Making time-series predictions of a computer-controlled system |
-
2023
- 2023-11-30 KR KR1020257022173A patent/KR20250131778A/en active Pending
- 2023-11-30 EP EP23898893.5A patent/EP4627417A1/en active Pending
- 2023-11-30 WO PCT/US2023/081816 patent/WO2024118915A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170169276A1 (en) * | 2011-09-27 | 2017-06-15 | The Board Of Regents Of The University Of Texas System | Systems and methods for automated screening and prognosis of cancer from whole-slide biopsy images |
| US20190113438A1 (en) * | 2016-04-07 | 2019-04-18 | The General Hospital Corporation | White Blood Cell Population Dynamics |
| WO2018235004A1 (en) * | 2017-06-22 | 2018-12-27 | Sendyne Corporation | RESOLVER OF DIFFERENTIAL STOCHASTIC EQUATIONS |
| WO2021035038A1 (en) * | 2019-08-20 | 2021-02-25 | The General Hospital Corporation | Single-cell modeling of clinical data to determine red blood cell regulation |
| US20210397955A1 (en) * | 2020-06-16 | 2021-12-23 | Robert Bosch Gmbh | Making time-series predictions of a computer-controlled system |
Non-Patent Citations (2)
| Title |
|---|
| LOOK ANDREAS; MELIH KANDEMIR: "Differential Bayesian Neural Nets", ARXIV.ORG, 2 December 2019 (2019-12-02), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081543381 * |
| SONG LU-KAI; BAI GUANG-CHEN; LI XUE-QIN; WEN JIE: "A unified fatigue reliability-based design optimization framework for aircraft turbine disk", INTERNATIONAL JOURNAL OF FATIGUE, vol. 152, 19 July 2021 (2021-07-19), AMSTERDAM, NL , XP086753946, ISSN: 0142-1123, DOI: 10.1016/j.ijfatigue.2021.106422 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118518982A (en) * | 2024-07-19 | 2024-08-20 | 江西师范大学 | FDIA detection method and FDIA detection device based on diffusion model |
| CN120046045A (en) * | 2025-04-23 | 2025-05-27 | 国网浙江省电力有限公司金华供电公司 | Novel energy power system line blocking early warning method based on martingale model |
| CN120123137A (en) * | 2025-05-14 | 2025-06-10 | 深圳市康莱米电子股份有限公司 | Temperature control device and temperature control method for tablet computer |
| CN120197518A (en) * | 2025-05-26 | 2025-06-24 | 青岛理工大学 | A method for end-to-end dynamic modeling of reservoirs based on diffusion model |
| CN120197518B (en) * | 2025-05-26 | 2025-08-08 | 青岛理工大学 | Oil reservoir end-to-end dynamic modeling method based on diffusion model |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4627417A1 (en) | 2025-10-08 |
| KR20250131778A (en) | 2025-09-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024118915A1 (en) | Thermodynamic artificial intelligence for generative diffusion models and bayesian deep learning | |
| Foster et al. | Deep adaptive design: Amortizing sequential bayesian experimental design | |
| US20230419075A1 (en) | Automated Variational Inference using Stochastic Models with Irregular Beliefs | |
| Ritter et al. | Online structured laplace approximations for overcoming catastrophic forgetting | |
| Bartunov et al. | Few-shot generative modelling with generative matching networks | |
| Guan et al. | Direct and indirect reinforcement learning | |
| JP7020547B2 (en) | Information processing equipment, control methods, and programs | |
| CN112818658B (en) | Training method, classifying method, device and storage medium for text classification model | |
| Bartunov et al. | Fast adaptation in generative models with generative matching networks | |
| Liu et al. | An experimental study on symbolic extreme learning machine | |
| Zhang et al. | Improved GAP-RBF network for classification problems | |
| EP4396730B1 (en) | Automated variational inference using stochastic models with irregular beliefs | |
| Mills et al. | L2nas: Learning to optimize neural architectures via continuous-action reinforcement learning | |
| Salman et al. | Nifty method for prediction dynamic features of online social networks from users’ activity based on machine learning | |
| EP4602521A1 (en) | Thermodynamic computing system for sampling high-dimensional probability distributions | |
| Champion | From data to dynamics: discovering governing equations from data | |
| Sentz et al. | Reduced basis approximations of parameterized dynamical partial differential equations via neural networks | |
| Dewulf et al. | The hyperdimensional transform: a holographic representation of functions | |
| Thaler et al. | JaxSGMC: Modular stochastic gradient MCMC in JAX | |
| Yeganeh et al. | Deep Active Inference Agents for Delayed and Long-Horizon Environments | |
| Kandola | Interpretable modelling with sparse kernels | |
| Carbone | Generative Models as Out-of-equilibrium particle systems: the case of Energy-Based Models | |
| Stinson | Generative Modeling and Inference in Directed and Undirected Neural Networks | |
| Iollo | Inference driven Bayesian Experimental Design | |
| Joachims | Uncertainty Quantification with Bayesian Neural Networks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23898893 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025532040 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025532040 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023898893 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023898893 Country of ref document: EP Effective date: 20250702 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023898893 Country of ref document: EP |
























