US20220076131A1 - Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers - Google Patents

Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers Download PDF

Info

Publication number
US20220076131A1
US20220076131A1 US17/481,568 US202117481568A US2022076131A1 US 20220076131 A1 US20220076131 A1 US 20220076131A1 US 202117481568 A US202117481568 A US 202117481568A US 2022076131 A1 US2022076131 A1 US 2022076131A1
Authority
US
United States
Prior art keywords
circuitry
distribution
approximating posterior
random variables
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/481,568
Inventor
Jason Rolfe
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
D Wave Systems Inc
Original Assignee
D Wave Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by D Wave Systems Inc filed Critical D Wave Systems Inc
Priority to US17/481,568 priority Critical patent/US20220076131A1/en
Assigned to PSPIB UNITAS INVESTMENTS II INC. reassignment PSPIB UNITAS INVESTMENTS II INC. SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: D-WAVE SYSTEMS INC.
Publication of US20220076131A1 publication Critical patent/US20220076131A1/en
Assigned to D-WAVE SYSTEMS INC. reassignment D-WAVE SYSTEMS INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: PSPIB UNITAS INVESTMENTS II INC., IN ITS CAPACITY AS COLLATERAL AGENT
Assigned to PSPIB UNITAS INVESTMENTS II INC., AS COLLATERAL AGENT reassignment PSPIB UNITAS INVESTMENTS II INC., AS COLLATERAL AGENT INTELLECTUAL PROPERTY SECURITY AGREEMENT Assignors: 1372934 B.C. LTD., D-WAVE SYSTEMS INC.
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/086Learning methods using evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N10/00Quantum computing, i.e. information processing based on quantum-mechanical phenomena
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0445
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0454
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • G06N3/0472
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the present disclosure generally relates to machine learning.
  • Machine learning relates to methods and circuitry that can learn from data and make predictions based on data.
  • machine learning methods and circuitry can include deriving a model from example inputs (such as a training set) and then making data-driven predictions.
  • Machine learning is related to optimization. Some problems can be expressed in terms of minimizing a loss function on a training set, where the loss function describes the disparity between the predictions of the model being trained and observable data.
  • Machine learning tasks can include unsupervised learning, supervised learning, and reinforcement learning.
  • Approaches to machine learning include, but are not limited to, decision trees, linear and quadratic classifiers, case-based reasoning, Bayesian statistics, and artificial neural networks.
  • Machine learning can be used in situations where explicit approaches are considered infeasible.
  • Example application areas include optical character recognition, search engine optimization, and computer vision.
  • a quantum processor is a computing device that can harness quantum physical phenomena (such as superposition, entanglement, and quantum tunneling) unavailable to non-quantum devices.
  • a quantum processor may take the form of a superconducting quantum processor.
  • a superconducting quantum processor may include a number of qubits and associated local bias devices, for instance two or more superconducting qubits.
  • An example of a qubit is a flux qubit.
  • a superconducting quantum processor may also employ coupling devices (i.e., “couplers”) providing communicative coupling between qubits. Further details and embodiments of exemplary quantum processors that may be used in conjunction with the present systems and devices are described in, for example, U.S. Pat. Nos. 7,533,068; 8,008,942; 8,195,596; 8,190,548; and 8,421,053.
  • Adiabatic quantum computation typically involves evolving a system from a known initial Hamiltonian (the Hamiltonian being an operator whose eigenvalues are the allowed energies of the system) to a final Hamiltonian by gradually changing the Hamiltonian.
  • a simple example of an adiabatic evolution is a linear interpolation between initial Hamiltonian and final Hamiltonian. An example is given by:
  • H i is the initial Hamiltonian
  • H f is the final Hamiltonian
  • H e is the evolution or instantaneous Hamiltonian
  • s is an evolution coefficient which controls the rate of evolution (i.e., the rate at which the Hamiltonian changes).
  • the system is typically initialized in a ground state of the initial Hamiltonian H i and the goal is to evolve the system in such a way that the system ends up in a ground state of the final Hamiltonian H f at the end of the evolution. If the evolution is too fast, then the system can transition to a higher energy state, such as the first excited state.
  • an “adiabatic” evolution is an evolution that satisfies the adiabatic condition:
  • ⁇ dot over (s) ⁇ is the time derivative of s
  • g(s) is the difference in energy between the ground state and first excited state of the system (also referred to herein as the “gap size”) as a function of s
  • is a coefficient much less than 1.
  • Quantum annealing is a computation method that may be used to find a low-energy state, typically preferably the ground state, of a system. Similar in concept to classical simulated annealing, the method relies on the underlying principle that natural systems tend towards lower energy states because lower energy states are more stable. While classical annealing uses classical thermal fluctuations to guide a system to a low-energy state and ideally its global energy minimum, quantum annealing may use quantum effects, such as quantum tunneling, as a source of disordering to reach a global energy minimum more accurately and/or more quickly than classical annealing. In quantum annealing thermal effects and other noise may be present to annealing. The final low-energy state may not be the global energy minimum.
  • Adiabatic quantum computation may be considered a special case of quantum annealing for which the system, ideally, begins and remains in its ground state throughout an adiabatic evolution.
  • quantum annealing systems and methods may generally be implemented on an adiabatic quantum computer.
  • any reference to quantum annealing is intended to encompass adiabatic quantum computation unless the context requires otherwise.
  • Quantum annealing uses quantum mechanics as a source of disorder during the annealing process.
  • An objective function such as an optimization problem, is encoded in a Hamiltonian H P , and the algorithm introduces quantum effects by adding a disordering Hamiltonian H D that does not commute with H P .
  • An example case is:
  • A(t) and B(t) are time dependent envelope functions.
  • A(t) can change from a large value to substantially zero during the evolution and H E can be thought of as an evolution Hamiltonian similar to H e described in the context of adiabatic quantum computation above.
  • the disorder is slowly removed by removing H D (i.e., by reducing A(t)).
  • quantum annealing is similar to adiabatic quantum computation in that the system starts with an initial Hamiltonian and evolves through an evolution Hamiltonian to a final “problem” Hamiltonian H P whose ground state encodes a solution to the problem. If the evolution is slow enough, the system may settle in the global minimum (i.e., the exact solution), or in a local minimum close in energy to the exact solution. The performance of the computation may be assessed via the residual energy (difference from exact solution using the objective function) versus evolution time. The computation time is the time required to generate a residual energy below some acceptable threshold value.
  • H P may encode an optimization problem and therefore H P may be diagonal in the subspace of the qubits that encode the solution, but the system does not necessarily stay in the ground state at all times.
  • the energy landscape of H P may be crafted so that its global minimum is the answer to the problem to be solved, and low-lying local minima are good approximations.
  • the gradual reduction of disordering Hamiltonian H D (i.e., reducing A(t)) in quantum annealing may follow a defined schedule known as an annealing schedule.
  • an annealing schedule Unlike adiabatic quantum computation where the system begins and remains in its ground state throughout the evolution, in quantum annealing the system may not remain in its ground state throughout the entire annealing schedule.
  • quantum annealing may be implemented as a heuristic technique, where low-energy states with energy near that of the ground state may provide approximate solutions to the problem.
  • a method for unsupervised learning over an input space comprising discrete or continuous variables, and at least a subset of a training dataset of samples of the respective variables, to attempt to identify the value of at least one parameter that increases the log-likelihood of the at least a subset of a training dataset with respect to a model, the model expressible as a function of the at least one parameter, the method executed by circuitry including at least one processor, may be summarized as including forming a first latent space comprising a plurality of random variables, the plurality of random variables comprising one or more discrete random variables; forming a second latent space comprising the first latent space and a set of supplementary continuous random variables; forming a first transforming distribution comprising a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables of the first latent space; forming an encoding distribution comprising an approximating posterior distribution over the first latent space, conditioned on the input space; forming a prior distribution over the first latent
  • Increasing the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset may include increasing the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent.
  • Increasing the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent may include attempting to maximize the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent.
  • the lower bound may be an evidence lower bound.
  • Constructing a first stochastic approximation to the lower bound of the log-likelihood of the at least a subset of a training dataset may include decomposing the first stochastic approximation to the lower bound into at least a first part comprising negative KL-divergence between the approximating posterior and the prior distribution over the first latent space, and a second part comprising an expectation, or at least a stochastic approximation to an expectation, with respect to the approximating posterior over the second latent space of the conditional log-likelihood of the at least a subset of a training dataset under the decoding distribution.
  • Constructing a second stochastic approximation to the gradient of the lower bound may include determining the gradient of the second part of the first stochastic approximation by backpropagation; approximating the gradient of the first part of the first stochastic approximation with respect to one or more parameters of the prior distribution over the first latent space using samples from the prior distribution; and determining a gradient of the first part of the first stochastic approximation with respect to parameters of the encoding distribution by backpropagation.
  • Approximating the gradient of the first part of the first stochastic approximation with respect to one or more parameters of the prior distribution over the first latent space using samples from the prior distribution may include at least one of generating samples or causing samples to be generated by a quantum processor.
  • a logarithm of the prior distribution may be, to within a constant, a problem Hamiltonian of a quantum processor.
  • the method may further include generating samples or causing samples to be generated by a quantum processor; and determining an expectation with respect to the prior distribution from the samples.
  • Generating samples or causing samples to be generated by at least one quantum processor may include performing at least one post-processing operation on the samples.
  • Generating samples or causing samples to be generated by at least one quantum processor may include operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the at least one quantum processor, and wherein operating the at least one quantum processor as a sample generator comprises: programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor; evolving the quantum processor; and reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.
  • the method may further include at least one of generating, or at least approximating, samples or causing samples to be generated, or least approximated, by a restricted Boltzmann machine; and determining the expectation with respect to the prior distribution from the samples.
  • the set of supplementary continuous random variables may include a plurality of continuous variables, and each one of the plurality of continuous variables may be conditioned on a different respective one of the plurality of random variables.
  • the method may further include forming a second transforming distribution, wherein the input space comprises a plurality of input variables, and the second transforming distribution is conditioned on one or more of the plurality of input variables and at least one of the one or more discrete random variables.
  • a computational system may be summarized as including hardware or circuitry, for example including at least one processor; and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least one processor to execute any of the above described acts or any of the methods of claims 1 through 16 .
  • a method for unsupervised learning by a computational system may be summarized as including forming a model, the model comprising one or more model parameters; initializing the model parameters; receiving a training dataset comprising a plurality of subsets of the training dataset; testing to determine if a stopping criterion has been met; in response to determining the stopping criterion has not been met: fetching a mini-batch comprising one of the plurality of subsets of the training dataset, the mini-batch comprising input data; performing propagation through an encoder that computes an approximating posterior distribution over a discrete space; sampling from the approximating posterior distribution over a set of continuous random variables via a sampler; performing propagation through a decoder that computes an auto-encoded distribution over the input data; performing backpropagation through the decoder of a log-likelihood of the input data with respect to the auto-encoded distribution over the input data; performing backpropagation through the
  • Initializing the model parameters may include initializing the model parameters using random variables. Initializing the model parameters may include initializing the model parameters based at least in part on a pre-training procedure. Testing to determine if a stopping criterion has been met may include testing to determine if a threshold number N of passes through the training dataset have been run.
  • the method may further include receiving at least a subset of a validation dataset, wherein testing to determine if a stopping criterion has been met includes determining a measure of validation loss on the at least a subset of a validation dataset computed on two or more successive passes, and testing to determine if the measure of validation loss meets a predetermined criterion.
  • Determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space may include determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space by generating samples or causing samples to be generated by a quantum processor.
  • Generating samples or causing samples to be generated by a quantum processor may include operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the at least one quantum processor, and wherein operating the at least one quantum processor as a sample generator comprises programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor; evolving the at least one quantum processor; and reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.
  • Operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution may include operating the at least one quantum processor to perform at least one post-processing operation on the samples.
  • Sampling from the approximating posterior distribution over a set of continuous random variables may include generating samples or causing samples to be generated by a digital processor.
  • the method for unsupervised learning may further include dividing the discrete space into a first plurality of disjoint groups; and dividing the set of supplementary continuous random variables into a second plurality of disjoint groups, wherein performing propagation through an encoder that computes an approximating posterior over a discrete space includes: determining a processing sequence for the first and the second plurality of disjoint groups; and for each of the first plurality of disjoint groups in an order determined by the processing sequence, performing propagation through an encoder that computes an approximating posterior, the approximating posterior conditioned on at least one of the previous ones in the processing sequence of the second plurality of disjoint groups and at least one of the plurality of input variables.
  • the method may further include receiving at least a subset of a validation dataset, wherein testing to determine if a stopping criterion has been met includes determining a measure of validation loss on the at least a subset of a validation dataset computed on two or more successive passes, and testing to determine if the measure of validation loss meets a predetermined criterion.
  • Determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space may include determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space by generating samples or causing samples to be generated by a quantum processor.
  • Generating samples or causing samples to be generated by a quantum processor may include operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the analog processor, and wherein operating the at least one quantum processor as a sample generator comprises: programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor, evolving the at least one quantum processor, and reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.
  • Operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution may include operating the at least one quantum processor to perform at least one post-processing operation on the samples.
  • Sampling from the approximating posterior over a set of continuous random variables may include generating samples or causing samples to be generated by a digital processor.
  • a computational system may be summarized as including hardware or circuitry, for example including at least one processor; and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least processor to execute any of the above described acts or any of the methods of claims 18 through 37 .
  • a method of unsupervised learning by a computational system may be summarized as including determining a first approximating posterior distribution over at least one group of a set of discrete random variables; sampling from at least one group of a set of supplementary continuous random variables using the first approximating posterior distribution over the at least one group of the set of discrete random variables to generate one or more samples, wherein a transforming distribution comprises a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables; determining a second approximating posterior distribution and a first prior distribution, the first prior distribution over at least one layer of a set of continuous variables; sampling from the second approximating posterior distribution; determining an auto-encoding loss on an input space comprising discrete or continuous variables, the auto-encoding loss conditioned on the one or more samples; determining a first KL-divergence, or at least an approximation thereof, between the second posterior distribution and the first prior distribution; determining
  • a computational system may be summarized as including hardware or circuitry, for example including at least one processor; and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least processor to execute any of the immediately above described acts or any of the methods of claims 39 through 40 .
  • a method of unsupervised learning by a computational system may be summarized as including determining a first approximating posterior distribution over a first group of discrete random variables conditioned on an input space comprising discrete or continuous variables; sampling from a first group of supplementary continuous variables based on the first approximating posterior distribution; determining a second approximating posterior distribution over a second group of discrete random variables conditioned on the input space and samples from the first group of supplementary continuous random variables; sampling from a second group of supplementary continuous variables based on the second approximating posterior distribution; determining a third approximating posterior distribution and a first prior distribution over a first layer of additional continuous random variables, the third approximating distribution conditioned on the input space, samples from at least one of the first and the second group of supplementary continuous random variables, and the first prior distribution conditioned on samples from at least one of the first and the second group of supplementary continuous random variables; sampling from the first layer of additional continuous random variables based on the third
  • a computational system may be summarized as including hardware or circuitry, for example including at least one processor and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least processor to execute any of the immediately above described acts or any of the methods of claims 41 through 42 .
  • FIG. 1 is a schematic diagram of an exemplary hybrid computer including a digital computer and an analog computer in accordance with the present systems, devices, methods, and articles.
  • FIG. 2A is a schematic diagram of an exemplary topology for a quantum processor.
  • FIG. 2B is a schematic diagram showing a close-up of the exemplary topology for a quantum processor.
  • FIG. 3 is a schematic diagram illustrating an example implementation of a variational auto-encoder (VAE).
  • VAE variational auto-encoder
  • FIG. 4 is a flow chart illustrating a method for unsupervised learning, in accordance with the presently described systems, devices, articles, and methods.
  • FIG. 5 is a schematic diagram illustrating an example implementation of a hierarchical variational auto-encoder (VAE).
  • VAE hierarchical variational auto-encoder
  • FIG. 6 is a schematic diagram illustrating an example implementation of a variational auto-encoder (VAE) with a hierarchy of continuous latent variables.
  • VAE variational auto-encoder
  • FIG. 7 is a flow chart illustrating a method for unsupervised learning via a hierarchical variational auto-encoder (VAE), in accordance with the present systems, devices, articles and methods.
  • VAE hierarchical variational auto-encoder
  • references to a processor or at least one processor refer to hardware or circuitry, with discrete or integrated, for example single or multi-core microprocessors, microcontrollers, central processor units, digital signal processors, graphical processing units, programmable gate arrays, programmed logic controllers, and analog processors, for instance quantum processors.
  • Various algorithms and methods and specific acts are executable via one or more processors.
  • FIG. 1 illustrates a hybrid computing system 100 including a digital computer 105 coupled to an analog computer 150 .
  • analog computer 150 is a quantum processor.
  • the exemplary digital computer 105 includes a digital processor (CPU) 110 that may be used to perform classical digital processing tasks.
  • CPU digital processor
  • Digital computer 105 may include at least one digital processor (such as central processor unit 110 with one or more cores), at least one system memory 120 , and at least one system bus 117 that couples various system components, including system memory 120 to central processor unit 110 .
  • digital processor such as central processor unit 110 with one or more cores
  • system memory 120 such as central memory 120
  • system bus 117 that couples various system components, including system memory 120 to central processor unit 110 .
  • the digital processor may be any logic processing unit, such as one or more central processing units (“CPUs”), graphics processing units (“GPUs”), digital signal processors (“DSPs”), application-specific integrated circuits (“ASICs”), programmable gate arrays (“FPGAs”), programmable logic controllers (PLCs), etc., and/or combinations of the same.
  • CPUs central processing units
  • GPUs graphics processing units
  • DSPs digital signal processors
  • ASICs application-specific integrated circuits
  • FPGAs programmable gate arrays
  • PLCs programmable logic controllers
  • Digital computer 105 may include a user input/output subsystem 111 .
  • the user input/output subsystem includes one or more user input/output components such as a display 112 , mouse 113 , and/or keyboard 114 .
  • System bus 117 can employ any known bus structures or architectures, including a memory bus with a memory controller, a peripheral bus, and a local bus.
  • System memory 120 may include non-volatile memory, such as read-only memory (“ROM”), static random access memory (“SRAM”), Flash NAND; and volatile memory such as random access memory (“RAM”) (not shown).
  • ROM read-only memory
  • SRAM static random access memory
  • RAM random access memory
  • Digital computer 105 may also include other non-transitory computer- or processor-readable storage media or non-volatile memory 115 .
  • Non-volatile memory 115 may take a variety of forms, including: a hard disk drive for reading from and writing to a hard disk, an optical disk drive for reading from and writing to removable optical disks, and/or a magnetic disk drive for reading from and writing to magnetic disks.
  • the optical disk can be a CD-ROM or DVD, while the magnetic disk can be a magnetic floppy disk or diskette.
  • Non-volatile memory 115 may communicate with digital processor via system bus 117 and may include appropriate interfaces or controllers 116 coupled to system bus 117 .
  • Non-volatile memory 115 may serve as long-term storage for processor- or computer-readable instructions, data structures, or other data (sometimes called program modules) for digital computer 105 .
  • digital computer 105 has been described as employing hard disks, optical disks and/or magnetic disks, those skilled in the relevant art will appreciate that other types of non-volatile computer-readable media may be employed, such magnetic cassettes, flash memory cards, Flash, ROMs, smart cards, etc.
  • non-volatile computer-readable media such magnetic cassettes, flash memory cards, Flash, ROMs, smart cards, etc.
  • some computer architectures employ volatile memory and non-volatile memory. For example, data in volatile memory can be cached to non-volatile memory. Or a solid-state disk that employs integrated circuits to provide non-volatile memory.
  • system memory 120 may store instruction for communicating with remote clients and scheduling use of resources including resources on the digital computer 105 and analog computer 150 .
  • system memory 120 may store at least one of processor executable instructions or data that, when executed by at least one processor, causes the at least one processor to execute the various algorithms described elsewhere herein, including machine learning related algorithms.
  • system memory 120 may store processor- or computer-readable calculation instructions to perform pre-processing, co-processing, and post-processing to analog computer 150 .
  • System memory 120 may store at set of analog computer interface instructions to interact with analog computer 150 .
  • Analog computer 150 may include at least one analog processor such as quantum processor 140 .
  • Analog computer 150 can be provided in an isolated environment, for example, in an isolated environment that shields the internal elements of the quantum computer from heat, magnetic field, and other external noise (not shown).
  • the isolated environment may include a refrigerator, for instance a dilution refrigerator, operable to cryogenically cool the analog processor, for example to temperature below approximately 1° Kelvin.
  • FIG. 2A shows an exemplary topology 200 a for a quantum processor, in accordance with the presently described systems, devices, articles, and methods.
  • Topology 200 a may be used to implement quantum processor 140 of FIG. 1 , however other topologies can also be used for the systems and methods of the present disclosure.
  • Topology 200 a comprises a grid of 2 ⁇ 2 cells 210 a - 210 d , each cell comprised of 8 qubits such as qubit 220 (only one called out in FIG. 2A ).
  • each cell 210 a - 210 d there are eight qubits 220 (only one called out for drawing clarity), the qubits 220 in each cell 210 a - 210 d arranged four rows (extending horizontally in drawing sheet) and four columns (extending vertically in drawing sheet). Pairs of qubits 220 from the rows and columns can be communicatively coupled to one another by a respective coupler such as coupler 230 (illustrated by bold cross shapes, only one called out in FIG. 2A ).
  • a respective coupler 230 is positioned and operable to communicatively couple the qubit in each column (vertically-oriented qubit in drawing sheet) in each cell to the qubits in each row (horizontally-oriented qubit in drawing sheet) in the same cell.
  • a respective coupler such as coupler 240 (only one called out in FIG. 2A ) is positioned and operable to communicatively couple the qubit in each column (vertically-oriented qubit in drawing sheet) in each cell with a corresponding qubit in each column (vertically-oriented qubit in drawing sheet) in a nearest neighboring cell in a same direction as the orientation of the columns.
  • a respective coupler such as coupler 250 (only one called out in FIG.
  • the couplers 240 , 250 couple qubits 220 between cells 210 such couplers 240 , 250 may at times be denominated as inter-cell couplers. Since the couplers 230 couple qubits within a cell 210 , such couplers 230 may at times be denominated as intra-cell couplers.
  • FIG. 2B shows an exemplary topology 200 b for a quantum processor, in accordance with the presently described systems, devices, articles, and methods.
  • Topology 200 b shows nine cells, such as cell 210 b (only one called out in FIG. 2B ), each cell comprising eight qubits q 1 through q 72 .
  • FIG. 2B illustrates the intra-coupling, such as coupler 230 b (only one called out in FIG. 2B ), and inter-coupling, such as coupler 260 (only one called out in FIG. 2B ), for the cell 210 b.
  • quantum processor 140 with the topology illustrated in FIGS. 2A and 2B is not limited only to problems that fit the native topology. For example, it is possible to embed a complete graph of size N on a quantum processor of size O(N 2 ) by chaining qubits together.
  • a computational system 100 comprising a quantum processor 140 with topology 200 a of FIG. 2A or topology 200 b of FIG. 2B can specify an energy function over spin variables +1/ ⁇ 1, and receive from the quantum processor with topology 200 a or topology 200 b samples of lower energy spin configurations in an approximately Boltzmann distribution according to the Ising model as follows:
  • the spin variables can be mapped to binary variables 0/1. Higher-order energy functions can be expressed by introducing additional constraints over auxiliary variables.
  • Quantum hardware typically includes one or more quantum processors or quantum processing units (QPUs).
  • QPUs quantum processing units
  • the systems and methods described herein adapt machine learning architectures and methods to exploit QPUs to advantageously achieve improved machine performance. Improved machine performance typically includes reduced training time and/or increased generalization accuracy.
  • optimization and sampling can be computational bottlenecks in machine learning systems and methods.
  • the systems and methods described herein integrate the QPU into the machine learning pipeline (including the architecture and methods) to perform optimization and/or sampling with improved performance over classical hardware.
  • the machine learning pipeline can be modified to suit QPUs that can be realized in practice.
  • Boltzmann machines including restricted Boltzmann machines (RBMs) can be used in deep learning systems.
  • Boltzmann machines are particularly suitable for unsupervised learning and probabilistic modeling such as in-painting and classification.
  • a QPU can be integrated into machine learning systems and methods to reduce the time taken to perform training.
  • the QPU can be used as a physical Boltzmann sampler.
  • the approach involves programming the QPU (which is an Ising system) such that the spin configurations realize a user-defined Boltzmann distribution natively. The approach can then draw samples directly from the QPU.
  • the restricted Boltzmann machine is a probabilistic graphical model that represents a joint probability distribution p(x,z) over binary visible units x and binary hidden units z.
  • the restricted Boltzmann machine can be used as an element in a deep learning network.
  • the RBM network has the topology of a bipartite graph with biases on each visible unit and on each hidden unit, and weights (couplings) on each edge.
  • An energy E(x,z) can be associated with the joint probability distribution p(x,z) over the visible and the hidden units, as follows:
  • conditional probabilities can be computed:
  • is the sigmoid function, used to ensure the values of the conditional probabilities lie in the range [0,1].
  • Training is the process by which the parameters of the model are adjusted to favor producing the desired training distribution. Typically, this is done by maximizing of the observed data distribution with respect to the model parameters.
  • One part of the process involves sampling over the given data distribution, and this part is generally straightforward.
  • Another part of the process involves sampling over the predicted model distribution, and this is generally intractable, in the sense that it would use unmanageable amounts of computational resources.
  • MCMC Markov Chain Monte Carlo
  • Contrastive Divergence-k (CD-k) can be used, in which the method only takes k steps of the MCMC process.
  • Another way to speed up the process is to use Persistent Contrastive Divergence (PCD), in which a Markov Chain is initialized in the state where it ended from the previous model.
  • PCD Persistent Contrastive Divergence
  • CD-k and PCD methods tend to perform poorly when the distribution is multi-modal and the modes are separated by regions of low probability.
  • the effects can be mitigated by sampling from the QPU and using the samples as starting points for non-quantum post-processing e.g., to initialize MCMC, CD, and PCD.
  • the QPU is performing the hard part of the sampling process.
  • the QPU finds a diverse set of valleys, and the post-processing operation samples within the valleys.
  • Post-processing can be implemented in a GPU and can be at least partially overlapped with sampling in the quantum processor to reduce the impact of post-processing on the overall timing.
  • a training data set can comprise a set of visible vectors. Training comprises adjusting the model parameters such that the model is most likely to reproduce the distribution of the training set. Typically, training comprises maximizing the log-likelihood of the observed data distribution with respect to the model parameters ⁇ :
  • the first term on the right-hand side (RHS) in the above equation is related to the positive phase and computes an expected value of energy E over p(z
  • the term involves sampling over the given data distribution.
  • the second term on the RHS is related to the negative phase, and computes an expected value of energy, over p(x
  • the term involves sampling over the predicted model distribution.
  • Unsupervised learning of probabilistic models is a technique for machine learning. It can facilitate tasks such as denoising to extract a signal from a mixture of signal and noise, and inpainting to reconstruct lost or corrupted parts of an image. It can also regularize supervised tasks such as classification.
  • unsupervised learning can include attempting to maximize the log-likelihood of an observed dataset under a probabilistic model. Equivalently, unsupervised learning can include attempting to minimize the KL-divergence from the data distribution to that of the model. While the exact gradient of the log-likelihood function is frequently intractable, stochastic approximations can be computed, provided samples can be drawn from the probabilistic model and its posterior distribution given the observed data.
  • Sampling can be efficient in directed graphical models comprising a directed acyclic graph since sampling can be performed by an ancestral pass. Even so, it can be inefficient to compute the posterior distributions over the hidden causes of observed data in such models, and samples from the posterior distributions are required to compute the gradient of the log-likelihood function.
  • Another approach to unsupervised learning is to optimize a lower bound on the log-likelihood function. This approach can be more computationally efficient.
  • An example of a lower bound is the evidence lower bound (ELBO) which differs from the true log-likelihood by the KL-divergence between an approximating posterior distribution, q(z
  • the approximating posterior distribution can be designed to be computationally tractable even though the true posterior distribution is not computationally tractable.
  • the ELBO can be expressed as follows:
  • L ⁇ ( x , ⁇ , ⁇ ) log ⁇ ⁇ p ⁇ ( x
  • x , ⁇ ) ] ⁇ z ⁇ q ⁇ ( z
  • the variational auto-encoder can regroup the ELBO as:
  • the KL-divergence between the approximating posterior and the true prior is analytically simple and computationally efficient for commonly chosen distributions, such as Gaussians.
  • a low-variance stochastic approximation to the gradient of the auto-encoding term q can be backpropagated efficiently, so long as samples from the approximating posterior q(z
  • samples can be drawn using a Gaussian distribution with mean m(x, ⁇ ) and variance v(x, ⁇ ) determined by the input, (m(x, ⁇ ), v(x, ⁇ )
  • conditional marginal cumulative distribution (CDF) is defined by
  • x, ⁇ ) maps each input to a distribution over the latent space, it is called the “encoder”.
  • z, ⁇ ) maps each configuration of the latent variables to a distribution over the input space, it is called the “decoder”.
  • the CDF F i (x) is the CDF of x i conditioned on all x j where j ⁇ i, and marginalized over all x k where i ⁇ k.
  • Such inverses generally exist provided the conditional-marginal probabilities are everywhere non-zero.
  • the approach can run into challenges with discrete distributions, such as, for example, Restricted Boltzmann Machines (RBMs).
  • RBMs Restricted Boltzmann Machines
  • An approximating posterior that only assigns non-zero probability to a discrete domain corresponds to a CDF that is piecewise-constant. That is, the range of the CDF is a proper subset of the interval [0, 1].
  • the domain of the inverse CDF is thus also a proper subset of the interval [0, 1] and its derivative is generally not defined.
  • the difficulty can remain even if a quantile function as follows is used:
  • the derivative of the quantile function is either zero or infinite for a discrete distribution.
  • One method for discrete distributions is to use a reinforcement learning method such as REINFORCE (Williams, http://www-anw.cs.umass.edu/ ⁇ barto/courses/cs687/williams92simple.pdf).
  • REINFORCE Adjust weights following receipt of a reinforcement value by an amount proportional to the difference between a reinforcement baseline and the reinforcement value.
  • the gradient of the log of the conditional likelihood distribution is estimated, in effect, by a finite difference approximation.
  • z, ⁇ ) is evaluated at many different points z ⁇ q(z
  • a discrete variational auto-encoder is a hierarchical probabilistic model consisting of an RBM, followed by multiple layers of continuous latent variables, allowing the binary variables to be marginalized out, and the gradient to backpropagate smoothly through the auto-encoding component of the ELBO.
  • the generative model is redefined so that the conditional distribution of the observed variables given the latent variables only depends on the new continuous latent space.
  • VAEs break the encoder distribution into “packets” of probability, each packet having infinitesimal but equal probability mass.
  • the values of the latent variables are approximately constant.
  • the packets correspond to a region in the latent space, and the expectation value is taken over the packets. There are generally more packets in regions of high probability, so more probable values are more likely to be selected.
  • the location of each packet can move, while its probability mass stays constant. So long as F q(z
  • REINFORCE works by breaking the latent representation into segments of infinitesimal but equal volume, within which the latent variables are also approximately constant, while the probability mass varies between segments. Once a segment is selected in the latent space, its location is independent of the parameters of the encoder. As a result, the contribution of the selected location to the loss function is not dependent on the gradient of the decoder. On the other hand, the probability mass assigned to the region in the latent space around the selected location is relevant.
  • the gradient estimate is generally only low-variance provided the motion of most probability packets has a similar effect on the loss function. This is likely to be the case when the packets are tightly clustered (e.g., if the encoder produces a Gaussian distribution with low variance) or if the movements of well-separated packets have a similar effect on the loss function (e.g., if the decoder is roughly linear).
  • VAEs cannot generally be used directly with discrete latent representations because changing the parameters of a discrete encoder moves probability mass between the allowed discrete values, and the allowed discrete values are generally far apart. As the encoder parameters change, a selected packet either remains in place or jumps more than an infinitesimal distance to an allowed discrete value. Consequently, small changes to the parameters of the encoder do not affect most of the probability packets. Even when a packet jumps between discrete values of the latent representation, the gradient of the decoder generally cannot be used to estimate the change in loss function accurately, because the gradient generally captures only the effects of very small movements of the probability packet.
  • the method described herein for unsupervised learning transforms the distributions to a continuous latent space within which the probability packets move smoothly.
  • ⁇ ) are extended by a transformation to a continuous, auxiliary latent representation ⁇ , and the decoder is correspondingly transformed to be a function of the continuous representation.
  • one approach maps each point in the discrete latent space to a non-zero probability over the entire auxiliary continuous space. In so doing, if the probability at a point in the discrete latent space increases from zero to a non-zero value, a probability packet does not have to jump a large distance to cover the resulting region in the auxiliary continuous space. Moreover, it ensures that the CDFs F i (x) are strictly increasing as a function of their main argument, and thus are invertible.
  • the method described herein for unsupervised learning smooths the conditional-marginal CDF F i (x) of an approximating posterior distribution, and renders the distribution invertible, and its inverse differentiable, by augmenting the latent discrete representation with a set of continuous random variables.
  • the generative model is redefined so that the conditional distribution of the observed variables given the latent variables only depends on the new continuous latent space.
  • the discrete distribution is thereby transformed into a mixture distribution over the continuous latent space, each value of each discrete random variable associated with a distinct mixture component on the continuous expansion. This does not alter the fundamental form of the model, nor the KL-divergence term of the ELBO; rather it adds a stochastic component to the approximating posterior and the prior.
  • the method augments the latent representation with continuous random variables ⁇ , conditioned on z, as follows:
  • FIG. 3 shows an example implementation of a VAE.
  • the variable z is a latent variable.
  • the variable x is a visible variable (for example, pixels in an image data set).
  • the variable is a continuous variable conditioned on a discrete z as described above in the present disclosure.
  • the variable can serve to smooth out the discrete random variables in the auto-encoder term. As described above, the variable generally does not directly affect the KL-divergence between the approximating posterior and the true prior.
  • the variables z 1 , z 2 , and z 3 are disjoint subsets of qubits in the quantum processor.
  • the computational system samples from the RBM using the quantum processor.
  • the computational system generates the hierarchical approximating posteriors using a digital (classical) computer.
  • the computational system uses priors 310 and 330 , and hierarchical approximating posteriors 320 and 340 .
  • the systems adds continuous variables ⁇ 1 , ⁇ 2 , ⁇ 3 below the latent variables z 1 , z 2 , z 3 .
  • FIG. 3 also shows the auto-encoding loop 350 of the VAE.
  • Its output q, along with independent random variable p, is passed into the deterministic function F q( ⁇
  • This ⁇ , along with the original input x, is finally passed to log p(x
  • the expectation of this log probability with respect to ⁇ is the auto-encoding term of the VAE.
  • This auto-encoder, conditioned on the input and the independent ⁇ , is deterministic and differentiable, so backpropagation can be used to produce a low-variance, computationally efficient approximation to the gradient.
  • the distribution remains continuous as q(z
  • the distribution is also everywhere non-zero in the approach that maps each point in the discrete latent space to a non-zero probability over the entire auxiliary continuous space.
  • ⁇ ) is defined as p( ⁇ ,z
  • ⁇ ) r( ⁇
  • ⁇ ,z, ⁇ ) p(x
  • the method described herein can generate low-variance stochastic approximations to the gradient.
  • the KL-divergence between the approximating posterior and the true prior distribution is unaffected by the introduction of auxiliary continuous latent variables, provided the same expansion is used for both.
  • the auto-encoder portion of the loss function is evaluated in the space of continuous random variables, and the KL-divergence portion of the loss function is evaluated in the discrete space.
  • the KL-divergence portion of the loss function is as follows:
  • the gradient of the KL-divergence portion of the loss function in the above equation with respect to ⁇ can be estimated stochastically using samples from the true prior distribution p(z
  • the gradient of the KL-divergence portion of the lost function can be expressed as follows:
  • the method computes the gradients of the KL-divergence portion of the loss function analytically, for example by first directly parameterizing a factorial q(z
  • Equation 1 can therefore be simplified by dropping the dependence of p on z and then marginalizing z out of q, as follows:
  • FIG. 4 shows a method 400 of unsupervised learning using a discrete variational auto-encoder. Execution of the method 400 by one or more processor-based devices may occur in accordance with the present system, devices, articles, and methods. Method 400 , like other methods herein may be implemented by a series or set of processor-readable instructions executed by one or more processors (i.e., hardware circuitry).
  • Method 400 starts at 405 , for example in response to a call from another routine or other invocation.
  • the system initializes the model parameters with random values. Alternatively, the system can initialize the model parameters based on a pre-training procedure.
  • the system tests to determine if a stopping criterion has been reached.
  • the stopping criterion can, for example, be related to the number of epochs (i.e., passes through the dataset) or a measurement of performance between successive passes through a validation dataset. In the latter case, when performance beings to degrade, it is an indication that the system is over-fitting and should stop.
  • the system ends method 400 at 475 , until invoked again, for example, a request to repeat the learning.
  • the system fetches a mini-batch of the training data set at 420 .
  • the system propagates the training data set through the encoder to compute the full approximating posterior over discrete space z.
  • the system generates or causes generation of samples from the approximating posterior over ⁇ , given the full distribution over z.
  • this is performed by a non-quantum processor, and uses the inverse of the CDF F i (x) described above.
  • the non-quantum processor can, for example, take the form of one or more of one or more digital microprocessors, digital signal processors, graphical processing units, central processing units, digital application specific integrated circuits, digital field programmable gate arrays, digital microcontrollers, and/or any associated memories, registers or other nontransitory computer- or processor-readable media, communicatively coupled to the non-quantum processor.
  • the system propagates the samples through the decoder to compute the distribution over the input.
  • the system performs backpropagation through the decoder.
  • the system performs backpropagation through the sampler over the approximating posterior over ⁇ .
  • backpropagation is an efficient computational approach to determining the gradient.
  • the system computes the gradient of the KL-divergence between the approximating posterior and the true prior over z.
  • the system performs backpropagation through the encoder.
  • the system determines a gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space.
  • the system determines at least one of a gradient or at least a stochastic approximation of a gradient, of a bound on the log-likelihood of the input data.
  • the system generates samples or causes samples to be generated by a quantum processor.
  • the system updates the model parameters based at least in part on the gradient.
  • the system tests to determine if the current mini-batch is the last mini-batch to be processed. In response to determining that the current mini-batch is the last mini-batch to be processed, the system returns control to 415 . In response to determining that the current mini-batch is not the last mini-batch to be processed, the system returns control to 420 .
  • act 470 is omitted, and control passes directly to 415 from 465 .
  • the decision whether to fetch another mini-batch can be incorporated in 415 .
  • the discrete VAE method extends the encoder and the prior with a transformation to a continuous, auxiliary latent representation, and correspondingly makes the decoder a function of the same continuous representation.
  • the method evaluates the auto-encoder portion of the loss function in the continuous representation while evaluating the KL-divergence portion of the loss function in the z space.
  • a probabilistic model is defined in terms of a prior distribution p(z) over latent variables z and a conditional distribution p(x
  • the observation of x often induces strong correlations of the z, given x, in the posterior p(z
  • an RBM used as the prior distribution may have strong correlations between the units of the RBM.
  • hierarchy can be introduced into the approximating posterior q(z
  • the variables of each hierarchical layer are independent given the previous layers, the total distribution can capture strong correlations, especially as the size of each hierarchical layer shrinks towards a single variable.
  • the latent variables z of the RBM are divided into disjoint groups, z 1 , . . . , z k .
  • the continuous latent variables ⁇ are divided into complementary disjoint groups ⁇ 1 , . . . , ⁇ k .
  • the groups may be chosen at random, while in other implementations the groups be defined so as to be of equal size.
  • the hierarchical variational auto-encoder defines the approximating posterior via a directed acyclic graphical model over these groups.
  • z j ⁇ 0,1 ⁇ and g j ( ⁇ i ⁇ j ,x, ⁇ ) is a parameterized function of the input and preceding ⁇ i , such as a neural network.
  • ⁇ i a parameterized function of the input and preceding ⁇ i , such as a neural network.
  • the corresponding graphical model is shown in FIG. 5 .
  • FIG. 5 schematic diagram illustrating an example implementation of a hierarchical variational auto-encoder (VAE).
  • VAE hierarchical variational auto-encoder
  • This hierarchical approximating posterior does not affect the form of the auto-encoding term 520 of FIG. 5 , except to increase the depth of the auto-encoder.
  • Each can be computed via the stochastic nonlinearity F q j ( ⁇ j
  • the deterministic probability value q(z 1
  • ⁇ i ⁇ j ,x, ⁇ ) is parameterized, for example by a neural network.
  • ⁇ ) can be estimated stochastically using samples from the approximating posterior q( ⁇ ,z
  • the prior can be, for example, an RBM.
  • Samples from the same prior distribution are required for an entire mini-batch, independent from the samples chosen from the training dataset.
  • Convolutional architectures are an essential component of state-of-the-art approaches to visual object classification, speech recognition, and numerous other tasks. In particular, they have been successfully applied to generative modeling, such as in deconvolutional networks and LAPGAN. There is, therefore, technical benefit in incorporating convolutional architectures into variational auto-encoders, as such can provide a technical solution to a technical problem, and thereby achieve a technical result.
  • Convolutional architectures are necessarily hierarchical. In the feedforward direction, they build from local, high-resolution features to global, low-resolution features through the application of successive layers of convolution, point-wise nonlinear transformations, and pooling. When used generatively, this process is reversed, with global, low-resolution features building towards local, high-resolution features through successive layers of deconvolution, point-wise nonlinear transformations, and unpooling.
  • ancillary random variables can be defined at each layer of the deconvolutional decoder network.
  • Ancillary random variables can be discrete random variables or continuous random variables.
  • the ancillary random variables of layer n are used in conjunction with the signal from layer n+1 to determine the signal to layer n ⁇ 1.
  • the approximating posterior over the ancillary random variables of layer n is defined to be a function of the convolutional encoder, generally restricted to layer n of the convolutional encoder.
  • To compute a stochastic approximation to the gradient of the evidence lower bound to the approach can perform a single pass up the convolutional encoder network, followed by a single pass down the deconvolutional decoder network. In the pass down the deconvolutional decoder network, the ancillary random variables are sampled from the approximating posteriors computed in the pass up the convolutional encoder network.
  • a traditional approach can result in approximating posteriors that poorly match the true posterior, and consequently can result in poor samples in the auto-encoding loop.
  • the approximating posterior defines independent distributions over each layer. This product of independent distributions ignores the strong correlations between adjacent layers in the true posterior, conditioned on the underlying data.
  • the representation throughout layer n should be mutually consistent, and consistent with the representation in layer n ⁇ 1 and n+1.
  • the approximating posterior over every random variable is independent.
  • the variability in the higher (more abstract) layers is uncorrelated with that in the lower layers, and consistency cannot be enforced across layers unless the approximating posterior collapses to a single point.
  • the true posterior has many modes, constrained by long-range correlations within each layer. For instance, if a line in an input image is decomposed into a succession of short line segments (e.g., Gabor filters), it is essential that the end of one segment line up with the beginning of the next segment. With a sufficiently overcomplete dictionary, there may be many sets of segments that cover the line, but differ by a small offset along the line. A factorial posterior can reliably represent one such mode.
  • short line segments e.g., Gabor filters
  • the computational system conditions the approximating posterior for the n th layer on the sample from the approximating posterior of the higher layers preceding it in the downward pass through the deconvolutional decoder.
  • the computational system conditions the approximating posterior for the n th layer on the sample from the (n ⁇ 1) th layer. This corresponds to a directed graphical model, flowing from the higher, more abstract layers to the lower, more concrete layers. Consistency between the approximating posterior distributions over each pair of layers is ensured directly.
  • the system can use a parameterized distribution for the deconvolutional component of the approximating posterior that shares structure and parameters with the generative model.
  • the system can continue to use a separately parameterized directed model.
  • a stochastic approximation to the gradient of the evidence lower bound can be computed via one pass up the convolutional encoder, one pass down the deconvolutional decoder of the approximating posterior, and another pass down the deconvolutional decoder of the prior, conditioned on the sample from the approximating posterior.
  • the approximating posterior is defined directly over the primary units of the deconvolutional generative model, as opposed to ancillary random variables, the final pass down the deconvolutional decoder of the prior does not actually pass signals from layer to layer. Rather, the input to each layer is determined by the approximating posterior.
  • the system propagates up the convolutional encoder and down the deconvolutional decoder of the approximating posterior, to compute the parameters of the approximating posterior.
  • this can compute the conditional approximating posterior of the n th layer based on both the n th layer of the convolutional encoder, and the preceding (n ⁇ 1) th layer of the deconvolutional decoder of the approximating posterior.
  • the approximating posterior of the n th layer may be based upon the input, the entire convolutional encoder, and layers i ⁇ n of the deconvolutional decoder of the approximating posterior (or a subset thereof).
  • the configuration sampled from the approximating posterior is then used in a pass down the deconvolutional decoder of the prior. If the approximating posterior is defined over the primary units of the deconvolutional network, then the signal from the (n ⁇ 1) th layer to the n th layer is determined by the approximating posterior for the (n ⁇ 1) th layer, independent of the preceding layers of the prior. If the approach uses auxiliary random variables, the sample from the n th layer depends on the (n ⁇ 1) th layer of the deconvolutional decoder of the prior, and the n th layer of the approximating posterior.
  • This approach can be extended to arbitrary numbers of layers, and to posteriors and priors that condition on more than one preceding layer, e.g. where layer n is conditioned on all layers m ⁇ n preceding it.
  • the approximating posterior and the prior can be defined to be fully autoregressive directed graphical models.
  • FIG. 6 is a schematic diagram illustrating an example implementation of a variational auto-encoder (VAE) with a hierarchy of continuous latent variables with an approximating posterior 610 and a prior 620 .
  • VAE variational auto-encoder
  • Each m>1 in approximating posterior 610 and prior 620 denotes a layer of continuous latent variables and is conditioned on the layers preceding it.
  • the approximating posterior can be made hierarchical, as follows:
  • the ELBO decomposes as
  • L ⁇ ( x , ⁇ , ⁇ ) log ⁇ p ⁇ ( x
  • l ⁇ m , x , ⁇ ) ] ⁇ 1 ⁇ ⁇ 2 ⁇ ... ⁇ ⁇ n ⁇ ⁇ m ⁇ q ⁇ ( m
  • l ⁇ m , x , ⁇ ) ⁇ log ⁇ [ p ⁇ ( x
  • l ⁇ m , x , ⁇ ) ] ⁇ m
  • Equation 4 the gradient of the last term in Equation 4 with respect to q( n ⁇ 1
  • a stochastic approximation to the gradient of the ELBO can be computed via one pass down approximating posterior 610 , sampling from each continuous latent ⁇ i and m>1 in turn, and another pass down prior 620 , conditioned on the samples from the approximating posterior.
  • samples at each layer n may be based upon both the input and all the preceding layers m ⁇ n.
  • ) can be applied from the prior to the sample form the approximating posterior.
  • the pass down the prior need not pass signal from layer to layer. Rather, the input to each layer can be determined by the approximating posterior using equation 4.
  • the KL-divergence is then taken between the approximating posterior and true prior at each layer, conditioned on the layers above.
  • Re-parametrization can be used to include parameter-dependent terms into the KL-divergence term.
  • Both the approximating posterior and the prior distribution of each layer m>1 are defined by neural networks, the inputs of which are ⁇ , 1>l>m and x in the case of the approximating posterior.
  • the output of these are networks are the mean and variance of a diagonal-covariance Gaussian distribution.
  • the system bases the batch normalization on the L1 norm.
  • the system may base the batch normalization on the L2 norm.
  • the system may use:
  • the training of variational auto-encoders is typically limited by the form of the approximating posterior.
  • an approximating posterior other than a factorial posterior there can be challenges using an approximating posterior other than a factorial posterior.
  • the entropy of the approximating posterior which constitutes one of the components of the KL-divergence between the approximating and true posterior (or true prior), can be trivial if the approximating posterior is factorial, and close to intractable if it is a mixture of factorial distributions. While one might consider using normalizing flows, importance weighting, or other methods to allow non-factorial approximating posteriors, it may be easier to change the model to make the true posterior more factorial.
  • ISTA and LISTA address this by (approximately) following the gradient (with proximal descent) of the L1-regularized reconstruction error.
  • the resulting transformation of the hidden representation is mostly linear in the input and the hidden representation:
  • a somewhat similar approach can be employed in deconvolutional decoder of the approximating posterior.
  • the conditional approximating posterior of layer z n given layer z n ⁇ 1 is computed by a multi-layer deterministic network.
  • the system can instead provide the deterministic transformation of the input to the internal layers, or any subset of the internal layers.
  • the approximating posterior over the final Gaussian units may then employ sparse coding via LISTA, suppressing redundant higher-level units, and thus allowing factorial posteriors where more than one unit coding for a given feature may be active.
  • there is no input to govern the disambiguation between redundant features so the winner-take-all selection must be achieved via other means, and a more conventional deep network may be sufficient.
  • the discrete variational auto-encoder can also be incorporated into a convolutional auto-encoder. It is possible to put a discrete VAE on the very top of the prior, where it can generate multi-modal distributions that then propagate down the deconvolutional decoder, readily allowing the production of more sophisticated multi-modal distributions. If using ancillary random variables, it would also be straightforward to include discrete random variables at every layer.
  • a quantum processor can employ a Chimera topology.
  • a Chimera topology can be defined as a tiled topology with intra-cell couplings at crossings between qubits within the cell and inter-cell couplings between respective qubits in adjacent cells.
  • Traditional VAEs typically use a factorial approximating posterior. As a result, traditional VAEs have difficulty capturing correlations between latent variables.
  • One approach is to refine the approximating posterior automatically. This approach can be complex. Another, generally simpler, approach is to make the approximating posterior hierarchical. A benefit of this approach is that it can capture any distribution, or at least a wider range of distributions.
  • FIG. 7 shows a method 700 for unsupervised learning via a hierarchical variational auto-encoder (VAE), in accordance with the present systems, devices, articles and methods.
  • Method 700 may be implemented as an extension of method 400 employing a hierarchy of random variables.
  • Method 700 starts at 705 , for example in response to a call from another routine or other invocation.
  • the system initializes the model parameters with random values, as described above with reference to 410 of method 400 .
  • the system tests to determine if a stopping criterion has been reached, as described above with reference to 415 of method 400 .
  • the system ends method 700 at 775 , until invoked again, for example, a request to repeat the learning.
  • the system fetches a mini-batch of the training data set.
  • the system divides the latent variables z into disjoint groups z 1 , . . . , z k and the corresponding continuous latent variables into disjoint groups ⁇ 1 , . . . ⁇ k .
  • the system propagates the training data set through the encoder to compute the full approximating posterior over discrete z j .
  • this hierarchical approximation does not alter the form of the gradient of the auto-encoding term IE.
  • the system generates or causes generation of samples from the approximating posterior over n layers of continuous variables given the full distribution over z.
  • the number of layers n may be 1 or more.
  • the system propagates the samples through the decoder to compute the distribution over the input, as describe above with reference to 435 of method 400 .
  • the system performs backpropagation through the decoder, as describe above with reference to 440 of method 400 .
  • the system performs backpropagation through the sampler over the approximating posterior over as describe above with reference to 445 of method 400 .
  • the system computes the gradient of the KL-divergence between the approximating posterior and the true prior over z, as describe above with reference to 450 of method 400 .
  • the system performs backpropagation through the encoder, as describe above with reference to 455 of method 400 .
  • the system determines a gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space.
  • the system determines at least one of a gradient or at least a stochastic approximation of a gradient, of a bound on the log-likelihood of the input data.
  • the system generates samples or causes samples to be generated by a quantum processor, as described above with reference to 460 of method 400 .
  • the system updates the model parameters based at least in part on the gradient, as described above with reference to 465 of method 400 .
  • the system tests to determine if the current mini-batch is the last mini-batch to be processed, as described above with reference to 470 of method 400 .
  • act 770 is omitted, and control passes directly to 715 from 765 . The decision whether to fetch another mini-batch can be incorporated in 715 .
  • the system In response to determining that the current mini-batch is the last mini-batch to be processed, the system returns control to 715 . In response to determining that the current mini-batch is not the last mini-batch to be processed, the system returns control to 720 .
  • method 700 renders the approximating posterior hierarchical over the discrete latent variables.
  • method 700 also adds a hierarchy of continuous latent variables below them.
  • the remaining component of the loss function can be expressed as follows:
  • the prior distribution is a Restricted Boltzmann Machine (RBM), as follows:
  • the present method divides the latent variables into two groups and defines the approximating posterior via a directed acyclic graphical model over the two groups z a and z b , as follows:
  • ⁇ )] with respect to the parameters ⁇ of the prior can be estimated stochastically using samples from the approximating posterior q(z
  • x) q a (z a
  • z a ,x, ⁇ ) can be performed analytically; the expectation with respect to q a (z a
  • sampling is from the native distribution of the quantum processor. Rao-Blackwellization can be used to marginalize half of the units. Samples from the same prior distribution are used for a mini-batch, independent of the samples chosen from the training dataset.
  • indices i, j, and k denote hierarchical groups of variables.
  • ⁇ ⁇ ⁇ H ⁇ ( q ) ⁇ i ⁇ ⁇ k ⁇ i ⁇ [ ⁇ z i ⁇ ( ⁇ ⁇ ⁇ ⁇ q i
  • the re-parameterization technique initially makes z i a function of ⁇ and ⁇ . However, it is possible to marginalize over values of the re-parameterization variables ⁇ for which z is consistent, thereby rendering z i a constant. Assuming, without loss of generality, that i ⁇ j, ⁇ [J ij z i z j ] can be expressed as follows:
  • sampling from ⁇ i is equivalent to sampling from ⁇ i
  • ⁇ i is not a function of q k ⁇ i , or parameters from previous layers. Combining this with the chain rule, ⁇ i can be held fixed when differentiating q j , with gradients not backpropagating from q j through ⁇ i .
  • the gradient of E p (J i,j z i z j ) can be decomposed using the chain rule.
  • z has been considered to be a function of ⁇ and ⁇ .
  • the variance of the estimate is proportional to the number of terms, and the number of terms contributing to each gradient can grow quadratically with the number of units in a bipartite model, and linearly in a chimera-structured model.
  • the number of terms contributing to each gradient can grow linearly with the number of units in a bipartite mode, and be constant in a chimera-structured model.
  • a factorial distribution over discrete random variables can be retained, and made conditional on a separate set of ancillary random variables.
  • the KL-divergence between the approximating posterior and the true prior of the ancillary variables can be subtracted.
  • the rest of the prior is unaltered, since the ancillary random variables ⁇ govern the approximating posterior, rather than the generative model.
  • Each layer i of the neural network g(x) consists of a linear transformation, parameterized by weight matrix W i and bias vector b i , followed by a pointwise nonlinearity. While intermediate layers can consist of ReLU or soft-plus units, with nonlinearity denoted by ⁇ , the logistic function ⁇ can be used as the nonlinearity in the top layer of the encoder to ensure the requisite range [0,1]. Parameters for each q i (z i
  • , ⁇ ) ⁇ ( ⁇ i ( ⁇ )) can again be used. If x is real, an additional neural network ⁇ ′( ⁇ ) can be introduced to calculate the variance of each variable, and take an approach analogous to traditional variational auto-encoders by using p i (x i
  • ⁇ , ⁇ ) ( ⁇ i ( ⁇ ), ⁇ ′ i ( ⁇ )). The final nonlinearity of the network ⁇ ( ⁇ ) should be linear, and the final nonlinearity of ⁇ ( ⁇ ) should be non-negative.
  • Algorithm 1 (shown below) illustrates an example implementation of training a network expressed as pseudocode. Algorithm 1 describes training a generic network with gradient descent. In other implementations, other methods could be used to train the network without loss of generality with respect to the approach.
  • Algorithm 1 establishes the input and output, and initialize the model parameters, then it determines if a stopping criterion has been met. In addition, algorithm 1 defines the processing of each mini-batch or subset.
  • Algorithms 1 and 2 (shown below) comprise pseudocode for binary visible units. Since J is bipartite, J q can be used to denote the upper-right quadrant of J, where the non-zero values reside. Gradient descent is one approach that can be used. In other implementations, gradient descent can be replaced by another technique, such as RMSprop, adagrad, or ADAM.
  • Algorithm 1 Train generic network with simple gradient descent def train ( )
  • Input A data set X, where X [: , i]is the ith element, and a learning rate parameter
  • foreach minibatch X pos getMinibatch (X, ) of the training dataset do
  • the presently disclosed systems and methods avoid these problems by symmetrically projecting the approximating posterior and the prior into a continuous space.
  • the computational system evaluates the auto-encoder portion of the loss function in the continuous space, marginalizing out the original discrete latent representation.
  • the computational system evaluates the KL-divergence between the approximating posterior and the true prior in the original discrete space, and, owing to the symmetry of the projection into the continuous space, it does not contribute to this term.
  • Algorithm 2 Helper functions for discrete VAE L ⁇ L up + L down def getMinibatch (X, )
  • q′ ( ⁇ 1
  • x, ⁇ ) Z pos and ⁇ U (0,1) n ⁇ m

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Condensed Matter Physics & Semiconductors (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physiology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Complex Calculations (AREA)

Abstract

A computational system can include digital circuitry and analog circuitry, for instance a digital processor and a quantum processor. The quantum processor can operate as a sample generator providing samples. Samples can be employed by the digital processing in implementing various machine learning techniques. For example, the computational system can perform unsupervised learning over an input space, for example via a discrete variational auto-encoder, and attempting to maximize the log-likelihood of an observed dataset. Maximizing the log-likelihood of the observed dataset can include generating a hierarchical approximating posterior.

Description

    BACKGROUND Field
  • The present disclosure generally relates to machine learning.
  • Machine Learning
  • Machine learning relates to methods and circuitry that can learn from data and make predictions based on data. In contrast to methods or circuitry that follow static program instructions, machine learning methods and circuitry can include deriving a model from example inputs (such as a training set) and then making data-driven predictions.
  • Machine learning is related to optimization. Some problems can be expressed in terms of minimizing a loss function on a training set, where the loss function describes the disparity between the predictions of the model being trained and observable data.
  • Machine learning tasks can include unsupervised learning, supervised learning, and reinforcement learning. Approaches to machine learning include, but are not limited to, decision trees, linear and quadratic classifiers, case-based reasoning, Bayesian statistics, and artificial neural networks.
  • Machine learning can be used in situations where explicit approaches are considered infeasible. Example application areas include optical character recognition, search engine optimization, and computer vision.
  • Quantum Processor
  • A quantum processor is a computing device that can harness quantum physical phenomena (such as superposition, entanglement, and quantum tunneling) unavailable to non-quantum devices. A quantum processor may take the form of a superconducting quantum processor. A superconducting quantum processor may include a number of qubits and associated local bias devices, for instance two or more superconducting qubits. An example of a qubit is a flux qubit. A superconducting quantum processor may also employ coupling devices (i.e., “couplers”) providing communicative coupling between qubits. Further details and embodiments of exemplary quantum processors that may be used in conjunction with the present systems and devices are described in, for example, U.S. Pat. Nos. 7,533,068; 8,008,942; 8,195,596; 8,190,548; and 8,421,053.
  • Adiabatic Quantum Computation
  • Adiabatic quantum computation typically involves evolving a system from a known initial Hamiltonian (the Hamiltonian being an operator whose eigenvalues are the allowed energies of the system) to a final Hamiltonian by gradually changing the Hamiltonian. A simple example of an adiabatic evolution is a linear interpolation between initial Hamiltonian and final Hamiltonian. An example is given by:

  • H e=(1−s)H i +sH ƒ
  • where Hi is the initial Hamiltonian, Hf is the final Hamiltonian, He is the evolution or instantaneous Hamiltonian, and s is an evolution coefficient which controls the rate of evolution (i.e., the rate at which the Hamiltonian changes).
  • As the system evolves, the evolution coefficient s goes from 0 to 1 such that at the beginning (i.e., s=0) the evolution Hamiltonian He is equal to the initial Hamiltonian Hi and at the end (i.e., s=1) the evolution Hamiltonian He is equal to the final Hamiltonian Hf. Before the evolution begins, the system is typically initialized in a ground state of the initial Hamiltonian Hi and the goal is to evolve the system in such a way that the system ends up in a ground state of the final Hamiltonian Hf at the end of the evolution. If the evolution is too fast, then the system can transition to a higher energy state, such as the first excited state. As used herein an “adiabatic” evolution is an evolution that satisfies the adiabatic condition:

  • {dot over (s)}|
    Figure US20220076131A1-20220310-P00001
    1|dH e /ds|0
    Figure US20220076131A1-20220310-P00002
    |=δg 2(s)
  • where {dot over (s)} is the time derivative of s, g(s) is the difference in energy between the ground state and first excited state of the system (also referred to herein as the “gap size”) as a function of s, and δ is a coefficient much less than 1.
  • If the evolution is slow enough that the system is always in the instantaneous ground state of the evolution Hamiltonian, then transitions at anti-crossings (when the gap size is smallest) are avoided. Other evolution schedules, besides the linear evolution described above, are possible including non-linear evolution, parametric evolution, and the like. Further details on adiabatic quantum computing systems, methods, and apparatus are described in, for example, U.S. Pat. Nos. 7,135,701; and 7,418,283.
  • Quantum Annealing
  • Quantum annealing is a computation method that may be used to find a low-energy state, typically preferably the ground state, of a system. Similar in concept to classical simulated annealing, the method relies on the underlying principle that natural systems tend towards lower energy states because lower energy states are more stable. While classical annealing uses classical thermal fluctuations to guide a system to a low-energy state and ideally its global energy minimum, quantum annealing may use quantum effects, such as quantum tunneling, as a source of disordering to reach a global energy minimum more accurately and/or more quickly than classical annealing. In quantum annealing thermal effects and other noise may be present to annealing. The final low-energy state may not be the global energy minimum. Adiabatic quantum computation may be considered a special case of quantum annealing for which the system, ideally, begins and remains in its ground state throughout an adiabatic evolution. Thus, those of skill in the art will appreciate that quantum annealing systems and methods may generally be implemented on an adiabatic quantum computer. Throughout this specification and the appended claims, any reference to quantum annealing is intended to encompass adiabatic quantum computation unless the context requires otherwise.
  • Quantum annealing uses quantum mechanics as a source of disorder during the annealing process. An objective function, such as an optimization problem, is encoded in a Hamiltonian HP, and the algorithm introduces quantum effects by adding a disordering Hamiltonian HD that does not commute with HP. An example case is:

  • H E ∝A(t)H D +B(t)H P,
  • where A(t) and B(t) are time dependent envelope functions. For example, A(t) can change from a large value to substantially zero during the evolution and HE can be thought of as an evolution Hamiltonian similar to He described in the context of adiabatic quantum computation above. The disorder is slowly removed by removing HD (i.e., by reducing A(t)).
  • Thus, quantum annealing is similar to adiabatic quantum computation in that the system starts with an initial Hamiltonian and evolves through an evolution Hamiltonian to a final “problem” Hamiltonian HP whose ground state encodes a solution to the problem. If the evolution is slow enough, the system may settle in the global minimum (i.e., the exact solution), or in a local minimum close in energy to the exact solution. The performance of the computation may be assessed via the residual energy (difference from exact solution using the objective function) versus evolution time. The computation time is the time required to generate a residual energy below some acceptable threshold value. In quantum annealing, HP may encode an optimization problem and therefore HP may be diagonal in the subspace of the qubits that encode the solution, but the system does not necessarily stay in the ground state at all times. The energy landscape of HP may be crafted so that its global minimum is the answer to the problem to be solved, and low-lying local minima are good approximations.
  • The gradual reduction of disordering Hamiltonian HD (i.e., reducing A(t)) in quantum annealing may follow a defined schedule known as an annealing schedule. Unlike adiabatic quantum computation where the system begins and remains in its ground state throughout the evolution, in quantum annealing the system may not remain in its ground state throughout the entire annealing schedule. As such, quantum annealing may be implemented as a heuristic technique, where low-energy states with energy near that of the ground state may provide approximate solutions to the problem.
  • BRIEF SUMMARY
  • A method for unsupervised learning over an input space comprising discrete or continuous variables, and at least a subset of a training dataset of samples of the respective variables, to attempt to identify the value of at least one parameter that increases the log-likelihood of the at least a subset of a training dataset with respect to a model, the model expressible as a function of the at least one parameter, the method executed by circuitry including at least one processor, may be summarized as including forming a first latent space comprising a plurality of random variables, the plurality of random variables comprising one or more discrete random variables; forming a second latent space comprising the first latent space and a set of supplementary continuous random variables; forming a first transforming distribution comprising a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables of the first latent space; forming an encoding distribution comprising an approximating posterior distribution over the first latent space, conditioned on the input space; forming a prior distribution over the first latent space; forming a decoding distribution comprising a conditional distribution over the input space conditioned on the set of supplementary continuous random variables; determining an ordered set of conditional cumulative distribution functions of the supplementary continuous random variables, each cumulative distribution function comprising functions of a full distribution of at least one of the one or more discrete random variables of the first latent space; determining an inversion of the ordered set of conditional cumulative distribution functions of the supplementary continuous random variables; constructing a first stochastic approximation to a lower bound on the log-likelihood of the at least a subset of a training dataset; constructing a second stochastic approximation to a gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset; and increasing the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset.
  • Increasing the lower bound on the log-likelihood of the at least a subset of a training dataset based at least in part on the gradient of the lower bound on the log-likelihood of the at least a subset of a training dataset may include increasing the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent. Increasing the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent may include attempting to maximize the lower bound on the log-likelihood of the at least a subset of a training dataset using a method of gradient descent. The encoding distribution and decoding distribution may be parameterized by deep neural networks. Determining an ordered set of conditional cumulative distribution functions of the supplementary continuous random variables may include analytically determining an ordered set of conditional cumulative distribution functions of the supplementary continuous random variables. The lower bound may be an evidence lower bound.
  • Constructing a first stochastic approximation to the lower bound of the log-likelihood of the at least a subset of a training dataset may include decomposing the first stochastic approximation to the lower bound into at least a first part comprising negative KL-divergence between the approximating posterior and the prior distribution over the first latent space, and a second part comprising an expectation, or at least a stochastic approximation to an expectation, with respect to the approximating posterior over the second latent space of the conditional log-likelihood of the at least a subset of a training dataset under the decoding distribution.
  • Constructing a second stochastic approximation to the gradient of the lower bound may include determining the gradient of the second part of the first stochastic approximation by backpropagation; approximating the gradient of the first part of the first stochastic approximation with respect to one or more parameters of the prior distribution over the first latent space using samples from the prior distribution; and determining a gradient of the first part of the first stochastic approximation with respect to parameters of the encoding distribution by backpropagation. Approximating the gradient of the first part of the first stochastic approximation with respect to one or more parameters of the prior distribution over the first latent space using samples from the prior distribution may include at least one of generating samples or causing samples to be generated by a quantum processor. A logarithm of the prior distribution may be, to within a constant, a problem Hamiltonian of a quantum processor.
  • The method may further include generating samples or causing samples to be generated by a quantum processor; and determining an expectation with respect to the prior distribution from the samples. Generating samples or causing samples to be generated by at least one quantum processor may include performing at least one post-processing operation on the samples. Generating samples or causing samples to be generated by at least one quantum processor may include operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the at least one quantum processor, and wherein operating the at least one quantum processor as a sample generator comprises: programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor; evolving the quantum processor; and reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution.
  • The method may further include at least one of generating, or at least approximating, samples or causing samples to be generated, or least approximated, by a restricted Boltzmann machine; and determining the expectation with respect to the prior distribution from the samples. The set of supplementary continuous random variables may include a plurality of continuous variables, and each one of the plurality of continuous variables may be conditioned on a different respective one of the plurality of random variables.
  • The method may further include forming a second transforming distribution, wherein the input space comprises a plurality of input variables, and the second transforming distribution is conditioned on one or more of the plurality of input variables and at least one of the one or more discrete random variables.
  • A computational system may be summarized as including hardware or circuitry, for example including at least one processor; and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least one processor to execute any of the above described acts or any of the methods of claims 1 through 16.
  • A method for unsupervised learning by a computational system, the method executable by circuitry including at least one processor, may be summarized as including forming a model, the model comprising one or more model parameters; initializing the model parameters; receiving a training dataset comprising a plurality of subsets of the training dataset; testing to determine if a stopping criterion has been met; in response to determining the stopping criterion has not been met: fetching a mini-batch comprising one of the plurality of subsets of the training dataset, the mini-batch comprising input data; performing propagation through an encoder that computes an approximating posterior distribution over a discrete space; sampling from the approximating posterior distribution over a set of continuous random variables via a sampler; performing propagation through a decoder that computes an auto-encoded distribution over the input data; performing backpropagation through the decoder of a log-likelihood of the input data with respect to the auto-encoded distribution over the input data; performing backpropagation through the sampler that samples from the approximating posterior distribution over the set of continuous random variables to generate an auto-encoded gradient; determining a first gradient of a KL-divergence, with respect to the approximating posterior, between the approximating posterior distribution and a true prior distribution over the discrete space; performing backpropagation through the encoder of a sum of the auto-encoding gradient and the first gradient of the KL-divergence with respect to the approximating posterior; determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space; determining at least one of a gradient or at least a stochastic approximation of a gradient, of a bound on the log-likelihood of the input data; updating the model parameters based at least in part on the determined at least one of the gradient or at least a stochastic approximation of the gradient, of the bound on the log-likelihood of the input data. Initializing the model parameters may include initializing the model parameters using random variables. Initializing the model parameters may include initializing the model parameters based at least in part on a pre-training procedure. Testing to determine if a stopping criterion has been met may include testing to determine if a threshold number N of passes through the training dataset have been run.
  • The method may further include receiving at least a subset of a validation dataset, wherein testing to determine if a stopping criterion has been met includes determining a measure of validation loss on the at least a subset of a validation dataset computed on two or more successive passes, and testing to determine if the measure of validation loss meets a predetermined criterion. Determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space may include determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space by generating samples or causing samples to be generated by a quantum processor.
  • Generating samples or causing samples to be generated by a quantum processor may include operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the at least one quantum processor, and wherein operating the at least one quantum processor as a sample generator comprises programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor; evolving the at least one quantum processor; and reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution. Operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution may include operating the at least one quantum processor to perform at least one post-processing operation on the samples. Sampling from the approximating posterior distribution over a set of continuous random variables may include generating samples or causing samples to be generated by a digital processor.
  • The method for unsupervised learning may further include dividing the discrete space into a first plurality of disjoint groups; and dividing the set of supplementary continuous random variables into a second plurality of disjoint groups, wherein performing propagation through an encoder that computes an approximating posterior over a discrete space includes: determining a processing sequence for the first and the second plurality of disjoint groups; and for each of the first plurality of disjoint groups in an order determined by the processing sequence, performing propagation through an encoder that computes an approximating posterior, the approximating posterior conditioned on at least one of the previous ones in the processing sequence of the second plurality of disjoint groups and at least one of the plurality of input variables. Dividing the discrete space into a first plurality of disjoint groups may include dividing the discrete space into a first plurality of disjoint groups by random assignment of discrete variables in the discrete space. Dividing the discrete space into a first plurality of disjoint groups may include dividing the discrete space into a first plurality of disjoint groups to generate even-sized groups in the first plurality of disjoint groups. Initializing the model parameters may include initializing the model parameter using random variables. Initializing the model parameters may include initializing the model parameter based at least in part on a pre-training procedure. Testing to determine if a stopping criterion has been met may include testing to determine if a threshold number N of passes through the training dataset have been run.
  • The method may further include receiving at least a subset of a validation dataset, wherein testing to determine if a stopping criterion has been met includes determining a measure of validation loss on the at least a subset of a validation dataset computed on two or more successive passes, and testing to determine if the measure of validation loss meets a predetermined criterion. Determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space may include determining a second gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space by generating samples or causing samples to be generated by a quantum processor.
  • Generating samples or causing samples to be generated by a quantum processor may include operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution, wherein a shape of the probability distribution depends on a configuration of a number of programmable parameters for the analog processor, and wherein operating the at least one quantum processor as a sample generator comprises: programming the at least one quantum processor with a configuration of the number of programmable parameters for the at least one quantum processor, wherein the configuration of a number of programmable parameters corresponds to the probability distribution over the plurality of qubits of the at least one quantum processor, evolving the at least one quantum processor, and reading out states for the qubits in plurality of qubits of the at least one quantum processor, wherein the states for the qubits in the plurality of qubits correspond to a sample from the probability distribution. Operating the at least one quantum processor as a sample generator to provide the samples from a probability distribution may include operating the at least one quantum processor to perform at least one post-processing operation on the samples. Sampling from the approximating posterior over a set of continuous random variables may include generating samples or causing samples to be generated by a digital processor.
  • A computational system may be summarized as including hardware or circuitry, for example including at least one processor; and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least processor to execute any of the above described acts or any of the methods of claims 18 through 37.
  • A method of unsupervised learning by a computational system, the method executable by circuitry including at least one processor, may be summarized as including determining a first approximating posterior distribution over at least one group of a set of discrete random variables; sampling from at least one group of a set of supplementary continuous random variables using the first approximating posterior distribution over the at least one group of the set of discrete random variables to generate one or more samples, wherein a transforming distribution comprises a conditional distribution over the set of supplementary continuous random variables, conditioned on the one or more discrete random variables; determining a second approximating posterior distribution and a first prior distribution, the first prior distribution over at least one layer of a set of continuous variables; sampling from the second approximating posterior distribution; determining an auto-encoding loss on an input space comprising discrete or continuous variables, the auto-encoding loss conditioned on the one or more samples; determining a first KL-divergence, or at least an approximation thereof, between the second posterior distribution and the first prior distribution; determining a second KL-divergence, or at least an approximation thereof, between the first posterior distribution and a second prior distribution, the second prior distribution over the set of discrete random variables; and backpropagating the sum of the first and the second KL-divergence and the auto-encoding loss on the input space conditioned on the one or more samples. The auto-encoding loss may be a log-likelihood.
  • A computational system may be summarized as including hardware or circuitry, for example including at least one processor; and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least processor to execute any of the immediately above described acts or any of the methods of claims 39 through 40.
  • A method of unsupervised learning by a computational system, the method executable by circuitry including at least one processor, may be summarized as including determining a first approximating posterior distribution over a first group of discrete random variables conditioned on an input space comprising discrete or continuous variables; sampling from a first group of supplementary continuous variables based on the first approximating posterior distribution; determining a second approximating posterior distribution over a second group of discrete random variables conditioned on the input space and samples from the first group of supplementary continuous random variables; sampling from a second group of supplementary continuous variables based on the second approximating posterior distribution; determining a third approximating posterior distribution and a first prior distribution over a first layer of additional continuous random variables, the third approximating distribution conditioned on the input space, samples from at least one of the first and the second group of supplementary continuous random variables, and the first prior distribution conditioned on samples from at least one of the first and the second group of supplementary continuous random variables; sampling from the first layer of additional continuous random variables based on the third approximating posterior distribution; determining a fourth approximating posterior distribution and a second prior distribution over a second layer of additional continuous random variables, the fourth approximating distribution conditioned on the input space, samples from at least one of the first and the second group of supplementary continuous random variables, samples from the first layer of additional continuous random variables, and the second prior distribution conditioned on at least one of samples from at least one of the first and the second group of supplementary continuous random variables, and samples from the first layer of additional continuous random variables; determining a first gradient of a KL-divergence, or at least a stochastic approximation thereof, between the third approximating posterior distribution and the first prior distribution with respect to the third approximating posterior distribution and the first prior distribution; determining a second gradient of a KL-divergence, or at least a stochastic approximation thereof, between the fourth approximating posterior distribution and the second prior distribution with respect to the fourth approximating posterior distribution and the second prior distribution; determining a third gradient of a KL-divergence, or at least a stochastic approximation thereof, between an approximating posterior distribution over the discrete random variables and a third prior distribution with respect to the approximating posterior distribution over the discrete random variables and the third prior distribution, wherein the approximating posterior distribution over the discrete random variables is a combination of the first approximating posterior distribution over the first group of discrete random variables, and the second approximating posterior distribution over the second group of discrete random variables; backpropagating the first, the second and the third gradients of the KL-divergence to the input space. The third prior distribution may be a restricted Boltzmann machine.
  • A computational system may be summarized as including hardware or circuitry, for example including at least one processor and at least one nontransitory processor-readable storage medium that stores at least one of processor-executable instructions or data which, when executed by the at least one processor cause the at least processor to execute any of the immediately above described acts or any of the methods of claims 41 through 42.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
  • In the drawings, identical reference numbers identify similar elements or acts. The sizes and relative positions of elements in the drawings are not necessarily drawn to scale. For example, the shapes of various elements and angles are not necessarily drawn to scale, and some of these elements may be arbitrarily enlarged and positioned to improve drawing legibility. Further, the particular shapes of the elements as drawn, are not necessarily intended to convey any information regarding the actual shape of the particular elements, and may have been solely selected for ease of recognition in the drawings.
  • FIG. 1 is a schematic diagram of an exemplary hybrid computer including a digital computer and an analog computer in accordance with the present systems, devices, methods, and articles.
  • FIG. 2A is a schematic diagram of an exemplary topology for a quantum processor.
  • FIG. 2B is a schematic diagram showing a close-up of the exemplary topology for a quantum processor.
  • FIG. 3 is a schematic diagram illustrating an example implementation of a variational auto-encoder (VAE).
  • FIG. 4 is a flow chart illustrating a method for unsupervised learning, in accordance with the presently described systems, devices, articles, and methods.
  • FIG. 5 is a schematic diagram illustrating an example implementation of a hierarchical variational auto-encoder (VAE).
  • FIG. 6 is a schematic diagram illustrating an example implementation of a variational auto-encoder (VAE) with a hierarchy of continuous latent variables.
  • FIG. 7 is a flow chart illustrating a method for unsupervised learning via a hierarchical variational auto-encoder (VAE), in accordance with the present systems, devices, articles and methods.
  • DETAILED DESCRIPTION Generalities
  • In the following description, some specific details are included to provide a thorough understanding of various disclosed embodiments. One skilled in the relevant art, however, will recognize that embodiments may be practiced without one or more of these specific details, or with other methods, components, materials, etc. In other instances, well-known structures associated with quantum processors, such as quantum devices, coupling devices, and control systems including microprocessors and drive circuitry have not been shown or described in detail to avoid unnecessarily obscuring descriptions of the embodiments of the present methods. Throughout this specification and the appended claims, the words “element” and “elements” are used to encompass, but are not limited to, all such structures, systems, and devices associated with quantum processors, as well as their related programmable parameters.
  • Unless the context requires otherwise, throughout the specification and claims that follow, the word “comprising” is synonymous with “including,” and is inclusive or open-ended (i.e., does not exclude additional, unrecited elements or method acts).
  • Reference throughout this specification to “one embodiment” “an embodiment”, “another embodiment”, “one example”, “an example”, or “another example” means that a particular referent feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example. Thus, the appearances of the phrases “in one embodiment”, “in an embodiment”, “another embodiment” or the like in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments or examples.
  • It should be noted that, as used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. Thus, for example, reference to a problem-solving system including “a quantum processor” includes a single quantum processor, or two or more quantum processors. It should also be noted that the term “or” is generally employed in its sense including “and/or” unless the content clearly dictates otherwise.
  • References to a processor or at least one processor refer to hardware or circuitry, with discrete or integrated, for example single or multi-core microprocessors, microcontrollers, central processor units, digital signal processors, graphical processing units, programmable gate arrays, programmed logic controllers, and analog processors, for instance quantum processors. Various algorithms and methods and specific acts are executable via one or more processors.
  • The headings provided herein are for convenience only and do not interpret the scope or meaning of the embodiments.
  • Quantum Hardware
  • FIG. 1 illustrates a hybrid computing system 100 including a digital computer 105 coupled to an analog computer 150. In some implementations analog computer 150 is a quantum processor. The exemplary digital computer 105 includes a digital processor (CPU) 110 that may be used to perform classical digital processing tasks.
  • Digital computer 105 may include at least one digital processor (such as central processor unit 110 with one or more cores), at least one system memory 120, and at least one system bus 117 that couples various system components, including system memory 120 to central processor unit 110.
  • The digital processor may be any logic processing unit, such as one or more central processing units (“CPUs”), graphics processing units (“GPUs”), digital signal processors (“DSPs”), application-specific integrated circuits (“ASICs”), programmable gate arrays (“FPGAs”), programmable logic controllers (PLCs), etc., and/or combinations of the same.
  • Unless described otherwise, the construction and operation of the various blocks shown in FIG. 1 are of conventional design. As a result, such blocks need not be described in further detail herein, as they will be understood by those skilled in the relevant art.
  • Digital computer 105 may include a user input/output subsystem 111. In some implementations, the user input/output subsystem includes one or more user input/output components such as a display 112, mouse 113, and/or keyboard 114.
  • System bus 117 can employ any known bus structures or architectures, including a memory bus with a memory controller, a peripheral bus, and a local bus. System memory 120 may include non-volatile memory, such as read-only memory (“ROM”), static random access memory (“SRAM”), Flash NAND; and volatile memory such as random access memory (“RAM”) (not shown).
  • Digital computer 105 may also include other non-transitory computer- or processor-readable storage media or non-volatile memory 115. Non-volatile memory 115 may take a variety of forms, including: a hard disk drive for reading from and writing to a hard disk, an optical disk drive for reading from and writing to removable optical disks, and/or a magnetic disk drive for reading from and writing to magnetic disks. The optical disk can be a CD-ROM or DVD, while the magnetic disk can be a magnetic floppy disk or diskette. Non-volatile memory 115 may communicate with digital processor via system bus 117 and may include appropriate interfaces or controllers 116 coupled to system bus 117. Non-volatile memory 115 may serve as long-term storage for processor- or computer-readable instructions, data structures, or other data (sometimes called program modules) for digital computer 105.
  • Although digital computer 105 has been described as employing hard disks, optical disks and/or magnetic disks, those skilled in the relevant art will appreciate that other types of non-volatile computer-readable media may be employed, such magnetic cassettes, flash memory cards, Flash, ROMs, smart cards, etc. Those skilled in the relevant art will appreciate that some computer architectures employ volatile memory and non-volatile memory. For example, data in volatile memory can be cached to non-volatile memory. Or a solid-state disk that employs integrated circuits to provide non-volatile memory.
  • Various processor- or computer-readable instructions, data structures, or other data can be stored in system memory 120. For example, system memory 120 may store instruction for communicating with remote clients and scheduling use of resources including resources on the digital computer 105 and analog computer 150. Also for example, system memory 120 may store at least one of processor executable instructions or data that, when executed by at least one processor, causes the at least one processor to execute the various algorithms described elsewhere herein, including machine learning related algorithms.
  • In some implementations system memory 120 may store processor- or computer-readable calculation instructions to perform pre-processing, co-processing, and post-processing to analog computer 150. System memory 120 may store at set of analog computer interface instructions to interact with analog computer 150.
  • Analog computer 150 may include at least one analog processor such as quantum processor 140. Analog computer 150 can be provided in an isolated environment, for example, in an isolated environment that shields the internal elements of the quantum computer from heat, magnetic field, and other external noise (not shown). The isolated environment may include a refrigerator, for instance a dilution refrigerator, operable to cryogenically cool the analog processor, for example to temperature below approximately 1° Kelvin.
  • FIG. 2A shows an exemplary topology 200 a for a quantum processor, in accordance with the presently described systems, devices, articles, and methods. Topology 200 a may be used to implement quantum processor 140 of FIG. 1, however other topologies can also be used for the systems and methods of the present disclosure. Topology 200 a comprises a grid of 2×2 cells 210 a-210 d, each cell comprised of 8 qubits such as qubit 220 (only one called out in FIG. 2A).
  • Within each cell 210 a-210 d, there are eight qubits 220 (only one called out for drawing clarity), the qubits 220 in each cell 210 a-210 d arranged four rows (extending horizontally in drawing sheet) and four columns (extending vertically in drawing sheet). Pairs of qubits 220 from the rows and columns can be communicatively coupled to one another by a respective coupler such as coupler 230 (illustrated by bold cross shapes, only one called out in FIG. 2A). A respective coupler 230 is positioned and operable to communicatively couple the qubit in each column (vertically-oriented qubit in drawing sheet) in each cell to the qubits in each row (horizontally-oriented qubit in drawing sheet) in the same cell. Additionally, a respective coupler, such as coupler 240 (only one called out in FIG. 2A) is positioned and operable to communicatively couple the qubit in each column (vertically-oriented qubit in drawing sheet) in each cell with a corresponding qubit in each column (vertically-oriented qubit in drawing sheet) in a nearest neighboring cell in a same direction as the orientation of the columns. Similarly, a respective coupler, such as coupler 250 (only one called out in FIG. 2A) is positioned and operable to communicatively couple the qubit in each row (horizontally-oriented qubit in drawing sheet) in each cell with a corresponding qubit in each row (horizontally-oriented qubit in drawing sheet) in each nearest neighboring cell in a same direction as the orientation of the rows. Since the couplers 240, 250 couple qubits 220 between cells 210 such couplers 240, 250 may at times be denominated as inter-cell couplers. Since the couplers 230 couple qubits within a cell 210, such couplers 230 may at times be denominated as intra-cell couplers.
  • FIG. 2B shows an exemplary topology 200 b for a quantum processor, in accordance with the presently described systems, devices, articles, and methods. Topology 200 b shows nine cells, such as cell 210 b (only one called out in FIG. 2B), each cell comprising eight qubits q1 through q72. FIG. 2B illustrates the intra-coupling, such as coupler 230 b (only one called out in FIG. 2B), and inter-coupling, such as coupler 260 (only one called out in FIG. 2B), for the cell 210 b.
  • The non-planarity of the connections between qubits q1-q72 makes the problem of finding the lowest energy state of the qubits q1-q72 an NP-hard problem, which means that it is possible to map many practical problems to the topology illustrated in FIGS. 2A and 2B, and described above.
  • Use of the quantum processor 140 with the topology illustrated in FIGS. 2A and 2B is not limited only to problems that fit the native topology. For example, it is possible to embed a complete graph of size N on a quantum processor of size O(N2) by chaining qubits together.
  • A computational system 100 (FIG. 1) comprising a quantum processor 140 with topology 200 a of FIG. 2A or topology 200 b of FIG. 2B can specify an energy function over spin variables +1/−1, and receive from the quantum processor with topology 200 a or topology 200 b samples of lower energy spin configurations in an approximately Boltzmann distribution according to the Ising model as follows:
  • E ( s ) = i h i s i + i , j J i , j s i s j
  • where hi are local biases and Ji,j are coupling terms.
  • The spin variables can be mapped to binary variables 0/1. Higher-order energy functions can be expressed by introducing additional constraints over auxiliary variables.
  • Machine Learning
  • Various systems and methods for augmenting conventional machine learning hardware such as Graphics Processing Units (GPUs) and Central Processing Units(CPUs) with quantum hardware are described herein. Quantum hardware typically includes one or more quantum processors or quantum processing units (QPUs). The systems and methods described herein adapt machine learning architectures and methods to exploit QPUs to advantageously achieve improved machine performance. Improved machine performance typically includes reduced training time and/or increased generalization accuracy.
  • Optimization and sampling can be computational bottlenecks in machine learning systems and methods. The systems and methods described herein integrate the QPU into the machine learning pipeline (including the architecture and methods) to perform optimization and/or sampling with improved performance over classical hardware. The machine learning pipeline can be modified to suit QPUs that can be realized in practice.
  • Sampling in Training Probabilistic Models
  • Boltzmann machines including restricted Boltzmann machines (RBMs) can be used in deep learning systems. Boltzmann machines are particularly suitable for unsupervised learning and probabilistic modeling such as in-painting and classification.
  • A shortcoming of existing approaches is that Boltzmann machines typically use costly Markov Chain Monte Carlo (MCMC) techniques to approximate samples drawn from an empirical distribution. The existing approaches serve as a proxy for a physical Boltzmann sampler.
  • A QPU can be integrated into machine learning systems and methods to reduce the time taken to perform training. For example, the QPU can be used as a physical Boltzmann sampler. The approach involves programming the QPU (which is an Ising system) such that the spin configurations realize a user-defined Boltzmann distribution natively. The approach can then draw samples directly from the QPU.
  • Restricted Boltzmann Machine (RBM)
  • The restricted Boltzmann machine (RBM) is a probabilistic graphical model that represents a joint probability distribution p(x,z) over binary visible units x and binary hidden units z. The restricted Boltzmann machine can be used as an element in a deep learning network.
  • The RBM network has the topology of a bipartite graph with biases on each visible unit and on each hidden unit, and weights (couplings) on each edge. An energy E(x,z) can be associated with the joint probability distribution p(x,z) over the visible and the hidden units, as follows:

  • p(x,z)=e −E(x,z)/Z
  • where Z is the partition function.
  • For a restricted Boltzmann machine, the energy is:

  • E(x,z)=−b T ·x−c T ·z−z T ·W·x
  • where b and c are bias terms expressed as matrices, W is a coupling term expressed as a matrix, and T denotes the transpose of a matrix. The conditional probabilities can be computed:

  • p(x|z)=σ(b+W T ·z)

  • p(z|x)=σ(c+W T ·x)
  • where σ is the sigmoid function, used to ensure the values of the conditional probabilities lie in the range [0,1].
  • Training RBMs
  • Training is the process by which the parameters of the model are adjusted to favor producing the desired training distribution. Typically, this is done by maximizing of the observed data distribution with respect to the model parameters. One part of the process involves sampling over the given data distribution, and this part is generally straightforward. Another part of the process involves sampling over the predicted model distribution, and this is generally intractable, in the sense that it would use unmanageable amounts of computational resources.
  • Some existing approaches use a Markov Chain Monte Carlo (MCMC) method to perform sampling. MCMC constructs a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after k>>1 steps is used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps which means that MCMC makes training a slow process.
  • To speed up the MCMC process, Contrastive Divergence-k (CD-k) can be used, in which the method only takes k steps of the MCMC process. Another way to speed up the process is to use Persistent Contrastive Divergence (PCD), in which a Markov Chain is initialized in the state where it ended from the previous model. CD-k and PCD methods tend to perform poorly when the distribution is multi-modal and the modes are separated by regions of low probability.
  • Even approximate sampling is NP-hard. The cost of sampling grows exponentially with problem size. Samples drawn from a native QPU network (as described above) are close to a Boltzmann distribution. It is possible to quantify the rate of convergence to a true Boltzmann distribution by evaluating the KL-divergence between the empirical distribution and the true distribution as a function of the number of samples.
  • Noise limits the precision with which the parameters of the model can be set in the quantum hardware. In practice, this means that the QPU is sampling from a slightly different energy function. The effects can be mitigated by sampling from the QPU and using the samples as starting points for non-quantum post-processing e.g., to initialize MCMC, CD, and PCD. The QPU is performing the hard part of the sampling process. The QPU finds a diverse set of valleys, and the post-processing operation samples within the valleys. Post-processing can be implemented in a GPU and can be at least partially overlapped with sampling in the quantum processor to reduce the impact of post-processing on the overall timing.
  • Sampling to Train RBMs
  • A training data set can comprise a set of visible vectors. Training comprises adjusting the model parameters such that the model is most likely to reproduce the distribution of the training set. Typically, training comprises maximizing the log-likelihood of the observed data distribution with respect to the model parameters θ:
  • log ( z p ( x , z ) ) θ = - E ( x , z ) θ p ( z | x ) + E ( x , z ) θ p ( x | z )
  • The first term on the right-hand side (RHS) in the above equation is related to the positive phase and computes an expected value of energy E over p(z|x). The term involves sampling over the given data distribution.
  • The second term on the RHS is related to the negative phase, and computes an expected value of energy, over p(x|z). The term involves sampling over the predicted model distribution.
  • Variational Auto-Encoder
  • Unsupervised learning of probabilistic models is a technique for machine learning. It can facilitate tasks such as denoising to extract a signal from a mixture of signal and noise, and inpainting to reconstruct lost or corrupted parts of an image. It can also regularize supervised tasks such as classification.
  • One approach to unsupervised learning can include attempting to maximize the log-likelihood of an observed dataset under a probabilistic model. Equivalently, unsupervised learning can include attempting to minimize the KL-divergence from the data distribution to that of the model. While the exact gradient of the log-likelihood function is frequently intractable, stochastic approximations can be computed, provided samples can be drawn from the probabilistic model and its posterior distribution given the observed data.
  • The efficiency of using stochastic approximations to arrive at a maximum of the log-likelihood function can be limited by the poor availability of desirable distributions for which the requisite sampling operations are computationally efficient. Hence, applicability of the techniques can be similarly limited.
  • Although sampling can be efficient in undirected graphical models provided there are no loops present among the connections, the range of representable relationships can be limited. Boltzmann machines (including restricted Boltzmann machines) can generate approximate samples using generally costly and inexact Markov Chain Monte Carlo (MCMC) techniques.
  • Sampling can be efficient in directed graphical models comprising a directed acyclic graph since sampling can be performed by an ancestral pass. Even so, it can be inefficient to compute the posterior distributions over the hidden causes of observed data in such models, and samples from the posterior distributions are required to compute the gradient of the log-likelihood function.
  • Another approach to unsupervised learning is to optimize a lower bound on the log-likelihood function. This approach can be more computationally efficient. An example of a lower bound is the evidence lower bound (ELBO) which differs from the true log-likelihood by the KL-divergence between an approximating posterior distribution, q(z|x, Ø), and the true posterior distribution, p(z|x, θ). The approximating posterior distribution can be designed to be computationally tractable even though the true posterior distribution is not computationally tractable. The ELBO can be expressed as follows:
  • ( x , θ , ϕ ) = log p ( x | θ ) - K L [ q ( z | x , θ ) p ( z | x , θ ) ] = z q ( z | x , ϕ ) log [ p ( x , z | θ ) q ( z | x , ϕ ) ]
  • where x denotes the observed random variables, z the latent random variables, the parameters of the generative model and ϕ the parameters of the approximating posterior.
  • Successive optimization of the ELBO with respect to ϕ and θ is analogous to variational expectation-maximization (EM). It is generally possible to construct a stochastic approximation to gradient descent on the ELBO that only requires exact, computationally tractable samples. A drawback of this approach is that it can lead to high variance in the gradient estimate, and can result in slow training and poor performance.
  • The variational auto-encoder can regroup the ELBO as:

  • Figure US20220076131A1-20220310-P00003
    (x,θ,ϕ)=−KL[q(z|x,ϕ)∥p(z|θ)]+
    Figure US20220076131A1-20220310-P00004
    q[log p(x|z,θ)].
  • The KL-divergence between the approximating posterior and the true prior is analytically simple and computationally efficient for commonly chosen distributions, such as Gaussians.
  • A low-variance stochastic approximation to the gradient of the auto-encoding term
    Figure US20220076131A1-20220310-P00004
    q can be backpropagated efficiently, so long as samples from the approximating posterior q(z|x) can be drawn using a differentiable, deterministic function ƒ(x,ϕ,ρ) of the combination of the inputs x, the parameters ϕ, and a set of input- and parameter-independent random variables ρ˜D. For instance, given a Gaussian distribution with mean m(x,ϕ) and variance v(x,ϕ) determined by the input,
    Figure US20220076131A1-20220310-P00005
    (m(x,ϕ), v(x,ϕ)), samples can be drawn using

  • ƒ(x,ϕ,φ=m(x,ϕ)+√{square root over (v(x,ϕ))}·ρ, where ρ˜
    Figure US20220076131A1-20220310-P00006
    (0,1).
  • When such an ƒ(x,ϕ,ρ) exists,
  • q ( z | x , ) [ log p ( x | z , θ ) = ρ [ log p ( x | f ( x , ρ , ) , θ ) ] ϕ q ( z | x , ) [ log p ( x | z , θ ) ] = ρ [ ϕ log p ( x | f ( x , ρ . ϕ ) , θ ) ] 1 N ρ ~ D log p ( x | ( f , ρ , ) , θ ) , ( 1 )
  • and the stochastic approximation to the derivative in equation 1 is analytically tractable so long as p(x|z,θ) and ƒ(x,ρ,Ø) are defined so as to have tractable derivatives.
  • This approach is possible whenever the approximating posteriors for each hidden variable, qi(z1|x,ϕ), are independent given x and ϕ, the cumulative distribution function (CDF) of each qi is invertible; and the inverse CDF each qi, is differentiable. Specifically, choose
    Figure US20220076131A1-20220310-P00007
    to be the uniform distribution between 0 and 1, and ƒi to be the inverse CDF of qi.
  • The conditional marginal cumulative distribution (CDF) is defined by

  • F i(x)=∫x i =−∞ x p(x i ′|x 1 , . . . ,x i−1)
  • Since the approximating posterior distribution q(z|x,ϕ) maps each input to a distribution over the latent space, it is called the “encoder”. Correspondingly, since the conditional likelihood distribution p(x|z,θ) maps each configuration of the latent variables to a distribution over the input space, it is called the “decoder”.
  • Unfortunately, a multivariate CDF is generally not invertible. One way to deal with this is to define a set of CDFs as follows:

  • F i(x)=∫x i ′=−∞ x p(x i ′|x 1 , . . . ,x i−1)
  • and invert each conditional CDF in turn. The CDF Fi(x) is the CDF of xi conditioned on all xj where j<i, and marginalized over all xk where i<k. Such inverses generally exist provided the conditional-marginal probabilities are everywhere non-zero.
  • Discrete Variational Auto-Encoders
  • The approach can run into challenges with discrete distributions, such as, for example, Restricted Boltzmann Machines (RBMs). An approximating posterior that only assigns non-zero probability to a discrete domain corresponds to a CDF that is piecewise-constant. That is, the range of the CDF is a proper subset of the interval [0, 1]. The domain of the inverse CDF is thus also a proper subset of the interval [0, 1] and its derivative is generally not defined.
  • The difficulty can remain even if a quantile function as follows is used:
  • F p - 1 ( ρ ) = inf { z : z = - z p ( z ) ρ }
  • The derivative of the quantile function is either zero or infinite for a discrete distribution.
  • One method for discrete distributions is to use a reinforcement learning method such as REINFORCE (Williams, http://www-anw.cs.umass.edu/˜barto/courses/cs687/williams92simple.pdf). The REINFORCE method adjust weights following receipt of a reinforcement value by an amount proportional to the difference between a reinforcement baseline and the reinforcement value. Rather than differentiating the conditional log-likelihood directly in REINFORCE, the gradient of the log of the conditional likelihood distribution is estimated, in effect, by a finite difference approximation. The conditional log-likelihood log p(x|z,θ) is evaluated at many different points z˜q(z|x,ϕ), and the gradient
  • ϕ log q ( ( z | x , ϕ )
  • weighted more strongly when p(x|z,θ) differs more greatly from the baseline.
  • One disadvantage is that the change of p(x|z,θ) in a given direction can only affect the REINFORCE gradient estimate if a sample is taken with a component in the same direction. In a D-dimensional latent space, at least D samples are required to capture the variation of the conditional distribution p(x|z,θ) in all directions. Since the latent representation can typically consist of hundreds of variables, the REINFORCE gradient estimate can be much less efficient than one that makes more direct use of the gradient of the conditional distribution p(x|z,θ).
  • A discrete variational auto-encoder (DVAE) is a hierarchical probabilistic model consisting of an RBM, followed by multiple layers of continuous latent variables, allowing the binary variables to be marginalized out, and the gradient to backpropagate smoothly through the auto-encoding component of the ELBO.
  • The generative model is redefined so that the conditional distribution of the observed variables given the latent variables only depends on the new continuous latent space.
  • A discrete distribution is thereby transformed into a mixture distribution over this new continuous latent space. This does not alter the fundamental form of the model, nor the KL-divergence term of the ELBO; rather it adds a stochastic component to the approximating posterior and the prior.
  • One interpretation of the way that VAEs work is that they break the encoder distribution into “packets” of probability, each packet having infinitesimal but equal probability mass. Within each packet, the values of the latent variables are approximately constant. The packets correspond to a region in the latent space, and the expectation value is taken over the packets. There are generally more packets in regions of high probability, so more probable values are more likely to be selected.
  • As the parameters of the encoder are changed, the location of each packet can move, while its probability mass stays constant. So long as Fq(z|x,ϕ)−1 exists and is differentiable, a small change in will correspond to a small change in the location of each packet. This allows the use of the gradient of the decoder to estimate the change in the loss function, since the gradient of the decoder captures the effect of small changes in the location of a selected packet in the latent space.
  • In contrast, REINFORCE works by breaking the latent representation into segments of infinitesimal but equal volume, within which the latent variables are also approximately constant, while the probability mass varies between segments. Once a segment is selected in the latent space, its location is independent of the parameters of the encoder. As a result, the contribution of the selected location to the loss function is not dependent on the gradient of the decoder. On the other hand, the probability mass assigned to the region in the latent space around the selected location is relevant.
  • Though VAEs can make use of gradient information from the decoder, the gradient estimate is generally only low-variance provided the motion of most probability packets has a similar effect on the loss function. This is likely to be the case when the packets are tightly clustered (e.g., if the encoder produces a Gaussian distribution with low variance) or if the movements of well-separated packets have a similar effect on the loss function (e.g., if the decoder is roughly linear).
  • One difficulty is that VAEs cannot generally be used directly with discrete latent representations because changing the parameters of a discrete encoder moves probability mass between the allowed discrete values, and the allowed discrete values are generally far apart. As the encoder parameters change, a selected packet either remains in place or jumps more than an infinitesimal distance to an allowed discrete value. Consequently, small changes to the parameters of the encoder do not affect most of the probability packets. Even when a packet jumps between discrete values of the latent representation, the gradient of the decoder generally cannot be used to estimate the change in loss function accurately, because the gradient generally captures only the effects of very small movements of the probability packet.
  • Therefore, to use discrete latent representations in the VAE framework, the method described herein for unsupervised learning transforms the distributions to a continuous latent space within which the probability packets move smoothly. The encoder q(z|x,Ø) and prior distribution p(z|θ) are extended by a transformation to a continuous, auxiliary latent representation ζ, and the decoder is correspondingly transformed to be a function of the continuous representation. By extending the encoder and the prior distribution in the same way, the remaining KL-divergence (referred to above) is unaffected.
  • In the transformation, one approach maps each point in the discrete latent space to a non-zero probability over the entire auxiliary continuous space. In so doing, if the probability at a point in the discrete latent space increases from zero to a non-zero value, a probability packet does not have to jump a large distance to cover the resulting region in the auxiliary continuous space. Moreover, it ensures that the CDFs Fi(x) are strictly increasing as a function of their main argument, and thus are invertible. The method described herein for unsupervised learning smooths the conditional-marginal CDF Fi(x) of an approximating posterior distribution, and renders the distribution invertible, and its inverse differentiable, by augmenting the latent discrete representation with a set of continuous random variables. The generative model is redefined so that the conditional distribution of the observed variables given the latent variables only depends on the new continuous latent space.
  • The discrete distribution is thereby transformed into a mixture distribution over the continuous latent space, each value of each discrete random variable associated with a distinct mixture component on the continuous expansion. This does not alter the fundamental form of the model, nor the KL-divergence term of the ELBO; rather it adds a stochastic component to the approximating posterior and the prior.
  • The method augments the latent representation with continuous random variables ζ, conditioned on z, as follows:

  • q(ζ,z|x,ϕ)=r(ζ|xq(z|x,ϕ)
  • where the support of r(ζ|x) for all values of z is connected, so the marginal distribution q(ζ|x,ϕ)=Σzr(ζ|z)·q(z|x,ϕ) has a constant, connected support so long as 0<q(z|x,Ø)<1. The approximating posterior r(ζ|x) is continuous and differentiable except at the end points of its support so that the inverse conditional-marginal CDF is differentiable.
  • FIG. 3 shows an example implementation of a VAE. The variable z is a latent variable. The variable x is a visible variable (for example, pixels in an image data set). The variable is a continuous variable conditioned on a discrete z as described above in the present disclosure. The variable can serve to smooth out the discrete random variables in the auto-encoder term. As described above, the variable generally does not directly affect the KL-divergence between the approximating posterior and the true prior.
  • In the example, the variables z1, z2, and z3 are disjoint subsets of qubits in the quantum processor. The computational system samples from the RBM using the quantum processor. The computational system generates the hierarchical approximating posteriors using a digital (classical) computer. The computational system uses priors 310 and 330, and hierarchical approximating posteriors 320 and 340.
  • For the prior 330 and the approximating posterior 340, the systems adds continuous variables ζ1, ζ2, ζ3 below the latent variables z1, z2, z3.
  • FIG. 3 also shows the auto-encoding loop 350 of the VAE. Initially, input x is passed into a deterministic feedforward network q(z=1|x,Ø), for which the final non-linearity is the logistic function. Its output q, along with independent random variable p, is passed into the deterministic function Fq(ζ|x,Ø) −1 to produce a sample of ζ. This ζ, along with the original input x, is finally passed to log p(x|ζ,θ). The expectation of this log probability with respect to ρ is the auto-encoding term of the VAE. This auto-encoder, conditioned on the input and the independent ρ, is deterministic and differentiable, so backpropagation can be used to produce a low-variance, computationally efficient approximation to the gradient.
  • The distribution remains continuous as q(z|x,ϕ) changes. The distribution is also everywhere non-zero in the approach that maps each point in the discrete latent space to a non-zero probability over the entire auxiliary continuous space. Correspondingly, p(ζ,z|θ) is defined as p(ζ,z|θ)=r(ζ|z)·p(z|θ), where r(ζ|z) is the same as for the approximating posterior, and p(x|ζ,z,θ)=p(x|ζ,θ). This transformation renders the model a continuous distribution over z.
  • The method described herein can generate low-variance stochastic approximations to the gradient. The KL-divergence between the approximating posterior and the true prior distribution is unaffected by the introduction of auxiliary continuous latent variables, provided the same expansion is used for both.
  • The auto-encoder portion of the loss function is evaluated in the space of continuous random variables, and the KL-divergence portion of the loss function is evaluated in the discrete space.
  • The KL-divergence portion of the loss function is as follows:
  • - KL [ q ( z | x , ϕ ) p ( z | θ ) ] = z q ( z | x , ϕ ) · [ log p ( z | θ ) - log q ( z | x , ϕ ) ]
  • The gradient of the KL-divergence portion of the loss function in the above equation with respect to θ can be estimated stochastically using samples from the true prior distribution p(z|θ). The gradient of the KL-divergence portion of the lost function can be expressed as follows:
  • KL ( q p ) θ = - E p ( z | θ ) θ q ( z | x , ϕ ) + E p ( z | θ ) θ p ( z | θ )
  • In one approach, the method computes the gradients of the KL-divergence portion of the loss function analytically, for example by first directly parameterizing a factorial q(z|x,ϕ) with a deep network g(x):
  • q ( z | x , ϕ ) = e - E q ( z | x , ϕ ) z e - E q ( z | x , ϕ ) where E q ( z | x ) = - g ( x ) T · z
  • and then using the following expression:
  • KL ( q p ) ϕ = ( ( g ( x ) - h - ( J T + J ) · z q ) T ( z q - z q 2 ) T ) · g ( x ) ϕ
  • Equation 1 can therefore be simplified by dropping the dependence of p on z and then marginalizing z out of q, as follows:
  • ϕ q ( ζ , z | x , ϕ ) [ log p ( x | ζ , z , θ ) ] 1 N ρ U ( 0 , 1 ) n log p ( x | ζ ( ρ ) , θ ) ζ = ζ ( ρ ) ( 2 )
  • An example of a transformation from the discrete latent space to a continuous latent space is the spike-and-slab transformation:
  • r ( ζ i | z i = 0 ) = { , if ζ i = 0 0 , otherwise r ( ζ i | z i = 1 ) = { 1 , if 0 ζ i 1 0 , otherwise
  • This transformation is consistent with sparse coding.
  • Other expansions to the continuous space are also possible. As an example a combination of delta spike and exponential function can be used:
  • r ( ζ i | z i = 0 ) = { , if ζ i = 0 0 , otherwise r ( ζ i | z i = 1 ) = { β e βζ e β - 1 , if 0 ζ i 1 0 , otherwise
  • Alternatively, it is possible to define a transformation from discrete to continuous variables in the approximating posterior, r(ζ|z), where the transformation is not independent of the input x. In the true posterior distribution, p(ζ|z,x)≈p(ζ|z) only if z already captures most of the information about x and p(ζ|z,x) changes little as a function of x. In a case where it may be desirable for q(ζi|zi,x,ϕ) to be a separate Gaussian for both values of the binary zi, it is possible to use a mixture of a delta spike and a Gaussian to define a transformation from the discrete to the continuous space for which the CDF can be inverted piecewise.
  • FIG. 4 shows a method 400 of unsupervised learning using a discrete variational auto-encoder. Execution of the method 400 by one or more processor-based devices may occur in accordance with the present system, devices, articles, and methods. Method 400, like other methods herein may be implemented by a series or set of processor-readable instructions executed by one or more processors (i.e., hardware circuitry).
  • Method 400 starts at 405, for example in response to a call from another routine or other invocation.
  • At 410, the system initializes the model parameters with random values. Alternatively, the system can initialize the model parameters based on a pre-training procedure. At 415, the system tests to determine if a stopping criterion has been reached. The stopping criterion can, for example, be related to the number of epochs (i.e., passes through the dataset) or a measurement of performance between successive passes through a validation dataset. In the latter case, when performance beings to degrade, it is an indication that the system is over-fitting and should stop.
  • In response to determining the stopping criterion has been reached, the system ends method 400 at 475, until invoked again, for example, a request to repeat the learning.
  • In response to determining the stopping criterion has not been reached, the system fetches a mini-batch of the training data set at 420. At 425, the system propagates the training data set through the encoder to compute the full approximating posterior over discrete space z.
  • At 430, the system generates or causes generation of samples from the approximating posterior over ζ, given the full distribution over z. Typically, this is performed by a non-quantum processor, and uses the inverse of the CDF Fi(x) described above. The non-quantum processor can, for example, take the form of one or more of one or more digital microprocessors, digital signal processors, graphical processing units, central processing units, digital application specific integrated circuits, digital field programmable gate arrays, digital microcontrollers, and/or any associated memories, registers or other nontransitory computer- or processor-readable media, communicatively coupled to the non-quantum processor.
  • At 435, the system propagates the samples through the decoder to compute the distribution over the input.
  • At 440, the system performs backpropagation through the decoder.
  • At 445, the system performs backpropagation through the sampler over the approximating posterior over ζ. In this context, backpropagation is an efficient computational approach to determining the gradient.
  • At 450, the system computes the gradient of the KL-divergence between the approximating posterior and the true prior over z. At 455, the system performs backpropagation through the encoder.
  • At 457, the system determines a gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space.
  • At 460, the system determines at least one of a gradient or at least a stochastic approximation of a gradient, of a bound on the log-likelihood of the input data.
  • In some embodiments, the system generates samples or causes samples to be generated by a quantum processor. At 465, the system updates the model parameters based at least in part on the gradient.
  • At 470, the system tests to determine if the current mini-batch is the last mini-batch to be processed. In response to determining that the current mini-batch is the last mini-batch to be processed, the system returns control to 415. In response to determining that the current mini-batch is not the last mini-batch to be processed, the system returns control to 420.
  • In some implementations, act 470 is omitted, and control passes directly to 415 from 465. The decision whether to fetch another mini-batch can be incorporated in 415.
  • In summary, as described in more detail above, the discrete VAE method extends the encoder and the prior with a transformation to a continuous, auxiliary latent representation, and correspondingly makes the decoder a function of the same continuous representation. The method evaluates the auto-encoder portion of the loss function in the continuous representation while evaluating the KL-divergence portion of the loss function in the z space.
  • Accommodating Explaining-Away with a Hierarchical Approximating Posterior
  • When a probabilistic model is defined in terms of a prior distribution p(z) over latent variables z and a conditional distribution p(x|z) over observed variables x given the latent variables, the observation of x often induces strong correlations of the z, given x, in the posterior p(z|x) due to phenomena such as explaining-away, a pattern of reasoning where the confirmation of one cause reduces the need to search for alternative causes. Moreover, an RBM used as the prior distribution may have strong correlations between the units of the RBM.
  • To accommodate the strong correlations expected in the posterior distribution while maintaining tractability, hierarchy can be introduced into the approximating posterior q(z|x). Although the variables of each hierarchical layer are independent given the previous layers, the total distribution can capture strong correlations, especially as the size of each hierarchical layer shrinks towards a single variable.
  • The latent variables z of the RBM are divided into disjoint groups, z1, . . . , zk. The continuous latent variables ζ are divided into complementary disjoint groups ζ1, . . . , ζk. In one implementations, the groups may be chosen at random, while in other implementations the groups be defined so as to be of equal size. The hierarchical variational auto-encoder defines the approximating posterior via a directed acyclic graphical model over these groups.
  • q ( z 1 , ζ 1 , , z k , ζ k | x , ϕ ) = 1 j k r ( ζ j | z j ) · q ( z j ζ i < j , x , ϕ ) where q ( z j | ζ i < j , x , ϕ ) = e g j ( ζ i < j , x ) T · z j z L z j ( 1 + e g z L ( ζ i < j , x ) )
  • zj∈{0,1} and gji<j,x,Ø) is a parameterized function of the input and preceding ζi, such as a neural network. The corresponding graphical model is shown in FIG. 5.
  • FIG. 5 schematic diagram illustrating an example implementation of a hierarchical variational auto-encoder (VAE). The model uses approximating posterior 510, where latent variable z3 is conditioned on the continuous variables ζ2 and ζ1 while z2 is conditioned on ζ1.
  • The dependence of zj on the discrete variables zi<j is mediated by the continuous variables ζi<j.
  • This hierarchical approximating posterior does not affect the form of the auto-encoding term 520 of FIG. 5, except to increase the depth of the auto-encoder. Each can be computed via the stochastic nonlinearity Fq j j i<j ,x,ϕ)(ρ), where the function qj can take previous as input.
  • The deterministic probability value q(z=1|ζi<j,x,Ø) is parameterized, for example by a neural network.
  • For each successive layer j of the autoencoder, input x and all previous ζi<j are passed into the network computing q(z=1|ζi<1,x,Ø). Its output qj, along with an independent random variable p is passed into the deterministic function Fq(ζ i<j,x,ϕ) (ρ) to produce a sample of ζj. Once all ζj have been recursively computed, the full ζ along with the original input x is finally passed to log p(x|ζ,θ).
  • The KL-divergence between the approximating posterior and the true prior is also not significantly affected by the introduction of additional continuous latent variables ζ, so long as the approach uses the same expansion r(ζ|z) for both the approximating posterior and the prior, as follows:
  • KL [ q p ] = z ζ ( 1 j k r ( ζ j | z j ) · q ( z j | ζ i < j , x ) ) · log 1 j k r ( ζ j | z j ) · q ( z j | ζ i < j , x ) p ( z ) · 1 j k r ( ζ j | z j ) = z ζ ( 1 j k r ( ζ j | z j ) · q ( z j | ζ i < j , x ) ) · log 1 j k q ( z j | ζ i < j , x ) p ( z )
  • The gradient of the KL-divergence with respect to the parameter of the prior p(z|θ) can be estimated stochastically using samples from the approximating posterior q(ζ,z|x,ϕ) and the true prior p(z|θ). The prior can be, for example, an RBM.
  • The final expectation with respect to q(zki<j,x,ϕ) can be performed analytically; all other expectations require samples from the approximating posterior. Similarly, the prior requires samples from, for example, an RBM.
  • Samples from the same prior distribution are required for an entire mini-batch, independent from the samples chosen from the training dataset.
  • Hierarchical Variational Auto-Encoders
  • Convolutional architectures are an essential component of state-of-the-art approaches to visual object classification, speech recognition, and numerous other tasks. In particular, they have been successfully applied to generative modeling, such as in deconvolutional networks and LAPGAN. There is, therefore, technical benefit in incorporating convolutional architectures into variational auto-encoders, as such can provide a technical solution to a technical problem, and thereby achieve a technical result.
  • Convolutional architectures are necessarily hierarchical. In the feedforward direction, they build from local, high-resolution features to global, low-resolution features through the application of successive layers of convolution, point-wise nonlinear transformations, and pooling. When used generatively, this process is reversed, with global, low-resolution features building towards local, high-resolution features through successive layers of deconvolution, point-wise nonlinear transformations, and unpooling.
  • Incorporating this architecture into the variational auto-encoder framework, it is natural to associate the upward pathway (from local to global) with the approximating posterior, and the downward pathway (from global to local) with the generative model. However, if the random variables of the generative model are defined to be the units of the deconvolutional network itself, then samples from the approximating posterior of the last hidden layer of the deconvolutional decoder can be determined directly by the convolutional encoder. In particular, it can be natural to define the samples from the last layer of the deconvolutional decoder to be a function solely of the first layer of the convolutional encoder. As a result, the auto-encoding component of the VAE parameter update depends on the bottom-most layer of random variables. This seems contradictory to the intuitive structure of a convolutional auto-encoder.
  • Instead, ancillary random variables can be defined at each layer of the deconvolutional decoder network. Ancillary random variables can be discrete random variables or continuous random variables.
  • In the deconvolutional decoder, the ancillary random variables of layer n are used in conjunction with the signal from layer n+1 to determine the signal to layer n−1. The approximating posterior over the ancillary random variables of layer n is defined to be a function of the convolutional encoder, generally restricted to layer n of the convolutional encoder. To compute a stochastic approximation to the gradient of the evidence lower bound, to the approach can perform a single pass up the convolutional encoder network, followed by a single pass down the deconvolutional decoder network. In the pass down the deconvolutional decoder network, the ancillary random variables are sampled from the approximating posteriors computed in the pass up the convolutional encoder network.
  • A Problem with the Traditional Approach
  • A traditional approach can result in approximating posteriors that poorly match the true posterior, and consequently can result in poor samples in the auto-encoding loop. In particular, the approximating posterior defines independent distributions over each layer. This product of independent distributions ignores the strong correlations between adjacent layers in the true posterior, conditioned on the underlying data.
  • The representation throughout layer n should be mutually consistent, and consistent with the representation in layer n−1 and n+1. However, in the architecture described above, the approximating posterior over every random variable is independent. In particular, the variability in the higher (more abstract) layers is uncorrelated with that in the lower layers, and consistency cannot be enforced across layers unless the approximating posterior collapses to a single point.
  • This problem is apparent in the case of (hierarchical) sparse coding. At every layer, the true posterior has many modes, constrained by long-range correlations within each layer. For instance, if a line in an input image is decomposed into a succession of short line segments (e.g., Gabor filters), it is essential that the end of one segment line up with the beginning of the next segment. With a sufficiently overcomplete dictionary, there may be many sets of segments that cover the line, but differ by a small offset along the line. A factorial posterior can reliably represent one such mode.
  • These equivalent representations can be disambiguated by the successive layers of the representation. For instance, a single random variable at a higher layer may specify the offset of all the line segments in the previous example. In the traditional approach, the approximating posteriors of the (potentially disambiguating) higher layers are computed after approximating posteriors of the lower layers have been computed. In contrast, an efficient hierarchical variational auto-encoder could infer the approximating posterior over the top-most layer first, potentially using a deep, convolutional computation. It would then compute the conditional approximating posteriors of lower layers given a sample from the approximating posterior of the higher layers.
  • A Proposed Approach−Hierarchical Priors and Approximating Posteriors
  • In the present approach, rather than defining the approximating posterior to be fully factorial, the computational system conditions the approximating posterior for the nth layer on the sample from the approximating posterior of the higher layers preceding it in the downward pass through the deconvolutional decoder. In an example case, the computational system conditions the approximating posterior for the nth layer on the sample from the (n−1)th layer. This corresponds to a directed graphical model, flowing from the higher, more abstract layers to the lower, more concrete layers. Consistency between the approximating posterior distributions over each pair of layers is ensured directly.
  • With such a directed approximating posterior, it is possible to do away with ancillary random variables, and define the distribution directly over the primary units of the deconvolutional network. In this case, the system can use a parameterized distribution for the deconvolutional component of the approximating posterior that shares structure and parameters with the generative model. Alternatively, the system can continue to use a separately parameterized directed model.
  • In the example case and other cases, a stochastic approximation to the gradient of the evidence lower bound can be computed via one pass up the convolutional encoder, one pass down the deconvolutional decoder of the approximating posterior, and another pass down the deconvolutional decoder of the prior, conditioned on the sample from the approximating posterior. Note that if the approximating posterior is defined directly over the primary units of the deconvolutional generative model, as opposed to ancillary random variables, the final pass down the deconvolutional decoder of the prior does not actually pass signals from layer to layer. Rather, the input to each layer is determined by the approximating posterior.
  • Below is an outline of the computations for two adjacent hidden layers, highlighting the hierarchical components and ignoring the details of convolution and deconvolution. If the approximating posterior is defined directly over the primary units of the deconvolutional generative model, then it is natural to use a structure such as:

  • z n−1 ,z n |x,Ø)=q(z n−1 |x,Ø)·q(z n |z n−1 ,x,Ø)

  • p(z n−1 ,z n|θ)=p(z n |z n−1θ)·p(z n−1|θ)
  • This builds the prior by conditioning the more local variables of the (n−1)th layer on the more global variables of the nth layer. With ancillary random variables, we might choose to use a simpler prior structure:

  • p(z n−1 ,z n|θ)=p(z n−1|θ)·p(z n|θ)
  • The evidence lower bound decomposes as:
  • VAE ( x , θ , ϕ ) = log p ( x | θ ) - KL [ q ( z n , z n - 1 | x , ϕ ) p ( z n , z n - 1 | x , θ ] = log p ( z | θ ) - KL [ q ( z n - 1 , z n , x , ϕ ) · q ( z n | x , ϕ ) p ( z n - 1 | z n , x , θ ) · p ( z n | x , θ ) = z n z n - 1 q ( z n - 1 | z n , x , ϕ ) · q ( z n | x , ϕ ) · log [ p ( x | z n - 1 , θ ) · p ( z n - 1 | z n , θ ) · p ( z n | θ ) q ( z n - 1 | z n , x , ϕ ) · q ( z n | x , ϕ ) ] = 𝔼 q ( z n - 1 | z n , x , ϕ ) · q ( z n | x , ϕ ) [ log p ( x | z n , z n - 1 , θ ) ] - KL [ q ( z n | x , ϕ ) p ( z n | ϕ ) ] - Z n q ( z n | x , ϕ ) · KL [ q ( z n - 1 z n , x , ϕ p ( z n - 1 z n , θ ) ] ( 3 )
  • If the approximating posterior is defined directly over the primary units of the deconvolutional generative model, then it may be the case that p(x|zn,zn−1,θ)=p(x|zn−1,θ).
  • If both q(zn−1|zn,x,ϕ) and p(z|n−1zn) are Gaussian, then their KL-divergence has a simple closed form, which can be computationally efficient if the covariance matrices are diagonal. The gradients with respect to q(zn|x,ϕ) in the last term of Equation 3 can be obtained using the same reparameterization method used in a standard VAE.
  • To compute the auto-encoding portion of the ELBO, the system propagates up the convolutional encoder and down the deconvolutional decoder of the approximating posterior, to compute the parameters of the approximating posterior. In an example parameterization, this can compute the conditional approximating posterior of the nth layer based on both the nth layer of the convolutional encoder, and the preceding (n−1)th layer of the deconvolutional decoder of the approximating posterior. In principle, the approximating posterior of the nth layer may be based upon the input, the entire convolutional encoder, and layers i≤n of the deconvolutional decoder of the approximating posterior (or a subset thereof).
  • The configuration sampled from the approximating posterior is then used in a pass down the deconvolutional decoder of the prior. If the approximating posterior is defined over the primary units of the deconvolutional network, then the signal from the (n−1)th layer to the nth layer is determined by the approximating posterior for the (n−1)th layer, independent of the preceding layers of the prior. If the approach uses auxiliary random variables, the sample from the nth layer depends on the (n−1)th layer of the deconvolutional decoder of the prior, and the nth layer of the approximating posterior.
  • This approach can be extended to arbitrary numbers of layers, and to posteriors and priors that condition on more than one preceding layer, e.g. where layer n is conditioned on all layers m<n preceding it.
  • The approximating posterior and the prior can be defined to be fully autoregressive directed graphical models.
  • The directed graphical models of the approximating posterior and prior can be defined as follows:
  • q ( 1 , , n | z , ϕ ) = 1 m n q ( m | l < m , x , ϕ ) p ( 1 , , n | θ ) = 1 m n p ( m | i < m , θ )
  • where the entire RBM and its associated continuous latent variables are now denoted by
    Figure US20220076131A1-20220310-P00008
    1={z1, ζ1, . . . , zk, ζk). This builds an approximating posterior and prior by conditioning the more local variables of layer m on the more global variables of layer m−1, . . . , 1. However, the conditional distribution in p(
    Figure US20220076131A1-20220310-P00008
    1, . . . ,
    Figure US20220076131A1-20220310-P00008
    n|θ) only depends on the continuous FIG. 6 is a schematic diagram illustrating an example implementation of a variational auto-encoder (VAE) with a hierarchy of continuous latent variables with an approximating posterior 610 and a prior 620.
  • Each
    Figure US20220076131A1-20220310-P00008
    m>1 in approximating posterior 610 and prior 620, respectively, denotes a layer of continuous latent variables and is conditioned on the layers preceding it. In the example implementation of FIG. 6, there are three levels of hierarchy.
  • Alternatively, the approximating posterior can be made hierarchical, as follows:
  • p ( 1 , , n | θ ) = 1 m n p ( m | θ )
  • The ELBO decomposes as
  • ( x , θ , ) = log p ( x | θ ) - KL [ m q ( m | l < m , x , ) m p ( m | l < m , x , θ ) ] = 1 2 n m q ( m | l < m = , x , ) · log [ p ( x | z , θ ) · m p ( m | l < m , x , θ ) m p ( m | l < m , x , ) ] = m q ( m | l < m , x , ) [ log p ( x | z , θ ) ] - m l < m ( l < m q ( l | K < l , x , ) ) · KL [ q ( m | l < m , x , ) p ( m | l < m , θ ) ] = 𝔼 q ( | x , ) [ log p ( x | z , θ ) ] - m 𝔼 q ( l < m | x , ) · KL [ q ( m | l < m , x , ) P ( m | l < m , θ ) ] ( 4 )
  • In the case where both q(
    Figure US20220076131A1-20220310-P00008
    m|
    Figure US20220076131A1-20220310-P00008
    l<m,x,ϕ) and p(
    Figure US20220076131A1-20220310-P00008
    m|
    Figure US20220076131A1-20220310-P00008
    l<m,θ) are Gaussian distributions, the KL-divergence can be computationally efficient, and the gradient of the last term in Equation 4 with respect to q(
    Figure US20220076131A1-20220310-P00008
    n−1|x,ϕ) can be obtained by reparametrizing, as commonly done in a traditional VAE. In all cases, a stochastic approximation to the gradient of the ELBO can be computed via one pass down approximating posterior 610, sampling from each continuous latent ζi and
    Figure US20220076131A1-20220310-P00008
    m>1 in turn, and another pass down prior 620, conditioned on the samples from the approximating posterior. In the pass down the approximating posterior, samples at each layer n may be based upon both the input and all the preceding layers m<n. To compute the auto-encoding portion of the ELBO, p(x|
    Figure US20220076131A1-20220310-P00008
    ) can be applied from the prior to the sample form the approximating posterior.
  • The pass down the prior need not pass signal from layer to layer. Rather, the input to each layer can be determined by the approximating posterior using equation 4.
  • The KL-divergence is then taken between the approximating posterior and true prior at each layer, conditioned on the layers above. Re-parametrization can be used to include parameter-dependent terms into the KL-divergence term.
  • Both the approximating posterior and the prior distribution of each layer
    Figure US20220076131A1-20220310-P00008
    m>1 are defined by neural networks, the inputs of which are ζ,
    Figure US20220076131A1-20220310-P00008
    1>l>m and x in the case of the approximating posterior. The output of these are networks are the mean and variance of a diagonal-covariance Gaussian distribution.
  • To ensure that all the units in the RBM are active and inactive, and thus all units in the RBM are used, when calculating the approximating posterior over the RBM units, rather than using traditional batch normalization, the system bases the batch normalization on the L1 norm. In an alternative approach, the system may base the batch normalization on the L2 norm.
  • Specifically, the system may use:

  • y=x−x

  • x bn =y/(|y|+∈)⊙s+o
  • and bound 2≤s≤3 and −s≤o≤s.
  • ISTA-Like Generative Model
  • The training of variational auto-encoders is typically limited by the form of the approximating posterior. However, there can be challenges using an approximating posterior other than a factorial posterior. The entropy of the approximating posterior, which constitutes one of the components of the KL-divergence between the approximating and true posterior (or true prior), can be trivial if the approximating posterior is factorial, and close to intractable if it is a mixture of factorial distributions. While one might consider using normalizing flows, importance weighting, or other methods to allow non-factorial approximating posteriors, it may be easier to change the model to make the true posterior more factorial.
  • In particular, with large numbers of latent variables, it may be desirable to use a sparse, overcomplete representation. In such a representation, there are many ways of representing a given input, although some will be more probable than others. At the same time, the model is sensitive to duplicate representations. Using two latent variables that represent similar features is not equivalent to using just one.
  • A similar problem arises in models with linear decoders and a sparsity prior; i.e., sparse coding. ISTA (and LISTA) address this by (approximately) following the gradient (with proximal descent) of the L1-regularized reconstruction error. The resulting transformation of the hidden representation is mostly linear in the input and the hidden representation:

  • z←(I−∈·W T ·Wz−∈·λ sign(z)+∈·W T ·x
  • Note, though, that the input must be provided to every layer.
  • A somewhat similar approach can be employed in deconvolutional decoder of the approximating posterior. Consider the case where the conditional approximating posterior of layer zn given layer zn−1 is computed by a multi-layer deterministic network. Rather than making a deterministic transformation of the input available to the first layer of this network, the system can instead provide the deterministic transformation of the input to the internal layers, or any subset of the internal layers. The approximating posterior over the final Gaussian units may then employ sparse coding via LISTA, suppressing redundant higher-level units, and thus allowing factorial posteriors where more than one unit coding for a given feature may be active. In the prior pathway, there is no input to govern the disambiguation between redundant features, so the winner-take-all selection must be achieved via other means, and a more conventional deep network may be sufficient.
  • Combination With Discrete Variational Auto-Encoder
  • The discrete variational auto-encoder can also be incorporated into a convolutional auto-encoder. It is possible to put a discrete VAE on the very top of the prior, where it can generate multi-modal distributions that then propagate down the deconvolutional decoder, readily allowing the production of more sophisticated multi-modal distributions. If using ancillary random variables, it would also be straightforward to include discrete random variables at every layer.
  • Hierarchical Approximating Posteriors
  • True posteriors can be multi-modal. Multiple plausible explanations for an observation can lead to a multi-modal posterior. In one implementation, a quantum processor can employ a Chimera topology. A Chimera topology can be defined as a tiled topology with intra-cell couplings at crossings between qubits within the cell and inter-cell couplings between respective qubits in adjacent cells. Traditional VAEs typically use a factorial approximating posterior. As a result, traditional VAEs have difficulty capturing correlations between latent variables.
  • One approach is to refine the approximating posterior automatically. This approach can be complex. Another, generally simpler, approach is to make the approximating posterior hierarchical. A benefit of this approach is that it can capture any distribution, or at least a wider range of distributions.
  • FIG. 7 shows a method 700 for unsupervised learning via a hierarchical variational auto-encoder (VAE), in accordance with the present systems, devices, articles and methods. Method 700 may be implemented as an extension of method 400 employing a hierarchy of random variables.
  • Method 700 starts at 705, for example in response to a call from another routine or other invocation.
  • At 710, the system initializes the model parameters with random values, as described above with reference to 410 of method 400.
  • At 715, the system tests to determine if a stopping criterion has been reached, as described above with reference to 415 of method 400.
  • In response to determining the stopping criterion has been reached, the system ends method 700 at 775, until invoked again, for example, a request to repeat the learning.
  • In response to determining the stopping criterion has not been reached, the system, at 720, fetches a mini-batch of the training data set.
  • At 722, the system divides the latent variables z into disjoint groups z1, . . . , zk and the corresponding continuous latent variables into disjoint groups ζ1, . . . ζk.
  • At 725, the system propagates the training data set through the encoder to compute the full approximating posterior over discrete zj. As mentioned before, this hierarchical approximation does not alter the form of the gradient of the auto-encoding term IE.
  • At 730, the system generates or causes generation of samples from the approximating posterior over n layers of continuous variables given the full distribution over z. The number of layers n may be 1 or more.
  • At 735, the system propagates the samples through the decoder to compute the distribution over the input, as describe above with reference to 435 of method 400.
  • At 740, the system performs backpropagation through the decoder, as describe above with reference to 440 of method 400.
  • At 745, the system performs backpropagation through the sampler over the approximating posterior over as describe above with reference to 445 of method 400.
  • At 750, the system computes the gradient of the KL-divergence between the approximating posterior and the true prior over z, as describe above with reference to 450 of method 400.
  • At 755, the system performs backpropagation through the encoder, as describe above with reference to 455 of method 400.
  • At 757, the system determines a gradient of a KL-divergence, with respect to parameters of the true prior distribution, between the approximating posterior and the true prior distribution over the discrete space.
  • At 760, the system determines at least one of a gradient or at least a stochastic approximation of a gradient, of a bound on the log-likelihood of the input data.
  • In some embodiments, the system generates samples or causes samples to be generated by a quantum processor, as described above with reference to 460 of method 400.
  • At 765, the system updates the model parameters based at least in part on the gradient, as described above with reference to 465 of method 400.
  • At 770, the system tests to determine if the current mini-batch is the last mini-batch to be processed, as described above with reference to 470 of method 400. In some implementations, act 770 is omitted, and control passes directly to 715 from 765. The decision whether to fetch another mini-batch can be incorporated in 715.
  • In response to determining that the current mini-batch is the last mini-batch to be processed, the system returns control to 715. In response to determining that the current mini-batch is not the last mini-batch to be processed, the system returns control to 720.
  • In summary and as described in more details above, method 700 renders the approximating posterior hierarchical over the discrete latent variables. In addition, method 700 also adds a hierarchy of continuous latent variables below them.
  • Computing the Gradients of the KL Divergence
  • The remaining component of the loss function can be expressed as follows:
  • - K L [ q ( z | x , ) p ( z | θ ) ] = z q ( z | x , ) · [ log p ( z | θ ) - log q ( z | x , ) ]
  • In some implementations, such as when the samples are generated using an example embodiment of a quantum processor, the prior distribution is a Restricted Boltzmann Machine (RBM), as follows:
  • p ( z | θ ) = e - E p ( z , θ ) p where E p ( z ) = - z T · J · z - h T · z and p = z e - E p ( z , θ )
  • where z∈{0,1}n,
    Figure US20220076131A1-20220310-P00009
    p is the partition function, and the lateral connection matrix J is bipartite and very sparse. The prior distribution described by the above equation contains strong correlations, and the present computational system can use a hierarchical approximating posterior.
  • The present method divides the latent variables into two groups and defines the approximating posterior via a directed acyclic graphical model over the two groups za and zb, as follows:
  • q ( z | x , ) = e - E a ( z a | x , ) a ( x ) · e - E b | a ( z b | z a , x , ) b | a ( z a , x ) where E a ( z a | x ) = - g a ( x ) T · z a E b | a ( z b | z a , x ) = - g b | a ( x , z a ) T · z b a ( x ) = z a e - E a ( z a | x , ) = a i a ( 1 + e g a i ( x ) ) b | a ( x , z a ) = z b e - E b | a ( z b | z a , x , ) = b i b ( 1 + e g b i | a ( x , z a ) )
  • The gradient −KL[q(z|x,ϕ)∥p(z|θ)] with respect to the parameters θ of the prior can be estimated stochastically using samples from the approximating posterior q(z|x)=qa(za|x)·qb|a(zb|za,x) and the true prior, as follows:
  • - θ KL [ q ( z | x , ) p ( z | θ ) ] = - z q ( z | x , ) · E p ( z , θ ) θ + z p ( z | θ ) · E p ( z | θ ) θ = - q a ( z a | x , ) [ q b | a ( z b | z a , x , ) [ E p ( z , θ ) θ ] ] + p ( z | θ ) [ E p ( z , θ ) θ ]
  • The expectation with respect to qb|a(zb|za,x,ϕ) can be performed analytically; the expectation with respect to qa(za|x,ϕ) requires samples from the approximating posterior. Similarly, for the prior, sampling is from the native distribution of the quantum processor. Rao-Blackwellization can be used to marginalize half of the units. Samples from the same prior distribution are used for a mini-batch, independent of the samples chosen from the training dataset.
  • The gradient of −KL[q(z|x,ϕ)∥p(z|θ)] with respect to the parameters ϕ of the approximating posterior does not depend on the partition function of the prior
    Figure US20220076131A1-20220310-P00010
    P, since:
  • KL ( q p ) = z ( q log q - q log p ) = z ( q log q + q · E p + q log p ) = z ( q log q + q · E p ) + log p
  • Consider a case where q is hierarchical with q=qa·qb|a . . . . The random variables are fundamentally continuous after marginalizing out the discrete random variables, the re-parameterization technique is used to backpropagate through Πj<iqj|k<j.
  • The entropy term of the KL divergence is then:
  • H ( q ) = z q · log q = z ( i q i | j < i ) · ( i log q i | k < i ) = i z ( j i q j | k < j ) · log q i | k < i = i z i j < i q j | k < i [ q i | k < i · log q i | k < i ] = i ρ k < i [ z i q i | ρ k < i · log q i | ρ k < i ]
  • where indices i, j, and k denote hierarchical groups of variables. The probability
  • q i | ρ k < i ( z i )
  • is evaluated analytically, whereas all variables k<i are sampled stochastically via ρk<i. Taking the gradient of H(q) in the above equation and using the identity:
  • q [ c · log q ] = c · z q · ( q / q ) = c · ( z q ) = 0
  • for a constant c, allows elimination of the gradient of
  • log q i | ρ k < i
  • in the earlier equation, and obtain:
  • H ( q ) = i ρ k < i [ z i ( q i | ρ k < i ) · log q i | ρ k < i ]
  • Moreover, elimination of a log-partition function in log qi| ρk<i is achieved by an analogous argument. By repeating this argument one more time,
  • ( q i | ρ k < i ) / ϕ
  • can be broken into its factorial component. If
  • q i | ρ k < i
  • is a logistic function of the input and zi∈{0,1}, the gradient of the entropy reduces to:
  • H ( q ) = i ρ k < i [ l i z l q i ( z i ) · ( z i · g l - z l ( q l ( z l ) · z l · g l ) ) · ( g l · z l ) ] = i ρ k < i [ g i T · ( g i [ q i ( z i = 1 ) - q i 2 ( z i = 1 ) ] ) ]
  • where l and zl correspond to single variables within the hierarchical groups denoted by i. In TensorFlow, it might be simpler to write:
  • H ( q ) = ρ k < i [ q i T ( z i = 1 ) · g i ]
  • The remaining cross-entropy term is:
  • z q · E p = - ρ [ z T · J · z + h T · z ]
  • The term hT·z can be handled analytically, since zi∈{0,1}, and

  • Figure US20220076131A1-20220310-P00011
    ρ[h T ·z]=h T·
    Figure US20220076131A1-20220310-P00011
    ρ[q(z=1)]
  • The approximating posterior q is continuous in this case, with non-zero derivative, so the re-parameterization technique can be applied to backpropagate gradients:
  • ρ [ h T · z ] = h T · ρ [ q ( z = 1 ) ]
  • In contrast, each element of the sum:
  • z T · J · z = i , j J i j · z i · z j
  • depends upon variables which are not usually in the same hierarchical level, so, in general:

  • Figure US20220076131A1-20220310-P00011
    ρ[J ij z i z j]≠J ij
    Figure US20220076131A1-20220310-P00011
    ρ[z i
    Figure US20220076131A1-20220310-P00011
    ρ[z j]
  • This term can be decomposed into:

  • Figure US20220076131A1-20220310-P00011
    ρ[J ij z i z j]=J ij·
    Figure US20220076131A1-20220310-P00011
    ρk<i[z i·
    Figure US20220076131A1-20220310-P00011
    ρk<i[z j]]
  • where, without loss of generality, zi is in a higher hierarchical layer than zj. It can be challenging to take the derivative of zj because it is a discontinuous function of ρk<i.
  • Direct Decomposition of ∂(Ii,jzizj)/∂ϕ
  • The re-parameterization technique initially makes zi a function of ρ and ϕ. However, it is possible to marginalize over values of the re-parameterization variables ρ for which z is consistent, thereby rendering zi a constant. Assuming, without loss of generality, that i<j,
    Figure US20220076131A1-20220310-P00004
    ρ[Jijzizj] can be expressed as follows:
  • ρ [ J i j z i z j ] = J i j · ρ k < i [ z i ~ q i | ρ k < i , [ z i ( ρ , ) · ρ i | z i [ ρ k < i [ z j ( ρ z i , ) ] ] ] ] = J i j · ρ k < i [ z i q i ( z 1 = 1 | ρ k < i , ) · z i · ρ i | z i [ ρ i < k < j [ z j q j ( z j = 1 | ρ z i , k < j , ) · z j ] ] ] = J i j · ρ k < i [ z i q i ( z i = 1 | ρ k < i , ) · z i · ρ i | z i [ ρ i < k < j [ q i ( z j = 1 | ρ z i , k < i , ) ] ] ]
  • The quantity qi(zj=1|ρz i ,k<j,ϕ) is not directly a function of the original ρ, since ρi is sampled from the distribution conditioned on the value of zi. It is this conditioning that coalesces qi(zj=1|ρz i ,k<j,ϕ), which should be differentiated.
  • With z1 fixed, sampling from ρi is equivalent to sampling from ζi|zi. In particular, ρi is not a function of qk<i, or parameters from previous layers. Combining this with the chain rule, ζi can be held fixed when differentiating qj, with gradients not backpropagating from qj through ζi.
  • Using the chain rule, the term due to the gradient of qi(zik<1,ϕ), is:
  • ϕ 𝔼 p [ J ij z i z j ] = J ij · ? [ z i ? ( z i = 1 ) ϕ · z i · ? [ ? [ ? q j ( z j = 1 ? , ϕ ) · z j ] ] ] = J ij · ? [ ? [ q i ( z i = 1 ) ϕ · z i q i ( z i = 1 ? , ϕ ) · ? [ ? [ z j ( ρ , ϕ ) ] ] ] ] = 𝔼 ρ [ J ij · q i ( z i = 1 ) ϕ · z i ( p , ϕ ) q i ( z i = 1 ? , ϕ ) · z j ( ρ , ϕ ) ] = 𝔼 ρ [ J ij · z i ( ρ , ϕ ) q i ( z i = 1 ) · q j ( z j = 1 ) · q i ( z i = 1 ) ϕ ] ? indicates text missing or illegible when filed
  • where, in the second line, we reintroduce sampling over zi, but reweight the samples so the expectation is unchanged.
  • The term due to the gradient of qj(zj|ρ,ϕ) is:
  • ϕ ρ [ J ij z i z j ] = J ij · ρ k < i [ z i q i ( z i ρ k < i , ϕ ) · z i · ρ i z i [ ρ i < k < i [ z j q j ϕ · z j ] ] ] = J ij · ρ k < j [ q i ρ k < j , ϕ [ z i ( ρ , ϕ ) · z j ( ρ , ϕ ) q j ( z j ρ k < j , ϕ ) · q j ϕ ] ] = ρ [ J ij · z i ( ρ , ϕ ) · z j ( ρ , ϕ ) q j ( z j = 1 ) · q j ϕ ]
  • For both zj and zj, the derivative with respect to q(z=0) can be ignored since in light of scaling by z=0. Once again, gradients can be prevented from backpropagating through ζi. Summing over zi, and then take the expectation of ρi conditioned on the chosen value of zi. As a result, qi(zj=1|ρz t ,k<j,ϕ), depends upon being fixed, independent of the preceding ρ and in the hierarchy.
  • Further marginalize over zj to obtain:
  • ϕ ρ [ J ij z i z j ] = ρ [ J ij · z i · q j ( z j = 1 ) ϕ ]
  • Decomposition of ∂(Ji,jzizj)/∂ϕ Via the Chain Rule
  • In another approach, the gradient of Ep(Ji,jzizj) can be decomposed using the chain rule. Previously, z has been considered to be a function of ρ and ϕ. Instead z can be formulated as a function of q(z=1) and ρ, where q(z=1) is itself a function of ρ and ϕ. Specifically,
  • z i ( q i ( z i = 1 ) , ρ i ) = { 0 if ρ i < 1 - q i ( z i = 1 ) = q i ( z i = 0 ) 1 otherwise
  • The chain rule can be used to differentiate with respect to q(z=1) since it allows pulling part of the integral over ρ inside the derivative with respect to ϕ.
  • Expanding the desired gradient using the re-parameterization technique and the chain rule, finds:
  • ϕ q [ J ij z i z j ] = ϕ q [ J ij z i z j ] = ρ [ k J ij z i z j q k ( z k = 1 ) · q k ( z k - 1 ) ϕ ]
  • The order of integration (via the expectation) and differentiation can be changed. Although z(q,ρ) is a step function, and its derivative is a delta function, the integral of its derivative is finite. Rather than dealing with generalized functions directly, the definition of the derivative can be applied, and push through the matching integral to recover a finite quantity. For simplicity, the sum over k can be pulled out of the expectation in the above equation, and consider each summand independently.
  • Since zi is only a function of qi, terms in the sum over k in the above equation vanish except k=i and k=j. Without loss of generality, consider the term k=the term k=j is symmetric. Applying the definition of the gradient to one of the summands, and then analytically taking the expectation with respect to ρi, obtains:
  • ρ [ J ij · z i ( q , ρ ) · z j ( q , ρ ) q i ( z i = 1 ) · q i ( z i = 1 ) ϕ ] = ρ [ lim δ q i ( z i = 1 ) 0 ( J ij · z i ( q + δ q i , ρ ) · z j ( q + δ q i , ρ ) - J ij · z i ( q , ρ ) · z j ( q , ρ ) ) ( δ q i ( z i = 1 ) ) · q i ( z i = 1 ) ϕ ] = ρ i i [ lim δ q i ( z i = 1 ) 0 δ q i · J ij · 1 · z j ( q , ρ ) - J ij · 0 · z j ( q , ρ ) δ q i ( z i = 1 ) · q i ( z i = 1 ) ϕ ρ i = q i ( z i = 0 ) ] = ρ k i [ J ij · z j ( q , ρ ) · q i ( z i = 1 ) ϕ ρ i = q i ( z i = 0 ) ]
  • Since ρi is fixed such that ζi=0, units further down the hierarchy can be sampled in a manner consistent with this restriction. The gradient is computed with a stochastic approximation by multiplying each sample by 1−zi, so that terms with ζi≠0 can be ignored, and scaling up the gradient when zi=0 by 1/qi(zi=0), as follows:
  • ϕ [ J ij z i z j ] = ρ [ J ij · 1 - z i 1 - q i ( z i = 1 ) · z j · q i ( z i = 1 ) ϕ ]
  • While this corresponds to taking the expectation of the gradient of the log-probability, it is done for each unit independently, so the total increase in variance can be modest.
  • Alternative Approach
  • An alternative approach is to take the gradient of the expectation using the gradient of log-probabilities over all variables:
  • ϕ [ J ij z i z j ] = = q 1 , q 2 1 , [ J ij z i z j · k ϕ log q k κ < k ] = q 1 , q 2 1 , [ J ij z i z j k 1 q k κ < k · q k κ < k ϕ ]
  • For the gradient term on the right-hand side, terms involving only zκ<k that occur hierarchicaly before k can be dropped out, since those terms can be pulled out of the expectation over qk. However, for terms involving zκ>k that occur hierarchically after k, the expected value of zκ depends upon the chosen value of zk.
  • Generally, no single term in the sum is expected to have a particularly high variance. However, the variance of the estimate is proportional to the number of terms, and the number of terms contributing to each gradient can grow quadratically with the number of units in a bipartite model, and linearly in a chimera-structured model. In contrast, in the previously described approach, the number of terms contributing to each gradient can grow linearly with the number of units in a bipartite mode, and be constant in a chimera-structured model.
  • Introducing a baseline:
  • q [ ( J ij z i z j - c ( x ) ) · ϕ log q ]
  • Non-Factorial Approximating Posteriors Via Ancillary Variables
  • Alternatively, or in addition, a factorial distribution over discrete random variables can be retained, and made conditional on a separate set of ancillary random variables.
  • ϕ ( z q ( z α ) · ( z T · J · z ) ) = ϕ ( q T ( z = 1 α ) · J · q ( z = 1 α ) )
  • so long as J is bipartite. The full gradient of the KL-divergence with respect to the parameters of the approximating posterior is then as follows:
  • ϕ KL ( q p ) = ρ [ ( g i - h - ( J T + J ) · q ( z = 1 ) ) · ϕ q ( z = 1 ) ]
  • Other than making the distributions conditioned on the ancillary random variables α of the approximating posterior, the KL-divergence between the approximating posterior and the true prior of the ancillary variables can be subtracted. The rest of the prior is unaltered, since the ancillary random variables α govern the approximating posterior, rather than the generative model.
  • Implementation
  • The following can be parameterized:

  • q(z|x,ϕ)=Πi q i(z i |x,ϕ)
  • using a feedforward neural network g(x). Each layer i of the neural network g(x) consists of a linear transformation, parameterized by weight matrix Wi and bias vector bi, followed by a pointwise nonlinearity. While intermediate layers can consist of ReLU or soft-plus units, with nonlinearity denoted by τ, the logistic function σ can be used as the nonlinearity in the top layer of the encoder to ensure the requisite range [0,1]. Parameters for each qi(zi|x,ϕ) are shared across inputs x, and 0≤gi(x)≤1.
  • Similarly, p(x|ζ,θ) can be parameterized using another feedforward neural network ƒ(ζ), with complementary parameterization. If x is binary, pi(xi=1|,θ)=σ(ƒi(ζ)) can again be used. If x is real, an additional neural network ƒ′(ζ) can be introduced to calculate the variance of each variable, and take an approach analogous to traditional variational auto-encoders by using pi(xi|ζ,θ)=
    Figure US20220076131A1-20220310-P00012
    i(ζ),ƒ′i(ζ)). The final nonlinearity of the network ƒ(ζ) should be linear, and the final nonlinearity of ƒ(ζ) should be non-negative.
  • Algorithm 1 (shown below) illustrates an example implementation of training a network expressed as pseudocode. Algorithm 1 describes training a generic network with gradient descent. In other implementations, other methods could be used to train the network without loss of generality with respect to the approach.
  • Algorithm 1 establishes the input and output, and initialize the model parameters, then it determines if a stopping criterion has been met. In addition, algorithm 1 defines the processing of each mini-batch or subset.
  • Algorithms 1 and 2 (shown below) comprise pseudocode for binary visible units. Since J is bipartite, Jq can be used to denote the upper-right quadrant of J, where the non-zero values reside. Gradient descent is one approach that can be used. In other implementations, gradient descent can be replaced by another technique, such as RMSprop, adagrad, or ADAM.
  • Algorithm 1: Train generic network with simple gradient descent
    def train ( )
    | Input : A data set X, where X [: , i]is the ith element, and a learning rate parameter
    Figure US20220076131A1-20220310-P00899
    | Output: Model parameters: {W,
    Figure US20220076131A1-20220310-P00899
    , Jq, h}
    | Initialize model parameters with random values
    | while Stopping criteria is not met do
    | | foreach minibatch Xpos = getMinibatch (X,
    Figure US20220076131A1-20220310-P00899
    ) of the training dataset do
    | | | Draw a sample from the approx posterior Z
    Figure US20220076131A1-20220310-P00899
     , Zpos, Xout ← posSamples (Xpos)
    | | | Draw a sample from the prior Zneg ← negSamples (Zneg
    Figure US20220076131A1-20220310-P00899
    )
    | | | Estimate ? θ using calcGradients (Xpos, Z
    Figure US20220076131A1-20220310-P00899
    , Zpos, Zneg, Xout)
    | | | Update parameters according to θ t + 1 θ t + ɛ · θ
    | | end | end
    Figure US20220076131A1-20220310-P00899
    indicates data missing or illegible when filed
  • At first, this approach appears to be caught between two conflicting constraints when trying to apply the variational auto-encoder technique to discrete latent representations. On the one hand, a discrete latent representation does not allow use of the gradient of the decoder, since the reparametrized latent representation jumps discontinuously or remains constant as the parameters of the approximating posterior are changed. On the other hand, KL[q(z|x,ϕ)∥p(z|θ)] is only easy to evaluate if by remaining in the original discrete space.
  • The presently disclosed systems and methods avoid these problems by symmetrically projecting the approximating posterior and the prior into a continuous space. The computational system evaluates the auto-encoder portion of the loss function in the continuous space, marginalizing out the original discrete latent representation. At the same time, the computational system evaluates the KL-divergence between the approximating posterior and the true prior in the original discrete space, and, owing to the symmetry of the projection into the continuous space, it does not contribute to this term.
  • Algorithm 2: Helper functions for discrete VAE
    L
    Figure US20220076131A1-20220310-P00899
     ← Lup + Ldown
    def getMinibatch (X,
    Figure US20220076131A1-20220310-P00899
    )
    | k ← k + 1
    | Xpos ← X [:, k · m: (k + 1) · m)
    def posSamples (Xpos)
    | Zo ← Xpos
    | for i ← 1 to Lup − 1 do
    | | Zi
    Figure US20220076131A1-20220310-P00899
     (Wi−1 · Zi−1 + bi−1)
    | end
    | Z
    Figure US20220076131A1-20220310-P00899
     ← WLup−1 · ZLup−1 + bLup−1
    | Zpos ← σ (Z
    Figure US20220076131A1-20220310-P00899
    )
    | ZLup ← G
    Figure US20220076131A1-20220310-P00899
    −1 (p) where q′ (ζ = 1|x, ϕ) = Zpos and ρ~U (0,1)n×m
    | for i ← Lup + 1 to Llast − 1 do
    | | Zi
    Figure US20220076131A1-20220310-P00899
     (Wi−1 · Zi−1 + bi−1)
    | end
    | Xout ← α (W
    Figure US20220076131A1-20220310-P00899
     · Z
    Figure US20220076131A1-20220310-P00899
     + b
    Figure US20220076131A1-20220310-P00899
    )
    def negSamples (Zpos)
    | if using D-Wave then
    | | sample Zneg from D-Wave using h and Jq
    | | post-process samples
    | else
    | | if using CD then
    | | | Zneg ← sample (Zpos)
    | | else if using PCD then
    | | | Zneg initialized to result of last call to negSamples ( )
    | | end
    | | for i ← 1 to n do
    | | | sample “left” half from p ( Z neg [ : d 2 , ? ] = 1 ) = σ ( J q   · Z neg [ d 2 ? ] + h [ : d 2 ] )
    | | | sample “right” half from p ( Z neg [ d 2 : , : ] = 1 ) = σ ( J q T   · Z neg [ : d 2 , : ] + h [ d 2 : ] )
    | | end
    | end
    def calcGradients (Xpos, Z
    Figure US20220076131A1-20220310-P00899
    , Zpos, Zneg, Xout)
    |  B L ? σ ? ( W L ? · Z L ? + b L ? ) · ( X pos X out - 1 - X pos 1 - X out )
    | for i ← Llast − 1 to Lup do
    |  | ? W ? B i + 1 · Z i T
    |  | ? ? B i + 1 · 1
    |  |Bi
    Figure US20220076131A1-20220310-P00899
    (Wi−1 · Zi−1 + bi−1) · Wi T · Bi+1
    | end
    |  B pos ? q · W T ? · B L up + 1
    |  B KL ( Z ? - h - vstack ( J q · Z pos [ d 2 ? : ] , J q T · Z pos [ : d 2 , : ] ) ) ( Z pos - Z pos 2 )
    | BL up ← σ′ (WLup−1 · ZLup−1 + bLup−1) · Bpos − BKL
    | for i ← Lup − 1 to 0 do
    |  | ? W ? B i + 1 · Z i T
    |  | ? ? B i + 1 · 1
    |  |Bi
    Figure US20220076131A1-20220310-P00899
    (Wi−1 · Zi−1 + bi−1) · Wi T · Bi+1
    | end
    |  ? ? Z pos [ : d 2 , : ] · Z pos [ d 2 : , : ] T - Z neg [ : d 2 , : ] · Z neg [ d 2 : , : ] T
    |  ? ? Z pos · 1 - Z neg · 1
    Figure US20220076131A1-20220310-P00899
    indicates data missing or illegible when filed
  • The above description of illustrated embodiments, including what is described in the Abstract, is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. Although specific embodiments of and examples are described herein for illustrative purposes, various equivalent modifications can be made without departing from the spirit and scope of the disclosure, as will be recognized by those skilled in the relevant art. The teachings provided herein of the various embodiments can be applied to other methods of quantum computation, not necessarily the exemplary methods for quantum computation generally described above.
  • The various embodiments described above can be combined to provide further embodiments. All of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet including: U.S. patent application publication 2015/0006443 published Jan. 1, 2015; U.S. patent application publication 2015/0161524 published Jun. 11, 2015; U.S. provisional patent application Ser. No. 62/207,057, filed Aug. 19, 2015, entitled “SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS”; U.S. provisional patent application Ser. No. 62/206,974, filed Aug. 19, 2015, entitled “DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS”; U.S. provisional patent application Ser. No. 62/268,321, filed Dec. 16, 2015, entitled “DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS”; and U.S. provisional patent application Ser. No. 63/307,929, filed 14 Mar. 2016, entitled “DISCRETE VARIATIONAL AUTO-ENCODER SYSTEMS AND METHODS FOR MACHINE LEARNING USING ADIABATIC QUANTUM COMPUTERS”, each of which is incorporated herein by reference in its entirety. Aspects of the embodiments can be modified, if necessary, to employ systems, circuits, and concepts of the various patents, applications, and publications to provide yet further embodiments.

Claims (18)

1-38. (canceled)
39. A method of unsupervised learning by a computational system, the method executed by circuitry including at least one processor and comprising:
determining by the circuitry a first approximating posterior distribution over at least one group of a set of discrete random variables;
sampling by the circuitry from at least one group of a set of supplementary continuous random variables using the first approximating posterior distribution over the at least one group of the set of discrete random variables to generate one or more samples, wherein a transforming distribution comprises a conditional distribution over the set of supplementary continuous random variables, conditioned on the at least one group of a set of discrete random variables;
determining by the circuitry a second approximating posterior distribution and a first prior distribution, the first prior distribution over at least one layer of a set of continuous variables;
sampling by the circuitry from the second approximating posterior distribution;
determining by the circuitry an auto-encoding loss on an input space comprising discrete or continuous variables, the auto-encoding loss conditioned on the one or more samples;
determining by the circuitry a first KL-divergence, or at least an approximation thereof, between the second approximating posterior distribution and the first prior distribution;
determining by the circuitry a second KL-divergence, or at least an approximation thereof, between the first approximating posterior distribution and a second prior distribution, the second prior distribution over the set of discrete random variables; and
backpropagating by the circuitry a sum of the first and the second KL-divergence and the auto-encoding loss on the input space conditioned on the one or more samples.
40. The method of claim 39 wherein the auto-encoding loss is a log-likelihood.
41. A method of unsupervised learning by a computational system, the method executed by circuitry including at least one processor and comprising:
determining by the circuitry a first approximating posterior distribution over a first group of discrete random variables conditioned on an input space comprising discrete or continuous variables;
sampling by the circuitry from a first group of supplementary continuous random variables based on the first approximating posterior distribution;
determining by the circuitry a second approximating posterior distribution over a second group of discrete random variables conditioned on the input space and samples from the first group of supplementary continuous random variables;
sampling by the circuitry from a second group of supplementary continuous random variables based on the second approximating posterior distribution;
determining by the circuitry a third approximating posterior distribution and a first prior distribution over a first layer of additional continuous random variables, the third approximating posterior distribution conditioned on the input space, samples from at least one of the first and the second group of supplementary continuous random variables, and the first prior distribution conditioned on samples from at least one of the first and the second group of supplementary continuous random variables;
sampling by the circuitry from the first layer of additional continuous random variables based on the third approximating posterior distribution;
determining by the circuitry a fourth approximating posterior distribution and a second prior distribution over a second layer of additional continuous random variables, the fourth approximating posterior distribution conditioned on the input space, samples from at least one of the first and the second group of supplementary continuous random variables, samples from the first layer of additional continuous random variables, and the second prior distribution conditioned on at least one of samples from at least one of the first and the second group of supplementary continuous random variables, and samples from the first layer of additional continuous random variables;
determining by the circuitry a first gradient of a KL-divergence, or at least a stochastic approximation thereof, between the third approximating posterior distribution and the first prior distribution with respect to the third approximating posterior distribution and the first prior distribution;
determining by the circuitry a second gradient of a KL-divergence, or at least a stochastic approximation thereof, between the fourth approximating posterior distribution and the second prior distribution with respect to the fourth approximating posterior distribution and the second prior distribution;
determining by the circuitry a third gradient of a KL-divergence, or at least a stochastic approximation thereof, between an approximating posterior distribution over a third group of discrete random variables and a third prior distribution with respect to the approximating posterior distribution over the third group of discrete random variables and the third prior distribution, wherein the approximating posterior distribution over the third group of discrete random variables is a combination of the first approximating posterior distribution over the first group of discrete random variables, and the second approximating posterior distribution over the second group of discrete random variables; and
backpropagating by the circuitry the first, the second and the third gradients of the KL-divergence to the input space.
42. The method of claim 41 wherein determining by the circuitry a third gradient of a KL-divergence, or at least a stochastic approximation thereof, between an approximating posterior distribution over the third group of discrete random variables and a third prior distribution with respect to the approximating posterior distribution over the third group of discrete random variables and the third prior distribution comprises determining by the circuitry a third gradient of a KL-divergence, or at least a stochastic approximation thereof, between an approximating posterior distribution over the third group of discrete random variables and a third prior distribution with respect to the approximating posterior distribution over the third group of discrete random variables and the third prior distribution, the third prior distribution is comprising a restricted Boltzmann machine.
43. The method of claim 39 wherein determining by the circuitry a first KL-divergence comprises computing by the circuitry a loss function analytically.
44. The method of claim 39 wherein determining by the circuitry a first KL-divergence comprises estimating by the circuitry a loss function stochastically.
45. The method of claim 39 wherein determining by the circuitry a second KL-divergence comprises computing by the circuitry a loss function analytically.
46. The method of claim 39 wherein determining by the circuitry a second KL-divergence comprises estimating by the circuitry a loss function stochastically.
47. The method of claim 39 wherein determining by the circuitry a second approximating posterior distribution and a first prior distribution, the first prior distribution over at least one layer of a set of continuous variables comprises determining by the circuitry a second approximating posterior distribution and a first prior distribution, the first prior distribution comprising a restricted Boltzmann machine.
48. The method of claim 39 wherein determining by the circuitry a second KL-divergence, or at least an approximation thereof, between the first approximating posterior distribution and a second prior distribution, the second prior distribution over the second group of discrete random variables comprises determining by the circuitry a second KL-divergence, or at least an approximation thereof, between the first approximating posterior distribution and a second prior distribution, the second prior comprising a restricted Boltzmann machine.
49. The method of claim 39 wherein sampling by the circuitry from the second approximating posterior distribution includes at least one of generating samples by the circuitry or causing samples to be generated by a digital processor.
50. The method of claim 39 wherein sampling by the circuitry from the second approximating posterior distribution includes at least one of generating samples by the circuitry or causing samples to be generated by a quantum processor.
51. The method of claim 41 wherein sampling by the circuitry from a first group of supplementary continuous variables based on the first approximating posterior distribution includes at least one of generating samples by the circuitry or causing samples to be generated by one of a digital processor and a quantum processor.
52. The method of claim 41 wherein sampling by the circuitry from a second group of supplementary continuous variables based on the second approximating posterior distribution includes at least one of generating samples by the circuitry or causing samples to be generated by one of a digital processor and a quantum processor.
53. The method of claim 41 wherein sampling by the circuitry from the first layer of additional continuous random variables based on third first approximating posterior distribution includes at least one of generating samples by the circuitry or causing samples to be generated by one of a digital processor and a quantum processor.
54. The method of claim 41 wherein determining by the circuitry a third approximating posterior distribution and a first prior distribution over a first layer of additional continuous random variables comprises determining by the circuitry a third approximating posterior distribution and a first prior distribution over a first layer of additional continuous random variables, the first prior distribution comprising a restricted Boltzmann machine.
55. The method of claim 41 wherein determining by the circuitry a fourth approximating posterior distribution and a second prior distribution over a second layer of additional continuous random variables comprises determining by the circuitry a fourth approximating posterior distribution and a second prior distribution over a second layer of additional continuous random variables, the second prior comprising a restricted Boltzmann machine.
US17/481,568 2015-08-19 2021-09-22 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers Pending US20220076131A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/481,568 US20220076131A1 (en) 2015-08-19 2021-09-22 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US201562206974P 2015-08-19 2015-08-19
US201562268321P 2015-12-16 2015-12-16
US201662307929P 2016-03-14 2016-03-14
PCT/US2016/047627 WO2017031356A1 (en) 2015-08-19 2016-08-18 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
US201815753666A 2018-02-20 2018-02-20
US17/481,568 US20220076131A1 (en) 2015-08-19 2021-09-22 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
US15/753,666 Continuation US11157817B2 (en) 2015-08-19 2016-08-18 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
PCT/US2016/047627 Continuation WO2017031356A1 (en) 2015-08-19 2016-08-18 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers

Publications (1)

Publication Number Publication Date
US20220076131A1 true US20220076131A1 (en) 2022-03-10

Family

ID=58050832

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/753,666 Active 2038-10-29 US11157817B2 (en) 2015-08-19 2016-08-18 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
US17/481,568 Pending US20220076131A1 (en) 2015-08-19 2021-09-22 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US15/753,666 Active 2038-10-29 US11157817B2 (en) 2015-08-19 2016-08-18 Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers

Country Status (4)

Country Link
US (2) US11157817B2 (en)
EP (1) EP3338221A4 (en)
CN (1) CN108140146B (en)
WO (1) WO2017031356A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089968A1 (en) * 2017-02-06 2021-03-25 Deepmind Technologies Limited Memory augmented generative temporal models
US11537881B2 (en) * 2019-10-21 2022-12-27 The Boeing Company Machine learning model development
WO2023204836A1 (en) * 2022-04-19 2023-10-26 Tencent America LLC Variational graph autoencoding for abstract meaning representation coreference resolution

Families Citing this family (75)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292326A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The training method and device of a kind of model
US10373055B1 (en) * 2016-05-20 2019-08-06 Deepmind Technologies Limited Training variational autoencoders to generate disentangled latent factors
KR102593690B1 (en) 2016-09-26 2023-10-26 디-웨이브 시스템즈, 인코포레이티드 Systems, methods and apparatus for sampling from a sampling server
US11042811B2 (en) 2016-10-05 2021-06-22 D-Wave Systems Inc. Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
CN108021549B (en) * 2016-11-04 2019-08-13 华为技术有限公司 Sequence conversion method and device
US11531852B2 (en) * 2016-11-28 2022-12-20 D-Wave Systems Inc. Machine learning systems and methods for training with noisy labels
US10621586B2 (en) * 2017-01-31 2020-04-14 Paypal, Inc. Fraud prediction based on partial usage data
CN117709426A (en) * 2017-02-24 2024-03-15 渊慧科技有限公司 Method, system and computer storage medium for training machine learning model
US10249289B2 (en) 2017-03-14 2019-04-02 Google Llc Text-to-speech synthesis using an autoencoder
EP3612981B1 (en) * 2017-04-19 2024-05-29 Siemens Healthineers AG Target detection in latent space
US11948075B2 (en) * 2017-06-09 2024-04-02 Deepmind Technologies Limited Generating discrete latent representations of input data items
CN109685087B9 (en) * 2017-10-18 2023-02-03 富士通株式会社 Information processing method and device and information detection method
US10360285B2 (en) * 2017-11-02 2019-07-23 Fujitsu Limited Computing using unknown values
US10977546B2 (en) * 2017-11-29 2021-04-13 International Business Machines Corporation Short depth circuits as quantum classifiers
US11586915B2 (en) 2017-12-14 2023-02-21 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
US11577145B2 (en) 2018-01-21 2023-02-14 Stats Llc Method and system for interactive, interpretable, and improved match and player performance predictions in team sports
US11645546B2 (en) 2018-01-21 2023-05-09 Stats Llc System and method for predicting fine-grained adversarial multi-agent motion
US20200401916A1 (en) * 2018-02-09 2020-12-24 D-Wave Systems Inc. Systems and methods for training generative machine learning models
US20210034969A1 (en) * 2018-03-09 2021-02-04 Deepmind Technologies Limited Training an unsupervised memory-based prediction system to learn compressed representations of an environment
RU2716322C2 (en) 2018-03-23 2020-03-11 Общество с ограниченной ответственностью "Аби Продакшн" Reproducing augmentation of image data
US11551127B1 (en) 2018-05-09 2023-01-10 Rigetti & Co, Llc Using a quantum processor unit to preprocess data
GB201810636D0 (en) 2018-06-28 2018-08-15 Microsoft Technology Licensing Llc Dynamic characterisation of synthetic genetic circuits in living cells
TW202007091A (en) 2018-07-02 2020-02-01 美商札帕塔運算股份有限公司 Compressed unsupervised quantum state preparation with quantum autoencoders
US11386346B2 (en) 2018-07-10 2022-07-12 D-Wave Systems Inc. Systems and methods for quantum bayesian networks
WO2020033807A1 (en) 2018-08-09 2020-02-13 Rigetti & Co, Inc. Quantum streaming kernel
KR20200023664A (en) * 2018-08-14 2020-03-06 삼성전자주식회사 Response inference method and apparatus
GB2576500A (en) * 2018-08-15 2020-02-26 Imperial College Sci Tech & Medicine Joint source channel coding based on channel capacity using neural networks
US11663513B2 (en) 2018-08-17 2023-05-30 Zapata Computing, Inc. Quantum computer with exact compression of quantum states
US11372651B2 (en) 2018-09-10 2022-06-28 International Business Machines Corporation Bootstrapping a variational algorithm for quantum computing
US11593660B2 (en) * 2018-09-18 2023-02-28 Insilico Medicine Ip Limited Subset conditioning using variational autoencoder with a learnable tensor train induced prior
WO2020064990A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Committed information rate variational autoencoders
US11636370B2 (en) 2018-10-12 2023-04-25 Zapata Computing, Inc. Quantum computer with improved continuous quantum generator
JP2022511331A (en) 2018-10-24 2022-01-31 ザパタ コンピューティング,インコーポレイテッド Hybrid quantum classical computer system for implementing and optimizing quantum Boltzmann machines
CN109543838B (en) * 2018-11-01 2021-06-18 浙江工业大学 Image increment learning method based on variational self-encoder
US11461644B2 (en) 2018-11-15 2022-10-04 D-Wave Systems Inc. Systems and methods for semantic segmentation
US11468357B2 (en) 2018-11-21 2022-10-11 Zapata Computing, Inc. Hybrid quantum-classical computer for packing bits into qubits for quantum optimization algorithms
JP7108186B2 (en) * 2018-11-27 2022-07-28 富士通株式会社 Optimization device and control method for optimization device
US11468293B2 (en) 2018-12-14 2022-10-11 D-Wave Systems Inc. Simulating and post-processing using a generative adversarial network
CN109886388B (en) * 2019-01-09 2024-03-22 平安科技(深圳)有限公司 Training sample data expansion method and device based on variation self-encoder
GB201900742D0 (en) * 2019-01-18 2019-03-06 Microsoft Technology Licensing Llc Modelling ordinary differential equations using a variational auto encoder
CN111464154B (en) * 2019-01-22 2022-04-22 华为技术有限公司 Control pulse calculation method and device
US10740571B1 (en) * 2019-01-23 2020-08-11 Google Llc Generating neural network outputs using insertion operations
US11900264B2 (en) 2019-02-08 2024-02-13 D-Wave Systems Inc. Systems and methods for hybrid quantum-classical computing
US11625612B2 (en) 2019-02-12 2023-04-11 D-Wave Systems Inc. Systems and methods for domain adaptation
EP3931803A1 (en) * 2019-02-27 2022-01-05 3Shape A/S Method for generating objects using an hourglass predictor
EP3912090A4 (en) 2019-03-01 2022-11-09 Stats Llc Personalizing prediction of performance using data and body-pose for analysis of sporting performance
EP3716150A1 (en) * 2019-03-27 2020-09-30 Nvidia Corporation Improved image segmentation using a neural network translation model
US11922301B2 (en) * 2019-04-05 2024-03-05 Samsung Display Co., Ltd. System and method for data augmentation for trace dataset
WO2020210536A1 (en) 2019-04-10 2020-10-15 D-Wave Systems Inc. Systems and methods for improving the performance of non-stoquastic quantum devices
US11554292B2 (en) 2019-05-08 2023-01-17 Stats Llc System and method for content and style predictions in sports
US11443137B2 (en) 2019-07-31 2022-09-13 Rohde & Schwarz Gmbh & Co. Kg Method and apparatus for detecting signal features
US11594006B2 (en) * 2019-08-27 2023-02-28 Nvidia Corporation Self-supervised hierarchical motion learning for video action recognition
US11494695B2 (en) 2019-09-27 2022-11-08 Google Llc Training neural networks to generate structured embeddings
CN111127346B (en) * 2019-12-08 2023-09-05 复旦大学 Multi-level image restoration method based on part-to-whole attention mechanism
US11657312B2 (en) * 2020-01-31 2023-05-23 International Business Machines Corporation Short-depth active learning quantum amplitude estimation without eigenstate collapse
CN111174905B (en) * 2020-02-13 2023-10-31 欧朗电子科技有限公司 Low-power consumption device and method for detecting vibration abnormality of Internet of things
CA3167402A1 (en) 2020-02-13 2021-08-19 Yudong CAO Hybrid quantum-classical adversarial generator
WO2021243107A1 (en) * 2020-05-27 2021-12-02 The Regents Of The University Of California Methods and systems for rapid antimicrobial susceptibility tests
US11935298B2 (en) 2020-06-05 2024-03-19 Stats Llc System and method for predicting formation in sports
US20230275686A1 (en) * 2020-07-13 2023-08-31 Lg Electronics Inc. Method and apparatus for performing channel coding by user equipment and base station in wireless communication system
KR20220019560A (en) 2020-08-10 2022-02-17 삼성전자주식회사 Apparatus and method for monitoring network
US20220101121A1 (en) * 2020-09-25 2022-03-31 Nvidia Corporation Latent-variable generative model with a noise contrastive prior
CN116324668A (en) 2020-10-01 2023-06-23 斯塔特斯公司 Predicting NBA zenithal and quality from non-professional tracking data
WO2022077345A1 (en) * 2020-10-15 2022-04-21 Robert Bosch Gmbh Method and apparatus for neural network based on energy-based latent variable models
CN112633511B (en) * 2020-12-24 2021-11-30 北京百度网讯科技有限公司 Method for calculating a quantum partitioning function, related apparatus and program product
US11966707B2 (en) 2021-01-13 2024-04-23 Zapata Computing, Inc. Quantum enhanced word embedding for natural language processing
CN117222455A (en) 2021-04-27 2023-12-12 斯塔特斯公司 System and method for single athlete and team simulation
US11347997B1 (en) * 2021-06-08 2022-05-31 The Florida International University Board Of Trustees Systems and methods using angle-based stochastic gradient descent
US11983720B2 (en) 2021-10-21 2024-05-14 International Business Machines Corporation Mixed quantum-classical method for fraud detection with quantum feature selection
US11640163B1 (en) * 2021-11-30 2023-05-02 International Business Machines Corporation Event time characterization and prediction in multivariate event sequence domains to support improved process reliability
US20230195056A1 (en) * 2021-12-16 2023-06-22 Paypal, Inc. Automatic Control Group Generation
US11809839B2 (en) 2022-01-18 2023-11-07 Robert Lyden Computer language and code for application development and electronic and optical communication
US20240094997A1 (en) * 2022-06-02 2024-03-21 ColdQuanta, Inc. Compiling quantum computing program specifications based on quantum operations
CN115577776B (en) * 2022-09-28 2024-07-23 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining ground state energy
CN117236198B (en) * 2023-11-14 2024-02-27 中国石油大学(华东) Machine learning solving method of flame propagation model of blasting under sparse barrier

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6671661B1 (en) 1999-05-19 2003-12-30 Microsoft Corporation Bayesian principal component analysis
US7636651B2 (en) 2003-11-28 2009-12-22 Microsoft Corporation Robust Bayesian mixture modeling
US7135701B2 (en) 2004-03-29 2006-11-14 D-Wave Systems Inc. Adiabatic quantum computation with superconducting qubits
US20060115145A1 (en) 2004-11-30 2006-06-01 Microsoft Corporation Bayesian conditional random fields
US7533068B2 (en) 2004-12-23 2009-05-12 D-Wave Systems, Inc. Analog processor comprising quantum devices
CN100585629C (en) * 2004-12-23 2010-01-27 D-波系统公司 Analog processor comprising quantum devices
WO2008083498A1 (en) 2007-01-12 2008-07-17 D-Wave Systems, Inc. Systems, devices and methods for interconnected processor topology
US8190548B2 (en) 2007-11-08 2012-05-29 D-Wave Systems Inc. Systems, devices, and methods for analog processing
WO2009120638A2 (en) 2008-03-24 2009-10-01 D-Wave Systems Inc. Systems, devices, and methods for analog processing
WO2009152180A2 (en) 2008-06-10 2009-12-17 D-Wave Systems Inc. Parameter learning system for solvers
US8095345B2 (en) 2009-01-20 2012-01-10 Chevron U.S.A. Inc Stochastic inversion of geophysical data for estimating earth model parameters
US8239336B2 (en) 2009-03-09 2012-08-07 Microsoft Corporation Data processing using restricted boltzmann machines
US10223632B2 (en) 2009-07-27 2019-03-05 International Business Machines Corporation Modeling states of an entity
US8589319B2 (en) * 2010-12-02 2013-11-19 At&T Intellectual Property I, L.P. Adaptive pairwise preferences in recommenders
CA2840958C (en) 2011-07-06 2018-03-27 D-Wave Systems Inc. Quantum processor based systems and methods that minimize an objective function
WO2014121147A1 (en) * 2013-01-31 2014-08-07 Betazi, Llc Production analysis and/or forecasting methods, apparatus, and systems
US9727824B2 (en) 2013-06-28 2017-08-08 D-Wave Systems Inc. Systems and methods for quantum processing of data
GB2511370B (en) * 2013-08-29 2015-07-08 Imagination Tech Ltd Low complexity soft output MIMO decoder
US20150161524A1 (en) 2013-12-05 2015-06-11 D-Wave Systems Inc. Sampling from a set spins with clamping

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Brochu, et al., A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning, arXiv:1012.2599v1 [cs.LG], 12 Dec 2010, pp. 1-49 (Year: 2010) *
Guo, et al., Variational Autoencoder With Optimizing Gaussian Mixture Model Priors, IEEE Access, March 2020, pp. 43992-44005 (Year: 2020) *
Hernandez-Lobato, et al., Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks, School of Engineering and Applied Sciences, Harvard University, 06 JUL 2015, pp. 1-9 (Year: 2015) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089968A1 (en) * 2017-02-06 2021-03-25 Deepmind Technologies Limited Memory augmented generative temporal models
US11977967B2 (en) * 2017-02-06 2024-05-07 Deepmind Technologies Limited Memory augmented generative temporal models
US11537881B2 (en) * 2019-10-21 2022-12-27 The Boeing Company Machine learning model development
WO2023204836A1 (en) * 2022-04-19 2023-10-26 Tencent America LLC Variational graph autoencoding for abstract meaning representation coreference resolution

Also Published As

Publication number Publication date
CN108140146A (en) 2018-06-08
CN108140146B (en) 2022-04-08
US20180247200A1 (en) 2018-08-30
EP3338221A1 (en) 2018-06-27
EP3338221A4 (en) 2019-05-01
US11157817B2 (en) 2021-10-26
WO2017031356A1 (en) 2017-02-23

Similar Documents

Publication Publication Date Title
US20220076131A1 (en) Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
US20210365826A1 (en) Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
US11410067B2 (en) Systems and methods for machine learning using adiabatic quantum computers
US11481669B2 (en) Systems, methods and apparatus for sampling from a sampling server
Zhang et al. Sequential three-way decision based on multi-granular autoencoder features
US20210256392A1 (en) Automating the design of neural networks for anomaly detection
US20200401916A1 (en) Systems and methods for training generative machine learning models
Georgiopoulos et al. Learning in the feed-forward random neural network: A critical review
US20230044102A1 (en) Ensemble machine learning models incorporating a model trust factor
US20210089867A1 (en) Dual recurrent neural network architecture for modeling long-term dependencies in sequential data
Zhao et al. Stein variational gradient descent with learned direction
Rafati et al. Trust-region minimization algorithm for training responses (TRMinATR): The rise of machine learning techniques
Pooladzandi Fast Training of Generalizable Deep Neural Networks
Liu Sparse Representation Neural Networks for Online Reinforcement Learning
Kong et al. DF2: Distribution-Free Decision-Focused Learning
Probst Generative adversarial networks in estimation of distribution algorithms for combinatorial optimization
Shao et al. Nonparametric Automatic Differentiation Variational Inference with Spline Approximation
US20220189154A1 (en) Connection weight learning for guided architecture evolution
Salmenperä et al. Software techniques for training restricted Boltzmann machines on size-constrained quantum annealing hardware
US20240355109A1 (en) Connection weight learning for guided architecture evolution
Sharma Gradient-based Adversarial Attacks to Deep Neural Networks in Limited Access Settings
Flugsrud Solving Quantum Mechanical Problems with Machine Learning
Pereira et al. Wasserstein generative adversarial networks for topology optimization
Li Noise Injection and Noise Augmentation for Model Regularization, Differential Privacy and Statistical Learning
Sun Sparse Deep Learning and Stochastic Neural Network

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: PSPIB UNITAS INVESTMENTS II INC., CANADA

Free format text: SECURITY INTEREST;ASSIGNOR:D-WAVE SYSTEMS INC.;REEL/FRAME:059317/0871

Effective date: 20220303

AS Assignment

Owner name: D-WAVE SYSTEMS INC., CANADA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:PSPIB UNITAS INVESTMENTS II INC., IN ITS CAPACITY AS COLLATERAL AGENT;REEL/FRAME:061493/0694

Effective date: 20220915

AS Assignment

Owner name: PSPIB UNITAS INVESTMENTS II INC., AS COLLATERAL AGENT, CANADA

Free format text: INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNORS:D-WAVE SYSTEMS INC.;1372934 B.C. LTD.;REEL/FRAME:063340/0888

Effective date: 20230413

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED