WO2023217027A1 - Policy optimization method and apparatus using environment model based on memristor array - Google Patents

Policy optimization method and apparatus using environment model based on memristor array Download PDF

Info

Publication number
WO2023217027A1
WO2023217027A1 PCT/CN2023/092475 CN2023092475W WO2023217027A1 WO 2023217027 A1 WO2023217027 A1 WO 2023217027A1 CN 2023092475 W CN2023092475 W CN 2023092475W WO 2023217027 A1 WO2023217027 A1 WO 2023217027A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
strategy
environment model
policy
cost
Prior art date
Application number
PCT/CN2023/092475
Other languages
French (fr)
Chinese (zh)
Inventor
高滨
林钰登
唐建石
吴华强
张清天
钱鹤
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2023217027A1 publication Critical patent/WO2023217027A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • Embodiments of the present disclosure relate to a strategy optimization method and a strategy optimization device utilizing a dynamic environment model based on a memristor array.
  • ANN Artificial Neural Network
  • long-term mission planning using traditional artificial neural networks remains a challenge because of the lack of ability to model uncertainty.
  • the inherent randomness (uncertainty) of real systems - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system.
  • Probabilistic models provide a way to address uncertainty. These models enable people to make informed decisions using the model's predictions, while being cautious about the uncertainty of those predictions.
  • At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.
  • obtaining a dynamic environment model includes: obtaining a Bayesian neural network, which has a weight matrix obtained by training; according to the Bayesian neural network The weight matrix obtains the corresponding multiple target conductance values, and maps the multiple target conductance values to the memristor array; the state and hidden input variables corresponding to the time t of the dynamic system are As an input signal, it is input to the memristor array after weight mapping.
  • the state and hidden input variables at time t are processed through the memristor array according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. , the output signal is used to obtain the prediction result of the dynamic system at time t+1.
  • the action of the object strategy at time t a t ⁇ (s t ; W ⁇ ), ⁇ represents the function of the object strategy, W ⁇ represents the strategy parameters, the weight matrix W of the Bayesian neural network satisfies the distribution W ⁇ q (W), plus
  • the linear noise ⁇ is additive Gaussian noise ⁇ N(0, ⁇ 2 ).
  • multiple times include time 1 to time T arranged in order from morning to night.
  • the expected value of the cost c t at time t is E[c t ], then the optimization cost at time t can be obtained by to get.
  • the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty.
  • Accidental uncertainty is caused by hidden input variables.
  • cognitive uncertainty is caused by the intrinsic noise of the memristor array.
  • the optimization cost at time t is calculated by To obtain, ⁇ ( ⁇ , ⁇ ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and ⁇ represents epistemic uncertainty.
  • the cost c t of the state s t at time t includes: sampling the hidden input variable z from the p(z) distribution; inputting the sample and the state s t-1 at time t-1 into the memristor after weight mapping
  • the policy gradient optimization algorithm includes the REINFORCE algorithm, PRO algorithm or TPRO algorithm.
  • At least one embodiment of the present disclosure also provides a strategy optimization device utilizing a dynamic environment model based on a memristor array, including: an acquisition unit configured to acquire a dynamic environment model based on the memristor array; a computing unit configured to calculate the dynamic environment model based on the memristor array.
  • the environment model and the object strategy perform multiple predictions at multiple times to obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times;
  • the strategy search unit is configured to use the policy gradient optimization algorithm to perform strategy search based on the data sample set. to optimize object strategies.
  • FIG. 1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure
  • Figure 1B shows a schematic flow chart of step S101 in Figure 1A;
  • Figure 2A shows a schematic structure of a memristor array
  • Figure 2B is a schematic diagram of a memristor device
  • Figure 2C is a schematic diagram of another memristor device
  • Figure 2D shows a schematic diagram of mapping the weight matrix of a Bayesian neural network to a memristor array
  • FIG. 3 shows a schematic flow chart of step S102 in Figure 1A
  • Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure
  • FIG. 5 shows a schematic block diagram of a strategy optimization device using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.
  • model-free deep reinforcement learning (Deep Reinforcement Learning)
  • the agent usually needs to conduct a large number of interactive trials and errors with the real environment, and the data efficiency is not high, so it cannot be applied to real tasks where trial and error costs are relatively high.
  • Model-based deep reinforcement learning can utilize data more efficiently.
  • the agent first learns the dynamic environment model from the historical experience of interacting with the real dynamic environment (such as state transition data collected in advance), and then interacts with the dynamic environment model to obtain a sub-optimal strategy. .
  • the model-based reinforcement learning method learns an accurate dynamic environment model.
  • This model is used when training the agent. It does not need to interact with the real environment too many times.
  • the agent can "imagine" the feeling of interacting in the real environment. It can greatly improve data efficiency and is suitable for actual physical scenarios where the cost of obtaining data is high; at the same time, the dynamic environment model can predict the unknown state of the environment, generalize the cognition of the agent, and can also be used as a new data source to provide Contextual information can be used to help decision-making, which can alleviate the exploration-exploitation dilemma.
  • Bayesian Neural Network is a kind of The probabilistic model of the neural network placed in the Bayesian framework can describe complex random patterns; and the Bayesian neural network with latent input variables (BNN with latent input variables, BNN+LV) can distribution (accidental uncertainty) to describe complex random patterns, while considering model uncertainty through distribution on weights (epistemic uncertainty).
  • Hidden input variables refer to variables that cannot be directly observed, but have an impact on the state and output of the probability model.
  • the inventor described a method and device for implementing a Bayesian neural network using memristor intrinsic noise in Chinese invention patent application publication CN110956256A, which is hereby cited in its entirety as part of this application.
  • Bayesian neural networks include but are not limited to fully connected structures, convolutional neural network (CNN) structures, etc.
  • the network weight W is a random variable (W ⁇ q(W)) based on a certain distribution.
  • each weight of the Bayesian neural network is a distribution after training is completed. For example, each weight is an independent distribution from each other.
  • At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.
  • the policy optimization method uses a dynamic environment model based on a memristor array to generate a data sample set, implements long-term dynamic planning based on the dynamic environment model, and then uses a more stable algorithm such as the policy gradient optimization algorithm to conduct policy search , no gradient disappearance and explosion problems, and can effectively optimize the object strategy.
  • At least one embodiment of the present disclosure also provides a strategy optimization device corresponding to the above strategy optimization method.
  • FIG. 1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure.
  • the strategy optimization method includes the following steps S101 to S103.
  • Step S101 Obtain a dynamic environment model based on the memristor array.
  • BNN+LV based on a memristor array can be used to model a dynamic system to obtain a dynamic environment model.
  • the specific steps for this will be shown in Figure 1B and will not be described again here. .
  • Step S102 Perform multiple predictions at multiple times based on the dynamic environment model and the object strategy, and obtain a data sample set including the optimization costs of the object strategy corresponding to multiple times.
  • the object policy involved is used in deep reinforcement learning, which can be, for example, a policy for an agent to maximize rewards or achieve a specific goal during its interaction with the environment.
  • Step S103 Based on the data sample set, use the policy gradient optimization algorithm to perform policy search to optimize the object policy.
  • the policy gradient optimization algorithm may include the REINFORCE algorithm, the PRO (Proximal Policy Optimization) algorithm, or the TPRO (Trust Region Policy Optimization) algorithm.
  • these policy gradient optimization methods are more stable and can effectively optimize object policies.
  • FIG. 1B shows a schematic flowchart of an example of step S101 in FIG. 1A.
  • step S101 may include the following steps S111 to S113.
  • Step S111 Obtain a Bayesian neural network, where the Bayesian neural network has a trained weight matrix.
  • the structure of Bayesian neural network includes fully connected structure or convolutional neural network structure.
  • Each network weight of this Bayesian neural network is a random variable.
  • each weight is a distribution, such as Gaussian distribution or Laplace distribution.
  • the Bayesian neural network can be trained offline to obtain the weight matrix.
  • the method of training the Bayesian neural network can refer to conventional methods.
  • a central processing unit (CPU) or an image processing unit (GPU) can be used.
  • Step S112 Obtain corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and map the multiple target conductance values to the memristor array.
  • the weight matrix is processed to obtain corresponding multiple target conductance values.
  • the weight matrix can be biased and scaled until the weight matrix meets the appropriate conductance window for the memristor array being used.
  • the target conductance value is calculated based on the processed weight matrix and the conductance value of the memristor.
  • the target conductance value please refer to the relevant description of the memristor-based Bayesian neural network, which will not be described again here.
  • FIG. 2A shows a schematic structure of a memristor array.
  • the memristor array is composed of, for example, multiple memristor units.
  • the multiple memristor units form an array of M rows and N columns. M and N are all positive integers.
  • Each memristor cell includes a switching element and one or more memristors.
  • WL ⁇ 1>, WL ⁇ 2>...WL ⁇ M> respectively represent the word lines of the first row, the second row...the Mth row, and the switching elements in the memristor unit circuit of each row.
  • the control electrode (such as the gate of a transistor) is connected to the corresponding word line of the row; BL ⁇ 1>, BL ⁇ 2>...BL ⁇ N> respectively represent the bits of the first column, the second column...the Nth column Line, the memristor in the memristor unit circuit of each column is connected to the bit line corresponding to the column; SL ⁇ 1>, SL ⁇ 2>...SL ⁇ M> respectively represent the first row, the second row...
  • the source line of the Mth row the source of the transistor in the memristor unit circuit of each row is connected to the corresponding source line of the row. According to Kirchhoff's law, by setting the state (such as resistance) of the memristor unit and applying corresponding word line signals and bit line signals to the word line and bit line, the above memristor array can complete the multiply-accumulate calculation in parallel. .
  • FIG. 2B is a schematic diagram of a memristor device, which includes a memristor array and its peripheral driving circuit.
  • the memristor device includes a signal acquisition device, a word line driving circuit, a bit line driving circuit, a source line driving circuit, a memristor array, and a data output circuit.
  • the signal acquisition device is configured to convert a digital signal into a plurality of analog signals through a digital to analog converter (DAC), so as to be input to a plurality of column signal input terminals of the memristor array.
  • DAC digital to analog converter
  • a memristor array includes M source lines, M word lines, and N bit lines, and a plurality of memristor cells arranged in M rows and N columns.
  • the operation of the memristor array is implemented through a word line driving circuit, a bit line driving circuit and a source line driving circuit.
  • the word line driving circuit includes multiple multiplexers (Mux) for switching the word line input voltage; the bit line driving circuit includes multiple multiplexers for switching the bit line input voltage; the source line The driver circuit also includes multiple multiplexers (Mux) for switching the source line input voltage.
  • the source line driver circuit also includes multiple ADCs for converting analog signals into digital signals.
  • TIA Trans-Impedance Amplifier
  • a memristor array includes operating modes and computational modes.
  • the memristor unit When the memristor array is in the operating mode, the memristor unit is in an initialization state, and the values of the parameter elements in the parameter matrix can be written into the memristor array.
  • the source line input voltage, bit line input voltage and word line input voltage of the memristor are switched to corresponding preset voltage ranges through a multiplexer.
  • the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in FIG. 2B.
  • the word line input voltage is set to 2V (volts), for example, when performing a reset operation on the memristor, the word line input voltage is set to 5V, for example, the word line input voltage It can be obtained from the voltage signal V_WL[1:M] in Figure 2B.
  • the source line input voltage is switched to the corresponding voltage range through the control signal SL_sw[1:M] of the multiplexer in the source line driving circuit in FIG. 2B.
  • the source line input voltage is set to 0V.
  • the source line input voltage is set to 2V.
  • the source line input voltage can be determined by the figure.
  • the voltage signal V_SL[1:M] in 2B is obtained.
  • the bit line input voltage is switched to the corresponding voltage range through the control signal BL_sw[1:N] of the multiplexer in the bit line driving circuit in FIG. 2B.
  • the bit line input voltage is set to 2V.
  • the bit line input voltage is set to 0V.
  • the bit line input voltage can be determined by the figure. DAC obtained in 2B.
  • the memristors in the memristor array are in a conductive state that can be used for computing, and the bitline input voltage input to the column signal input does not change the conductance value of the memristor, e.g. , the calculation can be completed by performing multiplication and addition operations on the memristor array.
  • the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in Figure 2B.
  • the word line input of the corresponding row Voltage setting is 5V, for example, when no turn-on signal is applied, the word line input voltage of the corresponding row is set to 0V, for example, the GND signal is turned on; through the control signal SL_sw[1:M of the multiplexer in the source line driver circuit in Figure 2B ] Switch the source line input voltage to the corresponding voltage range, for example, set the source line input voltage to 0V, so that the current signals from multiple row signal output terminals can flow into the data output circuit through the bit line drive circuit in Figure 2B
  • the control signal BL_sw[1:N] of the multiplexer switches the bit line input voltage to the corresponding voltage range, for example, setting the bit line input voltage to 0.1V-0.3V, thereby using the memristor array to perform multiplication and addition operations.
  • the data output circuit may include multiple transimpedance amplifiers (TIAs) and ADCs, and may convert current signals at multiple row signal output terminals into voltage signals and then into digital signals for subsequent processing.
  • TIAs transimpedance amplifiers
  • ADCs analog to digital converters
  • Figure 2C is a schematic diagram of another memristor device.
  • the structure of the memristor device shown in FIG. 2C is basically the same as that of the memristor device shown in FIG. 2B, and also includes a memristor array and its peripheral driving circuit.
  • the memristor device signal acquisition device, word line driving circuit, bit line driving circuit, source line driving circuit, memristor array and data output circuit.
  • a memristor array includes M source lines, 2M word lines, and 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns.
  • each memristor unit has a 2T2R structure, and the operation of mapping the parameter matrix used for transformation processing to multiple different memristor units in the memristor array will not be described again here.
  • the memristor array may also include M source lines, M word lines and 2N bit lines, and a plurality of memristor units arranged in M rows and N columns.
  • Figure 2D shows the process of mapping the weight matrix of the Bayesian neural network to the memristor array.
  • Memristor arrays are used to implement the weight matrix between layers in the Bayesian neural network.
  • N memristors are used for each weight to implement the distribution corresponding to the weight.
  • N is an integer greater than or equal to 2.
  • N conductance values are calculated, and the N conductance value distributions are mapped to the N memristors. In this way, the weight matrix in the Bayesian neural network is converted into the target conductance value and mapped into the intersection sequence of the memristor array.
  • the left side of the figure is a three-layer Bayesian neural network, which includes three neuron layers connected one by one.
  • the input layer includes layer 1 neurons
  • the hidden layer includes layer 2 neurons
  • the output layer includes layer 3 neurons.
  • the input layer would receive The received input data is passed to the hidden layer, and the hidden layer calculates and converts the input data and sends it to the output layer.
  • the output layer outputs the output structure of the Bayesian neural network.
  • the input layer, hidden layer and output layer all include multiple neuron nodes, and the number of neuron nodes in each layer can be set according to different application situations.
  • the number of neurons in the input layer is 2 (including N 1 and N 2 )
  • the number of neurons in the middle hidden layer is 3 (including N 3 , N 4 and N 5 )
  • the number of neurons in the output layer is 1 (including N 6 ).
  • the weight matrix is implemented by a memristor array as shown on the right side of Figure 2D.
  • the weight parameters can be programmed directly to the conductance of the memristor array.
  • the weight parameters can also be mapped to the conductance of the memristor array according to a certain rule.
  • the difference in conductance of two memristors can also be used to represent a weight parameter.
  • the structure of the memristor array on the right side in FIG. 2D is, for example, as shown in FIG. 2A .
  • the memristor array may include a plurality of memristors arranged in an array.
  • the weight connecting the input N1 and the output N3 is implemented by three memristors (G 11 , G 12 , G 13 ), and other weights in the weight matrix can be implemented in the same way.
  • source line SL 1 corresponds to neuron N 3
  • source line SL 2 corresponds to neuron N 4
  • source line SL 5 corresponds to neuron N 5
  • bit lines BL 1 , BL 2 and BL 3 correspond to neuron N 1
  • a weight between the input layer and the hidden layer is converted into three target conductance values according to the distribution, and the distribution is mapped into the cross sequence of the memristor array, here
  • the target conductance values are G 11 , G 12 , and G 13 , respectively, and are outlined with dashed lines in the memristor array.
  • step S113 Input the state and hidden input variables corresponding to time t of the dynamic system as input signals to the weight-mapped memristor array, and use the memristor array to calculate the state and hidden input variables at time t. Processing is performed according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. The output signal is used to obtain the prediction result of the dynamic system at time t+1.
  • W is the weight matrix of the Bayesian neural network
  • is the additive noise corresponding to the memristor array
  • s t+1 is the prediction result of the dynamic system at time t+1
  • the object strategy is at
  • represents the function of the object strategy
  • W ⁇ represents the strategy parameters
  • the weight matrix W of the Bayesian neural network satisfies the distribution W ⁇ q(W)
  • the additive noise ⁇ is the additive Gaussian noise ⁇ N(0, ⁇ 2 ).
  • the input signal is a voltage signal and the output signal is a current signal.
  • the output signal is read and analog-to-digital converted for subsequent processing.
  • the input sequence is applied to BL (Bit-line, bit line) in the form of voltage pulses, and then the output current flowing out of SL (Source-line, source line) is collected for further calculation and processing.
  • BL Bit-line, bit line
  • SL Source-line, source line
  • the input sequence can be converted by a DAC into an analog voltage signal, which is applied to BL through a multiplexer.
  • the output current is obtained from SL, which can be converted into a voltage signal through a transimpedance amplifier, and converted into a digital signal through the ADC, and the digital signal can be used for subsequent processing.
  • N memristors read current and N is relatively large, the total output current shows a certain distribution, such as a distribution similar to Gaussian distribution or Laplace distribution.
  • the total output current of all voltage pulses is the result of multiplying the input vector and the weight matrix.
  • a memristor crossbar array such a parallel read operation is equivalent to implementing two operations of sampling and vector matrix multiplication.
  • the multiple times include time 1 to time T arranged in order from morning to night.
  • FIG. 3 shows a schematic flowchart of an example of step S102 in FIG. 1A.
  • step S102 may include the following steps S301 to S303.
  • action a t-1 is the optimal action selected by the object policy at time t-1.
  • the cost c t of s t from which the cost sequence ⁇ c 1 ,c 2 ,...,c t ⁇ from time 1 to time t is obtained.
  • the optimization cost J t-1 of time t is obtained, where 1 ⁇ t ⁇ T.
  • examples of step S302 may include: sampling the latent input variable z from the p(z) distribution to obtain a sample; inputting the sample and the state s t-1 at time t-1 to the weight
  • the latent input variable z is first sampled from the p(z) distribution, and then the state s t- 1 at time t- 1 and the sample of the latent input variable are applied to the memristor array as a read (READ) voltage pulse. on BL, and then collect the output current flowing from SL for further calculation and processing to obtain the value corresponding to time t.
  • Cost c t By performing the above operations on the state at any time from time 1 to time t, the cost sequence ⁇ c 1 , c 2 ,..., c t ⁇ can be obtained.
  • Step S303 Obtain the data sample set ⁇ [a 0 ,J 0 ],...,[a T-1 ,J T-1 ] ⁇ from time 1 to time T.
  • the expected value of the cost c t at time t is E[c t ], then the optimization cost at time t can be passed to get.
  • the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty.
  • Accidental uncertainty is caused by hidden input variables
  • cognitive uncertainty is caused by hidden input variables.
  • the determinism is caused by the intrinsic noise of the memristor array.
  • the optimization cost at time t can be calculated by To obtain, ⁇ ( ⁇ , ⁇ ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and ⁇ represents epistemic uncertainty.
  • the set of data samples from time 1 to time T is ⁇ [a 0 ,J 0 ],...,[a T-1 ,J T-1 ] ⁇ .
  • the exemplary process of the above-mentioned strategy optimization method based on the dynamic environment model of the memristor array is as follows:
  • Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure.
  • a ship is driven and thereby fights waves to get as close as possible to a target location on the coastline, whereby a control model for driving the ship needs to be trained.
  • a ship at position (x, y) can choose an action (a x , a y ) that represents the direction and magnitude of the drive.
  • the subsequent position of the ship exhibits drift and disturbance.
  • the closer the location is to the coast the greater the interference.
  • Ships are given only a limited batch data set of spatial position transformations and cannot directly interact with the ocean environment to optimize action strategies to ensure safety. At this time, it is necessary to rely on empirical data to learn a marine environment model (dynamic environment model) that can predict the next state. Epistemic and accidental uncertainties will arise from missing information at unvisited locations and the randomness of the marine environment, respectively.
  • the sea surface is a dynamic environment
  • the object strategy refers to the method used in the solution process of the ship from the current position to the target position.
  • obtain the dynamic environment model and initial object policy for the dynamic environment The initialization state of the ship is the current position.
  • the execution action at the current moment is obtained from the object strategy.
  • the dynamic environment model is used to predict the state (the position of the ship) at the next moment.
  • the cost and optimization cost corresponding to the object strategy are calculated, and the action and optimization cost are recorded. composed of data samples. Assume that the current time is time 1. From time 1 to the subsequent time T, a data sample set is obtained, and the policy gradient optimization algorithm is used to perform a policy search on the data sample set to obtain an optimized object policy.
  • Figure 5 shows a schematic block diagram of a strategy optimization device 500 using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.
  • the strategy optimization device can be used to execute the data processing method shown in Figure 1A .
  • the strategy optimization device 500 includes an acquisition unit 501 , a calculation unit 502 and a strategy search unit 503 .
  • the acquisition unit 501 is configured to acquire a dynamic environment model based on the memristor array.
  • the computing unit 502 is configured to perform multiple predictions at multiple times according to the dynamic environment model and the object policy, and obtain a data sample set including the optimization cost of the object policy corresponding to multiple times.
  • the policy search unit 503 is configured to use a policy gradient optimization algorithm to perform a policy search based on a data sample set to optimize the object policy.
  • the strategy optimization device 500 can be implemented using hardware, software, firmware, or any feasible combination thereof, and this disclosure is not limiting.

Abstract

A policy optimization method and apparatus using a dynamic environment model based on a memristor array. The method comprises: acquiring a dynamic environment model based on a memristor array; performing prediction multiple times at a plurality of moments according to the dynamic environment model and an object policy, so as to obtain a data sample set, which comprises optimization costs of the object policy corresponding to the plurality of moments; and on the basis of the data sample set, performing policy searching by using a policy gradient optimization algorithm, so as to optimize the object policy. In the method, a data sample set is generated by using a dynamic environment model based on a memristor array, long-term dynamic planning based on the dynamic environment model is realized, and policy searching is then performed by using a more stable algorithm such as a policy gradient optimization algorithm, such that an object policy can be effectively optimized.

Description

利用基于忆阻器阵列的环境模型的策略优化方法和装置Strategy optimization method and device using memristor array-based environment model
本申请要求于2022年5月9日递交的中国专利申请第202210497721.2号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。This application claims priority from Chinese Patent Application No. 202210497721.2 submitted on May 9, 2022. The disclosure of the above Chinese patent application is hereby cited in its entirety as part of this application.
技术领域Technical field
本公开的实施例涉及一种利用基于忆阻器阵列的动态环境模型的策略优化方法和策略优化装置。Embodiments of the present disclosure relate to a strategy optimization method and a strategy optimization device utilizing a dynamic environment model based on a memristor array.
背景技术Background technique
人工神经网络(Artificial Neural Network,ANN)在动态系统的建模中有着广泛的应用。然而,因为缺乏建模不确定性的能力,使用传统的人工神经网络进行长期任务规划仍然是一个挑战。实际系统固有的随机性(不确定性)-过程噪声和数据驱动建模引入的逼近误差会导致人工神经网络的长期估计偏离系统的实际行为。概率模型为解决不确定性问题提供了一种方法,这些模型使人们能够利用模型的预测做出明智的决定,同时对这些预测的不确定性持谨慎态度。Artificial Neural Network (ANN) is widely used in the modeling of dynamic systems. However, long-term mission planning using traditional artificial neural networks remains a challenge because of the lack of ability to model uncertainty. The inherent randomness (uncertainty) of real systems - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system. Probabilistic models provide a way to address uncertainty. These models enable people to make informed decisions using the model's predictions, while being cautious about the uncertainty of those predictions.
发明内容Contents of the invention
本公开至少一个实施例提供一种利用基于忆阻器阵列的动态环境模型的策略优化方法,包括:获取基于忆阻器阵列的动态环境模型;根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合;基于数据样本集合,使用策略梯度优化算法进行策略搜索以对对象策略进行优化。At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.
例如,在本公开一实施例提供的策略优化方法中,获取动态环境模型,包括:获取贝叶斯神经网络,该贝叶斯神经网络具有经训练得到的权重矩阵;根据贝叶斯神经网络的权重矩阵得到对应的多个目标电导值,将多个目标电导值映射到忆阻器阵列中;将对应于动态系统的时刻t的状态和隐输入变量 作为输入信号输入到权重映射后的忆阻器阵列,通过忆阻器阵列对时刻t的状态和隐输入变量按照贝叶斯神经网络进行处理,从忆阻器阵列获取对应于处理结果的输出信号,输出信号用于得到动态系统的时刻t+1的预测结果。For example, in the strategy optimization method provided by an embodiment of the present disclosure, obtaining a dynamic environment model includes: obtaining a Bayesian neural network, which has a weight matrix obtained by training; according to the Bayesian neural network The weight matrix obtains the corresponding multiple target conductance values, and maps the multiple target conductance values to the memristor array; the state and hidden input variables corresponding to the time t of the dynamic system are As an input signal, it is input to the memristor array after weight mapping. The state and hidden input variables at time t are processed through the memristor array according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. , the output signal is used to obtain the prediction result of the dynamic system at time t+1.
例如,在本公开一实施例提供的策略优化方法中,动态环境模型的表达为st+1=f(st,at;W,ε),st是动态系统的时刻t的状态,at是对象策略在时刻t的动作,W是贝叶斯神经网络的权重矩阵,ε是对应于忆阻器阵列的加性噪声,st+1是动态系统的时刻t+1的预测结果;对象策略在时刻t的动作at=π(st;Wπ),π表示对象策略的函数,Wπ表示策略参数,贝叶斯神经网络的权重矩阵W满足分布W~q(W),加性噪声ε为加性高斯噪声ε~N(0,σ2)。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the dynamic environment model is expressed as s t+1 =f (s t , at ; W, ε), s t is the state of the dynamic system at time t, a t is the action of the object strategy at time t, W is the weight matrix of the Bayesian neural network, ε is the additive noise corresponding to the memristor array, s t+1 is the prediction result of the dynamic system at time t+1 ;The action of the object strategy at time t a t =π (s t ; Wπ), π represents the function of the object strategy, Wπ represents the strategy parameters, the weight matrix W of the Bayesian neural network satisfies the distribution W ~ q (W), plus The linear noise ε is additive Gaussian noise ε~N(0,σ 2 ).
例如,在本公开一实施例提供的策略优化方法中,多个时刻包括从早到晚依序排列的时刻1到时刻T,根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合,包括:对于时刻1到时刻T中的任一时刻t-1,由对象策略获得执行动作at-1,由at-1=π(st-1;Wπ)得到对象策略在时刻t-1的动作at-1;根据动态环境模型st=f(st-1,at-1;W,ε)计算得到时刻t-1之后的下一时刻t的状态st并获得对应于时刻t的状态st的代价ct,由此得到从时刻1到时刻t的代价序列{c1,c2,…,ct},基于代价序列获得时刻t的优化代价Jt-1,其中,1≤t≤T;得到时刻1到时刻T的数据样本集合{[a0,J0],…,[aT-1,JT-1]}。For example, in the strategy optimization method provided by an embodiment of the present disclosure, multiple times include time 1 to time T arranged in order from morning to night. Multiple predictions at multiple times are performed based on the dynamic environment model and the object policy, and we obtain A collection of data samples including the optimization cost of the object strategy corresponding to multiple moments, including: for any time t-1 from time 1 to time T, the execution action a t-1 is obtained from the object strategy, and is obtained by a t-1 = π(s t-1 ; Wπ) obtains the action a t -1 of the object strategy at time t-1; the time is calculated according to the dynamic environment model s t = f(s t-1 , a t-1 ; W, ε) The state s t at the next time t after t-1 is obtained and the cost c t corresponding to the state s t at time t is obtained, thereby obtaining the cost sequence {c 1 , c 2 ,…, c from time 1 to time t t }, obtain the optimization cost J t-1 at time t based on the cost sequence, where 1≤t≤T; obtain the data sample set from time 1 to time T {[a 0 ,J 0 ],...,[a T- 1 ,J T-1 ]}.
例如,在本公开一实施例提供的策略优化方法中,在时刻t上的代价ct的期望值为E[ct],则时刻t的优化代价可以通过来获得。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the expected value of the cost c t at time t is E[c t ], then the optimization cost at time t can be obtained by to get.
例如,在本公开一实施例提供的策略优化方法中,代价还包括偶然不确定性带来的代价变化和认知不确定性带来的代价变化,偶然不确定性是由隐输入变量引起的,认知不确定性是由忆阻器阵列的本征噪声引起的。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty. Accidental uncertainty is caused by hidden input variables. , the cognitive uncertainty is caused by the intrinsic noise of the memristor array.
例如,在本公开一实施例提供的策略优化方法中,时刻t的优化代价通过来获得,σ(η,θ)为偶然不确定性和认知不确定性的函数,η表示偶然不确定性,θ表示认知不确定性。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the optimization cost at time t is calculated by To obtain, σ(η,θ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and θ represents epistemic uncertainty.
例如,在本公开一实施例提供的策略优化方法中,根据动态环境模型st=f(st-1,at-1;W,ε)计算得到时刻t的状态st并获得对应于时刻t的状态st的代价ct,包括:对隐输入变量z从p(z)分布中采样得到样本;将样本和t-1时刻的状态st-1输入到权重映射后的忆阻器阵列得到预测状态st;对于预测状态st, 获得代价ct=c(st)。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the state s t at time t is calculated according to the dynamic environment model s t =f (s t-1 , a t-1 ; W, ε) and the corresponding state s t is obtained. The cost c t of the state s t at time t includes: sampling the hidden input variable z from the p(z) distribution; inputting the sample and the state s t-1 at time t-1 into the memristor after weight mapping The predicted state s t is obtained from the detector array; for the predicted state s t , Obtain cost c t =c(s t ).
例如,在本公开一实施例提供的策略优化方法中,策略梯度优化算法包括REINFORCE算法、PRO算法或TPRO算法。For example, in the policy optimization method provided by an embodiment of the present disclosure, the policy gradient optimization algorithm includes the REINFORCE algorithm, PRO algorithm or TPRO algorithm.
本公开至少一个实施例还提供一种利用基于忆阻器阵列的动态环境模型的策略优化装置,包括:获取单元,配置为获取基于忆阻器阵列的动态环境模型;计算单元,配置为根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合;策略搜索单元,配置为基于数据样本集合,使用策略梯度优化算法进行策略搜索以对对象策略进行优化。At least one embodiment of the present disclosure also provides a strategy optimization device utilizing a dynamic environment model based on a memristor array, including: an acquisition unit configured to acquire a dynamic environment model based on the memristor array; a computing unit configured to calculate the dynamic environment model based on the memristor array. The environment model and the object strategy perform multiple predictions at multiple times to obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times; the strategy search unit is configured to use the policy gradient optimization algorithm to perform strategy search based on the data sample set. to optimize object strategies.
附图说明Description of the drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below. Obviously, the drawings in the following description only relate to some embodiments of the present disclosure and do not limit the present disclosure. .
图1A示出了本公开至少一实施例提供的一种基于忆阻器阵列的动态环境模型的策略优化方法的示意性流程图;1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure;
图1B示出了图1A中步骤S101的示意性流程图;Figure 1B shows a schematic flow chart of step S101 in Figure 1A;
图2A示出了一种忆阻器阵列的示意性结构;Figure 2A shows a schematic structure of a memristor array;
图2B为一种忆阻器装置的示意图;Figure 2B is a schematic diagram of a memristor device;
图2C为另一种忆阻器装置的示意图;Figure 2C is a schematic diagram of another memristor device;
图2D示出了将贝叶斯神经网络的权重矩阵映射到忆阻器阵列的示意图;Figure 2D shows a schematic diagram of mapping the weight matrix of a Bayesian neural network to a memristor array;
图3示出了图1A中步骤S102的示意性流程图;Figure 3 shows a schematic flow chart of step S102 in Figure 1A;
图4示出了本公开至少一个实施例提供的策略优化方法的一个示例的示意图;Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure;
图5示出了本公开至少一实施例提供的一种利用基于忆阻器阵列的动态环境模型的策略优化装置的示意框图。FIG. 5 shows a schematic block diagram of a strategy optimization device using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然, 所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings of the embodiments of the present disclosure. Obviously, The described embodiments are some, but not all, of the embodiments of the present disclosure. Based on the described embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present disclosure.
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”、“一”或者“该”等类似词语也不表示数量限制,而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。Unless otherwise defined, technical terms or scientific terms used in this disclosure shall have the usual meaning understood by a person with ordinary skill in the art to which this disclosure belongs. "First", "second" and similar words used in this disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. Likewise, similar words such as "a", "an" or "the" do not indicate a quantitative limitation but rather indicate the presence of at least one. Words such as "include" or "comprising" mean that the elements or things appearing before the word include the elements or things listed after the word and their equivalents, without excluding other elements or things. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "down", "left", "right", etc. are only used to express relative positional relationships. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.
在无模型的深度强化学习(Deep Reinforcement Learning)中,智能体(Agent)通常需要与真实环境进行大量交互试错,数据效率不高,因此无法应用在试错代价比较大的真实任务中。基于模型的深度强化学习则可以更加高效地利用数据。在基于模型的深度强化学习中,智能体首先从与真实动态环境交互的历史经验(例如事先收集得到的状态转移数据)中学习得到动态环境模型,然后与动态环境模型交互进而得到次优化的策略。In model-free deep reinforcement learning (Deep Reinforcement Learning), the agent usually needs to conduct a large number of interactive trials and errors with the real environment, and the data efficiency is not high, so it cannot be applied to real tasks where trial and error costs are relatively high. Model-based deep reinforcement learning can utilize data more efficiently. In model-based deep reinforcement learning, the agent first learns the dynamic environment model from the historical experience of interacting with the real dynamic environment (such as state transition data collected in advance), and then interacts with the dynamic environment model to obtain a sub-optimal strategy. .
基于模型的强化学习方法学习到一个精准的动态环境模型的情形,训练智能体时使用这个模型,不需要与真实环境交互太多次,智能体可以“想象”在真实环境中进行互动的感觉,能极大地提高数据效率,适用在获取数据的成本较高的实际物理场景中;同时,动态环境模型可以预测环境的未知状态,泛化智能体的认知,也可以作为新的数据源,提供上下文信息来帮助决策,进而可以缓解探索-利用困境。在对实际环境建模时,环境固有的随机性(不确定性)-过程噪声和数据驱动建模引入的逼近误差会导致人工神经网络的长期估计偏离系统的实际行为。概率模型为解决不确定性问题提供了一种方法,这些模型使得能够利用模型的预测做出明智的决定,同时对这些预测的不确定性持谨慎态度。The model-based reinforcement learning method learns an accurate dynamic environment model. This model is used when training the agent. It does not need to interact with the real environment too many times. The agent can "imagine" the feeling of interacting in the real environment. It can greatly improve data efficiency and is suitable for actual physical scenarios where the cost of obtaining data is high; at the same time, the dynamic environment model can predict the unknown state of the environment, generalize the cognition of the agent, and can also be used as a new data source to provide Contextual information can be used to help decision-making, which can alleviate the exploration-exploitation dilemma. When modeling real environments, the inherent stochasticity (uncertainty) of the environment - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system. Probabilistic models provide a way to address uncertainty. These models enable informed decisions to be made using the model's predictions, while being cautious about the uncertainty of those predictions.
发明人发现:贝叶斯神经网络(Bayesian Neural Network,BNN)是一种将 神经网络置于贝叶斯框架中的概率模型,可以描述复杂的随机模式;并且,带隐输入变量的贝叶斯神经网络(BNN with latent input variables,BNN+LV)可以通过隐输入变量上的分布(偶然不确定性)来描述复杂的随机模式,同时通过权重上的分布(认知不确定性)来考虑模型的不确定性。隐输入变量是指不能被直接观测到,但是对概率模型的状态和输出存在影响的一种变量。发明人在中国发明专利申请公开CN110956256A中描述了利用忆阻器本征噪声实现贝叶斯神经网络的方法及装置,在此全文引用以作为本申请的一部分。The inventor discovered that Bayesian Neural Network (BNN) is a kind of The probabilistic model of the neural network placed in the Bayesian framework can describe complex random patterns; and the Bayesian neural network with latent input variables (BNN with latent input variables, BNN+LV) can distribution (accidental uncertainty) to describe complex random patterns, while considering model uncertainty through distribution on weights (epistemic uncertainty). Hidden input variables refer to variables that cannot be directly observed, but have an impact on the state and output of the probability model. The inventor described a method and device for implementing a Bayesian neural network using memristor intrinsic noise in Chinese invention patent application publication CN110956256A, which is hereby cited in its entirety as part of this application.
贝叶斯神经网络的结构包括但不限于全连接结构、卷积神经网络(Convolutional Neural Network,CNN)结构等,其网络权值W是基于一定分布的随机变量(W~q(W))。The structures of Bayesian neural networks include but are not limited to fully connected structures, convolutional neural network (CNN) structures, etc. The network weight W is a random variable (W~q(W)) based on a certain distribution.
发明人进一步发现,假设已有用于贝叶斯神经网络的动态系统的数据集D={X,Y},其中X是动态系统的状态特征向量,Y是动态系统的下一个状态。该贝叶斯神经网络的输入即是动态系统的状态特征向量X和隐输入变量z(z~p(z));该贝叶斯神经网络的参数可以训练;而且,贝叶斯神经网络的输出叠加上独立加性高斯噪声ε(ε~N(0,σ2))即是对动态系统下一个状态预测y,即y=f(X,z,W,ε)。由此,在训练完成后贝叶斯神经网络的每一个权重都是一个分布。例如,每个权重都是彼此独立的分布。The inventor further discovered that it is assumed that there is a data set D={X, Y} of the dynamic system used in the Bayesian neural network, where X is the state feature vector of the dynamic system and Y is the next state of the dynamic system. The input of the Bayesian neural network is the state feature vector X of the dynamic system and the hidden input variable z(z~p(z)); the parameters of the Bayesian neural network can be trained; and, The output is superimposed with independent additive Gaussian noise ε (ε ~ N (0, σ 2 )) to predict the next state of the dynamic system y, that is, y = f (X, z, W, ε). Therefore, each weight of the Bayesian neural network is a distribution after training is completed. For example, each weight is an independent distribution from each other.
在长期规划的任务中,梯度将经过多步的反向传播,存在梯度消失和爆炸问题;同时,直接基于忆阻器阵列实现的神经网络在进行策略搜索时,由于忆阻器的本征随机特性,梯度经由忆阻器阵列反向传播时会引入额外的噪声,这些带噪声的梯度无法有效地优化策略搜索。In long-term planning tasks, gradients will undergo multi-step backpropagation, causing gradient disappearance and explosion problems. At the same time, when conducting strategy search for neural networks implemented directly based on memristor arrays, due to the inherent randomness of memristors, Characteristics, additional noise will be introduced when the gradient is backpropagated through the memristor array, and these noisy gradients cannot effectively optimize the policy search.
本公开至少一个实施例提供一种利用基于忆阻器阵列的动态环境模型的策略优化方法,包括:获取基于忆阻器阵列的动态环境模型;根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合;基于数据样本集合,使用策略梯度优化算法进行策略搜索以对对象策略进行优化。At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.
本公开上述实施例提供的策略优化方法利用基于忆阻器阵列的动态环境模型来生成数据样本集合,实现基于动态环境模型的长期动态规划,然后使用策略梯度优化算法等更加稳定的算法进行策略搜索,无梯度消失和爆炸问题,能够有效地优化对象策略。 The policy optimization method provided by the above embodiments of the present disclosure uses a dynamic environment model based on a memristor array to generate a data sample set, implements long-term dynamic planning based on the dynamic environment model, and then uses a more stable algorithm such as the policy gradient optimization algorithm to conduct policy search , no gradient disappearance and explosion problems, and can effectively optimize the object strategy.
本公开至少一实施例还提供对应于上述策略优化方法的策略优化装置。At least one embodiment of the present disclosure also provides a strategy optimization device corresponding to the above strategy optimization method.
下面结合附图对本公开的实施例进行详细说明,但是本公开并不限于这些具体的实施例。The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.
图1A示出了本公开至少一实施例提供的一种基于忆阻器阵列的动态环境模型的策略优化方法的示意性流程图。1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure.
如图1A所示,该策略优化方法包括如下的步骤S101~S103。As shown in Figure 1A, the strategy optimization method includes the following steps S101 to S103.
步骤S101:获取基于忆阻器阵列的动态环境模型。Step S101: Obtain a dynamic environment model based on the memristor array.
在本公开的实施例中,例如,可以利用基于忆阻器阵列的BNN+LV对动态系统进行建模得到动态环境模型,对此具体的步骤将在图1B中示出,在此不再赘述。In embodiments of the present disclosure, for example, BNN+LV based on a memristor array can be used to model a dynamic system to obtain a dynamic environment model. The specific steps for this will be shown in Figure 1B and will not be described again here. .
步骤S102:根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合。Step S102: Perform multiple predictions at multiple times based on the dynamic environment model and the object strategy, and obtain a data sample set including the optimization costs of the object strategy corresponding to multiple times.
例如,所涉及的对象策略用于深度强化学习,例如,可以是智能体在与环境的交互过程中达成回报最大化或实现特定目标的策略。For example, the object policy involved is used in deep reinforcement learning, which can be, for example, a policy for an agent to maximize rewards or achieve a specific goal during its interaction with the environment.
步骤S103:基于数据样本集合,使用策略梯度优化算法进行策略搜索以对对象策略进行优化。Step S103: Based on the data sample set, use the policy gradient optimization algorithm to perform policy search to optimize the object policy.
例如,在本公开的实施例的不同示例中,策略梯度优化算法可以包括REINFORCE算法、PRO(Proximal Policy Optimization)算法或TPRO(Trust Region Policy Optimization)算法。在本公开的实施例中,这些策略梯度优化方法更加稳定,可以有效地优化对象策略。For example, in different examples of embodiments of the present disclosure, the policy gradient optimization algorithm may include the REINFORCE algorithm, the PRO (Proximal Policy Optimization) algorithm, or the TPRO (Trust Region Policy Optimization) algorithm. In embodiments of the present disclosure, these policy gradient optimization methods are more stable and can effectively optimize object policies.
图1B示出了图1A中步骤S101的示例的示意性流程图。FIG. 1B shows a schematic flowchart of an example of step S101 in FIG. 1A.
如图1B所示,步骤S101的示例可以包括如下的步骤S111~S113。As shown in FIG. 1B , an example of step S101 may include the following steps S111 to S113.
步骤S111:获取贝叶斯神经网络,其中,贝叶斯神经网络具有经训练得到的权重矩阵。Step S111: Obtain a Bayesian neural network, where the Bayesian neural network has a trained weight matrix.
例如,贝叶斯神经网络的结构包括全连接结构或卷积神经网络结构等。该贝叶斯神经网络的每个网络权重是随机变量。例如,在该贝叶斯神经网络经训练完成后,每一个权重都是一个分布,例如高斯分布或者拉普拉斯分布。For example, the structure of Bayesian neural network includes fully connected structure or convolutional neural network structure. Each network weight of this Bayesian neural network is a random variable. For example, after the Bayesian neural network is trained, each weight is a distribution, such as Gaussian distribution or Laplace distribution.
例如,可以对贝叶斯神经网络进行离线(offline)训练得到权重矩阵,对贝叶斯神经网络进行训练的方法可以参考常规方法,例如可以采用中央处理单元(CPU)、图像处理单元(GPU)、神经网络处理单元(NPU)等进行训 练,在此不再赘述。For example, the Bayesian neural network can be trained offline to obtain the weight matrix. The method of training the Bayesian neural network can refer to conventional methods. For example, a central processing unit (CPU) or an image processing unit (GPU) can be used. , neural network processing unit (NPU), etc. for training Practice, I won’t go into details here.
步骤S112:根据贝叶斯神经网络的权重矩阵得到对应的多个目标电导值,将多个目标电导值映射到忆阻器阵列中。Step S112: Obtain corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and map the multiple target conductance values to the memristor array.
在贝叶斯神经网络训练完成得到权重矩阵后,对权重矩阵进行处理以得到对应的多个目标电导值。例如,在该过程中,可以对权重矩阵进行偏置和放缩,直至权重矩阵满足用于所使用的忆阻器阵列的合适的电导窗口。对权重矩阵进行偏置和放缩处理后,根据处理后的权重矩阵和忆阻器的电导值计算目标电导值。具体的计算目标电导值的过程可以参考基于忆阻器的贝叶斯神经网络的相关描述,在此不再赘述。After the Bayesian neural network training is completed and the weight matrix is obtained, the weight matrix is processed to obtain corresponding multiple target conductance values. For example, during this process, the weight matrix can be biased and scaled until the weight matrix meets the appropriate conductance window for the memristor array being used. After biasing and scaling the weight matrix, the target conductance value is calculated based on the processed weight matrix and the conductance value of the memristor. For the specific process of calculating the target conductance value, please refer to the relevant description of the memristor-based Bayesian neural network, which will not be described again here.
图2A示出了一种忆阻器阵列的示意性结构,该忆阻器阵列例如由多个忆阻器单元构成,该多个忆阻器单元构成一个M行N列的阵列,M和N均为正整数。每个忆阻器单元包括开关元件和一个或多个忆阻器。在图2A中,WL<1>、WL<2>……WL<M>分别表示第一行、第二行……第M行的字线,每一行的忆阻器单元电路中的开关元件的控制极(例如晶体管的栅极)和该行对应的字线连接;BL<1>、BL<2>……BL<N>分别表示第一列、第二列……第N列的位线,每列的忆阻器单元电路中的忆阻器和该列对应的位线连接;SL<1>、SL<2>……SL<M>分别表示第一行、第二行……第M行的源线,每一行的忆阻器单元电路中的晶体管的源极和该行对应的源线连接。根据基尔霍夫定律,通过设置忆阻器单元的状态(例如阻值)并且在字线与位线施加相应的字线信号与位线信号,上述忆阻器阵列可以并行地完成乘累加计算。FIG. 2A shows a schematic structure of a memristor array. The memristor array is composed of, for example, multiple memristor units. The multiple memristor units form an array of M rows and N columns. M and N are all positive integers. Each memristor cell includes a switching element and one or more memristors. In Figure 2A, WL<1>, WL<2>...WL<M> respectively represent the word lines of the first row, the second row...the Mth row, and the switching elements in the memristor unit circuit of each row. The control electrode (such as the gate of a transistor) is connected to the corresponding word line of the row; BL<1>, BL<2>...BL<N> respectively represent the bits of the first column, the second column...the Nth column Line, the memristor in the memristor unit circuit of each column is connected to the bit line corresponding to the column; SL<1>, SL<2>...SL<M> respectively represent the first row, the second row... The source line of the Mth row, the source of the transistor in the memristor unit circuit of each row is connected to the corresponding source line of the row. According to Kirchhoff's law, by setting the state (such as resistance) of the memristor unit and applying corresponding word line signals and bit line signals to the word line and bit line, the above memristor array can complete the multiply-accumulate calculation in parallel. .
图2B为一种忆阻器装置的示意图,该忆阻器装置包括忆阻器阵列及其外围驱动电路。例如,如图2B所示,该忆阻器装置包括信号获取装置、字线驱动电路、位线驱动电路、源线驱动电路、忆阻器阵列以及数据输出电路。FIG. 2B is a schematic diagram of a memristor device, which includes a memristor array and its peripheral driving circuit. For example, as shown in FIG. 2B , the memristor device includes a signal acquisition device, a word line driving circuit, a bit line driving circuit, a source line driving circuit, a memristor array, and a data output circuit.
例如,信号获取装置配置为将数字信号通过数字模拟转换器(Digital to Analog converter,简称DAC)转换为多个模拟信号,以输入至忆阻器阵列的多个列信号输入端。For example, the signal acquisition device is configured to convert a digital signal into a plurality of analog signals through a digital to analog converter (DAC), so as to be input to a plurality of column signal input terminals of the memristor array.
例如,忆阻器阵列包括M条源线、M条字线和N条位线,以及阵列排布为M行N列的多个忆阻器单元。For example, a memristor array includes M source lines, M word lines, and N bit lines, and a plurality of memristor cells arranged in M rows and N columns.
例如,通过字线驱动电路、位线驱动电路和源线驱动电路实现对于忆阻器阵列的操作。 For example, the operation of the memristor array is implemented through a word line driving circuit, a bit line driving circuit and a source line driving circuit.
例如,字线驱动电路包括多个多路选择器(Multiplexer,简称Mux),用于切换字线输入电压;位线驱动电路包括多个多路选择器,用于切换位线输入电压;源线驱动电路也包括多个多路选择器(Mux),用于切换源线输入电压。例如,源线驱动电路还包括多个ADC,用于将模拟信号转换为数字信号。此外,在源线驱动电路中的Mux和ADC之间还可以进一步设置跨阻放大器(Trans-Impedance Amplifier,简称TIA)(图中未示出)以完成电流到电压的转换,以便于ADC处理。For example, the word line driving circuit includes multiple multiplexers (Mux) for switching the word line input voltage; the bit line driving circuit includes multiple multiplexers for switching the bit line input voltage; the source line The driver circuit also includes multiple multiplexers (Mux) for switching the source line input voltage. For example, the source line driver circuit also includes multiple ADCs for converting analog signals into digital signals. In addition, a Trans-Impedance Amplifier (TIA) (not shown in the figure) can be further set between the Mux and the ADC in the source line driver circuit to complete the conversion of current to voltage to facilitate ADC processing.
例如,忆阻器阵列包括操作模式和计算模式。当忆阻器阵列处于操作模式时,忆阻器单元处于初始化状态,可以将参数矩阵中的参数元素的数值写入忆阻器阵列中。例如,将忆阻器的源线输入电压、位线输入电压和字线输入电压通过多路选择器切换至对应的预设电压区间。For example, a memristor array includes operating modes and computational modes. When the memristor array is in the operating mode, the memristor unit is in an initialization state, and the values of the parameter elements in the parameter matrix can be written into the memristor array. For example, the source line input voltage, bit line input voltage and word line input voltage of the memristor are switched to corresponding preset voltage ranges through a multiplexer.
例如,通过图2B中的字线驱动电路中的多路选择器的控制信号WL_sw[1:M]将字线输入电压切换至相应的电压区间。例如在对忆阻器进行置位操作时,将字线输入电压设置为2V(伏特),例如在对忆阻器进行复位操作时,将字线输入电压设置为5V,例如,字线输入电压可以通过图2B中的电压信号V_WL[1:M]得到。For example, the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in FIG. 2B. For example, when performing a set operation on the memristor, the word line input voltage is set to 2V (volts), for example, when performing a reset operation on the memristor, the word line input voltage is set to 5V, for example, the word line input voltage It can be obtained from the voltage signal V_WL[1:M] in Figure 2B.
例如,通过图2B中的源线驱动电路中的多路选择器的控制信号SL_sw[1:M]将源线输入电压切换至相应的电压区间。例如在对忆阻器进行置位操作时,将源线输入电压设置为0V,例如在对忆阻器进行复位操作时,将源线输入电压设置为2V,例如,源线输入电压可以通过图2B中的电压信号V_SL[1:M]得到。For example, the source line input voltage is switched to the corresponding voltage range through the control signal SL_sw[1:M] of the multiplexer in the source line driving circuit in FIG. 2B. For example, when performing a set operation on the memristor, the source line input voltage is set to 0V. For example, when performing a reset operation on the memristor, the source line input voltage is set to 2V. For example, the source line input voltage can be determined by the figure. The voltage signal V_SL[1:M] in 2B is obtained.
例如,通过图2B中的位线驱动电路中的多路选择器的控制信号BL_sw[1:N]将位线输入电压切换至相应的电压区间。例如在对忆阻器进行置位操作时,将位线输入电压设置为2V,例如在对忆阻器进行复位操作时,将位线输入电压设置为0V,例如,位线输入电压可以通过图2B中DAC得到。For example, the bit line input voltage is switched to the corresponding voltage range through the control signal BL_sw[1:N] of the multiplexer in the bit line driving circuit in FIG. 2B. For example, when performing a set operation on the memristor, the bit line input voltage is set to 2V. For example, when performing a reset operation on the memristor, the bit line input voltage is set to 0V. For example, the bit line input voltage can be determined by the figure. DAC obtained in 2B.
例如,当忆阻器阵列处于计算模式时,忆阻器阵列中的忆阻器处于可用于计算的导电状态,列信号输入端输入的位线输入电压不会改变忆阻器的电导值,例如,可以通过忆阻器阵列执行乘加运算完成计算。例如,通过图2B中的字线驱动电路中的多路选择器的控制信号WL_sw[1:M]将字线输入电压切换至相应的电压区间,例如施加开启信号时,相应行的字线输入电压设置 为5V,例如不施加开启信号时,相应行的字线输入电压设置为0V,例如接通GND信号;通过图2B中的源线驱动电路中的多路选择器的控制信号SL_sw[1:M]将源线输入电压切换至相应的电压区间,例如将源线输入电压设置为0V,从而使得多个行信号输出端的电流信号可以流入数据输出电路,通过图2B中的位线驱动电路中的多路选择器的控制信号BL_sw[1:N]将位线输入电压切换至相应的电压区间,例如将位线输入电压设置为0.1V-0.3V,从而利用忆阻器阵列进行乘加运算。For example, when the memristor array is in computing mode, the memristors in the memristor array are in a conductive state that can be used for computing, and the bitline input voltage input to the column signal input does not change the conductance value of the memristor, e.g. , the calculation can be completed by performing multiplication and addition operations on the memristor array. For example, the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in Figure 2B. For example, when a turn-on signal is applied, the word line input of the corresponding row Voltage setting is 5V, for example, when no turn-on signal is applied, the word line input voltage of the corresponding row is set to 0V, for example, the GND signal is turned on; through the control signal SL_sw[1:M of the multiplexer in the source line driver circuit in Figure 2B ] Switch the source line input voltage to the corresponding voltage range, for example, set the source line input voltage to 0V, so that the current signals from multiple row signal output terminals can flow into the data output circuit through the bit line drive circuit in Figure 2B The control signal BL_sw[1:N] of the multiplexer switches the bit line input voltage to the corresponding voltage range, for example, setting the bit line input voltage to 0.1V-0.3V, thereby using the memristor array to perform multiplication and addition operations.
例如,数据输出电路可以包括多个跨阻放大器(TIA)、ADC,可以将多个行信号输出端的电流信号转换为电压信号,而后转换为数字信号,以用于后续处理。For example, the data output circuit may include multiple transimpedance amplifiers (TIAs) and ADCs, and may convert current signals at multiple row signal output terminals into voltage signals and then into digital signals for subsequent processing.
图2C为另一种忆阻器装置的示意图。图2C所示的忆阻器装置与图2B所示的忆阻器装置的结构基本相同,也包括忆阻器阵列及其外围驱动电路。例如,如图2C所示,该忆阻器装置信号获取装置、字线驱动电路、位线驱动电路、源线驱动电路、忆阻器阵列以及数据输出电路。Figure 2C is a schematic diagram of another memristor device. The structure of the memristor device shown in FIG. 2C is basically the same as that of the memristor device shown in FIG. 2B, and also includes a memristor array and its peripheral driving circuit. For example, as shown in FIG. 2C , the memristor device signal acquisition device, word line driving circuit, bit line driving circuit, source line driving circuit, memristor array and data output circuit.
例如,忆阻器阵列包括M条源线、2M条字线和2N条位线,以及阵列排布为M行N列的多个忆阻器单元。例如,每个忆阻器单元为2T2R结构,将用于变换处理的参数矩阵映射于忆阻器阵列中不同的多个忆阻器单元的操作,这里不再赘述。需要说明的是,忆阻器阵列也可以包括M条源线、M条字线和2N条位线,以及阵列排布为M行N列的多个忆阻器单元。For example, a memristor array includes M source lines, 2M word lines, and 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns. For example, each memristor unit has a 2T2R structure, and the operation of mapping the parameter matrix used for transformation processing to multiple different memristor units in the memristor array will not be described again here. It should be noted that the memristor array may also include M source lines, M word lines and 2N bit lines, and a plurality of memristor units arranged in M rows and N columns.
关于信号获取装置、控制驱动电路以及数据输出电路的描述可以参照之前的描述,这里不再赘述。For descriptions of the signal acquisition device, control drive circuit and data output circuit, please refer to the previous description and will not be repeated here.
图2D示出了将贝叶斯神经网络的权重矩阵映射到忆阻器阵列的过程。利用忆阻器阵列实现贝叶斯神经网络中层与层之间的权重矩阵,对每个权重使用N个忆阻器实现与该权重对应的分布,N为大于等于2的整数,针对该权重的对应的随机概率分布,计算得到N个电导值,将该N个电导值分布映射到该N个忆阻器中。如此,将贝叶斯神经网络中的权重矩阵转换为目标电导值映射到忆阻器阵列的交叉序列中。Figure 2D shows the process of mapping the weight matrix of the Bayesian neural network to the memristor array. Memristor arrays are used to implement the weight matrix between layers in the Bayesian neural network. N memristors are used for each weight to implement the distribution corresponding to the weight. N is an integer greater than or equal to 2. For this weight According to the corresponding random probability distribution, N conductance values are calculated, and the N conductance value distributions are mapped to the N memristors. In this way, the weight matrix in the Bayesian neural network is converted into the target conductance value and mapped into the intersection sequence of the memristor array.
如图2D所示,图中的左侧是一个三层贝叶斯神经网络,该贝叶斯神经网络包括逐一连接的3层神经元层。例如,输入层包括第1层神经元层,隐含层包括第2层神经元层,输出层包括第3层神经元层。例如,输入层将接 收的输入数据传递到隐含层,隐含层对该输入数据进行计算转换发送至输出层,输出层输出贝叶斯神经网络的输出结构。As shown in Figure 2D, the left side of the figure is a three-layer Bayesian neural network, which includes three neuron layers connected one by one. For example, the input layer includes layer 1 neurons, the hidden layer includes layer 2 neurons, and the output layer includes layer 3 neurons. For example, the input layer would receive The received input data is passed to the hidden layer, and the hidden layer calculates and converts the input data and sends it to the output layer. The output layer outputs the output structure of the Bayesian neural network.
如图2D所示,输入层、隐含层以及输出层均包括多个神经元节点,各层的神经元节点的个数可以根据不同的应用情况设定。例如,输入层的神经元个数为2(包括N1和N2),中间隐藏层的神经元个数为3(包括N3、N4和N5),输出层的神经元个数为1(包括N6)。As shown in Figure 2D, the input layer, hidden layer and output layer all include multiple neuron nodes, and the number of neuron nodes in each layer can be set according to different application situations. For example, the number of neurons in the input layer is 2 (including N 1 and N 2 ), the number of neurons in the middle hidden layer is 3 (including N 3 , N 4 and N 5 ), and the number of neurons in the output layer is 1 (including N 6 ).
如图2D所示,贝叶斯神经网络的相邻两层神经元层之间通过权重矩阵连接。例如,权重矩阵由如图2D右侧的忆阻器阵列实现。例如,可以将权重参数直接编程为忆阻器阵列的电导。例如,也可以将权重参数按照某一规则映射到忆阻器阵列的电导。例如,也可以利用两个忆阻器的电导的差值来代表一个权重参数。虽然本公开以将权重参数直接编程为忆阻器阵列的电导或将权重参数按照某一规则映射到忆阻器阵列的电导的方式对本公开的技术方案进行了描述,但其仅是示例性的,而不是对本公开的限制。As shown in Figure 2D, two adjacent neuron layers of the Bayesian neural network are connected by a weight matrix. For example, the weight matrix is implemented by a memristor array as shown on the right side of Figure 2D. For example, the weight parameters can be programmed directly to the conductance of the memristor array. For example, the weight parameters can also be mapped to the conductance of the memristor array according to a certain rule. For example, the difference in conductance of two memristors can also be used to represent a weight parameter. Although the present disclosure describes the technical solution of the present disclosure by directly programming the weight parameters to the conductance of the memristor array or mapping the weight parameters to the conductance of the memristor array according to a certain rule, this is only exemplary. , rather than limiting the disclosure.
图2D中的右侧的忆阻器阵列的结构例如如图2A所示,该忆阻器阵列可以包括阵列排布的多个忆阻器。如图2D所示的示例中,连接输入N1与输出N3之间的权重由3个忆阻器(G11、G12、G13)实现,权重矩阵中的其他权重可以相同地实现。更具体而言,源线SL1对应神经元N3,源线SL2对应神经元N4,源线SL5对应神经元N5,位线BL1、BL2和BL3对应神经元N1,输入层和隐藏层之间的一个权重(神经元N1和神经元N3之间的权重)按照分布被转换为三个目标电导值,并分布映射到忆阻器阵列的交叉序列中,这里目标电导值分别为G11、G12和G13,在忆阻器阵列中用虚线框框出。The structure of the memristor array on the right side in FIG. 2D is, for example, as shown in FIG. 2A . The memristor array may include a plurality of memristors arranged in an array. In the example shown in Figure 2D, the weight connecting the input N1 and the output N3 is implemented by three memristors (G 11 , G 12 , G 13 ), and other weights in the weight matrix can be implemented in the same way. More specifically, source line SL 1 corresponds to neuron N 3 , source line SL 2 corresponds to neuron N 4 , source line SL 5 corresponds to neuron N 5 , bit lines BL 1 , BL 2 and BL 3 correspond to neuron N 1 , A weight between the input layer and the hidden layer (the weight between neuron N 1 and neuron N 3 ) is converted into three target conductance values according to the distribution, and the distribution is mapped into the cross sequence of the memristor array, here The target conductance values are G 11 , G 12 , and G 13 , respectively, and are outlined with dashed lines in the memristor array.
回到图1B,步骤S113:将对应于动态系统的时刻t的状态和隐输入变量作为输入信号输入到权重映射后的忆阻器阵列,通过忆阻器阵列对时刻t的状态和隐输入变量按照贝叶斯神经网络进行处理,从忆阻器阵列获取对应于处理结果的输出信号,输出信号用于得到动态系统的时刻t+1的预测结果。Returning to Figure 1B, step S113: Input the state and hidden input variables corresponding to time t of the dynamic system as input signals to the weight-mapped memristor array, and use the memristor array to calculate the state and hidden input variables at time t. Processing is performed according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. The output signal is used to obtain the prediction result of the dynamic system at time t+1.
例如,在本公开的一些实施例中,动态环境模型的表达为st+1=f(st,at;W,ε),st是动态系统的时刻t的状态,at是对象策略在时刻t的动作,W是贝叶斯神经网络的权重矩阵,ε是对应于忆阻器阵列的加性噪声,st+1是动态系统的时刻t+1的预测结果;对象策略在时刻t的动作at=π(st;Wπ),π表示对象策略的函数,Wπ表示策略参数,贝叶斯神经网络的权重矩阵W满足分布 W~q(W),加性噪声ε为加性高斯噪声ε~N(0,σ2)。For example, in some embodiments of the present disclosure, the dynamic environment model is expressed as s t+1 =f (s t , at ; W, ε), s t is the state of the dynamic system at time t, and a t is the object The action of the strategy at time t, W is the weight matrix of the Bayesian neural network, ε is the additive noise corresponding to the memristor array, s t+1 is the prediction result of the dynamic system at time t+1; the object strategy is at The action at time t is a t = π (s t ; Wπ), π represents the function of the object strategy, Wπ represents the strategy parameters, and the weight matrix W of the Bayesian neural network satisfies the distribution W~q(W), the additive noise ε is the additive Gaussian noise ε~N(0,σ 2 ).
对于忆阻器阵列,输入信号为电压信号,输出信号为电流信号,读取输出信号并将输出信号进行模数转换以用于后续处理。例如,将输入序列以电压脉冲的方式施加在BL(Bit-line,位线)上,然后采集从SL(Source-line,源线)流出的输出电流进行进一步的计算处理。例如,对于如图2B或2C所示的忆阻器装置,可以通过DAC将输入序列转换为模拟电压信号,模拟电压信号通过多路选择器施加BL上。对应地,从SL获取输出电流,该电流可通过跨阻放大器转换成电压信号,通过ADC转换成数字信号,该数字信号可以用于后续处理。N个忆阻器在读电流且N比较大时,输出的总电流呈现一定的分布,例如类似于高斯分布或拉普拉斯分布等分布。所有电压脉冲的输出总电流就是输入向量与权重矩阵相乘的结果。在忆阻器交叉阵列中,这样的一次并行读操作就相当于实现了采样和向量矩阵乘法的两个操作。For the memristor array, the input signal is a voltage signal and the output signal is a current signal. The output signal is read and analog-to-digital converted for subsequent processing. For example, the input sequence is applied to BL (Bit-line, bit line) in the form of voltage pulses, and then the output current flowing out of SL (Source-line, source line) is collected for further calculation and processing. For example, for a memristor device as shown in Figure 2B or 2C, the input sequence can be converted by a DAC into an analog voltage signal, which is applied to BL through a multiplexer. Correspondingly, the output current is obtained from SL, which can be converted into a voltage signal through a transimpedance amplifier, and converted into a digital signal through the ADC, and the digital signal can be used for subsequent processing. When N memristors read current and N is relatively large, the total output current shows a certain distribution, such as a distribution similar to Gaussian distribution or Laplace distribution. The total output current of all voltage pulses is the result of multiplying the input vector and the weight matrix. In a memristor crossbar array, such a parallel read operation is equivalent to implementing two operations of sampling and vector matrix multiplication.
下面通过图3说明如何根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合。The following uses Figure 3 to illustrate how to perform multiple predictions at multiple times based on the dynamic environment model and the object strategy to obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times.
例如,多个时刻包括从早到晚依序排列的时刻1到时刻T。For example, the multiple times include time 1 to time T arranged in order from morning to night.
图3示出了图1A中步骤S102的示例的示意性流程图。FIG. 3 shows a schematic flowchart of an example of step S102 in FIG. 1A.
如图3所示,步骤S102可以包括如下的步骤S301~S303。As shown in Figure 3, step S102 may include the following steps S301 to S303.
步骤S301:对于时刻1到时刻T中的任一时刻t-1,由对象策略获得执行动作at-1,由at-1=π(st-1;Wπ)得到对象策略在时刻t-1的动作at-1Step S301: For any time t-1 from time 1 to time T, obtain the execution action a t-1 from the object policy, and obtain the object policy at time t from a t-1 = π(s t-1 ; Wπ) -1 action a t-1 .
例如,动作at-1为对象策略在时刻t-1的状态下选择的最优的动作。For example, action a t-1 is the optimal action selected by the object policy at time t-1.
步骤S302:根据动态环境模型st=f(st-1,at-1;W,ε)计算得到时刻t-1之后的下一时刻t的状态st并获得对应于时刻t的状态st的代价ct,由此得到从时刻1到时刻t的代价序列{c1,c2,…,ct},基于代价序列获得时刻t的优化代价Jt-1,其中,1≤t≤T。Step S302: Calculate the state s t at the next time t after time t-1 according to the dynamic environment model s t =f (s t-1 , a t-1 ; W, ε) and obtain the state corresponding to time t The cost c t of s t , from which the cost sequence {c 1 ,c 2 ,...,c t } from time 1 to time t is obtained. Based on the cost sequence, the optimization cost J t-1 of time t is obtained, where 1≤ t≤T.
例如,在本公开的一些实施例中,步骤S302的示例可以包括:对隐输入变量z从p(z)分布中采样得到样本;将样本和t-1时刻的状态st-1输入到权重映射后的忆阻器阵列得到预测状态st;对于预测状态st,获得代价ct=c(st)。For example, in some embodiments of the present disclosure, examples of step S302 may include: sampling the latent input variable z from the p(z) distribution to obtain a sample; inputting the sample and the state s t-1 at time t-1 to the weight The mapped memristor array obtains the predicted state s t ; for the predicted state s t , the cost c t =c(s t ) is obtained.
例如,首先对隐输入变量z从p(z)分布中采样,然后将t-1时刻的状态st- 1和隐输入变量的样本以忆阻器阵列的读取(READ)电压脉冲施加在BL上,然后采集从SL流出的输出电流进行进一步的计算处理得到对应于时刻t的 代价ct。对时刻1到时刻t中的任一时刻的状态均进行上述操作,则可以得到代价序列{c1,c2,…,ct}。For example, the latent input variable z is first sampled from the p(z) distribution, and then the state s t- 1 at time t- 1 and the sample of the latent input variable are applied to the memristor array as a read (READ) voltage pulse. on BL, and then collect the output current flowing from SL for further calculation and processing to obtain the value corresponding to time t. Cost c t . By performing the above operations on the state at any time from time 1 to time t, the cost sequence {c 1 , c 2 ,..., c t } can be obtained.
步骤S303:得到时刻1到时刻T的数据样本集合{[a0,J0],…,[aT-1,JT-1]}。Step S303: Obtain the data sample set {[a 0 ,J 0 ],...,[a T-1 ,J T-1 ]} from time 1 to time T.
例如,在时刻t上的代价ct的期望值为E[ct],则时刻t的优化代价可以通过来获得。For example, the expected value of the cost c t at time t is E[c t ], then the optimization cost at time t can be passed to get.
例如,在本公开的一些实施例中,代价还包括偶然不确定性带来的代价变化和认知不确定性带来的代价变化,偶然不确定性是由隐输入变量引起的,认知不确定性是由忆阻器阵列的本征噪声引起的。For example, in some embodiments of the present disclosure, the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty. Accidental uncertainty is caused by hidden input variables, and cognitive uncertainty is caused by hidden input variables. The determinism is caused by the intrinsic noise of the memristor array.
例如,若进一步考虑偶然不确定性和认知不确定性带来的代价变化,则时刻t的优化代价可以通过来获得,σ(η,θ)为偶然不确定性和认知不确定性的函数,η表示偶然不确定性,θ表示认知不确定性。For example, if the cost changes caused by accidental uncertainty and cognitive uncertainty are further considered, the optimization cost at time t can be calculated by To obtain, σ(η,θ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and θ represents epistemic uncertainty.
对于时刻1到时刻T之间的任一时刻,可以得到该时刻对应的数据样本,时刻1到时刻T的数据样本集合为{[a0,J0],…,[aT-1,JT-1]}。For any time between time 1 and time T, the data sample corresponding to that time can be obtained. The set of data samples from time 1 to time T is {[a 0 ,J 0 ],...,[a T-1 ,J T-1 ]}.
例如,上述基于忆阻器阵列的动态环境模型的策略优化方法的示例性流程如下:For example, the exemplary process of the above-mentioned strategy optimization method based on the dynamic environment model of the memristor array is as follows:
输入:基于忆阻器阵列的动态环境模型和初始对象策略Input: Memristor array-based dynamic environment model and initial object strategy
n=1到N,循环n=1 to N, loop
初始化状态s0 Initialization state s 0
t=1到T,循环t=1 to T, loop
由对象策略π得到执行动作at-1 Obtain the execution action a t-1 from the object policy π
利用基于忆阻器阵列的动态环境模型st=f(st-1,at-1;W,ε)预测时刻t的状态st,得到数据样本[st]Use the dynamic environment model s t =f (s t-1 ,a t-1 ; W, ε) based on the memristor array to predict the state s t at time t and obtain the data sample [s t ]
计算该时刻对象策略对应的代价和优化代价
ct=c(st)
Calculate the cost and optimization cost corresponding to the object strategy at that moment
c t =c(s t )
记录[at-1,Jt-1]Record[a t-1 ,J t-1 ]
得到数据样本集合{[a0,J0],…,[aT-1,JT-1]},并在数据样本集合上,Obtain the data sample set {[a 0 ,J 0 ],...,[a T-1 ,J T-1 ]}, and on the data sample set,
使用策略梯度优化算法进行策略搜索Policy search using the policy gradient optimization algorithm
结束n的循环 End the loop of n
输出:优化的策略πOutput: optimized policy π
图4示出了本公开至少一个实施例提供的策略优化方法的一个示例的示意图。Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure.
如图4所示,在一个示例性的应用中,一艘船被驱动并由此与海浪作斗争,以尽可能地靠近海岸线上的目标位置,由此需要训练用于驱动船的控制模型。位于位置(x,y)的船可以选择一个动作(ax,ay),该动作表示驱动的方向和幅度。然而,由于具有海浪的海面的动态环境,船的后续位置呈现出漂移和扰动。并且,位置越靠近海岸,干扰越大。船仅被赋予了有限的空间位置转换的批量数据集,并且无法通过直接与海洋环境交互来优化动作策略以确保安全。这时,需要依靠经验数据来学习能够预测下一个状态的海洋环境模型(动态环境模型)。认知不确定性和偶然不确定性将分别源于未访问位置的缺失信息和海洋环境的随机性。As shown in Figure 4, in an exemplary application, a ship is driven and thereby fights waves to get as close as possible to a target location on the coastline, whereby a control model for driving the ship needs to be trained. A ship at position (x, y) can choose an action (a x , a y ) that represents the direction and magnitude of the drive. However, due to the dynamic environment of the sea surface with waves, the subsequent position of the ship exhibits drift and disturbance. Also, the closer the location is to the coast, the greater the interference. Ships are given only a limited batch data set of spatial position transformations and cannot directly interact with the ocean environment to optimize action strategies to ensure safety. At this time, it is necessary to rely on empirical data to learn a marine environment model (dynamic environment model) that can predict the next state. Epistemic and accidental uncertainties will arise from missing information at unvisited locations and the randomness of the marine environment, respectively.
在此实施例中,对于控制模型,海面即为一个动态环境,对象策略是指船从当前位置到目标位置的求解过程中使用的方法。首先,获取针对该动态环境的动态环境模型和初始对象策略。船的初始化状态为当前位置,由对象策略得到当前时刻的执行动作,利用动态环境模型预测下一时刻的状态(船的位置),计算对象策略对应的代价和优化代价,并记录动作和优化代价组成的数据样本。假设当前时刻为时刻1,对于时刻1到后续的时刻T,得到数据样本集合,并在该数据样本集合上使用策略梯度优化算法进行策略搜索,从而得到经过优化的对象策略。In this embodiment, for the control model, the sea surface is a dynamic environment, and the object strategy refers to the method used in the solution process of the ship from the current position to the target position. First, obtain the dynamic environment model and initial object policy for the dynamic environment. The initialization state of the ship is the current position. The execution action at the current moment is obtained from the object strategy. The dynamic environment model is used to predict the state (the position of the ship) at the next moment. The cost and optimization cost corresponding to the object strategy are calculated, and the action and optimization cost are recorded. composed of data samples. Assume that the current time is time 1. From time 1 to the subsequent time T, a data sample set is obtained, and the policy gradient optimization algorithm is used to perform a policy search on the data sample set to obtain an optimized object policy.
图5示出了本公开至少一实施例提供的一种利用基于忆阻器阵列的动态环境模型的策略优化装置500的示意框图,该策略优化装置可以用于执行图1A所示的数据处理方法。Figure 5 shows a schematic block diagram of a strategy optimization device 500 using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure. The strategy optimization device can be used to execute the data processing method shown in Figure 1A .
如图5所示,策略优化装置500包括获取单元501、计算单元502以及策略搜索单元503。As shown in FIG. 5 , the strategy optimization device 500 includes an acquisition unit 501 , a calculation unit 502 and a strategy search unit 503 .
获取单元501被配置为获取基于忆阻器阵列的动态环境模型。The acquisition unit 501 is configured to acquire a dynamic environment model based on the memristor array.
计算单元502被配置为,根据动态环境模型以及对象策略进行多个时刻的多次预测,得到包括对象策略对应于多个时刻的优化代价的数据样本集合。The computing unit 502 is configured to perform multiple predictions at multiple times according to the dynamic environment model and the object policy, and obtain a data sample set including the optimization cost of the object policy corresponding to multiple times.
策略搜索单元503被配置为,基于数据样本集合,使用策略梯度优化算法进行策略搜索以对对象策略进行优化。 The policy search unit 503 is configured to use a policy gradient optimization algorithm to perform a policy search based on a data sample set to optimize the object policy.
例如,策略优化装置500可以采用硬件、软件、固件以及它们的任意可行的组合实现,本公开对此不作限制。For example, the strategy optimization device 500 can be implemented using hardware, software, firmware, or any feasible combination thereof, and this disclosure is not limiting.
上述策略优化装置的技术效果与图1A所示的策略优化方法的技术效果相同,在此不再赘述。The technical effects of the above strategy optimization device are the same as those of the strategy optimization method shown in Figure 1A, and will not be described again here.
有以下几点需要说明:The following points need to be explained:
(1)本公开实施例附图只涉及到本公开实施例涉及到的结构,其他结构可参考通常设计。(1) The drawings of the embodiments of this disclosure only relate to the structures involved in the embodiments of this disclosure, and other structures can refer to common designs.
(2)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。(2) Without conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。 The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. The protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims (10)

  1. 一种利用基于忆阻器阵列的动态环境模型的策略优化方法,包括:A strategy optimization method utilizing a dynamic environment model based on memristor arrays, including:
    获取基于所述忆阻器阵列的所述动态环境模型;Obtain the dynamic environment model based on the memristor array;
    根据所述动态环境模型以及对象策略进行多个时刻的多次预测,得到包括所述对象策略对应于所述多个时刻的优化代价的数据样本集合;Perform multiple predictions at multiple times based on the dynamic environment model and the object strategy to obtain a data sample set including the optimization costs of the object strategy corresponding to the multiple times;
    基于所述数据样本集合,使用策略梯度优化算法进行策略搜索以对所述对象策略进行优化。Based on the data sample set, a policy gradient optimization algorithm is used to perform a policy search to optimize the object policy.
  2. 根据权利要求1所述的策略优化方法,其中,获取所述动态环境模型,包括:The strategy optimization method according to claim 1, wherein obtaining the dynamic environment model includes:
    获取贝叶斯神经网络,其中,所述贝叶斯神经网络具有经训练得到的权重矩阵;Obtaining a Bayesian neural network, wherein the Bayesian neural network has a trained weight matrix;
    根据所述贝叶斯神经网络的所述权重矩阵得到对应的多个目标电导值,将所述多个目标电导值映射到所述忆阻器阵列中;Obtain corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and map the multiple target conductance values into the memristor array;
    将对应于动态系统的时刻t的状态和隐输入变量作为输入信号输入到权重映射后的所述忆阻器阵列,通过所述忆阻器阵列对所述时刻t的状态和所述隐输入变量按照所述贝叶斯神经网络进行处理,从所述忆阻器阵列获取对应于处理结果的输出信号,其中,所述输出信号用于得到所述动态系统的时刻t+1的预测结果。The state corresponding to the time t of the dynamic system and the hidden input variables are input into the weight-mapped memristor array as input signals, and the state of the time t and the hidden input variables are processed by the memristor array. Processing is performed according to the Bayesian neural network, and an output signal corresponding to the processing result is obtained from the memristor array, wherein the output signal is used to obtain the prediction result of the dynamic system at time t+1.
  3. 根据权利要求2所述的策略优化方法,其中,所述动态环境模型的表达为st+1=f(st,at;W,ε),The strategy optimization method according to claim 2, wherein the expression of the dynamic environment model is s t+1 =f (s t , a t ; W, ε),
    其中,st是所述动态系统的所述时刻t的状态,at是所述对象策略在时刻t的动作,W是所述贝叶斯神经网络的权重矩阵,ε是对应于所述忆阻器阵列的加性噪声,st+1是所述动态系统的所述时刻t+1的预测结果;Where, s t is the state of the dynamic system at the time t, a t is the action of the object strategy at the time t, W is the weight matrix of the Bayesian neural network, and ε is the weight matrix corresponding to the memory. The additive noise of the resistor array, s t+1 is the predicted result of the dynamic system at the time t+1;
    其中,所述对象策略在时刻t的动作at=π(st;Wπ),π表示所述对象策略的函数,Wπ表示策略参数,所述贝叶斯神经网络的权重矩阵W满足分布W~q(W),所述加性噪声ε为加性高斯噪声ε~N(0,σ2)。Among them, the action of the object strategy at time t is a t =π (s t ; Wπ), π represents the function of the object strategy, Wπ represents the strategy parameters, and the weight matrix W of the Bayesian neural network satisfies the distribution W ~q(W), the additive noise ε is additive Gaussian noise ε~N(0,σ 2 ).
  4. 根据权利要求3所述的策略优化方法,其中,所述多个时刻包括从早到晚依序排列的时刻1到时刻T,The strategy optimization method according to claim 3, wherein the plurality of time moments include time 1 to time T arranged in order from early to late,
    根据所述动态环境模型以及所述对象策略进行所述多个时刻的多次预 测,得到包括所述对象策略对应于所述多个时刻的所述优化代价的所述数据样本集合,包括:Perform multiple predictions at multiple times according to the dynamic environment model and the object policy. Measure and obtain the data sample set including the optimization cost of the object strategy corresponding to the multiple moments, including:
    对于所述时刻1到所述时刻T中的任一时刻t-1,由所述对象策略获得执行动作at-1For any time t-1 from time 1 to time T, the execution action at -1 is obtained from the object policy,
    由at-1=π(st-1;Wπ)得到所述对象策略在所述时刻t-1的动作at-1The action at- 1 of the target strategy at the time t- 1 is obtained from a t -1 =π(s t-1; Wπ);
    根据所述动态环境模型st=f(st-1,at-1;W,ε)计算得到所述时刻t-1之后的下一时刻t的状态st并获得对应于所述时刻t的状态st的代价ct,由此得到从所述时刻1到所述时刻t的代价序列{c1,c2,…,ct},According to the dynamic environment model s t =f (s t-1 , a t-1 ; W, ε), the state s t of the next time t after the time t-1 is calculated and the state corresponding to the time t is obtained. The cost c t of the state s t of t, thus obtaining the cost sequence {c 1 , c 2 ,..., c t } from the time 1 to the time t,
    基于所述代价序列获得所述时刻t的优化代价Jt-1,其中,1≤t≤T;Obtain the optimization cost J t-1 at the time t based on the cost sequence, where 1≤t≤T;
    得到所述时刻1到所述时刻T的所述数据样本集合{[a0,J0],…,[aT- 1,JT-1]}。The data sample set {[a 0 ,J 0 ],...,[a T- 1 ,J T-1 ]} from the time 1 to the time T is obtained.
  5. 根据权利要求4所述的策略优化方法,其中,在所述时刻t上的所述代价ct的期望值为E[ct],则所述时刻t的优化代价可以通过来获得。The strategy optimization method according to claim 4, wherein the expected value of the cost c t at the time t is E[c t ], then the optimization cost at the time t can be obtained by to get.
  6. 根据权利要求4或5所述的策略优化方法,其中,所述代价还包括偶然不确定性带来的代价变化和认知不确定性带来的代价变化,The strategy optimization method according to claim 4 or 5, wherein the cost also includes cost changes brought by accidental uncertainty and cost changes brought by cognitive uncertainty,
    其中,所述偶然不确定性是由所述隐输入变量引起的,所述认知不确定性是由所述忆阻器阵列的本征噪声引起的。Wherein, the accidental uncertainty is caused by the implicit input variable, and the cognitive uncertainty is caused by the intrinsic noise of the memristor array.
  7. 根据权利要求6所述的策略优化方法,其中,所述时刻t的优化代价通过来获得,其中,σ(η,θ)为所述偶然不确定性和所述认知不确定性的函数,η表示所述偶然不确定性,θ表示所述认知不确定性。The strategy optimization method according to claim 6, wherein the optimization cost at time t is calculated by To obtain, where σ(η,θ) is a function of the accidental uncertainty and the epistemic uncertainty, eta represents the accidental uncertainty, and θ represents the epistemic uncertainty.
  8. 根据权利要求4-7任一项所述的策略优化方法,其中,根据所述动态环境模型st=f(st-1,at-1;W,ε)计算得到所述时刻t的状态st并获得对应于所述时刻t的状态st的所述代价ct,包括:The strategy optimization method according to any one of claims 4 to 7, wherein the value of the time t is calculated according to the dynamic environment model s t =f (s t-1 , at -1 ; W, ε) State s t and obtain the cost c t corresponding to the state s t at the time t, including:
    对所述隐输入变量z从p(z)分布中采样得到样本;Sample the latent input variable z from the p(z) distribution to obtain a sample;
    将所述样本和所述t-1时刻的状态st-1输入到所述权重映射后的忆阻器阵列得到预测状态stInput the sample and the state s t-1 at time t -1 into the weight-mapped memristor array to obtain the predicted state s t ;
    对于所述预测状态st,获得所述代价ct=c(st)。For the predicted state s t , the cost c t =c(s t ) is obtained.
  9. 根据权利要求1-8任一项所述的策略优化方法,其中,所述策略梯度 优化算法包括REINFORCE算法、PRO算法或TPRO算法。The strategy optimization method according to any one of claims 1-8, wherein the strategy gradient Optimization algorithms include REINFORCE algorithm, PRO algorithm or TPRO algorithm.
  10. 一种利用基于忆阻器阵列的动态环境模型的策略优化装置,包括:A strategy optimization device utilizing a dynamic environment model based on a memristor array, including:
    获取单元,配置为获取基于所述忆阻器阵列的所述动态环境模型;An acquisition unit configured to acquire the dynamic environment model based on the memristor array;
    计算单元,配置为根据所述动态环境模型以及对象策略进行多个时刻的多次预测,得到包括所述对象策略对应于所述多个时刻的优化代价的数据样本集合;A computing unit configured to perform multiple predictions at multiple moments according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization cost of the object strategy corresponding to the multiple moments;
    策略搜索单元,配置为基于所述数据样本集合,使用策略梯度优化算法进行策略搜索以对所述对象策略进行优化。 A policy search unit configured to use a policy gradient optimization algorithm to perform a policy search based on the data sample set to optimize the object policy.
PCT/CN2023/092475 2022-05-09 2023-05-06 Policy optimization method and apparatus using environment model based on memristor array WO2023217027A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210497721.2 2022-05-09
CN202210497721.2A CN114819093A (en) 2022-05-09 2022-05-09 Strategy optimization method and device by utilizing environment model based on memristor array

Publications (1)

Publication Number Publication Date
WO2023217027A1 true WO2023217027A1 (en) 2023-11-16

Family

ID=82512800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092475 WO2023217027A1 (en) 2022-05-09 2023-05-06 Policy optimization method and apparatus using environment model based on memristor array

Country Status (2)

Country Link
CN (1) CN114819093A (en)
WO (1) WO2023217027A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819093A (en) * 2022-05-09 2022-07-29 清华大学 Strategy optimization method and device by utilizing environment model based on memristor array
CN116300477A (en) * 2023-05-19 2023-06-23 江西金域医学检验实验室有限公司 Method, system, electronic equipment and storage medium for regulating and controlling environment of enclosed space

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543827A (en) * 2018-12-02 2019-03-29 清华大学 Production fights network equipment and training method
CN110956256A (en) * 2019-12-09 2020-04-03 清华大学 Method and device for realizing Bayes neural network by using memristor intrinsic noise
US20210133541A1 (en) * 2019-10-31 2021-05-06 Micron Technology, Inc. Spike Detection in Memristor Crossbar Array Implementations of Spiking Neural Networks
CN113505887A (en) * 2021-09-12 2021-10-15 浙江大学 Memristor memory neural network training method aiming at memristor errors
CN114067157A (en) * 2021-11-17 2022-02-18 中国人民解放军国防科技大学 Memristor-based neural network optimization method and device and memristor array
CN114819093A (en) * 2022-05-09 2022-07-29 清华大学 Strategy optimization method and device by utilizing environment model based on memristor array

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543827A (en) * 2018-12-02 2019-03-29 清华大学 Production fights network equipment and training method
US20210133541A1 (en) * 2019-10-31 2021-05-06 Micron Technology, Inc. Spike Detection in Memristor Crossbar Array Implementations of Spiking Neural Networks
CN110956256A (en) * 2019-12-09 2020-04-03 清华大学 Method and device for realizing Bayes neural network by using memristor intrinsic noise
CN113505887A (en) * 2021-09-12 2021-10-15 浙江大学 Memristor memory neural network training method aiming at memristor errors
CN114067157A (en) * 2021-11-17 2022-02-18 中国人民解放军国防科技大学 Memristor-based neural network optimization method and device and memristor array
CN114819093A (en) * 2022-05-09 2022-07-29 清华大学 Strategy optimization method and device by utilizing environment model based on memristor array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林钰登 (LIN, YUDENG): "基于忆阻器阵列的神经网络系统的研究 (The Research on Resistive Random-access Memory Array Based Neural Network)", 中国优秀硕士学位论文全文数据库 (CHINESE MASTER’S THESES FULL-TEXT DATABASE), no. 202007, 15 July 2020 (2020-07-15), ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN114819093A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
US11348002B2 (en) Training of artificial neural networks
WO2023217027A1 (en) Policy optimization method and apparatus using environment model based on memristor array
CN112183739B (en) Hardware architecture of memristor-based low-power-consumption pulse convolution neural network
US10708522B2 (en) Image sensor with analog sample and hold circuit control for analog neural networks
US10740671B2 (en) Convolutional neural networks using resistive processing unit array
US11188825B2 (en) Mixed-precision deep-learning with multi-memristive devices
US11620505B2 (en) Neuromorphic package devices and neuromorphic computing systems
US11531898B2 (en) Training of artificial neural networks
WO2021098821A1 (en) Method for data processing in neural network system, and neural network system
US11087204B2 (en) Resistive processing unit with multiple weight readers
WO2023217021A1 (en) Data processing method based on memristor array, and data processing apparatus
WO2023217017A1 (en) Variational inference method and device for bayesian neural network based on memristor array
JP2022554371A (en) Memristor-based neural network parallel acceleration method, processor, and apparatus
US20210319293A1 (en) Neuromorphic device and operating method of the same
US20210374546A1 (en) Row-by-row convolutional neural network mapping for analog artificial intelligence network training
US11301752B2 (en) Memory configuration for implementing a neural network
KR20210143614A (en) Neuromorphic device for implementing neural network and method for thereof
Fang et al. Neuromorphic algorithm-hardware codesign for temporal pattern learning
US20230113627A1 (en) Electronic device and method of operating the same
CN115796252A (en) Weight writing method and device, electronic equipment and storage medium
CN114861902A (en) Processing unit, operation method thereof and computing chip
Qiu et al. Neuromorphic acceleration for context aware text image recognition
Irmanova et al. Discrete‐level memristive circuits for HTM‐based spatiotemporal data classification system
CN117808062A (en) Computing device, electronic device, and operating method for computing device
US20240021242A1 (en) Memory-based neuromorphic device and operating method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802799

Country of ref document: EP

Kind code of ref document: A1