WO2023217027A1

WO2023217027A1 - Policy optimization method and apparatus using environment model based on memristor array

Info

Publication number: WO2023217027A1
Application number: PCT/CN2023/092475
Authority: WO
Inventors: 高滨; 林钰登; 唐建石; 吴华强; 张清天; 钱鹤
Original assignee: 清华大学
Priority date: 2022-05-09
Filing date: 2023-05-06
Publication date: 2023-11-16
Also published as: CN114819093A

Abstract

A policy optimization method and apparatus using a dynamic environment model based on a memristor array. The method comprises: acquiring a dynamic environment model based on a memristor array; performing prediction multiple times at a plurality of moments according to the dynamic environment model and an object policy, so as to obtain a data sample set, which comprises optimization costs of the object policy corresponding to the plurality of moments; and on the basis of the data sample set, performing policy searching by using a policy gradient optimization algorithm, so as to optimize the object policy. In the method, a data sample set is generated by using a dynamic environment model based on a memristor array, long-term dynamic planning based on the dynamic environment model is realized, and policy searching is then performed by using a more stable algorithm such as a policy gradient optimization algorithm, such that an object policy can be effectively optimized.

Description

Strategy optimization method and device using memristor array-based environment model

This application claims priority from Chinese Patent Application No. 202210497721.2 submitted on May 9, 2022. The disclosure of the above Chinese patent application is hereby cited in its entirety as part of this application.

Technical field

Embodiments of the present disclosure relate to a strategy optimization method and a strategy optimization device utilizing a dynamic environment model based on a memristor array.

Background technique

Artificial Neural Network (ANN) is widely used in the modeling of dynamic systems. However, long-term mission planning using traditional artificial neural networks remains a challenge because of the lack of ability to model uncertainty. The inherent randomness (uncertainty) of real systems - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system. Probabilistic models provide a way to address uncertainty. These models enable people to make informed decisions using the model's predictions, while being cautious about the uncertainty of those predictions.

Contents of the invention

At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.

For example, in the strategy optimization method provided by an embodiment of the present disclosure, obtaining a dynamic environment model includes: obtaining a Bayesian neural network, which has a weight matrix obtained by training; according to the Bayesian neural network The weight matrix obtains the corresponding multiple target conductance values, and maps the multiple target conductance values to the memristor array; the state and hidden input variables corresponding to the time t of the dynamic system are As an input signal, it is input to the memristor array after weight mapping. The state and hidden input variables at time t are processed through the memristor array according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. , the output signal is used to obtain the prediction result of the dynamic system at time t+1.

For example, in the strategy optimization method provided by an embodiment of the present disclosure, the dynamic environment model is expressed as s _t+1 =f (s _t , _at ; W, ε), s _t is the state of the dynamic system at time t, a _t is the action of the object strategy at time t, W is the weight matrix of the Bayesian neural network, ε is the additive noise corresponding to the memristor array, s _t+1 is the prediction result of the dynamic system at time t+1 ;The action of the object strategy at time t a _t =π (s _t ; Wπ), π represents the function of the object strategy, Wπ represents the strategy parameters, the weight matrix W of the Bayesian neural network satisfies the distribution W ~ q (W), plus The linear noise ε is additive Gaussian noise ε~N(0,σ ² ).

For example, in the strategy optimization method provided by an embodiment of the present disclosure, multiple times include time 1 to time T arranged in order from morning to night. Multiple predictions at multiple times are performed based on the dynamic environment model and the object policy, and we obtain A collection of data samples including the optimization cost of the object strategy corresponding to multiple moments, including: for any time t-1 from time 1 to time T, the execution action a _t-1 is obtained from the object strategy, and is obtained by a _t-1 = π(s _t-1 ; Wπ) obtains the action a _t -1 of the object strategy at time t-1; the time is calculated according to the dynamic environment model s _t = f(s _t-1 , a _t-1 ; W, ε) The state s _t at the next time t after t-1 is obtained and the cost c _t corresponding to the state s _t at time t is obtained, thereby obtaining the cost sequence {c ₁ , c ₂ ,…, c from time 1 to time t _t }, obtain the optimization cost J _t-1 at time t based on the cost sequence, where 1≤t≤T; obtain the data sample set from time 1 to time T {[a ₀ ,J ₀ ],...,[a _{T- 1} ,J _T-1 ]}.

For example, in the strategy optimization method provided by an embodiment of the present disclosure, the expected value of the cost c _t at time t is E[c _t ], then the optimization cost at time t can be obtained by to get.

For example, in the strategy optimization method provided by an embodiment of the present disclosure, the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty. Accidental uncertainty is caused by hidden input variables. , the cognitive uncertainty is caused by the intrinsic noise of the memristor array.

For example, in the strategy optimization method provided by an embodiment of the present disclosure, the optimization cost at time t is calculated by To obtain, σ(η,θ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and θ represents epistemic uncertainty.

For example, in the strategy optimization method provided by an embodiment of the present disclosure, the state s t at time t is calculated according to the dynamic environment model s _t =f (s _t-1 , a _t-1 ; W, ε) and the corresponding state s _t is obtained. The cost c _t of the state s t at time _t includes: sampling the hidden input variable z from the p(z) distribution; inputting the sample and the state s t-1 at time _t-1 into the memristor after weight mapping The predicted state s _t is obtained from the detector array; for the predicted state s _t , Obtain cost c _t =c(s _t ).

For example, in the policy optimization method provided by an embodiment of the present disclosure, the policy gradient optimization algorithm includes the REINFORCE algorithm, PRO algorithm or TPRO algorithm.

At least one embodiment of the present disclosure also provides a strategy optimization device utilizing a dynamic environment model based on a memristor array, including: an acquisition unit configured to acquire a dynamic environment model based on the memristor array; a computing unit configured to calculate the dynamic environment model based on the memristor array. The environment model and the object strategy perform multiple predictions at multiple times to obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times; the strategy search unit is configured to use the policy gradient optimization algorithm to perform strategy search based on the data sample set. to optimize object strategies.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below. Obviously, the drawings in the following description only relate to some embodiments of the present disclosure and do not limit the present disclosure. .

1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure;

Figure 1B shows a schematic flow chart of step S101 in Figure 1A;

Figure 2A shows a schematic structure of a memristor array;

Figure 2B is a schematic diagram of a memristor device;

Figure 2C is a schematic diagram of another memristor device;

Figure 2D shows a schematic diagram of mapping the weight matrix of a Bayesian neural network to a memristor array;

Figure 3 shows a schematic flow chart of step S102 in Figure 1A;

Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure;

FIG. 5 shows a schematic block diagram of a strategy optimization device using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings of the embodiments of the present disclosure. Obviously, The described embodiments are some, but not all, of the embodiments of the present disclosure. Based on the described embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present disclosure.

Unless otherwise defined, technical terms or scientific terms used in this disclosure shall have the usual meaning understood by a person with ordinary skill in the art to which this disclosure belongs. "First", "second" and similar words used in this disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. Likewise, similar words such as "a", "an" or "the" do not indicate a quantitative limitation but rather indicate the presence of at least one. Words such as "include" or "comprising" mean that the elements or things appearing before the word include the elements or things listed after the word and their equivalents, without excluding other elements or things. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "down", "left", "right", etc. are only used to express relative positional relationships. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

In model-free deep reinforcement learning (Deep Reinforcement Learning), the agent usually needs to conduct a large number of interactive trials and errors with the real environment, and the data efficiency is not high, so it cannot be applied to real tasks where trial and error costs are relatively high. Model-based deep reinforcement learning can utilize data more efficiently. In model-based deep reinforcement learning, the agent first learns the dynamic environment model from the historical experience of interacting with the real dynamic environment (such as state transition data collected in advance), and then interacts with the dynamic environment model to obtain a sub-optimal strategy. .

The model-based reinforcement learning method learns an accurate dynamic environment model. This model is used when training the agent. It does not need to interact with the real environment too many times. The agent can "imagine" the feeling of interacting in the real environment. It can greatly improve data efficiency and is suitable for actual physical scenarios where the cost of obtaining data is high; at the same time, the dynamic environment model can predict the unknown state of the environment, generalize the cognition of the agent, and can also be used as a new data source to provide Contextual information can be used to help decision-making, which can alleviate the exploration-exploitation dilemma. When modeling real environments, the inherent stochasticity (uncertainty) of the environment - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system. Probabilistic models provide a way to address uncertainty. These models enable informed decisions to be made using the model's predictions, while being cautious about the uncertainty of those predictions.

The inventor discovered that Bayesian Neural Network (BNN) is a kind of The probabilistic model of the neural network placed in the Bayesian framework can describe complex random patterns; and the Bayesian neural network with latent input variables (BNN with latent input variables, BNN+LV) can distribution (accidental uncertainty) to describe complex random patterns, while considering model uncertainty through distribution on weights (epistemic uncertainty). Hidden input variables refer to variables that cannot be directly observed, but have an impact on the state and output of the probability model. The inventor described a method and device for implementing a Bayesian neural network using memristor intrinsic noise in Chinese invention patent application publication CN110956256A, which is hereby cited in its entirety as part of this application.

The structures of Bayesian neural networks include but are not limited to fully connected structures, convolutional neural network (CNN) structures, etc. The network weight W is a random variable (W~q(W)) based on a certain distribution.

The inventor further discovered that it is assumed that there is a data set D={X, Y} of the dynamic system used in the Bayesian neural network, where X is the state feature vector of the dynamic system and Y is the next state of the dynamic system. The input of the Bayesian neural network is the state feature vector X of the dynamic system and the hidden input variable z(z～p(z)); the parameters of the Bayesian neural network can be trained; and, The output is superimposed with independent additive Gaussian noise ε (ε ~ N (0, σ ² )) to predict the next state of the dynamic system y, that is, y = f (X, z, W, ε). Therefore, each weight of the Bayesian neural network is a distribution after training is completed. For example, each weight is an independent distribution from each other.

In long-term planning tasks, gradients will undergo multi-step backpropagation, causing gradient disappearance and explosion problems. At the same time, when conducting strategy search for neural networks implemented directly based on memristor arrays, due to the inherent randomness of memristors, Characteristics, additional noise will be introduced when the gradient is backpropagated through the memristor array, and these noisy gradients cannot effectively optimize the policy search.

The policy optimization method provided by the above embodiments of the present disclosure uses a dynamic environment model based on a memristor array to generate a data sample set, implements long-term dynamic planning based on the dynamic environment model, and then uses a more stable algorithm such as the policy gradient optimization algorithm to conduct policy search , no gradient disappearance and explosion problems, and can effectively optimize the object strategy.

At least one embodiment of the present disclosure also provides a strategy optimization device corresponding to the above strategy optimization method.

The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure.

As shown in Figure 1A, the strategy optimization method includes the following steps S101 to S103.

Step S101: Obtain a dynamic environment model based on the memristor array.

In embodiments of the present disclosure, for example, BNN+LV based on a memristor array can be used to model a dynamic system to obtain a dynamic environment model. The specific steps for this will be shown in Figure 1B and will not be described again here. .

Step S102: Perform multiple predictions at multiple times based on the dynamic environment model and the object strategy, and obtain a data sample set including the optimization costs of the object strategy corresponding to multiple times.

For example, the object policy involved is used in deep reinforcement learning, which can be, for example, a policy for an agent to maximize rewards or achieve a specific goal during its interaction with the environment.

Step S103: Based on the data sample set, use the policy gradient optimization algorithm to perform policy search to optimize the object policy.

For example, in different examples of embodiments of the present disclosure, the policy gradient optimization algorithm may include the REINFORCE algorithm, the PRO (Proximal Policy Optimization) algorithm, or the TPRO (Trust Region Policy Optimization) algorithm. In embodiments of the present disclosure, these policy gradient optimization methods are more stable and can effectively optimize object policies.

FIG. 1B shows a schematic flowchart of an example of step S101 in FIG. 1A.

As shown in FIG. 1B , an example of step S101 may include the following steps S111 to S113.

Step S111: Obtain a Bayesian neural network, where the Bayesian neural network has a trained weight matrix.

For example, the structure of Bayesian neural network includes fully connected structure or convolutional neural network structure. Each network weight of this Bayesian neural network is a random variable. For example, after the Bayesian neural network is trained, each weight is a distribution, such as Gaussian distribution or Laplace distribution.

For example, the Bayesian neural network can be trained offline to obtain the weight matrix. The method of training the Bayesian neural network can refer to conventional methods. For example, a central processing unit (CPU) or an image processing unit (GPU) can be used. , neural network processing unit (NPU), etc. for training Practice, I won’t go into details here.

Step S112: Obtain corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and map the multiple target conductance values to the memristor array.

After the Bayesian neural network training is completed and the weight matrix is obtained, the weight matrix is processed to obtain corresponding multiple target conductance values. For example, during this process, the weight matrix can be biased and scaled until the weight matrix meets the appropriate conductance window for the memristor array being used. After biasing and scaling the weight matrix, the target conductance value is calculated based on the processed weight matrix and the conductance value of the memristor. For the specific process of calculating the target conductance value, please refer to the relevant description of the memristor-based Bayesian neural network, which will not be described again here.

FIG. 2A shows a schematic structure of a memristor array. The memristor array is composed of, for example, multiple memristor units. The multiple memristor units form an array of M rows and N columns. M and N are all positive integers. Each memristor cell includes a switching element and one or more memristors. In Figure 2A, WL<1>, WL<2>...WL<M> respectively represent the word lines of the first row, the second row...the Mth row, and the switching elements in the memristor unit circuit of each row. The control electrode (such as the gate of a transistor) is connected to the corresponding word line of the row; BL<1>, BL<2>...BL<N> respectively represent the bits of the first column, the second column...the Nth column Line, the memristor in the memristor unit circuit of each column is connected to the bit line corresponding to the column; SL<1>, SL<2>...SL<M> respectively represent the first row, the second row... The source line of the Mth row, the source of the transistor in the memristor unit circuit of each row is connected to the corresponding source line of the row. According to Kirchhoff's law, by setting the state (such as resistance) of the memristor unit and applying corresponding word line signals and bit line signals to the word line and bit line, the above memristor array can complete the multiply-accumulate calculation in parallel. .

FIG. 2B is a schematic diagram of a memristor device, which includes a memristor array and its peripheral driving circuit. For example, as shown in FIG. 2B , the memristor device includes a signal acquisition device, a word line driving circuit, a bit line driving circuit, a source line driving circuit, a memristor array, and a data output circuit.

For example, the signal acquisition device is configured to convert a digital signal into a plurality of analog signals through a digital to analog converter (DAC), so as to be input to a plurality of column signal input terminals of the memristor array.

For example, a memristor array includes M source lines, M word lines, and N bit lines, and a plurality of memristor cells arranged in M rows and N columns.

For example, the operation of the memristor array is implemented through a word line driving circuit, a bit line driving circuit and a source line driving circuit.

For example, the word line driving circuit includes multiple multiplexers (Mux) for switching the word line input voltage; the bit line driving circuit includes multiple multiplexers for switching the bit line input voltage; the source line The driver circuit also includes multiple multiplexers (Mux) for switching the source line input voltage. For example, the source line driver circuit also includes multiple ADCs for converting analog signals into digital signals. In addition, a Trans-Impedance Amplifier (TIA) (not shown in the figure) can be further set between the Mux and the ADC in the source line driver circuit to complete the conversion of current to voltage to facilitate ADC processing.

For example, a memristor array includes operating modes and computational modes. When the memristor array is in the operating mode, the memristor unit is in an initialization state, and the values of the parameter elements in the parameter matrix can be written into the memristor array. For example, the source line input voltage, bit line input voltage and word line input voltage of the memristor are switched to corresponding preset voltage ranges through a multiplexer.

For example, the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in FIG. 2B. For example, when performing a set operation on the memristor, the word line input voltage is set to 2V (volts), for example, when performing a reset operation on the memristor, the word line input voltage is set to 5V, for example, the word line input voltage It can be obtained from the voltage signal V_WL[1:M] in Figure 2B.

For example, the source line input voltage is switched to the corresponding voltage range through the control signal SL_sw[1:M] of the multiplexer in the source line driving circuit in FIG. 2B. For example, when performing a set operation on the memristor, the source line input voltage is set to 0V. For example, when performing a reset operation on the memristor, the source line input voltage is set to 2V. For example, the source line input voltage can be determined by the figure. The voltage signal V_SL[1:M] in 2B is obtained.

For example, the bit line input voltage is switched to the corresponding voltage range through the control signal BL_sw[1:N] of the multiplexer in the bit line driving circuit in FIG. 2B. For example, when performing a set operation on the memristor, the bit line input voltage is set to 2V. For example, when performing a reset operation on the memristor, the bit line input voltage is set to 0V. For example, the bit line input voltage can be determined by the figure. DAC obtained in 2B.

For example, when the memristor array is in computing mode, the memristors in the memristor array are in a conductive state that can be used for computing, and the bitline input voltage input to the column signal input does not change the conductance value of the memristor, e.g. , the calculation can be completed by performing multiplication and addition operations on the memristor array. For example, the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in Figure 2B. For example, when a turn-on signal is applied, the word line input of the corresponding row Voltage setting is 5V, for example, when no turn-on signal is applied, the word line input voltage of the corresponding row is set to 0V, for example, the GND signal is turned on; through the control signal SL_sw[1:M of the multiplexer in the source line driver circuit in Figure 2B ] Switch the source line input voltage to the corresponding voltage range, for example, set the source line input voltage to 0V, so that the current signals from multiple row signal output terminals can flow into the data output circuit through the bit line drive circuit in Figure 2B The control signal BL_sw[1:N] of the multiplexer switches the bit line input voltage to the corresponding voltage range, for example, setting the bit line input voltage to 0.1V-0.3V, thereby using the memristor array to perform multiplication and addition operations.

For example, the data output circuit may include multiple transimpedance amplifiers (TIAs) and ADCs, and may convert current signals at multiple row signal output terminals into voltage signals and then into digital signals for subsequent processing.

Figure 2C is a schematic diagram of another memristor device. The structure of the memristor device shown in FIG. 2C is basically the same as that of the memristor device shown in FIG. 2B, and also includes a memristor array and its peripheral driving circuit. For example, as shown in FIG. 2C , the memristor device signal acquisition device, word line driving circuit, bit line driving circuit, source line driving circuit, memristor array and data output circuit.

For example, a memristor array includes M source lines, 2M word lines, and 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns. For example, each memristor unit has a 2T2R structure, and the operation of mapping the parameter matrix used for transformation processing to multiple different memristor units in the memristor array will not be described again here. It should be noted that the memristor array may also include M source lines, M word lines and 2N bit lines, and a plurality of memristor units arranged in M rows and N columns.

For descriptions of the signal acquisition device, control drive circuit and data output circuit, please refer to the previous description and will not be repeated here.

Figure 2D shows the process of mapping the weight matrix of the Bayesian neural network to the memristor array. Memristor arrays are used to implement the weight matrix between layers in the Bayesian neural network. N memristors are used for each weight to implement the distribution corresponding to the weight. N is an integer greater than or equal to 2. For this weight According to the corresponding random probability distribution, N conductance values are calculated, and the N conductance value distributions are mapped to the N memristors. In this way, the weight matrix in the Bayesian neural network is converted into the target conductance value and mapped into the intersection sequence of the memristor array.

As shown in Figure 2D, the left side of the figure is a three-layer Bayesian neural network, which includes three neuron layers connected one by one. For example, the input layer includes layer 1 neurons, the hidden layer includes layer 2 neurons, and the output layer includes layer 3 neurons. For example, the input layer would receive The received input data is passed to the hidden layer, and the hidden layer calculates and converts the input data and sends it to the output layer. The output layer outputs the output structure of the Bayesian neural network.

As shown in Figure 2D, the input layer, hidden layer and output layer all include multiple neuron nodes, and the number of neuron nodes in each layer can be set according to different application situations. For example, the number of neurons in the input layer is 2 (including N ₁ and N ₂ ), the number of neurons in the middle hidden layer is 3 (including N ₃ , N ₄ and N ₅ ), and the number of neurons in the output layer is 1 (including N ₆ ).

As shown in Figure 2D, two adjacent neuron layers of the Bayesian neural network are connected by a weight matrix. For example, the weight matrix is implemented by a memristor array as shown on the right side of Figure 2D. For example, the weight parameters can be programmed directly to the conductance of the memristor array. For example, the weight parameters can also be mapped to the conductance of the memristor array according to a certain rule. For example, the difference in conductance of two memristors can also be used to represent a weight parameter. Although the present disclosure describes the technical solution of the present disclosure by directly programming the weight parameters to the conductance of the memristor array or mapping the weight parameters to the conductance of the memristor array according to a certain rule, this is only exemplary. , rather than limiting the disclosure.

The structure of the memristor array on the right side in FIG. 2D is, for example, as shown in FIG. 2A . The memristor array may include a plurality of memristors arranged in an array. In the example shown in Figure 2D, the weight connecting the input N1 and the output N3 is implemented by three memristors (G ₁₁ , G ₁₂ , G ₁₃ ), and other weights in the weight matrix can be implemented in the same way. More specifically, source line SL ₁ corresponds to neuron N ₃ , source line SL ₂ corresponds to neuron N ₄ , source line SL ₅ corresponds to neuron N ₅ , bit lines BL ₁ , BL ₂ and BL ₃ correspond to neuron N 1 , A weight between the input layer and the hidden layer (the weight between neuron N ₁ and neuron N ₃ ) is converted into three target conductance values according to the distribution, and the distribution is mapped into the cross sequence of the memristor array, here The target conductance values are G ₁₁ , G ₁₂ , and G ₁₃ , respectively, and are outlined with dashed lines in the memristor array.

Returning to Figure 1B, step S113: Input the state and hidden input variables corresponding to time t of the dynamic system as input signals to the weight-mapped memristor array, and use the memristor array to calculate the state and hidden input variables at time t. Processing is performed according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. The output signal is used to obtain the prediction result of the dynamic system at time t+1.

For example, in some embodiments of the present disclosure, the dynamic environment model is expressed as s _t+1 =f (s _t , _at ; W, ε), s _t is the state of the dynamic system at time t, and a _t is the object The action of the strategy at time t, W is the weight matrix of the Bayesian neural network, ε is the additive noise corresponding to the memristor array, s _t+1 is the prediction result of the dynamic system at time t+1; the object strategy is at The action at time t is a _t = π (s _t ; Wπ), π represents the function of the object strategy, Wπ represents the strategy parameters, and the weight matrix W of the Bayesian neural network satisfies the distribution W~q(W), the additive noise ε is the additive Gaussian noise ε~N(0,σ ² ).

For the memristor array, the input signal is a voltage signal and the output signal is a current signal. The output signal is read and analog-to-digital converted for subsequent processing. For example, the input sequence is applied to BL (Bit-line, bit line) in the form of voltage pulses, and then the output current flowing out of SL (Source-line, source line) is collected for further calculation and processing. For example, for a memristor device as shown in Figure 2B or 2C, the input sequence can be converted by a DAC into an analog voltage signal, which is applied to BL through a multiplexer. Correspondingly, the output current is obtained from SL, which can be converted into a voltage signal through a transimpedance amplifier, and converted into a digital signal through the ADC, and the digital signal can be used for subsequent processing. When N memristors read current and N is relatively large, the total output current shows a certain distribution, such as a distribution similar to Gaussian distribution or Laplace distribution. The total output current of all voltage pulses is the result of multiplying the input vector and the weight matrix. In a memristor crossbar array, such a parallel read operation is equivalent to implementing two operations of sampling and vector matrix multiplication.

The following uses Figure 3 to illustrate how to perform multiple predictions at multiple times based on the dynamic environment model and the object strategy to obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times.

For example, the multiple times include time 1 to time T arranged in order from morning to night.

FIG. 3 shows a schematic flowchart of an example of step S102 in FIG. 1A.

As shown in Figure 3, step S102 may include the following steps S301 to S303.

Step S301: For any time t-1 from time 1 to time T, obtain the execution action a _t-1 from the object policy, and obtain the object policy at time t from a _t-1 = π(s _t-1 ; Wπ) -1 action a _t-1 .

For example, action a _t-1 is the optimal action selected by the object policy at time t-1.

Step S302: Calculate the state s t at the next time t after time t-1 according to the dynamic environment model s _t =f (s _t-1 , a _t-1 ; W, ε) and obtain the state corresponding to time _t The cost c _t of s _t , from which the cost sequence {c ₁ ,c ₂ ,...,c _t } from time 1 to time t is obtained. Based on the cost sequence, the optimization cost J _t-1 of time t is obtained, where 1≤ t≤T.

For example, in some embodiments of the present disclosure, examples of step S302 may include: sampling the latent input variable z from the p(z) distribution to obtain a sample; inputting the sample and the state s t-1 at time _t-1 to the weight The mapped memristor array obtains the predicted state s _t ; for the predicted state s _t , the cost c _t =c(s _t ) is obtained.

For example, the latent input variable z is first sampled from the p(z) distribution, and then the state s _t- 1 at time t- ₁ and the sample of the latent input variable are applied to the memristor array as a read (READ) voltage pulse. on BL, and then collect the output current flowing from SL for further calculation and processing to obtain the value corresponding to time t. Cost c _t . By performing the above operations on the state at any time from time 1 to time t, the cost sequence {c ₁ , c ₂ ,..., c _t } can be obtained.

Step S303: Obtain the data sample set {[a ₀ ,J ₀ ],...,[a _T-1 ,J _T-1 ]} from time 1 to time T.

For example, the expected value of the cost c _t at time t is E[c _t ], then the optimization cost at time t can be passed to get.

For example, in some embodiments of the present disclosure, the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty. Accidental uncertainty is caused by hidden input variables, and cognitive uncertainty is caused by hidden input variables. The determinism is caused by the intrinsic noise of the memristor array.

For example, if the cost changes caused by accidental uncertainty and cognitive uncertainty are further considered, the optimization cost at time t can be calculated by To obtain, σ(η,θ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and θ represents epistemic uncertainty.

For any time between time 1 and time T, the data sample corresponding to that time can be obtained. The set of data samples from time 1 to time T is {[a ₀ ,J ₀ ],...,[a _T-1 ,J _T-1 ]}.

For example, the exemplary process of the above-mentioned strategy optimization method based on the dynamic environment model of the memristor array is as follows:

Input: Memristor array-based dynamic environment model and initial object strategy

n=1 to N, loop

Initialization state s ₀

t=1 to T, loop

Obtain the execution action a _t-1 from the object policy π

Use the dynamic environment model s _t =f (s _t-1 ,a _t-1 ; W, ε) based on the memristor array to predict the state s t at time _t and obtain the data sample [s _t ]

Calculate the cost and optimization cost corresponding to the object strategy at that moment
c _t =c(s _t )

Record[a _t-1 ,J _t-1 ]

Obtain the data sample set {[a ₀ ,J ₀ ],...,[a _T-1 ,J _T-1 ]}, and on the data sample set,

Policy search using the policy gradient optimization algorithm

End the loop of n

Output: optimized policy π

Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure.

As shown in Figure 4, in an exemplary application, a ship is driven and thereby fights waves to get as close as possible to a target location on the coastline, whereby a control model for driving the ship needs to be trained. A ship at position (x, y) can choose an action (a _x , a _y ) that represents the direction and magnitude of the drive. However, due to the dynamic environment of the sea surface with waves, the subsequent position of the ship exhibits drift and disturbance. Also, the closer the location is to the coast, the greater the interference. Ships are given only a limited batch data set of spatial position transformations and cannot directly interact with the ocean environment to optimize action strategies to ensure safety. At this time, it is necessary to rely on empirical data to learn a marine environment model (dynamic environment model) that can predict the next state. Epistemic and accidental uncertainties will arise from missing information at unvisited locations and the randomness of the marine environment, respectively.

In this embodiment, for the control model, the sea surface is a dynamic environment, and the object strategy refers to the method used in the solution process of the ship from the current position to the target position. First, obtain the dynamic environment model and initial object policy for the dynamic environment. The initialization state of the ship is the current position. The execution action at the current moment is obtained from the object strategy. The dynamic environment model is used to predict the state (the position of the ship) at the next moment. The cost and optimization cost corresponding to the object strategy are calculated, and the action and optimization cost are recorded. composed of data samples. Assume that the current time is time 1. From time 1 to the subsequent time T, a data sample set is obtained, and the policy gradient optimization algorithm is used to perform a policy search on the data sample set to obtain an optimized object policy.

Figure 5 shows a schematic block diagram of a strategy optimization device 500 using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure. The strategy optimization device can be used to execute the data processing method shown in Figure 1A .

As shown in FIG. 5 , the strategy optimization device 500 includes an acquisition unit 501 , a calculation unit 502 and a strategy search unit 503 .

The acquisition unit 501 is configured to acquire a dynamic environment model based on the memristor array.

The computing unit 502 is configured to perform multiple predictions at multiple times according to the dynamic environment model and the object policy, and obtain a data sample set including the optimization cost of the object policy corresponding to multiple times.

The policy search unit 503 is configured to use a policy gradient optimization algorithm to perform a policy search based on a data sample set to optimize the object policy.

For example, the strategy optimization device 500 can be implemented using hardware, software, firmware, or any feasible combination thereof, and this disclosure is not limiting.

The technical effects of the above strategy optimization device are the same as those of the strategy optimization method shown in Figure 1A, and will not be described again here.

The following points need to be explained:

(1) The drawings of the embodiments of this disclosure only relate to the structures involved in the embodiments of this disclosure, and other structures can refer to common designs.

(2) Without conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.

The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. The protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

A strategy optimization method utilizing a dynamic environment model based on memristor arrays, including:

Obtain the dynamic environment model based on the memristor array;

Perform multiple predictions at multiple times based on the dynamic environment model and the object strategy to obtain a data sample set including the optimization costs of the object strategy corresponding to the multiple times;

Based on the data sample set, a policy gradient optimization algorithm is used to perform a policy search to optimize the object policy.
The strategy optimization method according to claim 1, wherein obtaining the dynamic environment model includes:

Obtaining a Bayesian neural network, wherein the Bayesian neural network has a trained weight matrix;

Obtain corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and map the multiple target conductance values into the memristor array;

The state corresponding to the time t of the dynamic system and the hidden input variables are input into the weight-mapped memristor array as input signals, and the state of the time t and the hidden input variables are processed by the memristor array. Processing is performed according to the Bayesian neural network, and an output signal corresponding to the processing result is obtained from the memristor array, wherein the output signal is used to obtain the prediction result of the dynamic system at time t+1.
The strategy optimization method according to claim 2, wherein the expression of the dynamic environment model is s t+1 =f (s t , a t ; W, ε),

Where, s t is the state of the dynamic system at the time t, a t is the action of the object strategy at the time t, W is the weight matrix of the Bayesian neural network, and ε is the weight matrix corresponding to the memory. The additive noise of the resistor array, s t+1 is the predicted result of the dynamic system at the time t+1;

Among them, the action of the object strategy at time t is a t =π (s t ; Wπ), π represents the function of the object strategy, Wπ represents the strategy parameters, and the weight matrix W of the Bayesian neural network satisfies the distribution W ~q(W), the additive noise ε is additive Gaussian noise ε~N(0,σ 2 ).
The strategy optimization method according to claim 3, wherein the plurality of time moments include time 1 to time T arranged in order from early to late,

Perform multiple predictions at multiple times according to the dynamic environment model and the object policy. Measure and obtain the data sample set including the optimization cost of the object strategy corresponding to the multiple moments, including:

For any time t-1 from time 1 to time T, the execution action at -1 is obtained from the object policy,

The action at- 1 of the target strategy at the time t- 1 is obtained from a t -1 =π(s t-1; Wπ);

According to the dynamic environment model s t =f (s t-1 , a t-1 ; W, ε), the state s t of the next time t after the time t-1 is calculated and the state corresponding to the time t is obtained. The cost c t of the state s t of t, thus obtaining the cost sequence {c 1 , c 2 ,..., c t } from the time 1 to the time t,

Obtain the optimization cost J t-1 at the time t based on the cost sequence, where 1≤t≤T;

The data sample set {[a 0 ,J 0 ],...,[a T- 1 ,J T-1 ]} from the time 1 to the time T is obtained.
The strategy optimization method according to claim 4, wherein the expected value of the cost c t at the time t is E[c t ], then the optimization cost at the time t can be obtained by to get.
The strategy optimization method according to claim 4 or 5, wherein the cost also includes cost changes brought by accidental uncertainty and cost changes brought by cognitive uncertainty,

Wherein, the accidental uncertainty is caused by the implicit input variable, and the cognitive uncertainty is caused by the intrinsic noise of the memristor array.
The strategy optimization method according to claim 6, wherein the optimization cost at time t is calculated by To obtain, where σ(η,θ) is a function of the accidental uncertainty and the epistemic uncertainty, eta represents the accidental uncertainty, and θ represents the epistemic uncertainty.
The strategy optimization method according to any one of claims 4 to 7, wherein the value of the time t is calculated according to the dynamic environment model s t =f (s t-1 , at -1 ; W, ε) State s t and obtain the cost c t corresponding to the state s t at the time t, including:

Sample the latent input variable z from the p(z) distribution to obtain a sample;

Input the sample and the state s t-1 at time t -1 into the weight-mapped memristor array to obtain the predicted state s t ;

For the predicted state s t , the cost c t =c(s t ) is obtained.
The strategy optimization method according to any one of claims 1-8, wherein the strategy gradient Optimization algorithms include REINFORCE algorithm, PRO algorithm or TPRO algorithm.
A strategy optimization device utilizing a dynamic environment model based on a memristor array, including:

An acquisition unit configured to acquire the dynamic environment model based on the memristor array;

A computing unit configured to perform multiple predictions at multiple moments according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization cost of the object strategy corresponding to the multiple moments;

A policy search unit configured to use a policy gradient optimization algorithm to perform a policy search based on the data sample set to optimize the object policy.