WO2023217027A1 - Procédé et appareil d'optimisation de politique utilisant un modèle d'environnement basé sur un réseau de memristances - Google Patents

Procédé et appareil d'optimisation de politique utilisant un modèle d'environnement basé sur un réseau de memristances Download PDF

Info

Publication number
WO2023217027A1
WO2023217027A1 PCT/CN2023/092475 CN2023092475W WO2023217027A1 WO 2023217027 A1 WO2023217027 A1 WO 2023217027A1 CN 2023092475 W CN2023092475 W CN 2023092475W WO 2023217027 A1 WO2023217027 A1 WO 2023217027A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
strategy
environment model
policy
cost
Prior art date
Application number
PCT/CN2023/092475
Other languages
English (en)
Chinese (zh)
Inventor
高滨
林钰登
唐建石
吴华强
张清天
钱鹤
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2023217027A1 publication Critical patent/WO2023217027A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • Embodiments of the present disclosure relate to a strategy optimization method and a strategy optimization device utilizing a dynamic environment model based on a memristor array.
  • ANN Artificial Neural Network
  • long-term mission planning using traditional artificial neural networks remains a challenge because of the lack of ability to model uncertainty.
  • the inherent randomness (uncertainty) of real systems - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system.
  • Probabilistic models provide a way to address uncertainty. These models enable people to make informed decisions using the model's predictions, while being cautious about the uncertainty of those predictions.
  • At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.
  • obtaining a dynamic environment model includes: obtaining a Bayesian neural network, which has a weight matrix obtained by training; according to the Bayesian neural network The weight matrix obtains the corresponding multiple target conductance values, and maps the multiple target conductance values to the memristor array; the state and hidden input variables corresponding to the time t of the dynamic system are As an input signal, it is input to the memristor array after weight mapping.
  • the state and hidden input variables at time t are processed through the memristor array according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. , the output signal is used to obtain the prediction result of the dynamic system at time t+1.
  • the action of the object strategy at time t a t ⁇ (s t ; W ⁇ ), ⁇ represents the function of the object strategy, W ⁇ represents the strategy parameters, the weight matrix W of the Bayesian neural network satisfies the distribution W ⁇ q (W), plus
  • the linear noise ⁇ is additive Gaussian noise ⁇ N(0, ⁇ 2 ).
  • multiple times include time 1 to time T arranged in order from morning to night.
  • the expected value of the cost c t at time t is E[c t ], then the optimization cost at time t can be obtained by to get.
  • the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty.
  • Accidental uncertainty is caused by hidden input variables.
  • cognitive uncertainty is caused by the intrinsic noise of the memristor array.
  • the optimization cost at time t is calculated by To obtain, ⁇ ( ⁇ , ⁇ ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and ⁇ represents epistemic uncertainty.
  • the cost c t of the state s t at time t includes: sampling the hidden input variable z from the p(z) distribution; inputting the sample and the state s t-1 at time t-1 into the memristor after weight mapping
  • the policy gradient optimization algorithm includes the REINFORCE algorithm, PRO algorithm or TPRO algorithm.
  • At least one embodiment of the present disclosure also provides a strategy optimization device utilizing a dynamic environment model based on a memristor array, including: an acquisition unit configured to acquire a dynamic environment model based on the memristor array; a computing unit configured to calculate the dynamic environment model based on the memristor array.
  • the environment model and the object strategy perform multiple predictions at multiple times to obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times;
  • the strategy search unit is configured to use the policy gradient optimization algorithm to perform strategy search based on the data sample set. to optimize object strategies.
  • FIG. 1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure
  • Figure 1B shows a schematic flow chart of step S101 in Figure 1A;
  • Figure 2A shows a schematic structure of a memristor array
  • Figure 2B is a schematic diagram of a memristor device
  • Figure 2C is a schematic diagram of another memristor device
  • Figure 2D shows a schematic diagram of mapping the weight matrix of a Bayesian neural network to a memristor array
  • FIG. 3 shows a schematic flow chart of step S102 in Figure 1A
  • Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure
  • FIG. 5 shows a schematic block diagram of a strategy optimization device using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.
  • model-free deep reinforcement learning (Deep Reinforcement Learning)
  • the agent usually needs to conduct a large number of interactive trials and errors with the real environment, and the data efficiency is not high, so it cannot be applied to real tasks where trial and error costs are relatively high.
  • Model-based deep reinforcement learning can utilize data more efficiently.
  • the agent first learns the dynamic environment model from the historical experience of interacting with the real dynamic environment (such as state transition data collected in advance), and then interacts with the dynamic environment model to obtain a sub-optimal strategy. .
  • the model-based reinforcement learning method learns an accurate dynamic environment model.
  • This model is used when training the agent. It does not need to interact with the real environment too many times.
  • the agent can "imagine" the feeling of interacting in the real environment. It can greatly improve data efficiency and is suitable for actual physical scenarios where the cost of obtaining data is high; at the same time, the dynamic environment model can predict the unknown state of the environment, generalize the cognition of the agent, and can also be used as a new data source to provide Contextual information can be used to help decision-making, which can alleviate the exploration-exploitation dilemma.
  • Bayesian Neural Network is a kind of The probabilistic model of the neural network placed in the Bayesian framework can describe complex random patterns; and the Bayesian neural network with latent input variables (BNN with latent input variables, BNN+LV) can distribution (accidental uncertainty) to describe complex random patterns, while considering model uncertainty through distribution on weights (epistemic uncertainty).
  • Hidden input variables refer to variables that cannot be directly observed, but have an impact on the state and output of the probability model.
  • the inventor described a method and device for implementing a Bayesian neural network using memristor intrinsic noise in Chinese invention patent application publication CN110956256A, which is hereby cited in its entirety as part of this application.
  • Bayesian neural networks include but are not limited to fully connected structures, convolutional neural network (CNN) structures, etc.
  • the network weight W is a random variable (W ⁇ q(W)) based on a certain distribution.
  • each weight of the Bayesian neural network is a distribution after training is completed. For example, each weight is an independent distribution from each other.
  • At least one embodiment of the present disclosure provides a strategy optimization method using a dynamic environment model based on a memristor array, including: obtaining a dynamic environment model based on a memristor array; performing multi-time optimization at multiple times according to the dynamic environment model and the object policy. Prediction, obtain a data sample set including the optimization cost of the object strategy corresponding to multiple moments; based on the data sample set, use the policy gradient optimization algorithm to perform strategy search to optimize the object strategy.
  • the policy optimization method uses a dynamic environment model based on a memristor array to generate a data sample set, implements long-term dynamic planning based on the dynamic environment model, and then uses a more stable algorithm such as the policy gradient optimization algorithm to conduct policy search , no gradient disappearance and explosion problems, and can effectively optimize the object strategy.
  • At least one embodiment of the present disclosure also provides a strategy optimization device corresponding to the above strategy optimization method.
  • FIG. 1A shows a schematic flow chart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure.
  • the strategy optimization method includes the following steps S101 to S103.
  • Step S101 Obtain a dynamic environment model based on the memristor array.
  • BNN+LV based on a memristor array can be used to model a dynamic system to obtain a dynamic environment model.
  • the specific steps for this will be shown in Figure 1B and will not be described again here. .
  • Step S102 Perform multiple predictions at multiple times based on the dynamic environment model and the object strategy, and obtain a data sample set including the optimization costs of the object strategy corresponding to multiple times.
  • the object policy involved is used in deep reinforcement learning, which can be, for example, a policy for an agent to maximize rewards or achieve a specific goal during its interaction with the environment.
  • Step S103 Based on the data sample set, use the policy gradient optimization algorithm to perform policy search to optimize the object policy.
  • the policy gradient optimization algorithm may include the REINFORCE algorithm, the PRO (Proximal Policy Optimization) algorithm, or the TPRO (Trust Region Policy Optimization) algorithm.
  • these policy gradient optimization methods are more stable and can effectively optimize object policies.
  • FIG. 1B shows a schematic flowchart of an example of step S101 in FIG. 1A.
  • step S101 may include the following steps S111 to S113.
  • Step S111 Obtain a Bayesian neural network, where the Bayesian neural network has a trained weight matrix.
  • the structure of Bayesian neural network includes fully connected structure or convolutional neural network structure.
  • Each network weight of this Bayesian neural network is a random variable.
  • each weight is a distribution, such as Gaussian distribution or Laplace distribution.
  • the Bayesian neural network can be trained offline to obtain the weight matrix.
  • the method of training the Bayesian neural network can refer to conventional methods.
  • a central processing unit (CPU) or an image processing unit (GPU) can be used.
  • Step S112 Obtain corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and map the multiple target conductance values to the memristor array.
  • the weight matrix is processed to obtain corresponding multiple target conductance values.
  • the weight matrix can be biased and scaled until the weight matrix meets the appropriate conductance window for the memristor array being used.
  • the target conductance value is calculated based on the processed weight matrix and the conductance value of the memristor.
  • the target conductance value please refer to the relevant description of the memristor-based Bayesian neural network, which will not be described again here.
  • FIG. 2A shows a schematic structure of a memristor array.
  • the memristor array is composed of, for example, multiple memristor units.
  • the multiple memristor units form an array of M rows and N columns. M and N are all positive integers.
  • Each memristor cell includes a switching element and one or more memristors.
  • WL ⁇ 1>, WL ⁇ 2>...WL ⁇ M> respectively represent the word lines of the first row, the second row...the Mth row, and the switching elements in the memristor unit circuit of each row.
  • the control electrode (such as the gate of a transistor) is connected to the corresponding word line of the row; BL ⁇ 1>, BL ⁇ 2>...BL ⁇ N> respectively represent the bits of the first column, the second column...the Nth column Line, the memristor in the memristor unit circuit of each column is connected to the bit line corresponding to the column; SL ⁇ 1>, SL ⁇ 2>...SL ⁇ M> respectively represent the first row, the second row...
  • the source line of the Mth row the source of the transistor in the memristor unit circuit of each row is connected to the corresponding source line of the row. According to Kirchhoff's law, by setting the state (such as resistance) of the memristor unit and applying corresponding word line signals and bit line signals to the word line and bit line, the above memristor array can complete the multiply-accumulate calculation in parallel. .
  • FIG. 2B is a schematic diagram of a memristor device, which includes a memristor array and its peripheral driving circuit.
  • the memristor device includes a signal acquisition device, a word line driving circuit, a bit line driving circuit, a source line driving circuit, a memristor array, and a data output circuit.
  • the signal acquisition device is configured to convert a digital signal into a plurality of analog signals through a digital to analog converter (DAC), so as to be input to a plurality of column signal input terminals of the memristor array.
  • DAC digital to analog converter
  • a memristor array includes M source lines, M word lines, and N bit lines, and a plurality of memristor cells arranged in M rows and N columns.
  • the operation of the memristor array is implemented through a word line driving circuit, a bit line driving circuit and a source line driving circuit.
  • the word line driving circuit includes multiple multiplexers (Mux) for switching the word line input voltage; the bit line driving circuit includes multiple multiplexers for switching the bit line input voltage; the source line The driver circuit also includes multiple multiplexers (Mux) for switching the source line input voltage.
  • the source line driver circuit also includes multiple ADCs for converting analog signals into digital signals.
  • TIA Trans-Impedance Amplifier
  • a memristor array includes operating modes and computational modes.
  • the memristor unit When the memristor array is in the operating mode, the memristor unit is in an initialization state, and the values of the parameter elements in the parameter matrix can be written into the memristor array.
  • the source line input voltage, bit line input voltage and word line input voltage of the memristor are switched to corresponding preset voltage ranges through a multiplexer.
  • the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in FIG. 2B.
  • the word line input voltage is set to 2V (volts), for example, when performing a reset operation on the memristor, the word line input voltage is set to 5V, for example, the word line input voltage It can be obtained from the voltage signal V_WL[1:M] in Figure 2B.
  • the source line input voltage is switched to the corresponding voltage range through the control signal SL_sw[1:M] of the multiplexer in the source line driving circuit in FIG. 2B.
  • the source line input voltage is set to 0V.
  • the source line input voltage is set to 2V.
  • the source line input voltage can be determined by the figure.
  • the voltage signal V_SL[1:M] in 2B is obtained.
  • the bit line input voltage is switched to the corresponding voltage range through the control signal BL_sw[1:N] of the multiplexer in the bit line driving circuit in FIG. 2B.
  • the bit line input voltage is set to 2V.
  • the bit line input voltage is set to 0V.
  • the bit line input voltage can be determined by the figure. DAC obtained in 2B.
  • the memristors in the memristor array are in a conductive state that can be used for computing, and the bitline input voltage input to the column signal input does not change the conductance value of the memristor, e.g. , the calculation can be completed by performing multiplication and addition operations on the memristor array.
  • the word line input voltage is switched to the corresponding voltage range through the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in Figure 2B.
  • the word line input of the corresponding row Voltage setting is 5V, for example, when no turn-on signal is applied, the word line input voltage of the corresponding row is set to 0V, for example, the GND signal is turned on; through the control signal SL_sw[1:M of the multiplexer in the source line driver circuit in Figure 2B ] Switch the source line input voltage to the corresponding voltage range, for example, set the source line input voltage to 0V, so that the current signals from multiple row signal output terminals can flow into the data output circuit through the bit line drive circuit in Figure 2B
  • the control signal BL_sw[1:N] of the multiplexer switches the bit line input voltage to the corresponding voltage range, for example, setting the bit line input voltage to 0.1V-0.3V, thereby using the memristor array to perform multiplication and addition operations.
  • the data output circuit may include multiple transimpedance amplifiers (TIAs) and ADCs, and may convert current signals at multiple row signal output terminals into voltage signals and then into digital signals for subsequent processing.
  • TIAs transimpedance amplifiers
  • ADCs analog to digital converters
  • Figure 2C is a schematic diagram of another memristor device.
  • the structure of the memristor device shown in FIG. 2C is basically the same as that of the memristor device shown in FIG. 2B, and also includes a memristor array and its peripheral driving circuit.
  • the memristor device signal acquisition device, word line driving circuit, bit line driving circuit, source line driving circuit, memristor array and data output circuit.
  • a memristor array includes M source lines, 2M word lines, and 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns.
  • each memristor unit has a 2T2R structure, and the operation of mapping the parameter matrix used for transformation processing to multiple different memristor units in the memristor array will not be described again here.
  • the memristor array may also include M source lines, M word lines and 2N bit lines, and a plurality of memristor units arranged in M rows and N columns.
  • Figure 2D shows the process of mapping the weight matrix of the Bayesian neural network to the memristor array.
  • Memristor arrays are used to implement the weight matrix between layers in the Bayesian neural network.
  • N memristors are used for each weight to implement the distribution corresponding to the weight.
  • N is an integer greater than or equal to 2.
  • N conductance values are calculated, and the N conductance value distributions are mapped to the N memristors. In this way, the weight matrix in the Bayesian neural network is converted into the target conductance value and mapped into the intersection sequence of the memristor array.
  • the left side of the figure is a three-layer Bayesian neural network, which includes three neuron layers connected one by one.
  • the input layer includes layer 1 neurons
  • the hidden layer includes layer 2 neurons
  • the output layer includes layer 3 neurons.
  • the input layer would receive The received input data is passed to the hidden layer, and the hidden layer calculates and converts the input data and sends it to the output layer.
  • the output layer outputs the output structure of the Bayesian neural network.
  • the input layer, hidden layer and output layer all include multiple neuron nodes, and the number of neuron nodes in each layer can be set according to different application situations.
  • the number of neurons in the input layer is 2 (including N 1 and N 2 )
  • the number of neurons in the middle hidden layer is 3 (including N 3 , N 4 and N 5 )
  • the number of neurons in the output layer is 1 (including N 6 ).
  • the weight matrix is implemented by a memristor array as shown on the right side of Figure 2D.
  • the weight parameters can be programmed directly to the conductance of the memristor array.
  • the weight parameters can also be mapped to the conductance of the memristor array according to a certain rule.
  • the difference in conductance of two memristors can also be used to represent a weight parameter.
  • the structure of the memristor array on the right side in FIG. 2D is, for example, as shown in FIG. 2A .
  • the memristor array may include a plurality of memristors arranged in an array.
  • the weight connecting the input N1 and the output N3 is implemented by three memristors (G 11 , G 12 , G 13 ), and other weights in the weight matrix can be implemented in the same way.
  • source line SL 1 corresponds to neuron N 3
  • source line SL 2 corresponds to neuron N 4
  • source line SL 5 corresponds to neuron N 5
  • bit lines BL 1 , BL 2 and BL 3 correspond to neuron N 1
  • a weight between the input layer and the hidden layer is converted into three target conductance values according to the distribution, and the distribution is mapped into the cross sequence of the memristor array, here
  • the target conductance values are G 11 , G 12 , and G 13 , respectively, and are outlined with dashed lines in the memristor array.
  • step S113 Input the state and hidden input variables corresponding to time t of the dynamic system as input signals to the weight-mapped memristor array, and use the memristor array to calculate the state and hidden input variables at time t. Processing is performed according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array. The output signal is used to obtain the prediction result of the dynamic system at time t+1.
  • W is the weight matrix of the Bayesian neural network
  • is the additive noise corresponding to the memristor array
  • s t+1 is the prediction result of the dynamic system at time t+1
  • the object strategy is at
  • represents the function of the object strategy
  • W ⁇ represents the strategy parameters
  • the weight matrix W of the Bayesian neural network satisfies the distribution W ⁇ q(W)
  • the additive noise ⁇ is the additive Gaussian noise ⁇ N(0, ⁇ 2 ).
  • the input signal is a voltage signal and the output signal is a current signal.
  • the output signal is read and analog-to-digital converted for subsequent processing.
  • the input sequence is applied to BL (Bit-line, bit line) in the form of voltage pulses, and then the output current flowing out of SL (Source-line, source line) is collected for further calculation and processing.
  • BL Bit-line, bit line
  • SL Source-line, source line
  • the input sequence can be converted by a DAC into an analog voltage signal, which is applied to BL through a multiplexer.
  • the output current is obtained from SL, which can be converted into a voltage signal through a transimpedance amplifier, and converted into a digital signal through the ADC, and the digital signal can be used for subsequent processing.
  • N memristors read current and N is relatively large, the total output current shows a certain distribution, such as a distribution similar to Gaussian distribution or Laplace distribution.
  • the total output current of all voltage pulses is the result of multiplying the input vector and the weight matrix.
  • a memristor crossbar array such a parallel read operation is equivalent to implementing two operations of sampling and vector matrix multiplication.
  • the multiple times include time 1 to time T arranged in order from morning to night.
  • FIG. 3 shows a schematic flowchart of an example of step S102 in FIG. 1A.
  • step S102 may include the following steps S301 to S303.
  • action a t-1 is the optimal action selected by the object policy at time t-1.
  • the cost c t of s t from which the cost sequence ⁇ c 1 ,c 2 ,...,c t ⁇ from time 1 to time t is obtained.
  • the optimization cost J t-1 of time t is obtained, where 1 ⁇ t ⁇ T.
  • examples of step S302 may include: sampling the latent input variable z from the p(z) distribution to obtain a sample; inputting the sample and the state s t-1 at time t-1 to the weight
  • the latent input variable z is first sampled from the p(z) distribution, and then the state s t- 1 at time t- 1 and the sample of the latent input variable are applied to the memristor array as a read (READ) voltage pulse. on BL, and then collect the output current flowing from SL for further calculation and processing to obtain the value corresponding to time t.
  • Cost c t By performing the above operations on the state at any time from time 1 to time t, the cost sequence ⁇ c 1 , c 2 ,..., c t ⁇ can be obtained.
  • Step S303 Obtain the data sample set ⁇ [a 0 ,J 0 ],...,[a T-1 ,J T-1 ] ⁇ from time 1 to time T.
  • the expected value of the cost c t at time t is E[c t ], then the optimization cost at time t can be passed to get.
  • the cost also includes cost changes caused by accidental uncertainty and cost changes caused by cognitive uncertainty.
  • Accidental uncertainty is caused by hidden input variables
  • cognitive uncertainty is caused by hidden input variables.
  • the determinism is caused by the intrinsic noise of the memristor array.
  • the optimization cost at time t can be calculated by To obtain, ⁇ ( ⁇ , ⁇ ) is a function of accidental uncertainty and epistemic uncertainty, eta represents accidental uncertainty, and ⁇ represents epistemic uncertainty.
  • the set of data samples from time 1 to time T is ⁇ [a 0 ,J 0 ],...,[a T-1 ,J T-1 ] ⁇ .
  • the exemplary process of the above-mentioned strategy optimization method based on the dynamic environment model of the memristor array is as follows:
  • Figure 4 shows a schematic diagram of an example of a strategy optimization method provided by at least one embodiment of the present disclosure.
  • a ship is driven and thereby fights waves to get as close as possible to a target location on the coastline, whereby a control model for driving the ship needs to be trained.
  • a ship at position (x, y) can choose an action (a x , a y ) that represents the direction and magnitude of the drive.
  • the subsequent position of the ship exhibits drift and disturbance.
  • the closer the location is to the coast the greater the interference.
  • Ships are given only a limited batch data set of spatial position transformations and cannot directly interact with the ocean environment to optimize action strategies to ensure safety. At this time, it is necessary to rely on empirical data to learn a marine environment model (dynamic environment model) that can predict the next state. Epistemic and accidental uncertainties will arise from missing information at unvisited locations and the randomness of the marine environment, respectively.
  • the sea surface is a dynamic environment
  • the object strategy refers to the method used in the solution process of the ship from the current position to the target position.
  • obtain the dynamic environment model and initial object policy for the dynamic environment The initialization state of the ship is the current position.
  • the execution action at the current moment is obtained from the object strategy.
  • the dynamic environment model is used to predict the state (the position of the ship) at the next moment.
  • the cost and optimization cost corresponding to the object strategy are calculated, and the action and optimization cost are recorded. composed of data samples. Assume that the current time is time 1. From time 1 to the subsequent time T, a data sample set is obtained, and the policy gradient optimization algorithm is used to perform a policy search on the data sample set to obtain an optimized object policy.
  • Figure 5 shows a schematic block diagram of a strategy optimization device 500 using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.
  • the strategy optimization device can be used to execute the data processing method shown in Figure 1A .
  • the strategy optimization device 500 includes an acquisition unit 501 , a calculation unit 502 and a strategy search unit 503 .
  • the acquisition unit 501 is configured to acquire a dynamic environment model based on the memristor array.
  • the computing unit 502 is configured to perform multiple predictions at multiple times according to the dynamic environment model and the object policy, and obtain a data sample set including the optimization cost of the object policy corresponding to multiple times.
  • the policy search unit 503 is configured to use a policy gradient optimization algorithm to perform a policy search based on a data sample set to optimize the object policy.
  • the strategy optimization device 500 can be implemented using hardware, software, firmware, or any feasible combination thereof, and this disclosure is not limiting.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Neurology (AREA)
  • Feedback Control In General (AREA)

Abstract

L'invention concerne un procédé et un appareil d'optimisation de politique utilisant un modèle d'environnement dynamique basé sur un réseau de memristances. Le procédé consiste : à acquérir un modèle d'environnement dynamique basé sur un réseau de memristances ; à effectuer une prédiction de multiples fois à une pluralité de moments selon le modèle d'environnement dynamique et une politique d'objet de façon à obtenir un ensemble d'échantillons de données, qui comprend des coûts d'optimisation de la politique d'objet correspondant à la pluralité de moments ; et, sur la base de l'ensemble d'échantillons de données, à effectuer une recherche de politique en utilisant un algorithme d'optimisation de gradient de politique de façon à optimiser la politique d'objet. Dans le procédé, un ensemble d'échantillons de données est généré en utilisant un modèle d'environnement dynamique basé sur un réseau de memristances, une planification dynamique à long terme basée sur le modèle d'environnement dynamique est réalisée et une recherche de politique est ensuite effectuée en utilisant un algorithme plus stable tel qu'un algorithme d'optimisation de gradient de politique, de telle sorte qu'une politique d'objet puisse être efficacement optimisée.
PCT/CN2023/092475 2022-05-09 2023-05-06 Procédé et appareil d'optimisation de politique utilisant un modèle d'environnement basé sur un réseau de memristances WO2023217027A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210497721.2 2022-05-09
CN202210497721.2A CN114819093A (zh) 2022-05-09 2022-05-09 利用基于忆阻器阵列的环境模型的策略优化方法和装置

Publications (1)

Publication Number Publication Date
WO2023217027A1 true WO2023217027A1 (fr) 2023-11-16

Family

ID=82512800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092475 WO2023217027A1 (fr) 2022-05-09 2023-05-06 Procédé et appareil d'optimisation de politique utilisant un modèle d'environnement basé sur un réseau de memristances

Country Status (2)

Country Link
CN (1) CN114819093A (fr)
WO (1) WO2023217027A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819093A (zh) * 2022-05-09 2022-07-29 清华大学 利用基于忆阻器阵列的环境模型的策略优化方法和装置
CN116300477A (zh) * 2023-05-19 2023-06-23 江西金域医学检验实验室有限公司 封闭空间环境调控方法、系统、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543827A (zh) * 2018-12-02 2019-03-29 清华大学 生成式对抗网络装置及训练方法
CN110956256A (zh) * 2019-12-09 2020-04-03 清华大学 利用忆阻器本征噪声实现贝叶斯神经网络的方法及装置
US20210133541A1 (en) * 2019-10-31 2021-05-06 Micron Technology, Inc. Spike Detection in Memristor Crossbar Array Implementations of Spiking Neural Networks
CN113505887A (zh) * 2021-09-12 2021-10-15 浙江大学 一种针对忆阻器误差的忆阻器存储器神经网络训练方法
CN114067157A (zh) * 2021-11-17 2022-02-18 中国人民解放军国防科技大学 基于忆阻器的神经网络优化方法、装置及忆阻器阵列
CN114819093A (zh) * 2022-05-09 2022-07-29 清华大学 利用基于忆阻器阵列的环境模型的策略优化方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543827A (zh) * 2018-12-02 2019-03-29 清华大学 生成式对抗网络装置及训练方法
US20210133541A1 (en) * 2019-10-31 2021-05-06 Micron Technology, Inc. Spike Detection in Memristor Crossbar Array Implementations of Spiking Neural Networks
CN110956256A (zh) * 2019-12-09 2020-04-03 清华大学 利用忆阻器本征噪声实现贝叶斯神经网络的方法及装置
CN113505887A (zh) * 2021-09-12 2021-10-15 浙江大学 一种针对忆阻器误差的忆阻器存储器神经网络训练方法
CN114067157A (zh) * 2021-11-17 2022-02-18 中国人民解放军国防科技大学 基于忆阻器的神经网络优化方法、装置及忆阻器阵列
CN114819093A (zh) * 2022-05-09 2022-07-29 清华大学 利用基于忆阻器阵列的环境模型的策略优化方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIN, YUDENG: "The Research on Resistive Random-access Memory Array Based Neural Network", MASTER’S THESIS, no. 202007, 15 July 2020 (2020-07-15), China, pages 1 - 65, XP009550291, DOI: 10.27135/d.cnki.ghudu.2019.002032 *

Also Published As

Publication number Publication date
CN114819093A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
US11348002B2 (en) Training of artificial neural networks
WO2023217027A1 (fr) Procédé et appareil d'optimisation de politique utilisant un modèle d'environnement basé sur un réseau de memristances
CN112183739B (zh) 基于忆阻器的低功耗脉冲卷积神经网络的硬件架构
US10708522B2 (en) Image sensor with analog sample and hold circuit control for analog neural networks
US10740671B2 (en) Convolutional neural networks using resistive processing unit array
JP7427030B2 (ja) 人工ニューラル・ネットワークのトレーニング方法、装置、プログラム
US11188825B2 (en) Mixed-precision deep-learning with multi-memristive devices
US11620505B2 (en) Neuromorphic package devices and neuromorphic computing systems
JP2022554371A (ja) メモリスタに基づくニューラルネットワークの並列加速方法およびプロセッサ、装置
US11087204B2 (en) Resistive processing unit with multiple weight readers
WO2023217021A1 (fr) Procédé de traitement de données basé sur un réseau de memristances, et appareil de traitement de données
US20210319293A1 (en) Neuromorphic device and operating method of the same
WO2023217017A1 (fr) Procédé et dispositif d'inférence variationnelle pour réseau neuronal bayésien basé sur un réseau de memristances
US11301752B2 (en) Memory configuration for implementing a neural network
US20210374546A1 (en) Row-by-row convolutional neural network mapping for analog artificial intelligence network training
US20230113627A1 (en) Electronic device and method of operating the same
KR20210143614A (ko) 뉴럴 네트워크를 구현하는 뉴로모픽 장치 및 그 동작 방법
Fang et al. Neuromorphic algorithm-hardware codesign for temporal pattern learning
Spoon et al. Accelerating deep neural networks with analog memory devices
CN115796252A (zh) 权重写入方法及装置、电子设备和存储介质
CN114861902A (zh) 处理单元及其操作方法、计算芯片
Qiu et al. Neuromorphic acceleration for context aware text image recognition
Irmanova et al. Discrete‐level memristive circuits for HTM‐based spatiotemporal data classification system
CN117808062A (zh) 计算装置、电子装置以及用于计算装置的操作方法
US20240021242A1 (en) Memory-based neuromorphic device and operating method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23802799

Country of ref document: EP

Kind code of ref document: A1