CN109190760A

CN109190760A - Neural network training method and device and environmental treatment method and device

Info

Publication number: CN109190760A
Application number: CN201810885459.2A
Authority: CN
Inventors: 邓煜彬; 余可; 吕健勤; 林达华; 汤晓鸥
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2019-01-11
Anticipated expiration: 2038-08-06
Also published as: CN109190760B

Abstract

This disclosure relates to a kind of neural network training method and device and environmental treatment method and device, which comprises the ambient condition vector of current cycle of training is inputted neural network, acquisition movement output and measurement output；It is acted to export according to ambient condition vector sum and determines the first rewards and punishments feedback；According to the first rewards and punishments feedback, the first rewards and punishments of history cycle of training feedback and output is measured, determines the model loss of neural network；It is lost according to model, adjusts the network parameter values of neural network；Neural network when neural network meets training condition, after being trained.Neural network training method according to an embodiment of the present disclosure, the model loss determined by multiple rewards and punishments feedback of current cycle of training and history cycle of training, it is not easy to fall into locally optimal solution during training neural network, can get the higher neural network of the goodness of fit.

Description

Neural network training method and device and environment processing method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a neural network training method and apparatus and an environment processing method and apparatus.

Background

In the related art, an environment state vector describing the environment of the current period can be input into a neural network to obtain action output and measurement output, the action output acts on the environment, the environment can be changed due to the action output to obtain the environment of the next period, and meanwhile reward and punishment feedback to the action output is obtained. And determining a loss function of the neural network according to the reward and punishment feedback so as to train the neural network. However, the loss function is determined by the training method according to single reward and punishment feedback, so that the training process is easy to fall into a local optimal solution, and a neural network with high fitting goodness is difficult to obtain.

Disclosure of Invention

The disclosure provides a neural network training method and device and an environment processing method and device.

According to an aspect of the present disclosure, there is provided a neural network training method, including:

inputting the environmental state vector of the current training period into a neural network for processing to obtain the action output of the current training period and the measurement output of the current training period;

determining a first reward and punishment feedback of the current training period according to the environment state vector of the current training period and the action output of the current training period;

determining a model loss of the neural network according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of a historical training period and the measurement output of the current training period, wherein the historical training period comprises one or more training periods before the current training period;

adjusting network parameter values of the neural network according to the model loss;

and when the neural network meets the training condition, obtaining the trained neural network.

According to the neural network training method disclosed by the embodiment of the disclosure, the environment state vector is input into the neural network, so that the action output of the current training period and the weighing output of the current training period can be obtained, the model loss is determined according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period and the weighing output, the model loss determined through the plurality of reward and punishment feedbacks is not easy to fall into a local optimal solution in the process of training the neural network, and the neural network with high fitting goodness can be obtained.

In one possible implementation, the measurement output of the current period includes a first measurement output of the current training period and a second measurement output of the current training period.

In one possible implementation, the model loss includes a first model loss corresponding to the first metric output, a second model loss corresponding to the second metric output, and a third model loss corresponding to the action output.

In one possible implementation manner, determining a model loss of the first neural network according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period, and the measurement output of the current training period includes:

determining the first model loss according to the first reward and punishment feedback of the current training period and the first weighing output of the current training period;

determining the loss of the second model according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period and the second weighing output of the current training period;

and determining the third model loss according to the first model loss and the second model loss.

In one possible implementation, determining the first model loss according to the first reward punishment feedback of the current training period and the first measurement output of the current training period includes:

determining a first accumulated discount reward and punishment feedback according to the first reward and punishment feedback, a preset discount rate and a first expected reward and punishment function of the current training period;

and determining the first model loss according to the first cumulative discount reward and punishment feedback and the first weighing output.

In one possible implementation, determining the second model loss according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period, and the second measurement output of the current training period includes:

determining a second reward and punishment feedback of the current training period according to the first reward and punishment feedback of the current training period and the first reward and punishment feedback of the historical training period;

determining a second accumulated discount reward and punishment feedback according to a second reward and punishment feedback of the current training period, a preset discount rate and a second expected reward and punishment function;

and determining the second model loss according to the second cumulative discount reward and punishment feedback and the second weighing output.

In one possible implementation, determining a second reward and punishment feedback of the current training period according to the first reward and punishment feedback of the current training period and the first reward and punishment feedback of the historical training period includes:

determining a reward and punishment change vector according to the first reward and punishment feedback of the historical training period and the first reward and punishment feedback of the current training period;

obtaining a variation accumulated vector according to the reward and punishment variation vector;

obtaining a zero fluctuation reward and punishment vector according to the change accumulated vector;

and determining the second reward punishment feedback according to the change accumulated vector and the zero fluctuation reward punishment vector.

Through this kind of mode, the first reward punishment feedback of a plurality of historical training periods of accessible obtains the second reward punishment feedback, can make the training process obtain more effectual reward punishment feedback, can make neural network can be applicable to more complicated environment, promotes neural network's performance.

In one possible implementation, the method further includes:

and determining the environmental state vector of the current training period according to the environmental state vector of the previous training period of the current training period and the action output of the previous training period.

In one possible implementation, the method further includes:

and when the current training period is the first training period for training the neural network, obtaining the environmental state vector of the current training period according to a preset initial environmental state vector and random action output.

By the method, the initial environment can quickly become the environment for training the neural network, the neural network can obtain effective reward and punishment feedback, the efficiency of training the neural network is improved, and the training process is not easy to fall into local optimal solution.

According to another aspect of the present disclosure, there is provided an environmental processing method, the method including:

inputting the environmental state vector of the current period into the neural network for processing to obtain the action output of the current period;

and determining the environment state vector of the next period of the current period and the first reward and punishment feedback of the current period according to the environment state vector of the current period and the action output of the current period.

According to another aspect of the present disclosure, there is provided a neural network training device including:

the input module is used for inputting the environmental state vector of the current training period into the neural network for processing to obtain the action output of the current training period and the measurement output of the current training period;

the feedback determining module is used for determining first reward punishment feedback of the current training period according to the environment state vector of the current training period and the action output of the current training period;

a model loss determination module, configured to determine a model loss of the neural network according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of a historical training period, and a measurement output of the current training period, where the historical training period includes one or more training periods before the current training period;

the adjusting module is used for adjusting the network parameter value of the neural network according to the model loss;

and the neural network obtaining module is used for obtaining the trained neural network when the neural network meets the training condition.

In one possible implementation, the model loss determination module is further configured to:

In one possible implementation, the apparatus further includes:

and the first determining module is used for determining the environmental state vector of the current training period according to the environmental state vector of the previous training period of the current training period and the action output of the previous training period.

In one possible implementation, the apparatus further includes:

and the second determining module is used for obtaining the environmental state vector of the current training period according to a preset initial environmental state vector and random action output when the current training period is the first training period for training the neural network.

In one possible implementation, the apparatus includes:

an action output obtaining module, configured to input the environmental state vector of the current cycle into the neural network according to any one of claims 1 to 9 for processing, so as to obtain an action output of the current cycle;

and the third determining module is used for determining the environmental state vector of the next period of the current period and the first reward and punishment feedback of the current period according to the environmental state vector of the current period and the action output of the current period.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the neural network training method described above is performed.

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the above-described environment processing method is performed.

According to another aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the neural network training method described above.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described environment processing method.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of a neural network training method in accordance with an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of an environment processing method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of an application of a neural network training method according to an embodiment of the present disclosure;

FIG. 4 shows a block diagram of a neural network training device, in accordance with an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a neural network training device, in accordance with an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an environment processing device according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of an electronic device shown in accordance with an exemplary embodiment;

fig. 8 is a block diagram illustrating an electronic device according to an example embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow diagram of a neural network training method according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:

in step S11, the environmental state vector of the current training period is input into the neural network for processing, and the motion output of the current training period and the measurement output of the current training period are obtained;

in step S12, determining a first reward and punishment feedback of the current training period according to the environmental state vector of the current training period and the action output of the current training period;

in step S13, determining a model loss of the neural network according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of a historical training period, and a measurement output of the current training period, wherein the historical training period includes one or more training periods before the current training period;

in step S14, adjusting network parameter values of the neural network according to the model loss;

in step S15, when the neural network satisfies the training condition, a trained neural network is obtained.

According to the neural network training method disclosed by the embodiment of the disclosure, the environment state vector is input into the neural network, so that the action output of the current training period and the weighing output of the current training period can be obtained, the model loss is determined according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period and the weighing output, the model loss is determined through the plurality of reward and punishment feedbacks, the neural network is not easy to fall into a local optimal solution in the process of training the neural network, and the neural network with high fitting goodness can be obtained.

In one possible implementation, the context of the current training period may represent the current state, e.g., a game, a traffic environment in which the autonomous vehicle is located, or a transaction situation in a financial market, etc. The action output of the neural network can be acted on the environment of the current period, so that the environment is changed. In an example, the action output is an in-game action instruction, e.g., the game is a shooting game, and the action instruction may be an instruction to command a character in the game to shoot a basket to change a current scoring situation. In an example, the action output may be an operation instruction in automatic driving, and may instruct a driven vehicle to perform an operation to change a current traffic environment. In an example, the action output may be a trading instruction in a financial market that may perform a trading operation to change a current market price. The present disclosure is not limited as to the type of environment and action output.

In one possible implementation, in step S11, the environment state vector describing the environment of the current training cycle may be input into a neural network for processing, and the neural network may be a neural network trained in the last training cycle.

In one possible implementation, the neural network may process the environmental state vector of the current training period, and may obtain an action output of the current training period and a metric output of the current training period. The motion output of the current training period may act on the environment of the current period, and the measurement output of the current training period may be used as a parameter for adjustment when a neural network is reversely adjusted.

In a possible implementation manner, when the current training period is a first training period for training the neural network, the environmental state vector of the current training period may be obtained according to a preset initial environmental state vector and a random motion output. In an example, when training of the neural network is not started, the action may be repeated multiple times by first acting on the initial environment using random action outputs. In an example, the random action output may be a vector, the choice of each element of which follows a certain probability distribution function, e.g., a softmax function. In the process of repeated execution, the random motion output is used, namely, the random motion output acts on the initial environment, the initial environment can be changed to obtain a first initial environment, and then the random motion output acts on the first initial environment, so that the first initial environment can be changed to obtain a second initial environment. The execution may be repeated multiple times, for example, the number of times of the repeated execution may be set to M (M is a positive integer greater than or equal to 1), after the M times of the repeated execution, the obtained environment may be determined as the environment of the first training cycle, and then an environment state vector describing the environment of the first training cycle may be input to the neural network for processing, and the neural network may be trained.

In one possible implementation manner, in step S12, the action output of the current training period may be applied to the environment of the current training period, that is, the environment state vector of the current training period and the action output of the current training period are subjected to vector operation or matrix operation (for example, operations such as addition, multiplication, or convolution may be included, and the present disclosure does not limit the type of the vector operation or the matrix operation), and a first reward and penalty feedback of the current training period may be obtained, where the first reward and penalty feedback may represent a change in a gap between the environment and an ideal state after the action output of the current training period is applied to the environment of the current training period.

In an example, the environment is a shooting game, the ideal state is that the difference between the score of the my and the score of the opponent is maximized, and after the action output of the current training period is acted on the environment, a first reward and punishment feedback can be obtained, for example, when the my shoots, a positive first reward and punishment feedback can be obtained, and when the opponent shoots, a negative first reward and punishment feedback can be obtained, and the reward and punishment feedback is 0.

In an example, the environment is a traffic environment in driving, after the action output of the current training cycle is applied to the environment, a first reward and punishment feedback can be obtained, for example, when the traffic environment is improved (for example, a road is clear) after the vehicle performs an operation associated with the action output, a positive first reward and punishment feedback can be obtained, when the traffic environment is deteriorated (for example, the vehicle drives into a crowded road) after the vehicle performs the operation associated with the action output, a negative first reward and punishment feedback can be obtained, and when the traffic state is unchanged after the vehicle performs the operation associated with the action output, the first reward and punishment feedback is 0.

In an example, the environment is a trading situation in a financial market, after the action output of the current training period is applied to the environment, the first reward and punishment feedback may be obtained, for example, after a trading operation associated with the action output is performed, if the benefit is positive, the positive first reward and punishment feedback may be obtained, after a trading operation associated with the action output is performed, if the benefit is negative, the negative first reward and punishment feedback may be obtained, and after a trading operation associated with the action output is performed, if there is no benefit (i.e., no trading operation is performed, or the trading profit-loss balance is performed, etc.), the first reward and punishment feedback is 0.

In a possible implementation manner, according to the environmental state vector of the current training period and the action output of the current training period, the environmental state vector of the next training period may also be determined. In an example, the environmental state vector of the current training period and the action output of the current training period may be obtained through a vector operation or a matrix operation (for example, operations such as addition, multiplication, or convolution may be included, and the present disclosure does not limit the type of the vector operation or the matrix operation). Specifically, the action output of the current training period may act on the environment and change the environment to obtain the environment of the next training period, where the environment state vector of the next training period is the state vector describing the environment of the next training period.

In a possible implementation manner, the environmental state vector of the current training period may be determined according to an environmental state vector of a previous training period of the current training period and an action output of the previous training period. That is, the environmental state vector of the current training cycle is determined by a vector operation or a matrix operation from the environmental state vector of the previous training cycle of the current training cycle and the action output of the previous training cycle. The action output of the previous training period can be acted on the environment of the previous training period to obtain the environment of the current training period.

In one possible implementation manner, in step S13, the model loss of the neural network may be determined according to a measurement output of a current training period, the first reward and punishment feedback of the current training period, and the first reward and punishment feedback of one or more historical training periods. The measurement output of the current period comprises a first measurement output of the current training period and a second measurement output of the current training period, and the model loss comprises a first model loss corresponding to the first measurement output, a second model loss corresponding to the second measurement output and a third model loss corresponding to the action output.

In one possible implementation, step S13 may include:

determining the first model loss according to the first reward and punishment feedback of the current training period and the first weighing output of the current training period; determining the loss of the second model according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period and the second weighing output of the current training period; and determining the third model loss according to the first model loss and the second model loss.

In one possible implementation, the first model loss corresponds to a first metric output. The first model loss can be propagated backwards through a first measurement output of the neural network to adjust parameters such as a weight of the neural network. In an example, the first model loss is related to a first reward penalty feedback for the current training period and a first scaled output for the current training period.

In one possible implementation, determining the first model loss according to the first reward punishment feedback of the current training period and the first measurement output of the current training period may include:

determining a first accumulated discount reward and punishment feedback according to the first reward and punishment feedback, a preset discount rate and a first expected reward and punishment function of the current training period; determining the first model loss according to the first cumulative discounted reward-penalty feedback and the first desired reward-penalty function.

In one possible implementation, according to the following formula (1), the first reward punishment feedback r of the current training period may be given_tDetermining a first cumulative discount reward punishment feedback by a preset discount rate gamma and a first expected reward punishment function V

Wherein the first cumulative discount reward punishment feedbackDoes not involve the first reward and punishment feedback of the historical training period and can therefore beReward and punishment feedback is determined to be cumulatively discounted for the short period. The first expected reward and penalty function V is an evaluation function that discounts reward and penalty feedback to a first accumulation of the environment of the current training period, and an input of the function may be an environment state vector. The preset reduction ratio γ is a constant, for example, 0.9. N is a positive integer greater than or equal to 1, t is the current training period, r_t+nWhen the action output of the current training period acts on the environment, the first reward and punishment feedback (N is more than or equal to 0 and less than or equal to N-1) of the t + N th period is used, for example, the action output of the current training period acts on the current environment, and the first reward and punishment feedback r of the current environment can be obtained_tAnd the environmental state vector s of the next training period_t+1The action output of the current training period acts on the environment of the next training period, and the reward punishment feedback r of the next training period can be realized_t+1And the environmental state vector s after two training periods_t+2. In this way, the first reward punishment feedback of each training period under the condition that the action output is the action output of the current training period and the environment state vector s after N training periods can be obtained_t+N。V(s_t+N) The evaluation result of (a) is a first cumulative discounted reward punishment feedback expected value for the environment after the N training periods, and the present disclosure does not limit the value of N. For example, where the environment is a trading situation in a financial market, a first desired reward penalty function V may be used to estimate the desired revenue.

In one possible implementation, the reward and punishment feedback can be presented according to the first cumulative discount according to the following formula (2)And a first metric output, determining the first model loss L₁。

Wherein, V(s)_t) For discounting the feedback expectation value for the first accumulation of the environment of the current training period, i.e. for rewarding or punishing the environment state vector s of the current training period_tThe expected value obtained by inputting a first expected reward and punishment function V, in the example V(s)_t) The value of (d) is the first measurement output.

In one possible implementation, the second model loss corresponds to a second metric output. The second model loss can be propagated backwards through a second measurement output of the neural network to adjust parameters such as a weight of the neural network. In an example, the second model loss is related to the first reward and punishment feedback for the current training period, the first reward and punishment feedback for the historical training period, and the second weighted output for the current training period. Determining the second model loss according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period and the second measurement output of the current training period, including: determining a second reward and punishment feedback of the current training period according to the first reward and punishment feedback of the current training period and the first reward and punishment feedback of the historical training period; determining a second accumulated discount reward and punishment feedback according to a second reward and punishment feedback of the current training period, a preset discount rate and a second expected reward and punishment function; and determining the second model loss according to the second cumulative discount reward and punishment feedback and the second weighing output.

In one possible implementation, determining a second reward and punishment feedback of the current training period according to the first reward and punishment feedback of the current training period and the first reward and punishment feedback of the historical training period may include:

determining a reward and punishment change vector according to the first reward and punishment feedback of the historical training period and the first reward and punishment feedback of the current training period; obtaining a variation accumulated vector according to the reward and punishment variation vector; obtaining a zero fluctuation reward and punishment vector according to the change accumulated vector; and determining the second reward punishment feedback according to the change accumulated vector and the zero fluctuation reward punishment vector.

In one possible implementation, r may be fed back according to a first reward penalty of the historical training period_t-1、r_t-2、r_t-3First reward punishment feedback r of equal and current training period_tDetermine reward and punishmentVariation vector d ═ d_t-(T-1),d_t-(T-2),...,d_t]. In an example, a first reward and punishment feedback of T training periods may be used, T being a positive integer greater than or equal to 2, in an example, T may be 20, and the disclosure does not limit the value of T. In an example, the first reward punishment of T training periods may be fed back into a vector r_t-(T-1)，...r_t-2，r_t-1，r_t]. Further, a variation of the reward and penalty feedback for each training history period with respect to the reward and penalty feedback for the last training period may be determined from the vector, i.e. d ═ r_t-(T-1)，r_t-(T-2)-r_t-(T-1)，...r_t-1-r_t-2，r_t-r_t-1]In the reward or punishment change vector d, d_t-(T-1)＝r_t-(T-1)，d_t-(T-2)＝r_t-(T-2)-r_t-(T-1)，...d_t＝r_t-r_t-1。

In a possible implementation mode, a variation accumulated vector can be obtained according to the reward and punishment variation vector dIn an example, the order of reward and punishment vector d may be reversed to obtain vector f ═ f₁,f₂,...,f_T]＝[d_t,d_t-1,...,d_t-(T-1)]And accumulating the elements of the vector f to obtain a variation accumulation vectorThat is to say that the first and second electrodes,i is more than or equal to 0 and less than or equal to T + 1. That is to say that the first and second electrodes,

in one possible implementation, vectors may be accumulated according to changesDetermining zero-fluctuation reward and punishment vectorsIn the example shown in the above, the first, wherein,that is to say that the first and second electrodes,

in one possible implementation, vectors may be accumulated according to changesAnd the zero fluctuation reward and punishment vectorDetermining the second reward punishment feedbackIn an example, vectors can be accumulated according to changesElement (1) ofAnddetermining a reward and penalty estimation parameter R_HAs in the following equation (3):

can accumulate vectors according to changesAnd the zero fluctuation reward and punishment vectorDetermining a change accumulation vectorAnd the zero fluctuation reward and punishment vectorDifference value and zero fluctuation reward and punishment vectorThe ratio therebetween, as in the following formula (4):

wherein k is more than or equal to 1 and less than or equal to T +1, determining T +1 ratios according to formula (4), and determining the variance of the T +1 ratiosFurther, the parameter R can be estimated according to the reward and punishment_HSum varianceDetermining second reward punishment feedbackAs in the following equation (5):

wherein, tau and sigma_maxFor a set parameter, τ can be used to controlAmplitude of (a)_maxCan be used for controllingIn the example, σ_max1, τ -2, the present disclosure pairs τ and σ_maxThe value of (A) is not limiting.

In one possible implementation, according to the following formula (6), the second reward punishment feedback of the current training period can be obtainedA predetermined reduction ratio gamma and a second desired reward penalty function V^vwrDetermining a second cumulative discount reward punishment feedback

Wherein the second cumulative discount reward punishment feedbackThe first reward and punishment feedback related to the historical training period can therefore be determined as a long period cumulative discounting reward and punishment feedback. Second desired reward and punishment function V^vwrIs an evaluation function of a second cumulative discounted reward and punishment feedback to the environment of the current training period, the input of the function can be a loopAn ambient state vector.For example, when the action output of the current training period is applied to the current environment, the second reward punishment feedback of the current environment can be obtainedAnd the environmental state vector s of the next training period_t+1The action output of the current training period acts on the environment of the next training period, and the reward and punishment feedback of the next training period can be realizedAnd the environmental state vector s after two training periods_t+2. In this way, a second reward punishment feedback of each training period under the condition that the action output is the action output of the current training period and the environment state vector s after N training periods can be obtained_t+N。V^vwr(s_t+N) The evaluation result of (1) is that the second cumulative discount punishment feedback expectation value of the environment after the N training periods is obtained under the condition that the action output is the action output of the current training period, and the numerical value of N is not limited by the disclosure.

In one possible implementation, the reward and punishment feedback can be presented according to the second cumulative discount according to the following formula (7)And a second metric output, determining the first model loss L₂。

Wherein, V^vwr(s_t) For discounting the feedback expectation value for a second accumulation of the environment of the current training period, i.e. for the current training periodAmbient state vector s_tInputting a second expected reward and punishment function V^vwrThe expected value obtained, in the example, V^vwr(s_t) The value of (d) is the second measurement output.

In one possible implementation, the first model loss L may be based on the following equation (8)₁And second model loss L₂To determine and action output a_tCorresponding third model loss L₃。

Wherein, pi (a)_t|s_t(ii) a Theta) is an expression of the neural network for the current training period, theta is a network parameter value of the neural network for the current training period,is the first model loss L₁，Is the second model loss L₂。

In one possible implementation, in step S14, L may be lost according to the first model₁Second model loss L₂And third model loss L₃To adjust network parameter values of the neural network. In an example, network parameter values of the neural network may be adjusted in a direction that minimizes model loss, or in a direction that minimizes regularized model loss, such that the adjusted neural network has a higher goodness-of-fit while avoiding overfitting.

In one possible implementation, step S14 may be performed in a loop for a plurality of times, for example, the predetermined number of times may be adjusted in a direction that minimizes model loss; the number of adjustments may be unlimited, and the loop may be stopped after multiple adjustments until the model loss decreases to a certain level or converges within a certain threshold.

In one possible implementation, in step S15, when the training condition is satisfied, a trained neural network may be obtained. In an example, a neural network that has been cycle adjusted a predetermined number of times may be determined as the trained neural network, or a neural network that has reduced model loss to a certain degree or converged within a certain threshold may be determined as the trained neural network. The present disclosure does not limit the training conditions.

According to the neural network training method disclosed by the embodiment of the disclosure, the environment of the first training period is quickly obtained by using random action output, so that effective reward and punishment feedback of the neural network can be obtained, the efficiency of training the neural network is improved, and the training process is not easy to fall into a local optimal solution. Furthermore, the model loss is determined according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period, the first measurement output and the second measurement output, more effective reward and punishment feedback can be obtained in the training process, the neural network can be applicable to more complex environments, the performance of the neural network is improved, the model loss determined through the plurality of reward and punishment feedbacks is not easy to fall into a local optimal solution in the process of training the neural network, and the neural network with higher fitting goodness can be obtained.

FIG. 2 shows a flow diagram of an environment processing method according to an embodiment of the present disclosure. As shown in fig. 2, the method includes:

in step S21, the environmental state vector of the current cycle is input into the neural network for processing, and the motion output of the current cycle is obtained;

in step S22, an environmental state vector of a next cycle of the current cycle and a first reward and punishment feedback of the current cycle are determined according to the environmental state vector of the current cycle and the action output of the current cycle.

In a possible implementation manner, in step S21, the neural network is the neural network trained by the training method of steps S11 to S15, and according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of the historical training period, and the model loss training determined by the first measurement output and the second measurement output, more effective reward and punishment feedback can be obtained in the training process, so that the trained neural network can be applied to a more complex environment, and more accurate action output can be obtained.

In one possible implementation manner, in step S22, the motion output of the current cycle and the environmental state vector of the current cycle may be subjected to a vector operation or a matrix operation (for example, operations such as addition, multiplication, or convolution may be included, and the present disclosure does not limit the type of the vector operation or the matrix operation), and the environmental state vector of the next cycle may be obtained. Specifically, the action of the current cycle may act on the environment and change the environment to obtain the environment of the next cycle, where the environment state vector of the next cycle is the state vector describing the environment of the next cycle. Furthermore, when the environment of the current period changes into the environment of the next period, a first reward and punishment feedback of the current period can be obtained, and the first reward and punishment feedback can be used for representing the generated income when the action output of the current period acts on the environment of the current period. For example, the environment is a transaction situation in a financial market, and after an action output (e.g., a transaction operation) of the current training period is applied to the environment, the transaction situation in the financial market can be changed (e.g., a stock price or a volume of a deal is changed after a transaction occurs, etc.), and a first reward and punishment feedback (e.g., a benefit obtained by the transaction) is obtained. In the training process of the neural network, the neural network is trained according to the first reward and punishment feedback of the current training period and the first reward and punishment feedback of the historical training period, so that the trained neural network can adapt to a more complex environment based on the historical training period, for example, more accurate trading operation can be performed according to more historical profit conditions, a better price and a lower trading cost can be obtained, and a larger profit can be obtained in the retesting according to historical data.

Fig. 3 shows an application schematic diagram of a neural network training method according to an embodiment of the present disclosure. As shown in fig. 3, the random motion output may be used to act on the initial environment and repeatedly performed a number of times to obtain the environment for the first training cycle to begin training the neural network.

In one possible implementation, an environmental state vector s describing the environment of the current training cycle may be obtained based on the environment of the current training cycle_tAn environment state vector s of the environment_tInput neural network to obtain action output a of current training period_tA first measurement output and a second measurement output.

In one possible implementation, the action may be output a_tActing on the environment of the current training period, a first reward and punishment feedback r can be obtained_tAnd the environmental state vector s of the next training period_t+1. According to the first reward and punishment feedback r_tAnd a first reward punishment feedback of the historical training period, wherein a second reward punishment feedback can be obtained

In one possible implementation, the first reward punishment feedback r is given according to_tAnd a first metric output for obtaining a first model loss L corresponding to the first metric output₁. According to the second reward punishment feedbackAnd a second metric output for obtaining a second model loss L corresponding to the second metric output₂. Loss L according to the first model₁And second model loss L₂The third model loss L corresponding to the motion output can be obtained₃。

In one possible implementation, L may be lost according to a first model₁And reversely regulating the neural network through the first measurement output. Can be lost according to the second model₂And reversely regulating the neural network through the second measurement output. And may lose L according to a second model₃Through the action output a_tThe neural network is back-regulated. When the training condition is satisfied, the trained neural network can be obtained. The trained neural network may be used to receive the environmental state vector s for the next training cycle_t+1。

In one possible implementation, the environment may include a traffic environment in which the autonomous vehicle is located, and the first reward punishment feedback r is given through the current training period_tAnd according to the first reward punishment feedback r_tAnd a second reward punishment feedback determined by the first reward punishment feedback of the historical training periodA plurality of effective reward and punishment feedbacks can be obtained, the method can be used for complex traffic environment, and the action output in the complex traffic environment is more accurate.

In one possible implementation, the environment is a transaction situation in the financial market, and the first reward punishment feedback r of the current training period is used_tAnd according to the first reward punishment feedback r_tAnd a second reward punishment feedback determined by the first reward punishment feedback of the historical training periodA plurality of effective reward and punishment feedbacks can be obtained, the reward and punishment feedbacks can be used for a complex transaction environment, more accurate action output is output in the complex transaction environment, more accurate transaction operation is carried out according to more historical profit conditions, more optimal price and lower transaction cost are obtained, and more profit is obtained in the retesting according to historical data.

In a possible implementation manner, the neural network trained by the method can be used in environment processing, that is, the environment state vector of the current period can be input into the trained neural network for processing, and the action output of the current period is obtained; and determining the environment state vector of the next period of the current period and the first reward and punishment feedback of the current period according to the environment state vector of the current period and the action output of the current period. The environment is processed by using the trained neural network, so that the trained neural network can adapt to a more complex environment based on a historical training period, more accurate action output is obtained, and higher first reward and punishment feedback, namely higher profit, is obtained.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

In addition, the present disclosure also provides an image processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the methods provided by the present disclosure, and the descriptions and corresponding descriptions of the corresponding technical solutions and the corresponding descriptions in the method section are omitted for brevity.

Fig. 4 shows a block diagram of a neural network training device according to an embodiment of the present disclosure, as shown in fig. 4, the neural network training device including:

the input module 11 is configured to input the environmental state vector of the current training period into a neural network for processing, and obtain an action output of the current training period and a measurement output of the current training period;

a feedback determining module 12, configured to determine a first reward punishment feedback of the current training period according to the environment state vector of the current training period and the action output of the current training period;

a model loss determining module 13, configured to determine a model loss of the neural network according to the first reward and punishment feedback of the current training period, the first reward and punishment feedback of a historical training period, and a measurement output of the current training period, where the historical training period includes one or more training periods before the current training period;

an adjusting module 14, configured to adjust a network parameter value of the neural network according to the model loss;

a neural network obtaining module 15, configured to obtain a trained neural network when the neural network meets a training condition.

Fig. 5 shows a block diagram of a neural network training device according to an embodiment of the present disclosure, as shown in fig. 5, the device further comprising:

and a second determining module 16, configured to, when the current training period is a first training period for training the neural network, obtain an environmental state vector of the current training period according to a preset initial environmental state vector and a random motion output.

In one possible implementation, the apparatus further includes:

a first determining module 17, configured to determine the environmental state vector of the current training period according to the environmental state vector of the previous training period of the current training period and the action output of the previous training period.

In a possible implementation, the model loss determination module 13 is further configured to:

Fig. 6 shows a block diagram of an environment processing apparatus according to an embodiment of the present disclosure, as shown in fig. 6, the environment processing apparatus including:

a motion output obtaining module 21, configured to input the environmental state vector of the current cycle into the neural network according to any one of claims 1 to 9 for processing, so as to obtain a motion output of the current cycle;

the third determining module 22 is configured to determine, according to the environment state vector of the current period and the action output of the current period, an environment state vector of a next period of the current period and the first reward and punishment feedback of the current period.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 is a block diagram illustrating an electronic device 800 according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 7, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 8 is a block diagram illustrating an electronic device 1900 according to an example embodiment. For example, the electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A neural network training method, the method comprising:

2. The method of claim 1, wherein the scaled output for the current period comprises a first scaled output for the current training period and a second scaled output for the current training period.

3. The method of claim 2, wherein the model penalty comprises a first model penalty corresponding to the first metric output, a second model penalty corresponding to the second metric output, and a third model penalty corresponding to the action output.

4. An environmental processing method, comprising:

inputting the environmental state vector of the current period into the neural network of any one of claims 1 to 3 for processing to obtain the action output of the current period;

5. A neural network training device, comprising:

6. An environmental processing apparatus, the apparatus comprising:

a motion output obtaining module, configured to input the environmental state vector of the current cycle into the neural network according to any one of claims 1 to 3 for processing, so as to obtain a motion output of the current cycle;

7. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 3.

8. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of claim 4.

9. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 3.

10. A computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of claim 4.