CN109190760A - Neural network training method and device and environmental treatment method and device - Google Patents
Neural network training method and device and environmental treatment method and device Download PDFInfo
- Publication number
- CN109190760A CN109190760A CN201810885459.2A CN201810885459A CN109190760A CN 109190760 A CN109190760 A CN 109190760A CN 201810885459 A CN201810885459 A CN 201810885459A CN 109190760 A CN109190760 A CN 109190760A
- Authority
- CN
- China
- Prior art keywords
- training
- punishments
- rewards
- neural network
- feedback
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Feedback Control In General (AREA)
- Manipulator (AREA)
Abstract
This disclosure relates to a kind of neural network training method and device and environmental treatment method and device, which comprises the ambient condition vector of current cycle of training is inputted neural network, acquisition movement output and measurement output;It is acted to export according to ambient condition vector sum and determines the first rewards and punishments feedback;According to the first rewards and punishments feedback, the first rewards and punishments of history cycle of training feedback and output is measured, determines the model loss of neural network;It is lost according to model, adjusts the network parameter values of neural network;Neural network when neural network meets training condition, after being trained.Neural network training method according to an embodiment of the present disclosure, the model loss determined by multiple rewards and punishments feedback of current cycle of training and history cycle of training, it is not easy to fall into locally optimal solution during training neural network, can get the higher neural network of the goodness of fit.
Description
Technical field
This disclosure relates to field of computer technology more particularly to a kind of neural network training method and device and environmental treatment
Method and device.
Background technique
In the related art, it can will describe to obtain in the ambient condition vector input neural network of the environment of current period
Movement output and measurement output will act output action in environment, and environment can change since the movement exports, and obtain
The environment of next cycle is obtained, while obtaining and the rewards and punishments that the movement exports is fed back.It is fed back according to the rewards and punishments to determine nerve net
The loss function of network, to train neural network.But the training method determines loss function according to single rewards and punishments feedback, is easy
Training process is set to fall into locally optimal solution, it is difficult to obtain the higher neural network of the goodness of fit.
Summary of the invention
The present disclosure proposes a kind of neural network training method and devices and environmental treatment method and device.
According to the one side of the disclosure, a kind of neural network training method is provided, comprising:
It will be handled in the ambient condition vector input neural network of current cycle of training, obtain current cycle of training
Movement output and the measurement of current cycle of training export;
The movement of the current cycle of training according to the ambient condition vector sum of the current cycle of training exports, and determination is worked as
The first rewards and punishments of preceding cycle of training are fed back;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction
The measurement output for practicing the period determines the model loss of the neural network, is included in the current instruction history cycle of training
Practice one or more cycles of training before the period;
It is lost according to the model, adjusts the network parameter values of the neural network;
Neural network when the neural network meets training condition, after being trained.
Neural network training method according to an embodiment of the present disclosure can get ambient condition vector input neural network
The movement output and the measurement of current cycle of training of current cycle of training exports, and anti-according to the first rewards and punishments of current cycle of training
Feedback, the first rewards and punishments of history cycle of training feedback and measurements, which export, determines model loss, by multiple rewards and punishments feedback come
Determining model loss is not easy to fall into locally optimal solution during training neural network, can get the higher mind of the goodness of fit
Through network.
In one possible implementation, the measurement output of the current period includes the first weighing apparatus of current cycle of training
Second measurement output of amount output and current cycle of training.
In one possible implementation, the model loss includes the first mould corresponding with the first measurement output
The corresponding second model loss of output and third model corresponding with movement output damage are measured in type loss and described second
It loses.
In one possible implementation, according to the first rewards and punishments feedback of the current cycle of training, history training week
The first rewards and punishments feedback of phase and the measurement of current cycle of training export, and determine the model loss of the first nerves network, comprising:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute
State the loss of the first model;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction
Practice the period second measures output, determines the second model loss;
According to first model loss and second model loss, the third model loss is determined.
In one possible implementation, according to the first rewards and punishments of current cycle of training feedback and current training week
The first of phase measures output, determines the first model loss, comprising:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really
Fixed first accumulates rewards and punishments feedback of discounting;
Rewards and punishments feedback and described first of discounting is accumulated according to described first and measures output, determines that first model loses.
In one possible implementation, according to the first rewards and punishments feedback of the current cycle of training, history training week
The first rewards and punishments feedback of phase and the second of current cycle of training measure output, determine the second model loss, comprising:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, determination is worked as
The second rewards and punishments of preceding cycle of training are fed back;
Reward-Penalty Functions it is expected according to the second rewards and punishments of current cycle of training feedback, preset discount rate and second,
Determine the second accumulation discount rewards and punishments feedback;
Rewards and punishments feedback and described second of discounting is accumulated according to described second and measures output, determines that second model loses.
In one possible implementation, according to the first rewards and punishments feedback and history of the current cycle of training training week
The first rewards and punishments of phase are fed back, and determine the second rewards and punishments feedback of current cycle of training, comprising:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become
Change vector;
According to the rewards and punishments change vector, variation accumulation vector is obtained;
Vector is accumulated according to the variation, obtains zero fluctuation rewards and punishments vector;
The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
In this way, it can be fed back by the first rewards and punishments of multiple history cycles of training to obtain the second rewards and punishments feedback,
Training process can be made to obtain more effective rewards and punishments feedbacks, neural network can be enable to be suitable for more complex environment,
Promote the performance of neural network.
In one possible implementation, the method also includes:
The previous training week according to the ambient condition vector sum of the previous cycle of training of the current cycle of training
The movement of phase exports, and determines the ambient condition vector of the current cycle of training.
In one possible implementation, the method also includes:
When the current cycle of training is first cycle of training of the training neural network, according to preset initial
The output of ambient condition vector sum random action, obtains the ambient condition vector of the current cycle of training.
In this way, initial environment can be made to rapidly become the environment that can be used for training neural network, nerve can be made
The effective rewards and punishments feedback of the acquisition of network, improves the efficiency of training neural network, and training process is made to be not easy to fall into local optimum
Solution.
According to another aspect of the present disclosure, a kind of environmental treatment method is provided, which comprises
The ambient condition vector of current period is inputted in the neural network and is handled, the movement of current period is obtained
Output;
It is exported, is determined under current period according to the movement of the ambient condition vector of current period and the current period
The ambient condition vector of a cycle and the first rewards and punishments feedback of current period.
According to another aspect of the present disclosure, a kind of neural metwork training device is provided, comprising:
Input module is obtained for handling in the ambient condition vector input neural network by current cycle of training
The movement output and the measurement of current cycle of training of current cycle of training exports;
Determining module is fed back, for the current cycle of training according to the ambient condition vector sum of the current cycle of training
Movement output, determine current cycle of training the first rewards and punishments feedback;
Model loses determining module, for according to the first rewards and punishments feedback of the current cycle of training, history cycle of training
The first rewards and punishments feedback and the measurement of current cycle of training export, determine the model loss of the neural network, history instruction
Practicing the period includes one or more cycles of training before the current cycle of training;
Module is adjusted, for losing according to the model, adjusts the network parameter values of the neural network;
Neural network obtains module, for the nerve net when the neural network meets training condition, after being trained
Network.
In one possible implementation, the measurement output of the current period includes the first weighing apparatus of current cycle of training
Second measurement output of amount output and current cycle of training.
In one possible implementation, the model loss includes the first mould corresponding with the first measurement output
The corresponding second model loss of output and third model corresponding with movement output damage are measured in type loss and described second
It loses.
In one possible implementation, the model loss determining module is further used for:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute
State the loss of the first model;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction
Practice the period second measures output, determines the second model loss;
According to first model loss and second model loss, the third model loss is determined.
In one possible implementation, the model loss determining module is further used for:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really
Fixed first accumulates rewards and punishments feedback of discounting;
Rewards and punishments feedback and described first of discounting is accumulated according to described first and measures output, determines that first model loses.
In one possible implementation, the model loss determining module is further used for:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, determination is worked as
The second rewards and punishments of preceding cycle of training are fed back;
Reward-Penalty Functions it is expected according to the second rewards and punishments of current cycle of training feedback, preset discount rate and second,
Determine the second accumulation discount rewards and punishments feedback;
Rewards and punishments feedback and described second of discounting is accumulated according to described second and measures output, determines that second model loses.
In one possible implementation, the model loss determining module is further used for:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become
Change vector;
According to the rewards and punishments change vector, variation accumulation vector is obtained;
Vector is accumulated according to the variation, obtains zero fluctuation rewards and punishments vector;
The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
In one possible implementation, described device further include:
First determining module, for the ambient condition vector sum according to the previous cycle of training of the current cycle of training
The movement of the previous cycle of training exports, and determines the ambient condition vector of the current cycle of training.
In one possible implementation, described device further include:
Second determining module, for when first cycle of training that the current cycle of training is the training neural network
When, exported according to preset initial environment state vector and random action, obtain the ambient condition of the current cycle of training to
Amount.
In one possible implementation, described device includes:
Movement output obtains module, for any one in the ambient condition vector input claim 1 to 9 by current period
It is handled in neural network described in, obtains the movement output of current period;
Third determining module, the movement for ambient condition vector and the current period according to current period are defeated
Out, the ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period are determined.
According to another aspect of the present disclosure, a kind of electronic equipment is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: execute above-mentioned neural network training method.
According to another aspect of the present disclosure, a kind of electronic equipment is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: execute above-mentioned environmental treatment method.
According to another aspect of the present disclosure, a kind of computer readable storage medium is provided, computer journey is stored thereon with
Sequence instruction, the computer program instructions realize above-mentioned neural network training method when being executed by processor.
According to another aspect of the present disclosure, a kind of computer readable storage medium is provided, computer journey is stored thereon with
Sequence instruction, the computer program instructions realize above-mentioned environmental treatment method when being executed by processor.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become
It is clear.
Detailed description of the invention
Comprising in the description and constituting the attached drawing of part of specification and specification together illustrates the disclosure
Exemplary embodiment, feature and aspect, and for explaining the principles of this disclosure.
Fig. 1 shows the flow chart of neural network training method according to an embodiment of the present disclosure;
Fig. 2 shows the flow charts of environmental treatment method according to an embodiment of the present disclosure;
Fig. 3 shows the application schematic diagram of neural network training method according to an embodiment of the present disclosure;
Fig. 4 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure;
Fig. 5 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure;
Fig. 6 shows the block diagram of environmental treatment device according to an embodiment of the present disclosure;
Fig. 7 is the block diagram of the electronic equipment shown in the exemplary embodiment of basis;
Fig. 8 is the block diagram of the electronic equipment shown in the exemplary embodiment of basis.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing
Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove
It non-specifically points out, it is not necessary to attached drawing drawn to scale.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary "
Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
In addition, giving numerous details in specific embodiment below to better illustrate the disclosure.
It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for
Method, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.
Fig. 1 shows the flow chart of neural network training method according to an embodiment of the present disclosure.As shown in Figure 1, the side
Method includes:
In step s 11, it will handle, worked as in the ambient condition vector input neural network of current cycle of training
The movement output and the measurement of current cycle of training of preceding cycle of training exports;
In step s 12, the current cycle of training according to the ambient condition vector sum of the current cycle of training is dynamic
It exports, determines the first rewards and punishments feedback of current cycle of training;
In step s 13, according to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training
The measurement of feedback and current cycle of training export, and determine the model loss of the neural network, the history cycle of training includes
One or more cycles of training before the current cycle of training;
In step S14, is lost according to the model, adjust the network parameter values of the neural network;
Neural network in step S15, when the neural network meets training condition, after being trained.
Neural network training method according to an embodiment of the present disclosure can get ambient condition vector input neural network
The movement output and the measurement of current cycle of training of current cycle of training exports, and anti-according to the first rewards and punishments of current cycle of training
Feedback, the first rewards and punishments of history cycle of training feedback and measurements, which export, determines model loss, by multiple rewards and punishments feedback come
Determining model loss is not easy to fall into locally optimal solution, it is higher to can get the goodness of fit during training neural network
Neural network.
In one possible implementation, the environment of current cycle of training can indicate current state, for example, game,
Traffic environment locating for automatic driving vehicle or the trading situation in financial market etc..It can by the movement output of neural network
It is applied in the environment of current period, environment is caused to change.In this example, movement output is the action command in game,
For example, the game is shooting game, action command can be the instruction of the shooting of the role in order game, to change currently
Scoring event.In this example, movement output can be the operational order in automatic Pilot, and the vehicle driven can be commanded to execute behaviour
Make, to change current traffic environment.In this example, movement output can be the trading instruction in financial market, can be performed and hands over
It is easy to operate, to change the current market price.The disclosure is to environment and acts the type of output with no restrictions.
In one possible implementation, in step s 11, the environment shape of the environment of current cycle of training can will be described
It is handled in state vector input neural network, which can be the nerve net after a training upper cycle of training
Network.
In one possible implementation, the neural network can ambient condition vector to current cycle of training carry out
Processing, the movement output and the measurement of current cycle of training that can get current cycle of training export.The current cycle of training
Movement output may act on the environment of current period, and the current cycle of training, which measures output, can be used for carrying out to neural network
When reversed adjustment, the parameter as adjustment.
In one possible implementation, when first instruction that the current cycle of training is the training neural network
When practicing the period, it can be exported according to preset initial environment state vector and random action, obtain the ring of the current cycle of training
Border state vector.In this example, when not starting to train neural network, first using random action output action in initial ring
In border, which be may be repeated a plurality of times.In this example, random action output can be vector, the choosing of each element of vector
It takes and obeys some probability-distribution function, for example, softmax function.During repeating, the random action is used
Output, that is, in initial environment, initial environment can change the random action output action, obtain the first initial environment, with
Afterwards by the random action output action in the first initial environment, then the first initial environment can change, and obtain the second initial ring
Border.It may be repeated a plurality of times, for example, the number repeated can be set as M (M is the positive integer more than or equal to 1), repeating
After executing M times, the environment of acquisition is determined as to the environment of first cycle of training, can will describe first cycle of training later
The ambient condition vector input neural network of environment is handled, and is trained to neural network.
In this way, initial environment can be made to rapidly become the environment that can be used for training neural network, nerve can be made
The effective rewards and punishments feedback of the acquisition of network, improves the efficiency of training neural network, and training process is made to be not easy to fall into local optimum
Solution.
It in one possible implementation, in step s 12, can be by the movement output action of current cycle of training in working as
The environment of preceding cycle of training, that is, the movement of current cycle of training described in the ambient condition vector sum by current cycle of training exports
By vector operation or matrix operation (such as, it may include be added, be multiplied or the operations such as convolution, the disclosure is to vector operation or square
The type of battle array operation is with no restrictions), it can get the first rewards and punishments feedback of current cycle of training, the first rewards and punishments feedback can indicate
The movement output action of current cycle of training is after the environment of current period, the change of the gap between the environment and perfect condition
Change.
In this example, the environment is shooting game, and perfect condition divides difference to maximize for our score with what the other player scores,
The movement output action of current cycle of training after the environment, can get to the first rewards and punishments feedback, for example, we knocks down,
It can get positive the first rewards and punishments feedback, other side, which knocks down, can get negative the first rewards and punishments feedback, all not knock down, then rewards and punishments are fed back to 0.
In this example, the environment is traffic environment locating in traveling, is made exporting the movement of current cycle of training
After the environment, the first rewards and punishments feedback can get, for example, after vehicle executes and exports associated operation with the movement,
Traffic environment improves (for example, the coast is clear), then can get positive the first rewards and punishments feedback, execute in vehicle and export with the movement
After associated operation, traffic environment deteriorates (for example, vehicle drives into a crowded road), and it is anti-to can get the first negative rewards and punishments
Feedback, after vehicle executes and exports associated operation with the movement, traffic behavior is unchanged, then the first rewards and punishments are fed back to 0.
In this example, the environment is the trading situation in financial market, is made exporting the movement of current cycle of training
After the environment, the first rewards and punishments feedback can get, for example, after execution exports associated transactional operation with the movement,
Income is positive, then can get positive the first rewards and punishments feedback, is executing with after the associated transactional operation of movement output, income is
It is negative, then can get negative the first rewards and punishments feedback, execute with after the associated transactional operation of movement output, without income (that is,
Be not carried out transactional operation, or execute transaction profit or loss balance situations such as), then the first rewards and punishments are fed back to 0.
In one possible implementation, it is currently instructed according to the ambient condition vector sum of the current cycle of training
The movement output for practicing the period, may further determine that the ambient condition vector of next cycle of training.In this example, it will can currently train week
The movement output of current cycle of training described in the ambient condition vector sum of phase is by vector operation or matrix operation (for example, can wrap
Include the operations such as addition, multiplication or convolution, the disclosure to the type of vector operation or matrix operation with no restrictions), can get next
The ambient condition vector of a cycle of training.Specifically, the movement output of current cycle of training may act on environment, and send out environment
It is raw to change, the environment of next cycle of training is obtained, the ambient condition vector of next cycle of training is to describe next instruction
Practice the state vector of the environment in period.
It in one possible implementation, can be according to the environment shape of the previous cycle of training of the current cycle of training
The movement of previous cycle of training described in state vector sum exports, and determines the ambient condition vector of the current cycle of training.That is, working as
The ambient condition vector of preceding cycle of training is as described in the ambient condition vector sum of the previous cycle of training of current cycle of training
What the movement output of previous cycle of training was determined by vector operation or matrix operation.It can moving previous cycle of training
Make output action in previous environment cycle of training, obtains the environment of current cycle of training.
In one possible implementation, in step s 13, it can be exported according to the measurement of current cycle of training, is described
The first rewards and punishments feedback of current cycle of training and the first rewards and punishments feedback of one or more history cycles of training, determine the mind
Model loss through network.Wherein, the current period measurement output include current cycle of training first measure output and
The second of current cycle of training measures output, and the model loss includes that the first model corresponding with the first measurement output damages
It loses and described second measures the corresponding second model loss of output and third model corresponding with movement output loss.
In one possible implementation, step S13 can include:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute
State the loss of the first model;According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback
Output is measured with the second of current cycle of training, determines the second model loss;It is lost according to first model and described
The loss of second model determines the third model loss.
In one possible implementation, the loss of the first model is corresponding with the first measurement output.First model
Loss can measure output by the first of neural network and carry out backpropagation, be adjusted with parameters such as weights to neural network
Section.In this example, the loss of the first model is measured with the first rewards and punishments feedback of current cycle of training and the first of current cycle of training
Output is related.
In one possible implementation, according to the first rewards and punishments of current cycle of training feedback and current training week
The first of phase measures output, determines the first model loss, it may include:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really
Fixed first accumulates rewards and punishments feedback of discounting;Discount rewards and punishments feedback and the first expectation Reward-Penalty Functions are accumulated according to described first, really
The fixed first model loss.
It in one possible implementation, can be according to following formula (1), according to the first of the current cycle of training the prize
Punish feedback rt, the expectation Reward-Penalty Functions V of preset discount rate γ and first discounts rewards and punishments feedback to determine the first accumulation
Wherein, first accumulates rewards and punishments feedback of discountingIt is not related to the first rewards and punishments feedback of history cycle of training,
Therefore it can be confirmed as short cycle and accumulate rewards and punishments feedback of discounting.First expectation Reward-Penalty Functions V is the environment to current cycle of training
The first accumulation discount the evaluation function of rewards and punishments feedback, the input of the function can be ambient condition vector.Preset discount rate
γ is constant, such as 0.9.N is the positive integer more than or equal to 1, and t is current cycle of training, rt+nTo use current cycle of training
Movement output action when environment, (0≤n≤N-1) is fed back in first rewards and punishments in the t+n period, for example, will currently train
The movement output action in period feeds back r in current environment, the first rewards and punishments that can get current environmenttWith next cycle of training
Ambient condition vector st+1, by the movement output action of current cycle of training in the environment of next cycle of training, next can instruct
R is fed back in the rewards and punishments for practicing the periodt+1With the ambient condition vector s after two cycles of trainingt+2.In this way, it can get dynamic
Make the first rewards and punishments feedback and N number of instruction of each cycle of training in the case that the movement that output is current cycle of training exports
Ambient condition vector s after practicing the periodt+N。V(st+N) valuation result be to be rolled over to first of the environment after N number of cycle of training the accumulation
Desired value is fed back in existing rewards and punishments, and the disclosure does not limit the numerical value of N.For example, the environment is the trading situation in financial market, the
One expectation Reward-Penalty Functions V can be used for estimating expected revenus.
In one possible implementation, rewards and punishments feedback of discounting can be accumulated according to first according to following formula (2)Output is measured with first, determines the first model loss L1。
Wherein, V (st) it is rewards and punishments feedback desired value of discounting to the first accumulation of the environment of current cycle of training, that is, it will work as
The ambient condition vector s of preceding cycle of trainingtInput the first expectation Reward-Penalty Functions V desired value obtained, in this example, V (st)
Value is the first measurement output.
In one possible implementation, the loss of the second model is corresponding with the second measurement output.Second model
Loss can measure output by the second of neural network and carry out backpropagation, be adjusted with parameters such as weights to neural network
Section.In this example, the loss of the second model and the first rewards and punishments feedback of current cycle of training, the first rewards and punishments of history cycle of training are anti-
Feedback and the second measurement output of current cycle of training are related.According to the first rewards and punishments feedback, the history of the current cycle of training
The first rewards and punishments feedback of cycle of training and the second of current cycle of training measure output, determine the second model loss, comprising:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, current training week is determined
The second rewards and punishments of phase are fed back;According to the second rewards and punishments of current cycle of training feedback, preset discount rate and the second expectation
Reward-Penalty Functions, determine the second accumulation discount rewards and punishments feedback;Rewards and punishments feedback and described second of discounting is accumulated according to described second to measure
Output determines the second model loss.
In one possible implementation, according to the first rewards and punishments feedback and history of the current cycle of training training week
The first rewards and punishments of phase are fed back, and determine the second rewards and punishments feedback of current cycle of training, it may include:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become
Change vector;According to the rewards and punishments change vector, variation accumulation vector is obtained;Vector is accumulated according to the variation, obtains zero fluctuation
Rewards and punishments vector;The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
It in one possible implementation, can be according to the first rewards and punishments feedback r of history cycle of trainingt-1、rt-2、rt-3Deng
R is fed back in the first rewards and punishments with current cycle of trainingt, determine rewards and punishments change vector d=[dt-(T-1),dt-(T-2),...,dt].Showing
In example, the first rewards and punishments feedback of T cycle of training can be used, T is the positive integer more than or equal to 2, and in this example, T can be
20, the disclosure to the numerical value of T with no restrictions.It in this example, can be by the first rewards and punishments feedback composition vector of T cycle of training
[rt-(T-1)... rt-2, rt-1, rt].Further, it can determine that the rewards and punishments feedback of each trained history cycle is opposite according to the vector
The variation of the rewards and punishments feedback of a cycle of training thereon, that is, d=[rt-(T-1), rt-(T-2)-rt-(T-1)... rt-1-rt-2, rt-
rt-1], in rewards and punishments change vector d, dt-(T-1)=rt-(T-1), dt-(T-2)=rt-(T-2)-rt-(T-1)... dt=rt-rt-1。
In one possible implementation, variation accumulation vector can be obtained according to rewards and punishments change vector dIn this example, the turn in sequence of rewards and punishments change vector d can be obtained vector f=[f1,
f2,...,fT]=[dt,dt-1,...,dt-(T-1)], and add up to the element of vector f, obtain variation accumulation vectorThat is,0≤i≤T+1.That is,
In one possible implementation, vector can be accumulated according to variationDetermine zero fluctuation rewards and punishments vectorIn this example, Wherein,That is,
In one possible implementation, vector can be accumulated according to variationWith the zero fluctuation rewards and punishments vectorDetermine the second rewards and punishments feedbackIn this example, vector can be accumulated according to variationIn elementWithDetermine that parameter R is estimated in rewards and punishmentsH, such as following formula (3):
Vector can be accumulated according to variationWith the zero fluctuation rewards and punishments vectorIn element, determine variation accumulation to
AmountWith the zero fluctuation rewards and punishments vectorDifference with zero fluctuation rewards and punishments vectorBetween ratio, such as following public affairs
Formula (4):
Wherein, 1≤k≤T+1 can determine T+1 ratio according to formula (4), and determine the variance of the T+1 ratioFurther, parameter R can be estimated according to rewards and punishmentsHAnd varianceDetermine that the second rewards and punishments are fed backIt is such as following
Formula (5):
Wherein, τ and σmaxFor the parameter of setting, τ can be used for controllingAmplitude, σmaxIt can be used for controllingWave
It is dynamic, in this example, σmax=1, τ=2, the disclosure is to τ and σmaxValue with no restrictions.
In this way, it can be fed back by the first rewards and punishments of multiple history cycles of training to obtain the second rewards and punishments feedback,
Training process can be made to obtain more effective rewards and punishments feedbacks, neural network can be enable to be suitable for more complex environment,
Promote the performance of neural network.
In one possible implementation, can be according to following formula (6), it can be according to the second of the current cycle of training
Rewards and punishments feedbackThe expectation of preset discount rate γ and second Reward-Penalty Functions Vvwr, determine the second accumulation discount rewards and punishments feedback
Wherein, second accumulates rewards and punishments feedback of discountingIt is related to the first rewards and punishments feedback of history cycle of training, because
This can be confirmed as long period and accumulate rewards and punishments feedback of discounting.Second expectation Reward-Penalty Functions VvwrIt is the environment to current cycle of training
The second accumulation discount the evaluation function of rewards and punishments feedback, the input of the function can be ambient condition vector.Work as to use
The movement output action of preceding cycle of training is when environment, in the second rewards and punishments feedback in the t+n period, for example, will currently train
The movement output action in period can get the second rewards and punishments feedback of current environment in current environmentWith next training week
The ambient condition vector s of phaset+1, can be next by the movement output action of current cycle of training in the environment of next cycle of training
The rewards and punishments of a cycle of training are fed backWith the ambient condition vector s after two cycles of trainingt+2.In this way, it can obtain
The second rewards and punishments feedback of each cycle of training in the case where the movement that movement output is current cycle of training exports is obtained, and
Ambient condition vector s after N number of cycle of trainingt+N。Vvwr(st+N) valuation result be movement output be current cycle of training
In the case where movement output, rewards and punishments feedback desired value of discounting is accumulated to second of the environment after N number of cycle of training, the disclosure is to N
Numerical value do not limit.
In one possible implementation, rewards and punishments feedback of discounting can be accumulated according to second according to following formula (7)Output is measured with second, determines the first model loss L2。
Wherein, Vvwr(st) it is rewards and punishments feedback desired value of discounting to the second accumulation of the environment of current cycle of training, that is, it will
The ambient condition vector s of current cycle of trainingtThe second expectation of input Reward-Penalty Functions VvwrDesired value obtained, in this example,
Vvwr(st) value be the second measurement output.
In one possible implementation, L can be lost according to the first model according to following formula (8)1With the second model
Lose L2, to determine and movement output atCorresponding third model loses L3。
Wherein, π (at|st;It is θ) expression formula of the neural network of current cycle of training, θ is the nerve of current cycle of training
The network parameter values of network,Square for the first model lose L1,
Square for the second model lose L2。
In one possible implementation, in step S14, L can be lost according to the first model1, the second model lose L2
L is lost with third model3To adjust the network parameter values of neural network.It in this example, can be according to making model minimization of loss
Direction adjusts mind to adjust the network parameter values of neural network, or according to the direction for the model minimization of loss for making regularization
Network parameter values through network make the neural network adjusted goodness of fit with higher, while avoiding over-fitting.
In one possible implementation, step S14 is recyclable executes repeatedly, for example, can be according to making model loss most
The direction of smallization adjusts scheduled number;The number of adjusting can not also be limited, is repeatedly adjusted until model loss is reduced to centainly
Degree or when converging in certain threshold value, stops circulation.
In one possible implementation, the mind in step S15, when meeting training condition, after can get training
Through network.In this example, circulation can be adjusted the neural network after pre-determined number and is determined as the neural network after training, or by mould
Type loss reduces to a certain extent or the neural network converged in certain threshold value is determined as the neural network after training.The disclosure
With no restrictions to training condition.
Neural network training method according to an embodiment of the present disclosure is exported using random action and obtains first instruction rapidly
The environment for practicing the period can make the effective rewards and punishments feedback of the acquisition of neural network, improve the efficiency of training neural network, and make to train
Process is not easy to fall into locally optimal solution.Further, according to the first rewards and punishments feedback of current cycle of training, history cycle of training
First rewards and punishments feedback and the first measurement output and second measure output to determine model loss, and training process can be made to obtain more
More effective rewards and punishments feedbacks, can enable neural network be suitable for more complex environment, promote the performance of neural network, and
And the model loss for feeding back to determine by multiple rewards and punishments is not easy to fall into locally optimal solution during training neural network,
It can get the higher neural network of the goodness of fit.
Fig. 2 shows the flow charts of environmental treatment method according to an embodiment of the present disclosure.As shown in Fig. 2, the method packet
It includes:
In the step s 21, it will handle, obtain current in the neural network of the ambient condition vector input of current period
The movement in period exports;
It in step S22, is exported, is determined according to the movement of the ambient condition vector of current period and the current period
The ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period.
In one possible implementation, in the step s 21, the neural network is to pass through step S11- step
The neural network that the training method of S15 was trained, according to the first rewards and punishments feedback of current cycle of training, history cycle of training
The neural network of the model loss training of one rewards and punishments feedback and the first measurement output and the second measurement output determination, was being trained
More effective rewards and punishments feedbacks can be obtained in journey, therefore, the neural network after the training can be suitable for more complicated
Environment, can get more accurately movement output.
It in one possible implementation, can be by the movement output of current period and current period in step S22
Ambient condition vector by vector operation or matrix operation (such as, it may include be added, be multiplied or the operations such as convolution, the disclosure pair
The type of vector operation or matrix operation is with no restrictions), it can get the ambient condition vector of next cycle.Specifically, currently
The movement in period may act on environment, and environment is made to change, and obtain the environment of next cycle, the environment of next cycle
State vector is to describe the state vector of the environment of next cycle.Further, the environmental change of current period is next
When the environment in a period, the first rewards and punishments feedback of current period can get, the first rewards and punishments feedback can be used for characterizing will be current
The movement output action in period is when the environment of current period, the income of generation.For example, the environment is the friendship in financial market
Easy situation can make financial market after the movement of current cycle of training output (for example, transactional operation) is acted on the environment
In trading situation change (for example, after trading, share price or exchange hand change), and obtain the first rewards and punishments it is anti-
Feedback (for example, income that transaction obtains).Due in the training process of neural network, according to the first rewards and punishments of current cycle of training
The first rewards and punishments of feedback and history cycle of training are fed back to train neural network, so that the neural network after training can be based on history
Cycle of training adapts to more complicated environment, for example, more accurately transaction behaviour can be made according to more historical yield situations
Make, to obtain more preferably price and lower transaction cost, and bigger income is being obtained according to returning in survey for historical data.
Fig. 3 shows the application schematic diagram of neural network training method according to an embodiment of the present disclosure.As shown in figure 3, can
It using random action output action in initial environment, and repeats repeatedly, obtains the environment of first cycle of training, to start
Training neural network.
In one possible implementation, according to the environment of current cycle of training, it can get description current cycle of training
Environment ambient condition vector st, by the ambient condition vector s of environmenttNeural network is inputted, can get current cycle of training
Movement output at, first measure output and second measure output.
In one possible implementation, movement can be exported atThe environment of current cycle of training is acted on, can get
R is fed back in first rewards and punishmentstWith the ambient condition vector s of next cycle of trainingt+1.R is fed back according to the first rewards and punishmentstAnd history training
First rewards and punishments in period are fed back, and can get the second rewards and punishments feedback
In one possible implementation, r is fed back according to the first rewards and punishmentstOutput is measured with first, can get and first
It measures and exports corresponding first model loss L1.It is fed back according to the second rewards and punishmentsOutput is measured with second, can get and second
It measures and exports corresponding second model loss L2.L is lost according to the first model1L is lost with the second model2, can get defeated with movement
Corresponding third model loses L out3。
In one possible implementation, L can be lost according to the first model1, output is measured to nerve net by first
Network is reversely adjusted.L can be lost according to the second model2, output is measured by second, and neural network is reversely adjusted.And
L can be lost according to the second model3, a is exported by movementtNeural network is reversely adjusted.It, can when meeting training condition
Neural network after being trained.Neural network after training can be used for receiving the ambient condition vector of next cycle of training
st+1。
In one possible implementation, the environment may include traffic environment locating for automatic driving vehicle, pass through
R is fed back in the first rewards and punishments of current cycle of trainingt, and according to the first rewards and punishments feedback rtAnd the first rewards and punishments of history cycle of training
Determining the second rewards and punishments feedback of feedbackIt can get multiple effective rewards and punishments feedbacks, can be used for complicated traffic environment, multiple
Output in miscellaneous traffic environment more accurately acts output.
In one possible implementation, the environment is the trading situation in financial market, passes through currently training week
R is fed back in the first rewards and punishments of phaset, and according to the first rewards and punishments feedback rtAnd the first rewards and punishments feedback of history cycle of training is determining
Second rewards and punishments feedbackIt can get multiple effective rewards and punishments feedbacks, can be used for complicated trading environment, in complicated transaction ring
More accurately movement output is exported in border, to make more accurate transactional operation according to more historical yield situations, to obtain
More preferably price and lower transaction cost are obtained, and bigger income is being obtained according to returning in survey for historical data.
In one possible implementation, can be used in environmental treatment by the neural network of above method training, that is,
It can will be handled in neural network after the input training of the ambient condition vector of current period, the movement for obtaining current period is defeated
Out;And exported according to the movement of the ambient condition vector of current period and the current period, determine the next of current period
The ambient condition vector in a period and the first rewards and punishments feedback of current period.Environment is carried out using the neural network after training
Processing obtains more accurately dynamic so that the neural network after training can adapt to more complicated environment based on history cycle of training
It exports, and obtains higher first rewards and punishments feedback, that is, higher income.
It is appreciated that above-mentioned each embodiment of the method that the disclosure refers to, without prejudice to principle logic,
To engage one another while the embodiment to be formed after combining, as space is limited, the disclosure is repeated no more.
In addition, the disclosure additionally provides image processing apparatus, electronic equipment, computer readable storage medium, program, it is above-mentioned
It can be used to realize any method that the disclosure provides, corresponding technical solution and description and the corresponding note referring to method part
It carries, repeats no more.
Fig. 4 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure, as shown in figure 4, the nerve
Network training device includes:
Input module 11 is obtained for handling in the ambient condition vector input neural network by current cycle of training
The movement output and the measurement of current cycle of training for obtaining current cycle of training export;
Determining module 12 is fed back, for currently training to be all according to the ambient condition vector sum of the current cycle of training
The movement of phase exports, and determines the first rewards and punishments feedback of current cycle of training;
Model loses determining module 13, for the first rewards and punishments feedback according to the current cycle of training, history training week
The first rewards and punishments feedback of phase and the measurement of current cycle of training export, and determine the model loss of the neural network, the history
Cycle of training includes one or more cycles of training before the current cycle of training;
Module 14 is adjusted, for losing according to the model, adjusts the network parameter values of the neural network;
Neural network obtains module 15, for the nerve when the neural network meets training condition, after being trained
Network.
Fig. 5 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure, as shown in figure 5, described device
Further include:
Second determining module 16, for when first training week that the current cycle of training is the training neural network
It when the phase, is exported according to preset initial environment state vector and random action, obtains the ambient condition of the current cycle of training
Vector.
In one possible implementation, described device further include:
First determining module 17, for the ambient condition vector according to the previous cycle of training of the current cycle of training
Movement with the previous cycle of training exports, and determines the ambient condition vector of the current cycle of training.
In one possible implementation, the measurement output of the current period includes the first weighing apparatus of current cycle of training
Second measurement output of amount output and current cycle of training.
In one possible implementation, the model loss includes the first mould corresponding with the first measurement output
The corresponding second model loss of output and third model corresponding with movement output damage are measured in type loss and described second
It loses.
In one possible implementation, the model loss determining module 13 is further used for:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute
State the loss of the first model;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction
Practice the period second measures output, determines the second model loss;
According to first model loss and second model loss, the third model loss is determined.
In one possible implementation, the model loss determining module 13 is further used for:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really
Fixed first accumulates rewards and punishments feedback of discounting;
Rewards and punishments feedback and described first of discounting is accumulated according to described first and measures output, determines that first model loses.
In one possible implementation, the model loss determining module 13 is further used for:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, determination is worked as
The second rewards and punishments of preceding cycle of training are fed back;
Reward-Penalty Functions it is expected according to the second rewards and punishments of current cycle of training feedback, preset discount rate and second,
Determine the second accumulation discount rewards and punishments feedback;
Rewards and punishments feedback and described second of discounting is accumulated according to described second and measures output, determines that second model loses.
In one possible implementation, the model loss determining module 13 is further used for:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become
Change vector;
According to the rewards and punishments change vector, variation accumulation vector is obtained;
Vector is accumulated according to the variation, obtains zero fluctuation rewards and punishments vector;
The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
Fig. 6 shows the block diagram of environmental treatment device according to an embodiment of the present disclosure, as shown in fig. 6, the environmental treatment
Device includes:
Movement output obtains module 21, for any in the ambient condition vector input claim 1 to 9 by current period
It is handled in neural network described in one, obtains the movement output of current period;
Third determining module 22, the movement for ambient condition vector and the current period according to current period are defeated
Out, the ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period are determined.
The embodiment of the present disclosure also proposes a kind of computer readable storage medium, is stored thereon with computer program instructions, institute
It states when computer program instructions are executed by processor and realizes the above method.Computer readable storage medium can be non-volatile meter
Calculation machine readable storage medium storing program for executing.
The embodiment of the present disclosure also proposes a kind of electronic equipment, comprising: processor;For storage processor executable instruction
Memory;Wherein, the processor is configured to the above method.
The equipment that electronic equipment may be provided as terminal, server or other forms.
Fig. 7 is the block diagram of a kind of electronic equipment 800 shown in the exemplary embodiment of basis.For example, electronic equipment 800 can
To be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices are good for
Body equipment, the terminals such as personal digital assistant.
Referring to Fig. 7, electronic equipment 800 may include following one or more components: processing component 802, memory 804,
Power supply module 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814,
And communication component 816.
The integrated operation of the usual controlling electronic devices 800 of processing component 802, such as with display, call, data are logical
Letter, camera operation and record operate associated operation.Processing component 802 may include one or more processors 820 to hold
Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more moulds
Block, convenient for the interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, with
Facilitate the interaction between multimedia component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in electronic equipment 800.These data
Example include any application or method for being operated on electronic equipment 800 instruction, contact data, telephone directory
Data, message, picture, video etc..Memory 804 can by any kind of volatibility or non-volatile memory device or it
Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable
Except programmable read only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, fastly
Flash memory, disk or CD.
Power supply module 806 provides electric power for the various assemblies of electronic equipment 800.Power supply module 806 may include power supply pipe
Reason system, one or more power supplys and other with for electronic equipment 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between the electronic equipment 800 and user.
In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface
Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touches
Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding
The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments,
Multimedia component 808 includes a front camera and/or rear camera.When electronic equipment 800 is in operation mode, as clapped
When taking the photograph mode or video mode, front camera and/or rear camera can receive external multi-medium data.It is each preposition
Camera and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike
Wind (MIC), when electronic equipment 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone
It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical
Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock
Determine button.
Sensor module 814 includes one or more sensors, for providing the state of various aspects for electronic equipment 800
Assessment.For example, sensor module 814 can detecte the state that opens/closes of electronic equipment 800, the relative positioning of component, example
As the component be electronic equipment 800 display and keypad, sensor module 814 can also detect electronic equipment 800 or
The position change of 800 1 components of electronic equipment, the existence or non-existence that user contacts with electronic equipment 800, electronic equipment 800
The temperature change of orientation or acceleration/deceleration and electronic equipment 800.Sensor module 814 may include proximity sensor, be configured
For detecting the presence of nearby objects without any physical contact.Sensor module 814 can also include optical sensor,
Such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, which may be used also
To include acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment.
Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.Show at one
In example property embodiment, communication component 816 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel
Relevant information.In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, short to promote
Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module
(UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, electronic equipment 800 can be by one or more application specific integrated circuit (ASIC), number
Word signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating
The memory 804 of machine program instruction, above-mentioned computer program instructions can be executed by the processor 820 of electronic equipment 800 to complete
The above method.
Fig. 8 is the block diagram of a kind of electronic equipment 1900 shown in the exemplary embodiment of basis.For example, electronic equipment 1900
It may be provided as a server.Referring to Fig. 8, electronic equipment 1900 includes processing component 1922, further comprise one or
Multiple processors and memory resource represented by a memory 1932, can be by the execution of processing component 1922 for storing
Instruction, such as application program.The application program stored in memory 1932 may include it is one or more each
Module corresponding to one group of instruction.In addition, processing component 1922 is configured as executing instruction, to execute the above method.
Electronic equipment 1900 can also include that a power supply module 1926 is configured as executing the power supply of electronic equipment 1900
Management, a wired or wireless network interface 1950 is configured as electronic equipment 1900 being connected to network and an input is defeated
(I/O) interface 1958 out.Electronic equipment 1900 can be operated based on the operating system for being stored in memory 1932, such as
Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating
The memory 1932 of machine program instruction, above-mentioned computer program instructions can by the processing component 1922 of electronic equipment 1900 execute with
Complete the above method.
The disclosure can be system, method and/or computer program product.Computer program product may include computer
Readable storage medium storing program for executing, containing for making processor realize the computer-readable program instructions of various aspects of the disclosure.
Computer readable storage medium, which can be, can keep and store the tangible of the instruction used by instruction execution equipment
Equipment.Computer readable storage medium for example can be-- but it is not limited to-- storage device electric, magnetic storage apparatus, optical storage
Equipment, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium
More specific example (non exhaustive list) includes: portable computer diskette, hard disk, random access memory (RAM), read-only deposits
It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable
Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon
It is stored with punch card or groove internal projection structure and the above-mentioned any appropriate combination of instruction.Calculating used herein above
Machine readable storage medium storing program for executing is not interpreted that instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations lead to
It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire
Electric signal.
Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/
Processing equipment, or outer computer or outer is downloaded to by network, such as internet, local area network, wide area network and/or wireless network
Portion stores equipment.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, interchanger, gateway
Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted
Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment
In calculation machine readable storage medium storing program for executing.
Computer program instructions for executing disclosure operation can be assembly instruction, instruction set architecture (ISA) instructs,
Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages
The source code or object code that any combination is write, the programming language include the programming language-of object-oriented such as
Smalltalk, C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer
Readable program instructions can be executed fully on the user computer, partly execute on the user computer, be only as one
Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part
Or it is executed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind
It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit
It is connected with ISP by internet).In some embodiments, by utilizing computer-readable program instructions
Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can
Programmed logic array (PLA) (PLA), the electronic circuit can execute computer-readable program instructions, to realize each side of the disclosure
Face.
Referring herein to according to the flow chart of the method, apparatus (system) of the embodiment of the present disclosure and computer program product and/
Or block diagram describes various aspects of the disclosure.It should be appreciated that flowchart and or block diagram each box and flow chart and/
Or in block diagram each box combination, can be realized by computer-readable program instructions.
These computer-readable program instructions can be supplied to general purpose computer, special purpose computer or other programmable datas
The processor of processing unit, so that a kind of machine is produced, so that these instructions are passing through computer or other programmable datas
When the processor of processing unit executes, function specified in one or more boxes in implementation flow chart and/or block diagram is produced
The device of energy/movement.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to
It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, thus, it is stored with instruction
Computer-readable medium then includes a manufacture comprising in one or more boxes in implementation flow chart and/or block diagram
The instruction of the various aspects of defined function action.
Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other
In equipment, so that series of operation steps are executed in computer, other programmable data processing units or other equipment, to produce
Raw computer implemented process, so that executed in computer, other programmable data processing units or other equipment
Instruct function action specified in one or more boxes in implementation flow chart and/or block diagram.
The flow chart and block diagram in the drawings show system, method and the computer journeys according to multiple embodiments of the disclosure
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
One module of table, program segment or a part of instruction, the module, program segment or a part of instruction include one or more use
The executable instruction of the logic function as defined in realizing.In some implementations as replacements, function marked in the box
It can occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually be held substantially in parallel
Row, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or
The combination of each box in flow chart and the box in block diagram and or flow chart, can the function as defined in executing or dynamic
The dedicated hardware based system made is realized, or can be realized using a combination of dedicated hardware and computer instructions.
The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology
Other those of ordinary skill in domain can understand each embodiment disclosed herein.
Claims (10)
1. a kind of neural network training method, which is characterized in that the described method includes:
It will be handled in the ambient condition vector input neural network of current cycle of training, obtain the movement of current cycle of training
Output and the measurement of current cycle of training export;
The movement of the current cycle of training according to the ambient condition vector sum of the current cycle of training exports, and determines current instruction
Practice the first rewards and punishments feedback in period;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current training week
The measurement of phase exports, and determines the model loss of the neural network, is included in the current training week history cycle of training
One or more cycles of training before phase;
It is lost according to the model, adjusts the network parameter values of the neural network;
Neural network when the neural network meets training condition, after being trained.
2. the method according to claim 1, wherein the measurement output of the current period includes current training week
The first of phase measures output and the second of current cycle of training and measures output.
3. according to the method described in claim 2, it is characterized in that, model loss includes measuring output pair with described first
The the first model loss answered is measured with described second to be exported corresponding second model and loses and corresponding with the movement output
The loss of third model.
4. a kind of environmental treatment method, which is characterized in that the described method includes:
In neural network described in any one of ambient condition vector input claims 1 to 3 by current period
Reason obtains the movement output of current period;
It is exported according to the movement of the ambient condition vector of current period and the current period, determines the next of current period
The ambient condition vector in period and the first rewards and punishments feedback of current period.
5. a kind of neural metwork training device characterized by comprising
Input module obtains current for handling in the ambient condition vector input neural network by current cycle of training
The movement output and the measurement of current cycle of training of cycle of training exports;
Determining module is fed back, for moving for the current cycle of training according to the ambient condition vector sum of the current cycle of training
It exports, determines the first rewards and punishments feedback of current cycle of training;
Model loses determining module, for according to the of the first rewards and punishments feedback of the current cycle of training, history cycle of training
One rewards and punishments feedback and the measurement of current cycle of training export, and determine the model loss of the neural network, the history training week
Phase includes one or more cycles of training before the current cycle of training;
Module is adjusted, for losing according to the model, adjusts the network parameter values of the neural network;
Neural network obtains module, for the neural network when the neural network meets training condition, after being trained.
6. a kind of environmental treatment device, which is characterized in that described device includes:
Movement output obtains module, for the ambient condition vector of current period to be inputted any one of claims 1 to 3 institute
It is handled in the neural network stated, obtains the movement output of current period;
Third determining module, the movement for ambient condition vector and the current period according to current period export, really
Determine the ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period.
7. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: perform claim require any one of 1 to 3 described in method.
8. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: perform claim require 4 described in method.
9. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the computer journey
Method described in any one of claims 1 to 3 is realized in sequence instruction when being executed by processor.
10. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the computer
Method as claimed in claim 4 is realized when program instruction is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810885459.2A CN109190760B (en) | 2018-08-06 | 2018-08-06 | Neural network training method and device and environment processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810885459.2A CN109190760B (en) | 2018-08-06 | 2018-08-06 | Neural network training method and device and environment processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109190760A true CN109190760A (en) | 2019-01-11 |
CN109190760B CN109190760B (en) | 2021-11-30 |
Family
ID=64920260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810885459.2A Active CN109190760B (en) | 2018-08-06 | 2018-08-06 | Neural network training method and device and environment processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109190760B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191722A (en) * | 2019-12-30 | 2020-05-22 | 支付宝(杭州)信息技术有限公司 | Method and device for training prediction model through computer |
CN111191769A (en) * | 2019-12-25 | 2020-05-22 | 中国科学院苏州纳米技术与纳米仿生研究所 | Self-adaptive neural network training and reasoning device |
CN113211441A (en) * | 2020-11-30 | 2021-08-06 | 湖南太观科技有限公司 | Neural network training and robot control method and device |
CN114793305A (en) * | 2021-01-25 | 2022-07-26 | 上海诺基亚贝尔股份有限公司 | Method, apparatus, device and medium for optical communication |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180084279A1 (en) * | 2016-03-02 | 2018-03-22 | MatrixView, Inc. | Video encoding by injecting lower-quality quantized transform matrix values into a higher-quality quantized transform matrix |
CN108009638A (en) * | 2017-11-23 | 2018-05-08 | 深圳市深网视界科技有限公司 | A kind of training method of neural network model, electronic equipment and storage medium |
CN108197652A (en) * | 2018-01-02 | 2018-06-22 | 百度在线网络技术(北京)有限公司 | For generating the method and apparatus of information |
CN108257116A (en) * | 2017-12-30 | 2018-07-06 | 清华大学 | A kind of method for generating confrontation image |
-
2018
- 2018-08-06 CN CN201810885459.2A patent/CN109190760B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180084279A1 (en) * | 2016-03-02 | 2018-03-22 | MatrixView, Inc. | Video encoding by injecting lower-quality quantized transform matrix values into a higher-quality quantized transform matrix |
CN108009638A (en) * | 2017-11-23 | 2018-05-08 | 深圳市深网视界科技有限公司 | A kind of training method of neural network model, electronic equipment and storage medium |
CN108257116A (en) * | 2017-12-30 | 2018-07-06 | 清华大学 | A kind of method for generating confrontation image |
CN108197652A (en) * | 2018-01-02 | 2018-06-22 | 百度在线网络技术(北京)有限公司 | For generating the method and apparatus of information |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191769A (en) * | 2019-12-25 | 2020-05-22 | 中国科学院苏州纳米技术与纳米仿生研究所 | Self-adaptive neural network training and reasoning device |
CN111191769B (en) * | 2019-12-25 | 2024-03-05 | 中国科学院苏州纳米技术与纳米仿生研究所 | Self-adaptive neural network training and reasoning device |
CN111191722A (en) * | 2019-12-30 | 2020-05-22 | 支付宝(杭州)信息技术有限公司 | Method and device for training prediction model through computer |
CN111191722B (en) * | 2019-12-30 | 2022-08-09 | 支付宝(杭州)信息技术有限公司 | Method and device for training prediction model through computer |
CN113211441A (en) * | 2020-11-30 | 2021-08-06 | 湖南太观科技有限公司 | Neural network training and robot control method and device |
CN113211441B (en) * | 2020-11-30 | 2022-09-09 | 湖南太观科技有限公司 | Neural network training and robot control method and device |
CN114793305A (en) * | 2021-01-25 | 2022-07-26 | 上海诺基亚贝尔股份有限公司 | Method, apparatus, device and medium for optical communication |
Also Published As
Publication number | Publication date |
---|---|
CN109190760B (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109190760A (en) | Neural network training method and device and environmental treatment method and device | |
CN109859096A (en) | Image Style Transfer method, apparatus, electronic equipment and storage medium | |
CN109829501A (en) | Image processing method and device, electronic equipment and storage medium | |
CN109089133A (en) | Method for processing video frequency and device, electronic equipment and storage medium | |
CN109919300A (en) | Neural network training method and device and image processing method and device | |
CN108256555A (en) | Picture material recognition methods, device and terminal | |
JP7165818B2 (en) | Neural network training method and device, and image generation method and device | |
CN110009090A (en) | Neural metwork training and image processing method and device | |
CN109801270A (en) | Anchor point determines method and device, electronic equipment and storage medium | |
CN109635920A (en) | Neural network optimization and device, electronic equipment and storage medium | |
CN109978891A (en) | Image processing method and device, electronic equipment and storage medium | |
CN109543537A (en) | Weight identification model increment training method and device, electronic equipment and storage medium | |
CN103886284B (en) | Character attribute information identifying method, device and electronic equipment | |
CN109819229A (en) | Image processing method and device, electronic equipment and storage medium | |
CN110348418A (en) | Method for tracking target and device, Intelligent mobile equipment and storage medium | |
CN110210619A (en) | The training method and device of neural network, electronic equipment and storage medium | |
CN110532956A (en) | Image processing method and device, electronic equipment and storage medium | |
CN104077563A (en) | Human face recognition method and device | |
CN109670077A (en) | Video recommendation method, device and computer readable storage medium | |
CN109145970A (en) | Question and answer treating method and apparatus, electronic equipment and storage medium based on image | |
CN108960283A (en) | Classification task incremental processing method and device, electronic equipment and storage medium | |
CN108596093A (en) | The localization method and device of human face characteristic point | |
CN109670632A (en) | The predictor method of ad click rate, the estimating device of ad click rate, electronic equipment and storage medium | |
CN110889489A (en) | Neural network training method, image recognition method and device | |
CN106599191A (en) | User attribute analysis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |