CN109190760A - Neural network training method and device and environmental treatment method and device - Google Patents

Neural network training method and device and environmental treatment method and device Download PDF

Info

Publication number
CN109190760A
CN109190760A CN201810885459.2A CN201810885459A CN109190760A CN 109190760 A CN109190760 A CN 109190760A CN 201810885459 A CN201810885459 A CN 201810885459A CN 109190760 A CN109190760 A CN 109190760A
Authority
CN
China
Prior art keywords
training
punishments
rewards
neural network
feedback
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810885459.2A
Other languages
Chinese (zh)
Other versions
CN109190760B (en
Inventor
邓煜彬
余可
吕健勤
林达华
汤晓鸥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN201810885459.2A priority Critical patent/CN109190760B/en
Publication of CN109190760A publication Critical patent/CN109190760A/en
Application granted granted Critical
Publication of CN109190760B publication Critical patent/CN109190760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

This disclosure relates to a kind of neural network training method and device and environmental treatment method and device, which comprises the ambient condition vector of current cycle of training is inputted neural network, acquisition movement output and measurement output;It is acted to export according to ambient condition vector sum and determines the first rewards and punishments feedback;According to the first rewards and punishments feedback, the first rewards and punishments of history cycle of training feedback and output is measured, determines the model loss of neural network;It is lost according to model, adjusts the network parameter values of neural network;Neural network when neural network meets training condition, after being trained.Neural network training method according to an embodiment of the present disclosure, the model loss determined by multiple rewards and punishments feedback of current cycle of training and history cycle of training, it is not easy to fall into locally optimal solution during training neural network, can get the higher neural network of the goodness of fit.

Description

Neural network training method and device and environmental treatment method and device
Technical field
This disclosure relates to field of computer technology more particularly to a kind of neural network training method and device and environmental treatment Method and device.
Background technique
In the related art, it can will describe to obtain in the ambient condition vector input neural network of the environment of current period Movement output and measurement output will act output action in environment, and environment can change since the movement exports, and obtain The environment of next cycle is obtained, while obtaining and the rewards and punishments that the movement exports is fed back.It is fed back according to the rewards and punishments to determine nerve net The loss function of network, to train neural network.But the training method determines loss function according to single rewards and punishments feedback, is easy Training process is set to fall into locally optimal solution, it is difficult to obtain the higher neural network of the goodness of fit.
Summary of the invention
The present disclosure proposes a kind of neural network training method and devices and environmental treatment method and device.
According to the one side of the disclosure, a kind of neural network training method is provided, comprising:
It will be handled in the ambient condition vector input neural network of current cycle of training, obtain current cycle of training Movement output and the measurement of current cycle of training export;
The movement of the current cycle of training according to the ambient condition vector sum of the current cycle of training exports, and determination is worked as The first rewards and punishments of preceding cycle of training are fed back;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction The measurement output for practicing the period determines the model loss of the neural network, is included in the current instruction history cycle of training Practice one or more cycles of training before the period;
It is lost according to the model, adjusts the network parameter values of the neural network;
Neural network when the neural network meets training condition, after being trained.
Neural network training method according to an embodiment of the present disclosure can get ambient condition vector input neural network The movement output and the measurement of current cycle of training of current cycle of training exports, and anti-according to the first rewards and punishments of current cycle of training Feedback, the first rewards and punishments of history cycle of training feedback and measurements, which export, determines model loss, by multiple rewards and punishments feedback come Determining model loss is not easy to fall into locally optimal solution during training neural network, can get the higher mind of the goodness of fit Through network.
In one possible implementation, the measurement output of the current period includes the first weighing apparatus of current cycle of training Second measurement output of amount output and current cycle of training.
In one possible implementation, the model loss includes the first mould corresponding with the first measurement output The corresponding second model loss of output and third model corresponding with movement output damage are measured in type loss and described second It loses.
In one possible implementation, according to the first rewards and punishments feedback of the current cycle of training, history training week The first rewards and punishments feedback of phase and the measurement of current cycle of training export, and determine the model loss of the first nerves network, comprising:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute State the loss of the first model;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction Practice the period second measures output, determines the second model loss;
According to first model loss and second model loss, the third model loss is determined.
In one possible implementation, according to the first rewards and punishments of current cycle of training feedback and current training week The first of phase measures output, determines the first model loss, comprising:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really Fixed first accumulates rewards and punishments feedback of discounting;
Rewards and punishments feedback and described first of discounting is accumulated according to described first and measures output, determines that first model loses.
In one possible implementation, according to the first rewards and punishments feedback of the current cycle of training, history training week The first rewards and punishments feedback of phase and the second of current cycle of training measure output, determine the second model loss, comprising:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, determination is worked as The second rewards and punishments of preceding cycle of training are fed back;
Reward-Penalty Functions it is expected according to the second rewards and punishments of current cycle of training feedback, preset discount rate and second, Determine the second accumulation discount rewards and punishments feedback;
Rewards and punishments feedback and described second of discounting is accumulated according to described second and measures output, determines that second model loses.
In one possible implementation, according to the first rewards and punishments feedback and history of the current cycle of training training week The first rewards and punishments of phase are fed back, and determine the second rewards and punishments feedback of current cycle of training, comprising:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become Change vector;
According to the rewards and punishments change vector, variation accumulation vector is obtained;
Vector is accumulated according to the variation, obtains zero fluctuation rewards and punishments vector;
The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
In this way, it can be fed back by the first rewards and punishments of multiple history cycles of training to obtain the second rewards and punishments feedback, Training process can be made to obtain more effective rewards and punishments feedbacks, neural network can be enable to be suitable for more complex environment, Promote the performance of neural network.
In one possible implementation, the method also includes:
The previous training week according to the ambient condition vector sum of the previous cycle of training of the current cycle of training The movement of phase exports, and determines the ambient condition vector of the current cycle of training.
In one possible implementation, the method also includes:
When the current cycle of training is first cycle of training of the training neural network, according to preset initial The output of ambient condition vector sum random action, obtains the ambient condition vector of the current cycle of training.
In this way, initial environment can be made to rapidly become the environment that can be used for training neural network, nerve can be made The effective rewards and punishments feedback of the acquisition of network, improves the efficiency of training neural network, and training process is made to be not easy to fall into local optimum Solution.
According to another aspect of the present disclosure, a kind of environmental treatment method is provided, which comprises
The ambient condition vector of current period is inputted in the neural network and is handled, the movement of current period is obtained Output;
It is exported, is determined under current period according to the movement of the ambient condition vector of current period and the current period The ambient condition vector of a cycle and the first rewards and punishments feedback of current period.
According to another aspect of the present disclosure, a kind of neural metwork training device is provided, comprising:
Input module is obtained for handling in the ambient condition vector input neural network by current cycle of training The movement output and the measurement of current cycle of training of current cycle of training exports;
Determining module is fed back, for the current cycle of training according to the ambient condition vector sum of the current cycle of training Movement output, determine current cycle of training the first rewards and punishments feedback;
Model loses determining module, for according to the first rewards and punishments feedback of the current cycle of training, history cycle of training The first rewards and punishments feedback and the measurement of current cycle of training export, determine the model loss of the neural network, history instruction Practicing the period includes one or more cycles of training before the current cycle of training;
Module is adjusted, for losing according to the model, adjusts the network parameter values of the neural network;
Neural network obtains module, for the nerve net when the neural network meets training condition, after being trained Network.
In one possible implementation, the measurement output of the current period includes the first weighing apparatus of current cycle of training Second measurement output of amount output and current cycle of training.
In one possible implementation, the model loss includes the first mould corresponding with the first measurement output The corresponding second model loss of output and third model corresponding with movement output damage are measured in type loss and described second It loses.
In one possible implementation, the model loss determining module is further used for:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute State the loss of the first model;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction Practice the period second measures output, determines the second model loss;
According to first model loss and second model loss, the third model loss is determined.
In one possible implementation, the model loss determining module is further used for:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really Fixed first accumulates rewards and punishments feedback of discounting;
Rewards and punishments feedback and described first of discounting is accumulated according to described first and measures output, determines that first model loses.
In one possible implementation, the model loss determining module is further used for:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, determination is worked as The second rewards and punishments of preceding cycle of training are fed back;
Reward-Penalty Functions it is expected according to the second rewards and punishments of current cycle of training feedback, preset discount rate and second, Determine the second accumulation discount rewards and punishments feedback;
Rewards and punishments feedback and described second of discounting is accumulated according to described second and measures output, determines that second model loses.
In one possible implementation, the model loss determining module is further used for:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become Change vector;
According to the rewards and punishments change vector, variation accumulation vector is obtained;
Vector is accumulated according to the variation, obtains zero fluctuation rewards and punishments vector;
The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
In one possible implementation, described device further include:
First determining module, for the ambient condition vector sum according to the previous cycle of training of the current cycle of training The movement of the previous cycle of training exports, and determines the ambient condition vector of the current cycle of training.
In one possible implementation, described device further include:
Second determining module, for when first cycle of training that the current cycle of training is the training neural network When, exported according to preset initial environment state vector and random action, obtain the ambient condition of the current cycle of training to Amount.
In one possible implementation, described device includes:
Movement output obtains module, for any one in the ambient condition vector input claim 1 to 9 by current period It is handled in neural network described in, obtains the movement output of current period;
Third determining module, the movement for ambient condition vector and the current period according to current period are defeated Out, the ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period are determined.
According to another aspect of the present disclosure, a kind of electronic equipment is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: execute above-mentioned neural network training method.
According to another aspect of the present disclosure, a kind of electronic equipment is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: execute above-mentioned environmental treatment method.
According to another aspect of the present disclosure, a kind of computer readable storage medium is provided, computer journey is stored thereon with Sequence instruction, the computer program instructions realize above-mentioned neural network training method when being executed by processor.
According to another aspect of the present disclosure, a kind of computer readable storage medium is provided, computer journey is stored thereon with Sequence instruction, the computer program instructions realize above-mentioned environmental treatment method when being executed by processor.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become It is clear.
Detailed description of the invention
Comprising in the description and constituting the attached drawing of part of specification and specification together illustrates the disclosure Exemplary embodiment, feature and aspect, and for explaining the principles of this disclosure.
Fig. 1 shows the flow chart of neural network training method according to an embodiment of the present disclosure;
Fig. 2 shows the flow charts of environmental treatment method according to an embodiment of the present disclosure;
Fig. 3 shows the application schematic diagram of neural network training method according to an embodiment of the present disclosure;
Fig. 4 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure;
Fig. 5 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure;
Fig. 6 shows the block diagram of environmental treatment device according to an embodiment of the present disclosure;
Fig. 7 is the block diagram of the electronic equipment shown in the exemplary embodiment of basis;
Fig. 8 is the block diagram of the electronic equipment shown in the exemplary embodiment of basis.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
In addition, giving numerous details in specific embodiment below to better illustrate the disclosure. It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for Method, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.
Fig. 1 shows the flow chart of neural network training method according to an embodiment of the present disclosure.As shown in Figure 1, the side Method includes:
In step s 11, it will handle, worked as in the ambient condition vector input neural network of current cycle of training The movement output and the measurement of current cycle of training of preceding cycle of training exports;
In step s 12, the current cycle of training according to the ambient condition vector sum of the current cycle of training is dynamic It exports, determines the first rewards and punishments feedback of current cycle of training;
In step s 13, according to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training The measurement of feedback and current cycle of training export, and determine the model loss of the neural network, the history cycle of training includes One or more cycles of training before the current cycle of training;
In step S14, is lost according to the model, adjust the network parameter values of the neural network;
Neural network in step S15, when the neural network meets training condition, after being trained.
Neural network training method according to an embodiment of the present disclosure can get ambient condition vector input neural network The movement output and the measurement of current cycle of training of current cycle of training exports, and anti-according to the first rewards and punishments of current cycle of training Feedback, the first rewards and punishments of history cycle of training feedback and measurements, which export, determines model loss, by multiple rewards and punishments feedback come Determining model loss is not easy to fall into locally optimal solution, it is higher to can get the goodness of fit during training neural network Neural network.
In one possible implementation, the environment of current cycle of training can indicate current state, for example, game, Traffic environment locating for automatic driving vehicle or the trading situation in financial market etc..It can by the movement output of neural network It is applied in the environment of current period, environment is caused to change.In this example, movement output is the action command in game, For example, the game is shooting game, action command can be the instruction of the shooting of the role in order game, to change currently Scoring event.In this example, movement output can be the operational order in automatic Pilot, and the vehicle driven can be commanded to execute behaviour Make, to change current traffic environment.In this example, movement output can be the trading instruction in financial market, can be performed and hands over It is easy to operate, to change the current market price.The disclosure is to environment and acts the type of output with no restrictions.
In one possible implementation, in step s 11, the environment shape of the environment of current cycle of training can will be described It is handled in state vector input neural network, which can be the nerve net after a training upper cycle of training Network.
In one possible implementation, the neural network can ambient condition vector to current cycle of training carry out Processing, the movement output and the measurement of current cycle of training that can get current cycle of training export.The current cycle of training Movement output may act on the environment of current period, and the current cycle of training, which measures output, can be used for carrying out to neural network When reversed adjustment, the parameter as adjustment.
In one possible implementation, when first instruction that the current cycle of training is the training neural network When practicing the period, it can be exported according to preset initial environment state vector and random action, obtain the ring of the current cycle of training Border state vector.In this example, when not starting to train neural network, first using random action output action in initial ring In border, which be may be repeated a plurality of times.In this example, random action output can be vector, the choosing of each element of vector It takes and obeys some probability-distribution function, for example, softmax function.During repeating, the random action is used Output, that is, in initial environment, initial environment can change the random action output action, obtain the first initial environment, with Afterwards by the random action output action in the first initial environment, then the first initial environment can change, and obtain the second initial ring Border.It may be repeated a plurality of times, for example, the number repeated can be set as M (M is the positive integer more than or equal to 1), repeating After executing M times, the environment of acquisition is determined as to the environment of first cycle of training, can will describe first cycle of training later The ambient condition vector input neural network of environment is handled, and is trained to neural network.
In this way, initial environment can be made to rapidly become the environment that can be used for training neural network, nerve can be made The effective rewards and punishments feedback of the acquisition of network, improves the efficiency of training neural network, and training process is made to be not easy to fall into local optimum Solution.
It in one possible implementation, in step s 12, can be by the movement output action of current cycle of training in working as The environment of preceding cycle of training, that is, the movement of current cycle of training described in the ambient condition vector sum by current cycle of training exports By vector operation or matrix operation (such as, it may include be added, be multiplied or the operations such as convolution, the disclosure is to vector operation or square The type of battle array operation is with no restrictions), it can get the first rewards and punishments feedback of current cycle of training, the first rewards and punishments feedback can indicate The movement output action of current cycle of training is after the environment of current period, the change of the gap between the environment and perfect condition Change.
In this example, the environment is shooting game, and perfect condition divides difference to maximize for our score with what the other player scores, The movement output action of current cycle of training after the environment, can get to the first rewards and punishments feedback, for example, we knocks down, It can get positive the first rewards and punishments feedback, other side, which knocks down, can get negative the first rewards and punishments feedback, all not knock down, then rewards and punishments are fed back to 0.
In this example, the environment is traffic environment locating in traveling, is made exporting the movement of current cycle of training After the environment, the first rewards and punishments feedback can get, for example, after vehicle executes and exports associated operation with the movement, Traffic environment improves (for example, the coast is clear), then can get positive the first rewards and punishments feedback, execute in vehicle and export with the movement After associated operation, traffic environment deteriorates (for example, vehicle drives into a crowded road), and it is anti-to can get the first negative rewards and punishments Feedback, after vehicle executes and exports associated operation with the movement, traffic behavior is unchanged, then the first rewards and punishments are fed back to 0.
In this example, the environment is the trading situation in financial market, is made exporting the movement of current cycle of training After the environment, the first rewards and punishments feedback can get, for example, after execution exports associated transactional operation with the movement, Income is positive, then can get positive the first rewards and punishments feedback, is executing with after the associated transactional operation of movement output, income is It is negative, then can get negative the first rewards and punishments feedback, execute with after the associated transactional operation of movement output, without income (that is, Be not carried out transactional operation, or execute transaction profit or loss balance situations such as), then the first rewards and punishments are fed back to 0.
In one possible implementation, it is currently instructed according to the ambient condition vector sum of the current cycle of training The movement output for practicing the period, may further determine that the ambient condition vector of next cycle of training.In this example, it will can currently train week The movement output of current cycle of training described in the ambient condition vector sum of phase is by vector operation or matrix operation (for example, can wrap Include the operations such as addition, multiplication or convolution, the disclosure to the type of vector operation or matrix operation with no restrictions), can get next The ambient condition vector of a cycle of training.Specifically, the movement output of current cycle of training may act on environment, and send out environment It is raw to change, the environment of next cycle of training is obtained, the ambient condition vector of next cycle of training is to describe next instruction Practice the state vector of the environment in period.
It in one possible implementation, can be according to the environment shape of the previous cycle of training of the current cycle of training The movement of previous cycle of training described in state vector sum exports, and determines the ambient condition vector of the current cycle of training.That is, working as The ambient condition vector of preceding cycle of training is as described in the ambient condition vector sum of the previous cycle of training of current cycle of training What the movement output of previous cycle of training was determined by vector operation or matrix operation.It can moving previous cycle of training Make output action in previous environment cycle of training, obtains the environment of current cycle of training.
In one possible implementation, in step s 13, it can be exported according to the measurement of current cycle of training, is described The first rewards and punishments feedback of current cycle of training and the first rewards and punishments feedback of one or more history cycles of training, determine the mind Model loss through network.Wherein, the current period measurement output include current cycle of training first measure output and The second of current cycle of training measures output, and the model loss includes that the first model corresponding with the first measurement output damages It loses and described second measures the corresponding second model loss of output and third model corresponding with movement output loss.
In one possible implementation, step S13 can include:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute State the loss of the first model;According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback Output is measured with the second of current cycle of training, determines the second model loss;It is lost according to first model and described The loss of second model determines the third model loss.
In one possible implementation, the loss of the first model is corresponding with the first measurement output.First model Loss can measure output by the first of neural network and carry out backpropagation, be adjusted with parameters such as weights to neural network Section.In this example, the loss of the first model is measured with the first rewards and punishments feedback of current cycle of training and the first of current cycle of training Output is related.
In one possible implementation, according to the first rewards and punishments of current cycle of training feedback and current training week The first of phase measures output, determines the first model loss, it may include:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really Fixed first accumulates rewards and punishments feedback of discounting;Discount rewards and punishments feedback and the first expectation Reward-Penalty Functions are accumulated according to described first, really The fixed first model loss.
It in one possible implementation, can be according to following formula (1), according to the first of the current cycle of training the prize Punish feedback rt, the expectation Reward-Penalty Functions V of preset discount rate γ and first discounts rewards and punishments feedback to determine the first accumulation
Wherein, first accumulates rewards and punishments feedback of discountingIt is not related to the first rewards and punishments feedback of history cycle of training, Therefore it can be confirmed as short cycle and accumulate rewards and punishments feedback of discounting.First expectation Reward-Penalty Functions V is the environment to current cycle of training The first accumulation discount the evaluation function of rewards and punishments feedback, the input of the function can be ambient condition vector.Preset discount rate γ is constant, such as 0.9.N is the positive integer more than or equal to 1, and t is current cycle of training, rt+nTo use current cycle of training Movement output action when environment, (0≤n≤N-1) is fed back in first rewards and punishments in the t+n period, for example, will currently train The movement output action in period feeds back r in current environment, the first rewards and punishments that can get current environmenttWith next cycle of training Ambient condition vector st+1, by the movement output action of current cycle of training in the environment of next cycle of training, next can instruct R is fed back in the rewards and punishments for practicing the periodt+1With the ambient condition vector s after two cycles of trainingt+2.In this way, it can get dynamic Make the first rewards and punishments feedback and N number of instruction of each cycle of training in the case that the movement that output is current cycle of training exports Ambient condition vector s after practicing the periodt+N。V(st+N) valuation result be to be rolled over to first of the environment after N number of cycle of training the accumulation Desired value is fed back in existing rewards and punishments, and the disclosure does not limit the numerical value of N.For example, the environment is the trading situation in financial market, the One expectation Reward-Penalty Functions V can be used for estimating expected revenus.
In one possible implementation, rewards and punishments feedback of discounting can be accumulated according to first according to following formula (2)Output is measured with first, determines the first model loss L1
Wherein, V (st) it is rewards and punishments feedback desired value of discounting to the first accumulation of the environment of current cycle of training, that is, it will work as The ambient condition vector s of preceding cycle of trainingtInput the first expectation Reward-Penalty Functions V desired value obtained, in this example, V (st) Value is the first measurement output.
In one possible implementation, the loss of the second model is corresponding with the second measurement output.Second model Loss can measure output by the second of neural network and carry out backpropagation, be adjusted with parameters such as weights to neural network Section.In this example, the loss of the second model and the first rewards and punishments feedback of current cycle of training, the first rewards and punishments of history cycle of training are anti- Feedback and the second measurement output of current cycle of training are related.According to the first rewards and punishments feedback, the history of the current cycle of training The first rewards and punishments feedback of cycle of training and the second of current cycle of training measure output, determine the second model loss, comprising: According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, current training week is determined The second rewards and punishments of phase are fed back;According to the second rewards and punishments of current cycle of training feedback, preset discount rate and the second expectation Reward-Penalty Functions, determine the second accumulation discount rewards and punishments feedback;Rewards and punishments feedback and described second of discounting is accumulated according to described second to measure Output determines the second model loss.
In one possible implementation, according to the first rewards and punishments feedback and history of the current cycle of training training week The first rewards and punishments of phase are fed back, and determine the second rewards and punishments feedback of current cycle of training, it may include:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become Change vector;According to the rewards and punishments change vector, variation accumulation vector is obtained;Vector is accumulated according to the variation, obtains zero fluctuation Rewards and punishments vector;The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
It in one possible implementation, can be according to the first rewards and punishments feedback r of history cycle of trainingt-1、rt-2、rt-3Deng R is fed back in the first rewards and punishments with current cycle of trainingt, determine rewards and punishments change vector d=[dt-(T-1),dt-(T-2),...,dt].Showing In example, the first rewards and punishments feedback of T cycle of training can be used, T is the positive integer more than or equal to 2, and in this example, T can be 20, the disclosure to the numerical value of T with no restrictions.It in this example, can be by the first rewards and punishments feedback composition vector of T cycle of training [rt-(T-1)... rt-2, rt-1, rt].Further, it can determine that the rewards and punishments feedback of each trained history cycle is opposite according to the vector The variation of the rewards and punishments feedback of a cycle of training thereon, that is, d=[rt-(T-1), rt-(T-2)-rt-(T-1)... rt-1-rt-2, rt- rt-1], in rewards and punishments change vector d, dt-(T-1)=rt-(T-1), dt-(T-2)=rt-(T-2)-rt-(T-1)... dt=rt-rt-1
In one possible implementation, variation accumulation vector can be obtained according to rewards and punishments change vector dIn this example, the turn in sequence of rewards and punishments change vector d can be obtained vector f=[f1, f2,...,fT]=[dt,dt-1,...,dt-(T-1)], and add up to the element of vector f, obtain variation accumulation vectorThat is,0≤i≤T+1.That is,
In one possible implementation, vector can be accumulated according to variationDetermine zero fluctuation rewards and punishments vectorIn this example, Wherein,That is,
In one possible implementation, vector can be accumulated according to variationWith the zero fluctuation rewards and punishments vectorDetermine the second rewards and punishments feedbackIn this example, vector can be accumulated according to variationIn elementWithDetermine that parameter R is estimated in rewards and punishmentsH, such as following formula (3):
Vector can be accumulated according to variationWith the zero fluctuation rewards and punishments vectorIn element, determine variation accumulation to AmountWith the zero fluctuation rewards and punishments vectorDifference with zero fluctuation rewards and punishments vectorBetween ratio, such as following public affairs Formula (4):
Wherein, 1≤k≤T+1 can determine T+1 ratio according to formula (4), and determine the variance of the T+1 ratioFurther, parameter R can be estimated according to rewards and punishmentsHAnd varianceDetermine that the second rewards and punishments are fed backIt is such as following Formula (5):
Wherein, τ and σmaxFor the parameter of setting, τ can be used for controllingAmplitude, σmaxIt can be used for controllingWave It is dynamic, in this example, σmax=1, τ=2, the disclosure is to τ and σmaxValue with no restrictions.
In this way, it can be fed back by the first rewards and punishments of multiple history cycles of training to obtain the second rewards and punishments feedback, Training process can be made to obtain more effective rewards and punishments feedbacks, neural network can be enable to be suitable for more complex environment, Promote the performance of neural network.
In one possible implementation, can be according to following formula (6), it can be according to the second of the current cycle of training Rewards and punishments feedbackThe expectation of preset discount rate γ and second Reward-Penalty Functions Vvwr, determine the second accumulation discount rewards and punishments feedback
Wherein, second accumulates rewards and punishments feedback of discountingIt is related to the first rewards and punishments feedback of history cycle of training, because This can be confirmed as long period and accumulate rewards and punishments feedback of discounting.Second expectation Reward-Penalty Functions VvwrIt is the environment to current cycle of training The second accumulation discount the evaluation function of rewards and punishments feedback, the input of the function can be ambient condition vector.Work as to use The movement output action of preceding cycle of training is when environment, in the second rewards and punishments feedback in the t+n period, for example, will currently train The movement output action in period can get the second rewards and punishments feedback of current environment in current environmentWith next training week The ambient condition vector s of phaset+1, can be next by the movement output action of current cycle of training in the environment of next cycle of training The rewards and punishments of a cycle of training are fed backWith the ambient condition vector s after two cycles of trainingt+2.In this way, it can obtain The second rewards and punishments feedback of each cycle of training in the case where the movement that movement output is current cycle of training exports is obtained, and Ambient condition vector s after N number of cycle of trainingt+N。Vvwr(st+N) valuation result be movement output be current cycle of training In the case where movement output, rewards and punishments feedback desired value of discounting is accumulated to second of the environment after N number of cycle of training, the disclosure is to N Numerical value do not limit.
In one possible implementation, rewards and punishments feedback of discounting can be accumulated according to second according to following formula (7)Output is measured with second, determines the first model loss L2
Wherein, Vvwr(st) it is rewards and punishments feedback desired value of discounting to the second accumulation of the environment of current cycle of training, that is, it will The ambient condition vector s of current cycle of trainingtThe second expectation of input Reward-Penalty Functions VvwrDesired value obtained, in this example, Vvwr(st) value be the second measurement output.
In one possible implementation, L can be lost according to the first model according to following formula (8)1With the second model Lose L2, to determine and movement output atCorresponding third model loses L3
Wherein, π (at|st;It is θ) expression formula of the neural network of current cycle of training, θ is the nerve of current cycle of training The network parameter values of network,Square for the first model lose L1, Square for the second model lose L2
In one possible implementation, in step S14, L can be lost according to the first model1, the second model lose L2 L is lost with third model3To adjust the network parameter values of neural network.It in this example, can be according to making model minimization of loss Direction adjusts mind to adjust the network parameter values of neural network, or according to the direction for the model minimization of loss for making regularization Network parameter values through network make the neural network adjusted goodness of fit with higher, while avoiding over-fitting.
In one possible implementation, step S14 is recyclable executes repeatedly, for example, can be according to making model loss most The direction of smallization adjusts scheduled number;The number of adjusting can not also be limited, is repeatedly adjusted until model loss is reduced to centainly Degree or when converging in certain threshold value, stops circulation.
In one possible implementation, the mind in step S15, when meeting training condition, after can get training Through network.In this example, circulation can be adjusted the neural network after pre-determined number and is determined as the neural network after training, or by mould Type loss reduces to a certain extent or the neural network converged in certain threshold value is determined as the neural network after training.The disclosure With no restrictions to training condition.
Neural network training method according to an embodiment of the present disclosure is exported using random action and obtains first instruction rapidly The environment for practicing the period can make the effective rewards and punishments feedback of the acquisition of neural network, improve the efficiency of training neural network, and make to train Process is not easy to fall into locally optimal solution.Further, according to the first rewards and punishments feedback of current cycle of training, history cycle of training First rewards and punishments feedback and the first measurement output and second measure output to determine model loss, and training process can be made to obtain more More effective rewards and punishments feedbacks, can enable neural network be suitable for more complex environment, promote the performance of neural network, and And the model loss for feeding back to determine by multiple rewards and punishments is not easy to fall into locally optimal solution during training neural network, It can get the higher neural network of the goodness of fit.
Fig. 2 shows the flow charts of environmental treatment method according to an embodiment of the present disclosure.As shown in Fig. 2, the method packet It includes:
In the step s 21, it will handle, obtain current in the neural network of the ambient condition vector input of current period The movement in period exports;
It in step S22, is exported, is determined according to the movement of the ambient condition vector of current period and the current period The ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period.
In one possible implementation, in the step s 21, the neural network is to pass through step S11- step The neural network that the training method of S15 was trained, according to the first rewards and punishments feedback of current cycle of training, history cycle of training The neural network of the model loss training of one rewards and punishments feedback and the first measurement output and the second measurement output determination, was being trained More effective rewards and punishments feedbacks can be obtained in journey, therefore, the neural network after the training can be suitable for more complicated Environment, can get more accurately movement output.
It in one possible implementation, can be by the movement output of current period and current period in step S22 Ambient condition vector by vector operation or matrix operation (such as, it may include be added, be multiplied or the operations such as convolution, the disclosure pair The type of vector operation or matrix operation is with no restrictions), it can get the ambient condition vector of next cycle.Specifically, currently The movement in period may act on environment, and environment is made to change, and obtain the environment of next cycle, the environment of next cycle State vector is to describe the state vector of the environment of next cycle.Further, the environmental change of current period is next When the environment in a period, the first rewards and punishments feedback of current period can get, the first rewards and punishments feedback can be used for characterizing will be current The movement output action in period is when the environment of current period, the income of generation.For example, the environment is the friendship in financial market Easy situation can make financial market after the movement of current cycle of training output (for example, transactional operation) is acted on the environment In trading situation change (for example, after trading, share price or exchange hand change), and obtain the first rewards and punishments it is anti- Feedback (for example, income that transaction obtains).Due in the training process of neural network, according to the first rewards and punishments of current cycle of training The first rewards and punishments of feedback and history cycle of training are fed back to train neural network, so that the neural network after training can be based on history Cycle of training adapts to more complicated environment, for example, more accurately transaction behaviour can be made according to more historical yield situations Make, to obtain more preferably price and lower transaction cost, and bigger income is being obtained according to returning in survey for historical data.
Fig. 3 shows the application schematic diagram of neural network training method according to an embodiment of the present disclosure.As shown in figure 3, can It using random action output action in initial environment, and repeats repeatedly, obtains the environment of first cycle of training, to start Training neural network.
In one possible implementation, according to the environment of current cycle of training, it can get description current cycle of training Environment ambient condition vector st, by the ambient condition vector s of environmenttNeural network is inputted, can get current cycle of training Movement output at, first measure output and second measure output.
In one possible implementation, movement can be exported atThe environment of current cycle of training is acted on, can get R is fed back in first rewards and punishmentstWith the ambient condition vector s of next cycle of trainingt+1.R is fed back according to the first rewards and punishmentstAnd history training First rewards and punishments in period are fed back, and can get the second rewards and punishments feedback
In one possible implementation, r is fed back according to the first rewards and punishmentstOutput is measured with first, can get and first It measures and exports corresponding first model loss L1.It is fed back according to the second rewards and punishmentsOutput is measured with second, can get and second It measures and exports corresponding second model loss L2.L is lost according to the first model1L is lost with the second model2, can get defeated with movement Corresponding third model loses L out3
In one possible implementation, L can be lost according to the first model1, output is measured to nerve net by first Network is reversely adjusted.L can be lost according to the second model2, output is measured by second, and neural network is reversely adjusted.And L can be lost according to the second model3, a is exported by movementtNeural network is reversely adjusted.It, can when meeting training condition Neural network after being trained.Neural network after training can be used for receiving the ambient condition vector of next cycle of training st+1
In one possible implementation, the environment may include traffic environment locating for automatic driving vehicle, pass through R is fed back in the first rewards and punishments of current cycle of trainingt, and according to the first rewards and punishments feedback rtAnd the first rewards and punishments of history cycle of training Determining the second rewards and punishments feedback of feedbackIt can get multiple effective rewards and punishments feedbacks, can be used for complicated traffic environment, multiple Output in miscellaneous traffic environment more accurately acts output.
In one possible implementation, the environment is the trading situation in financial market, passes through currently training week R is fed back in the first rewards and punishments of phaset, and according to the first rewards and punishments feedback rtAnd the first rewards and punishments feedback of history cycle of training is determining Second rewards and punishments feedbackIt can get multiple effective rewards and punishments feedbacks, can be used for complicated trading environment, in complicated transaction ring More accurately movement output is exported in border, to make more accurate transactional operation according to more historical yield situations, to obtain More preferably price and lower transaction cost are obtained, and bigger income is being obtained according to returning in survey for historical data.
In one possible implementation, can be used in environmental treatment by the neural network of above method training, that is, It can will be handled in neural network after the input training of the ambient condition vector of current period, the movement for obtaining current period is defeated Out;And exported according to the movement of the ambient condition vector of current period and the current period, determine the next of current period The ambient condition vector in a period and the first rewards and punishments feedback of current period.Environment is carried out using the neural network after training Processing obtains more accurately dynamic so that the neural network after training can adapt to more complicated environment based on history cycle of training It exports, and obtains higher first rewards and punishments feedback, that is, higher income.
It is appreciated that above-mentioned each embodiment of the method that the disclosure refers to, without prejudice to principle logic, To engage one another while the embodiment to be formed after combining, as space is limited, the disclosure is repeated no more.
In addition, the disclosure additionally provides image processing apparatus, electronic equipment, computer readable storage medium, program, it is above-mentioned It can be used to realize any method that the disclosure provides, corresponding technical solution and description and the corresponding note referring to method part It carries, repeats no more.
Fig. 4 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure, as shown in figure 4, the nerve Network training device includes:
Input module 11 is obtained for handling in the ambient condition vector input neural network by current cycle of training The movement output and the measurement of current cycle of training for obtaining current cycle of training export;
Determining module 12 is fed back, for currently training to be all according to the ambient condition vector sum of the current cycle of training The movement of phase exports, and determines the first rewards and punishments feedback of current cycle of training;
Model loses determining module 13, for the first rewards and punishments feedback according to the current cycle of training, history training week The first rewards and punishments feedback of phase and the measurement of current cycle of training export, and determine the model loss of the neural network, the history Cycle of training includes one or more cycles of training before the current cycle of training;
Module 14 is adjusted, for losing according to the model, adjusts the network parameter values of the neural network;
Neural network obtains module 15, for the nerve when the neural network meets training condition, after being trained Network.
Fig. 5 shows the block diagram of neural metwork training device according to an embodiment of the present disclosure, as shown in figure 5, described device Further include:
Second determining module 16, for when first training week that the current cycle of training is the training neural network It when the phase, is exported according to preset initial environment state vector and random action, obtains the ambient condition of the current cycle of training Vector.
In one possible implementation, described device further include:
First determining module 17, for the ambient condition vector according to the previous cycle of training of the current cycle of training Movement with the previous cycle of training exports, and determines the ambient condition vector of the current cycle of training.
In one possible implementation, the measurement output of the current period includes the first weighing apparatus of current cycle of training Second measurement output of amount output and current cycle of training.
In one possible implementation, the model loss includes the first mould corresponding with the first measurement output The corresponding second model loss of output and third model corresponding with movement output damage are measured in type loss and described second It loses.
In one possible implementation, the model loss determining module 13 is further used for:
Output is measured according to the first of the first rewards and punishments of current cycle of training feedback and current cycle of training, determines institute State the loss of the first model;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current instruction Practice the period second measures output, determines the second model loss;
According to first model loss and second model loss, the third model loss is determined.
In one possible implementation, the model loss determining module 13 is further used for:
According to the first rewards and punishments of current cycle of training feedback, preset discount rate and the first expectation Reward-Penalty Functions, really Fixed first accumulates rewards and punishments feedback of discounting;
Rewards and punishments feedback and described first of discounting is accumulated according to described first and measures output, determines that first model loses.
In one possible implementation, the model loss determining module 13 is further used for:
According to the first rewards and punishments of the first rewards and punishments feedback of the current cycle of training and history cycle of training feedback, determination is worked as The second rewards and punishments of preceding cycle of training are fed back;
Reward-Penalty Functions it is expected according to the second rewards and punishments of current cycle of training feedback, preset discount rate and second, Determine the second accumulation discount rewards and punishments feedback;
Rewards and punishments feedback and described second of discounting is accumulated according to described second and measures output, determines that second model loses.
In one possible implementation, the model loss determining module 13 is further used for:
According to the first rewards and punishments feedback and the first rewards and punishments of current cycle of training feedback of history cycle of training, determine that rewards and punishments become Change vector;
According to the rewards and punishments change vector, variation accumulation vector is obtained;
Vector is accumulated according to the variation, obtains zero fluctuation rewards and punishments vector;
The zero fluctuation rewards and punishments vector according to variation accumulation vector sum, determines the second rewards and punishments feedback.
Fig. 6 shows the block diagram of environmental treatment device according to an embodiment of the present disclosure, as shown in fig. 6, the environmental treatment Device includes:
Movement output obtains module 21, for any in the ambient condition vector input claim 1 to 9 by current period It is handled in neural network described in one, obtains the movement output of current period;
Third determining module 22, the movement for ambient condition vector and the current period according to current period are defeated Out, the ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period are determined.
The embodiment of the present disclosure also proposes a kind of computer readable storage medium, is stored thereon with computer program instructions, institute It states when computer program instructions are executed by processor and realizes the above method.Computer readable storage medium can be non-volatile meter Calculation machine readable storage medium storing program for executing.
The embodiment of the present disclosure also proposes a kind of electronic equipment, comprising: processor;For storage processor executable instruction Memory;Wherein, the processor is configured to the above method.
The equipment that electronic equipment may be provided as terminal, server or other forms.
Fig. 7 is the block diagram of a kind of electronic equipment 800 shown in the exemplary embodiment of basis.For example, electronic equipment 800 can To be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, Medical Devices are good for Body equipment, the terminals such as personal digital assistant.
Referring to Fig. 7, electronic equipment 800 may include following one or more components: processing component 802, memory 804, Power supply module 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, And communication component 816.
The integrated operation of the usual controlling electronic devices 800 of processing component 802, such as with display, call, data are logical Letter, camera operation and record operate associated operation.Processing component 802 may include one or more processors 820 to hold Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more moulds Block, convenient for the interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, with Facilitate the interaction between multimedia component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in electronic equipment 800.These data Example include any application or method for being operated on electronic equipment 800 instruction, contact data, telephone directory Data, message, picture, video etc..Memory 804 can by any kind of volatibility or non-volatile memory device or it Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable Except programmable read only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, fastly Flash memory, disk or CD.
Power supply module 806 provides electric power for the various assemblies of electronic equipment 800.Power supply module 806 may include power supply pipe Reason system, one or more power supplys and other with for electronic equipment 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between the electronic equipment 800 and user. In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touches Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 808 includes a front camera and/or rear camera.When electronic equipment 800 is in operation mode, as clapped When taking the photograph mode or video mode, front camera and/or rear camera can receive external multi-medium data.It is each preposition Camera and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when electronic equipment 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.
Sensor module 814 includes one or more sensors, for providing the state of various aspects for electronic equipment 800 Assessment.For example, sensor module 814 can detecte the state that opens/closes of electronic equipment 800, the relative positioning of component, example As the component be electronic equipment 800 display and keypad, sensor module 814 can also detect electronic equipment 800 or The position change of 800 1 components of electronic equipment, the existence or non-existence that user contacts with electronic equipment 800, electronic equipment 800 The temperature change of orientation or acceleration/deceleration and electronic equipment 800.Sensor module 814 may include proximity sensor, be configured For detecting the presence of nearby objects without any physical contact.Sensor module 814 can also include optical sensor, Such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, which may be used also To include acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment. Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.Show at one In example property embodiment, communication component 816 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, short to promote Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, electronic equipment 800 can be by one or more application specific integrated circuit (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating The memory 804 of machine program instruction, above-mentioned computer program instructions can be executed by the processor 820 of electronic equipment 800 to complete The above method.
Fig. 8 is the block diagram of a kind of electronic equipment 1900 shown in the exemplary embodiment of basis.For example, electronic equipment 1900 It may be provided as a server.Referring to Fig. 8, electronic equipment 1900 includes processing component 1922, further comprise one or Multiple processors and memory resource represented by a memory 1932, can be by the execution of processing component 1922 for storing Instruction, such as application program.The application program stored in memory 1932 may include it is one or more each Module corresponding to one group of instruction.In addition, processing component 1922 is configured as executing instruction, to execute the above method.
Electronic equipment 1900 can also include that a power supply module 1926 is configured as executing the power supply of electronic equipment 1900 Management, a wired or wireless network interface 1950 is configured as electronic equipment 1900 being connected to network and an input is defeated (I/O) interface 1958 out.Electronic equipment 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, for example including calculating The memory 1932 of machine program instruction, above-mentioned computer program instructions can by the processing component 1922 of electronic equipment 1900 execute with Complete the above method.
The disclosure can be system, method and/or computer program product.Computer program product may include computer Readable storage medium storing program for executing, containing for making processor realize the computer-readable program instructions of various aspects of the disclosure.
Computer readable storage medium, which can be, can keep and store the tangible of the instruction used by instruction execution equipment Equipment.Computer readable storage medium for example can be-- but it is not limited to-- storage device electric, magnetic storage apparatus, optical storage Equipment, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium More specific example (non exhaustive list) includes: portable computer diskette, hard disk, random access memory (RAM), read-only deposits It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon It is stored with punch card or groove internal projection structure and the above-mentioned any appropriate combination of instruction.Calculating used herein above Machine readable storage medium storing program for executing is not interpreted that instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations lead to It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire Electric signal.
Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment, or outer computer or outer is downloaded to by network, such as internet, local area network, wide area network and/or wireless network Portion stores equipment.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.
Computer program instructions for executing disclosure operation can be assembly instruction, instruction set architecture (ISA) instructs, Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages The source code or object code that any combination is write, the programming language include the programming language-of object-oriented such as Smalltalk, C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer Readable program instructions can be executed fully on the user computer, partly execute on the user computer, be only as one Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part Or it is executed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).In some embodiments, by utilizing computer-readable program instructions Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can Programmed logic array (PLA) (PLA), the electronic circuit can execute computer-readable program instructions, to realize each side of the disclosure Face.
Referring herein to according to the flow chart of the method, apparatus (system) of the embodiment of the present disclosure and computer program product and/ Or block diagram describes various aspects of the disclosure.It should be appreciated that flowchart and or block diagram each box and flow chart and/ Or in block diagram each box combination, can be realized by computer-readable program instructions.
These computer-readable program instructions can be supplied to general purpose computer, special purpose computer or other programmable datas The processor of processing unit, so that a kind of machine is produced, so that these instructions are passing through computer or other programmable datas When the processor of processing unit executes, function specified in one or more boxes in implementation flow chart and/or block diagram is produced The device of energy/movement.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, thus, it is stored with instruction Computer-readable medium then includes a manufacture comprising in one or more boxes in implementation flow chart and/or block diagram The instruction of the various aspects of defined function action.
Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other In equipment, so that series of operation steps are executed in computer, other programmable data processing units or other equipment, to produce Raw computer implemented process, so that executed in computer, other programmable data processing units or other equipment Instruct function action specified in one or more boxes in implementation flow chart and/or block diagram.
The flow chart and block diagram in the drawings show system, method and the computer journeys according to multiple embodiments of the disclosure The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation One module of table, program segment or a part of instruction, the module, program segment or a part of instruction include one or more use The executable instruction of the logic function as defined in realizing.In some implementations as replacements, function marked in the box It can occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually be held substantially in parallel Row, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or The combination of each box in flow chart and the box in block diagram and or flow chart, can the function as defined in executing or dynamic The dedicated hardware based system made is realized, or can be realized using a combination of dedicated hardware and computer instructions.
The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology Other those of ordinary skill in domain can understand each embodiment disclosed herein.

Claims (10)

1. a kind of neural network training method, which is characterized in that the described method includes:
It will be handled in the ambient condition vector input neural network of current cycle of training, obtain the movement of current cycle of training Output and the measurement of current cycle of training export;
The movement of the current cycle of training according to the ambient condition vector sum of the current cycle of training exports, and determines current instruction Practice the first rewards and punishments feedback in period;
According to the first rewards and punishments feedback of the current cycle of training, the first rewards and punishments of history cycle of training feedback and current training week The measurement of phase exports, and determines the model loss of the neural network, is included in the current training week history cycle of training One or more cycles of training before phase;
It is lost according to the model, adjusts the network parameter values of the neural network;
Neural network when the neural network meets training condition, after being trained.
2. the method according to claim 1, wherein the measurement output of the current period includes current training week The first of phase measures output and the second of current cycle of training and measures output.
3. according to the method described in claim 2, it is characterized in that, model loss includes measuring output pair with described first The the first model loss answered is measured with described second to be exported corresponding second model and loses and corresponding with the movement output The loss of third model.
4. a kind of environmental treatment method, which is characterized in that the described method includes:
In neural network described in any one of ambient condition vector input claims 1 to 3 by current period Reason obtains the movement output of current period;
It is exported according to the movement of the ambient condition vector of current period and the current period, determines the next of current period The ambient condition vector in period and the first rewards and punishments feedback of current period.
5. a kind of neural metwork training device characterized by comprising
Input module obtains current for handling in the ambient condition vector input neural network by current cycle of training The movement output and the measurement of current cycle of training of cycle of training exports;
Determining module is fed back, for moving for the current cycle of training according to the ambient condition vector sum of the current cycle of training It exports, determines the first rewards and punishments feedback of current cycle of training;
Model loses determining module, for according to the of the first rewards and punishments feedback of the current cycle of training, history cycle of training One rewards and punishments feedback and the measurement of current cycle of training export, and determine the model loss of the neural network, the history training week Phase includes one or more cycles of training before the current cycle of training;
Module is adjusted, for losing according to the model, adjusts the network parameter values of the neural network;
Neural network obtains module, for the neural network when the neural network meets training condition, after being trained.
6. a kind of environmental treatment device, which is characterized in that described device includes:
Movement output obtains module, for the ambient condition vector of current period to be inputted any one of claims 1 to 3 institute It is handled in the neural network stated, obtains the movement output of current period;
Third determining module, the movement for ambient condition vector and the current period according to current period export, really Determine the ambient condition vector of the next cycle of current period and the first rewards and punishments feedback of current period.
7. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: perform claim require any one of 1 to 3 described in method.
8. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: perform claim require 4 described in method.
9. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the computer journey Method described in any one of claims 1 to 3 is realized in sequence instruction when being executed by processor.
10. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the computer Method as claimed in claim 4 is realized when program instruction is executed by processor.
CN201810885459.2A 2018-08-06 2018-08-06 Neural network training method and device and environment processing method and device Active CN109190760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810885459.2A CN109190760B (en) 2018-08-06 2018-08-06 Neural network training method and device and environment processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810885459.2A CN109190760B (en) 2018-08-06 2018-08-06 Neural network training method and device and environment processing method and device

Publications (2)

Publication Number Publication Date
CN109190760A true CN109190760A (en) 2019-01-11
CN109190760B CN109190760B (en) 2021-11-30

Family

ID=64920260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810885459.2A Active CN109190760B (en) 2018-08-06 2018-08-06 Neural network training method and device and environment processing method and device

Country Status (1)

Country Link
CN (1) CN109190760B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191722A (en) * 2019-12-30 2020-05-22 支付宝(杭州)信息技术有限公司 Method and device for training prediction model through computer
CN111191769A (en) * 2019-12-25 2020-05-22 中国科学院苏州纳米技术与纳米仿生研究所 Self-adaptive neural network training and reasoning device
CN113211441A (en) * 2020-11-30 2021-08-06 湖南太观科技有限公司 Neural network training and robot control method and device
CN114793305A (en) * 2021-01-25 2022-07-26 上海诺基亚贝尔股份有限公司 Method, apparatus, device and medium for optical communication

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180084279A1 (en) * 2016-03-02 2018-03-22 MatrixView, Inc. Video encoding by injecting lower-quality quantized transform matrix values into a higher-quality quantized transform matrix
CN108009638A (en) * 2017-11-23 2018-05-08 深圳市深网视界科技有限公司 A kind of training method of neural network model, electronic equipment and storage medium
CN108197652A (en) * 2018-01-02 2018-06-22 百度在线网络技术(北京)有限公司 For generating the method and apparatus of information
CN108257116A (en) * 2017-12-30 2018-07-06 清华大学 A kind of method for generating confrontation image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180084279A1 (en) * 2016-03-02 2018-03-22 MatrixView, Inc. Video encoding by injecting lower-quality quantized transform matrix values into a higher-quality quantized transform matrix
CN108009638A (en) * 2017-11-23 2018-05-08 深圳市深网视界科技有限公司 A kind of training method of neural network model, electronic equipment and storage medium
CN108257116A (en) * 2017-12-30 2018-07-06 清华大学 A kind of method for generating confrontation image
CN108197652A (en) * 2018-01-02 2018-06-22 百度在线网络技术(北京)有限公司 For generating the method and apparatus of information

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191769A (en) * 2019-12-25 2020-05-22 中国科学院苏州纳米技术与纳米仿生研究所 Self-adaptive neural network training and reasoning device
CN111191769B (en) * 2019-12-25 2024-03-05 中国科学院苏州纳米技术与纳米仿生研究所 Self-adaptive neural network training and reasoning device
CN111191722A (en) * 2019-12-30 2020-05-22 支付宝(杭州)信息技术有限公司 Method and device for training prediction model through computer
CN111191722B (en) * 2019-12-30 2022-08-09 支付宝(杭州)信息技术有限公司 Method and device for training prediction model through computer
CN113211441A (en) * 2020-11-30 2021-08-06 湖南太观科技有限公司 Neural network training and robot control method and device
CN113211441B (en) * 2020-11-30 2022-09-09 湖南太观科技有限公司 Neural network training and robot control method and device
CN114793305A (en) * 2021-01-25 2022-07-26 上海诺基亚贝尔股份有限公司 Method, apparatus, device and medium for optical communication

Also Published As

Publication number Publication date
CN109190760B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN109190760A (en) Neural network training method and device and environmental treatment method and device
CN109859096A (en) Image Style Transfer method, apparatus, electronic equipment and storage medium
CN109829501A (en) Image processing method and device, electronic equipment and storage medium
CN109089133A (en) Method for processing video frequency and device, electronic equipment and storage medium
CN109919300A (en) Neural network training method and device and image processing method and device
CN108256555A (en) Picture material recognition methods, device and terminal
JP7165818B2 (en) Neural network training method and device, and image generation method and device
CN110009090A (en) Neural metwork training and image processing method and device
CN109801270A (en) Anchor point determines method and device, electronic equipment and storage medium
CN109635920A (en) Neural network optimization and device, electronic equipment and storage medium
CN109978891A (en) Image processing method and device, electronic equipment and storage medium
CN109543537A (en) Weight identification model increment training method and device, electronic equipment and storage medium
CN103886284B (en) Character attribute information identifying method, device and electronic equipment
CN109819229A (en) Image processing method and device, electronic equipment and storage medium
CN110348418A (en) Method for tracking target and device, Intelligent mobile equipment and storage medium
CN110210619A (en) The training method and device of neural network, electronic equipment and storage medium
CN110532956A (en) Image processing method and device, electronic equipment and storage medium
CN104077563A (en) Human face recognition method and device
CN109670077A (en) Video recommendation method, device and computer readable storage medium
CN109145970A (en) Question and answer treating method and apparatus, electronic equipment and storage medium based on image
CN108960283A (en) Classification task incremental processing method and device, electronic equipment and storage medium
CN108596093A (en) The localization method and device of human face characteristic point
CN109670632A (en) The predictor method of ad click rate, the estimating device of ad click rate, electronic equipment and storage medium
CN110889489A (en) Neural network training method, image recognition method and device
CN106599191A (en) User attribute analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant