CN109242207A

CN109242207A - A kind of Financial Time Series prediction technique based on deeply study

Info

Publication number: CN109242207A
Application number: CN201811179333.XA
Authority: CN
Inventors: 方锡鑫; 潘炎; 赖韩江; 印鉴; 潘文杰
Original assignee: Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd; National Sun Yat Sen University
Current assignee: Sun Yat Sen University; Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd; National Sun Yat Sen University
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2019-01-18

Abstract

The invention discloses a kind of Financial Time Series prediction techniques based on deeply study.This method includes three main subsystems: data process subsystem, and the function of this subsystem is to carry out data processing to the initial data got from WindAPI；Feature extraction subsystem, the function of this subsystem are one deep neural networks of construction to extract data characteristics；Intensified learning subsystem, the function of this subsystem is based on Actor-Critic algorithm, Constructing Policy network and assessment network, the selection and evaluation of transaction movement are carried out respectively, then continuous iteration, which is more newly arrived, guarantees that whole system gets the newest multidate information in market, and optimal transaction movement is made according to the status information got, finally obtain preferable trading effect.The present invention can constantly go to learn this complicated financial market, capture the transaction movement that may be made a profit in time, realize profit purpose by some essential informations in financial market.

Description

A kind of Financial Time Series prediction technique based on deeply study

Technical field

The present invention relates to deep learning and intensified learning field, more particularly, to a kind of based on deeply study Financial Time Series prediction technique.

Background technique

In the today prevailing that globalizes, financial market reaches unprecedented scale, therefore investment field emerges large quantities of excellent Elegant analyst predicts assets future price by personal experience and the judgement of some subjective analysis, but traditional deal maker grasps Make obviously that some are inefficient, therefore people also can carry out quantitative research by computer, instead of duplicate some artificial work, at this moment Also just occur the algorithm model of conventional quantization investment field in succession, many some ground in conjunction with what machine learning was done wherein also including Study carefully, as well-known Bayes, support vector machines, Recognition with Recurrent Neural Network etc..

But often the index of those of tradition research method all presence considerations is single, manual operation model prediction is extensive The problems such as ability is inadequate, and traditional machine learning algorithm is applied and similarly there is obvious lack in Financial Time Series It falls into, many models especially deep neural network will lead to over-fitting in the forecasting problem of Financial Time Series.Financial market Be it is very changeable, non-stationary property is a generally existing characteristic of Financial Time Series, this allows many model predictions When also show unstable situation.

In recent years, after defeating utterly major go master-hand with Alpha's go, intensity study enters completely new burning hot exploration In the stage, same financial field is no exception, but is different algorithm effect naturally also difference, the following Fig. 1 of general frame It is shown, nitrification enhancement of the invention also can be used Fig. 1 shows.Intensified learning series also has many algorithms, can substantially divide For the algorithm based on value function (value-based) and the algorithm based on tactful (policy-based).Calculation based on value function Method typical case is represented as Q-Learning, and the application in financial field is to define state of market to come, then according to ε- Greedy strategy removes selection transaction movement, the reward that then environment of observation and movement obtain, and then updating maintenance Q value table, with This is selected to achieve the purpose that optimize transaction movement, but the problem of this algorithm is when state or the dimension of movement are very big When, it is difficult to it is safeguarded with the form of Q table, and very difficult convergence.Algorithm Typical Representative based on strategy is Policy- Gradient, the application in financial field is equally to define state of market, then best dynamic according to having policy selection Make, the reward of the movement is obtained by environmental feedback, then reversely update policing algorithm, is constantly updated in this way so that is selected is dynamic It more likely makes a profit, although this method can be adapted for high-dimensional motion space, but the algorithm based on strategy is often What is converged to is local optimal solution and not global optimum, and bout updates that cause the algorithm based on strategy to update relatively low Effect.

Summary of the invention

The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides a kind of based on deeply The Financial Time Series prediction technique of habit.

In order to solve the above technical problems, technical scheme is as follows: when a kind of finance based on deeply study Sequence data predication method, comprising the following steps:

S1: original Financial Time Series are downloaded from the WindAPI of external system；

S2: the original Financial Time Series after downloading are carried out data prediction by building data process subsystem, and defeated Pretreated data out；

S3: construction feature extracts subsystem, and pretreated data are carried out to the extraction of depth characteristic, exports and extracts Depth characteristic information；

S4: depth characteristic information is compared with trading environment, carries out intensified learning by building intensified learning subsystem, Export transaction movement；

S5: the transaction movement generated according to the data being input in system adjusts the target position in storehouse in financial market in real time, reaches To transaction purpose.

Preferably, the Financial Time Series in the S1 include valence amount information and macroscopic information；

The valence amount information includes: opening price, closing price, highest price, lowest price, exchange hand；

The macroscopic information include: p/e ratio, price value ratio, city's pin rate, city show rate, circulation value, total market capitalisation, rate of gross profit, Net assets per share, asset-liability ratio, the exchange rate.

Preferably, data prediction includes computing technique index, the inspection of shortage of data value and processing, data in the S2 Standardization, outlier processing, the specific steps are as follows:

S21: the original Financial Time Series vector v of external system WindAPI output is received_rawCarry out the inspection of shortage of data value It looks into, and missing values is marked；

S22: original Financial Time Series vector v is utilized_raw, corresponding technology is calculated with macroscopic information according to valence amount information Indication information；

S23: time slide window is set as window=600, i.e., is calculated within every 600 minutes once, one can be slided every time by calculating Minute, calculate v in sliding window_rawThe mean μ of each dimension_iAnd standard deviation sigma_i, then every dimension is standardized respectivelyWherein v_iIt is v_rawValue of every dimension without transformation, v'_iIt is that every dimension is new after standardization Value, it is 0 that the information of every dimension, which is all standardized as mean value, the data that variance is 1；

S24: exceptional value is determined using box traction substation method, and by these abnormality value removings, with the mean μ being calculated in S23_i To replace exceptional value；

S25: the mean μ that will be calculated in S23_iFill up the missing values being labeled in S21；

S26: by S21 to S25 to original vector v_rawPretreatment obtain state of market feature vector v_fAt data Manage the output of subsystem.

Preferably, the technical indicator information includes: Exponential Moving Average MACD, averagely tends to Index A DX, is suitable Gesture index CCI, relative strength index RSI, cloth forest belt BOOL, average true fluctuation ATR, the equal line EMA_5 of index on the 5th, refer within 20th The equal line EMA_20 of number, the equal line EMA_60 of index on the 60th, the equal line EMA_120 of index on the 120th, price change rate ROC, random momentum refer to Number SMI, passenger collector-distributor volume AD, momentum line MOM, price concussion percentage PPO, William's variation discrete magnitude WVAD.

Preferably, the processing step of step S4 intensified learning subsystem is:

S41: the output v of feature extraction subsystem is received_deepAnd the input as this subsystem, i.e. state vector v_state；

S42: building acts network, with state vector v_stateAs input vector, by 128 hidden layer minds of first layer Full articulamentum through member, then Relu transformation is done, then export one 128 dimension intermediate vector v_am1There are 128 hidden layers to the second layer The full articulamentum of neuron, then carry out a Relu transformation, the intermediate vector v for finally exporting second layer hidden layer_am2By one The full articulamentum of 3 neuron of layer, the movement vector for exporting one 3 dimension indicate target bin position, this three-dimensional respectively indicates position in storehouse and is It sells shares 1 hand, position in storehouse is no position in storehouse and position in storehouse is plus the Probability p of 1 hand of storehouse, then obtains final transaction movement a by softmax transformation =A (s), s herein refer to the state in market, and A (s) refers to the mapping value obtained by acting network transformation, that is, trades Movement；

S43: after executing the transaction movement a in S42, market environment is obtained according to the feedback i.e. incentive message r of current state And the next NextState s' of market environment；

S44: loss function loss_actor=-log (p) the * td_error of construction movement network, wherein p is step The probability of the corresponding movement a chosen in S42, td_error is the Timing Difference error obtained from assessment network, acts network Optimization aim be minimize loss function loss_actor；

S45: building assessment network, is equally with state vector v_stateAs input vector, hidden by first layer 128 The full articulamentum of layer neuron, then Relu transformation is done, then export one 128 dimension intermediate vector v_cm1To the second layer have 128 it is hidden The full articulamentum of layer neuron is hidden, then carries out a Relu transformation, the intermediate vector v for finally exporting second layer hidden layer_cm2Through The full articulamentum of one layer of 1 neuron is crossed, the value variable v=Q (s) of one 1 dimension is finally exported, s herein refers to market State, Q (s) refers to characterizing a value of value, also referred to as Q value, table by assessing the obtained mapping value of network transformation What is levied is the value of the transaction movement；

S46: seeking Timing Difference error td_error=r+ γ * Q (s')-Q (s), and Q (s) here indicates that state s is input to The Q value that assessment network obtains, and Q (s') then indicates that NextState s' is input to the Q value that assessment network obtains, what γ here referred to It is the following reward decaying weight, characterization is more following significance level for rewarding consideration；

S47: the loss function loss_critic=td_error of construction assessment network², and assess the optimization aim of network It is then minimum loss function loss_critic.

Preferably, feature extraction subsystem detailed process described in step S3 is after receiving data process subsystem pretreatment State of market feature vector v_fAs input vector, mentioning for depth characteristic is carried out by the neural network of feature extraction subsystem It takes, obtained depth characteristic v_deep, output as feature extraction subsystem.

Compared with prior art, the beneficial effect of technical solution of the present invention is: present invention employs more high-dimensional information Description of the vector as state of market, in addition to traditional valence amount information, the present invention is additionally added macroscopic information and multiple technologies refer to Mark can more efficiently obtain market trend, improve the capture to profit buying signals；Secondly the feature that the present invention uses mentions It takes subsystem by deep neural network, the extraction of depth characteristic, the characteristic dimension that can analyze is carried out to state of market information More much bigger than artificial treatment, analysis ability is also more outstanding than traditional artificial judgement；Based on Actor-Critic algorithm In intensified learning forecasting system, the continuous iteration of Actor obtains the reasonable probability that each movement is selected under each state, Critic also continuous iteration constantly improve the rewards and punishments value for selecting each to act under each state, not only learns with this method Practise efficiency faster and also can be good at processing status space it is very big when the case where.

Detailed description of the invention

Fig. 1 is overall system architecture figure of the invention.

Fig. 2 is the deep neural network structure chart that feature of present invention extracts subsystem.

Fig. 3 is the intensified learning frame of Financial Time Series of the present invention.

Fig. 4 is the algorithm flow of ActorCritic of the present invention.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

Embodiment 1

Fig. 1 is the general frame figure based on deeply study Financial Time Series forecasting system of the embodiment of the present invention. As shown in Figure 1, system of the present invention specifically includes that data process subsystem, feature extraction subsystem and intensified learning subsystem System.There are also other two parts in figure: external system WindAPI downloads original Financial Time Series and according to intensified learning The transaction movement that subsystem obtains adjusts target position in storehouse and realizes transaction purpose.

In the specific implementation process, external system WindAPI download original Financial Time Series refer to it is soft by Wind The api interface that part provides downloads to some about valence amount information and macroscopic information, consults here for facilitating, at data The computing technique indication information part of reason subsystem is also listed in following tables together, can specifically be seen below:

The raw financial time series data of 1 Wind API of table downloading --- valence amount information description

Title	Symbol	Description
			Opening price	open	Each minute first tick price
Closing price	close	Each minute the last one tick price
			Highest price	high	The highest price of target in each minute
Lowest price	low	The minimum price of target in each minute
			Exchange hand	volume	The exchange hand of each minute target

The raw financial time series data of 2 Wind API of table downloading --- macroscopic information description

The raw financial time series data of 3 Wind API of table downloading --- technical indicator information description

Data process subsystem is then responsible for computing technique index, handles abnormal data and pretreatment work.Technical indicator Part is just as shown above.

The received input of data process subsystem is the initial data downloaded to from external WindAPI interface, t minutes can With with a vector(vector dimension is (dim_{Valence amount}+dim_{Macroscopic view}) × T, wherein T indicates time span) it indicates, every one-dimensional representation Be target t minutes valence figureofmerits or macro-indicators, the upper table that the two specific indexs include has been listed, subsequent dim_Technology.Outlier processing include missing values fill up and the amendment of noise spot, the present invention can check missing values and exception first Noise figure, the pretreatment being then standardized, obtained mean value and standard deviation.With mean value fill up indicated exceptional value with And missing values.Final process finishes, it will exports pretreated vector(vector dimension is (dim_{Valence amount}+dim_{Macroscopic view}+dim_Technology) ×T)。

Feature extraction subsystem receives from data process subsystem and pre-processes the vector finishedAnd as mind Input vector through network extracts depth characteristic by neural networkThe specific structure of neural network as shown in Fig. 2, It is made of three layers of full articulamentum, wherein drawing for convenience, hides and part hidden neuron is gone not draw, every layer of full articulamentum Hidden neuron number be 16.Input vectorDimension be dim_output=dim_{Valence amount}+dim_{Macroscopic view}+dim_Technology, i.e., dim_input=31, and output vectorDimension be dim_output=16.

Algorithm used by intensified learning subsystem is based on movement-assessment algorithm, constantly obtains each minute shape in market State selects transaction movement by movement network, and makes evaluation to the transaction movement of selection by assessing network, and not dropping out of school, it is excellent to practise Change and how to choose better transaction movement.The overall intensified learning process of this subsystem is as shown in Fig. 3 intensified learning frame.

System of the invention is the transaction intelligent body in figure, and trading environment refers to market.Transaction intelligent body passes through Observe the state in market, intelligent body is according to his experience and gets enough multidate informations in the market and judges, and obtains The transaction movement that should be taken under this state, and transaction movement is gone in trading environment, to obtain trading environment for this The feedback (reward i.e. in figure) of transaction movement, then intelligent body just according to obtained from system feedback reward value, state of market And transaction movement carry out intensified learning, on the one hand continually strengthen to obtain should be taken under the state of market of various complexity how Transaction movement be more conducive to making a profit, on the other hand constantly obtain market latest tendency, guarantee intelligent body can constantly answer More suitable transaction movement is made to newest fluctuation of price, and according to newest multidate information.

Specifically, movement-assessment algorithm Actor- of intensified learning subsystem of the present invention is introduced below with reference to Fig. 4 Critic, intensified learning subsystem are the output for receiving feature extraction subsystemAnd inputted as the conduct of this subsystem, That is state vector

As shown in Figure 4, building acts Network Learning Strategies, with state vectorAs input vector, by first layer There is the full articulamentum of 128 hidden layer neurons, then do Relu transformation, then exports one 128 dimension intermediate vectorTo second Layer has the full articulamentum of 128 hidden layer neurons, then carries out a Relu transformation, then will be in the full articulamentum generation of the second layer Between vectorIt is input in the full articulamentum of 3 neurons of the last layer, the movement vector for exporting one 3 dimension indicates target bin Position, it is 1 hand (- 1) of hole capital after selling all securities that this three-dimensional, which respectively indicates position in storehouse, and position in storehouse is the probability of more 1 hands of storehouse (1) for no position in storehouse (0) and position in storehouse, then Final transaction movement is obtained by softmax transformation, the whole process for obtaining transaction movement is represented by a^t=A (s^t), herein s^tRefer to the state in t minutes market；Assessment e-learning is constructed simultaneously to evaluate movement, is equally with state vector v_stateAs input vector, by the full articulamentum of 128 hidden layer neurons of first layer, then Relu transformation is done, then exported One 128 dimension intermediate vectorThere is the full articulamentum of 128 hidden layer neurons to the second layer, then carry out a Relu transformation, The intermediate vector that finally second layer hidden layer is exportedThe valence of one 1 dimension is exported by the full articulamentum of one layer of 1 neuron It is worth vector, the whole process for obtaining value vector is represented by v^t=Q (s^t)。

As seen from Figure 4, as the transaction movement a for receiving system output^tAfterwards, market environment can be according to the transaction made Act a^tIntensified learning subsystem two values are returned to, one is the lower minute state s that will jump in market^t+1, the other is this Secondary selection transaction movement a^tIncentive message

As shown in figure 4, the assessment network portion of intensified learning subsystem can be according to accessed information to movement network Evaluation is made in the movement of selection, and process will seek Timing Difference error td_error=r_at+γ*Q(s^t+1)-Q(s^t), here Q (s^t) expression state s is input to and assesses the Q value that network obtains, and Q (s^t+1) then indicate NextState s^t+1It is input to assessment net The Q value that network obtains, γ are attenuation coefficients, and the range of γ is 0≤γ < 1.

And entire algorithm will be constantly updated, then construct loss function respectively for the movement of subsystem-assessment network.Construction is dynamic Make loss function loss_actor=-log (p) the * td_error of network, wherein p chooses dynamic for correspondence in step S42 Make a^tProbability, td_error is from the obtained Timing Difference error of assessment network, this is introduced next, movement network Optimization aim be minimize loss function loss_actor；The loss function loss_critic=td_ of construction assessment network error², and the optimization aim for assessing network is then to minimize loss function loss_critic；

Its each minute specific learning process are as follows:

Movement network and assessment network observe the state of market s of this minute from environment^t, state state here VectorIt indicates；

Network is acted according to state of market, carries out the prediction a of transaction movement^t=A (s^t), prediction process can be segmented specifically are as follows: defeated Enter state vectorIt converts, is exported by the full articulamentum of first layer, and by Relu Wherein ω_akAnd b_akIt is weight and the biasing for acting the full articulamentum of network kth layer, the ω appeared below_ckAnd b_ckIt then indicates to assess The weight of the full articulamentum of network kth layer and biasing；It converts, is exported by the full articulamentum of the second layer, and by ReluBy the full articulamentum of third layer, the probability of each transaction movement is obtainedAnd converted by softmax, obtain transaction movement a^t=softmax (p^t)；

Environment gets the output a of movement network^t, two feedbacks can be generated, one is that market will jump to for lower minute State s^t+1, equally can also be by state vectorIt indicates, the other is this time selection transaction movement a^tIncentive message

Network is assessed according to state of market, the prediction v evaluated^t=Q (s^t), specific prediction process can also be segmented are as follows: input State vectorIt converts, is exported by the full articulamentum of first layer, and by Relu It converts, is exported by the full articulamentum of the second layer, and by ReluIt is complete by third layer Articulamentum obtains required value vector

Seek Timing Difference errorWherein 0≤γ < 1；

Respectively by minimizing loss function come the parameter of iteration update action network and assessment network.Construction acts network Loss function be loss_actor=-log (p) * td_error, construction assessment network loss function be loss_critic =td_error², this minimum process is specifically in the back-propagation completion of neural network.

The overall process of intensified learning subsystem is: one movement network of construction, this network understands basis from transaction ring The reward that border obtains selects most suitable transaction the probability for taking various transaction movement under different conditions is adjusted Movement, in addition can also construct an assessment network, this network is a learning network based on value, is updated by single step Algorithm calculates the rewards and punishments information of each step, and two network integrations get up, and tactful network selects transaction movement, assessment network evaluation choosing The superiority and inferiority degree of transaction movement out, then continuous iteration updates, and completes the selection to movement (target position in storehouse) and optimizes.

In traditional Q-learning algorithm, the meeting benefit reason when the limited amount of state, indicating to get up is also Eaily, but when very big in face of state of market space, traditional Q-learning is then helpless.Traditional In Policy Gradient algorithm, this reward is by covering a complete episode come what is be calculated, this is unavoidably It is very slow to result in learning rate, takes a long time just acquire thing.The present invention is based on the strong of Actor-Critic algorithm Chemistry is practised in forecasting system, then can well solve the problem of the two conventional methods encounter, the continuous iteration of Actor obtains Select the reasonable probability of each movement under each state, Critic also continuous iteration, constantly improve selected under each state it is every The rewards and punishments value of one movement.So, so that the transaction movement selected every time is become better and better, the excellent of transaction movement is assessed every time It is bad also more and more quasi-.

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of Financial Time Series prediction technique based on deeply study, it is characterised in that: the following steps are included:

S2: the original Financial Time Series after downloading are carried out data prediction, and exported pre- by building data process subsystem Data that treated；

S3: construction feature extracts subsystem, and pretreated data are carried out to the extraction of depth characteristic, export the depth extracted Characteristic information；

S4: depth characteristic information is compared building intensified learning subsystem with trading environment, carries out intensified learning, output Transaction movement；

S5: the transaction movement generated according to the data being input in system adjusts the target position in storehouse in financial market in real time, reaches friendship Easy purpose.

2. a kind of Financial Time Series prediction technique based on deeply study according to claim 1, feature exist In: the Financial Time Series in the S1 include valence amount information and macroscopic information；

The macroscopic information includes: that p/e ratio, price value ratio, city's pin rate, city show rate, circulation value, total market capitalisation, rate of gross profit, per share Net assets, asset-liability ratio, the exchange rate.

3. a kind of Financial Time Series prediction technique based on deeply study according to claim 1, feature exist In: data prediction includes computing technique index, the inspection of shortage of data value and processing, data normalization, exception in the S2 Value processing, the specific steps are as follows:

S21: the original Financial Time Series vector v of external system WindAPI output is received_rawThe inspection of shortage of data value is carried out, and Missing values are marked；

S22: original Financial Time Series vector v is utilized_raw, corresponding technical indicator is calculated with macroscopic information according to valence amount information Information；

S23: setting time slide window as window=600, i.e., calculate within every 600 minutes once, and calculating can slide one minute every time, Calculate v in sliding window_rawThe mean μ of each dimension_iAnd standard deviation sigma_i, then every dimension is standardized respectivelyWherein v_iIt is v_rawValue of every dimension without transformation, v_i' it is that every dimension is new after standardization Value, it is 0 that the information of every dimension, which is all standardized as mean value, the data that variance is 1；

S24: exceptional value is determined using box traction substation method, and by these abnormality value removings, with the mean μ being calculated in S23_iCarry out generation For exceptional value；

S26: by S21 to S25 to original vector v_rawPretreatment obtain state of market feature vector v_fAs data processing The output of system.

4. a kind of Financial Time Series prediction technique based on deeply study according to claim 3, feature exist In: the technical indicator information includes: Exponential Moving Average MACD, averagely tends to Index A DX, index of taking advantage of a situation CCI, phase To strong and weak index RSI, cloth forest belt BOOL, average true fluctuation ATR, the equal line EMA_5 of index on the 5th, the equal line EMA_20 of index on the 20th, The equal line EMA_60 of index on the 60th, the equal line EMA_120 of index on the 120th, price change rate ROC, random momentum index SMI, passenger collector-distributor volume AD, momentum line MOM, price concussion percentage PPO, William's variation discrete magnitude WVAD.

5. a kind of Financial Time Series prediction technique based on deeply study according to claim 1, feature exist In: feature extraction subsystem detailed process described in step S3 is to receive the pretreated state of market spy of data process subsystem Levy vector v_fAs input vector, the extraction of depth characteristic, obtained depth are carried out by the neural network of feature extraction subsystem Spend feature v_deep, output as feature extraction subsystem.

6. a kind of Financial Time Series prediction technique based on deeply study according to claim 1, feature exist In: the processing step of the intensified learning subsystem of step S4 is:

S42: building acts network, with state vector v_stateAs input vector, by 128 hidden layer neurons of first layer Full articulamentum, then do Relu transformation, then export one 128 dimension intermediate vector v_am1There are 128 hidden layer nerves to the second layer The full articulamentum of member, then carry out a Relu transformation, the intermediate vector v for finally exporting second layer hidden layer_am2By one layer 3 The full articulamentum of a neuron, the movement vector for exporting one 3 dimension indicate target bin position, and it is to sell shares that this three-dimensional, which respectively indicates position in storehouse, 1 hand, position in storehouse is no position in storehouse and position in storehouse is the Probability p for adding 1 hand of storehouse, then obtains final transaction movement a=A by softmax transformation (s), s herein refers to the state in market, and A (s) refers to the mapping value obtained by acting network transformation, i.e. transaction is dynamic Make；

S43: after executing the transaction movement a in S42, obtain market environment according to the feedback i.e. incentive message r of current state and The next NextState s' of market environment；

S44: loss function loss_actor=-log (p) the * td_error of construction movement network, wherein p is in step S42 The probability of the corresponding movement a chosen, td_error is the Timing Difference error obtained from assessment network, acts the optimization of network Target is to minimize loss function loss_actor；

S45: building assessment network, is equally with state vector v_stateAs input vector, by 128 hidden layer minds of first layer Full articulamentum through member, then Relu transformation is done, then export one 128 dimension intermediate vector v_cm1There are 128 hidden layers to the second layer The full articulamentum of neuron, then carry out a Relu transformation, the intermediate vector v for finally exporting second layer hidden layer_cm2By one The full articulamentum of 1 neuron of layer, finally exports the value variable v=Q (s) of one 1 dimension, and s herein refers to the shape in market State, Q (s) refer to characterizing a value of value, also referred to as Q value by assessing the obtained mapping value of network transformation, characterization It is the value of the transaction movement；

S46: Timing Difference error td_error=r+ γ * Q (s')-Q (s), γ here is asked to refer to the following reward decaying power Weight, characterization is more following significance level for rewarding consideration, and Q (s) here indicates that state s is input to what assessment network obtained Q value, and Q (s') then indicates that NextState s' is input to the Q value that assessment network obtains；

S47: the loss function loss_critic=td_error of construction assessment network², and the optimization aim for assessing network is then Minimize loss function loss_critic.

7. according to the method described in claim 1, it is characterized by: trading activity described in step S5, then be according to extensive chemical The transaction movement a that subsystem obtains, and the practical position in storehouse of the target position in storehouse adjustment firm offer indicated according to transaction movement are practised, it is final complete The process traded at systematic learning.