WO2023108987A1 - 基于强化学习的风险预测的方法、装置、设备及存储介质 - Google Patents

基于强化学习的风险预测的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023108987A1
WO2023108987A1 PCT/CN2022/090029 CN2022090029W WO2023108987A1 WO 2023108987 A1 WO2023108987 A1 WO 2023108987A1 CN 2022090029 W CN2022090029 W CN 2022090029W WO 2023108987 A1 WO2023108987 A1 WO 2023108987A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
risk
training set
critic
actor
Prior art date
Application number
PCT/CN2022/090029
Other languages
English (en)
French (fr)
Inventor
肖京
郭骁
王磊
王媛
刘云风
谭韬
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023108987A1 publication Critical patent/WO2023108987A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a method, device, device and storage medium for risk prediction based on reinforcement learning.
  • Reinforcement learning has also gradually developed and become an important branch of artificial intelligence. Reinforcement learning is a machine learning algorithm that maximizes rewards by obtaining rewards from the environment through the movement of an agent relative to the environment. Reinforcement learning is favored by many researchers and has been successfully applied in automation, intelligent control, automatic driving and other fields.
  • the application in the financial field is a hot spot in the current research of reinforcement learning, which is mainly used for analysis, research and decision-making of financial markets.
  • the model based on reinforcement learning predicts the risk value of the subject matter (such as stocks, funds, futures, bonds, derivatives, etc.) in the financial market, which can greatly reduce the human subjective calculation and the human subjective emotional operation. Some losses, as well as the impact of human error.
  • the inventor realized that the data of the subject matter is highly time-varying, resulting in poor performance of the model and poor actual prediction effect.
  • Embodiments of the present application provide a method, device, device, and storage medium for risk prediction based on reinforcement learning, which can improve the accuracy of risk prediction and facilitate risk decision-making.
  • the embodiment of the present application provides a method for risk prediction based on reinforcement learning, wherein:
  • the first risk prediction model is based on the first training set, the first critic model or
  • the second critic model is a model obtained by optimizing the first Actor model, the first Actor model is obtained by training based on the second training set, and the first training set and the second training set are obtained from a preset
  • the historical data extracted from the database, the preset database includes the target historical data.
  • the embodiment of the present application provides a risk prediction device based on reinforcement learning, wherein:
  • a receiving unit configured to receive a risk forecast request for the target object, where the risk forecast request includes a forecast date;
  • a processing unit configured to acquire the date of receipt of the risk forecast request and the target historical data of the target object N days before the date of receipt, where N is a positive integer greater than or equal to 1;
  • the first risk prediction model is based on the first training set, the first critic model or
  • the second critic model is a model obtained by optimizing the first Actor model, the first Actor model is obtained by training based on the second training set, and the first training set and the second training set are obtained from a preset
  • the historical data extracted from the database, the preset database includes the target historical data.
  • an embodiment of the present application provides a computer device, which includes a processor, a memory, and a communication interface, wherein the memory stores a computer program, and the computer program is configured to be executed by the processor, so The computer program includes instructions for performing the following steps:
  • the first risk prediction model is based on the first training set, the first critic model or
  • the second critic model is a model obtained by optimizing the first Actor model, the first Actor model is obtained by training based on the second training set, and the first training set and the second training set are obtained from a preset
  • the historical data extracted from the database, the preset database includes the target historical data.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program causes the computer to execute instructions for the following steps:
  • the first risk prediction model is based on the first training set, the first critic model or
  • the second critic model is a model obtained by optimizing the first Actor model, the first Actor model is obtained by training based on the second training set, and the first training set and the second training set are obtained from a preset
  • the historical data extracted from the database, the preset database includes the target historical data.
  • device, device, and storage medium for risk prediction based on reinforcement learning after receiving the risk prediction request of the target object, obtain the date of receipt of the risk prediction request and the date of the target object N days before the date of receipt
  • the target historical data is characterized by feature extraction on the target historical data to obtain target state features corresponding to each preset feature dimension in the plurality of preset feature dimensions.
  • the target state feature is input into the first prediction model, so as to obtain the risk value of the target object on the prediction date.
  • the first risk prediction model is a model obtained by optimizing the first Actor model based on the first training set, the first Critic model or the second Critic model, and the first Actor model is obtained by training based on the second training set
  • the first training set and the second training set are historical data extracted from a preset database, and the preset database includes the target historical data.
  • Fig. 1 is a schematic diagram of the working principle of an Actor-Critic algorithm provided by the embodiment of the present application;
  • FIG. 2 is a schematic flow diagram of a method for risk prediction based on reinforcement learning provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an LSTM algorithm provided in an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a first Actor model based on the LSTM algorithm provided by the embodiment of the present application;
  • Fig. 5 is a kind of Actor-Critic interactive training flowchart provided by the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a device for risk prediction based on reinforcement learning provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the risk prediction method based on reinforcement learning can be applied to electronic devices or servers configured by financial institutions such as banks, securities, and insurance.
  • financial institutions such as banks, securities, and insurance.
  • the application in the financial field is a hot spot in the current research of reinforcement learning, which is mainly used for analysis, research and decision-making of financial markets.
  • Reinforcement learning is a machine learning algorithm that maximizes rewards by obtaining rewards from the environment through the movement of an agent relative to the environment. Therefore, there are several basic elements in the reinforcement learning algorithm: agent, environment, state, action, reward (or reward). For ease of understanding, these basic concepts are first introduced below.
  • Intelligent body also known as “agent”, “agent”, “intelligent subject” and so on.
  • the agent can automatically adjust its behavior and state according to changes in the external environment, instead of just passively accepting external stimuli, and has the ability of self-management and self-regulation.
  • the agent can also accumulate or learn experience and knowledge, and modify its behavior to adapt to the new environment.
  • the environment is the part of the system other than the agent, which can feed back the status and rewards to the agent, and can also change according to certain rules.
  • the environment could be the financial market.
  • the state is the objective condition of the system in each time period. For a certain target in the financial market, there can be three states of rising, falling and consolidation in a certain period of time.
  • Action also known as decision-making. After the time and state are determined, the agent will make different choices according to the state of the environment, so that the current state can be transferred to the next state with certain probability or with certain probability. This process is called action. There are three different actions for buying, selling and holding for the operation of a certain target.
  • Reward also called reward
  • Returns can be positive or negative.
  • Reinforcement learning algorithms can be roughly divided into value-based algorithms and policy-based algorithms.
  • the typical representative of the value-based algorithm is the Q-Learning algorithm
  • the typical representative of the policy-based algorithm is the policy gradient (PG).
  • the Q-Learning algorithm is mainly used to define the market state. Then choose trading actions according to the greedy ( ⁇ -greedy) strategy, and then interact with the environment to get rewards.
  • the main idea of the algorithm is to form a Q-value table with state and action to store the Q-value, and then update the Q-value table according to the reward, so as to achieve the purpose of optimizing trading actions.
  • the dimension of the state or action is too large, it is difficult for the Q-Learning algorithm to converge.
  • the application of the PG algorithm in the financial field is also to define the market state and select the most favorable action according to the existing strategy.
  • the action is rewarded through environmental feedback, and then the strategy algorithm is updated in reverse.
  • the PG algorithm can be used in high-dimensional action spaces, but the PG algorithm is easy to fall into local optimum, and the round-based update is relatively inefficient.
  • the Actor-Critic algorithm includes two parts, namely actors (Actor) and judges (Critic). This algorithm combines the advantages of PG algorithm and Q-Learning algorithm. Actor, as a policy network, selects behavior based on probability, while Critic judges behavior based on Actor's behavior. The Actor then modifies the probability of choosing an action based on the score of the Critic. The combination of the two can make the policy network perform gradient update according to the value function to optimize the model parameters and obtain the optimal action selection under different environmental states, which is faster than the traditional PG round update.
  • the working principle of the Actor-Critic algorithm is shown in Figure 1.
  • the Actor model uses a policy network to approximate the policy function.
  • the form of the policy function can be expressed as:
  • is the way the agent makes decisions about the environment
  • action a is the conditional probability based on the state s and the network weight ⁇ .
  • the critic model uses the value network to approximate the value function.
  • the form of the value function can include the following:
  • the policy network parameters are updated according to the policy gradient:
  • w is the network parameter of the critic model
  • ⁇ and ⁇ ′ are the policy network parameters before and after the update respectively
  • is the update step size
  • is selected according to the actual situation.
  • the evaluation method of the critic model can be based on the following functions:
  • E(t) is the utility trace of the state, which can be expressed as:
  • the mean square error loss function is generally used for iterative update.
  • the update formula of the critic network parameter w can be expressed as:
  • the feature vector s t of the current state is used as the input of the Actor strategy network, and the action a t is output, and interacts with the environment to obtain a new state s t+1 , and the current reward value r t .
  • the training times can be T
  • the state feature dimension can be Y
  • the action space can be A
  • the step size can be ⁇
  • the decay factor ⁇
  • the value of ⁇ can be between 0.0 and 1.0.
  • the above is the introduction of the Actor-Critic algorithm.
  • the above method usually learns the optimal strategy by calculating the cumulative reward.
  • this method is simple and direct, it requires a huge amount of data to accumulate rewards in multi-step decision-making, and the huge search space makes the samples in financial market decision-making problems very scarce, resulting in sparse rewards, so it is impossible to effectively optimize the decision-making Model parameters.
  • FIG. 2 is a schematic flowchart of a method for risk prediction based on reinforcement learning provided by an embodiment of the present application. Taking the method applied to electronic equipment as an example for illustration, including the following steps S201-S204, wherein:
  • Step S201 Receive a risk forecast request for a target object, where the risk forecast request includes a forecast date.
  • the target object may be one or more stocks, and may also be financial products such as bonds, funds, and futures.
  • the risk prediction request can be generated according to the user's operation, or it can be automatically triggered when the prediction period arrives, which is not limited here.
  • the forecast period may be every working day. For example: after the market closes at 15 o'clock every working day, a risk forecast request is sent to the electronic device. In this way, when the prediction period arrives, risk prediction can be performed on the target object.
  • the predicted date may be the day after the current date, or a certain day one week after the current date, which is not limited here. If the forecast date obtained in the risk forecast request is the closing time of the trading market corresponding to the subject matter, such as weekends or holidays, the forecast date will be postponed to a working day. For example, the current date is October 17, 2021, and the predicted date may be October 18, 2021, or October 19, 2021. If the forecast date obtained in the risk forecast request is October 17, 2021 (Sunday), the forecast date will be postponed to October 18, 2021 (Monday). If the forecast date is not specified in the risk forecast request, the default forecast date is the day after the current date, and it will be postponed in case of weekends or holidays.
  • Step S202 Obtain the date of receipt of the risk forecast request and the target historical data of the target object N days before the date of receipt.
  • the receiving date refers to the date when the electronic device receives the risk prediction request.
  • N may be any positive integer greater than 1 or equal to 1, and the specific value of N is not limited.
  • N may be 10, or 30, or 60.
  • N can also be determined based on the time interval between the predicted date and the received date, for example, the larger the time interval, the larger N is.
  • Target historical data can be extracted from preset databases.
  • the preset database may be pre-stored in the electronic device, or stored in the server, and the electronic device obtains the preset database by accessing the server.
  • the preset database may include commonly used price volume indicator data of the subject matter from the historical time to the current time.
  • the historical time may refer to any past time.
  • the historical time may be January 1, 2010, December 31, 2018, or January 1, 2020, etc., which is not limited.
  • this application does not limit the data type of the data in the preset database. Please refer to Table 1.
  • the data in the preset database can include the opening price, closing price, highest price, lowest price, trading volume, and 5-day moving average of the subject matter. , 10-day moving average, 20-day moving average, 60-day moving average, etc.
  • serial number indicator code Indicator name 1 Pop opening price 2 Pcl Closing price 3 Phi highest price 4 Plo lowest price 5 Volume turnover 6 MA5 5 day moving average 7 MA10 10 day moving average 8 MA20 20 day moving average 9 MA60 60 day moving average
  • the preset database may include the data of the expert factor database.
  • Expert factors refer to factors that have a certain qualitative and quantitative relationship with the downside risk of each target.
  • the expert factor library can include the following eight dimensions: macro indicators, industry indicators, characteristic derivative indicators, capital technology indicators, capital flow indicators, derivative market indicators, and public opinion heat indicators.
  • Each preset feature dimension corresponds to a certain amount of target state features, see Table 2 for details.
  • the target state characteristics corresponding to the capital technical indicators can be N-day moving average, N-day volatility, Bollinger Bands, Mike lines, etc.
  • the target state characteristics corresponding to the capital flow indicators can be northbound capital inflow, southbound capital inflow, main force inflow of funds, etc.
  • the preset database may also include the purchasing managers' index (purchasing managers' index, PMI) shown in Table 2, Bollinger Bands, northward capital inflows, etc.; or may include the large market index not shown in Table 2, For example, the Shanghai Stock Exchange Index, the Shenzhen Stock Exchange Index, the ChiNext Index, the Hang Seng Index, and the Standard & Poor's 500 Index, etc., which are not limited in this embodiment of the present application.
  • PMI purchasing managers' index
  • the data source of the preset database can be derived from financial-related webpages or applications, or economic data released by the Bureau of Statistics, corporate financial statements, Shanghai/Shenzhen/overseas market data, social media statistics, etc. Do limited.
  • Step S203 Perform feature extraction on the target historical data to obtain target state features corresponding to each preset feature dimension in multiple preset feature dimensions.
  • step S202 after step S202 is performed, the following step may be further included: performing preprocessing on abnormal data in the target historical data to obtain data to be processed.
  • Step S203 may include: performing feature extraction on the data to be processed to obtain target state features corresponding to each preset feature dimension in multiple preset feature dimensions.
  • abnormal data may include missing values and noise values.
  • Noise values refer to noise data, which may be inaccurate data describing the scene. For example, calculate the random error or variance in the measured variable, determine the value smaller than the random error or variance as the noise value, etc.
  • the preprocessing of abnormal data in the target historical data may include: filling of missing values and processing of noise values.
  • the method of filling missing values is not limited in the embodiment of the present application, and mean/mode completer, hot deck imputation, K-nearest distance neighbor method (k-means clustering), etc. can be used. way to fill in missing values.
  • the embodiment of the present application does not limit the method of processing the noise value, and one or more methods of binning, clustering, or regression may be used to process the noise value.
  • the abnormal data is preprocessed, and after the data to be processed is obtained, the data to be processed can also be normalized. It can be understood that preprocessing the abnormal data is beneficial to improve the efficiency and accuracy of data processing.
  • the preset feature dimension may be the data of the expert factor library shown in Table 2, which will not be repeated here. It should be noted that the preset feature dimensions and corresponding target state features shown in Table 2 are only examples, and may also include other target-related feature dimensions and corresponding target state features, which are not covered in this embodiment of the present application. out of limit.
  • the expert factor is a feature summed up through financial market theory and practical experience. Therefore, compared with the traditional price volume and technical indicators shown in Table 1, the target state characteristics obtained based on the expert factors in the embodiment of the present application have more direct guiding significance for the prediction of downside risks, and can provide more information for reinforcement learning algorithms. Advanced state input to control model overfitting.
  • Step S204 Input the target state characteristics into the first risk prediction model to obtain the risk value of the target object on the forecast date.
  • the risk value is used as the output of the first risk prediction model.
  • the risk value refers to whether there is a significant downward trend in the target object, and the value of the risk value can be 1 or 0.
  • 1 means that the underlying asset is predicted to have a significant downside risk in the future
  • 0 means that the underlying item is predicted to have no significant downside risk in the future.
  • the first risk prediction model is a model obtained by optimizing the first Actor model based on the first training set, the first Critic model or the second Critic model, and the first Actor model is based on
  • the second training set is obtained by training, the first training set and the second training set are historical data extracted from a preset database, and the preset database includes the target historical data.
  • the preset database may be divided according to the time stamp to obtain the first training set and the second training set.
  • data for a certain period of time can be obtained from the preset database, and a certain proportion (for example, the first 1/5) of the data during this period can be divided into the second training set, and the remaining data during this period can be divided into the first training set. set.
  • the data of ten years from January 1, 2010 to December 31, 2020 of a certain subject is acquired from the preset database as training data. Then, the data from January 1, 2010 to December 31, 2012 can be divided into the second training set, and the data from January 1, 2013 to December 31, 2020 can be divided into the first training set.
  • This application does not limit the acquisition method of the first actor model, the first critic model or the second critic model, and the acquisition method of the first actor model will be introduced first below.
  • step S204 before performing step S204, the following steps may also be included: extracting the second training set from the preset database; calculating the second training set based on preset expert rules to obtain A first action set corresponding to different states; performing machine learning based on the first action set to obtain the first Actor model.
  • the preset expert rules refer to the theoretical cognition and financial risk control theory of the high correlation between certain types of financial indicators and downside risks that have been summarized for a long time.
  • the preset expert rules include but are not limited to Bollinger bands, resistance support relative strength (RSRS), exponential smoothing moving average (moving average convergence and divergence, MACD), market sentiment and other quantitative Timing method.
  • RSRS resistance support relative strength
  • MACD exponential smoothing moving average
  • Such methods are often based on heuristic rules or calculations on the relative relationship of several key indicators when fitting downside risk signals, and require fewer samples than purely data-driven algorithms.
  • the Bollinger Band signal is taken as an example to introduce the use of preset expert rules.
  • the Bollinger band is named after the inventor John Bollinger and is used to describe the range of price fluctuations.
  • the basic form of the Bollinger Bands is a banded channel composed of three track lines (one for the middle track and one for the upper and lower tracks).
  • n is the number of samples.
  • the value of n depends on the actual situation, and is generally 20.
  • the preset expert rules can pre-train the first Actor model, so that its decision-making level approaches the level of the preset expert rules before alternate training with the critic model.
  • high-value rewards can be obtained through fast sampling, which avoids a large number of inefficient or even invalid sampling in the early stage of the Actor-Critic algorithm.
  • imitation learning can be used to firstly use preset expert rules to fit the state of the financial market to obtain expert action labels, and use these labels for supervised pre-training.
  • the specific implementation process is as follows:
  • the second training set is extracted from the preset database as the state input.
  • the preset expert rules to calculate the second training set to obtain the corresponding first action set, that is, the expert action, so as to obtain the mapping from state to action and form a set of trading strategies.
  • Each expert decision data contains state and action sequence where n is the number of samples.
  • Combine all state-action elements and build a new set D ⁇ (s 1 ,a 1 ),(s 2 ,a 2 ),....
  • the state is used as a feature
  • the action is used as a label
  • supervised learning of classification (corresponding to discrete actions) or regression (corresponding to continuous actions) is performed, so that the first Actor model can be obtained.
  • the policy network corresponding to the first Actor model can use recurrent neural networks (recurrent neural networks, RNN), or long-short-term memory network (long-short time memory, LSTM), and can also use gating A gated recurrent unit (GRU), which is not limited in this embodiment of the present application.
  • RNN recurrent neural networks
  • LSTM long-short-term memory network
  • GRU gated recurrent unit
  • the LSTM algorithm is used as an example to construct the first Actor model.
  • the LSTM model is a special RNN model that solves the long memory problem that the RNN model does not have by introducing a gate mechanism.
  • one neuron of the LSTM model contains one cell state (cell) and three gate (gate) mechanisms.
  • the cell state (cell) is the key to the LSTM model, similar to the memory, which is the memory space of the model.
  • the cell state changes over time, and the recorded information is determined and updated by the gating mechanism.
  • the gate mechanism is a method to allow information to pass through selectively, and it is realized by the sigmoid function and the dot product operation.
  • the value of sigmoid is between 0-1, and the dot product determines the amount of information transmitted (how much of each part can pass). When sigmoid is 0, it means discarding information, and when it is 1, it means complete transmission (that is, complete memory).
  • LSTM maintains information at any time through three gates: forget gate, update gate and output gate.
  • Figure 3 is a schematic structural diagram of an LSTM algorithm provided by the embodiment of the present application, as shown in Figure 3, the input of the first Actor model constructed with the LSTM algorithm may include the current state of the financial market at the current moment t s t , the hidden state h t+1 in the LSTM network that records the time correlation between historical data and the decision a t-1 of the previous moment t-1 , which outputs the decision a t of the current moment.
  • the internal state of the network created by the directed loop architecture in the LSTM algorithm can process time-based sequence data and remember temporal connections, so it can solve the long-term dependency problem.
  • the large-scale LSTM algorithm has the ability to highly represent features, that is, it can learn rich time feature representations, and agents based on LSTM algorithms can mine temporal patterns in financial market data and remember historical states and actions.
  • the first actor model based on the LSTM algorithm consists of an input layer, an LSTM layer, and an output layer, where the input and output layers are fully connected layers with the same dimensions as the feature and action output.
  • the market state sequence can be described to form input features, which is conducive to solving the defect that the machine learning method cannot capture the law of market state changes due to the use of cross-sectional data.
  • the first Actor model can be pre-trained through the following steps:
  • Step A1 Prepare the assembly and perform random reordering
  • Step A2 Randomly initialize the weight value ⁇ of the first Actor model
  • Step A3 Select samples from D Using the current network, calculate the output action a n with the state s n as input;
  • Step A4 Calculate the loss function value L and its derivatives to the weight values of the first Actor model That is, the gradient of the network parameters of the first Actor model;
  • Step A5 Update the network parameters along the gradient direction of the network parameters of the first Actor model with a step size ⁇ ;
  • Step A6 Repeat steps A3-A5 until the training time or loss function value L converges.
  • the loss function value L of the first Actor model may include but not limited to binary cross entropy and the like. Let's take binary cross entropy as an example to measure the preset expert rules to calculate the action signal and the difference between the predicted value a t . In a binary classification problem, the cross entropy of each sample can be expressed as:
  • N is the number of samples, is the probability that the sample is predicted to be 1, and x t is the output of the output fully connected layer.
  • the action output is:
  • the strategy network corresponding to the first Actor model can be initialized from the "know nothing" Elevate to near-expert level.
  • the first Actor model can be used to make corresponding predictions in the environment.
  • the preset expert rules represented by the Bollinger Band signal only use the historical closing price to calculate the upper, middle and lower track indicators, and compare it with the closing price of the day to calculate the danger signal to get the expert action.
  • the input dimension of the strategy network is selected in the expert factor library, and its dimension is larger than that of the Bollinger band signal input. Therefore, the first Actor model has limitations.
  • the first Actor model can be further optimized by using the first training set to obtain the first risk prediction model, so as to obtain a better decision. Specifically, after performing machine learning based on the first action set to obtain the first Actor model, the following steps are further included: extracting the first training set from the preset database; Computing the first training set to obtain a second action set corresponding to different states; optimizing the first Actor model based on the second action set to obtain the first risk prediction model.
  • the first Actor model pre-trained by using the preset expert rules is used in the first training set, and the downside risk prediction is performed based on the preset expert rules to obtain the second action set at .
  • the binary cross-entropy function is also calculated by the method of supervised learning, and the gradient is back-transmitted to the first Actor model, and its continuous optimization is maintained. , so as to obtain the first risk prediction model.
  • the method of optimizing the first actor model based on the first training set does not directly use the critic model, but is based on real data.
  • the first Actor model can be prevented from being limited by the level of the critic model, and the prediction accuracy of the model can be improved.
  • the first actor model may be optimized by using the first critic model to obtain the first risk prediction model, so as to improve the accuracy of the risk prediction of the model.
  • the construction process of the first critic model may include the following steps: performing feature extraction on the first training set to obtain first state features and second state features; The second state feature is spliced to obtain a third state feature; machine learning is performed based on the third state feature to obtain a base model; the third state feature is input to the base model to obtain a base model training result; Acquiring the sorting result of the base model according to the training result of the base model; determining the weighted weight of the base model according to the sorting result; performing model fusion on the base model according to the weighted weight to obtain the first critic Model.
  • the first training set is extracted from a preset database.
  • the first state feature can be a traditional price volume index feature, and the second state feature can be a feature of an expert factor; or the first state feature can be a feature of an expert factor, and the second state feature can be a traditional price volume index feature.
  • the third state feature is obtained by splicing expert factor database data and traditional price volume index data.
  • the characteristic dimension of the traditional price volume indicator is P dimension
  • the characteristic dimension of expert factor is Q dimension
  • the third state characteristic dimension obtained after splicing is P+Q dimension.
  • the third state feature is used as the state input, which is sent to each machine learning base model for training.
  • the machine learning method can be selected from classification machine learning algorithms such as logistic regression, decision tree, random forest, and adaptive boosting (AdaBoost), which is not limited.
  • AdaBoost adaptive boosting
  • a logic variable of 0-1 can be used as a short signal, 1 means that the underlying object is predicted to have a significant downside risk in the future, and 0 means that the underlying object is predicted to have no significant downside risk in the future.
  • the output type remains consistent with the first Actor model.
  • the first actor model may be optimized by using the first critic model, so as to obtain the first risk prediction model, so as to improve the prediction accuracy of the model.
  • the state can be used as a feature
  • the action can be used as a label to perform supervised learning, further optimize the first Actor model, and obtain a first risk prediction model to improve the convergence of the model.
  • the second critic model may be used to optimize the first actor model to obtain the first risk prediction model, so as to improve the accuracy of the risk prediction of the model.
  • the construction process of the second critic model may include the following steps: constructing the value network of the second critic model, wherein the network structure of the value network is different from the network structure of the first actor model Same; Copy the weight values outside the output layer of the first Actor model to the value network; Based on the first training set, train the value network to update the weight values of the value network; The value network obtained after training is used as the second critic model.
  • the value network can be used to construct the second critic model.
  • the value network can cope with the time-varying nature of the capital market and improve the generalization ability of the model.
  • the main network structure of the second critic model is the same as that of the first actor model, but the final output of the second critic model is a one-dimensional continuous value, that is, the output is the value of the state or the value of the state-action.
  • the weight value of the first Actor model can be copied to the second Critic model from shallow to deep until before the last output layer. Because in the previous training, the first Actor model has been trained several times. Its weight value already has a strong ability to extract the deep features of the state variable. Copying the weights outside the output layer to the second critic model does not require additional consumption of data samples than starting training from weight random initialization. Therefore, the sampling efficiency of model optimization is further improved.
  • the first actor model and the first critic model can already output downside risk signals in different states, and execute selling actions accordingly.
  • the trajectory [(st t ,a t ),r t+1 ,st t+1 ] can be obtained from any time t.
  • the reward value is set as follows:
  • the interactive training graph includes a Critic network and two Actor networks (new Actor network and old Actor network), and the training process includes the following steps:
  • Step B1 At time t, input the state variable s t of the environment into the new Actor network.
  • the output dimension of the new Actor network is 2, respectively, to obtain ⁇ and ⁇ , which are used as the categorical distribution function (categorical distribution) to correspond to the probabilities of the two types of output, 0-1. For example, if the probability of sampling 0 and 1 in the class distribution function is 70% and 30% respectively, then ⁇ and ⁇ can be equal to 7 and 3 respectively.
  • the purpose of constructing the category distribution is to sample actions at . Input the action a t into the environment to get the reward r t and the next state st +1 , and then store the trajectory [(s t ,a t ),r t+1 ,s t+1 ]. Then input st +1 to the new Actor network, and repeat this step until a certain number of trajectories are stored. It should be noted that the new actor network is not updated during this process.
  • Step B2 Starting from any state in the stored trajectory, calculate the cumulative discount reward as the real value:
  • Step B3 Input all the combinations of stored s into the critic network to obtain the estimated value v t of the value function of all states.
  • the difference between R t and v t is calculated as the advantage function A t .
  • Step B4 Input all stored s combinations into the old Actor network and the new Actor network to obtain their respective outputs, ⁇ 1 , ⁇ 1 > and ⁇ 2 , ⁇ 2 >.
  • the old Actor network and the new Actor network have the same network structure, so the probability density functions (probability density functions, PDF) of the old Actor network and the new Actor network can be obtained, which are PDF1 and PDF2 respectively.
  • PDF probability density functions
  • Step B5 Compute the surrogate objective function and the cropped surrogate objective function
  • the clipping substitute objective function, when A>0, if IW>1+ ⁇ , then If IW ⁇ 1+ ⁇ , then When A ⁇ 0, if IW>1- ⁇ , then If IW ⁇ 1- ⁇ , then ⁇ is the cropping ratio, which can be 0.2. Compute the objective function on the stored trajectories Then backpropagate to update the new Actor network.
  • Step B6 After looping steps B4-B5 for a certain number of steps, the loop ends, and the old Actor network is updated with the weight of the new Actor network;
  • Step B7 Repeat steps B1-B6 until the model converges or reaches the specified number of steps.
  • the accumulated effective rewards of the second critic model can obtain a larger parameter gradient when optimizing the first actor model, which can make the first risk prediction model converge faster.
  • the optimization goal may also be set to take into account the performance of the first Critic model, the second Critic model and the firm offer. At this point, the optimization problem becomes a multi-task optimization problem.
  • the optimization process to the first Actor model may also include the following steps:
  • the model is optimized to obtain the first risk prediction model.
  • the risk function is constructed by obtaining real data from the preset database, and the real data refers to the financial time series data of the subject matter in the preset database, and these financial time series data are time-varying.
  • the real offer data may be opening price, closing price, highest price, lowest price, trading volume, etc.
  • the risk function is optimized to obtain an optimized risk function.
  • the first risk prediction model obtained by optimizing the first Actor model based on the optimized risk function takes into account the performance of the first critic model, the second critic model and the actual data. Therefore, the first risk prediction model has a certain ability to deal with the time-varying nature of the financial market.
  • step S204 after step S204 is performed, the following steps may also be included: extracting a verification data set from the preset database; verifying the first risk prediction model based on the verification data set to obtain the second A risk prediction model; training the second risk prediction model based on the first training set and the second training set to obtain a third risk prediction model; inputting the target state characteristics into the third risk prediction A model is used to obtain the risk value of the target object on the forecast date.
  • the verification data set is obtained from a preset database. If the preset database is divided according to the timestamp to obtain the first training set, the second training set, and the verification data set, then the verification data set is data that is relatively recent in time other than the first training set and the second training set. For example, if the ten-year data from January 1, 2010 to December 31, 2020 is used as the first training set and the second training set, then the verification data set can be the data from January 1, 2021 to the current time .
  • one or more of the first training set, the first critic model, the second critic model and the risk function can be used to optimize the first actor model to obtain the first risk prediction model.
  • the first risk prediction models obtained by different optimization methods can be verified in the verification data set, and the model with the highest prediction accuracy on the verification data set can be selected as the second risk prediction model.
  • the training setting mode of the second risk prediction model is fixed, wherein the training setting mode may be the setting of the model structure, model parameters or training mode, which is not limited.
  • the first training set and the second training set are combined to train the second risk prediction model to obtain a third risk prediction model.
  • the target state is used as the input of the third risk prediction model to predict the risk value of the target object. In this way, the third risk prediction model obtained through multiple trainings can improve the accuracy of risk prediction and facilitate risk decision-making.
  • the first Actor model is further optimized through different forms such as the first training set, the first critic model and the second critic model.
  • the risk prediction model provides an exploration space for the first Actor model to continue to improve performance.
  • the first risk prediction model has a high degree of convergence, which enables it to be dynamically optimized according to the time-varying characteristics of the market, so that the adaptability, robustness and anti-interference ability of the risk prediction model can be further improved. In this way, the accuracy of risk prediction can be improved, which is conducive to providing reliable decision-making solutions.
  • FIG. 6 is a schematic structural diagram of an apparatus for risk prediction based on reinforcement learning provided by an embodiment of the present application.
  • the device is applied to electronic equipment.
  • the device 600 for risk prediction based on reinforcement learning includes a receiving unit 601 and a processing unit 602.
  • the detailed description of each unit is as follows:
  • the receiving unit 601 is configured to receive a risk forecast request of the target object, where the risk forecast request includes a forecast date;
  • the processing unit 602 is configured to obtain the date of receipt of the risk prediction request and the target historical data of the target object N days before the date of receipt, where N is a positive integer greater than or equal to 1; for the target history Perform feature extraction on the data to obtain target state features corresponding to each preset feature dimension in multiple preset feature dimensions; input the target state features into the first risk prediction model to obtain the target object on the forecast date , wherein the first risk prediction model is a model obtained by optimizing the first Actor model based on the first training set, the first Critic model or the second Critic model, and the first Actor model is based on the first Actor model Two training sets are trained, the first training set and the second training set are historical data extracted from a preset database, and the preset database includes the target historical data.
  • the processing unit 602 is further configured to extract the second training set from the preset database; calculate the second training set based on preset expert rules to obtain corresponding The first action set; performing machine learning based on the first action set to obtain the first Actor model.
  • the processing unit 602 is further configured to perform feature extraction on the first training set to obtain the first state feature and the second state feature; for the first state feature and the second State features are spliced to obtain a third state feature; machine learning is performed based on the third state feature to obtain a base model; the third state feature is input to the base model to obtain a base model training result; according to the base model
  • the model training result obtains the sorting result of the base model; determines the weighted weight of the base model according to the sorting result; performs model fusion on the base model according to the weighted weight to obtain the first critic model.
  • the processing unit 602 is further configured to construct the value network of the second critic model, wherein the network structure of the value network is the same as the network structure of the first actor model; The weight value outside the output layer of the first Actor model is copied to the value network; based on the first training set, the value network is trained to update the weight value of the value network; the training is completed to obtain The value network of is used as the second critic model.
  • the processing unit 602 is further configured to extract the first training set from the preset database; calculate the first training set based on preset expert rules, and obtain corresponding the second action set; based on the second action set, the first Actor model is optimized to obtain the first risk prediction model; or based on the first Critic model or the second Critic model, the The first Actor model is optimized to obtain the first risk prediction model.
  • the processing unit 602 is further configured to obtain a risk function based on the preset database; optimize the risk function based on the first critic model and the second critic model, and obtain Optimizing a risk function; optimizing the first Actor model based on the optimized risk function to obtain the first risk prediction model.
  • the processing unit 602 is further configured to extract a verification data set from the preset database; verify the first risk prediction model based on the verification data set to obtain a second risk prediction model; Train the second risk prediction model based on the first training set and the second training set to obtain a third risk prediction model; input the target state characteristics into the third risk prediction model to obtain the obtained The risk value of the target object on the forecast date.
  • each unit may also refer to the corresponding description of the method embodiment shown in FIG. 2 .
  • FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 700 includes a processor 701 , a memory 702 and a communication interface 703 , wherein the memory 702 stores a computer program 704 .
  • the processor 701 , the memory 702 , the communication interface 703 and the computer program 704 may be connected through a bus 705 .
  • the computer program 704 is used to perform the following steps:
  • the first risk prediction model is based on the first training set, the first critic model or
  • the second critic model is a model obtained by optimizing the first Actor model, the first Actor model is obtained by training based on the second training set, and the first training set and the second training set are obtained from a preset
  • the historical data extracted from the database, the preset database includes the target historical data.
  • the target state characteristics are input into the first risk prediction model to obtain the risk value of the target object on the forecast date, and the computer program 704 is also used to perform the following steps: instruction:
  • Machine learning is performed based on the first action set to obtain the first Actor model.
  • the computer program 704 is further configured to execute the following steps:
  • the computer program 704 is further configured to execute the following steps:
  • the value network obtained after training is used as the second critic model.
  • the computer program 704 is further configured to execute instructions for the following steps:
  • the computer program 704 is further configured to execute instructions for the following steps:
  • the computer program 704 is further configured to execute Instructions for the following steps:
  • the storage 702 may also be called a storage medium or a storage device, etc., which is not limited in this embodiment of the present application.
  • the processor 701 may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processing (digital signal processing, DSP), dedicated Integrated circuit (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • CPU central processing unit
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • FPGA field-programmable gate array
  • the memory 702 mentioned in the embodiment of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM direct memory bus random access memory
  • direct rambus RAM direct rambus RAM
  • processor 701 is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components
  • the memory storage module
  • memory 702 described herein is intended to include, but is not limited to, these and any other suitable types of memory.
  • the bus 705 may include not only a data bus, but also a power bus, a control bus, and a status signal bus. However, for clarity of illustration, various buses are labeled as buses in the figures.
  • the embodiment of the present application also provides a computer storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any risk based on reinforcement learning as described in the above-mentioned method embodiments. Some or all steps of the forecasting method.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the embodiment of the present application also provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to enable the computer to execute the method described in the above method embodiments Part or all of the steps of any reinforcement learning-based risk prediction method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Biomedical Technology (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Technology Law (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种基于强化学习的风险预测的方法、装置、设备及存储介质,涉及人工智能领域。其中方法包括:接收目标标的物的风险预测请求,该风险预测请求包括预测日期(S201);获取该风险预测请求的接收日期和接收日期的前N天目标标的物的目标历史数据(S202);对目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征(S203);将目标状态特征输入至第一风险预测模型,得到目标标的物在预测日期的风险值(S204),其中,第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型。该方法可以提高风险预测的准确率,有利于进行风险决策。

Description

基于强化学习的风险预测的方法、装置、设备及存储介质
优先权申明
本申请要求于2021年12月15日提交中国专利局、申请号为202111535520.9,发明名称为“基于强化学习的风险预测的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种基于强化学习的风险预测的方法、装置、设备及存储介质。
背景技术
近年来,伴随着互联网的飞速发展,人工智能算法领域有了很大的进步。强化学习也逐步发展起来,成为人工智能的一个重要分支。强化学习是一种通过智能体(agent)相对于环境(environment)的运动,从环境中获得回报(reward),从而最大化回报的一种机器学习算法。强化学习受到广大研究者的青睐,已成功应用在自动化,智能控制,自动驾驶等领域。
目前,在金融领域的应用是强化学习当下研究的热点,主要用于对金融市场进行分析研究与决策。基于强化学习的模型对金融市场中的标的物(例如,股票、基金、期货、债券、衍生品等)的风险值进行预测,可以极大减少了人主观计算,人主观情绪化操作带来的一些损失,以及人工操作失误带来的影响。然而,发明人意识到标的物的数据具有高度时变性,导致模型表现不佳,实际预测效果差。
发明内容
本申请实施例提供了一种基于强化学习的风险预测的方法、装置、设备及存储介质,可以提高风险预测的准确率,有利于进行风险决策。
第一方面,本申请实施例提供了一种基于强化学习的风险预测的方法,其中:
接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
第二方面,本申请实施例提供了一种基于强化学习的风险预测的装置,其中:
接收单元,用于接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
处理单元,用于获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
第三方面,本申请实施例提供了一种计算机设备,其中,包括处理器、存储器和通信接口,其中,所述存储器存储有计算机程序,所述计算机程序被配置由所述处理器执行,所述计算机程序包括用于执行以下步骤的指令:
接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
第四方面,本申请实施例提供了一种计算机可读存储介质,其中,所述计算机可读存储介质存储计算机程序,所述计算机程序使得计算机执行以下步骤的指令:
接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
实施本申请实施例,将具有如下有益效果:
采用上述的基于强化学习的风险预测的方法、装置、设备及存储介质,在接收目标标的物的风险预测请求之后,获取风险预测请求的接收日期和所述接收日期的前N天目标标的物的目标历史数据,对该目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征。然后将该目标状态特征输入至第一预测模型,从而得到目标标的物在预测日期的风险值。其中,第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,第一Actor模型是基于第二训练集进行训练得到的,第一训练集和第二训练集是从预设数据库中提取的历史数据,预设数据库包括所述目标历史数据。如此,经过多次训练和优化得到的模型进行风险预测, 可以提升风险预测的准确率,有利于进行风险决策。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以基于这些附图获得其他的附图。其中:
图1为本申请实施例提供的一种Actor-Critic算法的工作原理示意图;
图2为本申请实施例提供的一种基于强化学习的风险预测的方法的流程示意图;
图3为本申请实施例提供的一种LSTM算法的结构示意图;
图4为本申请实施例提供的一种基于LSTM算法的第一Actor模型的结构示意图;
图5为本申请实施例提供的一种Actor-Critic交互训练流程图;
图6为本申请实施例提供的一种基于强化学习的风险预测的装置的结构示意图;
图7为本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
具体地,该基于强化学习的风险预测的方法可以应用在银行、证券、保险等金融机构配置的电子设备或服务器上。目前,在金融领域的应用是强化学习当下研究的热点,主要用于对金融市场进行分析研究与决策。
强化学习是一种通过智能体(agent)相对于环境(environment)的运动,从环境中获得回报(reward),从而最大化回报的一种机器学习算法。因此,强化学习算法存在几个基本的要素:智能体,环境,状态,动作,回报(或称为奖励)。为了便于理解,下文首先介绍这几个基本的概念。
(1)智能体,也称为“代理”、“代理者”、“智能主体”等。智能体可以根据外界环境的变化,而自动地对自己的行为和状态进行调整,而不是仅仅被动地接受外界的刺激,具有自我管理自我调节的能力。此外,智能体还可以积累或学习经验和知识,并修改自己的行为以适应新环境。
(2)环境,是系统中除智能体以外的部分,可以向智能体反馈状态和奖励,还可以按照一定的规律发生变化。对于金融领域而言,环境可以是金融市场。
(3)状态,是系统在每一个时间段所处的客观条件。对于金融市场中的某只标的物而言,在某个时间段可以有上涨、下跌和盘整三种状态。
(4)动作,也称决策。在时间和状态确定后,智能体会根据环境所处的状态做出不同的选择,从而使得当前状态可以确定地转移或者以一定概率转移到下一状态,这个过程称为动作。针对于某只标的物的操作可以有买进、卖出和持有三种不同的动作。
(5)回报,也称奖励,可以定义为采取某一动作之后带来的后续收益。回 报可以是正的,也可以是负的。
强化学习的算法大致可以分为以值为基础(value-based)的算法和以策略为基础(policy-based)的算法。以值为基础的算法典型代表为Q-Learning算法,以策略为基础的算法典型代表是策略梯度(policy gradient,PG)。
在金融领域,Q-Learning算法主要是通过定义市场状态。然后根据贪心(ε-greedy)策略去选择交易动作,再与环境交互得到回报。该算法的主要思想就是将状态和动作构成一个Q值表来存储Q值,然后根据回报更新Q值表,从而达到优化交易动作的目的。但是,当状态或动作的维度太大时,Q-Learning算法很难收敛。
与Q-Learning算法类似,PG算法在金融领域的应用同样是定义市场状态,根据已有策略选择最有利的动作。通过环境反馈得到该动作的回报,然后反向更新策略算法。PG算法可以用于高维度的动作空间,但是PG算法容易陷入局部最优,而且基于回合更新相对低效。
Actor-Critic算法,顾名思义,包括两部分,分别是演员(Actor)和评委(Critic)。该算法结合了PG算法和Q-Learning算法的优点。Actor作为策略网络基于概率来选行为,而Critic基于Actor的行为评判行为的得分。然后Actor根据Critic的评分修改选择动作的概率。两者结合,可使策略网络根据值函数进行梯度更新,以优化模型参数,获得不同环境状态下的最优动作选择,相较于传统的PG回合更新快。Actor-Critic算法工作原理如图1所示。
Actor模型使用策略网络,对策略函数进行近似估计,策略函数的形式可以表示为:
π θ(s,a)=P(a|s)
其含义是,π为智能体面对环境做出决策的方式,动作a是基于状态s和网络权重θ的条件概率。当给定t时刻状态s t,计算可得概率最大的动作a t,并使之于环境交互,得到实际奖励值r t+1和下一时刻状态s t+1
Critic模型使用价值网络,对价值函数进行近似估计,价值函数的形式可以包括如下几种:
(1)状态价值函数
v(s,w)≈v π(s)
或者:
(2)状态-动作价值函数
q(s,a,w)≈q π(s,a)
策略网络参数则根据策略梯度进行更新:
Figure PCTCN2022090029-appb-000001
其中,w是Critic模型的网络参数,θ和θ′分别为更新前后策略网络参数,α为更新步长,α根据实际情况进行选择。
Critic模型的评估方法可以基于以下几种函数:
(1)状态价值函数:
Figure PCTCN2022090029-appb-000002
(2)状态-动作价值函数:
Figure PCTCN2022090029-appb-000003
(3)时间差分函数:
Figure PCTCN2022090029-appb-000004
其中,时间差分项的表达式可利用状态价值函数,即δ(t)=r t+1+γV(s t+1,w)-V(s t,w)或利用状态-动作价值函数δ(t)=r t+1+γQ(s t+1,a t+1,w)-Q(s t,a t,w)。
(4)优势函数:
Figure PCTCN2022090029-appb-000005
其中,优势函数状态-动作价值函数与状态价值函数的差值,为A(s t,a t,w,β)=Q(s t,a t,w,ρ)-V(s t,w,β),β是优势函数的网络参数。
(5)TD(λ)差分:
Figure PCTCN2022090029-appb-000006
其中,E(t)为状态的效用迹,可以表示为:
Figure PCTCN2022090029-appb-000007
对于Critic本身的模型参数w,一般使用均方误差损失函数进行迭代更新。以基于状态价值函数的时间差分函数为例,Critic网络参数w的更新公式可以表示为:
δ(t)=r t+1+γV(s t+1)-V(s t)
Figure PCTCN2022090029-appb-000008
如图1所示,Actor-Critic算法中,首先,以当前状态的特征向量s t作为Actor策略网络的输入,输出动作a t,并与环境交互得到新状态s t+1,当前奖励值r t。其次,使用当前状态s t和新状态s t+1作为Critic价值网络的输入,分别得到价值V(s t)和V(s t+1),以及当前奖励值r t。接着,根据价值V(s t)和V(s t+1),以及当前奖励值r t,计算时间差分δ,得到该时间差分δ=r t+γV(s t+1)-V(s t)。然后,使用均方差损失函数∑(r t+γV(s t+1)-V(s t)) 2对Critic价值网络参数w进行更新。最后,通过损失函数
Figure PCTCN2022090029-appb-000009
对Actor策略网络参数θ进行更新。输入新的表示当前状态的特征向量s t,以重复执行以上步骤直至达到训练次数或目标函数收敛。其中,训练次数可以为T,状态特征维度可以为Y,动作空间可以为A,步长可以为α,β,衰减因子γ,γ的取值可以介于0.0和1.0之间。
以上,为Actor-Critic算法的介绍,上述方法在传统的强化学习任务中,通常通过计算累积奖励学习最优策略。这种方法虽然简单直接,但是在多步决策中需要巨大的数据量进行奖励累积,而巨大的搜索空间使得金融市场决策问题中的样本十分稀缺,从而造成了奖励的稀疏,因此无法有效优化决策模型参数。
本申请实施例主要针对强化学习中的Actor-Critic算法进行改进,以提高风险预测的准确率,有利于进行风险决策。具体的,请参照图2,图2是本申请实施例提供的一种基于强化学习的风险预测的方法的流程示意图。以该方法应用在电子设备为例进行举例说明,包括以下步骤S201-S204,其中:
步骤S201:接收目标标的物的风险预测请求,风险预测请求包括预测日期。
在本申请实施例中,目标标的物可以是某一只或者多只股票,还可以是债券、基金、期货等金融产品。风险预测请求可以是根据用户的操作生成的,也可以是预测周期到达时自动触发的,在此不做限定。该预测周期可以是每个工作日。例如:每个工作日15点闭市后,向电子设备发送风险预测请求。如此,在预测周期到达时,可以对目标标的物进行风险预测。
在本申请实施例中,预测日期可以是当前日期的后一天,也可以是当前日期后一周的某一天,在此不做限定。如果获取到风险预测请求中的预测日期为标的物对应的交易市场休市时间,如双休日或节假日,则预测日期往后顺延至工作日。示例地,当前日期为2021年10月17日,预测日期可以是2021年10月18日,也可以是2021年10月19日。如果获取到风险预测请求中的预测日期为2021年10月17日(周日),则将预测日期顺延至2021年10月18日(周一)。如果风险预测请求中没有指定预测日期,则默认预测日期为当前日期的后一天,遇双休日或节假日顺延。
步骤S202:获取风险预测请求的接收日期和接收日期的前N天目标标的物的目标历史数据。
在本申请实施例中,接收日期是指电子设备接收到风险预测请求的日期。N可以为大于1或等于1的任一正整数,不对N的具体取值做出限定。示例地,N可以取10,也还可以取30,还可以取60。N还可以基于预测日期和接收日期之间的时间间隔进行确定,例如,时间间隔越大,N越大。
目标历史数据可以从预设数据库中提取。预设数据库可以预先存储于电子设备中,或者,存储在服务器中,电子设备通过访问服务器获取预设数据库。预设数据库中可以包括从历史时间到当前时间之间的标的物的常用价量指标数据。其中,历史时间可以指已经过去的任一时间。示例地,历史时间可以是2010年1月1日,也可以2018年12月31日,还可以是2020年1月1日等,对此不做限定。且本申请对于预设数据库中数据的数据类型不做限定,请参照表1,预设数据库中的数据可以包括标的物的开盘价、收盘价、最高价、最低价、成交量、5日均线、10日均线、20日均线、60日均线等。
表1标的物的常用价量指标数据
编号 指标代码 指标名称
1 Pop 开盘价
2 Pcl 收盘价
3 Phi 最高价
4 Plo 最低价
5 Volume 成交量
6 MA5 5日均线
7 MA10 10日均线
8 MA20 20日均线
9 MA60 60日均线
预设数据库可以包括专家因子库的数据。专家因子是指与各标的物下行风险有一定定性和定量关系的因子。如表2所示,专家因子库可以包括以下八大维度:宏观大类指标、行业产业指标、特色衍生指标、资本技术指标、资金流向指标、衍生市场指标和舆情热度指标。每一预设特征维度对应一定量的目标状态特征,具体可以参照表2。示例地,资本技术指标对应的目标状态特征可以是N日均线、N日波动率、布林线、麦克线等;资金流向指标对应的目标状态特征可以是北向资金流入、南向资金流入、主力资金流入等。
进一步地,预设数据库还可以包括表2中所示的采购经理人指数(purchasing managers'index,PMI)、布林线、北向资金流入等;或者可以包括表2中未示出的大盘指数,例如,上证指数、深证指数、创业板指数、恒生指数和标普500 指数等,本申请实施例对此不作限定。
表2标的物的专家因子库的数据
Figure PCTCN2022090029-appb-000010
预设数据库的数据来源可以是从金融相关的网页或应用程序中导出数据,也可以是统计局发布经济数据、企业财务报表、沪/深/境外市场数据、社交媒体统计数据等,在此不做限定。
步骤S203:对目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征。
在一些可能的实施方式中,在执行步骤S202之后,还可以包括以下步骤:对目标历史数据中的异常数据进行预处理,得到待处理数据。步骤S203可以包括:对待处理数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征。
在本申请实施例中,异常数据可以包括缺失值和噪声值。噪声值是指干扰数据,可以为对场景描述不准确的数据。例如,计算测量变量中的随机误差或方差,确定小于随机误差或方差的数值为噪声值等。对目标历史数据中的异常数据进行预处理可以包括:缺失值的填补和噪声值的处理。对于缺失值的填补的方式,本申请实施例不做出限定,可以采用平均值填充(mean/mode completer)、热卡填充(hot deck imputation)、K最近距离邻法(k-means clustering)等方式对缺失值进行填补。本申请实施例不对处理噪声值的方法进行限定,可以采用分箱、聚类或回归中的一种或者多种方式来处理噪声值。对异常数据进行预处理,得到待处理数据之后,还可以对待处理数据进行归一化处理。可以理解,对异常数据进行预处理,有利于提高数据处理的效率和准确率。
在本申请实施例中,预设特征维度可以是表2所示的专家因子库的数据,在此不再赘述。需要说明的是,表2所示的预设特征维度和对应的目标状态特征仅为示例,还可以包括其他与标的物相关的特征维度和对应的目标状态特征,本申请实施例对此不做出限定。
可以理解,专家因子是通过金融市场理论和实践经验总结的特征。因此,本申请实施例基于专家因子得到的目标状态特征,相比于表1所示的传统的价量和技术指标,对于下行风险的预测具有更直接的指导意义,可以为强化学习算法提供更高级的状态输入,从而能够控制模型的过拟合。
步骤S204:将目标状态特征输入至第一风险预测模型,得到目标标的物在预测日期的风险值。
在本申请实施中,以风险值作为第一风险预测模型的输出。风险值是指目标标的物是否存在显著下跌的趋势,风险值的取值可以是1或0时。例如,1代表预测标的物未来将有显著下跌风险,0代表预测标的物未来没有显著下跌的风险。
在本申请实施例中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
在本申请实施例中,可以根据时间戳对预设数据库进行划分,得到第一训练集和第二训练集。例如,可以从预设数据库中获取一段时间的数据,将这段时间中一定的比例(例如前1/5)的数据划分为第二训练集,这段时间剩下的数据划分为第一训练集。示例地,从预设数据库中获取某只标的物由2010年1月1日到2020年12月31日十年的数据作为训练数据。那么,可以将2010年1月1日至2012年12月31日这两年的数据划分为第二训练集,2013年1月1日至2020年12月31日数据划分为第一训练集。
本申请对于第一Actor模型、第一Critic模型或第二Critic模型的获取方法不做限定,以下先对第一Actor模型的获取方法进行介绍。
在一些可能的实施方式中,在执行步骤S204之前,还可以包括以下步骤:从所述预设数据库提取所述第二训练集;基于预设专家规则对所述第二训练集进行计算,得到不同状态下对应的第一动作集合;基于所述第一动作集合进行机器学习,得到所述第一Actor模型。
其中,预设专家规则是指长期以来总结的某类金融指标与下行风险之间的高度关联的理论认知和金融风险控制理论。在本申请实施例中,预设专家规则包括但不限于布林带、阻力支撑相对强度(resistance support relative strength,RSRS)、指数平滑移动平均线(moving average convergence and divergence,MACD)、市场情绪等量化择时方法。这类方法在拟合下行风险信号时往往基于启发式规则或对若干关键指标的相对关系进行计算,与单纯基于数据驱动的算法相比所需样本较少。
本申请实施例以布林带信号为例对预设专家规则的使用进行介绍。布林带(bollinger band)以发明者约翰·布林格(John Bollinger)名字命名,用以刻画价格波动的区间。布林带的基本形态是由三条轨道线组成的带状通道(中轨和上、下轨各一条)。
中轨(MA),即t日前N日收盘价的移动平均线:
Figure PCTCN2022090029-appb-000011
其中,
Figure PCTCN2022090029-appb-000012
为第n日的当日收盘价,n为样本数。n的取值根据实际情况而定,一般为20。
上轨(UT),即比中轨高两倍标准差的距离价格:
Figure PCTCN2022090029-appb-000013
下轨(LT),即比中轨高低倍标准差的距离价格:
Figure PCTCN2022090029-appb-000014
布林带卖出信号的计算公式:
Figure PCTCN2022090029-appb-000015
如布林带卖出信号
Figure PCTCN2022090029-appb-000016
的公式所示,本问题已转化为一个0-1分类任务。当
Figure PCTCN2022090029-appb-000017
即当日收盘价不高于当日布林带下轨时,
Figure PCTCN2022090029-appb-000018
时,预测为风险信号,此时应对目标标的物进行清仓。其他情况,当
Figure PCTCN2022090029-appb-000019
时,满仓买入该目标标的物。当连续两个时刻均出现同样的信号时,例如,连续出现信号为1或0时,前者由于上一时刻已经清仓,因此无法再次执行卖出动作,而后者则因为前一时刻已经全额建仓,则无法再次执行买入动作。
可以看出,预设专家规则可以对第一Actor模型进行预训练,使其决策水平在与Critic模型进行交替训练前逼近预设专家规则的水平。在样本稀缺的环境中,可以快速采样得到高价值的奖励,避免了Actor-Critic算法初期大量低效甚至无效采样。
本申请实施例可以采用模仿学习的方式,先利用预设专家规则对金融市场状态进行拟合,得到专家动作标签,并用这些标签进行监督式预训练。具体实现过程如下:
首先从预设数据库中提取第二训练集作为状态输入,第二训练集的定义请参考前文的描述,在此不做赘述。然后利用预设专家规则对第二训练集进行计算,得到相应的第一动作集合,即专家动作,从而得到由状态到动作的映射,并形成一套交易策略。此时,可以获得一组专家决策数据τ={τ 12,…,τ m},其中m为专家模型的数量。每个专家决策数据包含状态和动作序列
Figure PCTCN2022090029-appb-000020
其中n为采样数量。将所有状态-动作元组合并构建新的集合D={(s 1,a 1),(s 2,a 2),…。此时,将状态作为特征,将动作作为标签,进行分类(对应离散动作)或回归(对应连续动作)的监督式学习,从而可以得到第一Actor模型。
在本申请实施例中,第一Actor模型对应的策略网络可以采用递归神经网络(recurrent neural networks,RNN),也可以采用长短期记忆网络(long-short time memory,LSTM),还可以采用门控循环单元(gated recurrent unit,GRU),本申请实施例对此不做出限定。本申请实施例以采用LSTM算法为例构建第一Actor模型。
LSTM模型是一种特殊的RNN模型,通过引入门(gate)机制,解决RNN模型不具备的长记忆性问题。具体来说,LSTM模型的1个神经元包含了1个细胞状态(cell)和3个门(gate)机制。细胞状态(cell)是LSTM模型的关键所在,类似于存储器,是模型的记忆空间。细胞状态随着时间而变化,记录的信息由门机制决定和更新。门机制是让信息选择式通过的方法,通过sigmoid函数和点乘操作实现。sigmoid取值介于0-1之间,点乘则决定了传送的信息量(每个部分有多少量可以通过)。当sigmoid取0时表示舍弃信息,取1时表示完全传 输(即完全记住)。LSTM通过遗忘门(forget gate)、更新门(update gate)和输出门(output gate)三个门在任意时刻维护信息。
请参照图3,图3是本申请实施例提供的一种LSTM算法的结构示意图,如图3所示,以LSTM算法构建的第一Actor模型的输入可以包括金融市场的当前时刻t的当前状态s t、记录有历史数据间的时间关联的LSTM网络中的隐藏状态h t+1和上一时刻t-1的决策a t-1,其输出当前时刻的决策a t
可以看出,LSTM算法中有向循环的体系结构创建的网络的内部状态,能够处理基于时间的序列数据并记住时序联系,因此可以解决长期依赖问题。大型LSTM算法具有高度表示特征的能力,即可以学习到丰富的时间特征表示,基于LSTM算法的智能体可以挖掘金融市场数据中的时序模式,并且记忆历史状态和动作。
如图4所示,基于LSTM算法的第一Actor模型由输入层、LSTM层、输出层组成,其中输入和输出层为与特征和动作输出维度相同的全连接层。利用LSTM网络,可对市场状态序列进行刻画形成输入特征,有利于解决机器学习方法由于使用截面数据而无法捕捉市场状态变化规律的缺陷。
在本申请实施例中,第一Actor模型可以通过以下步骤预训练得到:
步骤A1:准备集合
Figure PCTCN2022090029-appb-000021
并进行随机重排序;
步骤A2:随机初始化第一Actor模型的权重值θ;
步骤A3:从D中选取样本
Figure PCTCN2022090029-appb-000022
使用当前网络,以状态s n为输入计算输出动作a n
步骤A4:计算损失函数值L,及其对第一Actor模型各权重值的导数
Figure PCTCN2022090029-appb-000023
即第一Actor模型的网络参数的梯度;
步骤A5:以步长α,沿第一Actor模型的网络参数的梯度方向更新该网络参数;
步骤A6:重复步骤A3-A5,直至达到训练时长或损失函数值L收敛。
在本申请实施例中,第一Actor模型损失函数值L可以包括但不限于二元交叉熵等。下面以二元交叉熵为例,衡量预设专家规则计算动作信号
Figure PCTCN2022090029-appb-000024
和与预测值a t之间的差异。在一个二分类问题中,每一个样本的交叉熵可以被表示为:
Figure PCTCN2022090029-appb-000025
则整个集合的交叉熵为:
Figure PCTCN2022090029-appb-000026
其中,N是样本数,
Figure PCTCN2022090029-appb-000027
为样本被预测为1的概率,x t为输出全连接层的输出。动作输出为:
Figure PCTCN2022090029-appb-000028
可以看出,利用预设专家规则,通过以金融市场行情数据为状态输入,以专家决策动作为标签展开监督式学习,可以将第一Actor模型对应的策略网络由初始化的“一无所知”提升至近似专家水平。此时,在金融市场的风险预警中,已经可以使用第一Actor模型在环境中进行相应的预测。
然而,以布林带信号为代表的预设专家规则只使用历史收盘价计算上、中、下轨指标,并与当天收盘价相比较计算危险信号得到专家动作。而策略网络的输入维度则在专家因子库中进行选择,其维度大于布林带信号输入的维度。因此,第一Actor模型具有局限性。
因此,在一些可能的实施方式中,可以利用第一训练集对第一Actor模型进一步优化得到第一风险预测模型,从而取得更优决策。具体地,在所述基于所述第一动作集合进行机器学习,得到所述第一Actor模型之后,还包括以下步骤:从所述预设数据库提取所述第一训练集;基于预设专家规则对所述第一训练集进行计算,得到不同状态下对应的第二动作集合;基于所述第二动作集合对所述第一Actor模型进行优化,得到所述第一风险预测模型。
在本申请实施例中,将利用预设专家规则预训练得到的第一Actor模型在第一训练集中,基于预设专家规则进行下行风险预测,得到第二动作集合a t。在设计的量化下行风险指标存在一个真实的标签
Figure PCTCN2022090029-appb-000029
在本申请实施例中,与第一Actor模型的预训练过程类似,同样以监督学习的方法,对二元交叉熵函数进行计算,并将梯度反传至第一Actor模型,保持对其持续优化,从而得到第一风险预测模型。
可以看出,基于第一训练集对第一Actor模型进行优化的方法,没有直接使用Critic模型,而是以真实数据为依据。可以使第一Actor模型不会受限于Critic模型的水平,提高模型预测准确性。
在一些可能的实施方式中,可以利用第一Critic模型对第一Actor模型进行优化,得到第一风险预测模型,以提高模型的风险预测的准确性。
下面介绍第一Critic模型的构建过程。在一些可能的实施方式中,第一Critic模型的构建过程可以包括以下步骤:对所述第一训练集进行特征提取,得到第一状态特征和第二状态特征;对所述第一状态特征和所述第二状态特征进行拼接,得到第三状态特征;基于所述第三状态特征进行机器学习,得到基模型;将所述第三状态特征输入至所述基模型,得到基模型训练结果;根据所述基模型训练结果获取所述基模型的排序结果;根据所述排序结果确定所述基模型的加权权重;根据所述加权权重对所述基模型进行模型融合,得到所述第一Critic模型。
在本申请实施例中,第一训练集是从预设数据库中提取的,第一训练集的定义请参考前文的描述,在此不做赘述。第一状态特征可以是传统价量指标特征,第二状态特征可以是专家因子的特征;或者第一状态特征可以是专家因子的特征,第二状态特征可以是是传统价量指标特征,本申请实施例对此不做出限定。第三状态特征是专家因子库数据与传统的价量指标数据进行拼接得到的。示例地,传统价量指标特征维度为P维,专家因子的特征维度为Q维,则拼接后得到的第三状态特征维度为P+Q维。
具体地,以第三状态特征作为状态输入,分别送入各机器学习基模型进行训练。在本申请实施例中,机器学习方法可在逻辑回归、决策树、随机森林、自适应增强(adaptive boosting,AdaBoost)等分类机器学习算法进行选择,对此不做出限定。在本申请实施例中,可以采用0-1逻辑变量作为空头信号,1代表预测标的物未来将有显著下跌风险,0代表预测标的物未来没有显著下跌的风险。此时,输出类型与第一Actor模型保持一致。
将得到的基模型进行模型融合,得到第一Critic模型。融合方法可以使用加权平均法,将筛选后的基模型结果在基模型集合中根据预设评测指标进行排序。 示例地,将各基模型按其在验证集上的预测准确率由高到低分为5档,准确率越高,档位越高。使用各模型的排序分位数档位作为加权权重,通过加权平均得到综合模型结果,最终使用某固定阈值进行激活转化为逻辑变量,最终生成可以指导择时交易的空头信号。至此,又可以得到一组专家状态-动作数据D={(s 1,a′ 1),(s 2,a′ 2),…}。
考虑到前期预训练得到的第一Actor模型只使用简单的专家规则进行训练,没有挖掘状态变量与风险信号之间的深层关系。在一些可能的实施方式中,可以利用第一Critic模型对第一Actor模型进行优化,从而得到第一风险预测模型,以提高模型的预测准确性。具体地,可以采用状态作为特征,将动作作为标签,进行监督式学习,对第一Actor模型进一步优化,得到第一风险预测模型,以提高模型的收敛性。
在一些可能的实施方式中,可以利用第二Critic模型对第一Actor模型进行优化,得到第一风险预测模型,以提高模型的风险预测的准确性。
下面介绍第二Critic模型的构建过程。在一些可能的实施方式中,第二Critic模型的构建过程可以包括以下步骤:构建所述第二Critic模型的价值网络,其中,所述价值网络的网络结构与所述第一Actor模型的网络结构相同;将所述第一Actor模型的输出层之外的权重值复制给所述价值网络;基于所述第一训练集,对所述价值网络进行训练,以更新所述价值网络的权重值;将训练完成得到的所述价值网络作为所述第二Critic模型。
在本申请实施例中,可以采用价值网络构建第二Critic模型。价值网络可以应对资本市场的时变性,提高模型的泛化能力。第二Critic模型的主体网络结构与第一Actor模型相同,但第二Critic模型的最终输出为一维连续值,即输出为对状态价值或状态-动作价值。
具体地,可以将第一Actor模型的权重值由浅至深复制给第二Critic模型,直至最后一层输出层之前。由于在之前的训练中,第一Actor模型已经被训练过若干次。其权重值对状态变量深层特征的已经有了较强的提取能力,将输出层之外的权重复制到第二Critic模型,比从权重随机初始化开始训练,无需额外消耗数据样本。因此,在模型优化的采样效率上进一步提高。
在构建第二Critic模型之前,第一Actor模型和第一Critic模型已经可以在不同状态下输出下行风险信号,并据此执行卖出动作。用第一Actor模型和第一Critic模型分别在交易环境中利用第一训练集根据状态输入s t,执行相应动作a t,得到奖励r t+1和下一步的状态s t+1。那么,任意时刻t开始可以得到轨迹[(s t,a t),r t+1,s t+1]。这里奖励值做如下设置:
假设预测目标标的物是第二天价格将出现下跌,当a t=1时,以当日收盘价p t将持有的份额为的证券全部委托卖出,可得t时刻资产价值Q t。其中,Q t=p t×H t当a t=0时,则以当日收盘价委托买入。同时,市场后续的真实走势会得到次日收盘价p t+1,可得t+1时刻资产价值Q t+1,Q t+1=p t+1×H t。则奖励值r则设置为t时刻至t+1时刻资产价值的变化值与t时刻资产价值的比例,即:
Figure PCTCN2022090029-appb-000030
因此,预测走势与实际走势是否符合决定了奖励值的符号,其奖励符号对应关系如表3所示。
表3预测走势与实际走势和奖励符号之间的对应关系
序号 输出 输出含义 p t+1-p t条件 实际走势 奖励符号
1 1 下跌风险 >0 价格上涨 -
2 1 下跌风险 <0 价格下跌 +
3 0 无下跌风险 >0 价格上涨 +
4 0 无下跌风险 <0 价格下跌 -
对存储的轨迹中任意状态开始,计算累计折扣奖励作为真实价值:
R t=r t+1+γr t+22r t+3+…+γ T-tr T
与此同时,通过轨迹[(s t,a t),r t+1,s t+1],将不同时刻的状态s t作为输入,用第二Critic模型计算其价值v t,计算R t与v t之间的差作为优势函数A t。并求第二Critic模型的损失函数
Figure PCTCN2022090029-appb-000031
其中M为采样数量。然后反向传播更新第二Critic模型。
在本申请实施例中,第二Critic模型和第一Actor模型进行交互训练过程可以参考图5。如图5所示,该交互训练图包括一个Critic网络和两个Actor网络(新Actor网络和旧Actor网络),训练过程包括以下步骤:
步骤B1:在t时刻,将环境的状态变量s t输入到新Actor网络。
新Actor网络输出维度是2,分别为得到μ和σ,当作类别分布函数(categorical distribution)分别对应0-1两类输出的概率。示例地,如果类别分布函数中抽样得到0和1的概率分别为70%和30%,则μ和σ可以分别等于7和3。构建类别分布的目的是对动作a t进行抽样。将动作a t输入到环境中得到奖励r t和下一步的状态s t+1,然后存储轨迹[(s t,a t),r t+1,s t+1]。再将s t+1输入到新Actor网络,循环这一步骤,直至存储一定数量的轨迹。需要说明的是,在此过程中新Actor网络并没有更新。
步骤B2:对存储的轨迹中任意状态开始,计算累计折扣奖励作为真实价值:
R t=r t+1+γr t+22r t+3+…+γ T-tr T
步骤B3:将存储的所有s组合输入到Critic网络中,得到所有状态的价值函数估计值v t。计算R t与v t之间的差作为优势函数A t。并求Critic网络的损失函数
Figure PCTCN2022090029-appb-000032
其中M为采样数量。然后反向传播更新Critic网络。
步骤B4:将存储的所有s组合输入旧Actor网络和新Actor网络,分别得到各自输出,<μ 11>和<μ 22>。其中,旧Actor网络和新Actor网络两者网络结构一样,因此,可得旧Actor网络和新Actor网络各自的概率密度函数(probability density function,PDF),分别为PDF1和PDF2。从而得到存储的各动作在PDF1和PDF2上对应的概率每个动作对应的prob1和prob2,然后用prob2与prob1比值得到重要性权重(importance weight,IW),即IW=prob2/prob1。
步骤B5:计算替代目标函数
Figure PCTCN2022090029-appb-000033
和裁剪替代目标函数
Figure PCTCN2022090029-appb-000034
其中,剪裁替代目标函数,当A>0时,若IW>1+ξ,则
Figure PCTCN2022090029-appb-000035
若IW<1+ξ,则
Figure PCTCN2022090029-appb-000036
当A<0时,若IW>1-ξ,则
Figure PCTCN2022090029-appb-000037
若IW<1-ξ,则
Figure PCTCN2022090029-appb-000038
ξ为裁剪比例,可以取0.2。在存储的轨迹上计算目标函数
Figure PCTCN2022090029-appb-000039
然后反向传播,更新新Actor网络。
步骤B6:循环B4-B5步骤一定步数后,循环结束,用新Actor网络权重来更新旧Actor网络;
步骤B7:循环B1-B6步骤,直至模型收敛或达到指定步数。
可以看出,第二Critic模型累积的有效奖励可以在优化第一Actor模型得到更大的参数梯度,可以使第一风险预测模型收敛更快。
在一些可能的实施方式中,为了使第一Actor模型对金融市场的时变性具备应对能力,还可以将优化目标定为兼顾第一Critic模型、第二Critic模型和实盘表现。此时,该优化问题变为一个多任务优化问题。对第一Actor模型的优化过程还可以包括以下步骤:
基于所述预设数据库获取风险函数;基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数;基于所述优化风险函数对所述第一Actor模型进行优化,得到所述第一风险预测模型。
具体地,从预设数据库中获取实盘数据构建风险函数,实盘数据是指预设数据库中标的物的金融时序数据,这些金融时序数据具有时变性。示例地,实盘数据可以是开盘价、收盘价、最高价、最低价和成交量等。基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数。此时,基于优化风险函数对第一Actor模型优化得到的第一风险预测模型,同时兼顾了第一Critic模型、第二Critic模型和实盘数据的表现。从而使得第一风险预测模型对金融市场的时变性具备一定的应对能力。
在一些可能的实施方式中,在执行步骤S204之后,还可以包括以下步骤:从所述预设数据库中提取验证数据集;基于所述验证数据集对第一风险预测模型进行验证,得到第二风险预测模型;基于所述第一训练集和所述第二训练集对所述第二风险预测模型进行训练,得到第三风险预测模型;将所述目标状态特征输入至所述第三风险预测模型,得到所述目标标的物在所述预测日期的风险值。
在本申请实施例中,验证数据集是从预设数据库中获取的。如果根据时间戳对预设数据库进行划分,得到第一训练集、第二训练集和验证数据集,那么验证数据集是第一训练集和第二训练集以外,时间较近的数据。示例地,若采用2010年1月1日到2020年12月31日十年的数据作为第一训练集和第二训练集,那么验证数据集可以是2021年1月1日到当前时间的数据。
在前文的描述中,可以利用第一训练集、第一Critic模型、第二Critic模型以及风险函数中的一个或者多个对第一Actor模型进行优化,得到第一风险预测 模型。在本申请实施例中,可以将不同优化方式得到的第一风险预测模型在验证数据集中进行验证,选择在验证数据集上预测准确率最高的模型作为第二风险预测模型。固定第二风险预测模型的训练设置方式,其中训练设置方式可以是模型结构、模型参数或训练方式的设置,对此不做出限定。将第一训练集和第二训练集合并,对第二风险预测模型进行训练,得到第三风险预测模型。将目标状态作为第三风险预测模型的输入,预测目标标的物的风险值。如此,经过多次训练得到的第三风险预测模型,可以提升风险预测的准确率,有利于进行风险决策。
可以看出,在本申请实施例中,根据强化学习的在线学习特性,通过第一训练集、第一Critic模型和第二Critic模型等不同形式,对第一Actor模型进行进一步优化得到的第一风险预测模型,为第一Actor模型继续提升表现提供探索空间。此时,第一风险预测模型具有高度收敛性,使其可以根据市场时变特征动态优化,使得风险预测模型的适应性、鲁棒性和抗干扰能力得以进一步提升。如此,可以提高风险预测的准确率,有利于提供可靠的决策方案。
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的装置。
请参照图6,图6是本申请实施例提供的一种基于强化学习的风险预测的装置的结构示意图。该装置应用于电子设备。如图6所示,该基于强化学习的风险预测的装置600包括接收单元601和处理单元602各个单元的详细描述如下:
接收单元601用于接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
处理单元602用于获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
在一些可能的实施方式中,所述处理单元602还用于从所述预设数据库提取所述第二训练集;基于预设专家规则对所述第二训练集进行计算,得到不同状态下对应的第一动作集合;基于所述第一动作集合进行机器学习,得到所述第一Actor模型。
在一些可能的实施方式中,所述处理单元602还用于对所述第一训练集进行特征提取,得到第一状态特征和第二状态特征;对所述第一状态特征和所述第二状态特征进行拼接,得到第三状态特征;基于所述第三状态特征进行机器学习,得到基模型;将所述第三状态特征输入至所述基模型,得到基模型训练结果;根据所述基模型训练结果获取所述基模型的排序结果;根据所述排序结果确定所述基模型的加权权重;根据所述加权权重对所述基模型进行模型融合,得到所述第一Critic模型。
在一些可能的实施方式中,所述处理单元602还用于构建所述第二Critic模型的价值网络,其中,所述价值网络的网络结构与所述第一Actor模型的网络结构相同;将所述第一Actor模型的输出层之外的权重值复制给所述价值网络;基 于所述第一训练集,对所述价值网络进行训练,以更新所述价值网络的权重值;将训练完成得到的所述价值网络作为所述第二Critic模型。
在一些可能的实施方式中,所述处理单元602还用于从所述预设数据库提取所述第一训练集;基于预设专家规则对所述第一训练集进行计算,得到不同状态下对应的第二动作集合;基于所述第二动作集合对所述第一Actor模型进行优化,得到所述第一风险预测模型;或者基于所述第一Critic模型或所述第二Critic模型对所述第一Actor模型进行优化,得到所述第一风险预测模型。
在一些可能的实施方式中,所述处理单元602还用于基于所述预设数据库获取风险函数;基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数;基于所述优化风险函数对所述第一Actor模型进行优化,得到所述第一风险预测模型。
在一些可能的实施方式中,所述处理单元602还用于从所述预设数据库中提取验证数据集;基于所述验证数据集对第一风险预测模型进行验证,得到第二风险预测模型;基于所述第一训练集和所述第二训练集对所述第二风险预测模型进行训练,得到第三风险预测模型;将所述目标状态特征输入至所述第三风险预测模型,得到所述目标标的物在所述预测日期的风险值。
需要说明的是,各个单元的实现还可以对应参照图2所示的方法实施例的相应描述。
请参照图7,图7是本申请实施例提供的一种计算机设备的结构示意图。如图7所示,该计算机设备700包括处理器701、存储器702和通信接口703,其中存储器702存储有计算机程序704。处理器701、存储器702、通信接口703以及计算机程序704之间可以通过总线705连接。
当计算机设备为电子设备时,上述计算机程序704用于执行以下步骤的指令:
接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
在一些可能的实施方式中,将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值之前,所述计算机程序704还用于执行以下步骤的指令:
从所述预设数据库提取所述第二训练集;
基于预设专家规则对所述第二训练集进行计算,得到不同状态下对应的第一动作集合;
基于所述第一动作集合进行机器学习,得到所述第一Actor模型。
在一些可能的实施方式中,在所述从预设数据库提取所述第二训练集之后,所述计算机程序704还用于执行以下步骤的指令:
对所述第一训练集进行特征提取,得到第一状态特征和第二状态特征;
对所述第一状态特征和所述第二状态特征进行拼接,得到第三状态特征;
基于所述第三状态特征进行机器学习,得到基模型;
将所述第三状态特征输入至所述基模型,得到基模型训练结果;
根据所述基模型训练结果获取所述基模型的排序结果;
根据所述排序结果确定所述基模型的加权权重;
根据所述加权权重对所述基模型进行模型融合,得到所述第一Critic模型。
在一些可能的实施方式中,在所述从预设数据库提取所述第二训练集之后,所述计算机程序704还用于执行以下步骤的指令:
构建所述第二Critic模型的价值网络,其中,所述价值网络的网络结构与所述第一Actor模型的网络结构相同;
将所述第一Actor模型的输出层之外的权重值复制给所述价值网络;
基于所述第一训练集,对所述价值网络进行训练,以更新所述价值网络的权重值;
将训练完成得到的所述价值网络作为所述第二Critic模型。
在一些可能的实施方式中,在所述基于所述第一动作集合进行机器学习,得到所述第一Actor模型之后,所述计算机程序704还用于执行以下步骤的指令:
从所述预设数据库提取所述第一训练集;
基于预设专家规则对所述第一训练集进行计算,得到不同状态下对应的第二动作集合;
基于所述第二动作集合对所述第一Actor模型进行优化,得到所述第一风险预测模型;或者
基于所述第一Critic模型或所述第二Critic模型对所述第一Actor模型进行优化,得到所述第一风险预测模型。
在一些可能的实施方式中,所述计算机程序704还用于执行以下步骤的指令:
基于所述预设数据库获取风险函数;
基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数;
基于所述优化风险函数对所述第一Actor模型进行优化,得到所述第一风险预测模型。
在一些可能的实施方式中,在所述将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值之后,所述计算机程序704还用于执行以下步骤的指令:
从所述预设数据库中提取验证数据集;
基于所述验证数据集对第一风险预测模型进行验证,得到第二风险预测模型;
基于所述第一训练集和所述第二训练集对所述第二风险预测模型进行训练,得到第三风险预测模型;
将所述目标状态特征输入至所述第三风险预测模型,得到所述目标标的物在所述预测日期的风险值。
本领域技术人员可以理解,为了便于说明,图7中仅示出了一个存储器和处理器。在实际的终端或服务器中,可以存在多个处理器和存储器。存储器702也可以称为存储介质或者存储设备等,本申请实施例对此不做限定。
应理解,在本申请实施例中,处理器701可以是中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。
还应理解,本申请实施例中提及的存储器702可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器synchronize link DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
需要说明的是,当处理器701为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时,存储器(存储模块)集成在处理器中。
应注意,本文描述的存储器702旨在包括但不限于这些和任意其它适合类型的存储器。
该总线705除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线。
本申请实施例还提供一种计算机存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如上述方法实施例中记载的任何一种基于强化学习的风险预测的方法的部分或全部步骤。所述计算机可读存储介质可以是非易失性,也可以是易失性。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种基于强化学习的风险预测的方法的部分或全部步骤。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种基于强化学习的风险预测的方法,其中,包括:
    接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
    获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
    对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
    将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
  2. 根据权利要求1所述的方法,其中,在所述将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值之前,所述方法还包括:
    从所述预设数据库提取所述第二训练集;
    基于预设专家规则对所述第二训练集进行计算,得到不同状态下对应的第一动作集合;
    基于所述第一动作集合进行机器学习,得到所述第一Actor模型。
  3. 根据权利要求2所述的方法,其中,在所述从所述预设数据库提取所述第二训练集之后,所述方法还包括:
    对所述第一训练集进行特征提取,得到第一状态特征和第二状态特征;
    对所述第一状态特征和所述第二状态特征进行拼接,得到第三状态特征;
    基于所述第三状态特征进行机器学习,得到基模型;
    将所述第三状态特征输入至所述基模型,得到基模型训练结果;
    根据所述基模型训练结果获取所述基模型的排序结果;
    根据所述排序结果确定所述基模型的加权权重;
    根据所述加权权重对所述基模型进行模型融合,得到所述第一Critic模型。
  4. 根据权利要求2所述的方法,其中,在所述从所述预设数据库提取所述第二训练集之后,所述方法还包括:
    构建所述第二Critic模型的价值网络,其中,所述价值网络的网络结构与所述第一Actor模型的网络结构相同;
    将所述第一Actor模型的输出层之外的权重值复制给所述价值网络;
    基于所述第一训练集,对所述价值网络进行训练,以更新所述价值网络的权重值;
    将训练完成得到的所述价值网络作为所述第二Critic模型。
  5. 根据权利要求2所述的方法,其中,在所述基于所述第一动作集合进行机器学习,得到所述第一Actor模型之后,所述方法还包括:
    从所述预设数据库提取所述第一训练集;
    基于预设专家规则对所述第一训练集进行计算,得到不同状态下对应的第二动作集合;
    基于所述第二动作集合对所述第一Actor模型进行优化,得到所述第一风险 预测模型;或者
    基于所述第一Critic模型或所述第二Critic模型对所述第一Actor模型进行优化,得到所述第一风险预测模型。
  6. 根据权利要求1所述的方法,其中,所述方法还包括:
    基于所述预设数据库获取风险函数;
    基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数;
    基于所述优化风险函数对所述第一Actor模型进行优化,得到所述第一风险预测模型。
  7. 根据权利要求6所述的方法,其中,在所述将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值之后,所述方法还包括:
    从预设数据库中提取验证数据集;
    基于所述验证数据集对第一风险预测模型进行验证,得到第二风险预测模型;
    基于所述第一训练集和所述第二训练集对所述第二风险预测模型进行训练,得到第三风险预测模型;
    将所述目标状态特征输入至所述第三风险预测模型,得到所述目标标的物在所述预测日期的风险值。
  8. 一种基于强化学习的风险预测的装置,其中,包括:
    接收单元,用于接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
    处理单元,用于获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
    对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
    将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
  9. 一种计算机设备,其中,包括处理器、存储器和通信接口,其中,所述存储器存储有计算机程序,所述计算机程序被配置由所述处理器执行,所述计算机程序包括用于执行以下步骤的指令:
    接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
    获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
    对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
    将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
  10. 根据权利要求9所述的计算机设备,其中,将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值之前,所述计算机程序还用于执行以下步骤的指令:
    从所述预设数据库提取所述第二训练集;
    基于预设专家规则对所述第二训练集进行计算,得到不同状态下对应的第一动作集合;
    基于所述第一动作集合进行机器学习,得到所述第一Actor模型。
  11. 根据权利要求10所述的计算机设备,其中,在所述从预设数据库提取所述第二训练集之后,所述计算机程序还用于执行以下步骤的指令:
    对所述第一训练集进行特征提取,得到第一状态特征和第二状态特征;
    对所述第一状态特征和所述第二状态特征进行拼接,得到第三状态特征;
    基于所述第三状态特征进行机器学习,得到基模型;
    将所述第三状态特征输入至所述基模型,得到基模型训练结果;
    根据所述基模型训练结果获取所述基模型的排序结果;
    根据所述排序结果确定所述基模型的加权权重;
    根据所述加权权重对所述基模型进行模型融合,得到所述第一Critic模型。
  12. 根据权利要求10所述的计算机设备,其中,在所述从预设数据库提取所述第二训练集之后,所述计算机程序还用于执行以下步骤的指令:
    构建所述第二Critic模型的价值网络,其中,所述价值网络的网络结构与所述第一Actor模型的网络结构相同;
    将所述第一Actor模型的输出层之外的权重值复制给所述价值网络;
    基于所述第一训练集,对所述价值网络进行训练,以更新所述价值网络的权重值;
    将训练完成得到的所述价值网络作为所述第二Critic模型。
  13. 根据权利要求10所述的计算机设备,其中,在所述基于所述第一动作集合进行机器学习,得到所述第一Actor模型之后,所述计算机程序还用于执行以下步骤的指令:
    从所述预设数据库提取所述第一训练集;
    基于预设专家规则对所述第一训练集进行计算,得到不同状态下对应的第二动作集合;
    基于所述第二动作集合对所述第一Actor模型进行优化,得到所述第一风险预测模型;或者
    基于所述第一Critic模型或所述第二Critic模型对所述第一Actor模型进行优化,得到所述第一风险预测模型。
  14. 根据权利要求9所述的计算机设备,其中,所述计算机程序还用于执行以下步骤的指令:
    基于所述预设数据库获取风险函数;
    基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数;
    基于所述优化风险函数对所述第一Actor模型进行优化,得到所述第一风险预测模型。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储计算机程序,所述计算机程序使得计算机执行以下步骤的指令:
    接收目标标的物的风险预测请求,所述风险预测请求包括预测日期;
    获取所述风险预测请求的接收日期和所述接收日期的前N天所述目标标的物的目标历史数据,所述N为大于或等于1的正整数;
    对所述目标历史数据进行特征提取,得到多个预设特征维度中每一预设特征维度对应的目标状态特征;
    将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值,其中,所述第一风险预测模型是基于第一训练集、第一Critic模型或第二Critic模型,对第一Actor模型进行优化得到的模型,所述第一Actor模型是基于第二训练集进行训练得到的,所述第一训练集和所述第二训练集是从预设数据库中提取的历史数据,所述预设数据库包括所述目标历史数据。
  16. 根据权利要求15所述的计算机可读存储介质,其中,将所述目标状态特征输入至第一风险预测模型,得到所述目标标的物在所述预测日期的风险值之前,所述计算机程序还用于执行以下步骤的指令:
    从所述预设数据库提取所述第二训练集;
    基于预设专家规则对所述第二训练集进行计算,得到不同状态下对应的第一动作集合;
    基于所述第一动作集合进行机器学习,得到所述第一Actor模型。
  17. 根据权利要求16所述的计算机可读存储介质,其中,在所述从预设数据库提取所述第二训练集之后,所述计算机程序还用于执行以下步骤的指令:
    对所述第一训练集进行特征提取,得到第一状态特征和第二状态特征;
    对所述第一状态特征和所述第二状态特征进行拼接,得到第三状态特征;
    基于所述第三状态特征进行机器学习,得到基模型;
    将所述第三状态特征输入至所述基模型,得到基模型训练结果;
    根据所述基模型训练结果获取所述基模型的排序结果;
    根据所述排序结果确定所述基模型的加权权重;
    根据所述加权权重对所述基模型进行模型融合,得到所述第一Critic模型。
  18. 根据权利要求16所述的计算机可读存储介质,其中,在所述从预设数据库提取所述第二训练集之后,所述计算机程序还用于执行以下步骤的指令:
    构建所述第二Critic模型的价值网络,其中,所述价值网络的网络结构与所述第一Actor模型的网络结构相同;
    将所述第一Actor模型的输出层之外的权重值复制给所述价值网络;
    基于所述第一训练集,对所述价值网络进行训练,以更新所述价值网络的权重值;
    将训练完成得到的所述价值网络作为所述第二Critic模型。
  19. 根据权利要求16所述的计算机可读存储介质,其中,在所述基于所述第一动作集合进行机器学习,得到所述第一Actor模型之后,所述计算机程序还用于执行以下步骤的指令:
    从所述预设数据库提取所述第一训练集;
    基于预设专家规则对所述第一训练集进行计算,得到不同状态下对应的第二动作集合;
    基于所述第二动作集合对所述第一Actor模型进行优化,得到所述第一风险预测模型;或者
    基于所述第一Critic模型或所述第二Critic模型对所述第一Actor模型进行优化,得到所述第一风险预测模型。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序还用于执行以下步骤的指令:
    基于所述预设数据库获取风险函数;
    基于所述第一Critic模型和所述第二Critic模型,对所述风险函数进行优化,得到优化风险函数;
    基于所述优化风险函数对所述第一Actor模型进行优化,得到所述第一风险预测模型。
PCT/CN2022/090029 2021-12-15 2022-04-28 基于强化学习的风险预测的方法、装置、设备及存储介质 WO2023108987A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111535520.9A CN114240656A (zh) 2021-12-15 2021-12-15 基于强化学习的风险预测的方法、装置、设备及存储介质
CN202111535520.9 2021-12-15

Publications (1)

Publication Number Publication Date
WO2023108987A1 true WO2023108987A1 (zh) 2023-06-22

Family

ID=80756457

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090029 WO2023108987A1 (zh) 2021-12-15 2022-04-28 基于强化学习的风险预测的方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114240656A (zh)
WO (1) WO2023108987A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934486A (zh) * 2023-09-15 2023-10-24 深圳格隆汇信息科技有限公司 一种基于深度学习的决策评估方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114240656A (zh) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 基于强化学习的风险预测的方法、装置、设备及存储介质
CN115470785A (zh) * 2022-08-25 2022-12-13 深圳市富途网络科技有限公司 基于大数据的债券风险信息处理方法及相关设备
CN115630754B (zh) * 2022-12-19 2023-03-28 北京云驰未来科技有限公司 智能网联汽车信息安全的预测方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798608A (zh) * 2017-10-19 2018-03-13 深圳市耐飞科技有限公司 一种投资产品组合推荐方法及系统
CN110059896A (zh) * 2019-05-15 2019-07-26 浙江科技学院 一种基于强化学习的股票预测方法及系统
CN112101520A (zh) * 2020-08-10 2020-12-18 中国平安人寿保险股份有限公司 风险评估模型训练方法、业务风险评估方法及其他设备
CN112488826A (zh) * 2020-12-16 2021-03-12 北京逸风金科软件有限公司 基于深度强化学习对银行风险定价的优化方法和装置
CN112927085A (zh) * 2021-04-14 2021-06-08 刘星 基于区块链、大数据和算法的股票风险预警系统
CN114240656A (zh) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 基于强化学习的风险预测的方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798608A (zh) * 2017-10-19 2018-03-13 深圳市耐飞科技有限公司 一种投资产品组合推荐方法及系统
CN110059896A (zh) * 2019-05-15 2019-07-26 浙江科技学院 一种基于强化学习的股票预测方法及系统
CN112101520A (zh) * 2020-08-10 2020-12-18 中国平安人寿保险股份有限公司 风险评估模型训练方法、业务风险评估方法及其他设备
CN112488826A (zh) * 2020-12-16 2021-03-12 北京逸风金科软件有限公司 基于深度强化学习对银行风险定价的优化方法和装置
CN112927085A (zh) * 2021-04-14 2021-06-08 刘星 基于区块链、大数据和算法的股票风险预警系统
CN114240656A (zh) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 基于强化学习的风险预测的方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934486A (zh) * 2023-09-15 2023-10-24 深圳格隆汇信息科技有限公司 一种基于深度学习的决策评估方法及系统
CN116934486B (zh) * 2023-09-15 2024-01-12 深圳市蓝宇飞扬科技有限公司 一种基于深度学习的决策评估方法及系统

Also Published As

Publication number Publication date
CN114240656A (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2023108987A1 (zh) 基于强化学习的风险预测的方法、装置、设备及存储介质
Baek et al. ModAugNet: A new forecasting framework for stock market index value with an overfitting prevention LSTM module and a prediction LSTM module
Fathali et al. Stock market prediction of Nifty 50 index applying machine learning techniques
Wang et al. Learning nonstationary time-series with dynamic pattern extractions
Zheng et al. RLSTM: a new framework of stock prediction by using random noise for overfitting prevention
Lotfi et al. Artificial intelligence methods: toward a new decision making tool
Qiao et al. Prediction of stock return by LSTM neural network
Rathee et al. Analysis and price prediction of cryptocurrencies for historical and live data using ensemble-based neural networks
Baswaraj et al. An Accurate Stock Prediction Using Ensemble Deep Learning Model
Thesia et al. A dynamic scenario‐driven technique for stock price prediction and trading
Alghamdi et al. A novel hybrid deep learning model for stock price forecasting
Yu et al. Share price trend prediction using CRNN with LSTM structure
Wang et al. Stock price prediction based on chaotic hybrid particle swarm optimisation-RBF neural network
Alalaya et al. Combination method between fuzzy logic and neural network models to predict amman stock exchange
Guo Stock Price Prediction Using Machine Learning
Yadav et al. A Survey on Stock Price Prediction using Machine Learning Techniques
Solanki et al. An Improved Machine Learning Algorithm for Stock Price Prediction
Jiang et al. Time series to imaging-based deep learning model for detecting abnormal fluctuation in agriculture product price
Hamzah et al. Robust Stock Price Prediction using Gated Recurrent Unit (GRU)
Pongsena et al. Deep Learning for Financial Time-Series Data Analytics: An Image Processing Based Approach
Yan Financial Bubble Prediction with Neural Networks
Islam Development and Application of Artificial Neural Networks for Energy Demand Forecasting in Australia
Niszl Stock market forecasting: a cross-domain approach
Buche et al. Exploring Machine Learning Approaches for Analysis and Prediction in the Indian Stock Market.
Deswal et al. Stock Price Prediction of AAPL Stock by Using Machine Learning Techniques: A Comparative Study

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22905755

Country of ref document: EP

Kind code of ref document: A1