CN114240656A

CN114240656A - Risk prediction method, device and equipment based on reinforcement learning and storage medium

Info

Publication number: CN114240656A
Application number: CN202111535520.9A
Authority: CN
Inventors: 肖京; 郭骁; 王磊; 王媛; 刘云风; 谭韬; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-25
Also published as: WO2023108987A1

Abstract

The application relates to the field of artificial intelligence and discloses a risk prediction method, device, equipment and storage medium based on reinforcement learning. The method comprises the following steps: receiving a risk prediction request for a target subject matter, the risk prediction request including a prediction date; acquiring the receiving date of the risk prediction request and target history data of target objects N days before the receiving date; extracting the characteristics of the target historical data to obtain target state characteristics corresponding to each preset characteristic dimension in a plurality of preset characteristic dimensions; and inputting the target state characteristics into a first risk prediction model to obtain a risk value of the target object on a prediction date, wherein the first risk prediction model is a model obtained by optimizing a first Actor model based on a first training set, the first Critic model or a second Critic model. By implementing the embodiment of the application, the accuracy of risk prediction can be improved, and risk decision can be facilitated.

Description

Risk prediction method, device and equipment based on reinforcement learning and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a risk prediction method, apparatus, device, and storage medium based on reinforcement learning.

Background

In recent years, with the rapid development of the internet, the field of artificial intelligence algorithms has greatly advanced. Reinforcement learning is also gradually developed and becomes an important branch of artificial intelligence. Reinforcement learning is a machine learning algorithm that maximizes returns by obtaining returns (rewarded) from an environment through the movement of agents (agents) relative to the environment (environment). The reinforcement learning is favored by the researchers, and has been successfully applied to the fields of automation, intelligent control, automatic driving and the like.

At present, the application in the financial field is a hot spot of reinforcement learning in current research, and is mainly used for analysis research and decision making of financial markets. The risk value of the object (such as stocks, funds, futures, bonds, derivatives and the like) in the financial market is predicted based on the model of reinforcement learning, so that the influence caused by subjective calculation, some loss caused by subjective emotional operation of people and manual operation errors can be greatly reduced. However, the data of the subject matter has high time-varying property, resulting in poor model performance and poor actual prediction effect.

Disclosure of Invention

The embodiment of the application provides a risk prediction method, a risk prediction device, risk prediction equipment and a storage medium based on reinforcement learning, which can improve the accuracy of risk prediction and are beneficial to risk decision making.

In a first aspect, an embodiment of the present application provides a risk prediction method based on reinforcement learning, where:

receiving a risk prediction request for a target subject matter, the risk prediction request including a prediction date;

acquiring a receiving date of the risk prediction request and target history data of the target object N days before the receiving date, wherein N is a positive integer greater than or equal to 1;

extracting the characteristics of the target historical data to obtain target state characteristics corresponding to each preset characteristic dimension in a plurality of preset characteristic dimensions;

inputting the target state features into a first risk prediction model to obtain a risk value of the target object on the prediction date, wherein the first risk prediction model is a model obtained by optimizing a first Actor model based on a first training set, a first Critic model or a second Critic model, the first Actor model is obtained by training based on a second training set, the first training set and the second training set are historical data extracted from a preset database, and the preset database comprises the target historical data.

In a second aspect, an embodiment of the present application provides an apparatus for risk prediction based on reinforcement learning, where:

a receiving unit configured to receive a risk prediction request of a target object, the risk prediction request including a prediction date;

a processing unit, configured to obtain a reception date of the risk prediction request and target history data of the target object N days before the reception date, where N is a positive integer greater than or equal to 1;

In a third aspect, this application provides a computer device comprising a processor, a memory and a communication interface, wherein the memory stores a computer program configured to be executed by the processor, and the computer program comprises instructions for some or all of the steps as described in the first aspect of this application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, where the computer program causes a computer to perform some or all of the steps described in the first aspect of the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

by adopting the risk prediction method, the device, the equipment and the storage medium based on reinforcement learning, after a risk prediction request of a target object is received, the receiving date of the risk prediction request and the target historical data of the target object N days before the receiving date are obtained, and feature extraction is performed on the target historical data to obtain the target state feature corresponding to each preset feature dimension in a plurality of preset feature dimensions. The target state feature is then input to the first predictive model, thereby obtaining a risk value for the target object on the predicted date. The first risk prediction model is a model obtained by optimizing a first Actor model based on a first training set, a first Critic model or a second Critic model, the first Actor model is obtained by training based on a second training set, the first training set and the second training set are historical data extracted from a preset database, and the preset database comprises the target historical data. Therefore, the risk prediction is carried out on the model obtained through multiple times of training and optimization, the accuracy rate of the risk prediction can be improved, and risk decision is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained based on these drawings without creative efforts. Wherein:

fig. 1 is a schematic diagram of an operation principle of an Actor-Critic algorithm according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart illustrating a risk prediction method based on reinforcement learning according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an LSTM algorithm according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a first Actor model based on the LSTM algorithm according to an embodiment of the present application;

fig. 5 is a flowchart of Actor-criticic interaction training provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for risk prediction based on reinforcement learning according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The network architecture applied by the embodiment of the application comprises a server and electronic equipment. Wherein the electronic device may be in network communication with the server. Network communications may be based on any wired and wireless network, including but not limited to the Internet, wide area networks, metropolitan area networks, local area networks, Virtual Private Networks (VPNs), wireless communication networks, and the like.

The number of the electronic devices and the number of the servers are not limited in the embodiment of the application, and the servers can provide services for the electronic devices at the same time. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server may alternatively be implemented as a server cluster consisting of a plurality of servers.

The electronic device may be a Personal Computer (PC), a notebook computer, or a smart phone, and may also be an all-in-one machine, a palm computer, a tablet computer (pad), a smart television playing terminal, a vehicle-mounted terminal, or a portable device. The operating system of the PC-side electronic device, such as a kiosk or the like, may include, but is not limited to, operating systems such as Linux system, Unix system, Windows series system (e.g., Windows xp, Windows7, etc.), Mac OS X system (operating system of apple computer), and the like. The operating system of the electronic device at the mobile end, such as a smart phone, may include, but is not limited to, an operating system such as an android system, an IOS (operating system of an apple mobile phone), a Window system, and the like.

In particular, the risk prediction method based on reinforcement learning can be applied to electronic equipment or servers configured by financial institutions such as banks, securities and insurance. At present, the application in the financial field is a hot spot of reinforcement learning in current research, and is mainly used for analysis research and decision making of financial markets.

Reinforcement learning is a machine learning algorithm that maximizes returns by obtaining returns (rewarded) from an environment through the movement of agents (agents) relative to the environment (environment). Therefore, the reinforcement learning algorithm has several basic elements: agent, environment, status, action, reward (or called reward). For ease of understanding, the following first presents several basic concepts.

(1) Agents, also referred to as "agents," "intelligent agents," and the like. The intelligent agent can automatically adjust the behavior and the state of the intelligent agent according to the change of the external environment, and the intelligent agent does not only passively receive external stimulation and has the capability of self-management and self-regulation. In addition, the agent may also accumulate or learn experience and knowledge and modify its behavior to adapt to the new environment.

(2) The environment is the part of the system except the intelligent agent, can feed back the state and the reward to the intelligent agent, and can change according to a certain rule. For the financial domain, the environment may be a financial market.

(3) The state is the objective condition that the system is in at each time period. For a subject matter in the financial market, there may be three states of rising, falling, and rolling over for a certain period of time.

(4) Actions, also called decisions. After the time and the state are determined, the intelligent agent can make different selections according to the state of the environment, so that the current state can be transferred to the next state in a certain transition mode or with certain probability, and the process is called as action. There may be three different actions for buying, selling and holding a certain subject matter.

(5) The reward, also called reward, may be defined as the subsequent benefit that is brought after some action is taken. The reward may be positive or negative.

The algorithm for reinforcement learning can be roughly classified into a value-based (value-based) algorithm and a policy-based (policy-based) algorithm. The value-based algorithm is typically represented as a Q-Learning algorithm, and the policy-based algorithm is typically represented as a Policy Gradient (PG).

In the financial field, the Q-Learning algorithm is mainly based on defining market status. And then selecting a transaction action according to a greedy (epsilon-greedy) strategy, and interacting with the environment to obtain a return. The algorithm has the main idea that states and actions form a Q value table to store Q values, and then the Q value table is updated according to returns, so that the aim of optimizing transaction actions is fulfilled. However, when the dimension of the state or action is too large, the Q-Learning algorithm is difficult to converge.

Similar to the Q-Learning algorithm, the application of the PG algorithm to the financial field is also to define market states and select the most advantageous actions according to existing strategies. And obtaining the return of the action through environment feedback, and then updating the strategy algorithm reversely. The PG algorithm can be used for a high-dimensional motion space, but the PG algorithm is easily trapped in local optima and is relatively inefficient based on round updates.

The Actor-Critic algorithm, as the name implies, includes two parts, an Actor (Actor) and a commentator (Critic), respectively. The algorithm combines the advantages of the PG algorithm and the Q-Learning algorithm. Actor as a policy network selects behavior based on probability, while Critic assesses the score of behavior based on the behavior of Actor. The Actor then modifies the probability of selecting an action based on the Critic score. By combining the two, the strategy network can perform gradient updating according to the value function so as to optimize the model parameters and obtain the optimal action selection under different environmental states, and the updating is faster compared with the traditional PG round. The operation principle of the Actor-Critic algorithm is shown in FIG. 1.

The Actor model uses a policy network to perform approximate estimation on a policy function, and the form of the policy function can be expressed as:

π_θ(s，a)＝P(a|s)

the meaning is that pi is the way the agent makes a decision in the face of the environment, and action a is the conditional probability based on state s and network weight θ. State s when given time t_tCalculating the action a with the highest probability_tAnd makes it interact with the environment to obtain the actual reward value r_t+1And the state s at the next moment_t+1。

The criticic model uses a cost network to approximate a cost function, which may have the following form:

(1) state cost function

v(s，w)≈v_π(s)

Or:

(2) state-action cost function

q(s，a，w)≈q_π(s，a)

And updating the policy network parameters according to the policy gradient:

wherein w is a network parameter of the criticic model, theta and theta' are respectively a strategy network parameter before and after updating, alpha is an updating step length, and alpha is selected according to actual conditions.

The Critic model evaluation method can be based on the following functions:

(1) the state cost function:

(2) state-action cost function:

(3) time difference function:

where the expression for the time difference term may use the state cost function, i.e. δ (t) ═ r_t+1+γV(s_t+1，w)-V(s_tW) or using a state-action cost function δ (t) ═ r_t+1+γQ(s_t+1，a_t+1，w)-Q(s_t，a_t，w)。

(4) The merit function:

wherein the difference between the state-action cost function and the state cost function of the dominance function is A(s)_t，a_t，w，β)＝Q(s_t，a_t，w，ρ)-V(s_tW, β), β is a network parameter of the merit function.

(5) TD (λ) difference:

where e (t) is the utility trace of the state, which can be expressed as:

for the Critic model parameter w itself, the iterative update is typically performed using a mean square error loss function. Taking the time difference function based on the state cost function as an example, the update formula of the Critic network parameter w can be expressed as:

δ(t)＝r_t+1+γV(s_t+1)-V(s_t)

as shown in FIG. 1, in the Actor-Critic algorithm, first, a feature vector s of a current state is used_tOutput action a as input to an Actor policy network_tAnd interacts with the environment to obtain a new state s_t+1Current prize value r_t. Second, use the current state s_tAnd new state s_t+1As input to the Critic value network, values V(s) are obtained separately_t) And V(s)_t+1) And a current prize value r_t. Then, according to the value V(s)_t) And V(s)_t+1) And a current prize value r_tCalculating the time difference delta to obtain the time difference delta r_t+γV(s_t+1)-V(s_t). Then, the mean square error loss function ∑ (r) is used_t+γV(s_t+1)-V(s_t))²And updating the Critic value network parameter w. Finally, the loss function is passed

And updating the Actor policy network parameter theta. Inputting a new feature vector s representing the current state_tAnd repeatedly executing the steps until the training times or the convergence of the target function is reached. The training times can be T, the state feature dimension can be Y, the action space can be A, the step length can be alpha and beta, and the value of the attenuation factor gamma can be between 0.0 and 1.0.

In the above, for the introduction of the Actor-Critic algorithm, the above method generally learns the optimal strategy by calculating the accumulated reward in the traditional reinforcement learning task. Although the method is simple and direct, in multi-step decision making, a huge data volume is needed for reward accumulation, and a huge search space causes samples in financial market decision making problems to be very scarce, so that reward is sparse, and therefore decision model parameters cannot be effectively optimized.

The embodiment of the application mainly aims at improving the Actor-Critic algorithm in reinforcement learning so as to improve the accuracy of risk prediction and facilitate risk decision making. Specifically, referring to fig. 2, fig. 2 is a flowchart illustrating a risk prediction method based on reinforcement learning according to an embodiment of the present disclosure. Taking the application of the method to the electronic device as an example for illustration, the method includes the following steps S201 to S204, wherein:

step S201: a risk prediction request for a target subject matter is received, the risk prediction request including a prediction date.

In the embodiment of the present application, the target object may be one or more stocks, and may also be financial products such as bonds, funds, futures, and the like. The risk prediction request may be generated according to an operation of a user, or may be automatically triggered when a prediction period arrives, which is not limited herein. The prediction period may be every working day. For example: and after closing the market at 15 o' clock every working day, sending a risk prediction request to the electronic equipment. In this way, when the prediction period arrives, the risk prediction can be performed on the target object.

In this embodiment of the application, the predicted date may be the day after the current date, or may be a day of a week after the current date, which is not limited herein. And if the prediction date obtained in the risk prediction request is the market-breaking time of the trading market corresponding to the target object, such as double-break days or holidays, the prediction date is extended to the working day. For example, the current date is 2021 year 10 month 17 day, and the predicted date may be 2021 year 10 month 18 day, or 2021 year 10 month 19 day. If the forecast date in the acquired risk prediction request is 2021 year 10 month 17 day (monday), the forecast date is continued to 2021 year 10 month 18 day (monday). If the prediction date is not specified in the risk prediction request, the default prediction date is the day after the current date, and the double holidays or holidays are delayed.

Step S202: target history data for a target object is obtained for a date of receipt of the risk prediction request and N days prior to the date of receipt.

In the embodiment of the application, the receiving date refers to the date when the electronic equipment receives the risk prediction request. N may be any positive integer greater than 1 or equal to 1, and the specific value of N is not limited. For example, N may be 10, 30, or 60. N may also be determined based on the time interval between the predicted date and the received date, e.g., the greater the time interval, the greater N.

The target history data may be extracted from a preset database. The preset database may be stored in the electronic device in advance, or may be stored in a server, and the electronic device obtains the preset database by accessing the server. The preset database may include common price index data of the subject matter from the historical time to the current time. Where the historical time may refer to any time that has passed. For example, the historical time may be 1/2010, may be 12/31/2018, may be 1/2020, and the like, which is not limited thereto. In addition, the data type of the data in the preset database is not limited, and please refer to table 1, where the data in the preset database may include opening price, closing price, highest price, lowest price, volume of volume, 5-day average line, 10-day average line, 20-day average line, 60-day average line, and the like of the subject matter.

TABLE 1 data for usual valence indices of the subject matter

Numbering	Index code	Index name
				1	Pop	Price of opening dish
2	Pcl	Closing price
			3	Phi	Highest price
4	Plo	Lowest price
			5	Volume	Volume of business
6	MA5	5 day average line
			7	MA10	10 day average line
8	MA20	20 day average line
			9	MA60	60 day average line

The predetermined database may include data of an expert factor library. The expert factor is a factor which has a certain qualitative and quantitative relation with the descending risk of each target object. As shown in table 2, the expert factor library may include the following eight dimensions: macro category indicators, industry indicators, characteristic derivative indicators, capital technical indicators, capital flow indicators, derivative market indicators, and public opinion heat indicators. Each preset feature dimension corresponds to a certain amount of target state features, and specifically, refer to table 2. For example, the target state characteristics corresponding to the capital technical indicators may be an N-day average line, an N-day fluctuation rate, a brining line, a mic line, and the like; the target status characteristic corresponding to the fund flow direction indicator may be a northbound fund flow, a southbound fund flow, a principal fund flow, and the like.

Further, the preset database may further include a purchasing manager' index (PMI), a forest line, a northbound fund inflow, and the like shown in table 2; or may include large disc indexes not shown in table 2, such as a shang-syndrome index, a dui-syndrome index, a startup board index, a birth control index, a standard 500 index, etc., which are not limited in this application.

TABLE 2 data of the expert factor library of the subject matter

The data source of the preset database may be derived from a financial-related web page or an application program, or may be economic data published by a statistical bureau, an enterprise financial report, Shanghai/deep/overseas market data, social media statistical data, and the like, which is not limited herein.

Step S203: and extracting the characteristics of the target historical data to obtain the target state characteristics corresponding to each preset characteristic dimension in a plurality of preset characteristic dimensions.

In some possible embodiments, after performing step S202, the following steps may be further included: and preprocessing abnormal data in the target historical data to obtain data to be processed. Step S203 may include: and performing feature extraction on the data to be processed to obtain a target state feature corresponding to each preset feature dimension in a plurality of preset feature dimensions.

In the embodiment of the present application, the abnormal data may include missing values and noise values. The noise value refers to interference data, and may be inaccurate data for describing a scene. For example, a random error or variance in the measured variable is calculated, a value smaller than the random error or variance is determined as a noise value, and the like. Preprocessing the anomalous data in the target historical data may include: filling missing values and processing noise values. For the missing value filling manner, the embodiment of the present application is not limited, and the missing value may be filled by mean/mode completer (mean/mode completer), hot card filling (hot drop initialization), K-means proximity method (K-means clustering), and the like. The method for processing the noise value is not limited in the embodiment of the application, and the noise value can be processed by one or more of binning, clustering and regression. After the abnormal data is preprocessed to obtain the data to be processed, normalization processing can be carried out on the data to be processed. It can be understood that preprocessing the abnormal data is beneficial to improving the efficiency and accuracy of data processing.

In this embodiment of the application, the preset feature dimension may be data of an expert factor library shown in table 2, which is not described herein again. It should be noted that the preset feature dimensions and the corresponding target state features shown in table 2 are only examples, and other feature dimensions and corresponding target state features related to a target object may also be included, which is not limited in this embodiment of the present application.

It is understood that the expert factors are features summarized by financial market theory and practical experience. Therefore, the target state characteristics obtained based on the expert factors in the embodiment of the application have a more direct guiding significance for the prediction of the downlink risk compared with the traditional price and technical indexes shown in table 1, and can provide higher state input for the reinforcement learning algorithm, so that the overfitting of the model can be controlled.

Step S204: and inputting the target state characteristics into the first risk prediction model to obtain the risk value of the target object on the prediction date.

In the implementation of the present application, the risk value is used as the output of the first risk prediction model. The risk value refers to whether the target object has a trend of falling significantly, and the value of the risk value can be 1 or 0. For example, 1 represents that the subject matter is predicted to be at risk of falling significantly in the future, and 0 represents that the subject matter is predicted to be at risk of not falling significantly in the future.

In this embodiment of the application, the first risk prediction model is a model obtained by optimizing a first Actor model based on a first training set, a first Critic model or a second Critic model, the first Actor model is obtained by training based on a second training set, the first training set and the second training set are historical data extracted from a preset database, and the preset database includes the target historical data.

In the embodiment of the application, the preset database can be divided according to the timestamps to obtain the first training set and the second training set. For example, data for a period of time may be obtained from a predetermined database, a certain proportion (e.g., top 1/5) of the data for the period of time may be divided into a second training set, and the remaining data for the period of time may be divided into a first training set. Illustratively, data of a certain target object which is ten years from 1/2010 to 12/31/2020/is acquired from a preset database as training data. Then, the data of two years, i.e., 1/2010 to 12/31/2012, can be divided into the second training set, and the data of 1/2013 to 12/31/2020 can be divided into the first training set.

The method for acquiring the first Actor model, the first Critic model or the second Critic model is not limited in the present application, and the method for acquiring the first Actor model is described below.

In some possible embodiments, before performing step S204, the following steps may be further included: extracting the second training set from the preset database; calculating the second training set based on a preset expert rule to obtain corresponding first action sets in different states; and performing machine learning based on the first action set to obtain the first Actor model.

The preset expert rules refer to theoretical cognition and financial risk control theories of high association between certain types of financial indexes and descending risks summarized for a long time. In the embodiment of the present application, the preset expert rules include, but are not limited to, a forest belt, a Resistance Support Relative Strength (RSRS), an exponential smooth Movement Average (MACD), a market emotion, and other quantitative timing methods. When the method is used for fitting the downlink risk signals, the calculation is usually carried out on the basis of heuristic rules or the relative relation of a plurality of key indexes, and compared with an algorithm based on data driving only, the method needs fewer samples.

The embodiment of the application introduces the use of the preset expert rules by taking the Boolean forest band signals as an example. The Bollinger band (Bollinger band) is named after the name of the inventor John Bollinger (John Bollinger) to delineate the interval of price fluctuation. The basic form of the brinell belt is a belt-shaped channel (a middle rail and an upper rail and a lower rail are respectively one) composed of three rail lines.

Middle rail (MA), moving average of closing prices N days before t days:

wherein the content of the first and second substances,

the closing price of the current day of the nth day is shown, and n is the number of samples. The value of n is determined according to actual conditions and is generally 20.

Upper rail (UT), i.e. the distance price two times higher than the middle rail by the standard deviation:

lower rail (LT), i.e. the distance price of a multiple of the standard deviation higher than the middle rail:

the formula for calculating the selling signal of the brink belt is as follows:

like a boulin belt selling signal

The formula of (a) shows that the problem has been transformed into a 0-1 classification task. When in use

Namely, when the closing price of the current day is not higher than the closing price of the current day in the forest belt,

and predicting to be a risk signal, and clearing the bin of the target object at the moment. In other cases, when

When the bin is full, the target object is bought. When the same signal appears at two consecutive times, for example, when the signal appears as 1 or 0 continuously, the former cannot perform the selling action again because the bin is already cleared at the previous time, and the latter cannot perform the buying action again because the bin is already fully built at the previous time.

It can be seen that the preset expert rule may pre-train the first Actor model so that the decision level approaches the level of the preset expert rule before the alternative training with the criticic model. In an environment with scarce samples, high-value rewards can be obtained by fast sampling, and a large amount of inefficient and even ineffective sampling in the initial stage of the Actor-Critic algorithm is avoided.

According to the embodiment of the application, a learning simulation mode can be adopted, the financial market state is fitted by utilizing the preset expert rules to obtain the expert action labels, and the labels are used for supervised pre-training. The specific implementation process is as follows:

first, a second training set is extracted from a preset database as a state input, and the definition of the second training set is described in the foregoing, which is not repeated herein. And then, calculating the second training set by using a preset expert rule to obtain a corresponding first action set, namely the expert action, so as to obtain the mapping from the state to the action and form a set of transaction strategies. At this time, a set of expert decision data τ ═ { τ) may be obtained₁,τ₂,…,τ_mWhere m is the number of expert models. Each expert decision data contains a status andsequence of actions

Where n is the number of samples. Merging all state-action tuples to construct a new set D {(s)₁,a₁),(s₂,a₂) …. In this case, the first Actor model can be obtained by performing supervised learning of classification (corresponding to discrete actions) or regression (corresponding to continuous actions) using the state as a feature and the action as a label.

In this embodiment of the present application, the policy network corresponding to the first Actor model may use a Recurrent Neural Network (RNN), a long-short time memory (LSTM), or a Gated Recurrent Unit (GRU), which is not limited in this embodiment of the present application. In the embodiment of the application, a first Actor model is constructed by taking an LSTM algorithm as an example.

The LSTM model is a special RNN model, and solves the problem of long memory which is not possessed by the RNN model through a gate (gate) mechanism. Specifically, 1 neuron of the LSTM model contains 1 cell state (cell) and 3 gate (gate) mechanisms. The cell state (cell) is the key of the LSTM model, and is the memory space of the model, similar to memory. The state of the cells changes over time, and the recorded information is determined and updated by the door mechanism. The gate mechanism is a method for passing an information selection formula and is realized by a sigmoid function and a dot product operation. sigmoid takes on the value between 0 and 1, and dot product determines the amount of information to be transmitted (how much of each part can pass). When sigmoid takes 0, it indicates containing information, and when 1, it indicates complete transmission (i.e., complete memorization). The LSTM maintains information at any time through three gates, a forgetting gate (forget gate), an updating gate (update gate), and an output gate (output gate).

Referring to fig. 3, fig. 3 is a schematic structural diagram of an LSTM algorithm according to an embodiment of the present application, and as shown in fig. 3, an input of a first Actor model constructed by the LSTM algorithm may include a current state s of a financial market at a current time t_tHidden state in LSTM network having recorded time correlation between history data_t+1And decision a of the last time t-1_t-1Which outputs a decision a of the current time instant_t。

It can be seen that the internal state of the network created by the directed-loop architecture in the LSTM algorithm can handle time-based sequence data and remember timing relationships, thus solving the long-term dependency problem. Large LSTM algorithms have a high degree of feature representation capability, i.e., rich temporal feature representations can be learned, and LSTM algorithm-based agents can mine temporal patterns in financial market data and remember historical states and actions.

As shown in fig. 4, the first Actor model based on the LSTM algorithm is composed of an input layer, an LSTM layer, and an output layer, where the input and output layers are fully connected layers with the same dimension as the feature and action output. By utilizing the LSTM network, the market state sequence can be characterized to form input characteristics, and the defect that the machine learning method cannot capture the change rule of the market state due to the use of section data is overcome.

In this embodiment of the present application, the first Actor model may be obtained by pre-training through the following steps:

step A1: preparing collections

And random reordering is performed;

step A2: randomly initializing a weight value theta of a first Actor model;

step A3: select samples from D

Using the current network, in a state s_nComputing output actions a for input_n；

Step A4: calculating a loss function value L and a derivative of the loss function value L to each weight value of the first Actor model

I.e. the gradient of the network parameter of the first Actor model;

step A5: updating the network parameter along the gradient direction of the network parameter of the first Actor model by a step length alpha;

step A6: repeating steps A3-A5 until the training time period is reached or the loss function value L converges.

In the embodiment of the present application, the first Actor model loss function value L may include, but is not limited to, a binary cross entropy, and the like. Taking binary cross entropy as an example, the action signal is calculated by measuring the preset expert rule

And the predicted value a_tThe difference between them. In a binary problem, the cross-entropy of each sample can be expressed as:

the cross entropy of the entire set is then:

wherein N is the number of samples,

is the probability, x, that a sample is predicted to be 1_tTo output the output of the fully connected layer. The action output is as follows:

it can be seen that, by using the preset expert rules, the policy network corresponding to the first Actor model can be promoted from the initialized 'unknown' to the approximate expert level by using the financial market data as the state input and the expert decision action as the label to perform supervised learning. At this time, in the risk early warning of the financial market, the first Actor model can be used for corresponding prediction in the environment.

However, the preset expert rules represented by the brining band signals only use the historical closing price to calculate the indexes of the upper, middle and lower tracks, and compare the indexes with the closing price of the current day to calculate the danger signals to obtain the expert actions. And the input dimension of the strategy network is selected from the expert factor library, and the dimension of the strategy network is larger than the dimension of the input of the Boolean band signal. Therefore, the first Actor model has limitations.

Therefore, in some possible embodiments, the first Actor model may be further optimized by using the first training set to obtain the first risk prediction model, so as to obtain a better decision. Specifically, after the machine learning is performed based on the first action set to obtain the first Actor model, the method further includes the following steps: extracting the first training set from the preset database; calculating the first training set based on a preset expert rule to obtain corresponding second action sets in different states; and optimizing the first Actor model based on the second action set to obtain the first risk prediction model.

In the embodiment of the application, a first Actor model obtained by utilizing the pre-training of the preset expert rule is subjected to downlink risk prediction based on the preset expert rule in a first training set to obtain a second action set a_t. A real label exists in the designed quantitative downlink risk index

In the embodiment of the application, similar to the pre-training process of the first Actor model, the binary cross entropy function is calculated by a supervised learning method, the gradient is transmitted back to the first Actor model, and continuous optimization of the gradient is kept, so that the first risk prediction model is obtained.

It can be seen that, the method for optimizing the first Actor model based on the first training set does not directly use the criticic model, but uses real data as a basis. The first Actor model can not be limited by the level of the criticic model, and the model prediction accuracy is improved.

In some possible embodiments, the first Actor model may be optimized by using the first Critic model to obtain a first risk prediction model, so as to improve accuracy of risk prediction of the model.

The process of constructing the first Critic model is described below. In some possible embodiments, the process of constructing the first Critic model may include the following steps: performing feature extraction on the first training set to obtain a first state feature and a second state feature; splicing the first state characteristic and the second state characteristic to obtain a third state characteristic; performing machine learning based on the third state features to obtain a base model; inputting the third state characteristic into the base model to obtain a base model training result; obtaining a sequencing result of the base model according to the base model training result; determining a weighted weight of the base model according to the sorting result; and carrying out model fusion on the base model according to the weighted weight to obtain the first Critic model.

In the embodiment of the present application, the first training set is extracted from a preset database, and the definition of the first training set refers to the foregoing description, which is not repeated herein. The first status characteristic may be a traditional price indicator characteristic and the second status characteristic may be a characteristic of an expert factor; or the first status characteristic may be a characteristic of an expert factor, and the second status characteristic may be a characteristic of a conventional price indicator, which is not limited in the embodiment of the present application. And the third state characteristic is obtained by splicing the expert factor database data and the traditional price index data. Illustratively, the feature dimension of the traditional price index is P dimension, the feature dimension of the expert factor is Q dimension, and the feature dimension of the third state obtained after splicing is P + Q dimension.

Specifically, the third state feature is taken as a state input and is respectively sent to each machine learning base model for training. In the embodiment of the present application, the machine learning method may select a classification machine learning algorithm such as logistic regression, decision tree, random forest, adaptive boosting (AdaBoost), and the like, which is not limited to this. In the embodiment of the present application, a logical variable of 0 to 1 may be used as the empty-head signal, where 1 represents that the subject matter to be predicted is at a significant risk of falling in the future, and 0 represents that the subject matter to be predicted is not at a significant risk of falling in the future. At this time, the output type is kept consistent with the first Actor model.

And carrying out model fusion on the obtained basic model to obtain a first Critic model. The fusion method can use a weighted average method to sort the screened base model results in the base model set according to preset evaluation indexes. For example, each base model is classified into 5 gears according to the prediction accuracy rate of each base model on the verification set from high to low, and the higher the accuracy rate is, the higher the gear is. And (3) using the ranking quantile gear of each model as a weighted weight, obtaining a comprehensive model result through weighted average, finally activating and converting the comprehensive model result into a logic variable by using a certain fixed threshold value, and finally generating a blank signal capable of guiding time-selecting transaction. To this end, a set of expert state-action data D {(s) is obtained₁，a′₁)，(s₂，a′₂)，...}。

Considering that the first Actor model obtained by early pre-training is only trained by using simple expert rules, and deep relation between the state variable and the risk signal is not mined. In some possible embodiments, the first Critic model may be optimized by using the first Critic model, so as to obtain a first risk prediction model, so as to improve the prediction accuracy of the model. Specifically, the state can be used as a feature, the action is used as a label, supervised learning is performed, the first Actor model is further optimized, and the first risk prediction model is obtained, so that the convergence of the model is improved.

In some possible embodiments, the first Actor model may be optimized by using the second Critic model to obtain a first risk prediction model, so as to improve accuracy of risk prediction of the model.

The construction of the second Critic model is described below. In some possible embodiments, the construction process of the second Critic model may include the following steps: constructing a value network of the second Critic model, wherein the network structure of the value network is the same as that of the first Actor model; copying weight values outside an output layer of the first Actor model to the value network; training the value network based on the first training set to update the weight values of the value network; and taking the value network obtained after training as the second Critic model.

In an embodiment of the present application, a second criticic model may be constructed using a value network. The value network can cope with the time-varying of the capital market and improve the generalization capability of the model. The main network structure of the second criticic model is the same as that of the first Actor model, but the final output of the second criticic model is a one-dimensional continuous value, namely the output is a pair state value or a state-action value.

Specifically, the weight value of the first Actor model may be copied from shallow to deep to the second criticic model until the last output layer. Since in the previous training, the first Actor model has been trained several times. The weight value has stronger extraction capability on deep features of the state variable, the weight outside the output layer is copied to the second Critic model, and training is started compared with random initialization of the weight value, and no additional data sample is consumed. Therefore, the sampling efficiency of model optimization is further improved.

Before the second criticic model is constructed, the first Actor model and the first criticic model can output downlink risk signals in different states, and accordingly, a selling action is executed. Respectively inputting s according to the state by using a first training set in a trading environment by using a first Actor model and a first Critic model_tPerforming a corresponding action a_tReceive a reward r_t+1And the state of the next step s_t+1. Then, the trajectory [(s) can be obtained starting at any time t_t，a_t)，r_t+1，s_t+1]. Here the prize value is set as follows:

assuming that the target object is predicted to be the second day price will fall, when a_tWhen 1, the price p is collected on the same day_tThe securities with the shares of being held are all entrusted and sold, and the asset value Q at the time t can be obtained_t. Wherein Q is_t＝p_t×H_tWhen a is_tWhen the price is 0, the purchase is requested at the closing price on the day. Meanwhile, the subsequent real trend of the market can obtain the next day closing price p_t+1The asset value Q at the moment t +1 can be obtained_t+1，Q_t+1＝p_t+1×H_t. The reward value r is set as the ratio of the change value of the asset value from the time t to the time t +1 to the asset value at the time t, namely:

therefore, the predicted trend and the actual trend are matched with the symbol determining the reward value, and the reward symbol corresponding relationship is shown in table 3.

TABLE 3 correspondences between predicted and actual trends and reward symbols

Serial number

Output of

Output meaning

p_t+1-p_tCondition

Actual tendency

Reward symbol

1

Risk of falling

＞0

Price rise

-

2

1

Risk of falling

＜0

Price drop

+

3

0

Without risk of falling

＞0

Price rise

+

4

0

Without risk of falling

＜0

Price drop

-

Starting for any state in the stored trajectory, the cumulative discount reward is calculated as the real value:

R_t＝r_t+1+γr_t+2+γ²r_t+3+…+γ^T-tr_T

at the same time, pass through the trajectory [(s)_t，a_t)，r_t+1，s_t+1]The states s at different times_tAs input, the value v is calculated using a second Critic model_tCalculating R_tAnd v_tThe difference between as a merit function A_t. And solving a loss function of the second Critic model

Where M is the number of samples. The second Critic model is then updated by back-propagation.

In the embodiment of the present application, reference may be made to fig. 5 for a process of interactive training of the second Critic model and the first Actor model. As shown in fig. 5, the interactive training diagram includes a Critic network and two Actor networks (a new Actor network and an old Actor network), and the training process includes the following steps:

step B1: at time t, the state variable s of the environment is set_tInput to the new Actor network.

The new Actor network output dimension is 2, and is used as the probability that the category distribution function (category distribution) corresponds to two types of outputs of 0-1 respectively for obtaining mu and sigma. Illustratively, if the probability of sampling 0 and 1 in the class distribution function is 70% and 30%, respectively, then μ and σ may be equal to 7 and 3, respectively. The purpose of building a class distribution is to act a_tSampling is performed. Will act a_tInput into the environment for a reward r_tAnd the state of the next step s_t+1Then storing the trajectory [(s)_t，a_t)，r_t+1，s_t+1]. Then will s_t+1Input to the new Actor network and loop through this step until a certain number of traces are stored. Note that the new Actor network is not updated during this process.

Step B2: starting for any state in the stored trajectory, the cumulative discount reward is calculated as the real value:

R_t＝r_t+1+γr_t+2+γ²r_t+3+…+γ^T-tr_T

step B3: inputting all the stored s combinations into a Critic network to obtain the value function estimated values v of all the states_t. Calculation of R_tAnd v_tThe difference between as a merit function A_t. And solving the loss function of the Critic network

Where M is the number of samples. The update Critic network is then propagated backwards.

Step B4: inputting all the stored s combinations into the old Actor network and the new Actor network to respectively obtain respective outputs,<μ₁，σ₁>and<μ₂，σ₂>. Wherein, the network structures of the old Actor network and the new Actor network are the same, so thatThe available Probability Density Functions (PDFs) of the old Actor network and the new Actor network are PDF1 and PDF2, respectively. The stored probabilities of the actions on PDF1 and PDF2 are obtained, and probl and prob2 corresponding to each action are obtained, and then the ratio of prob2 to prob1 is used to obtain the Importance Weight (IW), that is, IW is prob2/prob 1.

Step B5: computing a substitute objective function

And clipping the substitute objective function

Wherein, the alternative objective function is clipped, when A > 0, if IW>1+ xi, then

If IW is less than 1+ xi, then

When A is less than 0, if IW is more than 1-xi, then

If IW is less than 1-xi, then

Xi is the cutting ratio, which can be 0.2. Computing an objective function on a stored trajectory

And then propagates backward, updating the new Actor network.

Step B6: after the steps of B4-B5 are circulated for a certain number of steps, the circulation is ended, and the old Actor network is updated by the new Actor network weight;

step B7: the steps B1-B6 are cycled until the model converges or a specified number of steps is reached.

It can be seen that the effective reward accumulated by the second Critic model can obtain a larger parameter gradient when the first Actor model is optimized, and the first risk prediction model can be converged faster.

In some possible embodiments, in order to make the first Actor model capable of coping with the time-varying property of the financial market, the optimization goal may also be to consider the first Critic model, the second Critic model and the real disk performance. At this point, the optimization problem becomes a multi-tasking optimization problem. The optimization process for the first Actor model may further include the steps of:

acquiring a risk function based on the preset database; optimizing the risk function based on the first Critic model and the second Critic model to obtain an optimized risk function; and optimizing the first Actor model based on the optimized risk function to obtain the first risk prediction model.

Specifically, real disk data is acquired from a preset database to construct a risk function, and the real disk data refers to financial time series data of the object marked in the preset database, and the financial time series data have time variation. The real disk data may be, for example, an opening price, a closing price, a maximum price, a minimum price, a volume, and the like. And optimizing the risk function based on the first Critic model and the second Critic model to obtain an optimized risk function. At the moment, the first risk prediction model obtained by optimizing the first Actor model based on the optimization risk function gives consideration to the performances of the first Critic model, the second Critic model and the real disk data. Therefore, the first risk prediction model has certain capability of coping with the time-varying property of the financial market.

In some possible embodiments, after performing step S204, the following steps may be further included: extracting a verification data set from the preset database; verifying the first risk prediction model based on the verification data set to obtain a second risk prediction model; training the second risk prediction model based on the first training set and the second training set to obtain a third risk prediction model; and inputting the target state characteristics into the third risk prediction model to obtain a risk value of the target object on the prediction date.

In the embodiment of the present application, the verification data set is obtained from a preset database. And if the preset database is divided according to the time stamps to obtain a first training set, a second training set and a verification data set, wherein the verification data set is data which is out of the first training set and the second training set and is closer in time. For example, if data from 1/2010 to 31/2020/12/31 ten years is used as the first training set and the second training set, the verification data set may be data from 1/2021 to the current time.

In the foregoing description, the first Actor model may be optimized by using one or more of the first training set, the first criticic model, the second criticic model, and the risk function, so as to obtain the first risk prediction model. In the embodiment of the application, the first risk prediction models obtained in different optimization modes can be verified in the verification data set, and the model with the highest prediction accuracy in the verification data set is selected as the second risk prediction model. And fixing the training setting mode of the second risk prediction model, wherein the training setting mode can be the setting of the model structure, the model parameters or the training mode, and is not limited. And merging the first training set and the second training set, and training the second risk prediction model to obtain a third risk prediction model. And taking the target state as the input of a third risk prediction model to predict the risk value of the target object. Therefore, the third risk prediction model obtained through multiple times of training can improve the accuracy of risk prediction and is beneficial to risk decision making.

It can be seen that, in the embodiment of the application, according to the online learning characteristic of reinforcement learning, the first risk prediction model obtained by further optimizing the first Actor model is provided through different forms such as the first training set, the first Critic model and the second Critic model, so as to provide an exploration space for the first Actor model to continue to promote performance. At the moment, the first risk prediction model has high convergence, so that the first risk prediction model can be dynamically optimized according to market time-varying characteristics, and the adaptability, robustness and anti-interference capability of the risk prediction model are further improved. Therefore, the accuracy of risk prediction can be improved, and a reliable decision scheme is provided.

The method of the embodiments of the present application is set forth above in detail and the apparatus of the embodiments of the present application is provided below.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a risk prediction apparatus based on reinforcement learning according to an embodiment of the present disclosure. The device is applied to electronic equipment. As shown in fig. 6, the apparatus 600 for risk prediction based on reinforcement learning includes a receiving unit 601 and a processing unit 602, which are described in detail as follows:

the receiving unit 601 is configured to receive a risk prediction request of a target object, where the risk prediction request includes a prediction date;

the processing unit 602 is configured to obtain a receiving date of the risk prediction request and target history data of the target object N days before the receiving date, where N is a positive integer greater than or equal to 1; extracting the characteristics of the target historical data to obtain target state characteristics corresponding to each preset characteristic dimension in a plurality of preset characteristic dimensions; inputting the target state features into a first risk prediction model to obtain a risk value of the target object on the prediction date, wherein the first risk prediction model is a model obtained by optimizing a first Actor model based on a first training set, a first Critic model or a second Critic model, the first Actor model is obtained by training based on a second training set, the first training set and the second training set are historical data extracted from a preset database, and the preset database comprises the target historical data.

In some possible embodiments, the processing unit 602 is further configured to extract the second training set from the preset database; calculating the second training set based on a preset expert rule to obtain corresponding first action sets in different states; and performing machine learning based on the first action set to obtain the first Actor model.

In some possible embodiments, the processing unit 602 is further configured to perform feature extraction on the first training set to obtain a first state feature and a second state feature; splicing the first state characteristic and the second state characteristic to obtain a third state characteristic; performing machine learning based on the third state features to obtain a base model; inputting the third state characteristic into the base model to obtain a base model training result; obtaining a sequencing result of the base model according to the base model training result; determining a weighted weight of the base model according to the sorting result; and carrying out model fusion on the base model according to the weighted weight to obtain the first Critic model.

In some possible embodiments, the processing unit 602 is further configured to construct a value network of the second Critic model, where a network structure of the value network is the same as a network structure of the first Actor model; copying weight values outside an output layer of the first Actor model to the value network; training the value network based on the first training set to update the weight values of the value network; and taking the value network obtained after training as the second Critic model.

In some possible embodiments, the processing unit 602 is further configured to extract the first training set from the preset database; calculating the first training set based on a preset expert rule to obtain corresponding second action sets in different states; optimizing the first Actor model based on the second action set to obtain the first risk prediction model; or optimizing the first Actor model based on the first Critic model or the second Critic model to obtain the first risk prediction model.

In some possible embodiments, the processing unit 602 is further configured to obtain a risk function based on the preset database; optimizing the risk function based on the first Critic model and the second Critic model to obtain an optimized risk function; and optimizing the first Actor model based on the optimized risk function to obtain the first risk prediction model.

In some possible embodiments, the processing unit 602 is further configured to extract a verification data set from the preset database; verifying the first risk prediction model based on the verification data set to obtain a second risk prediction model; training the second risk prediction model based on the first training set and the second training set to obtain a third risk prediction model; and inputting the target state characteristics into the third risk prediction model to obtain a risk value of the target object on the prediction date.

It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in fig. 2.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 7, the computer device 700 comprises a processor 701, a memory 702 and a communication interface 703, wherein the memory 702 stores a computer program 704. The processor 701, the memory 702, the communication interface 703 and the computer program 704 may be connected by a bus 705.

When the computer device is an electronic device, the computer program 704 is used for executing the following steps:

In some possible embodiments, the computer program 704 is further configured to input the target state characteristic into a first risk prediction model, resulting in a risk value of the target object at the prediction date, and to perform the following steps:

extracting the second training set from the preset database;

calculating the second training set based on a preset expert rule to obtain corresponding first action sets in different states;

and performing machine learning based on the first action set to obtain the first Actor model.

In some possible embodiments, after said extracting said second training set from the preset database, said computer program 704 is further for executing the instructions of:

performing feature extraction on the first training set to obtain a first state feature and a second state feature;

splicing the first state characteristic and the second state characteristic to obtain a third state characteristic;

performing machine learning based on the third state features to obtain a base model;

inputting the third state characteristic into the base model to obtain a base model training result;

obtaining a sequencing result of the base model according to the base model training result;

determining a weighted weight of the base model according to the sorting result;

and carrying out model fusion on the base model according to the weighted weight to obtain the first Critic model.

constructing a value network of the second Critic model, wherein the network structure of the value network is the same as that of the first Actor model;

copying weight values outside an output layer of the first Actor model to the value network;

training the value network based on the first training set to update the weight values of the value network;

and taking the value network obtained after training as the second Critic model.

In some possible embodiments, after the machine learning based on the first set of actions to obtain the first Actor model, the computer program 704 is further configured to execute the following steps:

extracting the first training set from the preset database;

calculating the first training set based on a preset expert rule to obtain corresponding second action sets in different states;

optimizing the first Actor model based on the second action set to obtain the first risk prediction model; or

And optimizing the first Actor model based on the first Critic model or the second Critic model to obtain the first risk prediction model.

In some possible embodiments, the computer program 704 is further for instructions to perform the steps of:

acquiring a risk function based on the preset database;

optimizing the risk function based on the first Critic model and the second Critic model to obtain an optimized risk function;

and optimizing the first Actor model based on the optimized risk function to obtain the first risk prediction model.

In some possible embodiments, after said inputting said target state characteristic to a first risk prediction model resulting in a risk value for said target subject on said prediction date, said computer program 704 is further for executing the instructions of:

extracting a verification data set from the preset database;

verifying the first risk prediction model based on the verification data set to obtain a second risk prediction model;

training the second risk prediction model based on the first training set and the second training set to obtain a third risk prediction model;

and inputting the target state characteristics into the third risk prediction model to obtain a risk value of the target object on the prediction date.

Those skilled in the art will appreciate that only one memory and processor are shown in fig. 7 for ease of illustration. In an actual terminal or server, there may be multiple processors and memories. The memory 702 may also be referred to as a storage medium or a storage device, and the like, which is not limited in this embodiment.

It should be understood that in the embodiments of the present application, the processor 701 may be a Central Processing Unit (CPU), and the processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like.

It will also be appreciated that the memory 702, as referred to in the subject embodiments, may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

It should be noted that when the processor 701 is a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, the memory (memory module) is integrated into the processor.

It is to be noted that the memory 702 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The bus 705 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for clarity of illustration the various buses are labeled as buses in the figures.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

In the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various Illustrative Logical Blocks (ILBs) and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

In the above-described embodiments, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like. For example, the block chain may store the common price index of the target object in the preset database, the expert factor library, etc.; the preset expert rules can also be stored, and the preset expert rules comprise quantitative timing methods such as a Boolean area, RSRS, MACD and the like; and a classification machine Learning algorithm such as logistic regression, decision tree, random forest, AdaBoost, or a Q-Learning algorithm, a policy gradient algorithm, an Actor-Critic algorithm, an LSTM algorithm, etc. in reinforcement Learning may be stored, which is not limited herein.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains information of a batch of network transactions, which is used to verify the validity (anti-counterfeiting) of the information and generate the next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Embodiments of the present application also provide a computer storage medium, which stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the methods for risk prediction based on reinforcement learning as described in the above method embodiments.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods of reinforcement learning based risk prediction as set forth in the above method embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for risk prediction based on reinforcement learning, comprising:

2. The method of claim 1, wherein prior to said inputting said target state characteristic into a first risk prediction model resulting in a risk value of said target subject matter on said prediction date, said method further comprises:

extracting the second training set from the preset database;

3. The method of claim 2, wherein after said extracting the second training set from the preset database, the method further comprises:

4. The method of claim 2, wherein after said extracting the second training set from the preset database, the method further comprises:

5. The method of claim 2, wherein after the machine learning based on the first set of actions, resulting in the first Actor model, the method further comprises:

extracting the first training set from the preset database;

6. The method of claim 1, further comprising:

acquiring a risk function based on the preset database;

7. The method of any one of claims 1-6, wherein after said inputting said target state characteristic to a first risk prediction model results in a risk value for said target subject on said prediction date, said method further comprises:

extracting a verification data set from a preset database;

8. An apparatus for risk prediction based on reinforcement learning, comprising:

9. A computer device, characterized in that it comprises a processor, a memory and a communication interface, wherein the memory stores a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program causing a computer to execute to implement the method of any one of claims 1-7.