CN113298279A

CN113298279A - Stage-distinguishing crowd funding progress prediction method

Info

Publication number: CN113298279A
Application number: CN202010107306.2A
Authority: CN
Inventors: 刘淇; 王俊; 章和夫; 潘镇; 张凯
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2021-08-24

Abstract

The invention discloses a stage discrimination crowd funding progress prediction method, which comprises the following steps: crawling progress data of crowd funding projects, and extracting static features and dynamic features of the progress data; the method comprises the steps that users of a crowd funding platform are regarded as an intelligent agent, crowd funding projects are regarded as environments, and the intelligent agent is trained in a reinforcement learning mode by combining static features and dynamic features of the crowd funding projects; and judging the stage of the crowd funding project to be predicted through the trained intelligent agent, and predicting future progress of the crowd funding project by using a corresponding strategy. By adopting the method, the accuracy of the progress prediction can be improved.

Description

Stage-distinguishing crowd funding progress prediction method

Technical Field

The invention relates to the field of Internet crowd funding, in particular to a stage-distinguishing type crowd funding progress prediction method.

Background

The crowd funding progress prediction mainly means that the daily progress of a given number of days later is predicted according to the progress change sequence of all days before a certain crowd funding project, and the daily progress is in percentage form. The crowd funding progress sequence contains the characteristics of the field, and a solution for better integrating the characteristics of the field into the crowd funding progress prediction problem is an open research problem and is widely researched.

In current research work and patents, the following methods are mainly used for predicting progress in crowd funding fields:

1) a decomposition synthesis prediction method based on a pure time series base.

Currently, work based on pure time series analysis aims at mining potentially different patterns of the sequence itself and synthesizing the sequence based on predictions in these patterns. Analysis and research through previous work verifies that mapping sequence decomposition to different spaces can indeed help improve the accuracy of the predicted progress.

2) A cyclic neural network prediction method based on sequence-to-sequence.

The progress prediction method based on the recurrent neural network utilizes the neural network structure to automatically extract the dynamically changing characteristics (including comment information and progress change), and adds the prior information specific to the crowd funding field as the constraint. Experiments prove that the method is more suitable for sequence prediction in the crowd funding field.

Although the two methods consider the characteristics of crowd funding sequences to a certain extent and try to give different prediction results according to different modes, the interaction process between users and crowd funding projects in the crowd funding field is not fully considered. For example, the progress of the current time may affect the decision of the user, and the decision of the user may affect the progress of the item at the next time, and both may affect each other in the dynamic interaction process.

Disclosure of Invention

The invention aims to provide a stage-distinguishing crowd funding progress prediction method which can improve the accuracy of progress prediction.

The purpose of the invention is realized by the following technical scheme:

a stage-differentiated crowd funding progress prediction method comprises the following steps:

crawling progress data of crowd funding projects, and extracting static features and dynamic features of the progress data;

the method comprises the steps that users of a crowd funding platform are regarded as an intelligent agent, crowd funding projects are regarded as environments, and the intelligent agent is trained in a reinforcement learning mode by combining static features and dynamic features of the crowd funding projects;

and judging the stage of the crowd funding project to be predicted through the trained intelligent agent, and predicting future progress of the crowd funding project by using a corresponding strategy.

According to the technical scheme provided by the invention, the interaction process between the user and the crowd funding project in the crowd funding field is fully considered, the progress prediction of stage differentiation can be realized, and the accuracy of the progress prediction is greatly improved compared with the prior art; in addition, the prediction result is beneficial to the control of the production progress of the crowd funded product and only when the project publisher starts the production of the crowd funded product in advance; and when some public funding is performed, the prediction also helps to judge whether the project is successful or not and the needed approximate days, and the social influence can be generated finally. In addition, the website or the crowd funding platform can be helped to recommend the item to the appropriate user in a personalized mode.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a stage-differentiated crowd funding progress prediction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an actuator in an agent according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the effect of training on the differences between different sub-strategies.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a stage-distinguishing crowd funding progress prediction method, which mainly comprises the following steps of:

1. and crawling progress data of crowd funding projects, and extracting static features and dynamic features of the progress data.

In an example of the invention, progress data for crowd funding projects is crawled from a crowd funding platform (e.g., indiegoo.

Progress data of crowd funding projects include: title, label, item category, financing objective, item introduction, daily review, daily schedule change, number of financed days, and number of remaining financing days; wherein, the title, the label, the item category, the financing target and the item introduction are static data; the daily comment, the daily progress change, the staged days and the remaining staged days are dynamic data.

For text information of static data and dynamic data, an embedded model (Word2Vec) is used to convert to corresponding vectorized representations. Then, static features are extracted from the vectorized representation of the static data, daily comments and day information are extracted from the vectorized representation of the dynamic data as dynamic features, and daily progress changes are extracted as tags to be predicted (i.e., the progress sequence below). Finally, the maximum and minimum normalization processing is carried out on all the characteristic data.

A crowd funding item i is represented as a triple (X)ⁱ，Cⁱ，Pⁱ) Wherein X isⁱRepresenting static features; cⁱAnd PⁱAll are dynamic sequences, which respectively represent dynamic characteristic sequences and progress sequences of projects.

The crowd funding progress prediction means that static characteristics X of crowd funding project i are obtainedⁱAnd dynamic signature sequences of the first T-T days

And project progress sequence

In the case of (1), project schedule sequence of the next τ days is predicted

Wherein, the dynamic feature vector of the t day of the item i

Feature vector of comment information in progress data of crowd funded project

Vector composed of days information (settled days, remaining settled days)

Composition, T1., T- τ, T being the total days of crowd funding.

2. The method comprises the steps of regarding a user of a crowd funding platform as an agent (agent), regarding a crowd funding project as an environment, and training the agent by combining static features and dynamic features of the crowd funding project in a reinforcement learning mode.

In the embodiment of the invention, the crowd funding project future progress prediction problem is modeled into a reinforced learning problem.

The reinforcement learning quadruple is expressed as < S, A, P, R >; wherein:

s is a state space and comprises state vectors S of all time steps; using a GRU (gated round robin unit) network to model state changes in the time dimension, the input at each time step is dynamic after initialization of the GRU network using static featuresStatus characteristics

The output hidden vector is used as a state vector of the environment and shown to the intelligent agent, namely s_t＝h_t(ii) a s represents a state vector, h represents a hidden vector output by the GRU network, and t represents a time step, namely a certain day;

a is an action space which comprises an action a of each time step; considering that what we are predicting is the progress of the project (in percentage) at the next time step, and the progress of the project may exceed 100%, the action space is defined as a continuous space made up of all positive numbers, and there is

Wherein

Representing the estimated value of the progress at the next time step;

r is a return function. Considering that a proper function needs to be selected according to the deviation between the predicted value and the true value of the project progress, a monotonically decreasing positive continuous differentiable function is selected as the return function, and the output of the return function is taken as the final return value r after the running average. Since the goal of strong chemistry is to maximize cumulative benefit, an optimal strategy that is learned from such a defined return function can minimize cumulative bias.

Because the method in embodiments of the present invention is model-free, the state transition probability P is not explicitly defined herein.

In the embodiment of the invention, a reinforced learning framework of an actor-evaluator (operator-critic) is adopted, namely, the inside of an intelligent body comprises two components of the actor and the evaluator; the actor is used to learn an action strategy u, i.e. a mapping μ from a state space to an action space, and to output, i.e. the final action a ═ μ_θ(s), θ represents a parameter of the strategy μ; the evaluation device is used for evaluating the action taken by the actuator and outputting a value function Q^μ(s, a) taking as state s under policy μThe value of action a (μ has been omitted below without specifying a certain strategy) while learning itself by time differentiation.

It should be noted that the output result of the evaluator in the training phase provides the predicted direction for the actor, and the progress can be predicted only by the actor in the testing phase. In addition, the subscripts τ, t in the various formulas to which this patent relates all indicate the corresponding time step (i.e., day); the parameters with subscripts τ, t representing the parameters of the respective time step, e.g. a_τ、s_tThe action of time step τ and the state of time step t are shown, and if these parameters related to time step are not provided with corresponding time step index, the related parameters are only used for describing that the specific time step is not required to be defined, for example, a and s described herein represent action and state.

The learning objective of an actor is to maximize the sum of the expected future benefits, namely:

wherein, mu_θDenotes the strategy μ, E denotes expectation, ρ, with θ as parameter^μStates s under strategy μ_tThe distribution obeyed, γ represents the cumulative discount factor calculated, r(s)_τ，μ_θ(s_τ) Is in a state of s_τTime dependent strategy mu_θTake action mu_θ(s_τ) The instant reward obtained, T, represents the total number of days crowd funded.

Then the actor update target applicable to the deterministic policy is:

wherein the content of the first and second substances,

represents L_fuFor the partial differential vector of theta,

represents μ_θ(s_t) For the partial differential vector of theta,

represents Q μ(s)_t，a_t) At a_t＝μ_θ(s_t) The partial differential vector of time pair a.

As can be seen from the above equations, the actors are updated in the direction of maximum value evaluated by the arbiter.

The discriminator calculates the evaluation result Q(s) of the current time step_t，a_t) The expectation of deviation from the value re-estimated after obtaining the current immediate benefit is self-learned, namely:

δ_t＝r_t+γQ(s_t+1，a_t+1)-Q(s_t，a_t)，

wherein r is_t＝r(s_t，a_t) Representing the instantaneous return of the time step t, delta_tIs the deviation calculated by the time difference method.

The above formula shows how to model the interaction between users of the crowd-funding platform and crowd-funding projects using a reinforcement learning framework, and to learn a progress prediction method that maximizes future revenue using an 'actor-judger' framework.

Still further, embodiments of the present invention require predicting future progress with accurate learning of historical progress changes. For this purpose, the objective function of the "actor" is improved. Specifically, the method comprises the following steps:

at the current time step t, using the predicted progress of all time steps before the current time step

And true progress p_τAs a loss function of past predictionsNumber:

meanwhile, considering that the progress of the crowd funding project is monotonically increased, a penalty is given to a portion of the predicted output at a larger time step (i.e., a subsequent time step) being smaller than the output at a smaller time step (i.e., a previous time step), and the loss function is:

wherein

Representing an indicative function;

the objective function of the actor is denoted as L_fu，L_paAnd L_regThe weighted sum of the three parts, namely:

L_actor＝L_fu+λ₁L_pa+λ₂L_reg

wherein λ is₁、λ₂Respectively represent L_pa、L_regThe weight of (c).

In addition, research shows that the project schedule sequence in the crowd funding field presents an obvious U shape, namely investment behaviors in the starting and ending stages of the project are more intensive, so that the project schedule is increased rapidly; the investment behavior in the middle stage is less, so that the project progress is slowly increased. How to fully utilize this important a priori knowledge is also a difficulty in incorporating it into the present invention. In order to fully utilize the sequence mode, an option mechanism is used for improving the internal structure of the actuator; briefly, the actor determines the financing stage of the current project using a high-level strategy and then selects a different low-level strategy to give a more optimistic or relatively conservative progress estimate based on the input state.

Formally, ω is used to denote an option (which may be understood as a computation block), i.e. a triplet ω (I)_ω，μ_ω，β_ω) Indicating an option. I is_ω，μ_ωAnd beta_ωRespectively, the initial state of the option (which is a similar concept to the previous state s), the low-level policy (which is a similar concept to the previous policy μ), and the termination probability function. In the embodiment of the present invention, it is assumed that the initial states of all options are the whole state space, i.e. the whole state space

I_ω(ii) S; here mu_ωThe aforementioned strategy μ, which represents the inclusion in ω, is then automatically changed to a lower-level strategy and additionally uses π (ω | s)_t) Representing high-level policies, i.e. input states s_tOutputting the probability of each option; termination function beta_ωRepresents the range from the state space to [0, 1 ] contained in ω]And mapping the interval. As shown in FIG. 2, state s is entered at time step t_tSince the last time step is selected as option ω_t-1Then the termination function will be used

The probability of output terminates. If terminated, by π (ω | s)_t) Probability of output selects new option ω_tOtherwise ω is still used_t-1(ii) a Then according to omega_tThe low-level policy of (2) outputs an action, i.e., predicted project progress, and determines a probability of termination at the beginning of the next time step. The number of options shown in fig. 2 is merely an example, and is not a limitation.

Those skilled in the art will appreciate that the low-level policy and the high-level policy are general terms in the option mechanism, and both correspond to different mapping functions, which are described above.

In the pair of objective functions L_actorWhen the method is improved, the core idea is to represent the state-option tuple (s, omega) as an extended state. Therefore, Q (s, a) (value of taking action a in state s), μ_ω(s) (Low-level policy in State s) and β_ω(s) (termination function in state s) after expansion are Q (s, ω, a), μ (s, ω) andβ(s，ω)。

on the one hand, the judgers can still be self-updated in a time-differentiated manner. However, since the next time step has a certain probability of terminating the current option, the expected value function of the next time step needs to be calculated according to whether the current option is terminated:

wherein, β(s)_t+1，ω_t) Representing the probability of terminating at the next time step, i.e. if not, the option for the current time step is still selected, the estimate being Q(s)_t+1，ω_t，a_t) (ii) a If terminated, then the maximum of all the estimates of the option at the next time step is selected instead of the option at the current time step, i.e., the

The evaluator updates the parameters by minimizing the estimated deviation before and after obtaining the current revenue, which is expressed as:

δ_t′＝r_t+γU(ω_t，s_t+1)-Q(s_t，ω_t，a_t)，

on the other hand, updating of the actor is divided into updating of a low-level strategy and updating of a termination function; the updating of the low-level policy is directly derived from the state extension, i.e. L_fuAnd partial updating:

wherein, ρ'^μRepresents the state(s) after expansion under the policy μ_t，ω_t) The distribution obeyed;

represents μ_θ(s_t，ω_t) The partial differential vector of (i.e. the action corresponding to the post-expansion state) versus theta,

represents Q(s)_t，ω_t，a_t) A partial differential vector for action a;

the update of the termination function can then be given by the following equation:

loss function L at update of termination function_termExpressed as:

L_term＝β(s_t+1，ω_t)A(s_t+1，ω_t)

the final objective function of the actuator is denoted as L_fu，L_pa，L_regAnd L_termThe weighted sum of the four parts, namely:

L_actor′＝L_fu+λ₁L_pa+λ₂L_reg+λ₃L_term

wherein λ is₃Represents L_termThe weight of (c).

It should be noted that, in the embodiment of the present invention, how the high-level policy is updated is not specifically described. In fact, the update of the high-level policy may result from many reinforcement learning update methods, including but not limited to random policy gradient updates, time difference methods, and planning methods.

3. And judging the stage of the crowd funding project to be predicted through the trained intelligent agent, and predicting future progress of the crowd funding project by using a corresponding strategy.

First, when the number of options is set to 2, experiments verify that the two low-level strategies learned by the options are relatively positive and relatively conservative. FIG. 3 is a diagram illustrating the effect of training on the differences between different sub-strategies. The specific test process is that all training items are equally divided into 8 parts according to a financing period, then the average value of the termination function of each part in all time steps is calculated, finally, a sub strategy is displayed by drawing, is prone to be used in the initial stage and the final stage of the financing of the items, another sub strategy is prone to be used in the intermediate stage of the financing of the items, and the average value of the output values of the two sub strategies in the same time step is calculated, so that the average value of the former sub strategy is larger, the sub strategy prone to be used in the initial stage and the final stage of the financing is more positive and suitable for fast growth, and the other relatively conservative sub strategy is suitable for slow growth, which is consistent with the crowd-funding sequence characteristics disclosed in the previous research.

Next, in the prediction phase, after extracting the static features and the dynamic features of the crowd funding project to be predicted according to the method described in step 1, only the actor shown in fig. 1 is used to predict the progress. Determining a corresponding state vector by using the static characteristic and the dynamic characteristic, and inputting the state vector to a traveling gear; when the actuator is used for prediction, the option of the previous time step is randomly terminated according to the termination probability output by the previous time step, if the option is terminated, the current stage of the crowd funding project is judged by the high-level strategy, specifically, the state is input into the high-level strategy, the probabilities corresponding to the two options are output by the high-level strategy, and the option with higher probability is selected. The stage where the option is applied is the stage where the determined project should be located. If the probability of the option corresponding to the sub-strategy which is applied to the rapid growth is higher, the method is equivalent to implicitly judging whether the item is in the initial stage or the final stage; conversely, if the probability of the option corresponding to the sub-strategy applying the slow growth is higher, the high-level strategy considers that the project is in the middle stage of financing. Here, both the fast-growing sub-strategy and the slow-growing sub-strategy belong to the low-level strategy μ defined above_ωThe relationship between the two strategies and the option is calibrated in advance, so that the used stage can be directly determined after the option with higher probability is selected. In addition, slow growth and fast growth are preset rates, andit is obvious that there is a fast-slow relationship between the two. The specific division of project phases here depends on the choice of the bounding value k. In this example, when the training program is divided into 8 equal parts according to the financing period, the financing progress increases more than 15% in the first 1/8 time and the last 1/8 time, which are considered as the initial stage and the final stage respectively; the financing progress in the remaining intermediate phases all increased by less than 15%, and is considered an intermediate phase. Whereas the growth variation of the item at the beginning and end is considered to be fast growth, the growth variation at the middle is considered to be slow growth.

After selecting the proper option, the low-level strategy corresponding to the option outputs the estimated value of the progress at the next time step. If the project to be predicted is at a past time step that has already been experienced, then the true progress will be used as part of the status representation; conversely, if a future change in progress needs to be predicted, the estimate output at the previous time step is used as part of the representation of the state at the next time step (i.e., as part of the progress in the dynamic input at the next time step in FIG. 1). Therefore, the purpose of predicting the change and the trend of the crowd funding project in the future days is achieved.

After the future trend of the crowd funding project is predicted within a certain error range, a user of the crowd funding platform can better judge whether the project is worth investment, and in addition, a publisher of the project can also adjust a funding target and a funding period in time according to the future trend, so that the maximum utilization of time and resources is achieved. More importantly, the prediction result is beneficial to the control of the production progress of the crowd funded product and only when the project publisher starts the production of the crowd funded product in advance; and when some public funding is performed, the prediction also helps to judge whether the project is successful or not and the needed approximate days, and the social influence can be generated finally. In addition, the website or the crowd funding platform can be helped to recommend the item to the appropriate user in a personalized mode.

According to the scheme, the progress prediction method capable of carrying out stage differentiation is realized by using the frame of hierarchical reinforcement learning and the improved objective function according to the characteristic information of the crowd funding project, and compared with the prior art, the accuracy of the emotion classification result is greatly improved.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A stage-differentiated crowd funding progress prediction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the crawling crowd-funding project progress data comprises extracting static features and dynamic features of the progress data, and the extracting comprises:

progress data of crowd funding projects include: title, label, item category, financing objective, item introduction, daily review, daily schedule change, number of financed days, and number of remaining financing days;

wherein, the title, the label, the item category, the financing target and the item introduction are static data; the daily comment, the daily progress change, the staged days and the remaining staged days are dynamic data;

for text information of static data and dynamic data, converting the embedded model into corresponding vectorization representation, extracting static features from the vectorization representation of the static data, extracting daily comments and day information from the vectorization representation of the dynamic data as dynamic features, extracting daily progress change as a progress sequence, and performing maximum and minimum normalization processing on the extracted data.

3. The method of claim 1, wherein a crowd funding project i is represented as a triple (X)ⁱ，Cⁱ，Pⁱ) Wherein X isⁱRepresenting static features; cⁱAnd PⁱAll are dynamic sequences, which respectively represent dynamic characteristic sequences and progress sequences of projects;

And project progress sequence

In the case of (1), project schedule sequence of the next τ days is predicted

Wherein, the dynamic feature vector of the t day of the item i

Funded by the massesFeature vector of comment information in progress data of project

Vector formed by the sum of day information

Composition, T1., T- τ, T being the total days of crowd funding.

4. The stage differentiation type crowd funding progress prediction method according to claim 1, wherein a reinforcement learning quadruplet is expressed as < S, a, P, R >; wherein:

s is a state space and comprises state vectors S of all time steps; using the GRU network to model the state change in the time dimension, after initializing the GRU network by using the static characteristics, the input in each time step is the dynamic characteristics, the output hidden vector is used as the state vector(s) of the environment display to the intelligent agent_t＝h_t(ii) a s represents a state vector, h represents a hidden vector output by the GRU network, and t represents a time step, namely a certain day;

a is an action space which comprises an action a of each time step; the motion space is defined as a continuous space composed of all positive numbers, and

wherein

Representing the estimated value of the progress at the next time step, a representing the action;

r is a return function which is a continuous differentiable function with a monotonically decreasing positive value, and the output of the return function is taken as the final return value R after the running average. (ii) a

The state transition probability P is not explicitly defined.

5. A stage discriminative crowd funding as claimed in claim 1 or 4The progress prediction method is characterized in that a reinforcement learning framework of an actor-judger is adopted, namely, the inside of an intelligent body comprises two components of the actor and the judger; the actor is used to learn an action strategy μ, i.e. a mapping μ from a state space to an action space, and to output, i.e. the final action a ═ μ_θ(s), θ represents a parameter of the strategy μ; the evaluation device is used for evaluating the action taken by the action device and outputting a value function Q^μ(s, a) taking the value of action a under policy μ as state s while self-learning by time differentiation;

wherein, mu_θDenotes the strategy μ, E denotes expectation, ρ, with θ as parameter^μStates s under strategy μ_tThe distribution obeyed, γ represents the cumulative discount factor calculated, r(s)_τ，μ_θ(s_τ) Is in a state of s_τTime dependent strategy mu_θTake action mu_θ(s_τ) The instant reward obtained, TN represents the total days of crowd funding;

the actor update target applicable to the deterministic policy is then:

wherein the content of the first and second substances,

represents L_fuFor the partial differential vector of theta,

represents μ_θ(s_t) For the partial differential vector of theta,

represents Q^μ(s_t，a_t) At a_t＝μ_θ(s_t) A partial differential vector of time versus a;

the discriminator calculates the evaluation result Q(s) of the current time step_t，a_t) Self-learning with the expectation of deviation from the value re-estimated after obtaining the current immediate benefit:

δ_t＝r_t+γQ(s_t+1，a_t+1)-Q(s_t，a_t)，

6. The stage-discriminative crowd funding progress prediction method of claim 5, further comprising: the objective function of the actuator is improved:

at the current time step t, using the predicted progress of all time steps between the current time steps

And true progress p_τAs a function of the loss to past predictions:

meanwhile, punishment is carried out on the part of the predicted output on the larger time step, which is smaller than the output on the smaller time step, and the loss function is as follows:

wherein

Representing an indicative function;

L_actor＝L_fu+λ₁L_pa+λ₂L_reg

wherein λ is₁、λ₂Respectively represent L_pa、L_regThe weight of (c).

7. The stage-discriminative crowd funding progress prediction method of claim 6, further comprising: improving the internal structure of the actor by using an option mechanism;

using ω for an option, ω ═ I_ω，μ_ω，β_ω)，I_ω，μ_ωAnd beta_ωRespectively representing the initial state, the low-level strategy and the termination probability function of option; it is assumed that the initial states of all options are the entire state space, i.e.

μ_ωRepresents the strategy mu contained in omega, automatically changing to a lower-level strategy at this time, and additionally uses pi (omega | s)_t) Representing high-level policies, i.e. input states s_tOutputting the probability of each option; termination function beta_ωRepresents the range from the state space to [0, 1 ] contained in ω]Mapping the interval; entering a state s at a time step t_tThe last time step is selected as option omega_t-1Will be terminated by a termination function

Probability of output is terminated; if terminated, by π (ω | s)_t) Probability of output selects new optionω_tOtherwise ω is still used_t-1(ii) a Then according to omega_tThe low-level policy of (1) outputs an action, i.e., a predicted project schedule, and determines a termination probability at the beginning of the next time step;

when the state-option tuple (s, ω) is expressed as an extended state, Q (s, a), μ_ω(s) and beta_ω(s) Q (s, ω, a), μ (s, ω) and β (s, ω), respectively, after expansion;

updating the actor comprises updating a low-level strategy and updating a termination function; the updating of the low-level policy is directly derived from extending the state, i.e. L_fuAnd partial updating:

represents μ_θ(s_t，ω_t) For the partial differential vector of theta,

represents Q(s)_t，ω_t，a_t) A partial differential vector for action a;

L_term＝β(s_t+1，ω_t)A(s_t+1，ω_t)

wherein, β(s)_t+1，ω_t) Representing the probability of termination at the next time step, i.e. if not, the option of the current time step is still selected, with an estimate of Q(s)_t+1，ω_t，a_t) (ii) a If terminated, the estimated maximum of all options at the next time step is selected instead of the option at the current time step, i.e.

L_actor′＝L_fu+λ₁L_pa+λ₂L_reg+λ₃L_term

wherein λ is₃Represents L_termThe weight of (c).

8. The method of claim 7, wherein for the evaluator, since the current option is terminated with a certain probability at the next time step, the evaluator calculates the expected value function of the next time step according to whether the termination is determined:

the evaluator still updates itself by means of time difference, which is expressed as:

δ_t′＝r_t+γU(ω_t，s_t+1)-Q(s_t，ω_t，a_t)，