WO2021169218A1

WO2021169218A1 - Data pushing method and system, electronic device and storage medium

Info

Publication number: WO2021169218A1
Application number: PCT/CN2020/112365
Authority: WO
Inventors: 陈娴娴; 阮晓雯; 徐亮
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-02-26
Filing date: 2020-08-31
Publication date: 2021-09-02
Also published as: CN111401937A

Abstract

A data pushing method and system, an electronic device and a storage medium, which relate to the field of intelligent decision making. Said method comprises: extracting, according to webpage browsing information, personal features related to data pushing (S110), and recording and storing a personal behavior policy; defining, in combination with the personal features and the personal behavior policy, a reward function (S120); abstracting a real process of item recommendation as a Markov process on the basis of the reward function (S130); using the Markov property of the Markov process to simplify a Bellman equation to form an iterative equation, calculating an optimal solution of the iterative equation, and obtaining a data pushing model (S140); inputting million-level data as data features into the data pushing model to perform network training, so as to form an optimal data pushing model; and inputting personal features of a data pushing target user into the optimal data pushing model, and the optimal data pushing model automatically outputting recommendation information to said target user (S160).

Description

Data pushing method, system, electronic device and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 26, 2020, the application number is 202010119662.6, and the invention title is "data push method, device and computer readable storage medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence, and in particular to a data push method, system, electronic device, and computer-readable storage medium.

Background technique

The classic recommendation system only relies on the big data stored in advance, but ignores that the recommended objects and the recommended environment are constantly changing in a practical sense. At the same time, it also ignores the new information generated during the interaction between the system and the recommended objects. People realize that these neglected interactive information and its possible instantaneous variability are precisely the most important. Therefore, the traditional recommendation system is fixed in rules to a certain extent, objectively speaking, it does not consider the environment and interaction factors. Therefore, this type of traditional methods has obvious lag in the interaction level, and cannot keep up with the latest needs of recommended objects. Therefore, the construction of a recommendation system framework that fully considers the interactive information of the system has become a hot issue in data mining.

The recommendation system is most afraid of serious lag. The time lag in user information acquisition and analysis results in delays in user demand analysis. It recommends things that users no longer like, no longer need, or are wrong. Traditional data push mainly Based on the basic machine learning framework, based on association rules, such as the purchased product as the rule header and the rule body as the recommendation object. The most classic example is that many people buy milk while buying bread to match, and the recommendation is complicated and inaccurate.

Therefore, there is an urgent need for a data push method with improved accuracy.

Summary of the invention

This application provides a data push method, system, electronic device, and computer-readable storage medium, the main purpose of which is to extract personal characteristics related to data push based on web browsing information, record and store personal behavior strategies, and combine personal characteristics and personal The behavior strategy defines the reward function, and then abstracts the actual process of item recommendation into a Markov process based on the reward function, and then uses the Markov property of the Markov process to simplify the Bellman equation, transforms the push process into an iterable equation, and finds Obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, continue to train the neural network until the neural network converges, obtain a data push model, and then use millions of data as data features to input the data push model for network training, And give the given Loss function to return the error to form the optimal data push model. Finally, the personal characteristics of the data push target user are input into the optimal data push model, and the optimal data push model automatically outputs the data push.

In order to achieve the above purpose, the data push method provided in this application is applied to an electronic device, and the method includes:

S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;

S120: Define a reward function in combination with the personal characteristics and personal behavior information;

S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;

S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;

S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;

S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.

To achieve the above objective, this application also provides a data push system, including: a feature extraction unit, a reward function unit, a network training unit, and an optimization model unit;

The feature extraction unit is used to extract personal features related to data push based on web browsing information, record and store personal behavior strategies;

The reward function unit is connected to the feature extraction unit, and is used to define the reward function in combination with the personal characteristics extracted by the feature extraction unit and the personal behavior strategy, and based on the reward function, the actual process of item recommendation is abstracted into a Markov process;

The network training unit is connected with the reward function unit, and is used to simplify the Bellman equation by using the Markov property of the Markov process output by the reward function unit to form an iterable equation, and to obtain the optimal solution of the iterable equation, combining with the optimal solution to build Neural network, continue to train the neural network until the neural network converges, and obtain the data push model;

The optimized model unit is connected to the network training unit, and is used to input the training data as data features into the data push model obtained through the network training unit for network training, and give the given Loss function to return the error to form the optimal data push Model, as long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.

In order to achieve the above object, the present application also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the The computer program implements the steps in the data push method described above.

In addition, in order to achieve the above-mentioned object, this application also provides a computer-readable storage medium in which a data push program is stored. When the data push program is executed by a processor, the aforementioned data push method is implemented. A step of.

The data push method, system, electronic device and computer-readable storage medium proposed in this application record and store personal behavior strategies by extracting personal characteristics, and then abstract the actual process of item recommendation into a Markov process based on the reward function, and then use The Markov property of the Markov process simplifies the Bellman equation, transforms the push process into an iterable equation, and obtains the optimal solution of the iterable equation, combines the optimal solution to build a neural network, and continues training the neural network until the neural network converges , Obtain the data push model, and finally input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs the data push. It greatly improves the accuracy and recall rate of data push, improves the satisfaction of recommended items and user needs, and avoids the lag in the interaction level.

Description of the drawings

Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application;

Fig. 2 is a flowchart of a data push method according to an embodiment of the present application;

Fig. 3 is a system framework diagram in a data push electronic device according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

The existing data push method is mainly based on the basic machine learning framework, based on the association rules, the purchased goods are used as the rule header, and the rule body is the recommended object. The time lag in the acquisition and analysis of user information leads to delays in the analysis of user needs and recommends to Something that the user no longer likes, no longer needs, or is wrong. In order to solve the above-mentioned problems in the existing data push methods, this application extracts personal characteristics related to data push from web browsing information, records and stores personal behavior strategies, defines a reward function, and abstracts the actual process of item recommendation as Marr In the Cove process, the optimal solution is obtained, and the neural network is continuously trained until the neural network converges to obtain the data push model. Only the personal characteristics of the target user of the data push are input into the optimal data push model, and the optimal data push model automatically outputs the data Push.

Specifically, according to an embodiment of the present application, a data push method is provided, which is applied to the electronic device 40.

Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application. As shown in FIG. 1, the implementation environment in this embodiment is a computer device 110.

The computer device 110 is a computer device, such as a terminal device such as a computer.

It should be noted that the computer terminal device 110 may be a tablet computer, a notebook computer, a desktop computer, etc., which is a cenOS (linux) system, but is not limited to this. The terminal device 110 such as a computer device can be connected via Bluetooth, USB (Universal Serial Bus) or other communication connection methods, which is not limited in this application.

Fig. 2 is a flowchart of a data push method according to an embodiment of the present application. As shown in Figure 2, in this embodiment, the data push method includes the following steps:

S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information; personal behavior information is a personal behavior strategy;

In the user’s shopping process, extract the height, weight, physical condition indicators, economic status and other personal characteristics that can be extracted according to the user’s first browsing information, record and store the usual shopping intention, general shopping time, specific reasons for shopping, shopping location, Institutional selection and other personal behavior strategies; here, personal characteristics and the specific content of personal behavior strategies can be determined according to personal web browsing topics and e-commerce behaviors that occur during web browsing.

If the subject of the user's web browsing is shopping, the extracted personal characteristics can include height, weight, physical condition indicators, economic status, location, etc. The corresponding personal behavior strategy can include general shopping intentions, general shopping time, shopping Specific reasons, shopping locations, organization choices, etc.

If the subject of the user's web browsing is learning and training, the extracted personal characteristics can include age, education, physical condition indicators, economic status, location, etc., and the corresponding personal behavior strategy can include usual learning needs and learning time , The specific reasons for learning, the purpose of learning, the choice of institutions, etc.; if the subject of the user’s web browsing is news browsing, the extracted personal characteristics can include gender, age, education, physical condition indicators, economic status, location, etc., corresponding People’s behavior strategies include: general browsing topics, general browsing time, and often.

S120: Combine personal characteristics and personal behavior information to define a reward function;

When performing data push (information recommendation), take the user's web browsing theme as shopping as an example. Whether the user finally buys or clicks depends on the results of a series of search rankings, not just based on a simple search or recommendation process, so It is necessary to regard the search engine as an agent and the user as the environment to transform the problem of item recommendation into a typical sequential decision-making problem.

The reward function defined in this embodiment needs to be mathematically defined in advance, and its definition and application are indispensable steps in the reinforcement learning algorithm; if the reward function receives positive feedback due to a certain behavior strategy, the behavior strategy is strengthened. Trend, based on the reinforcement learning algorithm, continue to try and continue to recommend. In the process of trying and recommending, the reward is accumulated according to user feedback, until the accumulated value of the reward function receiving environmental feedback is maximized to obtain the local optimal solution.

The reward function is: if only product clicks occur in a PV (page view page browsing), the corresponding reward value is the number of times the user clicks on the product; if a user pairing occurs in a PV (page view page browsing) For the purchase of commodities, the corresponding reward is the number of times the user clicks on the commodity plus the price of the purchased commodity; in other cases, the reward is zero.

In one embodiment, the data to be pushed is product recommendation, and the reward function is: if the user clicks on a product on the shopping page, the reward value is added to the product, and the reward value is the number of the user clicked on the product. If a product is purchased on the shopping page, the reward value is added to the product, and the reward value is the purchase price of the product; otherwise, the reward value is zero.

In another embodiment, the data to be pushed is training recommendation, and the reward function is: if the user clicks to browse a certain course on the training page, the reward value is added to the course, and the reward value is the number of times the user clicks to browse the course. If the user purchases a course on the training page, the reward value is added to the course, and the reward value is the purchase price of the course; otherwise, the reward value is zero.

In the case that a certain behavior strategy of the agent leads to positive rewards in the environment (the reward function value becomes larger), the tendency of the agent to generate this behavior strategy will be strengthened, and then the process of data push (item recommendation) is abstracted as MDP( Markov Decision Process (Markov process);

The MDP is represented by the four-tuple <S,A,R,T>:

Among them, S (StateSpace, state space), recommends the state of the data to be pushed on the page during the actual process of the item;

A (Action Space, action space), for all actions generated on the item recommendation page;

R: S×A×S→R(Reward Function, reward function), R(s,a,s′) represents the reward value obtained by the agent from the environment when the action a is performed on state s and transferred to state s′ , When the user shifts from clicking a to clicking b, the reward value obtained by b increases;

T:S×A×S→[0,1] is the state transition function of the environment (State Transition Function), T(s, a, s′) represents the execution of action A on state S and the transition to state S′ Probability.

In the Markov process, the agent perceives the environment state S in the entire data push process, and collects the personal behavior strategy through the agent. When the action space A in the personal behavior strategy is When the item) occurs, the reward function R increases the reward value of the item. The greater the probability T of clicking on the item, the more the reward value increases.

In one embodiment, the data push process is a product recommendation process, and the MDP is represented by the quadruple <S, A, R, T>:

Among them, S represents the number of times the product has been clicked, or the product has been purchased;

A means that the user is browsing or clicking on the item;

R: S×A×S→R is the reward function, R(s,a,s′) represents the reward value obtained by the product when the action A is performed on the state S and transferred to the state S′, such as When the user is clicked 5 times, the user is clicked again, and the reward value added is 1;

T:S×A×S→[0,1] is the state transition function, T(3,2, enough) indicates that when the product has been clicked 3 times, click 2 times and transfer to the state to purchase the product Probability.

In another embodiment, the data push process is a course recommendation process, and the MDP is represented by the quadruple <S, A, R, T>:

Among them, S represents the number of times the course has been tried, or the course has been purchased;

A means that the user is browsing or trying the course;

R: S×A×S→R is the reward function, R(s,a,s′) represents the reward value obtained by the item when the action A is executed on state S and transferred to state S′, such as browsing 3 When the course is time, try the course once, and the reward value of the course is 1;

T:S×A×S→[0,1] is the state transition function, T(3,2,s′) indicates that the course has been browsed or tried 3 times, and then browsed or tried 2 times, and transferred to Probability of buying the course.

S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue training the neural network until the The neural network converges to obtain a data push model;

First, simplify the Bellman equation and transform the pushing process into an iterable equation, and obtain the optimal solution of the iterable equation;

Simplify the Bellman equation based on Markov property, making it an iterable equation, so that the optimal solution can be solved by iteration. The iterable equation is:

Among them, γ is the attenuation coefficient, S, R, and t are equal, and the iterable equation is used to maximize the accumulation of rewards;

Solving the optimal solution of the iterable equation is to obtain the maximum objective function Q. It is necessary to obtain the largest cumulative reward through the agent recommendation in a batch; where the batch is a data set and is in the process of solving the optimal solution of the iterable equation Choose to perform sampling solution, that is, perform calculations in a small batch data set, loop batches, loop calculations, until the upper threshold is reached, or the results converge (relatively better results are obtained).

Introduce an approximate representation of the action value function, namely

Then when the approximate representation is established mathematically, combine the optimal solution to build a neural network DQN model;

Build two neural network architectures N1 and N2 with the same structure and different parameters. N1 is used to estimate the evaluation value, and N2 is used to calculate the target value, and then the network is iteratively updated for the reverse transfer. After k rounds of iterations, the N1's The network parameters are transplanted to N2; the N1 and N2 are fully connected networks with neurons, and the activation function used is the relu input as the feature, and the output is the value corresponding to the action. The number of neurons in it occurs slightly in different scenarios. Change;

The neural network initializes many parameters, and the machine continues to learn and update the many parameters until the framework of the neural network converges; when the neural network converges, the optimal solution of the above-mentioned iterable equation is obtained, that is, it is found that the entire push The optimal parameters of the process.

Specifically, the input of the constructed network is a feature map of a certain state St. The fully connected layer of 100 neurons constructed through the activation function tanh, and finally through the output layer, each action value Vi corresponding to action ai is output. In the neural network, Stochastic Gradient Descent is used for network iteration. The Experience Replay method is used in the algorithm. Before specifying t memory to be stored, all the current step states involved, the corresponding actions taken, the delayed rewards obtained, and the corresponding next state' are stored, as shown in the following formula:

Every experience stored

_{_{e t = (s t, a}} t, r t, s t + 1)

And store it in the memory bank

D _t ={e ₁ ,...,e _t }.

Finally, uniform sampling with replacement is performed.

At the same time, the fixed q-target method is applied, and the Loss function is

In the above formula, E is the desired function, a is the action space (Action Space), r is the reward function (Reward Function), s is the state transition function of the environment (State Transition Function), where s'is the meaning of the next state . U(D) is randomly and uniformly sampled, γ is the attenuation coefficient, and Q is the cumulative reward function; that is, iterative Loss is performed by subtracting the predicted reward in the Q table from the real reward in the next step.

S150: Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;

After forming the data push model, use millions of data as data features to input Deep Q Network for network training, and give the given Loss function to return the error, and continue training until the model converges to form a complete data push model , To obtain the optimal data push model.

In the process of automatically outputting the data push by the optimal data push model, the items pushed out are the items that are obtained by the neural network in the optimal push model through machine learning and repeated training to maximize the purchase probability of the target user.

The data push method in this embodiment first extracts the personal characteristics related to the data push during the shopping process of the user, records and stores the personal behavior strategy, then combines the personal characteristics and the personal behavior strategy to define the reward function, and recommends items The actual process is abstracted as a Markov process, and then the Markov property of the Markov process is used to simplify the Bellman equation, the push process is transformed into an iterable equation, and the optimal solution of the iterable equation is obtained to obtain the data Push model, as long as the user's characteristics are input into the data model, the data model will automatically launch the most suitable items for the user and the user has the greatest purchase probability. This method not only improves the accuracy of recommended items, but also greatly avoids The lag that exists at the interaction level.

On the other hand, this application also provides a data push system. FIG. 3 is a framework diagram of a data push push system according to an embodiment of this application. The system corresponds to the aforementioned data push method and can be installed in a data push electronic device.

As shown in FIG. 3, the data pushing system 300 includes a feature extraction unit 310, a reward function unit 320, a network training unit 330, and an optimization model unit 340.

Among them, the feature extraction unit 310 is configured to extract personal features related to data push according to web browsing information, record and store personal behavior strategies;

The reward function unit 320 is connected to the feature extraction unit 310, and is used to define a reward function in combination with the personal characteristics and personal behavior strategies extracted by the feature extraction unit 310, and abstract the actual process of item recommendation into a Markov process based on the reward function;

The network training unit 330 is connected to the reward function unit 320, and is used to use the Markov property of the Markov process output by the reward function unit 320 to simplify the Bellman equation to form an iterable equation, and to obtain the optimal solution of the iterable equation. The optimal solution builds a neural network, and continuously trains the neural network until the neural network converges, and obtains a data push model;

The optimization model unit 340 is connected to the network training unit 330, and is used to input the data push model obtained through the network training unit 330 with millions of data as data features for network training, and give a given Loss function for error return. An optimal data push model is formed. As long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.

FIG. 4 is a schematic diagram of the electronic device of this application. In this embodiment, the electronic device 40 may be a terminal device with arithmetic function, such as a server, a tablet computer, a portable computer, a desktop computer, and the like.

The electronic device 40 includes a processor 41, a memory 42, a computer program 43, a network interface, and a communication bus.

The electronic device 40 may be a tablet computer, a desktop computer, or a smart phone, but is not limited thereto.

The memory 42 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 40, such as a hard disk of the electronic device 40. In other embodiments, the readable storage medium may also be an external memory of the electronic device 40, such as a plug-in hard disk equipped on the electronic device 40, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.

In this embodiment, the readable storage medium of the memory 42 is generally used to store the computer program 43 installed in the electronic device 40, the key generation unit, the key management unit, the transmission unit, and the alarm unit.

The processor 41 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run program codes or process data stored in the memory 42, such as a data push program, etc. .

The network interface may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 40 and other electronic devices.

The communication bus is used to realize the connection and communication between these components.

FIG. 4 only shows the electronic device 40 with the components 41-43, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

In the embodiment of the electronic device shown in FIG. 4, the memory 42 as a computer storage medium may store an operating system and a data push program 43; the processor 41 implements the following steps when executing the data push program stored in the memory 42:

S120: Combine the personal characteristics and personal behavior information to define a reward function;

S140: Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;

S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information.

In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes a data push program, and the data push The following operations are implemented when the program is executed by the processor:

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned data push method and electronic device, and will not be repeated here.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which may be a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A data push method applied to an electronic device, the method including:

S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;

S120: Define a reward function in combination with the personal characteristics and personal behavior information;

S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;

S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;

S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;

S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model outputs recommendation information to the target user.
The data pushing method according to claim 1, wherein the reward function is:

If only a product click occurs in a PV, the corresponding reward value is the number of times the user clicks on the product; if a user purchases a product in a PV, then the corresponding reward is the number of times the user clicks on the product plus the purchased product The price; in other cases, the reward is 0.
The data push method according to claim 1, wherein:

The Markov process is represented by the quaternion <S,A,R,T>:

Wherein, S is the status of the data to be pushed on the page during the actual process of the item recommendation;

A is all actions generated on the item recommendation page;

R: S×A×S→R, is the reward function. When the user performs action A and transitions from state S to state S′, the state S′ obtains the reward value. When the user transfers from clicking a item to clicking b item When, item b gets the reward value;

T: S×A×S→[0,1], is the state transition function of the environment, T(s,a,s′) represents the probability of performing action a on state s and transitioning to state S′.
The data push method according to claim 1, wherein obtaining the optimal solution of the iterable equation is the maximum cumulative reward obtained through agent recommendation in a batch;

The method of obtaining the optimal solution of the iterable equation is sample solution. The process is: first calculate in a small batch data set, and then loop batches and loop calculations until the upper threshold is reached, or the result converges.
The data pushing method according to claim 1, wherein the process of building a neural network in combination with the optimal solution comprises:

Introduce an approximate representation of the action value function:

After the approximate representation is established mathematically, combine the optimal solution to build two neural network architectures N1 and N2 with the same structure and different parameters; where,

Use N1 to estimate the evaluation value, use N2 to calculate the target value, and then iteratively update the network in the reverse pass, and periodically transplant the network parameters of N1 to N2 after k rounds of iterations;

The N1 and N2 are all fully connected networks with neurons, and the number of neurons changes through different scenarios.
The data push method according to claim 1, wherein the process of continuously training the neural network until the neural network converges to obtain the data push model includes:

In the neural network, Stochastic Gradient Descent is used to iterate the network, and the Experience Replay method is applied. Before specifying t memory that needs to be stored, for all the current S involved, the corresponding A is taken, the obtained delay R and the corresponding next S 'To store.
The data push method according to claim 5, wherein, in the process of automatically outputting data push by the optimal data push model, the items pushed out are the neural network in the optimal push model through machine learning and The item that has the greatest chance of buying by the target user is obtained through repeated training.
A data push system, including: a feature extraction unit, a reward function unit, a network training unit, and an optimization model unit;

The feature extraction unit is used to extract personal features related to data push based on web browsing information, record and store personal behavior strategies;

The reward function unit is connected to the feature extraction unit, and is used to define the reward function in combination with the personal characteristics extracted by the feature extraction unit and the personal behavior strategy, and based on the reward function, the actual process of item recommendation is abstracted into a Markov process;

The network training unit is connected with the reward function unit, and is used to simplify the Bellman equation by using the Markov property of the Markov process output by the reward function unit to form an iterable equation, and to obtain the optimal solution of the iterable equation, combining with the optimal solution to build Neural network, continue to train the neural network until the neural network converges, and obtain the data push model;

The optimized model unit is connected to the network training unit, and is used to input the training data as data features into the data push model obtained through the network training unit for network training, and give the given Loss function to return the error to form the optimal data push Model, as long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
An electronic device comprising: a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer program:

S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;

S120: Define a reward function in combination with the personal characteristics and personal behavior information;

S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;

S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;

S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;

S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model outputs recommendation information to the target user.
The electronic device according to claim 9, wherein the reward function is:

If only a product click occurs in a PV, the corresponding reward value is the number of times the user clicks on the product; if a user purchases a product in a PV, then the corresponding reward is the number of times the user clicks on the product plus the purchased product The price; in other cases, the reward is 0.
The electronic device according to claim 9, wherein:

The Markov process is represented by the quaternion <S,A,R,T>:

Wherein, S is the status of the data to be pushed on the page during the actual process of the item recommendation;

A is all actions generated on the item recommendation page;

R: S×A×S→R, is the reward function. When the user performs action A and transitions from state S to state S′, the state S′ obtains the reward value. When the user transfers from clicking a item to clicking b item When, item b gets the reward value;

T: S×A×S→[0,1], is the state transition function of the environment, T(s,a,s′) represents the probability of performing action a on state s and transitioning to state S′.
9. The electronic device according to claim 9, wherein the optimal solution of the iterable equation is the maximum cumulative reward obtained through agent recommendation in a batch;

The method of obtaining the optimal solution of the iterable equation is sample solution. The process is: first calculate in a small batch data set, and then loop batches and loop calculations until the upper threshold is reached, or the result converges.
The electronic device according to claim 9, wherein the process of building a neural network in combination with the optimal solution comprises:

Introduce an approximate representation of the action value function:

After the approximate representation is established mathematically, combine the optimal solution to build two neural network architectures N1 and N2 with the same structure and different parameters; where,

Use N1 to estimate the evaluation value, use N2 to calculate the target value, and then iteratively update the network in the reverse pass, and periodically transplant the network parameters of N1 to N2 after k rounds of iterations;

The N1 and N2 are all fully connected networks with neurons, and the number of neurons changes through different scenarios.
The electronic device according to claim 9, wherein the process of continuously training the neural network until the neural network converges to obtain the data push model comprises:

In the neural network, Stochastic Gradient Descent is used to iterate the network, and the Experience Replay method is applied. Before specifying t memory that needs to be stored, for all the current S involved, the corresponding A is taken, the obtained delay R and the corresponding next S 'To store.
The electronic device according to claim 13, wherein, in the process of automatically outputting the data push by the optimal data push model, the items pushed out are the neural network in the optimal push model after machine learning and repetition. The items that are trained to make the target user have the greatest chance of buying.
A computer-readable storage medium in which a data push program is stored, and when the data push program is executed by a processor, the following steps are implemented:

S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;

S120: Define a reward function in combination with the personal characteristics and personal behavior information;

S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;

S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;

S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;

S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model outputs recommendation information to the target user.
The computer-readable storage medium of claim 16, wherein the reward function is:

If only a product click occurs in a PV, the corresponding reward value is the number of times the user clicks on the product; if a user purchases a product in a PV, then the corresponding reward is the number of times the user clicks on the product plus the purchased product The price; in other cases, the reward is 0.
The computer-readable storage medium of claim 16, wherein:

The Markov process is represented by the quaternion <S,A,R,T>:

Wherein, S is the status of the data to be pushed on the page during the actual process of the item recommendation;

A is all actions generated on the item recommendation page;

R: S×A×S→R, is the reward function. When the user performs action A and transitions from state S to state S′, the state S′ obtains the reward value. When the user transfers from clicking a item to clicking b item When, item b gets the reward value;

T: S×A×S→[0,1], is the state transition function of the environment, T(s,a,s′) represents the probability of performing action a on state s and transitioning to state S′.
The computer-readable storage medium according to claim 16, wherein the process of building a neural network in combination with the optimal solution comprises:

Introduce an approximate representation of the action value function:

After the approximate representation is established mathematically, combine the optimal solution to build two neural network architectures N1 and N2 with the same structure and different parameters; where,

Use N1 to estimate the evaluation value, use N2 to calculate the target value, and then iteratively update the network in the reverse pass, and periodically transplant the network parameters of N1 to N2 after k rounds of iterations;

The N1 and N2 are all fully connected networks with neurons, and the number of neurons changes through different scenarios.
The computer-readable storage medium according to claim 16, wherein the process of continuously training the neural network until the neural network converges to obtain a data push model comprises:

In the neural network, Stochastic Gradient Descent is used to iterate the network, and the Experience Replay method is applied. Before specifying t memory that needs to be stored, for all the current S involved, the corresponding A is taken, the obtained delay R and the corresponding next S 'To store.