WO2021169218A1 - 数据推送方法、系统、电子装置及存储介质 - Google Patents
数据推送方法、系统、电子装置及存储介质 Download PDFInfo
- Publication number
- WO2021169218A1 WO2021169218A1 PCT/CN2020/112365 CN2020112365W WO2021169218A1 WO 2021169218 A1 WO2021169218 A1 WO 2021169218A1 CN 2020112365 W CN2020112365 W CN 2020112365W WO 2021169218 A1 WO2021169218 A1 WO 2021169218A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data push
- neural network
- data
- reward
- optimal
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0255—Targeted advertisements based on user history
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0269—Targeted advertisements based on user profile or attribute
- G06Q30/0271—Personalized advertisement
Definitions
- This application relates to the field of artificial intelligence, and in particular to a data push method, system, electronic device, and computer-readable storage medium.
- the classic recommendation system only relies on the big data stored in advance, but ignores that the recommended objects and the recommended environment are constantly changing in a practical sense. At the same time, it also ignores the new information generated during the interaction between the system and the recommended objects. People realize that these neglected interactive information and its possible instantaneous variability are precisely the most important. Therefore, the traditional recommendation system is fixed in rules to a certain extent, objectively speaking, it does not consider the environment and interaction factors. Therefore, this type of traditional methods has obvious lag in the interaction level, and cannot keep up with the latest needs of recommended objects. Therefore, the construction of a recommendation system framework that fully considers the interactive information of the system has become a hot issue in data mining.
- the recommendation system is most afraid of serious lag.
- the time lag in user information acquisition and analysis results in delays in user demand analysis. It recommends things that users no longer like, no longer need, or are wrong.
- Traditional data push mainly Based on the basic machine learning framework, based on association rules, such as the purchased product as the rule header and the rule body as the recommendation object.
- association rules such as the purchased product as the rule header and the rule body as the recommendation object.
- the most classic example is that many people buy milk while buying bread to match, and the recommendation is complicated and inaccurate.
- This application provides a data push method, system, electronic device, and computer-readable storage medium, the main purpose of which is to extract personal characteristics related to data push based on web browsing information, record and store personal behavior strategies, and combine personal characteristics and personal
- the behavior strategy defines the reward function, and then abstracts the actual process of item recommendation into a Markov process based on the reward function, and then uses the Markov property of the Markov process to simplify the Bellman equation, transforms the push process into an iterable equation, and finds Obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, continue to train the neural network until the neural network converges, obtain a data push model, and then use millions of data as data features to input the data push model for network training, And give the given Loss function to return the error to form the optimal data push model. Finally, the personal characteristics of the data push target user are input into the optimal data push model, and the optimal data push model automatically outputs the data push.
- the data push method provided in this application is applied to an electronic device, and the method includes:
- S120 Define a reward function in combination with the personal characteristics and personal behavior information
- S140 Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;
- S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
- this application also provides a data push system, including: a feature extraction unit, a reward function unit, a network training unit, and an optimization model unit;
- the feature extraction unit is used to extract personal features related to data push based on web browsing information, record and store personal behavior strategies;
- the reward function unit is connected to the feature extraction unit, and is used to define the reward function in combination with the personal characteristics extracted by the feature extraction unit and the personal behavior strategy, and based on the reward function, the actual process of item recommendation is abstracted into a Markov process;
- the network training unit is connected with the reward function unit, and is used to simplify the Bellman equation by using the Markov property of the Markov process output by the reward function unit to form an iterable equation, and to obtain the optimal solution of the iterable equation, combining with the optimal solution to build Neural network, continue to train the neural network until the neural network converges, and obtain the data push model;
- the optimized model unit is connected to the network training unit, and is used to input the training data as data features into the data push model obtained through the network training unit for network training, and give the given Loss function to return the error to form the optimal data push Model, as long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
- the present application also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the The computer program implements the steps in the data push method described above.
- this application also provides a computer-readable storage medium in which a data push program is stored.
- the data push program is executed by a processor, the aforementioned data push method is implemented. A step of.
- the data push method, system, electronic device and computer-readable storage medium proposed in this application record and store personal behavior strategies by extracting personal characteristics, and then abstract the actual process of item recommendation into a Markov process based on the reward function, and then use
- the Markov property of the Markov process simplifies the Bellman equation, transforms the push process into an iterable equation, and obtains the optimal solution of the iterable equation, combines the optimal solution to build a neural network, and continues training the neural network until the neural network converges , Obtain the data push model, and finally input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs the data push. It greatly improves the accuracy and recall rate of data push, improves the satisfaction of recommended items and user needs, and avoids the lag in the interaction level.
- Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application
- Fig. 2 is a flowchart of a data push method according to an embodiment of the present application
- Fig. 3 is a system framework diagram in a data push electronic device according to an embodiment of the present application.
- Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- the existing data push method is mainly based on the basic machine learning framework, based on the association rules, the purchased goods are used as the rule header, and the rule body is the recommended object.
- the time lag in the acquisition and analysis of user information leads to delays in the analysis of user needs and recommends to Something that the user no longer likes, no longer needs, or is wrong.
- this application extracts personal characteristics related to data push from web browsing information, records and stores personal behavior strategies, defines a reward function, and abstracts the actual process of item recommendation as Marr In the Cove process, the optimal solution is obtained, and the neural network is continuously trained until the neural network converges to obtain the data push model. Only the personal characteristics of the target user of the data push are input into the optimal data push model, and the optimal data push model automatically outputs the data Push.
- a data push method is provided, which is applied to the electronic device 40.
- Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application. As shown in FIG. 1, the implementation environment in this embodiment is a computer device 110.
- the computer device 110 is a computer device, such as a terminal device such as a computer.
- the computer terminal device 110 may be a tablet computer, a notebook computer, a desktop computer, etc., which is a cenOS (linux) system, but is not limited to this.
- the terminal device 110 such as a computer device can be connected via Bluetooth, USB (Universal Serial Bus) or other communication connection methods, which is not limited in this application.
- Fig. 2 is a flowchart of a data push method according to an embodiment of the present application. As shown in Figure 2, in this embodiment, the data push method includes the following steps:
- S110 Extract personal characteristics and personal behavior information related to data push based on web browsing information; personal behavior information is a personal behavior strategy;
- the extracted personal characteristics can include height, weight, physical condition indicators, economic status, location, etc.
- the corresponding personal behavior strategy can include general shopping intentions, general shopping time, shopping Specific reasons, shopping locations, organization choices, etc.
- the extracted personal characteristics can include age, education, physical condition indicators, economic status, location, etc.
- the corresponding personal behavior strategy can include usual learning needs and learning time , The specific reasons for learning, the purpose of learning, the choice of institutions, etc.; if the subject of the user’s web browsing is news browsing, the extracted personal characteristics can include gender, age, education, physical condition indicators, economic status, location, etc., corresponding People’s behavior strategies include: general browsing topics, general browsing time, and often.
- the reward function defined in this embodiment needs to be mathematically defined in advance, and its definition and application are indispensable steps in the reinforcement learning algorithm; if the reward function receives positive feedback due to a certain behavior strategy, the behavior strategy is strengthened. Trend, based on the reinforcement learning algorithm, continue to try and continue to recommend. In the process of trying and recommending, the reward is accumulated according to user feedback, until the accumulated value of the reward function receiving environmental feedback is maximized to obtain the local optimal solution.
- the reward function is: if only product clicks occur in a PV (page view page browsing), the corresponding reward value is the number of times the user clicks on the product; if a user pairing occurs in a PV (page view page browsing)
- the corresponding reward is the number of times the user clicks on the commodity plus the price of the purchased commodity; in other cases, the reward is zero.
- the data to be pushed is product recommendation
- the reward function is: if the user clicks on a product on the shopping page, the reward value is added to the product, and the reward value is the number of the user clicked on the product. If a product is purchased on the shopping page, the reward value is added to the product, and the reward value is the purchase price of the product; otherwise, the reward value is zero.
- the data to be pushed is training recommendation
- the reward function is: if the user clicks to browse a certain course on the training page, the reward value is added to the course, and the reward value is the number of times the user clicks to browse the course. If the user purchases a course on the training page, the reward value is added to the course, and the reward value is the purchase price of the course; otherwise, the reward value is zero.
- the MDP is represented by the four-tuple ⁇ S,A,R,T>:
- S StateSpace, state space
- S recommends the state of the data to be pushed on the page during the actual process of the item
- R S ⁇ A ⁇ S ⁇ R(Reward Function, reward function), R(s,a,s′) represents the reward value obtained by the agent from the environment when the action a is performed on state s and transferred to state s′ , When the user shifts from clicking a to clicking b, the reward value obtained by b increases;
- T:S ⁇ A ⁇ S ⁇ [0,1] is the state transition function of the environment (State Transition Function), T(s, a, s′) represents the execution of action A on state S and the transition to state S′ Probability.
- the agent perceives the environment state S in the entire data push process, and collects the personal behavior strategy through the agent.
- the action space A in the personal behavior strategy is When the item) occurs, the reward function R increases the reward value of the item. The greater the probability T of clicking on the item, the more the reward value increases.
- the data push process is a product recommendation process
- the MDP is represented by the quadruple ⁇ S, A, R, T>:
- S represents the number of times the product has been clicked, or the product has been purchased
- R S ⁇ A ⁇ S ⁇ R is the reward function
- R(s,a,s′) represents the reward value obtained by the product when the action A is performed on the state S and transferred to the state S′, such as When the user is clicked 5 times, the user is clicked again, and the reward value added is 1;
- T:S ⁇ A ⁇ S ⁇ [0,1] is the state transition function
- T(3,2, enough) indicates that when the product has been clicked 3 times, click 2 times and transfer to the state to purchase the product Probability.
- the data push process is a course recommendation process
- the MDP is represented by the quadruple ⁇ S, A, R, T>:
- S represents the number of times the course has been tried, or the course has been purchased
- R S ⁇ A ⁇ S ⁇ R is the reward function
- R(s,a,s′) represents the reward value obtained by the item when the action A is executed on state S and transferred to state S′, such as browsing 3
- state S′ such as browsing 3
- T:S ⁇ A ⁇ S ⁇ [0,1] is the state transition function
- T(3,2,s′) indicates that the course has been browsed or tried 3 times, and then browsed or tried 2 times, and transferred to Probability of buying the course.
- S140 Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue training the neural network until the The neural network converges to obtain a data push model;
- ⁇ is the attenuation coefficient, S, R, and t are equal, and the iterable equation is used to maximize the accumulation of rewards;
- Solving the optimal solution of the iterable equation is to obtain the maximum objective function Q. It is necessary to obtain the largest cumulative reward through the agent recommendation in a batch; where the batch is a data set and is in the process of solving the optimal solution of the iterable equation Choose to perform sampling solution, that is, perform calculations in a small batch data set, loop batches, loop calculations, until the upper threshold is reached, or the results converge (relatively better results are obtained).
- N1 and N2 Build two neural network architectures N1 and N2 with the same structure and different parameters.
- N1 is used to estimate the evaluation value
- N2 is used to calculate the target value
- the network is iteratively updated for the reverse transfer.
- the N1's The network parameters are transplanted to N2; the N1 and N2 are fully connected networks with neurons, and the activation function used is the relu input as the feature, and the output is the value corresponding to the action.
- the neural network initializes many parameters, and the machine continues to learn and update the many parameters until the framework of the neural network converges; when the neural network converges, the optimal solution of the above-mentioned iterable equation is obtained, that is, it is found that the entire push The optimal parameters of the process.
- the input of the constructed network is a feature map of a certain state St.
- Stochastic Gradient Descent is used for network iteration.
- the Experience Replay method is used in the algorithm.
- E is the desired function
- a is the action space (Action Space)
- r is the reward function (Reward Function)
- s is the state transition function of the environment (State Transition Function), where s'is the meaning of the next state .
- U(D) is randomly and uniformly sampled
- ⁇ is the attenuation coefficient
- Q is the cumulative reward function; that is, iterative Loss is performed by subtracting the predicted reward in the Q table from the real reward in the next step.
- S150 Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
- S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
- the items pushed out are the items that are obtained by the neural network in the optimal push model through machine learning and repeated training to maximize the purchase probability of the target user.
- the data push method in this embodiment first extracts the personal characteristics related to the data push during the shopping process of the user, records and stores the personal behavior strategy, then combines the personal characteristics and the personal behavior strategy to define the reward function, and recommends items
- the actual process is abstracted as a Markov process, and then the Markov property of the Markov process is used to simplify the Bellman equation, the push process is transformed into an iterable equation, and the optimal solution of the iterable equation is obtained to obtain the data Push model, as long as the user's characteristics are input into the data model, the data model will automatically launch the most suitable items for the user and the user has the greatest purchase probability.
- This method not only improves the accuracy of recommended items, but also greatly avoids The lag that exists at the interaction level.
- FIG. 3 is a framework diagram of a data push push system according to an embodiment of this application.
- the system corresponds to the aforementioned data push method and can be installed in a data push electronic device.
- the data pushing system 300 includes a feature extraction unit 310, a reward function unit 320, a network training unit 330, and an optimization model unit 340.
- the feature extraction unit 310 is configured to extract personal features related to data push according to web browsing information, record and store personal behavior strategies;
- the reward function unit 320 is connected to the feature extraction unit 310, and is used to define a reward function in combination with the personal characteristics and personal behavior strategies extracted by the feature extraction unit 310, and abstract the actual process of item recommendation into a Markov process based on the reward function;
- the network training unit 330 is connected to the reward function unit 320, and is used to use the Markov property of the Markov process output by the reward function unit 320 to simplify the Bellman equation to form an iterable equation, and to obtain the optimal solution of the iterable equation.
- the optimal solution builds a neural network, and continuously trains the neural network until the neural network converges, and obtains a data push model;
- the optimization model unit 340 is connected to the network training unit 330, and is used to input the data push model obtained through the network training unit 330 with millions of data as data features for network training, and give a given Loss function for error return.
- An optimal data push model is formed. As long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
- FIG. 4 is a schematic diagram of the electronic device of this application.
- the electronic device 40 may be a terminal device with arithmetic function, such as a server, a tablet computer, a portable computer, a desktop computer, and the like.
- the electronic device 40 includes a processor 41, a memory 42, a computer program 43, a network interface, and a communication bus.
- the electronic device 40 may be a tablet computer, a desktop computer, or a smart phone, but is not limited thereto.
- the memory 42 includes at least one type of readable storage medium.
- the at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like.
- the readable storage medium may be an internal storage unit of the electronic device 40, such as a hard disk of the electronic device 40.
- the readable storage medium may also be an external memory of the electronic device 40, such as a plug-in hard disk equipped on the electronic device 40, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
- the readable storage medium of the memory 42 is generally used to store the computer program 43 installed in the electronic device 40, the key generation unit, the key management unit, the transmission unit, and the alarm unit.
- the processor 41 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run program codes or process data stored in the memory 42, such as a data push program, etc. .
- CPU central processing unit
- microprocessor or other data processing chip in some embodiments, and is used to run program codes or process data stored in the memory 42, such as a data push program, etc. .
- the network interface may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 40 and other electronic devices.
- a standard wired interface such as a WI-FI interface
- WI-FI interface wireless interface
- the communication bus is used to realize the connection and communication between these components.
- FIG. 4 only shows the electronic device 40 with the components 41-43, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
- the memory 42 as a computer storage medium may store an operating system and a data push program 43; the processor 41 implements the following steps when executing the data push program stored in the memory 42:
- S140 Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;
- S150 Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
- S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information.
- the embodiment of the present application also proposes a computer-readable storage medium.
- the computer-readable storage medium may be non-volatile or volatile.
- the computer-readable storage medium includes a data push program, and the data push The following operations are implemented when the program is executed by the processor:
- S140 Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;
- S150 Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
- S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Development Economics (AREA)
- General Engineering & Computer Science (AREA)
- Finance (AREA)
- Data Mining & Analysis (AREA)
- Strategic Management (AREA)
- Accounting & Taxation (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
一种数据推送方法、系统、电子装置及存储介质,涉及智能决策领域,该方法包括:根据网页浏览信息提取与数据推送相关的个人特征(S110),记录并存储个人行为策略;结合所述个人特征及个人行为策略定义奖励函数(S120);基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程(S130);利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,获得数据推送模型(S140);以百万级数据作为数据特征输入数据推送模型进行网络训练形成最优数据推送模型;将数据推送目标用户的个人特征输入最优数据推送模型,所述最优数据推送模型自动化地向目标用户输出推荐信息(S160)。
Description
本申请要求于2020年2月26日提交中国专利局、申请号为202010119662.6,发明名称为“数据推送方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,尤其涉及一种数据推送方法、系统、电子装置及计算机可读存储介质。
经典的推荐系统仅依赖于事先存储下来的大数据,却忽略了推荐对象及推荐环境在现实意义中是不断变化的,同时也忽略了系统与推荐对象在交互过程中产生的新的信息,发明人意识到,这些被忽略的交互信息及其可能存在的瞬间变化性恰恰是最重要的,因此传统的推荐系统在一定程度上规则固化,客观来说是没有考虑环境及交互因素在内的。因此这一类传统方法在交互层面存在明显的滞后性,无法紧跟推荐对象的最新需求。因此数据挖掘充分考虑系统交互信息的推荐系统框架的搭建成为了热点问题。
推荐系统最怕的是严重的滞后,对用户信息获取、分析的时间滞后,导致延迟了用户的需求分析,推荐给了用户已经不喜欢、已经不需要、或者错误的东西,传统的数据推送主要基于基础的机器学习框架,基于关联规则,如把已购商品作为规则头,规则体为推荐对象,最经典的例子是很多人购买牛奶的同时会购买面包来搭配,推荐繁杂、不准确。
因此,亟需一种提升精确率的数据推送方法。
发明内容
本申请提供一种数据推送方法、系统、电子装置及计算机可读存储介质,其主要目的在于通过根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略,结合个人特征及个人行为策略定义奖励函数,然后基于奖励函数将物品推荐的现实过程抽象为马尔科夫过程,进而利用马尔科夫过程的马尔科夫性简化贝尔曼方程,将推送过程转化为可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练所述神经网络直至神经网络收敛,获得数据推送模型,再以百万级数据作为数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,最后将数据推送目标用户的个人特征输入最优数据推送模型,最优数据推送模型自动化地输出数据推送。
为实现上述目的,本申请提供的数据推送方法,应用于电子装置,所述方法包括:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;
S120:结合所述个人特征及个人行为信息定义奖励函数;
S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;
S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;
S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;
S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型自动化地向所述目标用户输出推荐信息。
为实现上述目的,本申请还提供一种数据推送系统,包括:特征提取单元、奖励函数单元、网络训练单元、优化模型单元;
特征提取单元用于根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个 人行为策略;
奖励函数单元与特征提取单元相连,用于结合特征提取单元提取的个人特征及个人行为策略定义奖励函数,并基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;
网络训练单元奖励函数单元相连,用于利用奖励函数单元输出的马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;
该优化模型单元与网络训练单元相连,用于将训练数据作为数据特征输入通过网络训练单元获得的数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,只要将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型即可自动化地输出数据推送。
为实现上述目的,本申请还提供一种电子装置,该电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述数据推送方法中的步骤。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有数据推送程序,所述数据推送程序被处理器执行时,实现前述的数据推送方法的步骤。
本申请提出的数据推送方法、系统、电子装置及计算机可读存储介质,通过提取个人特征,记录并存储个人行为策略,然后基于奖励函数将物品推荐的现实过程抽象为马尔科夫过程,进而利用马尔科夫过程的马尔科夫性简化贝尔曼方程,将推送过程转化为可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练神经网络直至神经网络收敛,获得数据推送模型,最后将数据推送目标用户的个人特征输入最优数据推送模型,最优数据推送模型自动化地输出数据推送。极大地提高了数据推送的精确率和召回率,提升了推荐的物品与用户需求的满足度,避免了在交互层面存在的滞后性。
图1为根据本申请实施例的数据推送方法应用环境示意图;
图2为根据本申请实施例的数据推送方法的流程图;
图3为根据本申请实施例的数据推送电子装置中的系统框架图;
图4为根据本申请实施例的电子装置的结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
现有的数据推送方法主要基于基础的机器学习框架,基于关联规则,把已购商品作为规则头,规则体为推荐对象对用户信息获取、分析的时间滞后,导致延迟用户的需求分析,推荐给用户已经不喜欢、已经不需要、或者错误的东西。为了解决现有的数据推送方法中存在的上述问题,本申请从网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略出发,定义奖励函数,将物品推荐的现实过程抽象为马尔科夫过程,求得最优解,持续训练神经网络直至神经网络收敛,获得数据推送模型,只需将数据推送目标用户的个人特征输入最优数据推送模型,最优数据推送模型自动化地输出数据推送。
具体的,根据本申请的一个实施例,提供一种数据推送方法,应用于电子装置40。
图1为根据本申请实施例的数据推送方法应用环境示意图。如图1所示,在本实施例的实施环境为计算机设备110。
其中的计算机设备110为计算机设备,例如电脑等终端设备。
需要说明的是,计算机终端设备110可为平板电脑、笔记本电脑、台式计算机等,其为cenOS(linux)系统,但并不局限于此。计算机设备等终端设备110可以通过蓝牙、USB(Universal Serial Bus,通用串行总线)或者其他通讯连接方式进行连接,本申请在此 不做限制。
图2为根据本申请实施例的数据推送方法的流程图。如图2所示,在本实施例中,数据推送方法包括如下步骤:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;个人行为信息为个人行为策略;
在用户购物过程中,提取可以根据用户先浏览信息提取的身高、体重、身体状况指标、经济状况等个人特征,记录并存储通常的购物意向、购物的一般时间、购物的具体原因、购物地点、机构选择等个人行为策略;在这里,个人特征和个人行为策略的具体内容可以根据个人的网页浏览主题以及网页浏览过程中发生的电子商务行为确定。
如果用户网页浏览的主题为购物,所提取的个人特征可以包括身高、体重、身体状况指标、经济状况、所处地域等,相应的个人行为策略可以包括通常的购物意向、购物的一般时间、购物的具体原因、购物地点、机构选择等。
如果用户网页浏览的主题为学习、培训,则所提取的个人特征可以包括年龄、学历、身体状况指标、经济状况、所处地域等,相应的个人行为策略可以包括通常的学习需求、学习的时间、学习的具体原因、学习目的、机构选择等;如果用户网页浏览的主题为新闻浏览,则所提取的个人特征可以包括性别、年龄、学历、身体状况指标、经济状况、所处地域等,相应的人行为策略包括:通常的浏览主题、浏览的一般时间以及时常等。
S120:结合个人特征及个人行为信息定义奖励函数;
在进行数据推送(信息推荐)时,以用户网页浏览的主题为购物为例,用户最终是否购买或点击,取决于一连串搜索排序的结果,并不是仅仅基于某一次单纯的搜索或推荐过程,所以需要将搜索引擎作为智能体,将用户看作环境,将物品推荐的问题转化为典型的顺序决策问题。
本实施例中定义的奖励函数是需要预先进行数学定义的,其定义和应用是强化学习算法中必不可少的步骤;若因某个行为策略该奖励函数得到正反馈,则加强此行为策略的趋势,基于强化学习算法持续尝试、持续推荐,在尝试、推荐过程中根据用户反馈累加计算奖励,直至收到环境反馈的奖励函数累加值最大得到局部最优解。
该奖励函数为:若在一个PV(page view页面浏览)中仅发生商品点击,则相应的奖励值为用户点击的商品的数量次数;若在一个PV(page view页面浏览)中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为零。
在一个实施例中,待推送数据为商品推荐,该奖励函数为:若用户在购物页面中点击某商品,则为该商品增加奖励值,奖励值数为用户点击商品的个数,若用户在购物页面中购买了某商品,则为该商品增加奖励值,奖励值数为该商品的购买价格;其他情况奖励值为零。
在另一个实施例中,待推送数据为培训推荐,该奖励函数为:若用户在培训页面上点击浏览某课程,则为该课程增加奖励值,奖励值数为用户点击浏览该课程的次数,若用户在培训页面上购买了某课程,则为该课程增加奖励值,奖励值数为该课程的购买价格;其他情况奖励值为零。
S130:基于奖励函数将物品推荐的现实过程抽象为马尔科夫过程;
在智能体的某个行为策略导致环境正的奖励(奖励函数值变大)的情况下,智能体产生此行为策略的趋势便会加强,然后将数据推送(物品推荐)的过程抽象为MDP(Markov Decision Process马尔科夫过程);
该MDP由四元组<S,A,R,T>表示:
其中,S(StateSpace,状态空间),为该物品推荐现实过程中页面上待推送数据的状态;
A(Action Space,动作空间),为该物品推荐页面产生的所有动作;
R:S×A×S→R(Reward Function,奖励函数),R(s,a,s′)表示在状态s上执行动作a,并转移到状态s′时,Agent从环境获得的奖励值,当用户从点击a转移到点击b时,b获得的奖励值增加;
T:S×A×S→[0,1]为环境的状态转移函数(State Transition Function),T(s,a,s′)表示在状态S上执行动作A,并转移到状态S′的概率。
抽象为马尔科夫过程中,通过Agent感知整个数据推送的过程所处于的环境状态S,通过Agent采集该个人行为策略,当该个人行为策略中的动作空间A对某物品(点击或浏览某一物品)发生时,该奖励函数R对该物品进行奖励值增加,对该物品点击的概率T越大,奖励值就增加更多。
在一个实施例中,数据推送的过程为商品推荐过程,MDP由四元组<S,A,R,T>表示:
其中,S表示该商品已被点击的次数,或该商品被购买;
A表示用户正在浏览或点击该物品;
R:S×A×S→R为奖励函数,R(s,a,s′)表示在状态S上执行动作A,并转移到状态S′时,该商品获得的奖励值,比如在该商品被点击5次时用户又被点击一次,增加的奖励值为1;
T:S×A×S→[0,1]为状态转移函数,T(3,2,够)表示该商品在已被点击3次时,再点击2次,并转移到状态购买该商品的概率。
在另一个实施例中,数据推送的过程为课程推荐过程,MDP由四元组<S,A,R,T>表示:
其中,S表示该课程已被试看的次数,或该课程被购买;
A表示用户正在浏览或试看该课程;
R:S×A×S→R为奖励函数,R(s,a,s′)表示在状态S上执行动作A,并转移到状态S′时,该物品获得的奖励值,比如在浏览3次该课程时,试看该课程1次,该课程所获得的奖励值为1;
T:S×A×S→[0,1]为状态转移函数,T(3,2,s′)表示该课程在已被浏览或试看3次时,再浏览或试看2次,并转移到购买该课程的概率。
S140:利用该马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得该可迭代方程式的最优解,结合该最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;
首先,简化贝尔曼方程并把该推送过程转化为可迭代方程,并求得该可迭代方程的最优解;
基于马尔科夫性简化贝尔曼方程,使得其成为一个可迭代的方程式,从而可以通过迭代求解最优解,可迭代方程为:
其中,γ是衰减系数,S、R、t等同上,利用该可迭代方程将奖励累加达到最大;
求解可迭代方程的最优解即求最大目标函数Q,需要在一个batch中,通过智能体推荐拿到最大的累积奖励;其中,该batch为数据集,在求解可迭代方程最优解过程中选择进行抽样求解,即在一个batch小数据集中进行计算,循环取batch、循环计算,直至达到阈值上限,或者结果收敛(得到相对较优结果)。
然后当近似表示在数学上成立后,结合最优解搭建神经网络DQN model;
搭建两个结构相同、参数不同的神经网络架构N1、N2,N1进行evaluation value的估计,N2进行target value的计算,进而对反向传递进行网络迭代更新,并在k轮迭代后定期将N1的网络参数移植到N2中;该N1、N2均为具有神经元的全连接网络,使用的激活函数为relu输入为特征,输出为action对应的value,其中的神经元个数通过不同的场景发生略微改变;
神经网络初始化出众多参数,令机器持续学习、持续更新该众多参数,直至该神经网络的框架收敛;当神经网络收敛,即得到了上述可迭代方程的最优解,也就是找到了使整个推送过程最优的参数。
具体的,搭建的网络输入是某一个状态St的feature map,通过构建的100个神经元的全连接层通过激活函数tanh,最后通过输出层输出每一个对应action ai的动作价值Vi。神经网络中利用Stochastic Gradient Descent进行网络迭代。算法中应用Experience Replay的方法,对在指定t个需要存储的memory前,对所有涉及的当前步state,对应采取的action,得到的延迟reward以及对应的下一个state’进行存储,见如下公式:
存储的每一个experience
e
t=(s
t,a
t,r
t,s
t+1)
并将其存储于记忆库中
D
t={e
1,...,e
t}。
最终进行又放回的均匀抽样。
在上述公式中,E为期望函数,a为动作空间(Action Space),r为奖励函数(Reward Function),s为环境的状态转移函数(State Transition Function),其中s’是下一个状态的意思。U(D)随机均匀抽样,γ为衰减系数,Q为累积奖励函数;即以下一步的真实奖励减去Q表中的预测奖励来进行迭代Loss。
S150:以百万级数据作为训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;
在形成该数据推送模型后,以百万级数据作为数据特征输入Deep Q Network进行网络训练,并给予给定的Loss function进行误差的回传,并不断训练直至模型收敛已形成完善的数据推送模型,以获得最优数据推送模型。
S160:将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型自动化地向目标用户输出推荐信息。
在该最优数据推送模型自动化地输出数据推送的过程中,所推送出的物品为该最优推送模型中的神经网络经机器学习及反复训练得出的使目标用户购买几率最大的物品。
本实施例中的数据推送方法首先在用户在购物过程中提取出与数据推送相关的个人特征,记录并存储个人行为策略,再结合该个人特征及个人行为策略定义奖励函数,并将物品推荐的现实过程抽象为马尔科夫过程,进而利用该马尔科夫过程的马尔科夫性简化贝尔曼方程,将该推送过程转化为可迭代方程式,并求得该可迭代方程式的最优解,获得数据推送模型,只要将用户的特征输入该数据模型,该数据模型即会自动化地推出最适合该用户、且该用户购买几率最大的物品,该方法不仅提高了推荐物品的准确性,也极大地避免了在交互层面存在的滞后性。
另一方面,本申请还提供一种数据推送系统,图3为根据本申请实施例的数据授推送系统框架图,该系统对应于前述数据推送方法,可以设置于数据推送电子装置中。
如图3所示,该数据推送系统300包括特征提取单元310、奖励函数单元320、网络训练单元330、优化模型单元340。
其中,特征提取单元310用于根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略;
奖励函数单元320与特征提取单元310相连,用于结合特征提取单元310提取的个人特征及个人行为策略定义奖励函数,并基于该奖励函数将物品推荐的现实过程抽象为马尔 科夫过程;
网络训练单元330奖励函数单元320相连,用于利用奖励函数单元320输出的马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;
该优化模型单元340与网络训练单元330相连,用于以百万级数据作为数据特征输入通过网络训练单元330获得的数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,只要将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型即可自动化地输出数据推送。
图4为本申请电子装置示意图,在本实施例中,电子装置40可以是服务器、平板计算机、便携计算机、桌上型计算机等具有运算功能的终端设备。
该电子装置40包括:处理器41、存储器42、计算机程序43、网络接口及通信总线。
电子装置40可以是平板电脑、台式电脑、智能手机,但不限于此。
存储器42包括至少一种类型的可读存储介质。至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置40的内部存储单元,例如该电子装置40的硬盘。在另一些实施例中,可读存储介质也可以是电子装置40的外部存储器,例如电子装置40上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
在本实施例中,存储器42的可读存储介质通常用于存储安装于电子装置40的计算机程序43,密钥生成单元、密钥管理单元、传输单元和告警单元等。
处理器41在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器42中存储的程序代码或处理数据,例如数据推送程序等。
网络接口可选地可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该电子装置40与其他电子设备之间建立通信连接。
通信总线用于实现这些组件之间的连接通信。
图4仅示出了具有组件41-43的电子装置40,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
在图4所示的电子装置实施例中,作为一种计算机存储介质的存储器42中可以存储操作系统以及数据推送程序43;处理器41执行存储器42中存储的数据推送程序时实现如下步骤:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;
S120:结合该个人特征及个人行为信息定义奖励函数;
S130:基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;
S140:利用该马尔科夫过程的马尔科夫性简化贝尔曼方程,将该推送过程转化为可迭代方程式,并求得该可迭代方程式的最优解,结合该最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;
S150:以百万级数据作为训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;
S160:将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型自动化地输出推荐信息。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质中包括数据推送程序,该数据推送程序被处理器执行时实现如下操作:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;
S120:结合该个人特征及个人行为信息定义奖励函数;
S130:基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;
S140:利用该马尔科夫过程的马尔科夫性简化贝尔曼方程,将该推送过程转化为可迭代方程式,并求得该可迭代方程式的最优解,结合该最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;
S150:以百万级数据作为训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;
S160:将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型自动化地向目标用户输出推荐信息。
本申请之计算机可读存储介质的具体实施方式与上述数据推送方法、电子装置的具体实施方式大致相同,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。
Claims (20)
- 一种数据推送方法,应用于电子装置,所述方法包括:S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S120:结合所述个人特征及个人行为信息定义奖励函数;S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型向所述目标用户输出推荐信息。
- 根据权利要求1所述的数据推送方法,其中,所述奖励函数为:若在一个PV中仅发生商品点击,则相应的奖励值为用户点击商品的次数;若在一个PV中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为0。
- 根据权利要求1所述的数据推送方法,其中,所述马尔科夫过程由四元组<S,A,R,T>表示:其中,S为所述物品推荐现实过程中页面上待推送数据的状态;A为所述物品推荐页面产生的所有动作;R:S×A×S→R,为奖励函数,当用户执行动作A动作,由状态S转移到状态S′时,则S′状态获得奖励值,当用户从点击a物品转移到点击b物品时,b物品获得奖励值;T:S×A×S→[0,1],为环境的状态转移函数,T(s,a,s′)表示在状态s上执行动作a,并转移到状态S′的概率。
- 根据权利要求1所述的数据推送方法,其中,求得所述可迭代方程式的最优解为在一个batch中,通过智能体推荐得到的最大累积奖励;求得可迭代方程最优解的方式为抽样求解,其过程为:首先在一个batch小数据集中进行计算,然后循环取batch、循环计算,直至达到阈值上限,或者结果收敛。
- 根据权利要求1所述的数据推送方法,其中,在持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型的过程中,包括:在神经网络中利用Stochastic Gradient Descent进行网络迭代,应用Experience Replay的方法,在指定t个需要存储的memory前,对所有涉及的当前S,对应采取的A,得到的延迟R以及对应的下一个S′进行存储。
- 根据权利要求5所述的数据推送方法,其中,在所述最优数据推送模型自动化地 输出数据推送的过程中,所推送出的物品为所述最优推送模型中的神经网络经机器学习及反复训练得出的使目标用户购买几率最大的物品。
- 一种数据推送系统,包括:特征提取单元、奖励函数单元、网络训练单元、优化模型单元;特征提取单元用于根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略;奖励函数单元与特征提取单元相连,用于结合特征提取单元提取的个人特征及个人行为策略定义奖励函数,并基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;网络训练单元奖励函数单元相连,用于利用奖励函数单元输出的马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;该优化模型单元与网络训练单元相连,用于将训练数据作为数据特征输入通过网络训练单元获得的数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,只要将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型即可自动化地输出数据推送。
- 一种电子装置,该电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S120:结合所述个人特征及个人行为信息定义奖励函数;S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型向所述目标用户输出推荐信息。
- 根据权利要求9所述的电子装置,其中,所述奖励函数为:若在一个PV中仅发生商品点击,则相应的奖励值为用户点击商品的次数;若在一个PV中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为0。
- 根据权利要求9所述的电子装置,其中,所述马尔科夫过程由四元组<S,A,R,T>表示:其中,S为所述物品推荐现实过程中页面上待推送数据的状态;A为所述物品推荐页面产生的所有动作;R:S×A×S→R,为奖励函数,当用户执行动作A动作,由状态S转移到状态S′时,则S′状态获得奖励值,当用户从点击a物品转移到点击b物品时,b物品获得奖励值;T:S×A×S→[0,1],为环境的状态转移函数,T(s,a,s′)表示在状态s上执行动作a,并转移到状态S′的概率。
- 根据权利要求9所述的电子装置,其中,求得所述可迭代方程式的最优解为在一个batch中,通过智能体推荐得到的最大累积奖励;求得可迭代方程最优解的方式为抽样求解,其过程为:首先在一个batch小数据集中进行计算,然后循环取batch、循环计算,直至达到阈值上限,或者结果收敛。
- 根据权利要求9所述的电子装置,其中,在持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型的过程中,包括:在神经网络中利用Stochastic Gradient Descent进行网络迭代,应用Experience Replay的方法,在指定t个需要存储的memory前,对所有涉及的当前S,对应采取的A,得到的延迟R以及对应的下一个S′进行存储。
- 根据权利要求13所述的电子装置,其中,在所述最优数据推送模型自动化地输出数据推送的过程中,所推送出的物品为所述最优推送模型中的神经网络经机器学习及反复训练得出的使目标用户购买几率最大的物品。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有数据推送程序,所述数据推送程序被处理器执行时,实现如下步骤:S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S120:结合所述个人特征及个人行为信息定义奖励函数;S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型向所述目标用户输出推荐信息。
- 根据权利要求16所述的计算机可读存储介质,其中,所述奖励函数为:若在一个PV中仅发生商品点击,则相应的奖励值为用户点击商品的次数;若在一个PV中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为0。
- 根据权利要求16所述的计算机可读存储介质,其中,所述马尔科夫过程由四元组<S,A,R,T>表示:其中,S为所述物品推荐现实过程中页面上待推送数据的状态;A为所述物品推荐页面产生的所有动作;R:S×A×S→R,为奖励函数,当用户执行动作A动作,由状态S转移到状态S′时,则S′状态获得奖励值,当用户从点击a物品转移到点击b物品时,b物品获得奖励值;T:S×A×S→[0,1],为环境的状态转移函数,T(s,a,s′)表示在状态s上执行动作a,并转移到状态S′的概率。
- 根据权利要求16所述的计算机可读存储介质,其中,在持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型的过程中,包括:在神经网络中利用Stochastic Gradient Descent进行网络迭代,应用Experience Replay的方法,在指定t个需要存储的memory前,对所有涉及的当前S,对应采取的A,得到的延迟R以及对应的下一个S′进行存储。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010119662.6A CN111401937A (zh) | 2020-02-26 | 2020-02-26 | 数据推送方法、装置及存储介质 |
CN202010119662.6 | 2020-02-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021169218A1 true WO2021169218A1 (zh) | 2021-09-02 |
Family
ID=71413972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/112365 WO2021169218A1 (zh) | 2020-02-26 | 2020-08-31 | 数据推送方法、系统、电子装置及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111401937A (zh) |
WO (1) | WO2021169218A1 (zh) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113761375A (zh) * | 2021-09-10 | 2021-12-07 | 未鲲(上海)科技服务有限公司 | 基于神经网络的消息推荐方法、装置、设备及存储介质 |
CN114139472A (zh) * | 2021-11-04 | 2022-03-04 | 江阴市智行工控科技有限公司 | 基于强化学习双模型结构的集成电路直流分析方法及系统 |
CN114218290A (zh) * | 2021-12-06 | 2022-03-22 | 中国航空综合技术研究所 | 装备人机交互界面可用性评估的选择方法 |
CN114710792A (zh) * | 2022-03-30 | 2022-07-05 | 合肥工业大学 | 基于强化学习的5g配网分布式保护装置的优化布置方法 |
CN114943278A (zh) * | 2022-04-27 | 2022-08-26 | 浙江大学 | 基于强化学习的持续在线群体激励方法、装置及存储介质 |
CN115640933A (zh) * | 2022-11-03 | 2023-01-24 | 昆山润石智能科技有限公司 | 生产线缺陷自动管理方法、装置、设备及存储介质 |
WO2023142448A1 (zh) * | 2022-01-26 | 2023-08-03 | 北京沃东天骏信息技术有限公司 | 热点信息的处理方法、装置、服务器和可读存储介质 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401937A (zh) * | 2020-02-26 | 2020-07-10 | 平安科技(深圳)有限公司 | 数据推送方法、装置及存储介质 |
CN112565904B (zh) * | 2020-11-30 | 2023-05-09 | 北京达佳互联信息技术有限公司 | 视频片段推送方法、装置、服务器以及存储介质 |
CN118134553B (zh) * | 2024-05-08 | 2024-07-19 | 深圳爱巧网络有限公司 | 一种电商爆款多平台协同推送系统、方法、设备及介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001078003A1 (en) * | 2000-04-10 | 2001-10-18 | University Of Otago | Adaptive learning system and method |
CN106447463A (zh) * | 2016-10-21 | 2017-02-22 | 南京大学 | 一种基于马尔科夫决策过程模型的商品推荐方法 |
CN109451038A (zh) * | 2018-12-06 | 2019-03-08 | 北京达佳互联信息技术有限公司 | 一种信息推送方法、装置、服务器及计算机可读存储介质 |
CN109471963A (zh) * | 2018-09-13 | 2019-03-15 | 广州丰石科技有限公司 | 一种基于深度强化学习的推荐算法 |
CN110659947A (zh) * | 2019-10-11 | 2020-01-07 | 沈阳民航东北凯亚有限公司 | 商品推荐方法及装置 |
CN111401937A (zh) * | 2020-02-26 | 2020-07-10 | 平安科技(深圳)有限公司 | 数据推送方法、装置及存储介质 |
-
2020
- 2020-02-26 CN CN202010119662.6A patent/CN111401937A/zh active Pending
- 2020-08-31 WO PCT/CN2020/112365 patent/WO2021169218A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001078003A1 (en) * | 2000-04-10 | 2001-10-18 | University Of Otago | Adaptive learning system and method |
CN106447463A (zh) * | 2016-10-21 | 2017-02-22 | 南京大学 | 一种基于马尔科夫决策过程模型的商品推荐方法 |
CN109471963A (zh) * | 2018-09-13 | 2019-03-15 | 广州丰石科技有限公司 | 一种基于深度强化学习的推荐算法 |
CN109451038A (zh) * | 2018-12-06 | 2019-03-08 | 北京达佳互联信息技术有限公司 | 一种信息推送方法、装置、服务器及计算机可读存储介质 |
CN110659947A (zh) * | 2019-10-11 | 2020-01-07 | 沈阳民航东北凯亚有限公司 | 商品推荐方法及装置 |
CN111401937A (zh) * | 2020-02-26 | 2020-07-10 | 平安科技(深圳)有限公司 | 数据推送方法、装置及存储介质 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113761375A (zh) * | 2021-09-10 | 2021-12-07 | 未鲲(上海)科技服务有限公司 | 基于神经网络的消息推荐方法、装置、设备及存储介质 |
CN114139472A (zh) * | 2021-11-04 | 2022-03-04 | 江阴市智行工控科技有限公司 | 基于强化学习双模型结构的集成电路直流分析方法及系统 |
CN114218290A (zh) * | 2021-12-06 | 2022-03-22 | 中国航空综合技术研究所 | 装备人机交互界面可用性评估的选择方法 |
CN114218290B (zh) * | 2021-12-06 | 2024-05-03 | 中国航空综合技术研究所 | 装备人机交互界面可用性评估的选择方法 |
WO2023142448A1 (zh) * | 2022-01-26 | 2023-08-03 | 北京沃东天骏信息技术有限公司 | 热点信息的处理方法、装置、服务器和可读存储介质 |
CN114710792A (zh) * | 2022-03-30 | 2022-07-05 | 合肥工业大学 | 基于强化学习的5g配网分布式保护装置的优化布置方法 |
CN114943278A (zh) * | 2022-04-27 | 2022-08-26 | 浙江大学 | 基于强化学习的持续在线群体激励方法、装置及存储介质 |
CN114943278B (zh) * | 2022-04-27 | 2023-09-12 | 浙江大学 | 基于强化学习的持续在线群体激励方法、装置及存储介质 |
CN115640933A (zh) * | 2022-11-03 | 2023-01-24 | 昆山润石智能科技有限公司 | 生产线缺陷自动管理方法、装置、设备及存储介质 |
CN115640933B (zh) * | 2022-11-03 | 2023-10-13 | 昆山润石智能科技有限公司 | 生产线缺陷自动管理方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN111401937A (zh) | 2020-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021169218A1 (zh) | 数据推送方法、系统、电子装置及存储介质 | |
Zhao et al. | Deep reinforcement learning for list-wise recommendations | |
US20230153857A1 (en) | Recommendation model training method, recommendation method, apparatus, and computer-readable medium | |
CN110717098B (zh) | 基于元路径的上下文感知用户建模方法、序列推荐方法 | |
Kim et al. | TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation | |
US10938927B2 (en) | Machine learning techniques for processing tag-based representations of sequential interaction events | |
Mansour et al. | Bayesian incentive-compatible bandit exploration | |
US8843427B1 (en) | Predictive modeling accuracy | |
Benoit et al. | Binary quantile regression: a Bayesian approach based on the asymmetric Laplace distribution | |
JP5789204B2 (ja) | マルチリレーショナル環境において項目を推薦するためのシステム及び方法 | |
US20140046880A1 (en) | Dynamic Predictive Modeling Platform | |
CN109993583B (zh) | 信息推送方法和装置、存储介质及电子装置 | |
CN105159910A (zh) | 信息推荐方法和装置 | |
WO2012103290A1 (en) | Dynamic predictive modeling platform | |
US20220245424A1 (en) | Microgenre-based hyper-personalization with multi-modal machine learning | |
CN106447463A (zh) | 一种基于马尔科夫决策过程模型的商品推荐方法 | |
Beirlant et al. | Peaks-Over-Threshold modeling under random censoring | |
Zhang et al. | Dynamic scholarly collaborator recommendation via competitive multi-agent reinforcement learning | |
CN110598120A (zh) | 基于行为数据的理财推荐方法及装置、设备 | |
CN109117442B (zh) | 一种应用推荐方法及装置 | |
WO2020221022A1 (zh) | 业务对象推荐方法 | |
TW201636930A (zh) | 基於用戶操作行為的服務提供方法及裝置 | |
Saha et al. | Towards integrated dialogue policy learning for multiple domains and intents using hierarchical deep reinforcement learning | |
KR20210029826A (ko) | 당사자들 사이의 전략적 상호작용에서의 전략 검색을 위한 샘플링 방식들 | |
CN114119123A (zh) | 信息推送的方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20921352 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20921352 Country of ref document: EP Kind code of ref document: A1 |