WO2021169218A1 - Data pushing method and system, electronic device and storage medium - Google Patents

Data pushing method and system, electronic device and storage medium Download PDF

Info

Publication number
WO2021169218A1
WO2021169218A1 PCT/CN2020/112365 CN2020112365W WO2021169218A1 WO 2021169218 A1 WO2021169218 A1 WO 2021169218A1 CN 2020112365 W CN2020112365 W CN 2020112365W WO 2021169218 A1 WO2021169218 A1 WO 2021169218A1
Authority
WO
WIPO (PCT)
Prior art keywords
data push
neural network
data
reward
optimal
Prior art date
Application number
PCT/CN2020/112365
Other languages
French (fr)
Chinese (zh)
Inventor
陈娴娴
阮晓雯
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021169218A1 publication Critical patent/WO2021169218A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0255Targeted advertisements based on user history
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0269Targeted advertisements based on user profile or attribute
    • G06Q30/0271Personalized advertisement

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a data push method, system, electronic device, and computer-readable storage medium.
  • the classic recommendation system only relies on the big data stored in advance, but ignores that the recommended objects and the recommended environment are constantly changing in a practical sense. At the same time, it also ignores the new information generated during the interaction between the system and the recommended objects. People realize that these neglected interactive information and its possible instantaneous variability are precisely the most important. Therefore, the traditional recommendation system is fixed in rules to a certain extent, objectively speaking, it does not consider the environment and interaction factors. Therefore, this type of traditional methods has obvious lag in the interaction level, and cannot keep up with the latest needs of recommended objects. Therefore, the construction of a recommendation system framework that fully considers the interactive information of the system has become a hot issue in data mining.
  • the recommendation system is most afraid of serious lag.
  • the time lag in user information acquisition and analysis results in delays in user demand analysis. It recommends things that users no longer like, no longer need, or are wrong.
  • Traditional data push mainly Based on the basic machine learning framework, based on association rules, such as the purchased product as the rule header and the rule body as the recommendation object.
  • association rules such as the purchased product as the rule header and the rule body as the recommendation object.
  • the most classic example is that many people buy milk while buying bread to match, and the recommendation is complicated and inaccurate.
  • This application provides a data push method, system, electronic device, and computer-readable storage medium, the main purpose of which is to extract personal characteristics related to data push based on web browsing information, record and store personal behavior strategies, and combine personal characteristics and personal
  • the behavior strategy defines the reward function, and then abstracts the actual process of item recommendation into a Markov process based on the reward function, and then uses the Markov property of the Markov process to simplify the Bellman equation, transforms the push process into an iterable equation, and finds Obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, continue to train the neural network until the neural network converges, obtain a data push model, and then use millions of data as data features to input the data push model for network training, And give the given Loss function to return the error to form the optimal data push model. Finally, the personal characteristics of the data push target user are input into the optimal data push model, and the optimal data push model automatically outputs the data push.
  • the data push method provided in this application is applied to an electronic device, and the method includes:
  • S120 Define a reward function in combination with the personal characteristics and personal behavior information
  • S140 Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;
  • S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
  • this application also provides a data push system, including: a feature extraction unit, a reward function unit, a network training unit, and an optimization model unit;
  • the feature extraction unit is used to extract personal features related to data push based on web browsing information, record and store personal behavior strategies;
  • the reward function unit is connected to the feature extraction unit, and is used to define the reward function in combination with the personal characteristics extracted by the feature extraction unit and the personal behavior strategy, and based on the reward function, the actual process of item recommendation is abstracted into a Markov process;
  • the network training unit is connected with the reward function unit, and is used to simplify the Bellman equation by using the Markov property of the Markov process output by the reward function unit to form an iterable equation, and to obtain the optimal solution of the iterable equation, combining with the optimal solution to build Neural network, continue to train the neural network until the neural network converges, and obtain the data push model;
  • the optimized model unit is connected to the network training unit, and is used to input the training data as data features into the data push model obtained through the network training unit for network training, and give the given Loss function to return the error to form the optimal data push Model, as long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
  • the present application also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the The computer program implements the steps in the data push method described above.
  • this application also provides a computer-readable storage medium in which a data push program is stored.
  • the data push program is executed by a processor, the aforementioned data push method is implemented. A step of.
  • the data push method, system, electronic device and computer-readable storage medium proposed in this application record and store personal behavior strategies by extracting personal characteristics, and then abstract the actual process of item recommendation into a Markov process based on the reward function, and then use
  • the Markov property of the Markov process simplifies the Bellman equation, transforms the push process into an iterable equation, and obtains the optimal solution of the iterable equation, combines the optimal solution to build a neural network, and continues training the neural network until the neural network converges , Obtain the data push model, and finally input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs the data push. It greatly improves the accuracy and recall rate of data push, improves the satisfaction of recommended items and user needs, and avoids the lag in the interaction level.
  • Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application
  • Fig. 2 is a flowchart of a data push method according to an embodiment of the present application
  • Fig. 3 is a system framework diagram in a data push electronic device according to an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the existing data push method is mainly based on the basic machine learning framework, based on the association rules, the purchased goods are used as the rule header, and the rule body is the recommended object.
  • the time lag in the acquisition and analysis of user information leads to delays in the analysis of user needs and recommends to Something that the user no longer likes, no longer needs, or is wrong.
  • this application extracts personal characteristics related to data push from web browsing information, records and stores personal behavior strategies, defines a reward function, and abstracts the actual process of item recommendation as Marr In the Cove process, the optimal solution is obtained, and the neural network is continuously trained until the neural network converges to obtain the data push model. Only the personal characteristics of the target user of the data push are input into the optimal data push model, and the optimal data push model automatically outputs the data Push.
  • a data push method is provided, which is applied to the electronic device 40.
  • Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application. As shown in FIG. 1, the implementation environment in this embodiment is a computer device 110.
  • the computer device 110 is a computer device, such as a terminal device such as a computer.
  • the computer terminal device 110 may be a tablet computer, a notebook computer, a desktop computer, etc., which is a cenOS (linux) system, but is not limited to this.
  • the terminal device 110 such as a computer device can be connected via Bluetooth, USB (Universal Serial Bus) or other communication connection methods, which is not limited in this application.
  • Fig. 2 is a flowchart of a data push method according to an embodiment of the present application. As shown in Figure 2, in this embodiment, the data push method includes the following steps:
  • S110 Extract personal characteristics and personal behavior information related to data push based on web browsing information; personal behavior information is a personal behavior strategy;
  • the extracted personal characteristics can include height, weight, physical condition indicators, economic status, location, etc.
  • the corresponding personal behavior strategy can include general shopping intentions, general shopping time, shopping Specific reasons, shopping locations, organization choices, etc.
  • the extracted personal characteristics can include age, education, physical condition indicators, economic status, location, etc.
  • the corresponding personal behavior strategy can include usual learning needs and learning time , The specific reasons for learning, the purpose of learning, the choice of institutions, etc.; if the subject of the user’s web browsing is news browsing, the extracted personal characteristics can include gender, age, education, physical condition indicators, economic status, location, etc., corresponding People’s behavior strategies include: general browsing topics, general browsing time, and often.
  • the reward function defined in this embodiment needs to be mathematically defined in advance, and its definition and application are indispensable steps in the reinforcement learning algorithm; if the reward function receives positive feedback due to a certain behavior strategy, the behavior strategy is strengthened. Trend, based on the reinforcement learning algorithm, continue to try and continue to recommend. In the process of trying and recommending, the reward is accumulated according to user feedback, until the accumulated value of the reward function receiving environmental feedback is maximized to obtain the local optimal solution.
  • the reward function is: if only product clicks occur in a PV (page view page browsing), the corresponding reward value is the number of times the user clicks on the product; if a user pairing occurs in a PV (page view page browsing)
  • the corresponding reward is the number of times the user clicks on the commodity plus the price of the purchased commodity; in other cases, the reward is zero.
  • the data to be pushed is product recommendation
  • the reward function is: if the user clicks on a product on the shopping page, the reward value is added to the product, and the reward value is the number of the user clicked on the product. If a product is purchased on the shopping page, the reward value is added to the product, and the reward value is the purchase price of the product; otherwise, the reward value is zero.
  • the data to be pushed is training recommendation
  • the reward function is: if the user clicks to browse a certain course on the training page, the reward value is added to the course, and the reward value is the number of times the user clicks to browse the course. If the user purchases a course on the training page, the reward value is added to the course, and the reward value is the purchase price of the course; otherwise, the reward value is zero.
  • the MDP is represented by the four-tuple ⁇ S,A,R,T>:
  • S StateSpace, state space
  • S recommends the state of the data to be pushed on the page during the actual process of the item
  • R S ⁇ A ⁇ S ⁇ R(Reward Function, reward function), R(s,a,s′) represents the reward value obtained by the agent from the environment when the action a is performed on state s and transferred to state s′ , When the user shifts from clicking a to clicking b, the reward value obtained by b increases;
  • T:S ⁇ A ⁇ S ⁇ [0,1] is the state transition function of the environment (State Transition Function), T(s, a, s′) represents the execution of action A on state S and the transition to state S′ Probability.
  • the agent perceives the environment state S in the entire data push process, and collects the personal behavior strategy through the agent.
  • the action space A in the personal behavior strategy is When the item) occurs, the reward function R increases the reward value of the item. The greater the probability T of clicking on the item, the more the reward value increases.
  • the data push process is a product recommendation process
  • the MDP is represented by the quadruple ⁇ S, A, R, T>:
  • S represents the number of times the product has been clicked, or the product has been purchased
  • R S ⁇ A ⁇ S ⁇ R is the reward function
  • R(s,a,s′) represents the reward value obtained by the product when the action A is performed on the state S and transferred to the state S′, such as When the user is clicked 5 times, the user is clicked again, and the reward value added is 1;
  • T:S ⁇ A ⁇ S ⁇ [0,1] is the state transition function
  • T(3,2, enough) indicates that when the product has been clicked 3 times, click 2 times and transfer to the state to purchase the product Probability.
  • the data push process is a course recommendation process
  • the MDP is represented by the quadruple ⁇ S, A, R, T>:
  • S represents the number of times the course has been tried, or the course has been purchased
  • R S ⁇ A ⁇ S ⁇ R is the reward function
  • R(s,a,s′) represents the reward value obtained by the item when the action A is executed on state S and transferred to state S′, such as browsing 3
  • state S′ such as browsing 3
  • T:S ⁇ A ⁇ S ⁇ [0,1] is the state transition function
  • T(3,2,s′) indicates that the course has been browsed or tried 3 times, and then browsed or tried 2 times, and transferred to Probability of buying the course.
  • S140 Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue training the neural network until the The neural network converges to obtain a data push model;
  • is the attenuation coefficient, S, R, and t are equal, and the iterable equation is used to maximize the accumulation of rewards;
  • Solving the optimal solution of the iterable equation is to obtain the maximum objective function Q. It is necessary to obtain the largest cumulative reward through the agent recommendation in a batch; where the batch is a data set and is in the process of solving the optimal solution of the iterable equation Choose to perform sampling solution, that is, perform calculations in a small batch data set, loop batches, loop calculations, until the upper threshold is reached, or the results converge (relatively better results are obtained).
  • N1 and N2 Build two neural network architectures N1 and N2 with the same structure and different parameters.
  • N1 is used to estimate the evaluation value
  • N2 is used to calculate the target value
  • the network is iteratively updated for the reverse transfer.
  • the N1's The network parameters are transplanted to N2; the N1 and N2 are fully connected networks with neurons, and the activation function used is the relu input as the feature, and the output is the value corresponding to the action.
  • the neural network initializes many parameters, and the machine continues to learn and update the many parameters until the framework of the neural network converges; when the neural network converges, the optimal solution of the above-mentioned iterable equation is obtained, that is, it is found that the entire push The optimal parameters of the process.
  • the input of the constructed network is a feature map of a certain state St.
  • Stochastic Gradient Descent is used for network iteration.
  • the Experience Replay method is used in the algorithm.
  • E is the desired function
  • a is the action space (Action Space)
  • r is the reward function (Reward Function)
  • s is the state transition function of the environment (State Transition Function), where s'is the meaning of the next state .
  • U(D) is randomly and uniformly sampled
  • is the attenuation coefficient
  • Q is the cumulative reward function; that is, iterative Loss is performed by subtracting the predicted reward in the Q table from the real reward in the next step.
  • S150 Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
  • S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
  • the items pushed out are the items that are obtained by the neural network in the optimal push model through machine learning and repeated training to maximize the purchase probability of the target user.
  • the data push method in this embodiment first extracts the personal characteristics related to the data push during the shopping process of the user, records and stores the personal behavior strategy, then combines the personal characteristics and the personal behavior strategy to define the reward function, and recommends items
  • the actual process is abstracted as a Markov process, and then the Markov property of the Markov process is used to simplify the Bellman equation, the push process is transformed into an iterable equation, and the optimal solution of the iterable equation is obtained to obtain the data Push model, as long as the user's characteristics are input into the data model, the data model will automatically launch the most suitable items for the user and the user has the greatest purchase probability.
  • This method not only improves the accuracy of recommended items, but also greatly avoids The lag that exists at the interaction level.
  • FIG. 3 is a framework diagram of a data push push system according to an embodiment of this application.
  • the system corresponds to the aforementioned data push method and can be installed in a data push electronic device.
  • the data pushing system 300 includes a feature extraction unit 310, a reward function unit 320, a network training unit 330, and an optimization model unit 340.
  • the feature extraction unit 310 is configured to extract personal features related to data push according to web browsing information, record and store personal behavior strategies;
  • the reward function unit 320 is connected to the feature extraction unit 310, and is used to define a reward function in combination with the personal characteristics and personal behavior strategies extracted by the feature extraction unit 310, and abstract the actual process of item recommendation into a Markov process based on the reward function;
  • the network training unit 330 is connected to the reward function unit 320, and is used to use the Markov property of the Markov process output by the reward function unit 320 to simplify the Bellman equation to form an iterable equation, and to obtain the optimal solution of the iterable equation.
  • the optimal solution builds a neural network, and continuously trains the neural network until the neural network converges, and obtains a data push model;
  • the optimization model unit 340 is connected to the network training unit 330, and is used to input the data push model obtained through the network training unit 330 with millions of data as data features for network training, and give a given Loss function for error return.
  • An optimal data push model is formed. As long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
  • FIG. 4 is a schematic diagram of the electronic device of this application.
  • the electronic device 40 may be a terminal device with arithmetic function, such as a server, a tablet computer, a portable computer, a desktop computer, and the like.
  • the electronic device 40 includes a processor 41, a memory 42, a computer program 43, a network interface, and a communication bus.
  • the electronic device 40 may be a tablet computer, a desktop computer, or a smart phone, but is not limited thereto.
  • the memory 42 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like.
  • the readable storage medium may be an internal storage unit of the electronic device 40, such as a hard disk of the electronic device 40.
  • the readable storage medium may also be an external memory of the electronic device 40, such as a plug-in hard disk equipped on the electronic device 40, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the readable storage medium of the memory 42 is generally used to store the computer program 43 installed in the electronic device 40, the key generation unit, the key management unit, the transmission unit, and the alarm unit.
  • the processor 41 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run program codes or process data stored in the memory 42, such as a data push program, etc. .
  • CPU central processing unit
  • microprocessor or other data processing chip in some embodiments, and is used to run program codes or process data stored in the memory 42, such as a data push program, etc. .
  • the network interface may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 40 and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • WI-FI interface wireless interface
  • the communication bus is used to realize the connection and communication between these components.
  • FIG. 4 only shows the electronic device 40 with the components 41-43, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 42 as a computer storage medium may store an operating system and a data push program 43; the processor 41 implements the following steps when executing the data push program stored in the memory 42:
  • S140 Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;
  • S150 Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
  • S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information.
  • the embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes a data push program, and the data push The following operations are implemented when the program is executed by the processor:
  • S140 Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;
  • S150 Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
  • S160 Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data pushing method and system, an electronic device and a storage medium, which relate to the field of intelligent decision making. Said method comprises: extracting, according to webpage browsing information, personal features related to data pushing (S110), and recording and storing a personal behavior policy; defining, in combination with the personal features and the personal behavior policy, a reward function (S120); abstracting a real process of item recommendation as a Markov process on the basis of the reward function (S130); using the Markov property of the Markov process to simplify a Bellman equation to form an iterative equation, calculating an optimal solution of the iterative equation, and obtaining a data pushing model (S140); inputting million-level data as data features into the data pushing model to perform network training, so as to form an optimal data pushing model; and inputting personal features of a data pushing target user into the optimal data pushing model, and the optimal data pushing model automatically outputting recommendation information to said target user (S160).

Description

数据推送方法、系统、电子装置及存储介质Data pushing method, system, electronic device and storage medium
本申请要求于2020年2月26日提交中国专利局、申请号为202010119662.6,发明名称为“数据推送方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 26, 2020, the application number is 202010119662.6, and the invention title is "data push method, device and computer readable storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种数据推送方法、系统、电子装置及计算机可读存储介质。This application relates to the field of artificial intelligence, and in particular to a data push method, system, electronic device, and computer-readable storage medium.
背景技术Background technique
经典的推荐系统仅依赖于事先存储下来的大数据,却忽略了推荐对象及推荐环境在现实意义中是不断变化的,同时也忽略了系统与推荐对象在交互过程中产生的新的信息,发明人意识到,这些被忽略的交互信息及其可能存在的瞬间变化性恰恰是最重要的,因此传统的推荐系统在一定程度上规则固化,客观来说是没有考虑环境及交互因素在内的。因此这一类传统方法在交互层面存在明显的滞后性,无法紧跟推荐对象的最新需求。因此数据挖掘充分考虑系统交互信息的推荐系统框架的搭建成为了热点问题。The classic recommendation system only relies on the big data stored in advance, but ignores that the recommended objects and the recommended environment are constantly changing in a practical sense. At the same time, it also ignores the new information generated during the interaction between the system and the recommended objects. People realize that these neglected interactive information and its possible instantaneous variability are precisely the most important. Therefore, the traditional recommendation system is fixed in rules to a certain extent, objectively speaking, it does not consider the environment and interaction factors. Therefore, this type of traditional methods has obvious lag in the interaction level, and cannot keep up with the latest needs of recommended objects. Therefore, the construction of a recommendation system framework that fully considers the interactive information of the system has become a hot issue in data mining.
推荐系统最怕的是严重的滞后,对用户信息获取、分析的时间滞后,导致延迟了用户的需求分析,推荐给了用户已经不喜欢、已经不需要、或者错误的东西,传统的数据推送主要基于基础的机器学习框架,基于关联规则,如把已购商品作为规则头,规则体为推荐对象,最经典的例子是很多人购买牛奶的同时会购买面包来搭配,推荐繁杂、不准确。The recommendation system is most afraid of serious lag. The time lag in user information acquisition and analysis results in delays in user demand analysis. It recommends things that users no longer like, no longer need, or are wrong. Traditional data push mainly Based on the basic machine learning framework, based on association rules, such as the purchased product as the rule header and the rule body as the recommendation object. The most classic example is that many people buy milk while buying bread to match, and the recommendation is complicated and inaccurate.
因此,亟需一种提升精确率的数据推送方法。Therefore, there is an urgent need for a data push method with improved accuracy.
发明内容Summary of the invention
本申请提供一种数据推送方法、系统、电子装置及计算机可读存储介质,其主要目的在于通过根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略,结合个人特征及个人行为策略定义奖励函数,然后基于奖励函数将物品推荐的现实过程抽象为马尔科夫过程,进而利用马尔科夫过程的马尔科夫性简化贝尔曼方程,将推送过程转化为可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练所述神经网络直至神经网络收敛,获得数据推送模型,再以百万级数据作为数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,最后将数据推送目标用户的个人特征输入最优数据推送模型,最优数据推送模型自动化地输出数据推送。This application provides a data push method, system, electronic device, and computer-readable storage medium, the main purpose of which is to extract personal characteristics related to data push based on web browsing information, record and store personal behavior strategies, and combine personal characteristics and personal The behavior strategy defines the reward function, and then abstracts the actual process of item recommendation into a Markov process based on the reward function, and then uses the Markov property of the Markov process to simplify the Bellman equation, transforms the push process into an iterable equation, and finds Obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, continue to train the neural network until the neural network converges, obtain a data push model, and then use millions of data as data features to input the data push model for network training, And give the given Loss function to return the error to form the optimal data push model. Finally, the personal characteristics of the data push target user are input into the optimal data push model, and the optimal data push model automatically outputs the data push.
为实现上述目的,本申请提供的数据推送方法,应用于电子装置,所述方法包括:In order to achieve the above purpose, the data push method provided in this application is applied to an electronic device, and the method includes:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;
S120:结合所述个人特征及个人行为信息定义奖励函数;S120: Define a reward function in combination with the personal characteristics and personal behavior information;
S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;
S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;
S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型自动化地向所述目标用户输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
为实现上述目的,本申请还提供一种数据推送系统,包括:特征提取单元、奖励函数单元、网络训练单元、优化模型单元;To achieve the above objective, this application also provides a data push system, including: a feature extraction unit, a reward function unit, a network training unit, and an optimization model unit;
特征提取单元用于根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个 人行为策略;The feature extraction unit is used to extract personal features related to data push based on web browsing information, record and store personal behavior strategies;
奖励函数单元与特征提取单元相连,用于结合特征提取单元提取的个人特征及个人行为策略定义奖励函数,并基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;The reward function unit is connected to the feature extraction unit, and is used to define the reward function in combination with the personal characteristics extracted by the feature extraction unit and the personal behavior strategy, and based on the reward function, the actual process of item recommendation is abstracted into a Markov process;
网络训练单元奖励函数单元相连,用于利用奖励函数单元输出的马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;The network training unit is connected with the reward function unit, and is used to simplify the Bellman equation by using the Markov property of the Markov process output by the reward function unit to form an iterable equation, and to obtain the optimal solution of the iterable equation, combining with the optimal solution to build Neural network, continue to train the neural network until the neural network converges, and obtain the data push model;
该优化模型单元与网络训练单元相连,用于将训练数据作为数据特征输入通过网络训练单元获得的数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,只要将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型即可自动化地输出数据推送。The optimized model unit is connected to the network training unit, and is used to input the training data as data features into the data push model obtained through the network training unit for network training, and give the given Loss function to return the error to form the optimal data push Model, as long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
为实现上述目的,本申请还提供一种电子装置,该电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述数据推送方法中的步骤。In order to achieve the above object, the present application also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the The computer program implements the steps in the data push method described above.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有数据推送程序,所述数据推送程序被处理器执行时,实现前述的数据推送方法的步骤。In addition, in order to achieve the above-mentioned object, this application also provides a computer-readable storage medium in which a data push program is stored. When the data push program is executed by a processor, the aforementioned data push method is implemented. A step of.
本申请提出的数据推送方法、系统、电子装置及计算机可读存储介质,通过提取个人特征,记录并存储个人行为策略,然后基于奖励函数将物品推荐的现实过程抽象为马尔科夫过程,进而利用马尔科夫过程的马尔科夫性简化贝尔曼方程,将推送过程转化为可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练神经网络直至神经网络收敛,获得数据推送模型,最后将数据推送目标用户的个人特征输入最优数据推送模型,最优数据推送模型自动化地输出数据推送。极大地提高了数据推送的精确率和召回率,提升了推荐的物品与用户需求的满足度,避免了在交互层面存在的滞后性。The data push method, system, electronic device and computer-readable storage medium proposed in this application record and store personal behavior strategies by extracting personal characteristics, and then abstract the actual process of item recommendation into a Markov process based on the reward function, and then use The Markov property of the Markov process simplifies the Bellman equation, transforms the push process into an iterable equation, and obtains the optimal solution of the iterable equation, combines the optimal solution to build a neural network, and continues training the neural network until the neural network converges , Obtain the data push model, and finally input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs the data push. It greatly improves the accuracy and recall rate of data push, improves the satisfaction of recommended items and user needs, and avoids the lag in the interaction level.
附图说明Description of the drawings
图1为根据本申请实施例的数据推送方法应用环境示意图;Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application;
图2为根据本申请实施例的数据推送方法的流程图;Fig. 2 is a flowchart of a data push method according to an embodiment of the present application;
图3为根据本申请实施例的数据推送电子装置中的系统框架图;Fig. 3 is a system framework diagram in a data push electronic device according to an embodiment of the present application;
图4为根据本申请实施例的电子装置的结构示意图。Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
现有的数据推送方法主要基于基础的机器学习框架,基于关联规则,把已购商品作为规则头,规则体为推荐对象对用户信息获取、分析的时间滞后,导致延迟用户的需求分析,推荐给用户已经不喜欢、已经不需要、或者错误的东西。为了解决现有的数据推送方法中存在的上述问题,本申请从网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略出发,定义奖励函数,将物品推荐的现实过程抽象为马尔科夫过程,求得最优解,持续训练神经网络直至神经网络收敛,获得数据推送模型,只需将数据推送目标用户的个人特征输入最优数据推送模型,最优数据推送模型自动化地输出数据推送。The existing data push method is mainly based on the basic machine learning framework, based on the association rules, the purchased goods are used as the rule header, and the rule body is the recommended object. The time lag in the acquisition and analysis of user information leads to delays in the analysis of user needs and recommends to Something that the user no longer likes, no longer needs, or is wrong. In order to solve the above-mentioned problems in the existing data push methods, this application extracts personal characteristics related to data push from web browsing information, records and stores personal behavior strategies, defines a reward function, and abstracts the actual process of item recommendation as Marr In the Cove process, the optimal solution is obtained, and the neural network is continuously trained until the neural network converges to obtain the data push model. Only the personal characteristics of the target user of the data push are input into the optimal data push model, and the optimal data push model automatically outputs the data Push.
具体的,根据本申请的一个实施例,提供一种数据推送方法,应用于电子装置40。Specifically, according to an embodiment of the present application, a data push method is provided, which is applied to the electronic device 40.
图1为根据本申请实施例的数据推送方法应用环境示意图。如图1所示,在本实施例的实施环境为计算机设备110。Fig. 1 is a schematic diagram of an application environment of a data push method according to an embodiment of the present application. As shown in FIG. 1, the implementation environment in this embodiment is a computer device 110.
其中的计算机设备110为计算机设备,例如电脑等终端设备。The computer device 110 is a computer device, such as a terminal device such as a computer.
需要说明的是,计算机终端设备110可为平板电脑、笔记本电脑、台式计算机等,其为cenOS(linux)系统,但并不局限于此。计算机设备等终端设备110可以通过蓝牙、USB(Universal Serial Bus,通用串行总线)或者其他通讯连接方式进行连接,本申请在此 不做限制。It should be noted that the computer terminal device 110 may be a tablet computer, a notebook computer, a desktop computer, etc., which is a cenOS (linux) system, but is not limited to this. The terminal device 110 such as a computer device can be connected via Bluetooth, USB (Universal Serial Bus) or other communication connection methods, which is not limited in this application.
图2为根据本申请实施例的数据推送方法的流程图。如图2所示,在本实施例中,数据推送方法包括如下步骤:Fig. 2 is a flowchart of a data push method according to an embodiment of the present application. As shown in Figure 2, in this embodiment, the data push method includes the following steps:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;个人行为信息为个人行为策略;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information; personal behavior information is a personal behavior strategy;
在用户购物过程中,提取可以根据用户先浏览信息提取的身高、体重、身体状况指标、经济状况等个人特征,记录并存储通常的购物意向、购物的一般时间、购物的具体原因、购物地点、机构选择等个人行为策略;在这里,个人特征和个人行为策略的具体内容可以根据个人的网页浏览主题以及网页浏览过程中发生的电子商务行为确定。In the user’s shopping process, extract the height, weight, physical condition indicators, economic status and other personal characteristics that can be extracted according to the user’s first browsing information, record and store the usual shopping intention, general shopping time, specific reasons for shopping, shopping location, Institutional selection and other personal behavior strategies; here, personal characteristics and the specific content of personal behavior strategies can be determined according to personal web browsing topics and e-commerce behaviors that occur during web browsing.
如果用户网页浏览的主题为购物,所提取的个人特征可以包括身高、体重、身体状况指标、经济状况、所处地域等,相应的个人行为策略可以包括通常的购物意向、购物的一般时间、购物的具体原因、购物地点、机构选择等。If the subject of the user's web browsing is shopping, the extracted personal characteristics can include height, weight, physical condition indicators, economic status, location, etc. The corresponding personal behavior strategy can include general shopping intentions, general shopping time, shopping Specific reasons, shopping locations, organization choices, etc.
如果用户网页浏览的主题为学习、培训,则所提取的个人特征可以包括年龄、学历、身体状况指标、经济状况、所处地域等,相应的个人行为策略可以包括通常的学习需求、学习的时间、学习的具体原因、学习目的、机构选择等;如果用户网页浏览的主题为新闻浏览,则所提取的个人特征可以包括性别、年龄、学历、身体状况指标、经济状况、所处地域等,相应的人行为策略包括:通常的浏览主题、浏览的一般时间以及时常等。If the subject of the user's web browsing is learning and training, the extracted personal characteristics can include age, education, physical condition indicators, economic status, location, etc., and the corresponding personal behavior strategy can include usual learning needs and learning time , The specific reasons for learning, the purpose of learning, the choice of institutions, etc.; if the subject of the user’s web browsing is news browsing, the extracted personal characteristics can include gender, age, education, physical condition indicators, economic status, location, etc., corresponding People’s behavior strategies include: general browsing topics, general browsing time, and often.
S120:结合个人特征及个人行为信息定义奖励函数;S120: Combine personal characteristics and personal behavior information to define a reward function;
在进行数据推送(信息推荐)时,以用户网页浏览的主题为购物为例,用户最终是否购买或点击,取决于一连串搜索排序的结果,并不是仅仅基于某一次单纯的搜索或推荐过程,所以需要将搜索引擎作为智能体,将用户看作环境,将物品推荐的问题转化为典型的顺序决策问题。When performing data push (information recommendation), take the user's web browsing theme as shopping as an example. Whether the user finally buys or clicks depends on the results of a series of search rankings, not just based on a simple search or recommendation process, so It is necessary to regard the search engine as an agent and the user as the environment to transform the problem of item recommendation into a typical sequential decision-making problem.
本实施例中定义的奖励函数是需要预先进行数学定义的,其定义和应用是强化学习算法中必不可少的步骤;若因某个行为策略该奖励函数得到正反馈,则加强此行为策略的趋势,基于强化学习算法持续尝试、持续推荐,在尝试、推荐过程中根据用户反馈累加计算奖励,直至收到环境反馈的奖励函数累加值最大得到局部最优解。The reward function defined in this embodiment needs to be mathematically defined in advance, and its definition and application are indispensable steps in the reinforcement learning algorithm; if the reward function receives positive feedback due to a certain behavior strategy, the behavior strategy is strengthened. Trend, based on the reinforcement learning algorithm, continue to try and continue to recommend. In the process of trying and recommending, the reward is accumulated according to user feedback, until the accumulated value of the reward function receiving environmental feedback is maximized to obtain the local optimal solution.
该奖励函数为:若在一个PV(page view页面浏览)中仅发生商品点击,则相应的奖励值为用户点击的商品的数量次数;若在一个PV(page view页面浏览)中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为零。The reward function is: if only product clicks occur in a PV (page view page browsing), the corresponding reward value is the number of times the user clicks on the product; if a user pairing occurs in a PV (page view page browsing) For the purchase of commodities, the corresponding reward is the number of times the user clicks on the commodity plus the price of the purchased commodity; in other cases, the reward is zero.
在一个实施例中,待推送数据为商品推荐,该奖励函数为:若用户在购物页面中点击某商品,则为该商品增加奖励值,奖励值数为用户点击商品的个数,若用户在购物页面中购买了某商品,则为该商品增加奖励值,奖励值数为该商品的购买价格;其他情况奖励值为零。In one embodiment, the data to be pushed is product recommendation, and the reward function is: if the user clicks on a product on the shopping page, the reward value is added to the product, and the reward value is the number of the user clicked on the product. If a product is purchased on the shopping page, the reward value is added to the product, and the reward value is the purchase price of the product; otherwise, the reward value is zero.
在另一个实施例中,待推送数据为培训推荐,该奖励函数为:若用户在培训页面上点击浏览某课程,则为该课程增加奖励值,奖励值数为用户点击浏览该课程的次数,若用户在培训页面上购买了某课程,则为该课程增加奖励值,奖励值数为该课程的购买价格;其他情况奖励值为零。In another embodiment, the data to be pushed is training recommendation, and the reward function is: if the user clicks to browse a certain course on the training page, the reward value is added to the course, and the reward value is the number of times the user clicks to browse the course. If the user purchases a course on the training page, the reward value is added to the course, and the reward value is the purchase price of the course; otherwise, the reward value is zero.
S130:基于奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
在智能体的某个行为策略导致环境正的奖励(奖励函数值变大)的情况下,智能体产生此行为策略的趋势便会加强,然后将数据推送(物品推荐)的过程抽象为MDP(Markov Decision Process马尔科夫过程);In the case that a certain behavior strategy of the agent leads to positive rewards in the environment (the reward function value becomes larger), the tendency of the agent to generate this behavior strategy will be strengthened, and then the process of data push (item recommendation) is abstracted as MDP( Markov Decision Process (Markov process);
该MDP由四元组<S,A,R,T>表示:The MDP is represented by the four-tuple <S,A,R,T>:
其中,S(StateSpace,状态空间),为该物品推荐现实过程中页面上待推送数据的状态;Among them, S (StateSpace, state space), recommends the state of the data to be pushed on the page during the actual process of the item;
A(Action Space,动作空间),为该物品推荐页面产生的所有动作;A (Action Space, action space), for all actions generated on the item recommendation page;
R:S×A×S→R(Reward Function,奖励函数),R(s,a,s′)表示在状态s上执行动作a,并转移到状态s′时,Agent从环境获得的奖励值,当用户从点击a转移到点击b时,b获得的奖励值增加;R: S×A×S→R(Reward Function, reward function), R(s,a,s′) represents the reward value obtained by the agent from the environment when the action a is performed on state s and transferred to state s′ , When the user shifts from clicking a to clicking b, the reward value obtained by b increases;
T:S×A×S→[0,1]为环境的状态转移函数(State Transition Function),T(s,a,s′)表示在状态S上执行动作A,并转移到状态S′的概率。T:S×A×S→[0,1] is the state transition function of the environment (State Transition Function), T(s, a, s′) represents the execution of action A on state S and the transition to state S′ Probability.
抽象为马尔科夫过程中,通过Agent感知整个数据推送的过程所处于的环境状态S,通过Agent采集该个人行为策略,当该个人行为策略中的动作空间A对某物品(点击或浏览某一物品)发生时,该奖励函数R对该物品进行奖励值增加,对该物品点击的概率T越大,奖励值就增加更多。In the Markov process, the agent perceives the environment state S in the entire data push process, and collects the personal behavior strategy through the agent. When the action space A in the personal behavior strategy is When the item) occurs, the reward function R increases the reward value of the item. The greater the probability T of clicking on the item, the more the reward value increases.
在一个实施例中,数据推送的过程为商品推荐过程,MDP由四元组<S,A,R,T>表示:In one embodiment, the data push process is a product recommendation process, and the MDP is represented by the quadruple <S, A, R, T>:
其中,S表示该商品已被点击的次数,或该商品被购买;Among them, S represents the number of times the product has been clicked, or the product has been purchased;
A表示用户正在浏览或点击该物品;A means that the user is browsing or clicking on the item;
R:S×A×S→R为奖励函数,R(s,a,s′)表示在状态S上执行动作A,并转移到状态S′时,该商品获得的奖励值,比如在该商品被点击5次时用户又被点击一次,增加的奖励值为1;R: S×A×S→R is the reward function, R(s,a,s′) represents the reward value obtained by the product when the action A is performed on the state S and transferred to the state S′, such as When the user is clicked 5 times, the user is clicked again, and the reward value added is 1;
T:S×A×S→[0,1]为状态转移函数,T(3,2,够)表示该商品在已被点击3次时,再点击2次,并转移到状态购买该商品的概率。T:S×A×S→[0,1] is the state transition function, T(3,2, enough) indicates that when the product has been clicked 3 times, click 2 times and transfer to the state to purchase the product Probability.
在另一个实施例中,数据推送的过程为课程推荐过程,MDP由四元组<S,A,R,T>表示:In another embodiment, the data push process is a course recommendation process, and the MDP is represented by the quadruple <S, A, R, T>:
其中,S表示该课程已被试看的次数,或该课程被购买;Among them, S represents the number of times the course has been tried, or the course has been purchased;
A表示用户正在浏览或试看该课程;A means that the user is browsing or trying the course;
R:S×A×S→R为奖励函数,R(s,a,s′)表示在状态S上执行动作A,并转移到状态S′时,该物品获得的奖励值,比如在浏览3次该课程时,试看该课程1次,该课程所获得的奖励值为1;R: S×A×S→R is the reward function, R(s,a,s′) represents the reward value obtained by the item when the action A is executed on state S and transferred to state S′, such as browsing 3 When the course is time, try the course once, and the reward value of the course is 1;
T:S×A×S→[0,1]为状态转移函数,T(3,2,s′)表示该课程在已被浏览或试看3次时,再浏览或试看2次,并转移到购买该课程的概率。T:S×A×S→[0,1] is the state transition function, T(3,2,s′) indicates that the course has been browsed or tried 3 times, and then browsed or tried 2 times, and transferred to Probability of buying the course.
S140:利用该马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得该可迭代方程式的最优解,结合该最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue training the neural network until the The neural network converges to obtain a data push model;
首先,简化贝尔曼方程并把该推送过程转化为可迭代方程,并求得该可迭代方程的最优解;First, simplify the Bellman equation and transform the pushing process into an iterable equation, and obtain the optimal solution of the iterable equation;
基于马尔科夫性简化贝尔曼方程,使得其成为一个可迭代的方程式,从而可以通过迭代求解最优解,可迭代方程为:Simplify the Bellman equation based on Markov property, making it an iterable equation, so that the optimal solution can be solved by iteration. The iterable equation is:
Figure PCTCN2020112365-appb-000001
Figure PCTCN2020112365-appb-000001
其中,γ是衰减系数,S、R、t等同上,利用该可迭代方程将奖励累加达到最大;Among them, γ is the attenuation coefficient, S, R, and t are equal, and the iterable equation is used to maximize the accumulation of rewards;
求解可迭代方程的最优解即求最大目标函数Q,需要在一个batch中,通过智能体推荐拿到最大的累积奖励;其中,该batch为数据集,在求解可迭代方程最优解过程中选择进行抽样求解,即在一个batch小数据集中进行计算,循环取batch、循环计算,直至达到阈值上限,或者结果收敛(得到相对较优结果)。Solving the optimal solution of the iterable equation is to obtain the maximum objective function Q. It is necessary to obtain the largest cumulative reward through the agent recommendation in a batch; where the batch is a data set and is in the process of solving the optimal solution of the iterable equation Choose to perform sampling solution, that is, perform calculations in a small batch data set, loop batches, loop calculations, until the upper threshold is reached, or the results converge (relatively better results are obtained).
再引入一个动作价值函数的近似表示,即
Figure PCTCN2020112365-appb-000002
Introduce an approximate representation of the action value function, namely
Figure PCTCN2020112365-appb-000002
然后当近似表示在数学上成立后,结合最优解搭建神经网络DQN model;Then when the approximate representation is established mathematically, combine the optimal solution to build a neural network DQN model;
搭建两个结构相同、参数不同的神经网络架构N1、N2,N1进行evaluation value的估计,N2进行target value的计算,进而对反向传递进行网络迭代更新,并在k轮迭代后定期将N1的网络参数移植到N2中;该N1、N2均为具有神经元的全连接网络,使用的激活函数为relu输入为特征,输出为action对应的value,其中的神经元个数通过不同的场景发生略微改变;Build two neural network architectures N1 and N2 with the same structure and different parameters. N1 is used to estimate the evaluation value, and N2 is used to calculate the target value, and then the network is iteratively updated for the reverse transfer. After k rounds of iterations, the N1's The network parameters are transplanted to N2; the N1 and N2 are fully connected networks with neurons, and the activation function used is the relu input as the feature, and the output is the value corresponding to the action. The number of neurons in it occurs slightly in different scenarios. Change;
神经网络初始化出众多参数,令机器持续学习、持续更新该众多参数,直至该神经网络的框架收敛;当神经网络收敛,即得到了上述可迭代方程的最优解,也就是找到了使整个推送过程最优的参数。The neural network initializes many parameters, and the machine continues to learn and update the many parameters until the framework of the neural network converges; when the neural network converges, the optimal solution of the above-mentioned iterable equation is obtained, that is, it is found that the entire push The optimal parameters of the process.
具体的,搭建的网络输入是某一个状态St的feature map,通过构建的100个神经元的全连接层通过激活函数tanh,最后通过输出层输出每一个对应action ai的动作价值Vi。神经网络中利用Stochastic Gradient Descent进行网络迭代。算法中应用Experience Replay的方法,对在指定t个需要存储的memory前,对所有涉及的当前步state,对应采取的action,得到的延迟reward以及对应的下一个state’进行存储,见如下公式:Specifically, the input of the constructed network is a feature map of a certain state St. The fully connected layer of 100 neurons constructed through the activation function tanh, and finally through the output layer, each action value Vi corresponding to action ai is output. In the neural network, Stochastic Gradient Descent is used for network iteration. The Experience Replay method is used in the algorithm. Before specifying t memory to be stored, all the current step states involved, the corresponding actions taken, the delayed rewards obtained, and the corresponding next state' are stored, as shown in the following formula:
存储的每一个experienceEvery experience stored
e t=(s t,a t,r t,s t+1) e t = (s t, a t, r t, s t + 1)
并将其存储于记忆库中And store it in the memory bank
D t={e 1,...,e t}。 D t ={e 1 ,...,e t }.
最终进行又放回的均匀抽样。Finally, uniform sampling with replacement is performed.
同时应用fixed q-target方法,Loss function为
Figure PCTCN2020112365-appb-000003
At the same time, the fixed q-target method is applied, and the Loss function is
Figure PCTCN2020112365-appb-000003
在上述公式中,E为期望函数,a为动作空间(Action Space),r为奖励函数(Reward Function),s为环境的状态转移函数(State Transition Function),其中s’是下一个状态的意思。U(D)随机均匀抽样,γ为衰减系数,Q为累积奖励函数;即以下一步的真实奖励减去Q表中的预测奖励来进行迭代Loss。In the above formula, E is the desired function, a is the action space (Action Space), r is the reward function (Reward Function), s is the state transition function of the environment (State Transition Function), where s'is the meaning of the next state . U(D) is randomly and uniformly sampled, γ is the attenuation coefficient, and Q is the cumulative reward function; that is, iterative Loss is performed by subtracting the predicted reward in the Q table from the real reward in the next step.
S150:以百万级数据作为训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
在形成该数据推送模型后,以百万级数据作为数据特征输入Deep Q Network进行网络训练,并给予给定的Loss function进行误差的回传,并不断训练直至模型收敛已形成完善的数据推送模型,以获得最优数据推送模型。After forming the data push model, use millions of data as data features to input Deep Q Network for network training, and give the given Loss function to return the error, and continue training until the model converges to form a complete data push model , To obtain the optimal data push model.
S160:将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型自动化地向目标用户输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
在该最优数据推送模型自动化地输出数据推送的过程中,所推送出的物品为该最优推送模型中的神经网络经机器学习及反复训练得出的使目标用户购买几率最大的物品。In the process of automatically outputting the data push by the optimal data push model, the items pushed out are the items that are obtained by the neural network in the optimal push model through machine learning and repeated training to maximize the purchase probability of the target user.
本实施例中的数据推送方法首先在用户在购物过程中提取出与数据推送相关的个人特征,记录并存储个人行为策略,再结合该个人特征及个人行为策略定义奖励函数,并将物品推荐的现实过程抽象为马尔科夫过程,进而利用该马尔科夫过程的马尔科夫性简化贝尔曼方程,将该推送过程转化为可迭代方程式,并求得该可迭代方程式的最优解,获得数据推送模型,只要将用户的特征输入该数据模型,该数据模型即会自动化地推出最适合该用户、且该用户购买几率最大的物品,该方法不仅提高了推荐物品的准确性,也极大地避免了在交互层面存在的滞后性。The data push method in this embodiment first extracts the personal characteristics related to the data push during the shopping process of the user, records and stores the personal behavior strategy, then combines the personal characteristics and the personal behavior strategy to define the reward function, and recommends items The actual process is abstracted as a Markov process, and then the Markov property of the Markov process is used to simplify the Bellman equation, the push process is transformed into an iterable equation, and the optimal solution of the iterable equation is obtained to obtain the data Push model, as long as the user's characteristics are input into the data model, the data model will automatically launch the most suitable items for the user and the user has the greatest purchase probability. This method not only improves the accuracy of recommended items, but also greatly avoids The lag that exists at the interaction level.
另一方面,本申请还提供一种数据推送系统,图3为根据本申请实施例的数据授推送系统框架图,该系统对应于前述数据推送方法,可以设置于数据推送电子装置中。On the other hand, this application also provides a data push system. FIG. 3 is a framework diagram of a data push push system according to an embodiment of this application. The system corresponds to the aforementioned data push method and can be installed in a data push electronic device.
如图3所示,该数据推送系统300包括特征提取单元310、奖励函数单元320、网络训练单元330、优化模型单元340。As shown in FIG. 3, the data pushing system 300 includes a feature extraction unit 310, a reward function unit 320, a network training unit 330, and an optimization model unit 340.
其中,特征提取单元310用于根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略;Among them, the feature extraction unit 310 is configured to extract personal features related to data push according to web browsing information, record and store personal behavior strategies;
奖励函数单元320与特征提取单元310相连,用于结合特征提取单元310提取的个人特征及个人行为策略定义奖励函数,并基于该奖励函数将物品推荐的现实过程抽象为马尔 科夫过程;The reward function unit 320 is connected to the feature extraction unit 310, and is used to define a reward function in combination with the personal characteristics and personal behavior strategies extracted by the feature extraction unit 310, and abstract the actual process of item recommendation into a Markov process based on the reward function;
网络训练单元330奖励函数单元320相连,用于利用奖励函数单元320输出的马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;The network training unit 330 is connected to the reward function unit 320, and is used to use the Markov property of the Markov process output by the reward function unit 320 to simplify the Bellman equation to form an iterable equation, and to obtain the optimal solution of the iterable equation. The optimal solution builds a neural network, and continuously trains the neural network until the neural network converges, and obtains a data push model;
该优化模型单元340与网络训练单元330相连,用于以百万级数据作为数据特征输入通过网络训练单元330获得的数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,只要将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型即可自动化地输出数据推送。The optimization model unit 340 is connected to the network training unit 330, and is used to input the data push model obtained through the network training unit 330 with millions of data as data features for network training, and give a given Loss function for error return. An optimal data push model is formed. As long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
图4为本申请电子装置示意图,在本实施例中,电子装置40可以是服务器、平板计算机、便携计算机、桌上型计算机等具有运算功能的终端设备。FIG. 4 is a schematic diagram of the electronic device of this application. In this embodiment, the electronic device 40 may be a terminal device with arithmetic function, such as a server, a tablet computer, a portable computer, a desktop computer, and the like.
该电子装置40包括:处理器41、存储器42、计算机程序43、网络接口及通信总线。The electronic device 40 includes a processor 41, a memory 42, a computer program 43, a network interface, and a communication bus.
电子装置40可以是平板电脑、台式电脑、智能手机,但不限于此。The electronic device 40 may be a tablet computer, a desktop computer, or a smart phone, but is not limited thereto.
存储器42包括至少一种类型的可读存储介质。至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置40的内部存储单元,例如该电子装置40的硬盘。在另一些实施例中,可读存储介质也可以是电子装置40的外部存储器,例如电子装置40上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The memory 42 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 40, such as a hard disk of the electronic device 40. In other embodiments, the readable storage medium may also be an external memory of the electronic device 40, such as a plug-in hard disk equipped on the electronic device 40, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
在本实施例中,存储器42的可读存储介质通常用于存储安装于电子装置40的计算机程序43,密钥生成单元、密钥管理单元、传输单元和告警单元等。In this embodiment, the readable storage medium of the memory 42 is generally used to store the computer program 43 installed in the electronic device 40, the key generation unit, the key management unit, the transmission unit, and the alarm unit.
处理器41在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器42中存储的程序代码或处理数据,例如数据推送程序等。The processor 41 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run program codes or process data stored in the memory 42, such as a data push program, etc. .
网络接口可选地可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该电子装置40与其他电子设备之间建立通信连接。The network interface may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 40 and other electronic devices.
通信总线用于实现这些组件之间的连接通信。The communication bus is used to realize the connection and communication between these components.
图4仅示出了具有组件41-43的电子装置40,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。FIG. 4 only shows the electronic device 40 with the components 41-43, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
在图4所示的电子装置实施例中,作为一种计算机存储介质的存储器42中可以存储操作系统以及数据推送程序43;处理器41执行存储器42中存储的数据推送程序时实现如下步骤:In the embodiment of the electronic device shown in FIG. 4, the memory 42 as a computer storage medium may store an operating system and a data push program 43; the processor 41 implements the following steps when executing the data push program stored in the memory 42:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;
S120:结合该个人特征及个人行为信息定义奖励函数;S120: Combine the personal characteristics and personal behavior information to define a reward function;
S130:基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
S140:利用该马尔科夫过程的马尔科夫性简化贝尔曼方程,将该推送过程转化为可迭代方程式,并求得该可迭代方程式的最优解,结合该最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;S140: Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;
S150:以百万级数据作为训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
S160:将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型自动化地输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质中包括数据推送程序,该数据推送程序被处理器执行时实现如下操作:In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes a data push program, and the data push The following operations are implemented when the program is executed by the processor:
S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;
S120:结合该个人特征及个人行为信息定义奖励函数;S120: Combine the personal characteristics and personal behavior information to define a reward function;
S130:基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
S140:利用该马尔科夫过程的马尔科夫性简化贝尔曼方程,将该推送过程转化为可迭代方程式,并求得该可迭代方程式的最优解,结合该最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;S140: Utilize the Markov property of the Markov process to simplify the Bellman equation, transform the push process into an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continue Train the neural network until the neural network converges, and obtain a data push model;
S150:以百万级数据作为训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Use millions of data as the training data feature to input the data push model for network training, and give the given Loss function to return the error to form the optimal data push model;
S160:将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型自动化地向目标用户输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model automatically outputs recommendation information to the target user.
本申请之计算机可读存储介质的具体实施方式与上述数据推送方法、电子装置的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned data push method and electronic device, and will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which may be a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种数据推送方法,应用于电子装置,所述方法包括:A data push method applied to an electronic device, the method including:
    S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;
    S120:结合所述个人特征及个人行为信息定义奖励函数;S120: Define a reward function in combination with the personal characteristics and personal behavior information;
    S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
    S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;
    S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;
    S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型向所述目标用户输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model outputs recommendation information to the target user.
  2. 根据权利要求1所述的数据推送方法,其中,所述奖励函数为:The data pushing method according to claim 1, wherein the reward function is:
    若在一个PV中仅发生商品点击,则相应的奖励值为用户点击商品的次数;若在一个PV中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为0。If only a product click occurs in a PV, the corresponding reward value is the number of times the user clicks on the product; if a user purchases a product in a PV, then the corresponding reward is the number of times the user clicks on the product plus the purchased product The price; in other cases, the reward is 0.
  3. 根据权利要求1所述的数据推送方法,其中,The data push method according to claim 1, wherein:
    所述马尔科夫过程由四元组<S,A,R,T>表示:The Markov process is represented by the quaternion <S,A,R,T>:
    其中,S为所述物品推荐现实过程中页面上待推送数据的状态;Wherein, S is the status of the data to be pushed on the page during the actual process of the item recommendation;
    A为所述物品推荐页面产生的所有动作;A is all actions generated on the item recommendation page;
    R:S×A×S→R,为奖励函数,当用户执行动作A动作,由状态S转移到状态S′时,则S′状态获得奖励值,当用户从点击a物品转移到点击b物品时,b物品获得奖励值;R: S×A×S→R, is the reward function. When the user performs action A and transitions from state S to state S′, the state S′ obtains the reward value. When the user transfers from clicking a item to clicking b item When, item b gets the reward value;
    T:S×A×S→[0,1],为环境的状态转移函数,T(s,a,s′)表示在状态s上执行动作a,并转移到状态S′的概率。T: S×A×S→[0,1], is the state transition function of the environment, T(s,a,s′) represents the probability of performing action a on state s and transitioning to state S′.
  4. 根据权利要求1所述的数据推送方法,其中,求得所述可迭代方程式的最优解为在一个batch中,通过智能体推荐得到的最大累积奖励;The data push method according to claim 1, wherein obtaining the optimal solution of the iterable equation is the maximum cumulative reward obtained through agent recommendation in a batch;
    求得可迭代方程最优解的方式为抽样求解,其过程为:首先在一个batch小数据集中进行计算,然后循环取batch、循环计算,直至达到阈值上限,或者结果收敛。The method of obtaining the optimal solution of the iterable equation is sample solution. The process is: first calculate in a small batch data set, and then loop batches and loop calculations until the upper threshold is reached, or the result converges.
  5. 根据权利要求1所述的数据推送方法,其中,结合所述最优解搭建神经网络的过程包括:The data pushing method according to claim 1, wherein the process of building a neural network in combination with the optimal solution comprises:
    引入一个动作价值函数的近似表示:
    Figure PCTCN2020112365-appb-100001
    Introduce an approximate representation of the action value function:
    Figure PCTCN2020112365-appb-100001
    所述近似表示在数学上成立后,结合所述最优解搭建两个结构相同、参数不同的神经网络架构N1、N2;其中,After the approximate representation is established mathematically, combine the optimal solution to build two neural network architectures N1 and N2 with the same structure and different parameters; where,
    利用N1进行evaluation value的估计,利用N2进行target value的计算,进而对反向传递进行网络迭代更新,并在k轮迭代后定期将N1的网络参数移植到N2中;Use N1 to estimate the evaluation value, use N2 to calculate the target value, and then iteratively update the network in the reverse pass, and periodically transplant the network parameters of N1 to N2 after k rounds of iterations;
    所述N1、N2均为具有神经元的全连接网络,所述神经元个数通过不同的场景发生改变。The N1 and N2 are all fully connected networks with neurons, and the number of neurons changes through different scenarios.
  6. 根据权利要求1所述的数据推送方法,其中,在持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型的过程中,包括:The data push method according to claim 1, wherein the process of continuously training the neural network until the neural network converges to obtain the data push model includes:
    在神经网络中利用Stochastic Gradient Descent进行网络迭代,应用Experience Replay的方法,在指定t个需要存储的memory前,对所有涉及的当前S,对应采取的A,得到的延迟R以及对应的下一个S′进行存储。In the neural network, Stochastic Gradient Descent is used to iterate the network, and the Experience Replay method is applied. Before specifying t memory that needs to be stored, for all the current S involved, the corresponding A is taken, the obtained delay R and the corresponding next S 'To store.
  7. 根据权利要求5所述的数据推送方法,其中,在所述最优数据推送模型自动化地 输出数据推送的过程中,所推送出的物品为所述最优推送模型中的神经网络经机器学习及反复训练得出的使目标用户购买几率最大的物品。The data push method according to claim 5, wherein, in the process of automatically outputting data push by the optimal data push model, the items pushed out are the neural network in the optimal push model through machine learning and The item that has the greatest chance of buying by the target user is obtained through repeated training.
  8. 一种数据推送系统,包括:特征提取单元、奖励函数单元、网络训练单元、优化模型单元;A data push system, including: a feature extraction unit, a reward function unit, a network training unit, and an optimization model unit;
    特征提取单元用于根据网页浏览信息提取与数据推送相关的个人特征,记录并存储个人行为策略;The feature extraction unit is used to extract personal features related to data push based on web browsing information, record and store personal behavior strategies;
    奖励函数单元与特征提取单元相连,用于结合特征提取单元提取的个人特征及个人行为策略定义奖励函数,并基于该奖励函数将物品推荐的现实过程抽象为马尔科夫过程;The reward function unit is connected to the feature extraction unit, and is used to define the reward function in combination with the personal characteristics extracted by the feature extraction unit and the personal behavior strategy, and based on the reward function, the actual process of item recommendation is abstracted into a Markov process;
    网络训练单元奖励函数单元相连,用于利用奖励函数单元输出的马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得可迭代方程式的最优解,结合最优解搭建神经网络,持续训练该神经网络直至该神经网络收敛,获得数据推送模型;The network training unit is connected with the reward function unit, and is used to simplify the Bellman equation by using the Markov property of the Markov process output by the reward function unit to form an iterable equation, and to obtain the optimal solution of the iterable equation, combining with the optimal solution to build Neural network, continue to train the neural network until the neural network converges, and obtain the data push model;
    该优化模型单元与网络训练单元相连,用于将训练数据作为数据特征输入通过网络训练单元获得的数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型,只要将数据推送目标用户的个人特征输入该最优数据推送模型,该最优数据推送模型即可自动化地输出数据推送。The optimized model unit is connected to the network training unit, and is used to input the training data as data features into the data push model obtained through the network training unit for network training, and give the given Loss function to return the error to form the optimal data push Model, as long as the personal characteristics of the data push target user are input into the optimal data push model, the optimal data push model can automatically output data push.
  9. 一种电子装置,该电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:An electronic device comprising: a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer program:
    S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;
    S120:结合所述个人特征及个人行为信息定义奖励函数;S120: Define a reward function in combination with the personal characteristics and personal behavior information;
    S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
    S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;
    S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;
    S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型向所述目标用户输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model outputs recommendation information to the target user.
  10. 根据权利要求9所述的电子装置,其中,所述奖励函数为:The electronic device according to claim 9, wherein the reward function is:
    若在一个PV中仅发生商品点击,则相应的奖励值为用户点击商品的次数;若在一个PV中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为0。If only a product click occurs in a PV, the corresponding reward value is the number of times the user clicks on the product; if a user purchases a product in a PV, then the corresponding reward is the number of times the user clicks on the product plus the purchased product The price; in other cases, the reward is 0.
  11. 根据权利要求9所述的电子装置,其中,The electronic device according to claim 9, wherein:
    所述马尔科夫过程由四元组<S,A,R,T>表示:The Markov process is represented by the quaternion <S,A,R,T>:
    其中,S为所述物品推荐现实过程中页面上待推送数据的状态;Wherein, S is the status of the data to be pushed on the page during the actual process of the item recommendation;
    A为所述物品推荐页面产生的所有动作;A is all actions generated on the item recommendation page;
    R:S×A×S→R,为奖励函数,当用户执行动作A动作,由状态S转移到状态S′时,则S′状态获得奖励值,当用户从点击a物品转移到点击b物品时,b物品获得奖励值;R: S×A×S→R, is the reward function. When the user performs action A and transitions from state S to state S′, the state S′ obtains the reward value. When the user transfers from clicking a item to clicking b item When, item b gets the reward value;
    T:S×A×S→[0,1],为环境的状态转移函数,T(s,a,s′)表示在状态s上执行动作a,并转移到状态S′的概率。T: S×A×S→[0,1], is the state transition function of the environment, T(s,a,s′) represents the probability of performing action a on state s and transitioning to state S′.
  12. 根据权利要求9所述的电子装置,其中,求得所述可迭代方程式的最优解为在一个batch中,通过智能体推荐得到的最大累积奖励;9. The electronic device according to claim 9, wherein the optimal solution of the iterable equation is the maximum cumulative reward obtained through agent recommendation in a batch;
    求得可迭代方程最优解的方式为抽样求解,其过程为:首先在一个batch小数据集中进行计算,然后循环取batch、循环计算,直至达到阈值上限,或者结果收敛。The method of obtaining the optimal solution of the iterable equation is sample solution. The process is: first calculate in a small batch data set, and then loop batches and loop calculations until the upper threshold is reached, or the result converges.
  13. 根据权利要求9所述的电子装置,其中,结合所述最优解搭建神经网络的过程包 括:The electronic device according to claim 9, wherein the process of building a neural network in combination with the optimal solution comprises:
    引入一个动作价值函数的近似表示:
    Figure PCTCN2020112365-appb-100002
    Introduce an approximate representation of the action value function:
    Figure PCTCN2020112365-appb-100002
    所述近似表示在数学上成立后,结合所述最优解搭建两个结构相同、参数不同的神经网络架构N1、N2;其中,After the approximate representation is established mathematically, combine the optimal solution to build two neural network architectures N1 and N2 with the same structure and different parameters; where,
    利用N1进行evaluation value的估计,利用N2进行target value的计算,进而对反向传递进行网络迭代更新,并在k轮迭代后定期将N1的网络参数移植到N2中;Use N1 to estimate the evaluation value, use N2 to calculate the target value, and then iteratively update the network in the reverse pass, and periodically transplant the network parameters of N1 to N2 after k rounds of iterations;
    所述N1、N2均为具有神经元的全连接网络,所述神经元个数通过不同的场景发生改变。The N1 and N2 are all fully connected networks with neurons, and the number of neurons changes through different scenarios.
  14. 根据权利要求9所述的电子装置,其中,在持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型的过程中,包括:The electronic device according to claim 9, wherein the process of continuously training the neural network until the neural network converges to obtain the data push model comprises:
    在神经网络中利用Stochastic Gradient Descent进行网络迭代,应用Experience Replay的方法,在指定t个需要存储的memory前,对所有涉及的当前S,对应采取的A,得到的延迟R以及对应的下一个S′进行存储。In the neural network, Stochastic Gradient Descent is used to iterate the network, and the Experience Replay method is applied. Before specifying t memory that needs to be stored, for all the current S involved, the corresponding A is taken, the obtained delay R and the corresponding next S 'To store.
  15. 根据权利要求13所述的电子装置,其中,在所述最优数据推送模型自动化地输出数据推送的过程中,所推送出的物品为所述最优推送模型中的神经网络经机器学习及反复训练得出的使目标用户购买几率最大的物品。The electronic device according to claim 13, wherein, in the process of automatically outputting the data push by the optimal data push model, the items pushed out are the neural network in the optimal push model after machine learning and repetition. The items that are trained to make the target user have the greatest chance of buying.
  16. 一种计算机可读存储介质,所述计算机可读存储介质中存储有数据推送程序,所述数据推送程序被处理器执行时,实现如下步骤:A computer-readable storage medium in which a data push program is stored, and when the data push program is executed by a processor, the following steps are implemented:
    S110:根据网页浏览信息提取与数据推送相关的个人特征及个人行为信息;S110: Extract personal characteristics and personal behavior information related to data push based on web browsing information;
    S120:结合所述个人特征及个人行为信息定义奖励函数;S120: Define a reward function in combination with the personal characteristics and personal behavior information;
    S130:基于所述奖励函数将物品推荐的现实过程抽象为马尔科夫过程;S130: Abstract the actual process of item recommendation into a Markov process based on the reward function;
    S140:利用所述马尔科夫过程的马尔科夫性简化贝尔曼方程形成可迭代方程式,并求得所述可迭代方程式的最优解,结合所述最优解搭建神经网络,持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型;S140: Use the Markov property of the Markov process to simplify the Bellman equation to form an iterable equation, and obtain the optimal solution of the iterable equation, combine the optimal solution to build a neural network, and continuously train the The neural network until the neural network converges to obtain a data push model;
    S150:将训练数据特征输入数据推送模型进行网络训练,并给予给定的Loss function进行误差的回传,形成最优数据推送模型;S150: Input the training data features into the data push model for network training, and give the given Loss function to return the error to form an optimal data push model;
    S160:将数据推送目标用户的个人特征输入所述最优数据推送模型,所述最优数据推送模型向所述目标用户输出推荐信息。S160: Input the personal characteristics of the data push target user into the optimal data push model, and the optimal data push model outputs recommendation information to the target user.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述奖励函数为:The computer-readable storage medium of claim 16, wherein the reward function is:
    若在一个PV中仅发生商品点击,则相应的奖励值为用户点击商品的次数;若在一个PV中发生了用户对商品的购买,那么对应的奖励为用户点击商品的次数加被购买的商品的价格;其他的情况下奖励为0。If only a product click occurs in a PV, the corresponding reward value is the number of times the user clicks on the product; if a user purchases a product in a PV, then the corresponding reward is the number of times the user clicks on the product plus the purchased product The price; in other cases, the reward is 0.
  18. 根据权利要求16所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 16, wherein:
    所述马尔科夫过程由四元组<S,A,R,T>表示:The Markov process is represented by the quaternion <S,A,R,T>:
    其中,S为所述物品推荐现实过程中页面上待推送数据的状态;Wherein, S is the status of the data to be pushed on the page during the actual process of the item recommendation;
    A为所述物品推荐页面产生的所有动作;A is all actions generated on the item recommendation page;
    R:S×A×S→R,为奖励函数,当用户执行动作A动作,由状态S转移到状态S′时,则S′状态获得奖励值,当用户从点击a物品转移到点击b物品时,b物品获得奖励值;R: S×A×S→R, is the reward function. When the user performs action A and transitions from state S to state S′, the state S′ obtains the reward value. When the user transfers from clicking a item to clicking b item When, item b gets the reward value;
    T:S×A×S→[0,1],为环境的状态转移函数,T(s,a,s′)表示在状态s上执行动作a,并转移到状态S′的概率。T: S×A×S→[0,1], is the state transition function of the environment, T(s,a,s′) represents the probability of performing action a on state s and transitioning to state S′.
  19. 根据权利要求16所述的计算机可读存储介质,其中,结合所述最优解搭建神经网络的过程包括:The computer-readable storage medium according to claim 16, wherein the process of building a neural network in combination with the optimal solution comprises:
    引入一个动作价值函数的近似表示:
    Figure PCTCN2020112365-appb-100003
    Introduce an approximate representation of the action value function:
    Figure PCTCN2020112365-appb-100003
    所述近似表示在数学上成立后,结合所述最优解搭建两个结构相同、参数不同的神经 网络架构N1、N2;其中,After the approximate representation is established mathematically, combine the optimal solution to build two neural network architectures N1 and N2 with the same structure and different parameters; where,
    利用N1进行evaluation value的估计,利用N2进行target value的计算,进而对反向传递进行网络迭代更新,并在k轮迭代后定期将N1的网络参数移植到N2中;Use N1 to estimate the evaluation value, use N2 to calculate the target value, and then iteratively update the network in the reverse pass, and periodically transplant the network parameters of N1 to N2 after k rounds of iterations;
    所述N1、N2均为具有神经元的全连接网络,所述神经元个数通过不同的场景发生改变。The N1 and N2 are all fully connected networks with neurons, and the number of neurons changes through different scenarios.
  20. 根据权利要求16所述的计算机可读存储介质,其中,在持续训练所述神经网络直至所述神经网络收敛,获得数据推送模型的过程中,包括:The computer-readable storage medium according to claim 16, wherein the process of continuously training the neural network until the neural network converges to obtain a data push model comprises:
    在神经网络中利用Stochastic Gradient Descent进行网络迭代,应用Experience Replay的方法,在指定t个需要存储的memory前,对所有涉及的当前S,对应采取的A,得到的延迟R以及对应的下一个S′进行存储。In the neural network, Stochastic Gradient Descent is used to iterate the network, and the Experience Replay method is applied. Before specifying t memory that needs to be stored, for all the current S involved, the corresponding A is taken, the obtained delay R and the corresponding next S 'To store.
PCT/CN2020/112365 2020-02-26 2020-08-31 Data pushing method and system, electronic device and storage medium WO2021169218A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010119662.6 2020-02-26
CN202010119662.6A CN111401937A (en) 2020-02-26 2020-02-26 Data pushing method and device and storage medium

Publications (1)

Publication Number Publication Date
WO2021169218A1 true WO2021169218A1 (en) 2021-09-02

Family

ID=71413972

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112365 WO2021169218A1 (en) 2020-02-26 2020-08-31 Data pushing method and system, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN111401937A (en)
WO (1) WO2021169218A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761375A (en) * 2021-09-10 2021-12-07 未鲲(上海)科技服务有限公司 Message recommendation method, device, equipment and storage medium based on neural network
CN114139472A (en) * 2021-11-04 2022-03-04 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure
CN114218290A (en) * 2021-12-06 2022-03-22 中国航空综合技术研究所 Selection method for equipment human-computer interaction interface usability evaluation
CN114943278A (en) * 2022-04-27 2022-08-26 浙江大学 Continuous online group incentive method and device based on reinforcement learning and storage medium
CN115640933A (en) * 2022-11-03 2023-01-24 昆山润石智能科技有限公司 Method, device and equipment for automatically managing production line defects and storage medium
WO2023142448A1 (en) * 2022-01-26 2023-08-03 北京沃东天骏信息技术有限公司 Hotspot information processing method and apparatus, and server and readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401937A (en) * 2020-02-26 2020-07-10 平安科技(深圳)有限公司 Data pushing method and device and storage medium
CN112565904B (en) * 2020-11-30 2023-05-09 北京达佳互联信息技术有限公司 Video clip pushing method, device, server and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001078003A1 (en) * 2000-04-10 2001-10-18 University Of Otago Adaptive learning system and method
CN106447463A (en) * 2016-10-21 2017-02-22 南京大学 Commodity recommendation method based on Markov decision-making process model
CN109451038A (en) * 2018-12-06 2019-03-08 北京达佳互联信息技术有限公司 A kind of information-pushing method, device, server and computer readable storage medium
CN109471963A (en) * 2018-09-13 2019-03-15 广州丰石科技有限公司 A kind of proposed algorithm based on deeply study
CN110659947A (en) * 2019-10-11 2020-01-07 沈阳民航东北凯亚有限公司 Commodity recommendation method and device
CN111401937A (en) * 2020-02-26 2020-07-10 平安科技(深圳)有限公司 Data pushing method and device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001078003A1 (en) * 2000-04-10 2001-10-18 University Of Otago Adaptive learning system and method
CN106447463A (en) * 2016-10-21 2017-02-22 南京大学 Commodity recommendation method based on Markov decision-making process model
CN109471963A (en) * 2018-09-13 2019-03-15 广州丰石科技有限公司 A kind of proposed algorithm based on deeply study
CN109451038A (en) * 2018-12-06 2019-03-08 北京达佳互联信息技术有限公司 A kind of information-pushing method, device, server and computer readable storage medium
CN110659947A (en) * 2019-10-11 2020-01-07 沈阳民航东北凯亚有限公司 Commodity recommendation method and device
CN111401937A (en) * 2020-02-26 2020-07-10 平安科技(深圳)有限公司 Data pushing method and device and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761375A (en) * 2021-09-10 2021-12-07 未鲲(上海)科技服务有限公司 Message recommendation method, device, equipment and storage medium based on neural network
CN114139472A (en) * 2021-11-04 2022-03-04 江阴市智行工控科技有限公司 Integrated circuit direct current analysis method and system based on reinforcement learning dual-model structure
CN114218290A (en) * 2021-12-06 2022-03-22 中国航空综合技术研究所 Selection method for equipment human-computer interaction interface usability evaluation
CN114218290B (en) * 2021-12-06 2024-05-03 中国航空综合技术研究所 Selection method for equipment man-machine interaction interface usability evaluation
WO2023142448A1 (en) * 2022-01-26 2023-08-03 北京沃东天骏信息技术有限公司 Hotspot information processing method and apparatus, and server and readable storage medium
CN114943278A (en) * 2022-04-27 2022-08-26 浙江大学 Continuous online group incentive method and device based on reinforcement learning and storage medium
CN114943278B (en) * 2022-04-27 2023-09-12 浙江大学 Continuous online group incentive method and device based on reinforcement learning and storage medium
CN115640933A (en) * 2022-11-03 2023-01-24 昆山润石智能科技有限公司 Method, device and equipment for automatically managing production line defects and storage medium
CN115640933B (en) * 2022-11-03 2023-10-13 昆山润石智能科技有限公司 Automatic defect management method, device and equipment for production line and storage medium

Also Published As

Publication number Publication date
CN111401937A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
WO2021169218A1 (en) Data pushing method and system, electronic device and storage medium
Kim et al. TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation
US10938927B2 (en) Machine learning techniques for processing tag-based representations of sequential interaction events
CN110717098B (en) Meta-path-based context-aware user modeling method and sequence recommendation method
Mansour et al. Bayesian incentive-compatible bandit exploration
US8843427B1 (en) Predictive modeling accuracy
Benoit et al. Binary quantile regression: a Bayesian approach based on the asymmetric Laplace distribution
JP5789204B2 (en) System and method for recommending items in a multi-relational environment
Barnes Azure machine learning
Letham et al. Sequential event prediction
Kothiyal et al. An experimental test of prospect theory for predicting choice under ambiguity
US20140046880A1 (en) Dynamic Predictive Modeling Platform
CN109993583B (en) Information pushing method and device, storage medium and electronic device
CN105159910A (en) Information recommendation method and device
US20220245424A1 (en) Microgenre-based hyper-personalization with multi-modal machine learning
EP2668545A1 (en) Dynamic predictive modeling platform
CN106447463A (en) Commodity recommendation method based on Markov decision-making process model
Chen et al. Learning multiple similarities of users and items in recommender systems
Zhang et al. Dynamic scholarly collaborator recommendation via competitive multi-agent reinforcement learning
WO2020221022A1 (en) Service object recommendation method
TW201636930A (en) Providing service based on user operation behavior
Saha et al. Towards integrated dialogue policy learning for multiple domains and intents using hierarchical deep reinforcement learning
Gisselbrecht et al. Whichstreams: A dynamic approach for focused data capture from large social media
Arandjelović Prediction of health outcomes using big (health) data
CN114119123A (en) Information pushing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921352

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921352

Country of ref document: EP

Kind code of ref document: A1