CN112612951B

CN112612951B - Unbiased learning sorting method for income improvement

Info

Publication number: CN112612951B
Application number: CN202011491942.6A
Authority: CN
Inventors: 张伟楠; 戴心仪; 侯嘉伟; 西云佳; 俞勇
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-07-01
Anticipated expiration: 2040-12-17
Also published as: CN112612951A

Abstract

The invention discloses a profit improvement-oriented unbiased learning sorting method, which is based on biased user click log data to directly optimize an unbiased benefit index. Firstly, a position-sensitive click rate estimation model is learned, the click rates of inquiry-document pairs with different characteristics at different positions are modeled, the obtained unbiased estimation of the user benefit is obtained, a learning framework based on lambdalos provides an objective function based on lambdalos to directly optimize the unbiased estimation of the user benefit, and the complexity of a testing stage can be reduced to O (N) by learning a scoring function. Theoretical analysis proves that the objective function can optimize an effective upper bound of the target benefit. The method proves the effectiveness of the method on three public data sets, and can be used for scenes such as list recommendation, webpage search, advertisement systems and the like.

Description

Unbiased learning sorting method for income improvement

Technical Field

The invention relates to the field of information retrieval, in particular to a unbiased machine learning sequencing method.

Background

Learning to rank (Learning to rank) is a classic problem in the field of information retrieval, and is also a core task of service scenes such as internet search and recommendation. Traditional learning ranking methods rely on explicit relevance feedback, which typically requires annotations from human experts and is non-personalized. In personalized search, recommendation and other scenarios, manually labeled data is very expensive and difficult to obtain. Meanwhile, implicit feedback (implicit feedback), such as a click log of a user, is widely applied in scenes such as search and recommendation as a cheap, instant and user-centered substitute. However, the user clicks the log and is affected by the specific display mode, and most importantly, the deviation caused by the displayed Position, namely the Position bias (Position bias), causes that the user cannot directly and accurately reflect the relevance of the article. Conventional learning ranking methods are used more in the context of web searching. Some user click models for web page search, and methods of counterfactual learning based thereon, address the mismatch between the click log and the true relevance feedback by different assumptions about the user's browsing behavior, so that the ranked results can still be ranked in descending order of their relevance probability.

In a practical recommendation system, it can be modeled as a ranking problem that specifies the query and context in general. Taking a movie recommendation as an example, the query refers to past viewing history of the user, and the context refers to time, used equipment, and the like. By modeling in this way, a list recommendation system can be solved by some methods of learning and ranking, which have been studied well in the traditional web search framework. However, in such a system, more complex user behavior patterns are generated, which brings new challenges. In such an actual system, three most central considerations are provided, firstly, data deviation caused by a display mode needs to be solved well, secondly, learning targets are expected to be closer to certain income indexes under a real scene, such as click rate, conversion rate and the like, and thirdly, in the process of actual application, the model on line is expected to have high efficiency and low time delay. It is desirable to design a system that satisfies the above three requirements, while taking into account unbiased, target oriented and efficient properties.

(ii) analyzing recent research on unbiased machine learning ranking

Recently, experts and scholars have proposed a series of counter-fact learning-based approaches around solving how to learn an unbiased machine learning ranking. The Search to Rank with Selection Bias in Personal Search published by Wang et al on the International Conference Information Retrieval in the network Search and Data Mining International Conference (SIGIR 2016 (39 th.)), and the Ungained Search-to-Rank with gained feed published by Joachis et al on the network Search and Data Mining International Conference Web Search and Data Mining (WSDM2017 (10 th.)) are based on the influence of the on-line stochastic exchange experiment to estimate the Rank position, and then the Inverse Weighting of the dip (Inverse probability Weighting) is used to correct the position Bias. However, these methods require on-line based random switching experiments, which undoubtedly sacrifices user experience, impacting platform revenue.

Considering this point, the estimation position bias and out interactive interaction published by Agarwal et al in the International Conference on Web Search and Data Mining (WSDM 2019, 12 th) of the network Search and Data Mining International Conference provides a method, which can directly estimate the influence brought by the sequencing position according to the user click log, thereby avoiding the expensive cost of the online random experiment. On the basis, Fang et al further propose to estimate a location bias tendency related to query Information by using a user click log, which is published on an Information Retrieval International conference Special Interest Group on Information Retrieval report (42 th of SIGER 2019). However, these methods have strong limitations, for example, the sorting logs are required to have a plurality of historical sorting models, and the same item is arranged at different positions, which brings inconvenience to practical application in real scenes.

In addition, there is a series of efforts to jointly learn a trend model and a learned ranking model from the user's click logs. Position bias estimation for unaided learning to rank in personal Search published by Wang et al in the International Conference on Web Search and Data Mining (WSDM 2018, 11 th) of the International Conference on Web Search and Data Mining jointly learns the Position deviation coefficient and a regression-based ranking model in the framework of an EM method. The idea of dual learning is utilized by the unknown learning to rank with Unbiased probability estimation published by Ai and the like in the Information Retrieval international conference Special Interest Group on Information Retrieval (41 nd of SIGIR 2018), and a tendency model and an Unbiased learning ordering model are jointly learned. The method of joint Learning is also used by Unbiaded LambdaMART published by Hu et al in The World Wide Web Conference (WWW 2019). An Unbiaded Pairwise Learning-to-Rank Algorithm, not only calculates The tendency score of position influence for clicked items, but also calculates The corresponding tendency score for non-clicked items. The problem with this type of approach is that in the framework of joint learning, neither the trend model nor the correlation model have exact supervisory signals for modeling; unless the estimate of the correlation is accurate enough, the trend model can be learned well, and the learning of the correlation relies on a good trend model.

The following conclusions can be drawn for relevant research at home and abroad: at present, a unbiased machine learning method based on user interaction relies on expensive online random experiments, severe data requirements or lack of a guaranteed training frame, and is difficult to implement in a real scene. Meanwhile, most of the traditional correlation indexes, such as MAP and NDCG, are optimized, and a specific benefit improvement-oriented target is not targeted, so that a certain disjunction exists between an offline evaluation index and a real online evaluation index.

Therefore, those skilled in the art are dedicated to develop an unbiased machine learning method suitable for a general data scenario, which does not require additional online interaction, and can optimize a real online benefit index.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, two problems to be solved by the present invention are to obtain an unbiased learning objective based on the real user interaction as much as possible without additional interaction, and to make the learning objective close to the benefit index in the real scene as much as possible. At the same time, it is desirable that the learning method has high efficiency and low time delay in the testing stage.

In order to achieve the purpose, the invention provides a profit-improvement-oriented unbiased machine learning sequencing method which comprises the following steps:

and modeling an unbiased machine learning sequencing model based on biased user behavior data, and directly optimizing the unbiased machine learning sequencing model facing the benefit improvement.

Further, the unbiased machine-learned ranking model includes a location-sensitive click-through rate prediction model, and a ranking scoring function.

Furthermore, the learning process is carried out in two steps, wherein in the first step, a position-sensitive click rate estimation model is learned to obtain unbiased estimation of target benefit; in a second step, a ranking scoring function is learned using a pairwise loss function based on the amount of change in the benefit estimate.

Further, the learning goal of the location-sensitive click-through rate estimation model is unbiased estimation of the target benefit.

Further, the penalty function of the ranking scoring function may optimize an effective upper bound on the target benefit.

Further, the method comprises the following steps:

step 1, obtaining a click log of a user from interaction with the user;

step 2, according to a specific application scene, defining a most needed target benefit form, wherein any benefit form which can be expressed as weighted summation of click rate/purchase rate can be defined as a learning target in the step and is directly optimized in the next step;

step 3, randomly extracting S from the click log_cTraining a click rate estimation model g based on position sensitivity according to click data_θ(f_i,k_i)。

Step 4, click rate estimation model g based on click log and position sensitivity_θ(f_i,k_i) And obtaining unbiased estimation of target benefit on each query.

Step 5, starting the learning of the ranking model, firstly initializing a ranking scoring function s randomly_i＝Φ(f_i)；

Step 6, sorting according to the current scoring function to obtain a sorted list under each query;

step 7, randomly selecting S_rFor document pairs from the same query, the difference of the target benefit estimates after exchanging the pair of samples is calculated.

Step 8,Updating the ranking scoring function phi (f) according to the difference of the target benefit estimates of each pair of samples_i)。

And 9, repeating the steps 6-8 until the sorting scoring function is converged.

Further, the click log in step 1, expressed as a collection

Where Q represents the set of all queries, n_qRepresenting the number of documents under the current query, i and q represent the subscripts of the documents and query, respectively. The click logs each include the following information: query feature f_qDocument feature f_dContext feature f_c(the three are collectively denoted by f)_i) Benefit weight b_iPosition k_iWhether or not to click

Further, in step 2, the target benefit is defined as the weighted expected sum of the item clicks on the ranking list in a fixed query set, specifically, the target benefit on each query is defined as,

wherein

Indicating the click rate at which item i is placed at the current location, b_iA benefit-related weight corresponding to each item is represented, for example, in video recommendation, the weight can be defined as the watching duration of the video; while in an ad search, this weight may be defined as the auction price for the ad.

Further, in step 3, the following loss functions are optimized until convergence;

where l (p, q) — plogq- (1-p) log (1-q) is the cross entropy loss.

Further, in step 4, the unbiased estimation calculation formula is:

wherein k is_iAnd

respectively representing the position of item i under the current sort and in the click log.

The unbiased nature of this equation can be demonstrated by the following derivation,

further, in step 7, the calculation formula of the difference value is:

ΔUtil(i,j)＝u(i,k_j)+u(j,k_i)-u(i,k_i)-u(j,k_j)；

wherein u (i, k)_i) Indicating that item i is placed at position k_iThe efficiency of (a) is, in particular,

further, in step 8, the updated target is derived from the pairwise loss function weighted by the difference;

compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a general unbiased learning sorting method which can be directly optimized for benefit indexes in a real scene.

2. In the invention, for the modeling of the position bias, the characteristics of the query level and the object level are considered, and the rule that the attention of the user is reduced along with the increase of the position is more carefully and accurately modeled.

3. The learning objective in the invention is an unbiased estimation of the objective benefit, rather than relying on a potential correlation index, such as MAP, NDCG, etc., which makes the learning objective and the true on-line index closer.

2. In the invention, the optimization of the learning target is realized through lambdaloss, so that the problem of seeking the maximum matching is simplified into the problem of calculating a scoring function, and the complexity of the testing stage is O (N)³) To O (N).

3. Practice proves that the method has a good effect on the generated data set based on the learning sequencing public data set compared with the latest unbiased machine learning method at home and abroad.

Drawings

Fig. 1 is a schematic sequencing flow diagram of an embodiment of the present application.

Detailed Description

The preferred embodiments of the present application will be described below with reference to the accompanying drawings so that the technical contents thereof will be more clearly understood. The present application may be embodied in many different forms of embodiments and the scope of the present application is not limited to only the embodiments set forth herein.

The conception, the specific structure and the technical effects of the present invention will be further described below to fully understand the objects, the features and the effects of the present invention, but the present invention is not limited thereto.

Referring to the attached figure 1, the implementation flow is as follows:

step one, generating a click log, in the step, a user calculates a corresponding click rate according to a certain click model and a query and a document list in a real data set, and then samples the click log of the user model from the click rate. The click log contains queries, document characteristics, benefit weights (such as bids for advertisements, length of time the video was viewed, etc.), position in the list, and whether or not to be clicked. This is consistent with the click dataset we have collected in the real scene;

step two, according to a specific application scene, defining a most needed target benefit form, wherein any benefit form which can be expressed as weighted summation of click rate/purchase rate can be defined as a learning target in the step and is directly optimized in the next step;

step three, randomly extracting S from the click log_cTraining a click rate estimation model g based on position sensitivity according to click data_θ(f_i,k_i) Repeating the process until the training converges;

fourthly, estimating a model g based on click logs and click rate based on position sensitivity_θ(f_i,k_i) To obtain an unbiased estimate of the target benefit on each query,

step five, starting the learning of the sequencing model, and firstly randomly initializing a sequencing scoring function s_i＝Φ(f_i)；

Step six, sorting according to the current scoring function to obtain a sorted list under each query;

step seven, randomly selecting S_rFor the document pair from the same query, the difference of the target benefit estimation after exchanging the pair of samples is calculated, the formula is as follows,

ΔUtil(i,j)＝u(i,k_j)+u(j,k_i)-u(i,k_i)-u(j,k_j)；

step eightUpdating the ranking scoring function phi (f) according to the difference value of each pair of sample target benefit estimates_i) The updated target is from the pairwise loss function weighted by the difference;

step nine, repeating the step 6 to the step 8 until the sorting scoring function is converged;

step ten, for one query in the test set and the corresponding document list, calculating the score of each document through a scoring function, arranging the scores in a descending order, inputting the reordered document list into a click model again, obtaining the click rate of the user in the new ordering list, and calculating the corresponding profit of the ordering function according to the click rate.

To verify the experimental results of the present invention, we show the comparative experimental results on three standard data sets, the baseline method is the two sorting methods of error correction without using error-correcting machine Learning method SVMRank, LambdaRank, and using on-line random exchange experiment (Randomization) and query-related position bias estimation (CPBM), respectively corresponding to the above-mentioned unpiased Learning-to-Rank with acquired Feedback and interaction simulation for context-dependent evaluation-bias evaluation, and DLA using deep neural network DNN method, corresponding to unpiased Learning to Rank with arbitrary approximated evaluation. These methods are the most advanced unbiased machine learning ranking methods. Meanwhile, a real position offset is used for correcting the deviation, and KM (oracle) which is used for performing maximum matching according to a real click rate is used as a reference, and the two methods both use the information of a user click model for generating a click log, so that the method is not used as a comparison scheme and is only used as a reference, and respectively represents the upper limit of a counterfactual learning method and the best sequencing result in an ideal manner.

We performed comparative experiments on Yahoo LETOR set 1, MSLR-WEB10K, and istella, respectively, which are three standard datasets often used in the field of learning and ranking. In the comparative experiment, the main evaluation indexes are MAP and NDCG (@10) based on the relevance, and # Click and CTR based on Click respectively represent the average number of clicks on each query and the average Click rate of each document, which is a benefit index that is really more concerned by a real information system. The overall comparative test results are shown in tables 1, 2 and 3,

TABLE 1 results of comparative experiments on Yahoo LETOR set 1

Table 2 comparative Experimental results on MSLR-WEB10K

Table 3 comparative Experimental results on MSLR-WEB10K

It can be observed from the table that the method of the present invention can obtain better results on benefit indexes compared with the common unbiased machine learning method, which indicates that the technical scheme of the present invention is effective.

The foregoing detailed description of the preferred embodiments of the present application. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the concepts of the present application should be within the scope of protection defined by the claims.

Claims

1. A profit-improvement-oriented unbiased machine learning sequencing method is characterized in that an unbiased machine learning sequencing model is modeled based on biased user behavior data, and profit improvement is directly oriented for optimization;

the unbiased machine learning ranking model comprises a position-sensitive click rate estimation model and a ranking scoring function;

the learning process is carried out in two steps:

step one, learning the click rate estimation model of the position sensitivity to obtain unbiased estimation of target benefit;

secondly, learning the ranking scoring function by using a pairwise loss function based on the unbiased estimated variation obtained in the last step;

the method comprises the following steps:

step 1, obtaining a click log of a user from interaction with the user;

step 2, defining a target benefit form according to a specific application scene; the form of the target benefit comprises a weighted sum form of click through rate/purchase rate;

step 3, randomly extracting S from the click log_cTraining the click rate estimation model g based on position sensitivity according to click data_θ(f_i，k_i)；

Step 4, based on the click log and the click rate estimation model g based on the position sensitivity_θ(f_i，k_i) Obtaining an unbiased estimate of the target benefit on each query;

step 5, starting the learning of the ranking and scoring function, firstly, randomly initializing the ranking and scoring function s_i＝Φ(f_i)；

Step 6, sorting according to the current sorting scoring function to obtain a sorting list under each query;

step 7, randomly selecting S_rCalculating the difference of target benefit estimation after exchanging the pair of samples for the document pairs from the same query;

step 8, according to each pair of samplesUpdating the ranking scoring function Φ (f) by the difference of the target benefit estimates of the book_i)；

And 9, repeating the steps 6-8 until the sorting scoring function is converged.

2. The method of claim 1, wherein in step 1, the click log is represented as a collection

Where Q represents the set of all queries, n_qRepresenting the number of documents under the current query, i and q represent the subscripts of the documents and query, respectively, each including the following information: b_iRepresents the benefit weight, k_iThe position is indicated by a position indication,

indicating whether or not to click, f_iExpress feature, f_iIncluding query features f_qDocument feature f_dContextual characteristics f_c。

3. The method according to claim 2, wherein in step 2, the target benefit is defined as a weighted expected sum of item clicks on the ranking list in each query in a fixed set of queries, and specifically, the target benefit on each query is defined as:

wherein

Indicating the click rate at which item i is placed at the current location, b_iIndicating the corresponding benefit-related weight for each item.

4. The method of claim 3, wherein in step 3, the following loss functions are optimized until convergence;

where l (p, q) — plogq- (1-p) log (1-q) is the cross entropy loss.

5. The method of claim 4, wherein in step 4, the unbiased estimation of the target benefit is calculated by the formula:

wherein k is_iAnd

respectively representing the position of the item i under the current sequencing and the position in the click log;

6. the method of claim 5, wherein in step 7, the difference of the target benefit estimates is calculated by:

ΔUtil(i，j)＝u(i，k_j)+u(j，k_i)-u(i，k_i)-u(j，k_j)；

7. the method of claim 6, wherein in step 8, the updated target is derived from a pairwise loss function weighted by the difference in the target benefit estimates: