WO2020209966A1

WO2020209966A1 - Training a target model

Info

Publication number: WO2020209966A1
Application number: PCT/US2020/021929
Authority: WO
Inventors: Xue LI; Zhipeng LUO; Hao Sun; Jianjin ZHANG; Weihao HAN; Xianqi CHU; Liangjie Zhang; Qi Zhang
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2019-04-12
Filing date: 2020-03-11
Publication date: 2020-10-15
Also published as: CN111813888A

Abstract

The present disclosure provides method and apparatus for training a target model. At least one reference model can be trained with a first dataset. A second dataset and a third dataset can be scored through the at least one reference model, respectively. The target model can be trained with the scored second dataset. The target model can be optimized with the scored third dataset.

Description

TRAINING A TARGET MODEL

BACKGROUND

[0001] With the development of technologies such as machine learning, deep learning and neural networks, various models based on these technologies have been continuously developed and applied. Taking search engines as an example, using search engines to find specific content on the web has become a part of the daily lives of computer users. After receiving a user's search query, a search engine first recall, from a pre-established index database, a specific number of documents associated with the query by using a trained matching model, then subsequently processes the documents, such as relevance filtering, sorting, etc., and finally selects the highest ranked series of documents to present to the user. Since all of these subsequent processing is performed only on the recalled documents, the matching model needs to be trained to be able to recall the documents most relevant to the query. In addition, as a preliminary step in the search process, the matching model needs to recall the documents quickly enough to cope with intensive user query requests and the user's immediate demand for response.

SUMMARY

[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0003] Embodiments of the present disclosure provide method and apparatus for training a target model. At least one reference model can be trained with a first dataset. A second dataset and a third dataset can be scored through the at least one reference model, respectively. The target model can be trained with the scored second dataset. The target model can be optimized with the scored third dataset.

[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

[0006] FIG. l is a schematic diagram of an exemplary fast matching model.

[0007] FIG.2 illustrates an exemplary process for training a target model through a reference model according to an embodiment of the present disclosure.

[0008] FIG.3 is a schematic diagram of an exemplary bottom crossing matching model.

[0009] FIG.4 illustrates an exemplary process for training a reference model through multi-task learning according to an embodiment of the present disclosure.

[0010] FIG.5 illustrates an exemplary process for optimizing a target model according to an embodiment of the present disclosure.

[0011] FIG.6 is a flowchart of an exemplary method for training a target model according to an embodiment of the present disclosure.

[0012] FIG.7 illustrates an exemplary apparatus for training a target model according to an embodiment of the present disclosure.

[0013] FIG.8 illustrates an exemplary apparatus for training a target model according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0014] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

[0015] Currently, a fast matching model is commonly used in search engines to recall documents relevant to queries from a pre-established index database. Herein, the fast matching model refers to a lightweight and bottom-separable model that can convert queries and documents individually into representation vectors in a common inner product space without the need to know each other at the bottom of the model. Common fast matching models include, for example, Deep Structure Semantic Model (DSSM), Convolutional Depth Structure Semantic Model (CDSSM), and the like.

[0016] FIG. l is a schematic diagram of an exemplary fast matching model 100. As shown in FIG. 1, inputs to the fast matching model 100 can include a query 110 and a document 120. In an aspect, for the query 110, the fast matching model 100 can include an embedding layer 112 for converting each word in a sequence of input words into a feature vector; a convolution layer 114 for extracting context features based on a sliding window around each word; a pooling layer 116 for selecting the most important context features; and a semantic layer 118 for representing high level semantic feature vectors of the sequence of input words. In another aspect, for the document 120, a feature extraction can be performed first. The extracted features may, for example, include at least one of: keywords 122 characterizing the core topic of the document 120; a document title 124 indicating a title of the document 120; a Uniform Resource Locator (URL) 126 indicating an address of the document 120 on the Internet; a description 128 summarizing a main content of the document 120; and a Landing Page (LP) title 130 indicating the title of the LP corresponding to the document 120. Herein, the LP refers to a page corresponding to a link that a user reaches after clicking on the link on a search result page. For the above-mentioned features extracted from the document 120, the fast matching model 100 can include an embedding layer 132, a convolution layer 134, a pooling layer 136 and a semantic layer 138, the functions of which are similar to the functions of the corresponding layers for the query 110, respectively. Moreover, the fast matching model 100 can also include a scoring layer 180 for determining relevance between a feature vector of the query 110 output from the semantic layer 118 and a feature vector of the document 120 output from the semantic layer 138. It should be appreciated that the fast matching model 100 shown in FIG. 1 is merely one example of existing fast matching models. The fast matching model can have any other structure and can include more or fewer layers depending on the actual application requirements.

[0017] In some cases, the fast matching model may store feature vectors calculated for a large number of documents in an index database in advance in order to reduce the amount of online calculation. For example, for a large number of documents, feature vectors of these documents may be respectively calculated in advance through the embedding layer 132, the convolution layer 134, the pooling layer 136 and the semantic layer 138 shown in FIG. 1, and stored in the index database. When receiving a query request from a user, the fast matching model may perform a calculation of a feature vector only for the query, for example, through the embedded layer 112, the convolution layer 114, the pooling layer 116 and the semantic layer 118, etc. shown in FIG. 1. Then, documents matching the query are retrieved by performing relevance matching between the feature vector of the query and the feature vector of the documents stored in the index database. For example, the relevance matching can be performed efficiently by using a neighbor search algorithm. Thus, with the fast matching model, documents relevant to the entered query can be quickly recalled from the index database.

[0018] As shown in FIG. 1, the fast matching model 100 separates a query side from a document side and performs vector conversion on the query 110 and the document 120, respectively. Since the query and the document do not interact until the scoring layer, the fast matching model often loses information that is important for matching between the query and the document when performing vector conversion, which will greatly limits the performance of the matching model, thus affecting the accuracy of the recalled documents. In addition, the fast matching model is typically trained on training data with human- provided relevance labels. Such training data is expensive and time consuming to collect, and thus its number is limited. However, the performance of the fast matching model depends on whether there is a large amount of training data with relevance labels.

[0019] Therefore, it is desirable to improve the performance of the fast matching model to recall documents relevant to a query more accurately and more quickly. However, there are many challenges to this improvement.

[0020] In an aspect, a matching model with a more complex structure can be tried instead of a fast matching model to achieve better matching accuracy. Such a matching model can be, for example, a bottom crossing matching model. Herein, the bottom crossing matching model refers to a model in which a input query and a document interact immediately after an embedding layer. Common bottom crossing matching models include Deep Crossing Model, Decision Tree Ensemble Model, Deep Cross Network Model, and so on. Since in the bottom crossing matching model, the query and the document interact immediately after the embedding layer, this structure of the bottom crossing matching model can provide better performance than the fast matching model, but at the same time the computational complexity increases greatly. In practical applications, query requests from users are extremely dense. As a preliminary step in performing the search process, recalling documents matching queries from the index database must be performed quickly. Therefore, the bottom crossing matching model cannot be directly deployed to perform a recall for documents.

[0021] In another aspect, it is proposed in some techniques to train a fast matching model using data without relevance labels as an alternative to human labeled training data. For example, it is proposed to use search log data to train the fast matching model. In this case, a user click is often used as an alternative to a human provided relevance label. For example, a query-document pair that has been clicked by a user is considered relevant, and a query-document pair synthesized by a query with other randomly extracted documents is considered irrelevant. However, there are many problems with this way of using search log data. For example, the arbitrariness and subjectivity of user behavior can lead to deviations between user clicks and actual relevance, and the synthesized irrelevant query-document pairs are also likely to contain actually relevant query-document pairs, which will both reduce the accuracy of the training data, resulting in "pollution" of the training data. Furthermore, in order to distinguish relevancies at a finer granularity, human provided labels are generally hierarchical, for example, indicating different levels of relevancies by 6 values of "0" to "5", where a larger value indicates more relevant. Such labels may be difficult to approximate by processing the search log data based on user clicks.

[0022] The embodiments of the present disclosure propose to improve the performance of a target model through an improved training process. For example, the target model can be trained by using a reference model. Herein, the target model refers to a model expected to be trained that is simple and deployable, such as a fast matching model, and the reference model refers to a model with a relatively complex structure that can be used to assist in training the target model and is generally not directly deployable, such as a bottom crossing matching model. It should be appreciated that although the following discussion relates to an example of training a fast matching model using a bottom crossing matching model, the embodiments of the present disclosure are not limited thereto, but other types of reference models may be used to train other types of target models in a similar manner.

[0023] In an aspect, in accordance with the embodiments of the present disclosure, a large number of datasets without labels can be scored using a reference model to obtain a large amount of training data for training a target model. For example, a dataset without labels may include search log data, which may include queries and documents from a large number of search processes of search engines, etc., and thus the number thereof is enormous. Since the reference model can be a model with higher performance, its scoring on samples in the dataset will have high accuracy and can better approximate human labeling. The target model can be trained using the obtained large amount of training data. Since the amount of these training data will greatly exceed the available human labeled training data, and scores in these training data have high accuracy, this will help to train the target model with better performance.

[0024] In another aspect, the embodiments of the present disclosure may also use the reference model to further optimize the target model being trained. Another set of datasets with labels can be scored using the reference model to obtain a scored dataset with labels. Each sample in the scored dataset with labels includes a label and a score provided by the reference model. These samples with both labels and scores can be used to optimize the target model being trained. [0025] In yet another aspect, the embodiments of the present disclosure also propose an effective training approach for the reference model. For example, the reference model can be trained by jointly learning a plurality of relevant tasks so that it can distinguish relevancies at finer granularity with higher accuracy.

[0026] FIG.2 illustrates an exemplary process 200 for training a target model through a reference model according to an embodiment of the present disclosure. As an example, the target model can be a fast matching model.

[0027] Firstly, a first dataset 210 for training a reference model can be obtained. The first dataset 210 can be, for example, a dataset with labels. The first dataset 210 can include a plurality of samples. Each sample can include a query, a document and a label, such as represented as a triplet <query, document, label>, where the label can indicate relevance between the query and the document.

[0028] The labels in the first dataset 210 can be human added or added in any other manner. The relevance of each query-document pair in the first dataset 210 can be scored and a label indicating the relevance of the query-document pair can be given. The human added labels are relatively trustworthy and thus considered to be "strong annotations". In addition, in order to distinguish the relevancies at a fine granularity, the labels in the first dataset 210 are usually hierarchical, enumerated labels, for example, representing different levels of relevancies by a set of relevance values, wherein a greater relevance value indicates more relevant. As an example, the relevance values of labels may be (0, 1, 2, 3, 4, 5}, where "0" indicates irrelevant and "5" indicates the most relevant.

[0029] In one case, there can be two types of labels for the same query-document pair. The first type of label is a document copy label that indicates relevance between a query and a document copy. Herein, the document copy refers to information about the document that a user can see on a search results page. The second type of label is the landing page label, which indicates relevance between the query and a landing page. Herein, the landing page refers to a page that a user reaches after clicking on a link corresponding to a document on a search result page. The "label" in the sample's triplet <query, document, label> may include the document copy label and the landing page label respectively, or may be a comprehensive label obtained based on the document copy label and the landing page label.

[0030] At 220, each enumerated label in the first dataset 210 can be converted into a set of binary labels by constructing a set of tasks to obtain a converted first dataset 210. This conversion makes fuller use of the fine-grained information provided by enumerated labels. Herein, the binary label may include a positive label indicating that the query and the document are relevant, such as " 1 ", and a negative label indicating that the query and the document are irrelevant, such as "0". When training a matching model, converting an enumerated label to a binary label will improve the performance of the matching model.

[0031] Typically, an enumerated label will be uniquely converted into a positive label and a negative label. For example, in the case that a label has six relevance values of "0" to "5", a label with a relevance value of "0" is converted into a negative label, and a label with a relevance values greater than "0" is converted into a positive label. However, this conversion does not take into account the degree of distinction between labels with values greater than "0". For example, a label with a relevance label of "2" and a label with a relevance label of "3 " are all converted into positive labels. In contrast, in an embodiment of the present disclosure, an enumerated label is not uniquely converted into a positive label and a negative label, but is converted into a set of binary labels by a set of tasks to increase the degree of distinction between different relevance values.

[0032] At 230, at least one reference model can be trained with the converted first dataset 210. It is to be noted that since the operation of constructing the tasks at 220 is optional, the first dataset 210 can also be used directly to train the at least one reference model.

[0033] In an implementation, the reference model can be, for example, a bottom crossing matching model. FIG.3 is a schematic diagram of an exemplary bottom crossing matching model 300. As shown in FIG. 3, inputs to the bottom crossing matching model 300 can include a query 310 and a document 320. In an aspect, for the query 310, the bottom crossing matching model 300 can include an embedding layer 340 to convert the query 310. In another aspect, for the document 320, a feature extraction can be performed first. The extracted features may include, for example, at least one of keywords 322, a document title 324, an URL 326, a description 328 and a LP title 330. For the above-mentioned features extracted from the document 320, the bottom crossing matching model 300 can include embedding layers 342, 344, 346, 348, and 350 to convert individual features of the document 320, respectively. Then, the outputs of the embedding layers 340 to 350 may be provided together to a stacking layer 360 to be stacked into one feature vector and provided to a residual layer 370. The residual layer 370 is composed of residual units that can after transforming original input features by, for example, two layers of Rectified Linear Units (ReLUs), add the transformed input features to the original input features by dimension. Finally, the feature vector is scored by a scoring layer 380 to indicate relevance between the query 310 and the document 320. It should be appreciated that the bottom crossing matching model 300 shown in FIG. 3 is merely one example of bottom crossing matching models. The bottom crossing matching model can have any other structure and can include more or fewer layers depending on the actual application requirements.

[0034] It should be appreciated that the embodiments of the present disclosure may employ a single reference model or multiple reference models to train the target model. Thus, a single reference model can be trained or multiple reference models can be trained separately at 230. In the case of multiple reference models, these reference models may have the same model structure, for example, all are Deep Crossing Model, or these reference models may have different model structures, for example, combinations of Deep Crossing Model, Decision Tree Ensemble Model, and so on. In the case that the multiple reference models have different model structures, since each reference model has its own advantages, the larger the difference in model structure, the stronger the performance of the model ensemble obtained by subsequent combination.

[0035] In the process 200, after training at least one reference model at 230, the second dataset 240 can be scored using the at least one reference model, wherein the second dataset 240 will be used to form training data for training the target model. The second dataset 240 can be, for example, a dataset without labels. The second dataset 240 can include a plurality of samples, each sample including at least a query and a document, and having a structure such as <query, documents The samples in the second dataset 240 can be based on, for example, search log data.

[0036] At least one reference model may score each of the samples in the second dataset 240 to obtain a relevance score for the sample. Here, the relevance score obtained through the reference model is also referred to as a target score, which indicates relevance between the query and the document in the sample, and serves as a reference for the subsequent training the target model. The scored second dataset 240 forms a first scored dataset 250. Samples in the first scored dataset 250 may have a structure such as <query, document, target score>. Assuming that the target score of the z-th sample in the first scored dataset 250 is represented

wherein 0>Si>l, the larger s_t indicates that the query is more relevant to the document. Since the target score is given by the reference model, which is slightly less reliable than the human provided labels, it is therefore also referred to as "weak annotation".

[0037] In an embodiment, if the at least one reference model includes more than one reference model, for each sample in the second dataset 240, the relevance between the query and the document in the sample may be scored through the at least one reference model, to obtain at least one initial score of the sample. Subsequently, a target score of the sample can be generated based on the at least one initial score. For example, where the at least one reference model includes two reference models, two initial scores of the sample can be obtained by scoring the sample through each reference model, respectively. The target score of the sample can then be generated based on the two initial scores. In one example, the two initial scores can be arithmetically averaged, and the result obtained is taken as the target score for the sample.

[0038] As previously described, the samples in the second dataset 240 can be based on search log data. Since the amount of search log data is large, a large amount of scored search log data may be obtained by scoring it through a reference model. Thus, a large amount of training data available for training the target model will be included in the first scored dataset 250.

[0039] At 260, the first scored dataset 250 can be used to train the target model. The target model can be, for example, a fast matching model. In an embodiment, for each sample in the first scored dataset 250, a relevance score of the sample may be obtained by scoring the sample through the target model. Here, the relevance score obtained through the target model may also be referred to as a predicted score. A prediction loss of the sample can then be calculated using both the sample's target score provided by the reference model and the predicted score provided by the target model, and the target model can be trained by minimizing the prediction loss.

[0040] In process 200, the third dataset 270 can also be scored using at least one reference model, wherein the third dataset 270 will be used to form optimization training data for optimizing the target model being trained. The third dataset 270 can be, for example, a dataset with labels. The third dataset 270 can include a plurality of samples, each sample including at least a query, a document, and an label human provided or provided in other ways, and having a structure such as <query, document, label>, wherein the label indicates relevance between the query and the document. For each sample in the third dataset 270, the sample can be scored through at least one reference model to obtain a relevance score of the sample. Here, the relevance score obtained through the reference model is also referred to as a target score, which indicates relevance between the query and the document in the sample, and serves as a reference for the subsequent optimizing the target model. The scored third dataset 270 forms a second scored dataset 280. Samples in the second scored dataset 280 may have a structure such as <query, document, label, target score>. The approach in which the third dataset 270 is scored using the at least one reference model may be similar to the approach in which the second dataset 240 is scored using the at least one reference model.

[0041] At 290, the second scored dataset 280 can be used to optimize the target model trained at 260. For each sample in the second scored dataset 280, the sample may be scored through the target model to obtain a relevance score of the sample, which can also be referred to as a predicted score. A prediction loss corresponding to the sample can then be calculated using a combination of the label of the sample, the target score provided by the reference model and the predicted score provided by the target model, and the target model can be optimized by minimizing the prediction loss.

[0042] The target model ultimately obtained by process 200 can be deployed online for performing a recall of documents, while the at least one reference model only runs offline for training the target model. It should be appreciated that herein, the use of the reference model to train the target model may encompass both the initial training operations on the target model at 260 and the optimization operations on the target model being trained at 290.

[0043] The embodiments of the present disclosure propose a method of training a reference model through multi-task learning (MTL). Herein, the MTL refers to the use of enumerated labels to build a plurality of relevant tasks and to train the model by jointly learning the plurality of tasks. When training the reference model, the use of the MTL can make fuller use of fine-grained information provided by the enumerated labels.

[0044] FIG.4 illustrates an exemplary process 400 for training a reference model through MTL according to an embodiment of the present disclosure. The process 400 may correspond to the operations 220 and 230 in FIG. 2.

[0045] At 402, enumerated labels in samples of a dataset used to train the reference model can be converted into a set of binary labels through a set of tasks. The dataset is, for example, the first dataset 210 in FIG 2. The number of the set of binary labels may be equal to or less than the dimension of the value of the enumerated labels. The enumerated labels can include a plurality of relevance values, such as (0, 1, 2, 3, 4, 5}. The binary labels may include a positive label indicating that the query and the document are relevant, such as " 1 ", and a negative label indicating that the query and the document are irrelevant, such as "0". A set of tasks for converting an enumerated label into a set of binary labels can convert the enumerated label to the positive label and the negative label based on their respective cutoff values. In this example, the cutoff value for each task can be taken from, for example, one of (0, 1, 2, 3, 4}. In a particular task, an enumerated label with a relevance value less than or equal to the cutoff value is converted into a negative label, and an enumerated label with a relevance value greater than the cutoff value is converted into a positive label.

[0046] In an implementation, the set of tasks may include a primary task and at least one auxiliary task. The primary task may refer to the task with a cutoff value that is a boundary value in the plurality of relevance values of the enumerated labels that divides relevance between the query and the document into relevant and irrelevant, and the auxiliary task may refer to a task with a cutoff value that is a value of the plurality of relevance values other than the boundary value. A relevance value of the plurality of relevance values that is less than or equal to the boundary value may indicate that the query is irrelevant to the document, and a relevance value of the plurality of relevance values that is greater than the boundary value may indicate that the query is relevant to the document. For example, for a document copy label, the boundary value may be "0" such that the relevance value "0" indicates that the query is irrelevant to the document, and a relevance value " 1 " or a greater value indicates that the query is relevant to the document. Further, for example, for a landing page label, the boundary value may be " 1 " such that the relevance value "0" and " 1 " indicates that the query is irrelevant to the document, and a relevance value "2" or a greater value indicates that the query is relevant to the document. Table 1 shows an exemplary label division based on a primary task and auxiliary tasks 1-4. In this example, an enumerated label has relevance values of (0, 1, 2, 3, 4, 5} and a boundary value of "0". The cutoff value of the primary task is“0” and can distinguish between labels with relevance values of“0” and labels with relevance values greater than“0”, and the cutoff values of auxiliary tasks 1- 4 are“1”,“ 2", "3", and "4" respectively, and can further distinguish between labels with relevance values greater than "0".

[0047] Table 1 shows, through various tasks, which enumerated labels are converted into the negative label and which enumerated labels are converted into the positive label. For example, the auxiliary task 3 has a cutoff value of "3", and through the auxiliary task 3, enumerated labels with relevance values of (0, 1, 2, 3 } are converted into the negative label "0", and enumerated labels with relevance values of {4, 5} are converted into the positive label " 1 ".

[0048] In another example, the boundary value may be " 1. " In this case, the cutoff value of the primary task is " 1 ". In the primary task, enumerated labels with relevance values of (0, 1 } are converted into the negative label "0", and enumerated labels with relevance values of (2, 3, 4, 5} are converted into the positive label " 1 ". The cutoff values of the auxiliary tasks 1-4 are "0", "2", "3", and "4", respectively. For example, the cutoff value of the auxiliary task 1 can be "0". In the auxiliary task 1, an enumerated label with a relevance value of {0} is converted into the negative label "0", and enumerated labels with relevance values of ( 1, 2, 3, 4, 5} are converted into the positive label " 1 ".

[0049] An enumerated label can be converted into a set of binary labels through the above-mentioned set of tasks including a primary task and auxiliary tasks. In addition, the set of tasks can distinguish between enumerated labels with relevance values greater than "0", so that fine-grained hierarchical labels can be utilized. Take a sample <query w, document k , 2> as an example, where "2" is an enumerated label indicating the relevance between "query m " and "document k". Through the above set of tasks as shown in Table 1, the enumerated label "2" can be converted into a set of binary labels " 1 ", " 1 ", "0" , "0" and "0" corresponding to the primary task and the auxiliary tasks 1-4, respectively. Again, take a sample <query m, document k , 3> as an example, where "3" is an enumerated label indicating the relevance between "query m" and "document k" . Through the above set of tasks as shown in Table 1, the enumerated label "3" can be converted into another set of binary labels " 1 ", " 1 ", " 1 " , "0" and "0" corresponding to the primary task and the auxiliary tasks 1-4, respectively. It can be seen that through the above set of tasks, the enumerated label "2" and the enumerated label "3" can be converted into different two sets of binary labels.

[0050] At 404, after converting an enumerated label in a sample into a set of binary labels through a set of tasks, a set of derived samples can be created by combining the query and the document in the sample and the set of binary labels. Herein, the derived sample refers to a sample that includes at least a query, a document, and a binary label, wherein the binary label is converted from an enumerated label through the task being constructed.

[0051] Continue with the sample <query , document k , 2> as an example. Through the above primary task and auxiliary tasks as shown in Table 1, the enumerated label "2" can be converted into a set of binary labels, that is, " 1 ", " 1 ", "0" , "0" and "0" . By combining "query m" and "document k" and the set of binary labels, a set of derived samples can be created , such as <query m, document k , 1>, <query m, document k , 1>, <query m, document k , 0>, <query m, document k , 0> and <query w, document k , 0>.

[0052] After creating a set of derived samples, the process 400 can in turn use the set of derived samples to train the reference model.

[0053] At 406, the set of derived samples can be scored by using the reference model, respectively, to obtain a set of predicted scores respectively corresponding to the set of derived samples. Here, the predicted score refers to the score provided by the reference model after scoring relevance between the query and the document of each derived sample.

[0054] Continuing with the previous example, the reference model can score the set of derived samples <query w, document k , 1>, <query w, document k , 1>, <query w, document k , 0>, <query w, document k , 0> and <query w, document k , 0> for the sample <query w, document k , 2>, to obtain a corresponding set of predicted scores, for example, denoted as

respectively.

[0055] At 408, a set of prediction losses respectively corresponding to the set of derived samples can be calculated based on the set of binary labels and the set of predicted scores. It should be appreciated that the embodiments of the present disclosure are not limited to any particular manner of calculating the prediction loss.

[0056] Continuing with the previous example, a set of prediction losses Z₀, Z₄, l₂, Z₃and Z₄ respectively corresponding to the set of derived samples <query w, document k , 1>, <query w, document k , 1>, <query w, document k , 0>, <query w, document k , 0> and <query w, document k ,0> can be calculated based on the set ofbinary labels " 1", " 1", "0", "0", and "0", as well as the set of predicted scores s₀, s₄, s₂, s₃ and s₄. For example, taking a second derived sample <query m , document k , 1> in the set of derived samples as an example, the binary label " 1 " of the derived sample and the predicted score s₄ of the derived sample can be used to calculate the prediction loss Z₄ of the derived sample.

[0057] At 410, a comprehensive prediction loss can be generated based on the set of prediction losses.

[0058] In an embodiment, the comprehensive prediction loss may be generated by directly summing each prediction loss of the set of prediction losses.

[0059] In another embodiment, weighting coefficients of each prediction loss of the set of prediction losses may be firstly set, and then the comprehensive prediction loss may be generated by weighted summing the set of prediction losses based on the set weighting coefficients. For example, the weighting coefficients can be set based on the task corresponding to the derived sample. For example, for the primary task, the weighting coefficient can be set to 0.5, and for the auxiliary tasks, the weighting coefficient can be evenly set, for example set to (1-0.5) / n, where n is the number of auxiliary tasks.

[0060] At 412, the reference model can be optimized by minimizing the comprehensive prediction loss.

[0061] It should be appreciated that when more than one reference models are used in the embodiments of the present disclosure, different reference models may be separately trained by the process 400 described above.

[0062] According to an embodiment of the present disclosure, after the reference model is trained, it may score a dataset without labels, such as the second dataset 240 in FIG. 2, to obtain a scored dataset without labels for training the target model, such as the first scored dataset 250 in FIG. 2. The dataset for training the target model may include a plurality of samples having a structure of <query, document, target score>, wherein the target score is provided after the reference model scores relevance between the query and the document in the sample. In an implementation, in order to effectively utilize the target scores of the individual samples provided by the reference model when training the target model, the target scores may be first converted to obtain the derived scores. In this context, the derived score refers to a score that is directly used for training the target model and indicates the relevance between the query and the document in respective sample. In the following discussion, the target score of the z-th sample in the dataset that will be used for training the target model is represented as s and the derived score of the sample is represented as y*.

[0063] In an embodiment, the derived score y_L may be the original value of the target score s as shown in equation (1) below:

Vi = ^si (1)

[0064] In another embodiment, the target score s_t may be converted based on the threshold ti to obtain binary derived scores y, " 1 " and "0", as shown in equation (2):

[0065] When training the target model, the relevance between the query and the document in each sample of the dataset for training the target model can be scored through the target model to obtain a predicted score for each sample. The predicted score of the z-th sample can be represented as ^. In an implementation, the loss of the z-th sample can be calculated as a weighted squared loss, as shown in equation (3) below:

h = Wiiyi - yd² (3) where w_L is the set weight corresponding to the i-th sample when calculating the loss of the target model, where 0>W_j>l . The weight w_L can be set, for example, according to one of the following equations (4)-(6):

Wi = \2 s_t - 1\P (5) w_t = 1 (6) where t₂, t₃ and p are parameters set by the system for calculating the weight w_{t .}

[0066] In an implementation, a dataset including a plurality of samples based on search log data may be scored through at least one reference model, and then the scored dataset is used to train the target model. Since the amount of search log data is large, the scored dataset is able to provide a large amount of training data for training the target model. Although the search log data does not have human provided labels, after being scored by the reference model, each sample may have a target score indicating relevance between a query and a document, and thus these target scores can be utilized to effectively train the target model. The dependency on the human labeled training data can be alleviated by scoring the search log data and using the scored search log data to train the target model. In addition, this approach of scoring the search log data through the reference model can be more accurate than the approach of using user clicks as an alternative to relevance labels.

[0067] According to an embodiment of the present disclosure, after the target model is initially trained, the target model may be further optimized using another dataset scored by the reference model. For example, the reference model can score the dataset with labels, such as the third dataset 270 in FIG. 2, to obtain a scored dataset with labels for optimizing the target model, such as the second scored dataset 280 in FIG. 2. The dataset used to optimize the target model may include a plurality of samples with a structure of <query, document, label, target score>, wherein the label may be a relevance value indicating relevance between the query and the document of the sample, which may be provided previously by humans or provided in other ways, and the target score is provided by the reference model after scoring the relevance between the query and the document of the sample.

[0068] FIG.5 illustrates an exemplary process 500 for optimizing a target model according to an embodiment of the present disclosure. The process 500 may correspond to the operation 290 in FIG. 2.

[0069] At 502, relevance between a query and a document in each sample of a dataset for optimizing the target model can be scored through the target model to obtain a predicted score for the sample. The predicted score of the z-th sample can be represented as

[0070] The process 500 can further calculate a prediction loss corresponding to the sample based on a combination of a label and the target score of the sample and the predicted score. The prediction loss of the z-th sample can be represented

[0071] In an embodiment for calculating a prediction loss, at 504, credibility of the sample can be determined based on whether the combination of the label and the target score of the sample and the predicted score meets a predetermined rule.

[0072] The predetermined rule may at least use the label in the sample as a reference. The predetermined rule may include the predicted score being greater than the target score when the label in the sample indicates that the query is relevant to the document. For example, the predetermined rule indicates that, in the case where the label in the sample indicates that the query is relevant to the document, the predicted score obtained by the target model scoring the relevance between the query and the document in the sample should be as large as possible. Preferably, the predicted score should be greater than the target score provided by the reference model. The predetermined rule may also include the predicted score being less than the target score when the label in the sample indicates that the query is irrelevant to the document. For example, the predetermined rule indicates that, in the case where the label in the sample indicates that the query is irrelevant to the document, the predicted score obtained by the target model scoring the relevance between the query and the document in the sample should be as small as possible. Preferably, the predicted score should be less than the target score provided by the reference model. The sample is determined to be credible when the combination of the label and the target score of the sample and the predicted score meets the predetermined rule described above. Otherwise, the sample is determined to be incredible.

[0073] In an implementation, the label in the sample may be converted into a binary label when determining whether the combination of the label and the target score of the sample and the predicted score meets a predetermined rule. The binary label of the i-th sample can be represented as y_t. For example, the label in the sample can be converted into the binary label through any of the primary task and auxiliary tasks mentioned above.

[0074] In an implementation, during the process of optimizing the target model, in order to effectively utilize the target score of each sample provided by the reference model, the target score may be converted to obtain the derived score. The derived score of the z-th sample can be represented as y_{L .} The target score in each sample may be converted in a similar manner to the way for training the target model, for example, converting the target score into the derived score according to the above equation (1) or (2).

[0075] Subsequently, at 506, a weight corresponding to the sample can be set based on the credibility of the sample. The weight corresponding to the z-th sample can be represented as W_j. In an implementation, the weight is set based on a predetermined criterion, and the predetermined criterion may comprise: a weight corresponding to a credible sample indicated by the credibility being less than or equal to a weight corresponding to an incredible sample indicated by the credibility. In an implementation, for a credible sample z, 0<Wi<l; and for an incredible sample z, w_t= 1.

[0076] In order to facilitate the description of the weight

an embodiment of the present disclosure defines a sign function as shown in the following equation (7):

r6, if x < 0

d_q(c) =

1, if x>0 ^ where 0 < Q < 1 is the hyper parameter set by the system.

[0077] According to an embodiment of the present disclosure, the weight w_L may be defined as the following equation (8):

[0078] It should be appreciated that equations (7), (8) are merely an exemplary form of describing weight . Other forms may also be adopted to describe the weights w_L in accordance with the embodiments of the present disclosure.

[0079] It can be seen that, unlike the weight w on which training the target model is based and that is only relevant to the target score, the weight w_L on which optimizing the target model is based is also relevant to the label. Therefore, the weight w_L on which optimizing the target model is based can also be referred to as a label-aware weigh.

[0080] At 508, the prediction loss can be calculated based on the weight w_{t .} As mentioned above, the prediction loss of the z-th sample can be represented as

In an implementation, the prediction loss l_t can be defined as a weighted squared loss, as shown in equation (9) below:

h = Wi(yi - 9i)² (9)

[0081] At 510, the target model can be optimized by minimizing the prediction loss

[0082] Through the process 500 of FIG. 5, the target model can be optimized using at least both the target score provided by the reference model and the relevance label included in the dataset, so that the performance of the target model can be further improved. The method for optimizing the target model according to the embodiments of the present disclosure is intended to calculate a corresponding prediction loss based on both the target score provided by the reference model and the relevance label included in the dataset, and to optimize the target model by minimizing the prediction loss. When calculating the prediction loss, the relevance label can be used as a reference as described above. Only one implementation for calculating the prediction loss is exemplarily shown in the above steps 504 to 508, under the concept of calculating the prediction loss corresponding to the sample based on the combination of the label and the target score of the sample and the prediction score, the embodiments of the present disclosure may also encompass any other implementation of calculating the prediction loss based on the concept.

[0083] FIG.6 is a flowchart of an exemplary method 600 for training a target model according to an embodiment of the present disclosure.

[0084] At 610, at least one reference model can be trained with a first dataset.

[0085] At 620, a second dataset and a third dataset can be scored through the at least one reference model, respectively.

[0086] At 630, the target model can be trained with the scored second dataset.

[0087] At 640, the target model can be trained with the scored third dataset.

[0088] In an implementation, the first dataset comprises a plurality of samples, each sample comprising at least a query, a document, and an enumerated label indicating relevance between the query and the document, and the training the at least one reference model comprises, for each sample: converting the enumerated label in the sample to a set of binary labels through a set of tasks; creating a set of derived samples by combining the query and the document in the sample and the set of binary labels; and training the at least one reference model with the set of derived samples.

[0089] In an implementation, the set of binary labels includes positive labels indicating that the query is relevant to the document and negative labels indicating that the query is irrelevant to the document, and the set of tasks convert the enumerated label to a positive label or a negative label based on respective cutoff values, respectively.

[0090] In an implementation, the value of the enumerated label is selected from a plurality of relevance values, and the set of tasks includes a primary task and at least one auxiliary task, a cutoff value of the primary task is a boundary value in the plurality of relevance values that divides relevance between the query and the document into relevant and irrelevant, and a cutoff value of the at least one auxiliary task is a value in the plurality of relevance values other than the boundary value, respectively.

[0091] In an implementation, the training the at least one reference model with the set of derived samples comprises: scoring the set of derived samples through the at least one reference model, respectively, to obtain a set of predicted scores respectively corresponding to the set of derived samples; calculating a set of prediction losses respectively corresponding to the set of derived samples based on the set of binary labels and the set of predicted scores; generating a comprehensive prediction loss based on the set of prediction losses; and optimizing the at least one reference model by minimizing the comprehensive prediction loss.

[0092] In an implementation, the second dataset and the third dataset each comprises a plurality of samples, each sample comprising at least a query and a document, and the scoring comprises, for each sample: scoring relevance between the query and the document in the sample through the at least one reference model to obtain at least one initial score of the sample; and generating a target score of the sample based on the at least one initial score.

[0093] In an implementation, the scored third dataset comprises a plurality of samples, each sample comprising a query, a document, a label and a target score, and the optimizing comprises, for each sample: scoring relevance between the query and the document in the sample through the target model to obtain a predicted score of the sample; calculating a prediction loss corresponding to the sample based on a combination of the label and the target score in the sample and the predicted score; and optimizing the target model by minimizing the prediction loss.

[0094] In an implementation, the calculating the prediction loss comprises: determining credibility of the sample based on whether the combination satisfies a predetermined rule; setting a weight corresponding to the sample based on the credibility of the sample; and calculating the prediction loss based on the weight.

[0095] In an implementation, the predetermined rule uses at least the label as a reference.

[0096] In an implementation, the predetermined rules comprise: the predicted score being greater than the target score when the label indicates that the query is relevant to the document; and the predicted score being less than the target score when the label indicates that the query is irrelevant to the document.

[0097] In an implementation, the weight is set based on a predetermined criterion, the predetermined criterion comprising: a weight corresponding to a credible sample indicated by the credibility being less than or equal to a weight corresponding to an incredible sample indicated by the credibility. [0098] In an implementation, the target model is a fast matching model and the at least one reference model is a bottom crossing matching model.

[0099] In an implementation, the at least one reference model has the same model structure or has different model structures.

[00100] In an implementation, the second dataset comprises a plurality of samples that are based on search log data.

[00101] It should be appreciated that the method 600 may further comprise any steps/processes for training the target model according to the embodiments of the present disclosure as mentioned above.

[00102] FIG.7 illustrates an exemplary apparatus 700 for training a target model according to an embodiment of the present disclosure.

[00103] The apparatus 700 may comprise a reference model training module 710, for training at least one reference model with a first dataset; a scoring module 720, for scoring a second dataset and a third dataset through the at least one reference model, respectively; a target model training module 730, for training the target model with the scored second dataset; and an optimizing module 740, for optimizing the target model with the scored third dataset.

[00104] In an implementation, the first dataset comprises a plurality of samples, each sample comprising at least a query, a document, and an enumerated label indicating relevance between the query and the document, and the reference model training module 710 is further configured for, for each sample: converting the enumerated label in the sample to a set of binary labels through a set of tasks; creating a set of derived samples by combining the query and the document in the sample and the set of binary labels; and training the at least one reference model with the set of derived samples.

[00105] In an implementation, the second dataset and the third dataset each comprises a plurality of samples, each sample comprising at least a query and a document, and the scoring module 720 is further configured for, for each sample: scoring relevance between the query and the document in the sample through the at least one reference model to obtain at least one initial score of the sample; and generating a target score of the sample based on the at least one initial score.

[00106] In an implementation, the scored third dataset comprises a plurality of samples, each sample comprising a query, a document, a label and a target score, and the optimizing module 740 is further configured for, for each sample: scoring relevance between the query and the document in the sample through the target model to obtain a predicted score of the sample; calculating a prediction loss corresponding to the sample based on a combination of the label and the target score in the sample and the predicted score; and optimizing the target model by minimizing the prediction loss.

[00107] In an implementation, the calculating the prediction loss comprises: determining credibility of the sample based on whether the combination satisfies a predetermined rule; setting a weight corresponding to the sample based on the credibility of the sample; and calculating the prediction loss based on the weight.

[00108] Moreover, the apparatus 700 may further comprise any other modules configured for training the target model according to the embodiments of the present disclosure as mentioned above.

[00109] FIG.8 illustrates an exemplary apparatus 800 for training a target model according to an embodiment of the present disclosure.

[00110] The apparatus 800 may comprise at least one processor 810. The apparatus 800 may further comprise a memory 820 coupled with the processor 810. The memory 820 may store computer executable instructions that, when executed, cause the processor 810 to perform any operations of the methods for training a target model according to the embodiments of the present disclosure as mentioned above.

[00111] The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for training a target model according to the embodiments of the present disclosure as mentioned above.

[00112] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

[00113] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

[00114] Processors are described in connection with various apparatus and methods. These processors can be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a microcontroller, a DSP, or other suitable platforms.

[00115] Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software can reside on computer readable medium. Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

[00116] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for training a target model, comprising:

training at least one reference model with a first dataset;

scoring a second dataset and a third dataset through the at least one reference model, respectively;

training the target model with the scored second dataset; and

optimizing the target model with the scored third dataset.

2. The method of claim 1, wherein the first dataset comprises a plurality of samples, each sample comprising at least a query, a document, and an enumerated label indicating relevance between the query and the document, and the training the at least one reference model comprises, for each sample:

converting the enumerated label in the sample to a set of binary labels through a set of tasks;

creating a set of derived samples by combining the query and the document in the sample and the set of binary labels; and

training the at least one reference model with the set of derived samples.

3. The method of claim 2, wherein:

the set of binary labels includes positive labels indicating that the query is relevant to the document and negative labels indicating that the query is irrelevant to the document, and the set of tasks convert the enumerated label to a positive label or a negative label based on respective cutoff values, respectively.

4. The method of claim 3, wherein:

a value of the enumerated label is selected from a plurality of relevance values, and the set of tasks includes a primary task and at least one auxiliary task, a cutoff value of the primary task is a boundary value in the plurality of relevance values that divides relevance between the query and the document into relevant and irrelevant, and a cutoff value of the at least one auxiliary task is a value in the plurality of relevance values other than the boundary value, respectively.

5. The method of claim 2, wherein the training the at least one reference model with the set of derived samples comprises:

scoring the set of derived samples through the at least one reference model, respectively, to obtain a set of predicted scores respectively corresponding to the set of derived samples;

calculating a set of prediction losses respectively corresponding to the set of derived samples based on the set of binary labels and the set of predicted scores;

generating a comprehensive prediction loss based on the set of prediction losses; and optimizing the at least one reference model by minimizing the comprehensive prediction loss.

6. The method of claim 1, wherein the second dataset and the third dataset each comprises a plurality of samples, each sample comprising at least a query and a document, and the scoring comprises, for each sample:

scoring relevance between the query and the document in the sample through the at least one reference model to obtain at least one initial score of the sample; and

generating a target score of the sample based on the at least one initial score.

7. The method of claim 1, wherein the scored third dataset comprises a plurality of samples, each sample comprising a query, a document, a label and a target score, and the optimizing comprises, for each sample:

scoring relevance between the query and the document in the sample through the target model to obtain a predicted score of the sample;

calculating a prediction loss corresponding to the sample based on a combination of the label and the target score in the sample and the predicted score; and

optimizing the target model by minimizing the prediction loss.

8. The method of claim 7, wherein the calculating the prediction loss comprises: determining credibility of the sample based on whether the combination satisfies a predetermined rule;

setting a weight corresponding to the sample based on the credibility of the sample; and

calculating the prediction loss based on the weight.

9. The method of claim 8, wherein the predetermined rule uses at least the label as a reference.

10. The method of claim 8, wherein the predetermined rule comprises:

the predicted score being greater than the target score when the label indicates that the query is relevant to the document; and

the predicted score being less than the target score when the label indicates that the query is irrelevant to the document.

11. The method of claim 8, wherein the weight is set based on a predetermined criterion, the predetermined criterion comprising: a weight corresponding to a credible sample indicated by the credibility being less than or equal to weight corresponding to an incredible sample indicated by the credibility.

12. The method of claim 1, wherein the target model is a fast matching model and the at least one reference model is a bottom crossing matching model.

13. The method of claim 1, wherein the second dataset comprises a plurality of samples that are based on search log data.

14. An apparatus for training a target model, comprising:

a reference model training module, for training at least one reference model with a first dataset;

a scoring module, for scoring a second dataset and a third dataset through the at least one reference model, respectively;

a target model training module, for training the target model with the scored second dataset; and

an optimizing module, for optimizing the target model with the scored third dataset.

15. A device for training a target model, comprising:

at least one processor; and

a memory storing computer executable instructions that, when executed, cause the at least one processor to:

train at least one reference model with a first dataset;

score a second dataset and a third dataset through the at least one reference model, respectively;

train the target model with the scored second dataset; and

optimize the target model with the scored third dataset.