WO2021146003A1

WO2021146003A1 - Providing qa training data and training a qa model based on implicit relevance feedbacks

Info

Publication number: WO2021146003A1
Application number: PCT/US2020/064971
Authority: WO
Inventors: Ming GONG; Linjun SHOU; Feixiang CHENG; Daxin Jiang
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2020-01-16
Filing date: 2020-12-15
Publication date: 2021-07-22
Also published as: CN113127614A

Abstract

The present disclosure provides methods and apparatuses for providing QA training data and training a QA model based on implicit relevance feedbacks. A question-passage pair and corresponding user behaviors may be obtained from a search log. Behavior features may be extracted from the user behaviors. A relevance score between the question and the passage may be determined, through an implicit relevance feedback model, based on the behavior features. A relevance label may be added to the question-passage pair based on the relevance score. The QA model may be pre-trained with the obtained auto-labelled QA training data, and the pre-trained QA model may be fine-tuned with human-labelled QA training data.

Description

PROVIDING QA TRAINING DATA AND TRAINING A QA MODEL BASED ON

IMPLICIT RELEVANCE FEEDBACKS

BACKGROUND

[0001] A search engine may provide search results for a user query in a search result page (SERP). A traditional search result includes a link to the most relevant web document with respect to the user query. Herein, the web document may also be referred to as, e.g., a web page. The link may refer to a hyperlink, a web address, a URL, etc. In recent years, some web search engines begin to further provide a question-answering (QA) service in SERPs, which is also referred to as a web QA service. For example, if a query has question intent, a web search engine will extract the most relevant passage from a web document to answer the user’s question, and place the passage within an individual QA block in a SERP. The passage may refer to one or more sentences, one or more paragraphs, abstract, etc., extracted from the corresponding web document. The QA service is becoming more and more popular for search engine users because it may avoid user operations such as clicking on links to web documents, browsing the web documents, looking for answers, etc. SUMMARY

[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0003] Embodiments of the present disclosure provide methods and apparatuses for providing QA training data and training a QA model based on implicit relevance feedbacks. A question-passage pair and corresponding user behaviors may be obtained from a search log. Behavior features may be extracted from the user behaviors. A relevance score between the question and the passage may be determined, through an implicit relevance feedback model, based on the behavior features. A relevance label may be added to the question- passage pair based on the relevance score. The QA model may be pre-trained with the obtained auto-labelled QA training data, and the pre-trained QA model may be fine-tuned with human-labelled QA training data.

[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG.l illustrates an exemplary search result page.

[0007] FIG.2 illustrates an exemplary process for providing QA training data according to an embodiment.

[0008] FIG.3 illustrates an exemplary process for providing QA training data based on a label aggregation strategy according to an embodiment.

[0009] FIG.4 illustrates an exemplary process for providing QA training data based on a score aggregation strategy according to an embodiment.

[0010] FIG.5 illustrates an exemplary process for providing QA training data based on a feature aggregation strategy according to an embodiment.

[0011] FIG.6 illustrates an exemplary process for training a QA model according to an embodiment.

[0012] FIG.7 illustrates a flowchart of an exemplary method for providing QA training data based on implicit relevance feedbacks according to an embodiment.

[0013] FIG.8 illustrates a flowchart of an exemplary method for training a QA model based on implicit relevance feedbacks according to an embodiment.

[0014] FIG.9 illustrates an exemplary apparatus for providing QA training data based on implicit relevance feedbacks according to an embodiment.

[0015] FIG.10 illustrates an exemplary apparatus for training a QA model based on implicit relevance feedbacks according to an embodiment.

[0016] FIG.11 illustrates an exemplary apparatus for providing QA training data and/or training a QA model based on implicit relevance feedbacks according to an embodiment.

DETAILED DESCRIPTION

[0017] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

[0018] The Web QA needs to accurately determine passage relevance, e.g., identifying relevance between a passage and a question, which would decide whether a passage is able to answer a given question. Some traditional approaches adopt linguistic rules or patterns. Such rule-based approaches may handle some known search situations. To achieve web- scale, open-domain question answering, machine learning models may also be used for determining passage relevance. For example, in recent years, deep learning models that are based on deep neural networks have also been used for determining passage relevance. However, a challenge for the machine learning models is the requirement for large amounts of training data, wherein each training data example includes a question-passage pair and a corresponding relevance label. Usually, the model size, in terms of the number of parameters, will increase according to the complexity of the target task, and the size of required training data will also increase in proportion to the model size. Especially, the number of parameters of a deep learning model for passage relevance determination in web QA is extremely large, and thus more training data is also required. Moreover, a search engine often provides services in multiple countries using various languages. It is unrealistic to manually label a large amount of training data for each language. Besides of the huge labeling cost, the quality of labels is also a concern. In particular, labels for some professional queries may not be reliable. Therefore, how to collect a large amount of high quality training data in different languages is a problem to be solved for web QA.

[0019] One way to collect QA training data is asking users to provide explicit feedbacks to the relevance of passages. Explicit feedbacks to the relevance of passages refer to that a user takes an action to proactively express his/her satisfaction with the search results, e.g., proactively expressing whether he/she thinks the provided passage is indeed relevant to the question or answers the question. As an example, a feedback link or voting button may be presented at the same time as a passage is provided in a SERP, so that a user may explicitly submit a feedback to the provided passage. Explicit relevance feedbacks may be used for forming labels in the QA training data that indicate relevance between questions and passages. However, in practical applications, only a few search engine users would try to send explicit relevance feedbacks. Moreover, users usually tend to send negative feedbacks, e.g., pointing out that a passage has a low relevance to a question, and rarely send positive feedbacks, e.g., pointing out that a passage has a high relevance to a question, thus resulting in significantly more negative examples than positive examples in the labelled data. To form balanced QA training data, it is required to take almost equal amount of positive and negative examples from the skewed label distribution, which further reduces the number of valid training data that may be derived from the explicit relevance feedbacks. Consequently, the explicit relevance feedbacks cannot be used for effectively collecting QA training data. Moreover, the explicit relevance feedbacks may also disturb users in their interactions with search engines.

[0020] In the scenario of ranking web documents or web pages by search engines, it has been proposed to utilize users' implicit relevance feedbacks for web documents to determine document relevance, e.g., identifying relevance between a web document and a query, and thus collect training data. The implicit relevance feedbacks for web documents refer to inference of user satisfaction according to user behaviors on web documents in searching and/or browsing sessions without increasing the burden on search engine users. The collection cost of implicit relevance feedbacks for web documents is relatively low, the quantity is large, and the burden on users is not increased. Various features for mining implicit relevance feedbacks for Web documents from user behaviors have been proposed, e.g., click information, average dwell time, number of page visits, etc.

[0021] However, the determining of web document relevance is different from the determining of paragraph relevance. Document-level user behaviors cannot be simply used for inferring passage relevance. Take the click behavior as an example, if a user clicks on a web page in a SERP, this usually indicates high relevance of the web page to the user query, while if a web page is not clicked by the user, this usually indicates low relevance of the web page to the user query. In other words, for a web document, there is a strong correlation between user click behavior and document relevance. However, for a passage, this may not be the case. For example, assuming that a user question is "What's the normal body temperature for a child?" and a passage about body temperatures of children is provided in a SERP, since the provided passage already contains the information that the user wants to obtain, the user may not perform any further click operation. For example, assuming that a user question is "What's the normal body temperature for an adult?" and a passage about body temperatures of children is provided in a SERP, the information in this passage may not accurately match the user question. Since the user may want to explore more information in a source page from which the passage is extracted, the user may click on a link to the source page and read more content in that page. The above example reveals a unique characteristic of the QA scenario, e.g., the content of a passage is already presented to a user in a QA block, therefore, in the case that the passage content contains the information that the user requires, the user may not need to click on a link to a source page or other web page links to get an answer, but in the case that the passage content does not contain a satisfactory answer, the user may instead perform further operations.

[0022] In addition, in a SERP, the number of QA block also differs from the number of web documents. Given a user question, a search engine can usually return a series of web document links in a SERP, but only return a single QA block. Most existing click models leverage relevance ranking orders of documents to gain more reliable implicit relevance feedbacks. However, this approach cannot be applied to a single QA block.

[0023] Therefore, the existing approach of mining implicit relevance feedbacks for web documents from users cannot be effectively applied to the web QA scenario.

[0024] Embodiments of the present disclosure propose to mine implicit relevance feedbacks for web QA from user behaviors. Different types of behaviors imposed by users on SERPs may be considered, e.g., click behavior, re-query behavior, browsing behavior, etc. An implicit relevance feedback model may be used for mining implicit relevance feedbacks for web QA from user behaviors. The implicit relevance feedback model may predict a relevance score of a question-passage pair based on behavior features extracted from user behaviors. In order to reduce influences by the randomness of individual users and individual actions, the implicit relevance feedback model may also adopt different aggregation strategies to aggregate behaviors of a large number of different users. Relevance scores provided by the implicit relevance feedback model may form labels for indicating relevance between questions and passages. A question-passage pair with a relevance label may be used as a QA training data example and added to a QA training data set. Benefiting from the use of the implicit relevance feedback model, the provided relevance labels have high accuracy.

[0025] The embodiments of the present disclosure may obtain a large number of question-passage pairs and corresponding user behaviors from search logs, and automatically add relevance labels to the question-passage pairs based on the user behaviors. Accordingly, a large-scale of QA training data set that is formed based on implicit correlation feedbacks may be provided. The formed QA training data set may then be used for training a QA model which may also be referred to as a QA relevance model. For example, the QA training data set may be used for pre-training the QA model in a weak supervision approach.

[0026] Through the embodiments of the present disclosure, a large amount of auto- labelled QA training data may be provided, and labels of the QA training data have high accuracy, which may facilitate to train a QA model with higher performance. Moreover, since the addition of labels is based on user behaviors and is not restricted by languages, the embodiments of the present disclosure may also easily establish QA training data in different languages. [0027] FIG.l illustrates an exemplary search result page (SERP) 100. The SERP 100 may be presented to a user in a user interface by a search engine in response to the user's question. Components in the SERP 100 may be exemplarily divided into a search block 110, a QA block 120, a relevant question block 130, a web page link block 140, etc. Here, the blocks are only different logical divisions of the components in the SERP 100, and in terms of display and function, different blocks and components therein may be independent from or combined with each other.

[0028] In the search block 110, the user may enter a question or query, e.g., "summer flu treatment".

[0029] In response to determining that the user input in the search block 110 has question intent, the search engine may provide the QA block 120 in SERP 100. The QA block 120 may include, e.g., a passage 122 for answering the user question, an extension option 124 of the passage 122, a source page link 126 of the passage 122, etc. The passage 122 is content that is extracted from a web document and is most relevant to the user question. For example, in FIG.l, the passage 122 may include multiple tips for treating summer cold. Due to the limitation of display size of a page, the passage 122 may only be partially displayed. In this case, the user may click on the extension option 124, e.g., a "More items" link, to view the hidden parts of the passage 122. The source page link 126 is a hyperlink to a source page or a source web document from which the passage 122 is extracted. When the user clicks on the source page link 126, the source page of the passage 122 may be presented in the user interface. Moreover, optionally, the SERP 100 may further include a feedback button or link 128 for collecting explicit relevance feedbacks provided by the user for the passage 122. For example, when the user clicks on the feedback button or link 128, a feedback page or feedback options may be presented, so that the user may provide a feedback as to whether the current passage 122 satisfactorily answers the question. The feedback button or link 128 may be presented within or outside the QA block 120. [0030] The relevant question block 130 may include questions relevant to or similar to the user question in the search block 110. These relevant questions may include, e.g., questions frequently searched by other users. In FIG.l, multiple questions relevant to the user question "summer flu treatment" are shown in the relevant question block 130, e.g., "What causes summer flu?", "Medicines for summer flu?", etc. When the user clicks on a relevant question, the search engine may initiate a search for the clicked relevant question and present a corresponding SERP in the user interface.

[0031] The web page link block 140 includes hyperlinks to web pages or web documents relevant to the user question in the search block 110. The web page links in the web page link block 140 may be ranked by the search engine based on document relevance. When the user clicks on a web page link, the web page may be presented in the user interface. [0032] It should be understood that all the blocks and components in the SERP 100 in FIG.1 are exemplary, and according to specific designs and application requirements, the SERP 100 may include more or fewer blocks and components, and these blocks and components may be laid out and presented in any other approaches.

[0033] FIG.2 illustrates an exemplary process 200 for providing QA training data according to an embodiment. The process 200 may be performed to form a training data set for training or improving a QA model, based on implicit relevance feedbacks for web QA mined from user behaviors.

[0034] A QA system 210 may be deployed in a search engine to provide a web QA service. The QA system 210 may obtain questions input by users and answer the questions through the QA model 212. For example, the QA model 212 may provide a passage within a QA block in a SERP in response to a user's question.

[0035] There may be interactions between the QA system 210 and a large number of search engine users, and information relevant to the interactions may be stored in a search log 220. The search log 220 may include a plurality of information items from historical usages of a large number of users. Each information item may correspond to an impression. Herein, an impression may refer to a presentation of search results for a user’s question or query, e.g., the SERP 100 in FIG.l. Assuming that the QA system 210 receives a question q from a user w, the QA system 210 may provide the user with an impression i in the user interface in response to the question , and the impression i includes at least a passage p for answering the question q. The user u may perform some operations or not perform any operation on the impression i. Accordingly, the information item corresponding to the impression i may include a question , a passage p, a group of user behaviors reflecting the user’s operation conditions, etc.

[0036] The process 200 aims to mine implicit relevance feedbacks from user behaviors included in the information items in the search log 220, so as to automatically add relevance labels to question-passage pairs.

[0037] As an example, FIG.2 shows that an exemplary question-passage pair 222 and multiple groups of user behaviors 224 corresponding to multiple impressions of the question-passage pair 222 are extracted from the information items in the search log 220. Here, the multiple impressions of the question-passage pair 222 may refer to multiple impressions respectively presented at different users, which include, e.g., the same question in the question-passage pair 222, the same passage in the question-passage pair 222, etc. Moreover, it should be understood that since the process 200 aims to mine implicit relevance feedbacks, the user behaviors 224 may refer to those behaviors from the users corresponding to implicit relevance feedbacks for web QA. The user behaviors 224 may be classified into different types, e.g., a Click behavior type, a Re-query behavior type, a Browsing behavior type, etc. The click behavior type may include various behaviors relevant to a "click" operation. In an implementation, user behaviors belonging to the click behavior type may be further classified into different click behavior subtypes, e.g., an Answer Click subtype, an Answer Expansion Click subtype, an Outside Answer Click subtype, a Related Click subtype, etc. The Answer Click refers to a click on a source page link of a passage, e.g., a click on the source page link 126 in FIG.1. The Answer Expansion Click refers to a click on an extension option of a passage so as to display hidden parts of the passage, e.g., a click on the extension option 124 in FIG.1. The Outside Answer Click refers to a click on a web page link in the SERP other than a source page link of a passage, e.g., a click on a web page link in the web page link block 140 in FIG.l. The Related Click refers to a click on a relevant question, e.g., a click on a relevant question in the relevant question block 130 in FIG.1. The re-query behavior type may include a behavior that involves query reformulation, e.g., the user may modify the original query or question and issue a new query or question to the search engine. The browsing behavior type may include a behavior that involves reading a passage or any other content in the SERP by the user without causing any input. It should be understood that the embodiments of the present disclosure are not limited to any of the user behaviors and user behavior types described above, but may include more or less user behaviors and user behavior types.

[0038] At 230, behavior feature extraction may be performed on the user behaviors 224, so as to obtain behavior features 232 associated with the question-passage pair 222. For example, the user behaviors 224 may be converted into Boolean features or numerical features.

[0039] In an implementation, the behavior features 232 may include raw behavior features extracted from a group of user behaviors corresponding to each impression, and the group of user behaviors is behaviors generated by a single user for the impression. Names, behavior types, descriptions, etc. of some exemplary raw behavior features are shown in Table 1 below.

Table 1 [0040] The "Description" part of Table 1 gives the meanings of the corresponding raw behavior features, and gives the values of the raw behavior features in different user behavior situations. It should be understood that the embodiments of the present disclosure may adopt some or all of the raw behavior features listed in Table 1, or adopt any other raw behavior features, and are not limited to the value setting approaches listed in Table 1. Moreover, “satisfactorily clicking” (SatClick) in Table 1 may refer to that after a click behavior is performed, the dwell time on the next presented page is greater than or equal to a predetermined threshold. Moreover, the raw behavior feature "Abandonment" may refer to that the user dwells on the SERP for a period of time to browse the SERP, but ends the search without performing a click behavior. Compared with the raw behavior feature "NoClick", the dwell time of "Abandonment" on the SERP is longer than that of "NoClick" on the SERP.

[0041] In an implementation, the behavior features 232 may include aggregated behavior features extracted from a plurality of groups of user behaviors. For example, the N impressions of the question-passage pair 222 may be provided to a plurality of users, and these users may have performed their own group of behaviors on these impressions. By aggregating the behaviors of these users, the influences by the randomness of individual users and individual behaviors may be avoided.

[0042] The embodiments of the present disclosure define a click-through rate (CTR) for a component in a SERP, wherein the component may be a passage, a source page link to a passage, an extension option of a passage, a relevant question, a web page link, etc., and any one of any other parts in the SERP. For example, the CTR may be calculated as:

CTR ₌ ^N _click Equation (1)

N impression wherein N impre ion denotes the total number of impressions of the component, e.g., how many impressions in total in which the component is presented, and Nclick denotes the number of clicks on the component in the impressions.

[0043] Moreover, the embodiments of the present disclosure further define a satisfied click-through rate (SatCTR). For example, SatCTR may be calculated as:

SatCTR = ^Nsatclick Equation (2)

N impression wherein Nsatciwk denotes the number of satisfactorily clicking (SatClick) on the component. [0044] Names, behavior types, descriptions, etc. of some exemplary aggregated behavior features are shown in Table 2 below.

Table 2

[0045] In Table 2, "rate" may refer to a ratio of the number of occurrences of a corresponding behavior to the total number of impressions. Taking the RFRate as an example, this aggregated behavior feature refers to a ratio of the number of re-queries that are performed to the total number of impressions. The "Description" part of Table 2 gives the meanings of the corresponding aggregated behavior features, and gives the calculation of the aggregated behavior features in different user behavior situations. It should be understood that the embodiments of the present disclosure may adopt some or all of the aggregated behavior features listed in Table 2, or adopt any other aggregated behavior features, and are not limited to the calculation approaches listed in Table 2.

[0046] In the process 200, the behavior features 232 may be further used for determining the relevance between the question and the passage in the question-passage pair 222. For example, a pre-trained implicit relevance feedback model 240 may be used for determining a relevance score 242 between the question and the passage based on the behavior features

232.

[0047] The implicit relevance feedback model 240 aims to mine users’ implicit relevance feedbacks for QA from a set of behavior features. The architecture of the implicit relevance feedback model 240 may be based on various technologies, e.g., logistic regression, decision tree, random forest, gradient boosting decision tree, etc. Training data for training the implicit relevance feedback model 240 may take the form of, e.g., <question, passage, behavior features, label>, wherein the "behavior features" are obtained for a question-passage pair composed of the "question" and the "passage”, and the "label" refers to an artificial label of the relevance between the "question" and the "passage". A training objective of the implicit relevance feedback model 240 is to enable a relevance score predicted by the model based on the behavior features to fit the artificial label. The implicit relevance feedback model 240 may adopt different aggregation strategies for aggregating user behaviors. Various aggregation strategies will be discussed in detail later in conjunction with FIG.3 to FIG.5.

[0048] A relevance label 250 may be formed based on the relevance score 242, which indicates the relevance between the question and the passage in the question-sentence pair 222

[0049] In an implementation, the relevance label 250 may be a Boolean label converted from the relevance score 242. The relevance label 250 may be generated according to the following equations:

Equation (3)

Equation

(4) wherein, F_{FeedbackModei} ) denotes the implicit relevance feedback model 240, x_t denotes a feature in the behavior features 232, m denotes the number of features in the behavior features 232, score_{<Q P>} denotes the relevance score 242, label_{<Q P>} denotes the relevance label 250, and t₁ and t₂ are preset thresholds.

[0050] It should be understood that the embodiments of the present disclosure are not limited to the approach of forming the relevance label 250 from the relevance score 242 discussed above, but may adopt any other approaches. For example, instead of only setting the label to the values of 0 and 1, the label may be set to more possible values. For example, instead of converting the relevance score into a discrete integer value, the relevance score may also be directly used as a label.

[0051] The relevance label 250 may be associated with the question-passage pair 222 to form a QA training data example. This QA training data example may be added to an auto- labelled training data set 260 which is used for training the QA model.

[0052] In the process 200, the implicit relevance feedback model 240 is able to mine users’ implicit relevance feedbacks for QA from the set of behavior features. Different behavior features and/or their combinations may make different contributions to the mining of the implicit relevance feedbacks. For example, the behavior feature SERPDwellTime indicates the duration of dwelling on a SERP by a user. Since the content of a passage is presented in a QA block in the SERP as an answer to the user’s question, the SERPDwellTime may be a good indicator for the relevance between the passage and the question. For example, as discussed above, the behavior features AnswerClick and AnswerSatClick may have lower significance for the determination of passage relevance. Moreover, combinations of values of different behavior features may be more helpful for the determination of passage relevance. For example, when SERPDwellTime is long and NoClick=l, i.e., when the SERP is abandoned, the passage may have high relevance, because the user may have obtained the required information after browsing the passage just for a while. For example, when AnswerClick=0 and OTAnswerClick=l, this is usually a significant indication that the passage has low relevance, because the user may be not satisfied with the answer by the passage and click on other web page links. For example, when AnswerClickOnly=l and SERPDwellTime is long, this is usually a positive sign of the passage relevance, because the displayed passage content may not fully answer the user’s question and thus the user clicks on the source page link of the passage for further viewing. For example, if NoClick=l and HasRF=l, this may indicate that the passage is not relevant to the user’s question, and thus the user modify the question to further express his requirements.

[0053] The process 200 may be performed for each question-passage pair and corresponding user behaviors retrieved from the search log 220. Since the search log 220 may include a large number of question-passage pairs and corresponding user behaviors came from actual application scenarios, a large amount of auto-labelled QA training data may be provided through the process 200. The QA training data may be used for training or improving a QA model. For example, the training data set 260 may be used for training a QA model which is deployed in the QA system 210. For example, the training data set 260 may be used for improving the QA model 212.

[0054] FIG.3 illustrates an exemplary process 300 for providing QA training data based on a label aggregation strategy according to an embodiment. The process 300 is an exemplary specific implementation of the process 200 in FIG.2, and the same reference numerals in FIG.3 and FIG.2 refer to the same processing steps or information. In the process 300, the implicit relevance feedback model may adopt a label aggregation strategy for aggregating user behaviors. The label aggregation strategy may refer to predicting a corresponding relevance score for each of a plurality of impressions of a question-passage and forming a corresponding relevance label, and then combining relevance labels of these impressions into a final relevance label.

[0055] The raw behavior features corresponding to different impressions may be extracted from the user behaviors 224 through the behavior feature extraction at 230, e.g., raw behavior features 332-1 corresponding to impression 1, raw behavior features 332-2 corresponding to impression 2, ..., raw behavior features 332 -n corresponding to impression n , wherein n is the number of the impressions of the question-passage pair 222 recorded in the search log 220.

[0056] The implicit relevance feedback model 240 may be trained for generating an initial relevance score corresponding to each impression based on raw behavior features of the impression. For example, an initial relevance score 342-1 corresponding to the impression 1 is generated based on the raw behavior features 332-1, an initial relevance score 342-2 corresponding to the impression 2 is generated based on the raw behavior features 332-2, ..., an initial relevance score 342 -n corresponding to the impression n is generated based on the raw behavior features 332 -n.

[0057] In the process 300, an initial relevance label corresponding to each impression may be further formed according to an initial relevance score of the impression. For example, an initial relevance label 344-1 corresponding to the impression 1 is generated based on the initial relevance score 342-1, an initial relevance label 344-2 corresponding to the impression 2 is generated based on the initial relevance score 342-2, ..., an initial relevance label 344 -n corresponding to the impression n is generated based on the initial relevance score 342 -n. The approach of forming an initial relevance label from an initial relevance score is similar to the approach of forming a relevance label from a relevance score described above in conjunction with FIG.2.

[0058] According to the process 300, a plurality of initial relevance labels corresponding to a plurality of impressions may be combined into a final relevance label. For example, the initial relevance label 344-1, the initial relevance label 344-2, ..., the initial relevance label 344 -n may be combined into the final relevance label 250. The labels may be combined in various approaches. For example, voting may be performed among a plurality of initial relevance labels, and a value that gets the most votes is considered as a final relevance label. [0059] The relevance label 250 and the question-passage pair 222 are added to the auto- labelled training data set 260 as a QA training data example.

[0060] FIG.4 illustrates an exemplary process 400 for providing QA training data based on a score aggregation strategy according to an embodiment. The process 400 is an exemplary specific implementation of the process 200 in FIG.2, and the same reference numerals in FIG.4 and FIG.2 refer to the same processing steps or information. In the process 400, the implicit relevance feedback model may adopt a score aggregation strategy for aggregating user behaviors. The score aggregation strategy may refer to predicting a corresponding relevance score for each of a plurality of impressions of a question-passage, then combining relevance scores of these impressions into a final relevance score, and finally forming a relevance label based on the final relevance score.

[0061] The raw behavior features corresponding to different impressions may be extracted from the user behaviors 224 through the behavior feature extraction at 230, e.g., raw behavior features 432-1 corresponding to impression 1, raw behavior features 432-2 corresponding to impression 2, ..., raw behavior features 432 -n corresponding to impression //, wherein n is the number of the impressions of the question-passage pair 222 recorded in the search log 220.

[0062] The implicit relevance feedback model 240 may be trained for generating an initial relevance score corresponding to each impression based on raw behavior features of the impression. For example, an initial relevance score 442-1 corresponding to the impression 1 is generated based on the raw behavior features 432-1, an initial relevance score 442-2 corresponding to the impression 2 is generated based on raw behavior features 432-2, an initial relevance score 442 -n corresponding to the impression n is generated based on raw behavior features 432 -n.

[0063] In the process 400, a plurality of initial relevance scores corresponding to a plurality of impressions may be combined into a final relevance score. For example, the initial relevance score 442-1, the initial relevance score 442-2, ..., the initial relevance score 442 -n may be combined into a final relevance score 242. The scores may be combined in various approaches. For example, an average value of a plurality of initial relevance scores may be considered as a final relevance score.

[0064] The process 400 may further form the relevance label 250 from the relevance score 242. The relevance label 250 and the question-passage pair 222 are added to the auto- labelled training data set 260 as a QA training data example.

[0065] FIG.5 illustrates an exemplary process 500 for providing QA training data based on a feature aggregation strategy according to an embodiment. The process 500 is an exemplary specific implementation of the process 200 in FIG.2, and the same reference numerals in FIG.5 and FIG.2 refer to the same processing steps or information. In the process 500, the implicit relevance feedback model may adopt a feature aggregation strategy for aggregating user behaviors. The feature aggregation strategy may refer to adopting aggregated behavior features for predicting a relevance score of a question-passage pair, and then forming a relevance label based on the relevance score.

[0066] Aggregated behavior features 532 across a plurality of impressions of the question-passage pair 222 may be extracted from the user behaviors 224 through the behavior feature extraction at 230. Assuming that the question-passage pair 222 has n impressions, some or all of the aggregated behavior features shown in Table 2 may be extracted from n groups of user behaviors corresponding to the n impressions respectively. The implicit relevance feedback model 240 may be trained for generating the relevance score 242 based on the aggregated behavior features 532. The relevance score 242 may further form a relevance label 250. The relevance label 250 and the question-passage pair 222 are added to the auto-labelled training data set 260 as a QA training data example. [0067] FIG.6 illustrates an exemplary process 600 for training a QA model according to an embodiment. The process 600 may train a QA model based at least on implicit relevance feedbacks.

[0068] The QA model 610 shown in FIG.6 may have an architecture based on various technologies. For example, the QA model 610 may be based on a deep neural network, e.g., bidirectional long short term memory (BiLSTM), bidirectional encoder representation from transformers (BERT), etc. It should be understood that the embodiments of the present disclosure are intended to train various QA models with the QA training data obtained through e.g., the process 200 of FIG.2, and are not limited to any specific QA models. [0069] In the process 600, the QA model 610 is exemplarily trained in two stages. In the first stage, the QA model 610 may be pre-trained at 620. An auto-labelled training data set 622 may be obtained in advance, which corresponds to the training data set 260 in FIG.2, so that the training data set 622 includes a large amount of auto-labelled QA training data obtained based on implicit relevance feedbacks. The QA model 610 may be pre-trained with the training data set 622 in a weak supervision approach. In the second stage, the pre-trained QA model 610 may be fine-tuned at 630 to improve model performance. A human-labelled training data set 632 for performing the fine-tuning may be obtained in advance. As described above, the training data set 632 may only include a relatively small amount of QA training data.

[0070] For example, cross-entropy (CE) may be used as a loss function of the two training stages, which may be defined as: y = pQAModei .< Q P >) Equation (5)

log(l - yQ] Equation (6) wherein F_QAModei ( ) denotes the QA model, y is a relevance value output by the QA model, k denotes the number of training data examples, yt represents a relevance value output by the QA model for the z-th training data example, yi denotes a true label regarding relevance value in the z-th training data example. The loss LCE calculated by Equation (6) may be used for updating the QA model in Equation (5) through gradient backward propagation.

[0071] Since the embodiments of the present disclosure provide the training data set 622 including a large amount of QA training data for the training of the QA model 610, the QA model 610 trained through the process 600 will have better performance compared with the existing QA models trained with only a limited amount of training data.

[0072] It should be understood that the process 600 for training the QA model in FIG.6 is exemplary, and the training data set 622 obtained according to the embodiments of the present disclosure may be used in any other approaches. For example, the QA model 610 may be trained with the training data set 622 in a training process that includes only one stage, instead of the process 600 including two training stages. For example, instead of utilizing the human-labelled training data set 632 for fine-tuning a QA model, the training data set 622 obtained according to the embodiments of the present disclosure may be used for fine-tuning and improving an existing QA model.

[0073] FIG.7 illustrates a flowchart of an exemplary method 700 for providing QA training data based on implicit relevance feedbacks according to an embodiment.

[0074] At 710, a question-passage pair and corresponding user behaviors may be obtained from a search log.

[0075] At 720, behavior features may be extracted from the user behaviors.

[0076] At 730, a relevance score between the question and the passage may be determined, through an implicit relevance feedback model, based on the behavior features. [0077] At 740, a relevance label may be added to the question-passage pair based on the relevance score.

[0078] In an implementation, the user behaviors may comprise at least one type of: click behavior type, re-query behavior type and browsing behavior type.

[0079] In an implementation, the obtaining user behaviors may comprise obtaining a plurality of groups of user behaviors corresponding to a plurality of impressions of the question-passage pair respectively.

[0080] The extracting behavior features may comprise: extracting, from a group of user behaviors corresponding to each impression of the plurality of impressions, raw behavior features corresponding to the impression. The determining a relevance score may comprise: for each impression of the plurality of impressions, determining, through the implicit relevance feedback model, an initial relevance score corresponding to the impression between the question and the passage based on raw behavior features corresponding to the impression. The determining a relevance score may further comprise: combining a plurality of initial relevance scores corresponding to the plurality of impressions into the relevance score. Optionally, the adding a relevance label may comprise: for each impression of the plurality of impressions, determining an initial relevance label corresponding to the impression based on an initial relevance score corresponding to the impression; and combining a plurality of initial relevance labels corresponding to the plurality of impressions into the relevance label.

[0081] The extracting behavior features may comprise: extracting aggregated behavior features from the plurality of groups of user behaviors. The determining a relevance score may comprise: determining, through the implicit relevance feedback model, the relevance score between the question and the passage based on the aggregated behavior features. [0082] Each impression of the plurality of impressions may comprise at least one of: the passage, a source page link of the passage, an extension option of the passage, relevant questions, and web page links.

[0083] In an implementation, the method 700 may further comprise: adding the question-passage pair and the relevance label into a QA training data set as a QA training data example.

[0084] In an implementation, the relevance label may be a Boolean value generated based on the relevance score.

[0085] It should be understood that the method 700 may further comprise any step/process for providing QA training data based on implicit relevance feedbacks according to the embodiments of the present disclosure as described above.

[0086] FIG.8 illustrates a flowchart of an exemplary method 800 for training a QA model based on implicit relevance feedbacks according to an embodiment.

[0087] At 810, an auto-labelled training data set may be obtained. Each training data example in the auto-labelled training data set may comprise a question-passage pair and a relevance label, the relevance label being generated at least based on user behaviors corresponding to the question-passage pair through an implicit relevance feedback model. [0088] At 820, the QA model may be pre-trained with the auto-labelled training data set in a weak supervision approach.

[0089] At 830, the QA model may be fine-tuned with a human-labelled training data set. [0090] In an implementation, the implicit relevance feedback model may be for: determining a relevance score between the question and the passage based on behavior features extracted from the user behaviors. The relevance label may be generated based on the relevance score.

[0091] It should be understood that the method 800 may further comprise any step/process for training a QA model based on implicit relevance feedbacks according to the embodiments of the present disclosure as described above.

[0092] FIG.9 illustrates an exemplary apparatus 900 for providing QA training data based on implicit relevance feedbacks according to an embodiment.

[0093] The apparatus 900 may comprise: an information obtaining module 910, for obtaining a question-passage pair and corresponding user behaviors from a search log; a behavior feature extracting module 920, for extracting behavior features from the user behaviors; a relevance score determining module 930, for determining, through an implicit relevance feedback model, a relevance score between the question and the passage based on the behavior features; and a relevance label adding module 940, for adding a relevance label to the question-passage pair based on the relevance score.

[0094] In an implementation, the information obtaining module 910 may be for: obtaining a plurality of groups of user behaviors corresponding to a plurality of impressions of the question-passage pair respectively.

[0095] The behavior feature extracting module 920 may be for: extracting, from a group of user behaviors corresponding to each impression of the plurality of impressions, raw behavior features corresponding to the impression. The relevance score determining module 930 may be for: for each impression of the plurality of impressions, determining, through the implicit relevance feedback model, an initial relevance score corresponding to the impression between the question and the passage based on raw behavior features corresponding to the impression; and combining a plurality of initial relevance scores corresponding to the plurality of impressions into the relevance score.

[0096] The behavior feature extracting module 920 may be for: extracting, from a group of user behaviors corresponding to each impression of the plurality of impressions, raw behavior features corresponding to the impression. The relevance score determining module 930 may be for: for each impression of the plurality of impressions, determining, through the implicit relevance feedback model, an initial relevance score corresponding to the impression between the question and the passage based on raw behavior features corresponding to the impression. The relevance label adding module 940 may be for: for each impression of the plurality of impressions, determining an initial relevance label corresponding to the impression based on an initial relevance score corresponding to the impression; and combining a plurality of initial relevance labels corresponding to the plurality of impressions into the relevance label.

[0097] The behavior feature extracting module 920 may be for: extracting aggregated behavior features from the plurality of groups of user behaviors. The relevance score determining module 930 may be for: determining, through the implicit relevance feedback model, the relevance score between the question and the passage based on the aggregated behavior features.

[0098] In addition, the apparatus 900 may further comprise any other module configured for any operation of providing QA training data based on implicit relevance feedbacks. [0099] FIG.10 illustrates an exemplary apparatus 1000 for training a QA model based on implicit relevance feedbacks according to an embodiment.

[00100] The apparatus 1000 may comprise: a training data set obtaining module 1010, for obtaining an auto-labelled training data set, each training data example in the auto- labelled training data set comprising a question-passage pair and a relevance label, the relevance label being generated at least based on user behaviors corresponding to the question-passage pair through an implicit relevance feedback model; a pre-training module 1020, for pre-training the QA model with the auto-labelled training data set in a weak supervision approach; and a fine-tuning model 1030, for fine-tuning the QA model with a human-labelled training data set.

[00101] In addition, the apparatus 1000 may further comprise any other module configured for any operation of training a QA model based on implicit relevance feedbacks. [00102] FIG.11 illustrates an exemplary apparatus 1100 for providing QA training data and/or training a QA model based on implicit relevance feedbacks according to an embodiment 1100.

[00103] The apparatus 1100 may comprise at least one processor 1110. The apparatus 1100 may further comprise a memory 1120 coupled to the processor 1110. The memory 1120 may store computer-executable instructions that when executed, cause the processor 1110 to perform any operation of the method for providing QA training data based on implicit relevance feedbacks, or any operation of the method for training a QA model based on implicit relevance feedbacks according to the embodiments of the present disclosure described above.

[00104] The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operation of the methods for providing QA training data and/or training a QA model based on implicit relevance feedbacks according to the embodiments of the present disclosure described above.

[00105] It should be understood that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

[00106] It should also be understood that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

[00107] Processors are described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a micro-controller, a digital signal processor (DSP), a field programmable gate array (FPGA) , a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a micro-controller, a DSP, or other suitable platforms.

[00108] Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, etc. Software may reside on computer readable medium. Computer readable medium may include, e.g., a memory, which may be, e.g., a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

[00109] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for providing question-answering (QA) training data based on implicit relevance feedbacks, comprising: obtaining a question-passage pair and corresponding user behaviors from a search log; extracting behavior features from the user behaviors; determining, through an implicit relevance feedback model, a relevance score between the question and the passage based on the behavior features; and adding a relevance label to the question-passage pair based on the relevance score.

2. The method of claim 1, wherein the user behaviors comprise at least one type of: click behavior type, re-query behavior type and browsing behavior type.

3. The method of claim 1, wherein the obtaining user behaviors comprises: obtaining a plurality of groups of user behaviors corresponding to a plurality of impressions of the question-passage pair respectively.

4. The method of claim 3, wherein the extracting behavior features comprises: extracting, from a group of user behaviors corresponding to each impression of the plurality of impressions, raw behavior features corresponding to the impression.

5. The method of claim 4, wherein the determining a relevance score comprises: for each impression of the plurality of impressions, determining, through the implicit relevance feedback model, an initial relevance score corresponding to the impression between the question and the passage based on raw behavior features corresponding to the impression.

6. The method of claim 5, wherein the determining a relevance score further comprises: combining a plurality of initial relevance scores corresponding to the plurality of impressions into the relevance score.

7. The method of claim 5, wherein the adding a relevance label comprises: for each impression of the plurality of impressions, determining an initial relevance label corresponding to the impression based on an initial relevance score corresponding to the impression; and combining a plurality of initial relevance labels corresponding to the plurality of impressions into the relevance label.

8. The method of claim 3, wherein the extracting behavior features comprises: extracting aggregated behavior features from the plurality of groups of user behaviors.

9. The method of claim 8, wherein the determining a relevance score comprises: determining, through the implicit relevance feedback model, the relevance score between the question and the passage based on the aggregated behavior features.

10. The method of claim 3, wherein each impression of the plurality of impressions comprises at least one of: the passage, a source page link of the passage, an extension option of the passage, relevant questions, and web page links.

11. The method of claim 1, further comprising: adding the question-passage pair and the relevance label into a QA training data set as a QA training data example.

12. The method of claim 1, wherein the relevance label is a Boolean value generated based on the relevance score.

13. A method for training a question-answering (QA) model based on implicit relevance feedbacks, comprising: obtaining an auto-labelled training data set, each training data example in the auto- labelled training data set comprising a question-passage pair and a relevance label, the relevance label being generated at least based on user behaviors corresponding to the question-passage pair through an implicit relevance feedback model; pre-training the QA model with the auto-labelled training data set in a weak supervision approach; and fine-tuning the QA model with a human-labelled training data set.

14. An apparatus for providing question-answering (QA) training data based on implicit relevance feedbacks, comprising: an information obtaining module, for obtaining a question-passage pair and corresponding user behaviors from a search log; a behavior feature extracting module, for extracting behavior features from the user behaviors; a relevance score determining module, for determining, through an implicit relevance feedback model, a relevance score between the question and the passage based on the behavior features; and a relevance label adding module, for adding a relevance label to the question-passage pair based on the relevance score.

15. An apparatus for providing question-answering (QA) training data based on implicit relevance feedbacks, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: obtain a question-passage pair and corresponding user behaviors from a search log, extract behavior features from the user behaviors, determine, through an implicit relevance feedback model, a relevance score between the question and the passage based on the behavior features, and add a relevance label to the question-passage pair based on the relevance score.