WO2021150313A1 - Apprentissage contrastif de réponse à des questions (qa) - Google Patents

Apprentissage contrastif de réponse à des questions (qa) Download PDF

Info

Publication number
WO2021150313A1
WO2021150313A1 PCT/US2020/064144 US2020064144W WO2021150313A1 WO 2021150313 A1 WO2021150313 A1 WO 2021150313A1 US 2020064144 W US2020064144 W US 2020064144W WO 2021150313 A1 WO2021150313 A1 WO 2021150313A1
Authority
WO
WIPO (PCT)
Prior art keywords
contrastive
text
query
search
relevant
Prior art date
Application number
PCT/US2020/064144
Other languages
English (en)
Inventor
Ming GONG
Ze YANG
Linjun SHOU
Daxin Jiang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021150313A1 publication Critical patent/WO2021150313A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Definitions

  • a search engine may provide search results for a user query in a search result page (SERP).
  • SERP search result page
  • Traditional search results include links to the most relevant web documents with respect to the user query.
  • a web document may also be referred to as, e.g., a web page.
  • a link may refer to a hyperlink, a web address, a URL, etc.
  • QA question answering
  • the QA service provides a more efficient information access mechanism, which extracts the most relevant passage from a web document and directly presents the content of the passage to a user. For example, if a user query has question intent, a web search engine will extract the most relevant passage from a web document, and place the passage within an individual QA block in a SERP.
  • the passage may refer to one or more sentences, one or more passages, abstract, etc., extracted from the corresponding web document.
  • the QA service is becoming more and more popular for search engine users and is becoming an important service provided by search engines.
  • Embodiments of the present disclosure propose methods and apparatuses for providing contrastive training data.
  • a positive example may be obtained from a training data set, the positive example comprising a first text and a second text labelled as relevant.
  • Contrastive information may be extracted from a search log.
  • the first text may be amended based at least on the contrastive information.
  • the amended first text and the second text may be combined into a negative example which is contrastive to the positive example, the amended first text and the second text being labelled as irrelevant in the negative example.
  • FIG.l illustrates an exemplary search result page.
  • FIG.2 illustrates an exemplary process for providing contrastive training data according to an embodiment.
  • FIG.3 illustrates an exemplary process for providing contrastive training data through a Web Knowledge based Method (WKM) according to an embodiment.
  • WKM Web Knowledge based Method
  • FIG.4 illustrates an exemplary process for extracting candidate options according to an embodiment.
  • FIG.5 illustrates exemplary semi-structured data according to an embodiment.
  • FIG.6 illustrates an exemplary process for providing contrastive training data through a User Feedback based Method (UFM) according to an embodiment.
  • UFM User Feedback based Method
  • FIG.7 illustrates a flowchart of an exemplary method for providing contrastive training data according to an embodiment.
  • FIG.8 illustrates an exemplary apparatus for providing contrastive training data according to an embodiment.
  • FIG.9 illustrates an exemplary apparatus for providing contrastive training data according to an embodiment.
  • a QA system for providing the Web QA service may be configured in a search engine, to provide a passage most relevant to a query in a SERP.
  • the QA system may include a QA model, which is also referred to as a QA relevance model.
  • the QA model is used for, for each candidate passage, providing a relevance score between the candidate passage and the query. Therefore, the QA system may select the passage most relevant to the query based on a relevance score of each candidate passage, and present the passage to the user in a QA block in the SERP.
  • Web QA may be understood as a text matching task, and the text matching may broadly refer to techniques used for identifying whether a pair of texts is semantically relevant.
  • Some conventional approaches adopt an information retrieval (IR) model as a QA model to provide the QA service.
  • the IR model may include, e.g., vector space model, BM 25 model, language model for IR, etc.
  • a neural network model is adopted as a QA model.
  • the neural network model may also refer to deep learning model, deep neural network model, etc.
  • the neural network model encodes semantic meaning of a query’s text into a vector. By mapping similar expressions to close positions in the vector space, the neural network model may recall a passage relevant to the query more accurately.
  • a deep pre training approach may also be adopted for further improving the performance of the neural network model through context embedding.
  • the neural network model captures the semantic similarity among texts based on distributional hypothesis, e.g., it would deem that linguistic items with similar distributions have similar meanings. Consequently, although the neural network model may successfully learn that "kid” is similar to “children", it may, at the same time, also consider that "the elderly” is similar to “kid” and “children", since the set of words that often co-occur with "the elderly” may also likely appear in the context of "kid” or "children”. In the case of performing word embedding through word2vec, taking the word "adult” as an example, the closest words in the vector space may include “youth”, “children”, “the elderly”, etc.
  • the word embedding technology not only clusters synonyms, but also easily clusters other words in the same category in the vector space. Even if deep context embedding is applied as in the deep pre-training approach, the neural network model may still consider that, e.g., "children” and “the elderly” are similar, since contexts of these two words usually overlap with each other.
  • the neural network model may not have sufficient sensitivity to distinguish words that have relevant attributes or categories but have dissimilar meanings. For example, the two words “children” and “the elderly” are both related to the category “person” or to the attribute "person”, but the meanings of them are not similar. In the scenario of the web QA, lack of such sensitivity may cause poor user experiences. For example, if a user query is "cold treatment for children", a search engine may provide a passage about how to treat children's cold, which is relevant to the query, i.e., a good answer to the query.
  • the search engine may still provide the passage about how to treat children's cold, which would be irrelevant to the query, i.e., an inappropriate answer to the query.
  • adversarial training is applied for the neural network model.
  • the adversarial training aims to make small perturbations that do not change the meaning of an input text will not cause significant changes to an output by the model. For example, an adversarial instance may be generated by amending words in the original text.
  • the adversarial training can only enhance the robustness of the model, but cannot be used for improving the sensitivity of the model.
  • Embodiments of the present disclosure propose to enhance sensitivity of a QA model through contrastive learning, e.g., sensitivity of a neural network model used for a web QA task, so that the model can have a capability of effectively distinguishing words that have relevant attributes or categories but dissimilar meanings.
  • the contrastive learning may include, e.g., automatically constructing or providing contrastive training data for enhancing the model’s sensitivity, and training the model with the automatically constructed contrastive training data.
  • the "contrastive" is used for describing relationship between two texts, e.g., two contrastive texts may refer to that these two texts are relevant in attributes or categories but not similar in meaning. For example, "children” and “the elderly” are two contrastive words.
  • Training data set used for training the QA model may include training data in the form of ⁇ q, p, label>, wherein q denotes a query or question, p denotes a passage, and "label" indicates the relevance between q and p.
  • q and p are labelled as relevant, the ⁇ q, p> pair may be considered as a positive example, and when q and p are labelled as irrelevant, the ⁇ q, p> pair may be considered as a negative example.
  • the embodiments of the present disclosure may automatically generate a contrastive query qi' of qi, and construct a negative example ⁇ q ⁇ p ⁇ > contrastive to the positive example ⁇ qi, pi>, wherein qi and pi are labelled as irrelevant.
  • qi and qi are contrastive, qi deviates from q ⁇ in terms of meaning.
  • the constructed negative example may be added to the training data set to be used for training the QA model.
  • the embodiments of the present disclosure may construct qi! as, e.g., "cold treatment for the elderly", and accordingly form a negative example composed of the query "cold treatment for the elderly” and the passage about how to treat children's cold.
  • the QA model can not only learn from the positive examples what information should be associated together, but also learn from the constructed contrastive negative examples what information should be distinguished, e.g., distinguishing words that are relevant in attributes or categories but not similar in meaning.
  • the embodiments of the present disclosure may mine contrastive information from a search log, and construct negative examples that are contrastive to positive examples in the training data set with the contrastive information.
  • Two unsupervised methods for automatically constructing contrastive training data are proposed, e.g., Web Knowledge based Method (WKM) and User Feedback based Method (UFM).
  • WKM Web Knowledge based Method
  • UFM User Feedback based Method
  • the WKM may generate contrastive training data with a contrastive word pair set, which is mined from the search log, through word replacement. For example, the WKM may obtain candidate options from the search log, cluster the candidate options into multiple groups at least with a semi-structured data corpus collected on the web, and form contrastive word pairs from candidate options included in each group.
  • the WKM may amend a query in a positive example with the mined contrastive word pair set, e.g., replacing words in the query to form a negative example.
  • the UFM may select a contrastive query that is contrastive to a query in a positive example based at least on search records in the search log, e.g., displayed links, user click behaviors, etc., and form a negative example with the selected contrastive query.
  • the embodiments of the present disclosure are not limited to construct contrastive ⁇ query, passage> pairs in the QA scenario, but may be more broadly used for constructing contrastive text pairs ⁇ text G, text 2> for various types of text pairs ⁇ text 1, text 2> in other application scenarios.
  • a search engine may need to determine whether two queries have the same meaning, or any other model may need to determine whether two texts have the same meaning, therefore, through constructing training text pairs acting as negative examples that are contrastive to training text pairs acting as positive examples according to the embodiments of the present disclosure, the model’s sensitivity in determining textual meaning similarity may be enhanced.
  • contrastive training data constructed according to the embodiments of the present invention may facilitate to improve the model’s ability of recognizing changes in meanings caused by word changes in texts, thereby enhancing the model’s sensitivity. Therefore, although the construction of contrastive query-passage pair training data is taken as an example in some parts of the following discussion, the same or similar process may also be applied for scenarios of constructing any other contrastive text pair training data.
  • FIG.l illustrates an exemplary search result page (SERP) 100.
  • the SERP 100 may be presented to a user in a user interface by a search engine in response to the user's query or question.
  • Components in the SERP 100 may be exemplarily divided into a search block 110, a QA block 120, a relevant question block 130, a web page link block 140, etc.
  • the blocks are only different logical divisions of the components in the SERP 100, and in terms of display and function, different blocks and components therein may be independent from or combined with each other.
  • the user may enter a query, e.g., "summer flu treatment".
  • the search engine may provide the QA block 120 in the SERP 100.
  • the QA block 120 may include, e.g., a passage 122 for answering the user query, an extension option 124 of the passage 122, a source page link 126 of the passage 122, etc.
  • the passage 122 is content that is extracted from a web document and is most relevant to the user query. For example, in FIG.l, the passage 122 may include multiple tips for treating summer cold. Due to the limitation of display size of a page, the passage 122 may only be partially displayed.
  • the user may click on the extension option 124, e.g., a "More items" link, to view the hidden parts of the passage 122.
  • the source page link 126 is a hyperlink to a source page or a source web document from which the passage 122 is extracted.
  • the SERP 100 may further include a feedback button or link 128 for collecting satisfaction feedbacks provided by the user for the passage 122.
  • the relevant question block 130 may include questions relevant to or similar to the user query in the search block 110. These relevant questions may include, e.g., queries frequently searched by other users. In FIG.l, multiple questions relevant to the user query "summer flu treatment" are shown in the relevant question block 130, e.g., "What causes summer flu?", "Medicines for summer flu?”, etc.
  • the search engine may initiate a search for the clicked relevant question and present a corresponding SERP in the user interface.
  • the web page link block 140 includes hyperlinks to web pages or web documents relevant to the user query in the search block 110.
  • the web page links in the web page link block 140 may be ranked by the search engine based on document relevance.
  • the web page may be presented in the user interface.
  • FIG.2 illustrates an exemplary process 200 for providing contrastive training data according to an embodiment.
  • the process 200 may be performed for automatically generating a contrastive negative example for a text pair acting as a positive example in a training data set 210, so as to expand training data in the training data set 210.
  • the training data set 210 may include training data for training a target model.
  • the target model may be various models related to prediction of text relevance, e.g., a QA model in a QA system, a text meaning comparison model, a text classification model, etc.
  • the training data in the training data set 210 may take the form of ⁇ text 1, text 2, label>, wherein the "label" indicates the relevance between text 1 and text 2.
  • the text pair When being labelled as "relevant”, the text pair may be considered as a positive example, and when being labelled as "irrelevant”, the text pair may be considered as a negative example.
  • the search log 220 may include a search record for each query.
  • a search record may include various types of information in a SERP provided in response to a query, e.g., the query, a passage provided for the query, web links provided for the query, etc.
  • the search record may also include various user behaviors on the SERP, e.g., clicking on a web page link by a user, etc.
  • contrastive information 240 may be extracted from the search log 240.
  • the contrastive information may refer to various types of information that facilitate to generate contrastive training data.
  • the contrastive information 240 may include a contrastive word pair set which may be used for generating contrastive training data in the WKM.
  • the contrastive information 240 may include contrastive queries which may be used for generating contrastive training data in the UFM.
  • text 1 in the positive example 212 may be amended based at least on the contrastive information 240.
  • Text 1 may be amended in different approaches. For example, words in text 1 may be replaced by words in the contrastive word pair set. For example, text 1 may be directly replaced by a contrastive query.
  • the amended text 1 may be used for forming a negative example 252 that is contrastive to the positive example 212.
  • the amended text 1 and text 2 in the positive example 212 may be combined into a text pair ⁇ am ended text 1, text 2>. Since the amended text 1 is contrastive to text 1, the amended text 1 and text 2 may be labelled as irrelevant. Accordingly, the negative example 252 " ⁇ amended text 1, text 2, irrelevant" is formed.
  • the negative example 252 may act as contrastive training data that is contrastive to the positive example 212.
  • a contrastive training data set including a plurality of contrastive training data may be automatically obtained.
  • the process 200 may further add the contrastive training data in the contrastive training data set to the training data set 210.
  • the contrastive training data set may be added to the training data set 210 in different approaches.
  • the contrastive training data set may be simply appended to the training data set 210 as additional training data in addition to the original training data in the training data set 210.
  • the training data set 210 may be updated by using at least a portion of the contrastive training data in the contrastive training data set to replace a portion of the original negative examples in the training data set 210, so as to ensure a balance between the number of positive examples and the number of negative examples in the training data set while guiding sensitivity training and ensuring the original accuracy of the model.
  • Negative examples in the updated training data set may be configured according to the following equation:
  • X n ( . X n> X * ) sample(A n , a
  • X n is the original negative example set in the training data set
  • X * is the contrastive training data set generated by the process 200
  • X n ( ⁇ ) is the final negative example set in the updated training data set
  • sample(L, K) is a sampling function for sampling K instances from the source L
  • a is a sampling coefficient which is used for controlling a ratio of instances selected from the original negative examples and from the contrastive training data.
  • the process 200 may be amended in any approaches.
  • the process 200 only shows generating one negative example 252 for the positive example 212, it is also possible to perform multiple amendments to text 1 at 250 with different contrastive information, and obtain different versions of the amended text 1, thereby, multiple negative examples may be generated for the positive example 212.
  • the text pair ⁇ text 1, text 2> directed by the process 200 may be any text pair involving text relevance prediction, e.g., a ⁇ query, passage> pair in the QA scenario, a ⁇ query, query> pair in the scenario of comparing meanings of queries by a search engine, a ⁇ sentence, sentence> pair or a ⁇ word, word> pair in the scenario of comparing meanings of texts by a general language model, etc.
  • FIG.3 illustrates an exemplary process 300 for providing contrastive training data through a WKM according to an embodiment.
  • the process 300 is an exemplary specific implementation of the process 200 in FIG.2.
  • the process of mining a contrastive word pair set from a search log in FIG.3 may be regarded as a specific implementation of the process of extracting the contrastive information from the search log in FIG.2.
  • the search log 310 may include records of search sessions of users.
  • a search session may refer to a search process for one or more interrelated queries established between a search engine and a user.
  • a search session involving more than one query may be referred to as a multi-turn search session.
  • the search engine may first receive an initial query entered by a user, i.e., a first-turn query. After presenting a SERP for the first-turn query, the user may desire to amend the first-turn query in order to obtain further information, and thus initiate a second-turn query associated with the first-turn query. After obtaining the second-turn query, the search engine may perform search and return a SERP for the second-turn query.
  • the process 300 may generate a contrastive word pair set based on multi -turn search sessions in the search log.
  • candidate option extraction may be performed in the search log 310.
  • at least one multi -turn search session may be first extracted from the search log.
  • the at least one multi-turn search session may have the same first-turn query.
  • Candidate options may be extracted from the extracted at least one multi turn search session.
  • a candidate option may refer to a candidate that may act as a word in a contrastive word pair set. It should be understood that, herein, the "word” may broadly refer to character, word, phrase, etc.
  • candidate option 1, candidate option 2, candidate option 3, ..., etc. may be extracted from the search log 310.
  • FIG.4 illustrates an exemplary process 400 for extracting candidate options according to an embodiment.
  • a search log 410 in FIG.4 may correspond to the search log 310 in FIG.3. Assuming that a plurality of multi-turn search sessions including the same first-turn search "diabetes" are extracted from the search log 410, e.g., Session 1, Session 2, and Session 3. Session 1 includes multiple turns of query, e.g., a first-turn query 422 "diabetes", a second-turn query 424 "type 1 diabetes", a third-turn query 426 "diabetes symptoms", etc.
  • Session 2 includes multiple turns of query, e.g., a first-turn query 432 "diabetes", a second-turn query 434 "diabetes treatment", a third-turn query 436 "diabetes treatment for female", etc.
  • Session 3 includes multiple turns of query, e.g., a first-turn query 442 "diabetes", a second-turn query 444 "type 1 diabetes symptoms", a third-turn query 446 "type 1 diabetes fatigue”, etc.
  • those words shared between every two adjacent queries may be further extracted as a body, and those words not shared between every two adjacent queries may be extracted as candidate options.
  • Longest Common Sequence may be used for detecting shared words in two adjacent queries.
  • the query 422 "diabetes” and the query 424 "type 1 diabetes” share a LCS "diabetes", and an entry Bi corresponding to this LCS "diabetes" may be established in a body set B , and other words in the two queries, e.g., "type 1", may be stored as candidate options in a subset 0 ⁇ corresponding to B ⁇ in a candidate option set O.
  • the query 424 "type 1 diabetes” and the query 426 "diabetes symptoms” share a LCS "diabetes", and there is already an entry Bi corresponding to the LCS "diabetes" in the body set /i, therefore, other words in the two queries, e.g., "type 1" and "symptoms", may be stored as candidate options in the subset 0 ⁇ corresponding to Bi, wherein since the candidate option "type 1" already exists in the subset O l , repeated storage of "type 1" may be avoided.
  • body information and candidate option information as shown in the table at the bottom of FIG.4 may be obtained, e.g., the candidate option set () ⁇ 'type 1, symptoms, treatment, female ⁇ corresponding to the body entry Bi "diabetes", a candidate set (h (symptoms, fatigue ⁇ corresponding to a body entry Bi "type 1 diabetes”, a candidate option set (h ⁇ female ⁇ corresponding to a body entry "diabetes treatment”, etc.
  • B t denotes the z-th body
  • M is the number of bodies
  • 0 denotes the j- th candidate option in the candidate option set O i corresponding to the z-th body
  • N is the number of candidate options in the candidate option set corresponding to the z-th body.
  • the candidate options extracted at 320 may include, e.g., the candidate options in the candidate option sets O i, (h and (h in FIG.4.
  • the candidate options extracted at 320 may not be suitable for direct use in forming contrastive word pairs.
  • some candidate options may not belong to the same category or attribute and cannot be used for forming contrastive word pairs.
  • “type 1" and "female” do not belong to the same category or attribute, it will be meaningless to generate a word pair ⁇ type 1, female> which is not a contrastive word pair either.
  • some candidate options may be synonyms and should not be used for forming contrastive word pairs. For example, " woman” and “female” are synonyms and have similar meanings, therefore, it should be avoided to form a word pair ⁇ woman, female>.
  • the process 300 may also include data optimization for the extracted candidate options.
  • group clustering may be performed to the candidate options.
  • the candidate options extracted at 320 may be clustered into Group 1, Group 2, ..., etc.
  • Each group may include one or more candidate options with the same category or attribute.
  • a semi-structured data corpus 332 prepared in advance may be used for performing the group clustering.
  • the semi -structured data corpus 332 may include various types of semi-structured data obtained from the web, e.g., web table, web list, web menu, etc. Usually, candidate options belonging to the same category or attribute will appear together in the same semi -structured data.
  • FIG.5 illustrates exemplary semi -structured data according to an embodiment.
  • a web table 512 is displayed on a web page 510. As highlighted by dashed lines, the web table 512 includes words "Stage 1", "Stage 2", “Stage 3", “Stage 4", etc. belonging to the same category or attribute.
  • a web list 522 is displayed on a web page 520. As highlighted by dashed lines, the web list 522 also includes the words "Stage 1", “Stage 2", “Stage 3", “Stage 4", etc. belonging to the same category or attribute.
  • the similarity between two candidate options may be calculated based at least on occurrence information of the two candidate options in the semi-structured data corpus 332. For example, given a candidate option pair (o ; , of) and semi -structured data dED, wherein Oi and o, are two candidate options extracted at 320, and D represents a semi- structured data corpus. According to the inclusion relationship between the candidate options and the semi-structured data, i.e., occurrence of the candidate options in the semi- structured data, the following definition may be given:
  • Equation (2) wherein X E ⁇ 1, 0 ⁇ indicates whether d t includes O j , and ⁇ D ⁇ is the number of semi- structured data in the corpus D.
  • Equation (3) Equation (4) wherein X e ⁇ 1, 0 ⁇ indicates whether d E D includes and Y e ⁇ 1, 0 ⁇ indicates whether d E D includes O j .
  • groups may be generated through a greedy clustering approach, and a group corresponding to each candidate option may be determined.
  • o 1 E O may be selected as a group.
  • o t E O (i 1 1) is traversed, and a similarity score between o L and each existing group G j E C is calculated, wherein the calculation equation is, e.g., Equation (6) wherein G j is an existing group, ⁇ G j ⁇ is the number of candidate options in the group G j , and o G .
  • n is the n- th candidate option in the group G j .
  • o L may be considered as a new group, otherwise, o L will be added to the group G j having the maximum similarity score.
  • the Table 1 below shows an exemplary process of determining a group corresponding to each candidate option through the greedy clustering approach.
  • a group set C ⁇ G 1 , G 2 , —, G ⁇ C ⁇ may be obtained, wherein
  • the groups in the group set C may correspond to Group 1, Group 2, etc., in FIG.3.
  • inner-group deduplication may be performed to the groups generated at 330, so as to remove synonymic candidate options.
  • a group including two or more synonymic candidate options may be first identified, and then only one of the two or more synonymic candidate options may be retained in the group. For example, if a group includes the synonyms " woman” and "female", one of these two words, e.g., " woman", may be removed, and only the other word "female” is retained in the group.
  • WordNet may be used for removing synonyms from a group.
  • any two candidate options in each group may be combined into a contrastive word pair.
  • multiple contrastive word pairs 352-1 may be generated based on the candidate options in Group G
  • multiple contrastive word pairs 352-2 may be generated based on the candidate options in Group 2', etc. All the obtained contrastive word pairs may form a contrastive word pair set 352.
  • the contrastive word pair set 352 may be directly generated based on Group 1, Group 2, etc., obtained at 330.
  • the contrastive word pair set 352 is an example of the contrastive information extracted from the search log. Therefore, step 320 to step 350 in the process 300 may be regarded as an exemplary implementation of step 230 in FIG.2.
  • a positive example 370 from the training data set may be amended with the contrastive word pair set 352.
  • a target word in text 1 of the positive example 370 may be identified at 360, and the target word is also included in a contrastive word pair 360-1 in the contrastive word pair set 352.
  • unigram words, bigram words, trigram words, etc. in text 1 may be traversed, and a target word in text 1 that matches a word in a contrastive word pair in the contrastive word pair set 352 may be found.
  • the other word in this contrastive word pair may be regarded as a contrastive word of the target word in text 1.
  • the target word in text 1 may be replaced by the contrastive word in the contrastive word pair 360-1, so that text 1 is amended. Accordingly, a negative example 390 that is contrastive to the positive example 370 may be obtained, which includes a text pair that is labelled as irrelevant, and is composed of the amended text 1 and text 2.
  • the positive example 370 is a text pair composed of a query "cold treatment for children" and a passage about how to treat children’s cold
  • the process 300 it may be determined that the "children" in the query is a target word and is included in a contrastive word pair such as ⁇ children, the elderly>, therefore, the "children" in the query may be replaced by "the elderly", and a negative example composed of a query "cold treatment for the elderly” and a passage about how to treat children's cold may be constructed.
  • FIG.6 illustrates an exemplary process 600 for providing contrastive training data through a UFM according to an embodiment.
  • the process 600 is an exemplary specific implementation of the process 200 in FIG.2.
  • the process of mining contrastive queries from a search log in FIG.6 may be regarded as a specific implementation of the process of extracting the contrastive information from the search log in FIG.2.
  • a search log may not only record search results for a user’s query, but also record user behaviors when the user interacts with a SERP, e.g., clicking on a web link, etc.
  • the process 600 may be performed for generating a negative example that is contrastive to a positive example 610 in a training data set.
  • at least one relevant query 624 which is relevant to text 1 in the positive example 610 may be determined from a search log 622.
  • the determined at least one relevant query 624 may be uniformly denoted as a relevant query set Q r.
  • text 1 and text 2 in the positive example 610 may be a query and a passage respectively.
  • a query- based inverted index may be constructed for performing fast query retrieval in the search log, and BM25 may be used for ranking.
  • search records corresponding to the positive example 610 and search records corresponding to each relevant query in Q r may be extracted from the search log 622.
  • the search records may include, e.g., queries, passages, web page links, click behaviors on web page links, etc.
  • contrastive parameter values between text 1 in the positive example 610 and each relevant query in Q r may be calculated based at least on the search records corresponding to the positive example 610 and the search records corresponding to each relevant query in Q r.
  • the number of co-displayed links and the number of co clicked links between text 1 and each relevant query may be determined based on the search records, and contrastive parameter values between text 1 and the relevant query may be calculated based at least on the number of co-displayed links and the number of co-clicked links.
  • Text 1 may be denoted as q , and co-displayed link information, co-clicked link information, etc. between q and a relevant query q r in Q r may be calculated by the following equations:
  • CoDisplay(q, q r ) U(q) ( ⁇ t/(q r ) Equation (7)
  • UnionDisplay(q, q r ) U(q) U t/(q r ) Equation (8)
  • CoClick(q, q r ) Click(q ) n Click(q r ) Equation (9)
  • q r e Q r , i7 ( ) denotes a link list provided in a SERP for q
  • t/(q r ) denotes a link list provided in a SERP for q r
  • Click(q ) denotes a link list clicked in the SERP for q
  • Click q r ) denotes a link list clicked in the SERP for q r
  • CoDisplay(q, q r ) denotes a link list co-displayed in the SERPs for q and q r
  • the normalization coefficients I c (q, q r ) and l r (q, q r ) defined above may be considered as examples of the contrastive parameters. Accordingly, values of the normalization coefficients calculated according to the above equations may act as the contrastive parameter values between q and q r.
  • At 650 at least one contrastive query may be determined from Q r based on comparison between the calculated contrastive parameter values and predetermined criteria. For example, a relevant query with contrastive parameter values that meet the predetermined criteria may be selected from Q r as a contrastive query.
  • t 1 and t 2 are predetermined thresholds, e.g., t 1 may be set to 0, and t 2 may be set to 0.4.
  • the above Equation (12) may be considered as an example of the predetermined criteria. It should be understood that the embodiments of the present disclosure may also adopt any other forms of predetermined criteria.
  • the contrastive query is an example of the contrastive information extracted from the search log. Therefore, step 620 to step 650 in the process 600 may be regarded as an exemplary implementation of step 230 in FIG.2.
  • the positive example 610 may be amended with the contrastive query.
  • text 1 in the positive example 610 may be directly replaced by the contrastive query.
  • a negative example 670 that is contrastive to the positive example 610 may be obtained, which includes a text pair that is labelled as irrelevant and is composed of the contrastive query and text 2. It should be understood that if multiple contrastive queries are determined at 650, multiple negative examples may be formed with these contrastive queries, respectively.
  • the positive example 610 is a text pair composed of a query "cold treatment for children" and a passage about how to treat children’s cold
  • a contrastive query such as "how to treat cold for the elderly”
  • a negative example composed of the query "how to treat cold for the elderly” and the passage about how to treat children's cold
  • FIG.7 illustrates a flowchart of an exemplary method 700 for providing contrastive training data according to an embodiment.
  • a positive example may be obtained from a training data set, the positive example including a first text and a second text labelled as relevant.
  • contrastive information may be extracted from a search log.
  • the first text may be amended based at least on the contrastive information.
  • the amended first text and the second text may be combined into a negative example which is contrastive to the positive example, the amended first text and the second text being labelled as irrelevant in the negative example.
  • the extracting contrastive information from a search log may comprise: extracting at least one multi-turn search session from the search log; and generating a contrastive word pair set with queries in the at least one multi-turn search session.
  • the at least one multi-turn search session may have the same first-turn query.
  • the generating a contrastive word pair set may comprise: extracting candidate options from the queries in the at least one multi-turn search session; clustering the candidate options into one or more groups with a semi -structured data corpus; and combining any two candidate options in each group into a contrastive word pair.
  • the extracting candidate options may comprise: for each multi-turn search session, extracting words not shared in every two adjacent queries as the candidate options.
  • the clustering may comprise: for two target candidate options in the candidate options, calculating similarity between the two target candidate options based at least on occurrence information of the two target candidate options in the semi -structured data corpus.
  • the clustering may comprise: determining, through a greedy clustering approach, a group to which each candidate option in the candidate options corresponds.
  • the semi -structured data in the semi-structured data corpus may belong to at least one type of: web table, web list and web menu.
  • the method 700 may further comprise: identifying a group including two or more synonymic candidate options; and retaining, in the group, only one candidate option in the two or more synonymic candidate options.
  • the amending the first text may comprise: identifying a target word which is included in the first text and included in a contrastive word pair in the contrastive word pair set; and replacing, in the first text, the target word by another word in the contrastive word pair.
  • the extracting contrastive information from a search log may comprise: determining a contrastive query corresponding to the first text from the search log.
  • the determining a contrastive query may comprise: determining, from the search log, at least one relevant query which is relevant to the first text; for each relevant query, calculating contrastive parameter values between the first text and the relevant query based at least on a search record corresponding to the positive example and a search record corresponding to the relevant query in the search log; and selecting, from the at least one relevant query, a relevant query which has contrastive parameter values conforming to predetermined criteria as the contrastive query.
  • the calculating contrastive parameter values may comprise: determining the number of co-displayed links and the number of co-clicked links between the first text and the relevant query, based on the search record corresponding to the positive example and the search record corresponding to the relevant query; and calculating the contrastive parameter values between the first text and the relevant query based at least on the number of co-displayed links and the number of co-clicked links.
  • the amending the first text may comprise: replacing the first text by the contrastive query.
  • the training data set may be for training a QA model, the first text corresponding to a query, the second text corresponding to a passage.
  • the method 700 may further comprise any step/process for providing contrastive training data according to the embodiments of the present disclosure described above.
  • FIG.8 illustrates an exemplary apparatus 800 for providing contrastive training data according to an embodiment.
  • the apparatus 800 may comprise: a positive example obtaining module 810, for obtaining a positive example from a training data set, the positive example including a first text and a second text labelled as relevant; a contrastive information extracting module 820, for extracting contrastive information from a search log; a text amending module 830, for amending the first text based at least on the contrastive information; and a negative example generating module 840, for combining the amended first text and the second text into a negative example which is contrastive to the positive example, the amended first text and the second text being labelled as irrelevant in the negative example.
  • the contrastive information extracting module 820 may be for: extracting at least one multi-turn search session from the search log; and generating a contrastive word pair set with queries in the at least one multi-turn search session.
  • the generating a contrastive word pair set may comprise: extracting candidate options from the queries in the at least one multi-turn search session; clustering the candidate options into one or more groups with a semi -structured data corpus; and combining any two candidate options in each group into a contrastive word pair.
  • the contrastive information extracting module 820 may be for: determining a contrastive query corresponding to the first text from the search log.
  • the apparatus 800 may further comprise any other module configured for any operation of providing contrastive training data.
  • FIG.9 illustrates an exemplary apparatus 900 for providing contrastive training data according to an embodiment.
  • the apparatus 900 may comprise at least one processor 910.
  • the apparatus 900 may further comprise a memory 920 coupled to the processor 910.
  • the memory 920 may store computer-executable instructions that, when executed, cause the processor 910 to: obtain a positive example from a training data set, the positive example including a first text and a second text labelled as relevant; extract contrastive information from a search log; amend the first text based at least on the contrastive information; and combine the amended first text and the second text into a negative example which is contrastive to the positive example, the amended first text and the second text being labelled as irrelevant in the negative example.
  • the processor 910 may be further configured for performing any other operations of the methods for providing contrastive training data according to the embodiments of the present disclosure described above.
  • the embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for providing contrastive training data according to the embodiments of the present disclosure described above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors are described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a micro-controller, a digital signal processor (DSP), a field programmable gate array (FPGA) , a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • PLD programmable logic device
  • state machine gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
  • the functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor,
  • Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, etc. Software may reside on computer readable medium.
  • Computer readable medium may include, e.g., a memory, which may be, e.g., a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk.
  • a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention se rapporte à un apprentissage contrastif de réponse à des questions (QA), et concerne des procédés et des appareils permettant de fournir des données d'apprentissage contrastives. Un exemple positif peut être obtenu à partir d'un ensemble de données d'apprentissage, l'exemple positif comprenant un premier texte et un second texte étiqueté comme pertinent. Des informations contrastives peuvent être extraites d'un journal de recherche. Le premier texte peut être modifié d'après au moins les informations contrastives. Le premier texte modifié et le second texte peuvent être combinés dans un exemple négatif qui contraste avec l'exemple positif, le premier texte modifié et le second texte étant étiquetés comme non pertinents dans l'exemple négatif.
PCT/US2020/064144 2020-01-20 2020-12-10 Apprentissage contrastif de réponse à des questions (qa) WO2021150313A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010064971.8 2020-01-20
CN202010064971.8A CN113139119A (zh) 2020-01-20 2020-01-20 用于问题回答(qa)的对仗学习

Publications (1)

Publication Number Publication Date
WO2021150313A1 true WO2021150313A1 (fr) 2021-07-29

Family

ID=74125696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/064144 WO2021150313A1 (fr) 2020-01-20 2020-12-10 Apprentissage contrastif de réponse à des questions (qa)

Country Status (2)

Country Link
CN (1) CN113139119A (fr)
WO (1) WO2021150313A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673242A (zh) * 2021-08-20 2021-11-19 之江实验室 一种基于k邻近结点算法和对比学习的文本分类方法
CN114579606A (zh) * 2022-05-05 2022-06-03 阿里巴巴达摩院(杭州)科技有限公司 预训练模型数据处理方法、电子设备及计算机存储介质
CN114880452A (zh) * 2022-05-25 2022-08-09 重庆大学 一种基于多视角对比学习的文本检索方法
EP4322066A4 (fr) * 2022-06-22 2024-02-14 Jina AI GmbH Procédé et appareil de génération de données d'apprentissage
CN118673125A (zh) * 2024-08-22 2024-09-20 杭州电子科技大学 一种基于回复感知的会话式信息检索方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190057159A1 (en) * 2017-08-15 2019-02-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for recalling for search
US20190294694A1 (en) * 2018-03-21 2019-09-26 International Business Machines Corporation Similarity based negative sampling analysis

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019658B (zh) * 2017-07-31 2023-01-20 腾讯科技(深圳)有限公司 检索项的生成方法及相关装置
CN108509474B (zh) * 2017-09-15 2022-01-07 腾讯科技(深圳)有限公司 搜索信息的同义词扩展方法及装置
CN110633407B (zh) * 2018-06-20 2022-05-24 百度在线网络技术(北京)有限公司 信息检索方法、装置、设备及计算机可读介质

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190057159A1 (en) * 2017-08-15 2019-02-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for recalling for search
US20190294694A1 (en) * 2018-03-21 2019-09-26 International Business Machines Corporation Similarity based negative sampling analysis

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673242A (zh) * 2021-08-20 2021-11-19 之江实验室 一种基于k邻近结点算法和对比学习的文本分类方法
CN114579606A (zh) * 2022-05-05 2022-06-03 阿里巴巴达摩院(杭州)科技有限公司 预训练模型数据处理方法、电子设备及计算机存储介质
CN114880452A (zh) * 2022-05-25 2022-08-09 重庆大学 一种基于多视角对比学习的文本检索方法
EP4322066A4 (fr) * 2022-06-22 2024-02-14 Jina AI GmbH Procédé et appareil de génération de données d'apprentissage
CN118673125A (zh) * 2024-08-22 2024-09-20 杭州电子科技大学 一种基于回复感知的会话式信息检索方法

Also Published As

Publication number Publication date
CN113139119A (zh) 2021-07-20

Similar Documents

Publication Publication Date Title
US8751218B2 (en) Indexing content at semantic level
CN103838833B (zh) 基于相关词语语义分析的全文检索系统
CN103678576B (zh) 基于动态语义分析的全文检索系统
WO2021150313A1 (fr) Apprentissage contrastif de réponse à des questions (qa)
CN110888991B (zh) 一种弱标注环境下的分段式语义标注方法
US20130036076A1 (en) Method for keyword extraction
Wu et al. Identification of web query intent based on query text and web knowledge
Lu et al. A dataset search engine for the research document corpus
Dobson Interpretable Outputs: Criteria for Machine Learning in the Humanities.
Sanchez-Gomez et al. Sentiment-oriented query-focused text summarization addressed with a multi-objective optimization approach
Zulen et al. Study and implementation of monolingual approach on indonesian question answering for factoid and non-factoid question
Park et al. Extracting search intentions from web search logs
Balog et al. The university of amsterdam at weps2
Çelebi et al. Automatic question answering for Turkish with pattern parsing
Saenko et al. Filtering abstract senses from image search results
BAZRFKAN et al. Using machine learning methods to summarize persian texts
Nikolić et al. Modelling the System of Receiving Quick Answers for e-Government Services: Study for the Crime Domain in the Republic of Serbia
Gao et al. Improving medical ontology based on word embedding
Sati et al. Arabic text question answering from an answer retrieval point of view: A survey
Plansangket New weighting schemes for document ranking and ranked query suggestion
Takhirov et al. An evidence-based verification approach to extract entities and relations for knowledge base population
Ağduk et al. Classification of news texts from different languages with machine learning algorithms
US20240281489A1 (en) System, method, and application for embedded internet searching and result display for personalized language and vocabulary learning
Luo et al. Improving keyphrase extraction from web news by exploiting comments information
Stańczyk et al. On employing elements of rough set theory to stylometric analysis of literary texts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20835956

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20835956

Country of ref document: EP

Kind code of ref document: A1