CN115599904A

CN115599904A - Question generation from queries

Info

Publication number: CN115599904A
Application number: CN202110771263.2A
Authority: CN
Inventors: 王雪云; 寿林钧; 公明; 姜大昕
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2023-01-13
Also published as: WO2023282996A1

Abstract

The present disclosure presents methods, apparatuses, and computer program products for question generation from queries. A search log associated with a search engine may be obtained. A plurality of search queries and a plurality of sets of search results corresponding to the plurality of search queries may be extracted from the search logs. A set of training data for training at least a problem generation model may be obtained from the plurality of search queries and the plurality of sets of search results, each sample in the set of training data including a title of a search query and a search result, the title being a problem corresponding to the search query.

Description

Question generation from queries

Background

Natural Language Processing (NLP) is a technology for communicating with a computer using a Natural Language, and aims to enable a computer to understand and use a Natural Language to realize communication between human and machine, thereby performing various tasks related to the Natural Language instead of a human. Tasks performed with NLP techniques may be referred to as NLP tasks. Examples of NLP tasks may include Question Answering (QA) tasks, machine Reading Comprehension (MRC) tasks, query Passage (QP) relevance tasks, query List (QList) relevance tasks, and so on. The machine learning model may be trained with a set of training data for a particular NLP task. A trained machine learning model may be deployed to perform the NLP task.

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure present methods, apparatuses, and computer program products for question generation from queries. A plurality of search queries and a plurality of sets of search results corresponding to the plurality of search queries may be extracted from the search logs. A set of training data for training at least a problem generation model may be obtained from the plurality of search queries and the plurality of sets of search results, each sample in the set of training data including a title of a search query and a search result, the title being a problem corresponding to the search query.

It should be noted that one or more of the above aspects include features that are specifically pointed out in the following detailed description and claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate and not to limit the disclosed aspects.

FIG. 1 is a diagram illustrating an exemplary set of questions comprising an exemplary set of queries and corresponding to the set of queries.

FIG. 2 illustrates an exemplary process of question generation from a query according to an embodiment of the disclosure.

Fig. 3 illustrates an exemplary process for generating an initial training data set in accordance with an embodiment of the present disclosure.

Fig. 4 illustrates an exemplary process for removing a sample in accordance with an embodiment of the disclosure.

FIG. 5 is a flow diagram of an exemplary method for question generation from a query in accordance with an embodiment of the present disclosure.

FIG. 6 illustrates an example apparatus for question generation from a query according to an embodiment of this disclosure.

FIG. 7 illustrates an example apparatus for question generation from a query according to an embodiment of this disclosure.

Detailed Description

The present disclosure will now be discussed with reference to several exemplary embodiments. It is understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.

Some NLP models that perform NLP tasks may perform corresponding processing based on an input query to output results for the query. As an example, a QA model for performing QA tasks may, upon receiving a query from a user, output an answer that can answer the query. As another example, an MRC model for performing an MRC task may, upon receiving a query and a text passage, output a text segment of the text passage that is capable of answering the query. Queries provided to the NLP model may be denormal, e.g., presence of grammatical or spelling errors, intent ambiguity, etc. For example, a query provided by a user to an NLP model may be "compute density". The query includes only two keywords, rather than a complete sentence, and is thus intended to be ambiguous. Such a query may not be easily understood and processed by the NLP model, making it difficult to output accurate results for the query.

Embodiments of the present disclosure propose question generation from queries. In this context, a query may refer to an original text fragment provided by a user or other entity for obtaining particular information. The query may be informal, e.g., not following the correct grammatical structure, having typographical or spelling errors, ambiguous intent, etc. For example, when a query includes only a few keywords, the query indicated is ambiguous and thus the intent is ambiguous. In this context, a question may refer to a segment of text that indicates a topic that requires an answer or explanation. The question is typically a complete sentence, such as an interrogative sentence. The question may be regular, e.g., following the correct grammatical structure, having the correct spelling, having a definite intent, etc. Since the problem has a form that is easy for humans and machines to understand, it may also be referred to as a natural language problem. Questions may be generated from queries by a question generation model. The problem generation model may be a machine learning model that is specifically trained to generate problems from queries. The problem generation model may generate a problem suitable for NLP model processing from an input query. For example, the question generation model may obtain a query originally input to the NLP model and generate a question from the obtained query. The generated problem may be provided to the NLP model. Since the problem generated by the problem generation model is formal, such as following the correct grammar structure, having the correct spelling, having a definite intention, etc., the problem is easier to understand and process by the NLP model, so that more accurate results can be obtained.

In one aspect, embodiments of the present disclosure propose obtaining a set of training data for training a problem generation model through search logs associated with a search engine. For example, a plurality of search queries and a plurality of sets of search results corresponding to the plurality of search queries may be extracted from the search logs. A search query may be a query in the context of a search engine, which may also be considered an instance of a query. For example, a search query may refer to a piece of original text that is provided to a search engine by a user or other entity to initiate retrieval of particular information. A set of search results corresponding to a search query may include, for example, one or more search results retrieved by a search engine based on the search query. Each search result may include a title. For example, where the search result is a web page, the title of the search result may be the title of the web page. The title of a search result corresponding to a particular search query may be a question corresponding to the search query. For example, the search query may be "compute density". The title of a web page retrieved based on the search query may be "how to compute the density? ". The title "how to calculate the density? "can be considered a problem corresponding to the search query" compute density ". Thus, the titles of the search queries and their corresponding search results may be combined into a sample for training the problem generation model. A search query may correspond to a set of search results. Accordingly, a set of samples corresponding to a search query may be obtained. There may be multiple search queries in the search log. Accordingly, multiple sample sets corresponding to multiple search queries may be combined into an initial training data set.

In another aspect, embodiments of the present disclosure propose filtering an initial training data set based on a plurality of factors to obtain a training data set for training a problem generation model. The first factor relates to the consistency of intent between the title in the sample and the search query in the sample. For example, in filtering the initial set of training data, it may be determined whether the intent of the title in each sample is consistent with the intent of the search query in that sample, and samples in which the intent of the title is inconsistent with the intent of the search query are removed. The second factor relates to the quality of the header in the sample. For example, in filtering the initial training data set, for each sample, it may be determined whether the title satisfies a predetermined quality requirement based on the grammatical structure, spelling, intent clarity, etc. of the title in the sample, and samples for which the title does not satisfy the predetermined quality requirement are removed. The third factor relates to textual similarity between the title in the sample and the search query in the sample. For example, in filtering the initial training data set, it may be determined whether the title in each sample is textually similar to the search query in that sample, and samples whose titles are textually similar to the search query are removed. By taking the above three factors into consideration, samples that do not satisfy the above conditions can be removed from the initial training data set, so that a training data set including high-quality samples that satisfy requirements in terms of, for example, intention consistency, title quality, text similarity, and the like can be obtained. Furthermore, the above process is based on a huge number of search logs, and is implemented in a fully automatic manner without human intervention, so that a training data set comprising a large number of high-quality samples can be obtained. Such a training data set may help train out a problem generation model with better performance.

In another aspect, the training data set obtaining process according to embodiments of the present disclosure is multi-lingual based. The training data set acquisition process provided by the embodiment of the disclosure fully considers the commonality and characteristics of different languages, and thus can be widely applied to any language. That is to say, the training data set generated for the question based on any language can be obtained through the training data set obtaining process proposed by the embodiment of the present disclosure. This is extremely beneficial for some languages where training data is scarce or uncommon. Further, where different sets of training data based on different languages are obtained, these sets of training data may be utilized to train the problem generation model to obtain a multi-language capable problem generation model.

In another aspect, embodiments of the present disclosure propose data enhancement of a training data set for NLP tasks using a problem generation model. For example, samples in the training data set for some NLP tasks may include queries. For each sample, a question may be generated from the query in that sample by a question generation model. The generated questions may be combined with portions of the sample other than the query into a new sample. New samples may be added to the training data set for the NLP task. In this way, the number of samples can be increased. In addition, the quality of the sample is improved because the problem generated by the problem generation model is a formal problem easy for human and machine understanding.

In another aspect, embodiments of the present disclosure propose to further train the problem generation model with a training data set for the NLP task to obtain a model for performing the NLP task. For example, the NLP task may be a syntax Error Correction (GEC) task. The GEC task aims to convert an input text segment with a grammatical error into a text segment without a grammatical error. The problem generation model may be further trained using a training data set for the GEC task. The training data set for the GEC task may include a plurality of samples, each of which may include a text segment with a grammatical error and a text segment without a grammatical error. The problem generation model further trained with the training data set for the GEC task can execute the GEC task when actually deployed, e.g., an input text segment with grammatical errors can be converted into a text segment without grammatical errors.

In another aspect, embodiments of the present disclosure propose training other NLP models with a set of training data for problem generation models obtained according to embodiments of the present disclosure. As an example, the training data set used for the problem generation model may also be used to train the GEC model. Since the training data set for the problem generation model includes a large number of high-quality samples, when the training data set is used to train another NLP model, data enhancement can be performed on the training data set for the NLP model, for example, the number and quality of samples are improved, which helps train an NLP model with better performance.

FIG. 1 is a diagram 100 illustrating an exemplary set of queries 100a and an exemplary set of questions 100b corresponding to the set of queries. An exemplary set of queries 100a may include queries based on various languages. These queries may be informal, such as not following the correct grammatical structure, having typing or spelling errors, ambiguous intent, etc. Each question in the set of exemplary questions 100b may correspond to one query in the set of exemplary queries 100 a. Each question may be generated from a respective query, for example, by a question generation model according to embodiments of the present disclosure. These questions may be regular, e.g., follow the correct grammatical structure, have the correct spelling, have a definite intent, etc.

The query 110 sea blue may be based on English. The query 110 includes only keywords, not complete sentences, the intent of which is ambiguous. The question 112 corresponding to the query 110 may be "what is the sea blue? (why is the seawater blue). The question 112 clearly indicates that the question asked for an answer is "reason seawater is blue".

Query 120"fair deep" may be based on french. The spelling of "crepe" in the query 120 is wrong, and should be "cr e. In addition, because of the lack of articles before "crepe," query 120 is grammatically incorrect. Further, the query 120 includes only keywords, not complete sentences, the intent of which is ambiguous. The question 122 corresponding to the query 120 may be "Comment false des cr"? (how do you do a cookie) ". Question 122 corrects spelling and grammar errors present in query 120 and clearly indicates that the question asked to answer is "method of making pretzels".

Query 130"

Babys (baby body temperature) "may be german based. Because of the lack of articles before "Babys," query 130 is grammatically incorrect. Further, the query 130 includes only keywords, not complete sentences, the intent of which is ambiguous. The question 132 corresponding to the query 130 may be "Wie misst man die

eines Babys? (how do you take a body temperature to the baby)'. Question 132 corrects grammatical errors present in query 130 and clearly indicates that the question to be answered is "method of thermometering an infant".

The query 140 "helping the user to help the user in the user's attraction" can be based on japanese. The query 140 includes only keywords, not complete sentences, the intent of which is ambiguous. A question 142 corresponding to the query 140 can be "12399v, 1239312428124281241235612391? (how long the duration of human attention is) ". The question 142 clearly indicates that the topic to be answered is "the length of the duration of the attention of the human being".

Query 150 "ii-p-1099c ci _ ci \ 1102t @ (high-altitude parachuting)" may be based on russian. The query 150 includes only keywords, rather than complete sentences, the intent of which is ambiguous. The problem 152 corresponding to the query 150 may be "r al zha zao boao, boan, r al, 1103aji, a r al, a \ 1078, -ci. (where is it appropriate for high-altitude parachuting)'. The question 152 clearly indicates that the question to be answered is "a place suitable for high-altitude parachuting".

The query 160 "greenhouse May tomato planting" may be Chinese based. The query 160 does not follow the correct syntactic structure and the intent is also ambiguous. A question 162 corresponding to the query 160 may be "may be whether or not tomatoes may be planted in the greenhouse in january? ". Question 162 corrects grammatical errors in query 160 and clearly indicates that the question asked for an answer is "feasibility of growing tomatoes in greenhouse in the month of May".

It should be understood that the queries and questions illustrated in FIG. 1 are merely exemplary. Queries may also be based on other languages. The questions corresponding to each query may also have other forms.

According to embodiments of the present disclosure, a training data set may be obtained through a search log associated with a search engine. The obtained training data set may be used at least for training the problem-generating model. FIG. 2 illustrates exemplary processes 200a and 200b for question generation from a query according to an embodiment of the disclosure. In process 200a, a training data set may be obtained through a search log associated with a search engine. In process 200b, the problem-generating model may be trained using the obtained training data set. The trained problem generation model can generate a problem suitable for NLP model processing from the input query.

At 210, a search log associated with a search engine can be obtained. A search log associated with a search engine may include a plurality of search queries. Each search query of the plurality of search queries may be a word, phrase, or sentence, etc. that is entered into a search box of a search engine by a user or other entity to obtain particular information on the network. The search query may be informal, e.g., not following the correct grammatical structure, having typing or spelling errors, having an ambiguous intent, etc. In addition, the search query may be based on any language. For each search query, a set of search results corresponding to the search query may be included in the search logs. Each search result in the set of search results may be retrieved over the network by a search engine based on the search query. The search result set may also be based on any language.

At 220, a plurality of search queries and a plurality of sets of search results corresponding to the plurality of search queries may be extracted from the obtained search logs. One of the plurality of sets of search results may correspond to one of the plurality of search queries.

A set of training data for at least training the problem generation model may be obtained from the extracted plurality of search queries and the plurality of sets of search results. For example, an initial training data set may be generated based on the extracted plurality of search queries and the plurality of sets of search results, and the initial training data set may be filtered to obtain a training data set.

At 230, an initial training data set 232 may be generated based on the plurality of search queries and the plurality of search result sets. The search results may include a title. For example, where the search result is a web page, the title of the search result may be the title of the web page. The search results may also include an address. The address may be, for example, a Uniform Resource Locator (URL). The title of a search result corresponding to a particular search query may be a question corresponding to the search query. For example, the search query may be "compute density". The title of a web page retrieved based on the search query may be "how to compute the density? ". The title "how to calculate the density? "can be considered a problem corresponding to the search query" compute density ". Thus, the titles of the search queries and their respective search results may be combined into a sample that is used to train the problem generation model. That is, the sample may include the title of the search query and search results. For each search query of a plurality of search queries, the search query may be combined with a title of each search result of a set of search results corresponding to the search query to obtain a set of samples corresponding to the search query. Multiple sample sets corresponding to multiple search queries may then be combined into an initial training data set 232. An exemplary process for generating the initial training data set will be described later in connection with fig. 3.

At 240, the initial training data set 232 may be filtered to obtain a training data set 242. In filtering the initial training data set 232, for each sample in the initial training data set 232, it may be determined whether the sample should be removed. An exemplary process for removing a sample will be described later in connection with fig. 4.

Through the process 200a above, a training data set 242 for training the problem generation model is obtained. In process 200b, the problem-generating model may be trained using the obtained training data set.

The training data set 250 in process 200b may correspond to the training data set 242 in process 200 a. The problem generation model 260 may be trained using the training data set 250 to obtain a trained problem generation model 270. The problem generation model may be a sequence-to-sequence (sequence-to-sequence) model based on the encoder-decoder structure. The encoder and decoder may be a Deep-Neural Network (DNN) model, such as a Recurrent Neural Network (RNN) model, a Long Short-Term Memory (LSTM) model, a Transformer (Transformer) model, or the like. The encoder may extract semantic and syntactic features from a representation of the input query. The decoder may generate a natural language question based on the extracted semantic and syntactic features.

The trained problem generation model 270 is adapted to generate a problem suitable for NLP model processing from an input query. By way of example, the NLP model may be an MRC model, a QA model, a QP correlation model, a QList correlation model, or the like. The query 272 may be provided to the trained problem generation model 270. For example, the query 272 may be a query originally input to the NLP model. The trained question generation model 270 may generate questions 274 from the queries 272. Questions 274 can be provided to the NLP model. Since the questions 274 generated by the trained question generation model 270 are regular, e.g., follow the correct grammatical structure, have the correct spelling, have a definite intent, the questions 274 are easier to understand and process by the NLP model, and more accurate results can be obtained.

Because the training data set obtained according to embodiments of the present disclosure, such as

training data set

242 or 250, is obtained in an automated and reliable manner based on a large number of search logs, the training data set may include a large number of high quality samples. Such a training data set may help to train out problem-generating models with good performance.

It should be appreciated that the process for question generation from a query described above in connection with FIG. 2 is merely exemplary. Steps in the process for question generation from a query may be replaced or modified in any manner, and the process may include more or fewer steps, depending on the actual application requirements. For example, in process 200b, prior to training the problem generation model 260 with the training data set 250, the problem generation model 260 may be pre-trained in some pre-training manner to improve the efficiency of model training. The pre-training mode may include, for example, a Masked Language Model (MLM), next Sentence Prediction (NSP), and the like. Further, the specific order or hierarchy of steps in processes 200a and 200b is merely exemplary, and the processes for question generation from queries may be performed in an order different from that described.

Fig. 3 illustrates an exemplary process 300 for generating an initial training data set in accordance with an embodiment of the disclosure. Process 300 may, for example, correspond to step 230 in process 200a in fig. 2. Through process 300, an initial set of training data may be generated based on a plurality of search queries extracted from a search log and a plurality of sets of search results corresponding to the plurality of search queries. For example, for each search query of the plurality of search queries, the search query may be combined with a title of each search result of the set of search results corresponding to the search query to obtain a set of samples corresponding to the search query. Subsequently, a plurality of sample sets corresponding to the plurality of search queries may be combined into an initial training data set.

The plurality of search queries extracted from the search logs may be search queries 310-1Q ₁ Search query 310-2Q ₂ 8230, search query 310-MQ _M Where M ≧ 1 represents the number of search queries. Following with search query Q ₁ An exemplary process of obtaining a sample set is illustrated for example.

And search query Q extracted from search log ₁ The corresponding search result set may be search result set 320-1R ₁ . Search result set R ₁ May include, for example, search results 320-1-1R ₁₁ Search results 320-1-2R ₁₂ 8230min, and 320-1-NR as search result _1N Where N ≧ 1 represents the number of search results. Each search result may include various types of information such as a title, address, and the like. Taking the example that the search result is a web page, the search result may include, for example, a title of the web page, a URL of the web page, and the like. Search result R ₁₁ May include, for example, a title T ₁₁ Address A ₁₁ Etc. search results R ₁₂ May include, for example, a title T ₁₂ Address A ₁₂ Etc. search results R _1N May include, for example, a title T _1N Address A _1N And the like.

According to an embodiment of the present disclosure, a search query Q may be generated ₁ And search result set R ₁ To obtain a search query Q ₁ Corresponding sample set 330-1S ₁ . For example, a search query Q may be ₁ And search result R ₁₁ Title T of ₁₁ Are combined to obtain a sample 330-1-1S ₁₁ <Search query Q ₁ Title T ₁₁ >(ii) a The search query Q ₁ And search result R ₁₂ Title T of ₁₂ Are combined to obtain samples 330-1-2S ₁₂ <Search query Q ₁ Title T ₁₂ >(ii) a The search query Q ₁ And search result R _1N Title T of _1N Are combined to obtain a sample 330-1-1S _1N <Search query Q ₁ Title T _1N >. Sample S ₁₁ Sample S ₁₂ 823060, 8230305 and sample S _1N A search query Q may be formed ₁ Corresponding sample set S ₁ 。

Can be applied to search query Q ₂ To search query Q _M Per search query execution and pair search query Q ₁ Operations performed are similar, thereby obtaining sample sets corresponding to respective search queries. By way of example, and search query Q extracted from a search log ₂ The corresponding search result set may be search result set 320-2R ₂ . Search query Q ₂ And search result set R ₂ To obtain a search query Q ₂ Corresponding sample set 330-2S ₂ . As another example, the query Q extracted from the search log _M The corresponding search result set may be search result set 320-MR _M . Search query Q _M And search result set R _M To obtain a search query Q _M Corresponding sample set 330-MS _M 。

Subsequently, the sample set S can be processed ₁ Sample set S ₂ 823060, sample set S _M Are combined to obtain an initial training data set 340. The initial training data set 340 may, for example, correspond to the initial training data set 232 in the process 200a in fig. 2.

It should be appreciated that search query Q in FIG. 3 ₁ Search query Q ₂ 823060, 8230searching query Q _M And/or search result set R ₁ Search result set R ₂ 823060, 823080, search result set R _M May be based on any language. Further, it should be understood that for generating the initial training data set described above in connection with FIG. 3The process is exemplary only. The steps in the process for generating the initial training data set may be replaced or modified in any manner, and the process may include more or fewer steps, depending on the actual application requirements. For example, in obtaining a sample corresponding to each search query, it is possible to combine the search query with both the title and address of the respective search result, in addition to combining the search query with the title of the respective search result. Further, the particular order or hierarchy of steps in process 300 is merely exemplary, and the process for generating the initial training data set may be performed in an order different than that described.

Fig. 4 illustrates an exemplary process 400 for removing a sample according to an embodiment of the disclosure. Process 400 may be performed for each sample in an initial training data set generated based on a plurality of search queries extracted from a search log and a plurality of sets of search results. For example, process 400 may be performed for each sample in initial training data set 232 in process 200a in fig. 2 or each sample in initial training data set 340 in fig. 3. An exemplary process for removing a sample is described below using sample 402S as an example. Through process 400, it may be determined whether sample S should be removed. The sample S may include, for example, a search query Q and a title T.

At 404, it may be determined whether an address corresponding to title T is accessed. The address corresponding to the title T may be an address, such as a URL, of the search result corresponding to the title T. In one embodiment, whether an address corresponding to title T is accessed may be determined by determining whether the address is clicked on by a user. A set of search results retrieved based on search query Q may be presented to the user via a search results page. The search results page may include information for each search result, such as title, address, summary, and the like. Typically, the user will read the title of the search results. If a user believes that the title of a search result has the same intent as the search query it issued, the address may be accessed, for example, by clicking on the address corresponding to the title to gain further insight into the search result. Thus, whether there is an intent correspondence between the title T and the search query Q may be determined by determining whether an address corresponding to the title T is accessed.

If it is determined at 404 that the address corresponding to the title T has not been accessed, there may not be an intent correspondence between the title T and the search query Q. In this case, the process 400 may proceed to 424, i.e., remove the sample S.

If it is determined at 404 that an address corresponding to the title T has been accessed, there may be an intent correspondence between the title T and the search query Q. In this case, process 400 may proceed to 406.

According to an embodiment of the present disclosure, it may be determined whether the title T in the sample S satisfies a predetermined quality requirement, and in case it is determined that the title T does not satisfy the predetermined quality requirement, the sample S is removed. For example, whether the title T satisfies the predetermined quality requirement may be determined based on at least one of a grammatical structure, spelling, and intent clarity of the title T. In one embodiment, it may be determined whether the title T meets a predetermined quality requirement by determining whether an address corresponding to the title T is associated with a predetermined address. In another embodiment, it may be determined whether the title T meets the predetermined quality requirement by determining whether the title T includes a query word.

For example, at 406, it may be determined whether an address corresponding to the title T is associated with a predetermined address. The predetermined address may be, for example, an address of a website providing a Community Question Answering (CQA) service. By way of example, the website providing the CQA service may be quera, cicada, nut shell, and the like. Websites providing CQA services have good reputation. Generally, the titles of the respective web pages of these websites are good in terms of the grammatical structure, spelling, and intent clarity, and thus can be regarded as a high-quality problem that satisfies predetermined quality requirements.

If it is determined at 406 that the address corresponding to the title T is not associated with a predetermined address, such as the address of the website providing the CQA service, the title T may not be the title of the web page in the website providing the CQA service and thus may not be a high quality issue. In this case, the process 400 may proceed to 424, i.e., remove the sample S.

If it is determined at 406 that the address corresponding to the title T is associated with a predetermined address, such as the address of the website providing the CQA service, the title T may be the title of a web page in the website providing the CQA service and thus may be a high quality issue. In this case, process 400 may proceed to 408.

At 408, it may be determined whether the title T includes a query. The question words may be words used to construct question sentences. As an example, the interrogatories may include, for example, why, what, where, how, etc. The question is typically in the form of an interrogative sentence. The interrogative words can express the query intention explicitly and are essential components constituting the interrogative sentence. Therefore, high quality issues should include interrogatories.

If it is determined at 408 that title T does not include a query, title T may not be a high quality issue. In this case, the process 400 may proceed to 424, i.e., remove the sample S.

If it is determined at 408 that the title T includes a query, the title T may be a high quality issue. In this case, process 400 may proceed to 410.

For some languages, such as English, french, german, etc., the question words are typically located at the beginning of the question sentence. Preferably, when the title T is based on a predetermined language such as english, french, german, or the like, it can be determined whether the beginning of the title T includes a query word. If it is determined that the beginning of title T does not include a query, title T may not be a high quality issue. In this case, the process 400 may proceed to 424, i.e., remove the sample S. If it is determined that the title T includes the query word, the title T may be a high quality issue. In this case, process 400 may proceed to 410.

At 410, it may be determined whether the title T includes a predetermined word. The predetermined word may be a word unrelated to the semantics of the title T. For example, the title of a web page from a particular web site may contain words related to the name of the web site, such as words related to the name of the web site appended to the beginning or end of the title. Such words are independent of the semantics of the title T. In case a sample whose included title's corresponding address is not associated with the predetermined address has been removed by step 406, the sample referred to in step 410 may be a sample whose included title's corresponding address is associated with the predetermined address. It is possible to determine which words are generally included in the titles regardless of the semantics of the titles by looking at the titles of the web pages of the web sites corresponding to the predetermined addresses and to treat the words as the predetermined words. According to an embodiment of the present disclosure, in the case where the title T of the sample S includes a predetermined word, the sample S may be updated by deleting the predetermined word. For example, if it is determined at 410 that the title T includes a predetermined word, the process may proceed to 412, i.e., delete the predetermined word.

If at 410 it is determined that the title T does not include the predetermined word, process 400 may proceed to 414. Whether there is an intent correspondence between the title T and the search query Q has been determined at 404 above by determining whether the address corresponding to the title T was accessed. According to an embodiment of the present disclosure, whether there is an intention consistency between the title T and the search query Q may be further determined by an intention consistency model. For example, at 414, an intent-consistency score between the title T and the search query Q may be predicted by an intent-consistency model. The intent-consistency model may be a machine-learning model specifically trained to predict intent-consistency scores between the title T and the search query Q. At 416, it may be determined whether the intent-consistency score predicted by the intent-consistency model is above a consistency threshold.

If it is determined at 416 that the intent-consistency score is not above the consistency threshold, this indicates that there may be no intent consistency between the title T and the search query Q. In this case, the process 400 may proceed to 424, i.e., remove the sample S.

If it is determined at 416 that the intent-consistency score is above the consistency threshold, it indicates that there is likely intent consistency between the title T and the search query Q. In this case, process 400 may proceed to 418.

In accordance with embodiments of the present disclosure, to ensure the effectiveness of the training data set to improve the efficiency of model training, the title T included by each sample should be textually dissimilar from the search query Q. For example, title T is textually similar to search query Q when most of the words contained by title T are the same as some of the words contained by search query Q. Training a machine learning model, such as a problem generation model, with such samples would be inefficient and improve model performance is limited. Thus, the sample S may be removed when it is determined that there is textual similarity between the title T in the sample S and the search query Q in the sample S.

For example, at 418, a textual similarity score between the title T and the search query Q may be predicted by a textual similarity model. The text similarity model may be a machine learning model specifically trained to predict text similarity scores between the title T and the search query Q. As an example, the text similarity model may be a Jaccard (Jaccard) similarity calculation model. At 420, it may be determined whether the text similarity score predicted by the text similarity model is below a similarity threshold.

If it is determined at 420 that the text similarity score is not below the similarity threshold, it indicates that the title T and the search query Q may be textually similar. In this case, the process 400 may proceed to 424, i.e., remove the sample S.

If it is determined at 420 that the text similarity score is below the similarity threshold, it indicates that the title T may not be textually similar to the search query Q. In this case, the process 400 may proceed to 422, i.e., retain the sample S.

In process 400, it may be determined whether sample S should be removed based on a variety of factors. The first factor relates to the consistency of intent between the title T in sample S and the search query Q in sample S. For example, whether there is an intent correspondence between the title T and the search query Q may be determined by step 404 and/or

steps

414 and 416. The second factor relates to the quality of the header T in the sample S. For example, it may be determined whether the title T meets a predetermined quality requirement through step 406 and/or step 408. The third factor relates to textual similarity between the title T in sample S and the search query Q in sample S. For example, it may be determined whether there is a textual similarity between the title T and the search query Q through

steps

418 and 420. By taking the above three factors into consideration, samples that do not satisfy the above conditions can be removed from the initial training data set, so that a training data set including high-quality samples that satisfy requirements in terms of intention consistency, title quality, text similarity, and the like can be obtained. In addition, the training data set may include a large number of high quality samples, since it is based on a huge number of search logs, obtained in a fully automated manner without human intervention. Such a training data set may help to train out a problem generation model with better performance.

Through process 400, it may be determined whether sample S should be removed. Generally, when the search query Q or title T included in the sample S is based on some language, such as english, french, german, etc., the search query Q or title T may have some specific format, e.g., having capital letters, lowercase letters, etc. These particular formats are often of special interest. Preferably, the format of the search query Q or title T included with the sample S may be preserved when processing the sample S, such as by the process 400, to enable a more accurate determination of whether the sample S should be removed.

It should be appreciated that process 400 fully accounts for the commonality and specificity of different languages and thus may be applied broadly to any language. Furthermore, it should be understood that the process for removing a sample described above in connection with fig. 4 is merely exemplary. The steps in the process for removing the sample may be replaced or modified in any manner, and the process may include more or fewer steps, depending on the actual application requirements. For example, in the process 400, the language in which the search query Q or title T in the sample S is based can be predetermined, and one or more of the steps in the process 400 can be performed based on the determined language. For example, when the title T is based on a predetermined language such as english, french, german, etc., rather than determining whether the title T includes a query word at the beginning of the title T, it may be determined at 410. Further, the particular order or hierarchy of steps in process 400 is merely exemplary, and the process for removing samples may be performed in an order different than that described. For example, it is also possible to perform the predetermined word deleting operations at 410 and 412 first, and then perform the operation of determining whether the topic T includes a query word at 408. Further, it should be understood that the predetermined word deletion operations at 410 and 412 may update the sample S. In case a predetermined word in the title in at least one sample in the initial training data set is deleted, the initial training data set is updated accordingly. Accordingly, filtering operations on the initial training data set, such as the sample removal operations involved in process 400, may be performed on the updated initial training data set.

As described above with reference to fig. 2, a problem generation model trained using a training data set obtained according to an embodiment of the present disclosure is suitable for generating a problem suitable for NLP model processing from an input query. According to embodiments of the present disclosure, the problem generation model is also applicable to data enhancement of the training data set for NLP tasks. The samples in the training data set for the NLP task may include queries. For example, samples in a training data set for a QA task, an MRC task, a QP correlation task, a QList correlation task, etc. may comprise a query. Taking the MRC task as an example, the task may find an answer fragment for a particular query from a given paragraph. Each sample in the training data set for the MRC task may be, for example, < query, paragraph, answer segment start position label, answer segment end position label >. For each sample, a question may be generated from the query in that sample by a question generation model. The generated question may be combined with parts of the sample other than the query, such as "paragraph", "answer segment start position label", "answer segment end position label", into a new sample. The new sample may be, for example, < question, paragraph, answer segment start position label, answer segment end position label >. New samples may be added to the training data set for the MRC task. In this way, the number of samples can be increased. In addition, the quality of the sample is improved because the problem generated by the problem generation model is a high-quality problem.

In addition, according to an embodiment of the present disclosure, the problem generation model trained using the training data set obtained according to an embodiment of the present disclosure is adapted to be further trained using the training data set for the NLP task for performing the NLP task. For example, the NLP task may be a GEC task. The GEC task aims to convert an input text segment with a grammatical error into a text segment without a grammatical error. The problem generation model may be further trained with a training data set for the GEC task. The training data set for the GEC task may include a plurality of samples, each of which may include a text segment with a grammatical error and a text segment without a grammatical error. The problem generation model further trained with the training data set for the GEC task is able to execute the GEC task when actually deployed, e.g., an input text segment with grammatical errors can be converted into a text segment without grammatical errors.

Further, as described above with reference to FIG. 2, a set of training data for a problem generation task obtained in accordance with embodiments of the present disclosure may be used to train a problem generation model. However, embodiments of the present disclosure are not limited thereto. The training data set for the problem generation task obtained according to the embodiments of the present disclosure may also be suitable for training NLP models other than the problem generation model. Since the training data set for the problem-generating task is obtained in an automated and reliable manner based on a huge number of search logs, the training data set may include a large number of high-quality samples. Training data sets for other NLP models can be data-enhanced with training data sets for problem-generating tasks to improve sample quality and sample number of the training data sets, thereby training NLP models with better performance. As an example, the samples in the training data set for the question generation task may include a query and a question, where the query may be a text segment that does not follow the correct grammatical structure and the question may be a text segment that follows the correct grammatical structure. Thus, the training data set for the problem generation task may also be used to train the GEC model. In one embodiment, the GEC model may be trained through two phases. In the first stage, the GEC model may be trained by a training data set for the problem generation task. In the second phase, the GEC model may be further trained on the training data set of the GEC task. In another embodiment, a set of training data for the problem generation task and a set of training data for the GEC task may be combined into a comprehensive set of training data, and the GEC model is trained using the comprehensive set of training data. Because the training data set for the problem-generating task includes a large number of high-quality samples, the training data set for the GEC model is enhanced, which helps to train out a GEC model with better performance.

FIG. 5 is a flow diagram of an exemplary method 500 for question generation from a query in accordance with an embodiment of the present disclosure.

At 510, a search log associated with a search engine can be obtained.

At 520, a plurality of search queries and a plurality of sets of search results corresponding to the plurality of search queries may be extracted from the search log.

At 530, a set of training data for training at least the problem-generating model may be obtained from the plurality of search queries and the plurality of sets of search results. Each sample in the training data set may include a title for a search query and search results. The title may be a question corresponding to the search query.

In one embodiment, the obtaining the training data set may include: generating an initial training data set based on the plurality of search queries and the plurality of search result sets; and filtering the initial training data set to obtain the training data set.

The generating an initial training data set may include: for each search query of the plurality of search queries, combining the search query with a title of each search result of a set of search results corresponding to the search query to obtain a set of samples corresponding to the search query; and combining a plurality of sample sets corresponding to the plurality of search queries into the initial training data set.

The filtering of the initial training data set may comprise, for each sample in the initial training data set: determining whether there is an intent correspondence between a title in the sample and a search query in the sample; and removing the sample in response to determining that there is no intent correspondence between the title and the search query.

The determining whether there is an intent correspondence between the title and the search query may include: it is determined whether an address corresponding to the title is accessed.

The determining whether there is an intent correspondence between the title and the search query may include: determining whether there is intent correspondence between the title and the search query through an intent correspondence model.

The filtering the initial training data set may include, for each sample in the initial training data set: determining whether the titles in the sample meet a predetermined quality requirement; and in response to determining that the title does not meet the predetermined quality requirement, removing the sample.

The determining whether the title satisfies the predetermined quality requirement may include: determining whether the title satisfies the predetermined quality requirement based on at least one of a grammatical structure, spelling, and intent clarity of the title.

The determining whether the title satisfies the predetermined quality requirement may include: it is determined whether an address corresponding to the title is associated with a predetermined address.

The determining whether the title satisfies the predetermined quality requirement may include: it is determined whether the title includes a query word.

The title may be based on a predetermined language. The determining whether the title includes a query word may include: it is determined whether the beginning of the title includes a query word.

The filtering the initial training data set may include, for each sample in the initial training data set: determining whether there is textual similarity between a title in the sample and a search query in the sample; and removing the sample in response to determining that the title has textual similarity to the search query.

The method 500 may further include: updating the initial training data set by deleting a predetermined word in a title in at least one sample in the initial training data set. The filtering the initial training data set may comprise: the updated initial training data set is filtered.

In one embodiment, the plurality of search queries and/or the plurality of sets of search results may be based on any language.

In one embodiment, the question generation model may be adapted to generate a question from an input query that is suitable for NLP model processing.

In one embodiment, the problem generation model may be adapted to be further trained with a set of training data for an NLP task for performing the NLP task.

In one embodiment, the training data set may also be adapted to train NLP models other than the problem generation model.

It should be understood that the method 500 may also include any steps/processes for question generation from queries according to embodiments of the present disclosure described above.

FIG. 6 illustrates an example apparatus 600 for question generation from a query in accordance with an embodiment of this disclosure.

The apparatus 600 may comprise: a search log obtaining module 610 for obtaining a search log associated with a search engine; a query and result extraction module 620, configured to extract a plurality of search queries and a plurality of search result sets corresponding to the search queries from the search log; and a training data set obtaining module 630 for obtaining a training data set at least for training a problem generation model from the plurality of search queries and the plurality of search result sets, each sample in the training data set including a title of a search query and a search result, the title being a problem corresponding to the search query. Furthermore, the apparatus 600 may also include any other module configured for question generation from a query according to embodiments of the present disclosure described above.

Fig. 7 illustrates an example apparatus 700 for question generation from a query in accordance with an embodiment of this disclosure.

The apparatus 700 may include: at least one processor 710; and a memory 720 that stores computer-executable instructions. The computer-executable instructions, when executed, may cause the at least one processor 710 to: the method includes obtaining a search log associated with a search engine, extracting a plurality of search queries and a plurality of search result sets corresponding to the plurality of search queries from the search log, and obtaining a training data set at least for training a problem generation model from the plurality of search queries and the plurality of search result sets, each sample in the training data set including a title of a search query and a search result, the title being a problem corresponding to the search query.

It should be understood that the processor 710 may also perform any other steps/processes of the method for question generation from queries according to embodiments of the present disclosure described above.

Embodiments of the present disclosure propose computer program products for question generation from a query, comprising a computer program for execution by at least one processor for: obtaining a search log associated with a search engine; extracting a plurality of search queries and a plurality of search result sets corresponding to the plurality of search queries from the search logs; and obtaining a training data set at least for training a problem generation model by the plurality of search queries and the plurality of search result sets, each sample in the training data set comprising a title of a search query and a search result, the title being a problem corresponding to the search query. Furthermore, the computer program may also be executed to implement any other steps/processes of the method for question generation from queries according to the embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any operations of the method for question generation from queries according to embodiments of the present disclosure as described above.

It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts. In addition, the articles "a" and "an" as used in this specification and the appended claims should generally be construed to mean "one" or "one or more" unless specified otherwise or clear from context to be directed to a singular form.

It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.

The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented with a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a state machine, gated logic units, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented using software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code segments, program code, programs, subprograms, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer-readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), a register, or a removable disk. Although the memory is shown separate from the processor in aspects presented in this disclosure, the memory may be located internal to the processor, such as a cache or registers.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

1. A method for question generation from a query, comprising:

obtaining a search log associated with a search engine;

extracting a plurality of search queries and a plurality of search result sets corresponding to the plurality of search queries from the search log; and

obtaining a training data set at least for training a problem generation model from the plurality of search queries and the plurality of search result sets, each sample in the training data set including a title of a search query and a search result, the title being a problem corresponding to the search query.

2. The method of claim 1, wherein the obtaining a set of training data comprises:

generating an initial training data set based on the plurality of search queries and the plurality of search result sets; and

filtering the initial training data set to obtain the training data set.

3. The method of claim 2, wherein the generating an initial training data set comprises:

for each search query of the plurality of search queries, combining the search query with a title of each search result of a set of search results corresponding to the search query to obtain a set of samples corresponding to the search query; and

combining a plurality of sample sets corresponding to the plurality of search queries into the initial training data set.

4. The method of claim 2, wherein the filtering the initial training data set comprises, for each sample in the initial training data set:

determining whether there is an intent correspondence between a title in the sample and a search query in the sample; and

removing the sample in response to determining that there is no intent correspondence between the title and the search query.

5. The method of claim 4, wherein the determining whether there is intent correspondence between the title and the search query comprises:

it is determined whether an address corresponding to the title is accessed.

6. The method of claim 4, wherein the determining whether there is intent correspondence between the title and the search query comprises:

determining whether there is intent correspondence between the title and the search query through an intent correspondence model.

7. The method of claim 2, wherein the filtering the initial training data set comprises, for each sample in the initial training data set:

determining whether the titles in the sample meet a predetermined quality requirement; and

in response to determining that the header does not meet the predetermined quality requirement, removing the sample.

8. The method of claim 7, wherein the determining whether the title meets a predetermined quality requirement comprises:

determining whether the title satisfies the predetermined quality requirement based on at least one of a grammatical structure, spelling, and intent clarity of the title.

9. The method of claim 7, wherein the determining whether the title meets a predetermined quality requirement comprises:

it is determined whether an address corresponding to the title is associated with a predetermined address.

10. The method of claim 7, wherein the determining whether the title meets a predetermined quality requirement comprises:

determining whether the title includes a query word.

11. The method of claim 10, wherein the title is based on a predetermined language, and said determining whether the title includes a query word comprises:

it is determined whether the beginning of the title includes a query word.

12. The method of claim 2, wherein the filtering the initial training data set comprises, for each sample in the initial training data set:

determining whether there is a textual similarity between a title in the sample and a search query in the sample; and

removing the sample in response to determining that the title has textual similarity to the search query.

13. The method of claim 2, further comprising:

updating the initial training data set by deleting a predetermined word in a title in at least one sample in the initial training data set, and

wherein the filtering the initial training data set comprises: the updated initial training data set is filtered.

14. The method of claim 1, wherein the plurality of search queries and/or the plurality of sets of search results are based on any language.

15. The method of claim 1, wherein the question generation model is adapted to generate a question adapted for Natural Language Processing (NLP) model processing from an input query.

16. The method of claim 1, wherein the problem generation model is adapted to be further trained with a set of training data for an NLP task for performing the NLP task.

17. The method of claim 1, wherein the training data set is further adapted to train an NLP model in addition to the problem generation model.

18. An apparatus for question generation from a query, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to:

a search log associated with a search engine is obtained,

extracting a plurality of search queries and a plurality of sets of search results corresponding to the plurality of search queries from the search log, an

19. The apparatus of claim 18, wherein the obtaining a set of training data comprises:

filtering the initial training data set to obtain the training data set.

20. A computer program product for question generation from a query, comprising a computer program for execution by at least one processor to:

obtaining a search log associated with a search engine;

extracting a plurality of search queries and a plurality of search result sets corresponding to the plurality of search queries from the search logs; and

obtaining a training data set at least for training a problem generation model from the plurality of search queries and the plurality of search result sets, each sample in the training data set comprising a title of a search query and a search result, the title being a problem corresponding to the search query.