CN111428119A - Query rewriting method and device and electronic equipment - Google Patents

Query rewriting method and device and electronic equipment Download PDF

Info

Publication number
CN111428119A
CN111428119A CN202010100144.XA CN202010100144A CN111428119A CN 111428119 A CN111428119 A CN 111428119A CN 202010100144 A CN202010100144 A CN 202010100144A CN 111428119 A CN111428119 A CN 111428119A
Authority
CN
China
Prior art keywords
rewriting
model
training
query
query input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010100144.XA
Other languages
Chinese (zh)
Inventor
王宗宇
杨俭
万峻辰
王金刚
谢睿
张富峥
王仲远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202010100144.XA priority Critical patent/CN111428119A/en
Publication of CN111428119A publication Critical patent/CN111428119A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a query rewriting method, belongs to the technical field of computers, and is beneficial to improving the accuracy of query rewriting based on a prediction result of a rewriting model. The method comprises the following steps: respectively training an admission model and a rewriting model based on a preset first data set, wherein the preset first data set comprises a plurality of query input pairs; rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to a rewriting and predicting result; screening the plurality of query input pairs in the first data set through the access model after iterative training, and iteratively training the rewrite model according to a screening result; and repeating the iterative training process until the training process meets the preset training condition, and performing query rewriting prediction through the rewriting model after iterative training, so that the accuracy of query rewriting based on the prediction result of the rewriting model can be improved.

Description

Query rewriting method and device and electronic equipment
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a query rewriting method, a query rewriting device, electronic equipment and a computer-readable storage medium.
Background
Query rewriting (also called query expansion) is a method for rewriting query input searched by a user in a search engine and improving the accuracy of a recall result of the search engine. For example, in some comprehensive search engines, there are usually different expression ways for a business or a product; for example, "chafing dish" is also known as "chafing dish". Sometimes different words express the same user intent, such as "wedding photo" and "wedding photo", "fitting glasses" and "glasses shop". Query rewrite for a search engine may be a recall optimization for a search engine that recalls as many search results as possible that match the user's search intent without changing the user's intent. The query rewrite techniques commonly used in the industry at present mainly include the following three types: first, a method of synonym replacement: obtaining synonyms of the query input through a word model or a large word forest and the like, and then directly replacing synonym fragments in the query input with the obtained synonyms to form new query input. Second, a method of data mining: there are generally two methods of excavation; 1) query entries (queries) that can be rewritten are mined out by the user's search behavior over a historical period of time. 2) According to the click relation between the query input in the log and the document, a query input pair which can be rewritten is mined through a graph algorithm, such as a simrank + + algorithm which is most representative; third, a method of translation: the rewrite model is trained using either statistical machine method (SMT) or neural Network Machine Translation (NMT) methods with rewritable pair as training data. The query input is rewritten by the model method.
The above method in the prior art has at least the following problems: the synonym replacement method has narrow data coverage; the data mining method has the problem that the long tail flow cannot be covered; although the translation method can solve the long tail flow problem theoretically and has better coverage than a synonym replacement mode, the rewriting accuracy and the coverage of the translation method depend on samples for training a translation model. In order to improve the quality of the sample and ensure the accuracy of the rewritable query input pair for training the translation model, the sample is usually labeled manually, which results in high labor cost.
It can be seen that the query rewrite method in the prior art still needs to be improved.
Disclosure of Invention
The embodiment of the application discloses a query rewriting method which is beneficial to improving the accuracy of query rewriting.
In a first aspect, an embodiment of the present application discloses a query rewrite method, including:
step S1, respectively training an admission model and a rewriting model based on a preset first data set, wherein the preset first data set comprises a plurality of query input pairs;
step S2, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to rewriting and predicting results;
step S3, the admission model after iterative training is used for screening the plurality of query input pairs in the first data set, and the rewriting model is iteratively trained according to the screening result;
step S4, responding to the training process meeting the preset training condition, and performing query rewriting prediction through the rewriting model after iterative training;
and step S5, in response to the fact that the training process does not meet preset training conditions, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the admission model according to rewriting and predicting results, and jumping to the step S3.
In a second aspect, an embodiment of the present application discloses an inquiry rewriting apparatus, including:
the system comprises an admission model and rewriting model initial training module, a data processing module and a data processing module, wherein the admission model and rewriting model initial training module is used for respectively training an admission model and a rewriting model based on a preset first data set, and the preset first data set comprises a plurality of query input pairs;
an admission model first iterative training module, configured to perform rewrite prediction on a part of the plurality of query input pairs in the first data set through the rewrite model, and iteratively train the admission model according to a rewrite prediction result;
the rewriting model iterative training module is used for screening the plurality of query input pairs in the first data set through the access model after iterative training and iteratively training the rewriting model according to a screening result;
the judging module is used for judging whether the iterative training process meets the preset training condition or not;
the rewriting prediction module is used for responding to the condition that the training process meets the preset training condition and carrying out inquiry rewriting prediction through the rewriting model after the iterative training;
and the admission model second iterative training module is used for responding to the fact that the training process does not meet preset training conditions, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the admission model according to the rewriting and predicting result, and then jumping to the rewriting model iterative training module.
In a third aspect, an embodiment of the present application further discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the query rewrite method according to the embodiment of the present application when executing the computer program.
In a fourth aspect, embodiments of the present application disclose a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the query rewrite method disclosed in embodiments of the present application.
The query rewriting method disclosed by the embodiment of the application respectively trains an admission model and a rewriting model based on a preset first data set, wherein the preset first data set comprises a plurality of query input pairs; then, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to a rewriting and predicting result; screening the plurality of query input pairs in the first data set through the access model after iterative training, and iteratively training the rewrite model according to a screening result; responding to a training process meeting a preset training condition, and performing query rewriting prediction through the rewriting model after iterative training; and in response to the fact that the training process does not meet the preset training condition, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the access model according to the rewriting and predicting result, and skipping to the step of executing iterative training of the rewriting model, so that the accuracy of query rewriting of the predicting result based on the rewriting model is improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
FIG. 1 is a flowchart of a query rewrite method according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a combined model formed by an admission model and a rewrite model in a first embodiment of the present application;
FIG. 3 is a schematic structural diagram of a query rewrite apparatus according to a second embodiment of the present application;
FIG. 4 is a second schematic structural diagram of a query rewrite apparatus according to a second embodiment of the present application;
FIG. 5 schematically shows a block diagram of an electronic device for performing a method according to the present application; and
fig. 6 schematically shows a storage unit for holding or carrying program code implementing a method according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
As shown in fig. 1, a query rewriting method disclosed in an embodiment of the present application includes: step S1 to step S6.
And step S1, respectively training the admission model and the rewriting model based on the preset first data set.
The preset first data set comprises a plurality of query input pairs.
The admission model is a network model used for predicting query input pairs and determining whether the query input pairs can be rewritten with each other, the rewriting model is a network model used for predicting the input query inputs and determining rewriting candidate words and corresponding probabilities of the query inputs, the admission model is built based on a classification model, for example, the admission model can be trained based on classification network models such as CNN (convolutional neural network) and XGBoost (eXtreme gradient boosting), the rewriting model can be trained based on a Seq2Seq model based on L (L on short-Term Memory) network and based on an Encoder-Decoder structure, in the embodiment, the admission model with better effect on N L P (Natural L and Natural language Processing) and the training network translation model disclosed in the embodiment of the invention are STM training examples.
The admission model and the rewrite model in the embodiment of the present application form a cascade structure, as shown in fig. 2, the admission model performs prediction processing on a first data set, and an obtained query input pair (i.e., a positive example) that can be changed mutually is used as training data of the rewrite model, and at the same time, the rewrite model performs prediction processing on the first data set, and a generated iterative training data set is used as training data of the admission model. In specific implementation, the admission model and the rewrite model are firstly respectively trained offline, and then the cascaded admission model and rewrite model are jointly trained.
In some embodiments of the present application, before training the admission model and the rewrite model respectively based on the first data set, the method further includes: a first set of data is mined based on the query log.
The method for mining candidate query input (query) pairs based on the query log comprises the following steps: access (session) based data mining and similarity (simrank) based data mining. In data mining based on access, query input pairs that may be rewritten, for example, by sequence mining of user search keywords over a period of time (e.g., 30 minutes). For example, if a user initially searches for "rinse pot" and then "hot pot", the "rinse pot" and "hot pot" can be rewritten to each other and can form a query input pair. When data mining is performed based on similarity, query input pairs which can be rewritten are mainly mined according to behaviors of clicking the same recall result aiming at different query inputs. For example, for two query inputs of "rinse pot" and "hot pot", the user clicks the recall result of "drag hot pot on the seabed at the same time, and then the" rinse pot "and the" hot pot "are considered to be rewritten with each other, and a query input pair can be formed.
In other embodiments of the present application, the query input pair may also be mined by using other methods based on the query log, which is not illustrated in this embodiment.
Further, each query input pair obtained by mining may be used as a sample data, and a rewriting probability label is set for the sample data to generate a training sample, where the rewriting probability label is used to indicate whether query inputs included in the query input pair can be rewritten with each other. For example, the rewrite probability tag of a query input pair ("rinse, hot pot") is set to 1, and it is identified that "rinse" and "hot pot" in the query input pair can be rewritten with each other.
On the other hand, a query input pair consisting of query inputs that cannot be rewritten with each other may be determined in the above-described method. For example, for two query inputs, "rinse" and "cook" that the user did not click on the same recall result, then the "rinse" and "cook" are considered not rewritable with respect to each other, and the "rinse" and "rice" may constitute negative examples of query input pairs. For example, the rewrite probability tag of the query input pair ("rinse pot", "rice") is set to 0, which identifies that "rinse pot" and "fry rice" in the query input pair are not rewritable with each other.
By the method, a plurality of query input pairs which can be rewritten with each other can be obtained by mining and used as positive samples, a plurality of query input pairs which can not be rewritten with each other can be obtained by mining and used as negative samples, and the positive samples and the negative samples form a first data set. For example, each training sample in the first data set may be represented as (query1, query2, label), where label indicates whether the query inputs query1 and query2 may be rewritten over each other. For example, one of the positive samples is: (chafing dish, instant-boiled pot, 1) and one negative sample is (instant-boiled pot, rice, 0).
After a first data set comprising a number of query input pairs is obtained, an admission model and a rewrite model are first trained offline based on the first data set.
When training the admission model, the query input pairs which can be rewritten with each other are used as positive examples of input, and the query input pairs which cannot be rewritten with each other are used as negative examples of input. The BERT model learns which common characteristics of the query input pairs serving as positive examples and which common characteristics of the query input pairs serving as negative examples according to the input query input pairs, and predicts the probability of the input query input pairs serving as query input pairs which can be rewritten with each other and the probability of the input query input pairs serving as query input pairs which cannot be rewritten with each other according to the learned commonality when performing online prediction. The BERT model is a deep neural network model, semantic information of each query input in a learning input query input pair is taken as a characteristic in the learning process, the query input pair semantics which can be rewritten are the same, and the query input pair models which cannot be rewritten can be considered to have different semantics. Therefore, the final purpose of the BERT model is to learn semantic information of the query input pair, and to learn how to judge whether the two query input semantics are the same or not, and whether the two query input semantics can be used as the query input pair rewritten with each other or not.
In training the rewrite model, the rewrite model may be trained by using, as an input of the rewrite model, one query input from a pair of query inputs that can be rewritten over each other, and using, as outputs of the rewrite model, N query inputs that can be rewritten over each other with the query input (i.e., the other query input from the pair of query inputs that includes the query input and whose rewrite probability label indicates that two query inputs in the pair of query inputs can be rewritten over each other) and a rewrite probability label (e.g., 1). Wherein N may be set to an integer greater than 1 as desired. The learning target of the rewrite model is to learn rewritable information by inputting data (i.e., two query inputs that can be rewritten with each other). Typically, a query input pair is entered, such as "Sichuan chafing dish" and "Sichuan chafing dish". The model learns and memorizes the data characteristics which can be rewritten mutually through the data, and the chafing dish can be rewritten when the chafing dish is learned in the example. When there are a large number of training samples, the more information that can be rewritten with each other that the rewrite model can learn and memorize. When the rewriting model obtained by training is applied to rewriting prediction, for a query input, N rewriting candidate words and the rewriting probability value corresponding to each rewriting candidate word are output.
Since the first data set mined in an unsupervised manner is usually low in data accuracy but high in recall rate, high-quality data to be changed needs to be screened from the data. In the embodiment of the application, through further data screening and model iterative training, the computer is used for calculating and processing external data to generate more accurate rewriting data, the more accurate rewriting data is learned through the access model and the rewriting model to extract commonalities in the data, and then the input data is calculated based on the extracted commonalities to obtain a prediction result.
And step S2, performing rewriting prediction on a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to a rewriting prediction result.
In the embodiment of the present application, each query input pair is provided with a rewriting probability tag, where the rewriting probability tag is used to indicate whether query inputs included in the query input pair can be rewritten with each other. For example, a rewrite probability tag of 1 indicates that query inputs included in the query input pair can be rewritten over each other, i.e., the query input pair is a rewrite positive case; a rewrite probability tag of 0 indicates that the query inputs included in the query input pair may not be rewritten over each other, i.e., the query input pair is a rewrite negative example.
In some embodiments of the present application, the portion of the number of query inputs used to make the rewrite predictions is: the rewrite probability label indicates the query input pair that query inputs included in the query input pair can be rewritten with respect to each other. For example, the first data set includes 1000 positive examples, 200 positive examples may be randomly selected, that is, a query input pair with a rewriting probability label of 1, such as a query input pair represented by (query1, query2,1), one query input (such as query1) of the 200 positive examples is respectively input to the rewriting model trained in step S1, and the 200 query input pairs are subjected to rewriting prediction by the rewriting model trained in step S1, so that each query input obtains a set of rewriting results, and 200 sets of rewriting results may be obtained.
In some embodiments of the present application, the rewriting prediction result is used to indicate a preset first number of rewriting candidate words corresponding to a query input into the rewriting model, and a probability that each of the rewriting candidate words can be rewritten with each other. For example, each group of rewrite results includes N rewrite candidate words, and each rewrite candidate word corresponds to a probability indicating the rewritability.
In some embodiments of the present application, the iteratively training the admission model according to the rewriting prediction result includes: generating a query input registration sample according to the rewriting candidate word corresponding to the maximum probability and the query input; generating a query input pair negative sample with the query input according to the corresponding preset second number of the rewriting candidate words with the minimum probability; generating an iterative training data set according to the query input positive sample and the query input negative sample generated by the rewriting prediction result of the part of the plurality of query inputs; iteratively training the admission model based on the iterative training data set; wherein the preset second number is smaller than the preset first number. For example, for query input query1, the rewrite candidate word with the highest probability and the corresponding query input query1 are recombined into a positive sample, and the other rewrite candidate words and the query input query1 are respectively recombined into N-1 negative samples. According to this method, 200 positive samples and 200 x (N-1) negative samples are generated for the selected 200 query input pairs. An iterative training data set is generated based on the plurality of positive samples and the plurality of negative samples generated in accordance with the method. Specifically for this embodiment, an iterative training data set including 200 positive samples and 200 x (N-1) negative samples may be obtained.
Next, based on the generated iterative training data set, the admission model trained in step S1 is iteratively trained.
And step S3, screening the plurality of query input pairs in the first data set through the access model after iterative training, and iteratively training the rewrite model according to a screening result.
In some embodiments of the present application, the iteratively trained admission model is used to screen the plurality of query input pairs in the first data set, and the iteratively trained rewrite model according to the screening result includes: predicting the query input for the positive case of rewriting in the first data set through the admission model after iterative training, and iteratively training the rewrite model based on the query input for the positive case of rewriting. For example, the admission model obtained after iterative training in step S2 or the admission model obtained through subsequently performed iterative training predicts the query input pairs in the first data set, and the determined prediction may be used as the query input pairs for the correction case. In the admission model prediction process, after an inquiry input pair is input, the prediction result output by the admission model is the probability that the inquiry input pair can be rewritten mutually, and when the probability is greater than 0.5, two inquiry inputs included in the input inquiry input pair can be considered to be rewritten mutually, namely the input inquiry input pair is a rewriting positive example, otherwise, the input inquiry input pair is not. Further, the rewrite model described in the embodiments of the present application is iteratively trained based on the predicted rewrite positive examples.
In some embodiments of the present application, the iterative training process may be terminated, and query rewrite prediction may be performed through the rewrite model after the iterative training.
In order to further improve the estimation accuracy of the trained rewrite model, in other embodiments of the present application, it is necessary to determine whether to continue to perform iterative training according to a preset training condition.
Step S6, determining whether the iterative training process satisfies a preset training condition, if yes, performing step S4, otherwise, performing step S5.
The preset training conditions met by the iterative training process in the embodiment of the application comprise: the iteration frequency reaches the preset frequency, or the prediction accuracy of the rewriting model reaches the preset accuracy threshold. And determining the preset times and the preset accuracy threshold according to the rewriting model test result.
In some embodiments of the present application, before performing rewrite prediction on a part of the plurality of query input pairs in the first data set by using the rewrite model after iterative training, before iteratively training the admission model according to a rewrite prediction result, it may be determined whether an iterative training process satisfies a preset training condition, and if so, the iterative training process is ended, and the trained rewrite model is directly used for performing rewrite prediction; if not, the rewriting prediction of the part of the plurality of query input pairs in the first data set is continuously carried out through the rewriting model after the iterative training, the admission model is iteratively trained according to the rewriting prediction result, the step S3 is skipped to be carried out, and the step of iteratively training the rewriting model is repeatedly carried out.
And step S4, responding to the training process meeting the preset training condition, and performing query rewriting prediction through the rewriting model after iterative training.
Thus, the iterative training process of rewriting the model is completed.
After the iterative training process is completed, online query rewriting prediction can be performed through the rewriting model after iterative training. For example, a query input is input to a rewrite model that completes iterative training, the rewrite model will output N rewrite candidate words, and the rewrite probabilities corresponding to each rewrite candidate word.
And step S5, in response to the fact that the training process does not meet preset training conditions, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the admission model according to rewriting and predicting results, and jumping to the step S3.
Under the condition that the current iterative training process does not meet the preset training condition (if the prediction accuracy of the rewriting model does not reach the preset accuracy threshold), rewriting and predicting a part of the plurality of query input pairs in the first data set by the rewriting model after the iterative training in the step S3, and screening data for iteratively training the admittance model.
In some embodiments of the present application, the portion of the number of query inputs used to make the rewrite predictions is: the rewrite probability label indicates the query input pair that query inputs included in the query input pair can be rewritten with respect to each other. For example, the first data set includes 1000 positive examples, 200 positive examples may be randomly selected, that is, a query input pair with a rewriting probability label of 1, such as a query input pair represented by (query1, query2,1), one query input (such as query1) of the 200 positive examples is respectively input to the rewriting model trained in step S3, and the 200 query input pairs are subjected to rewriting prediction by the rewriting model trained in step S3, so that each query input obtains a set of rewriting results, and 200 sets of rewriting results may be obtained.
In some embodiments of the present application, the rewriting prediction result is used to indicate a preset first number of rewriting candidate words corresponding to a query input into the rewriting model, and a probability that each of the rewriting candidate words can be rewritten with each other. For example, each group of rewrite results includes N rewrite candidate words, and each rewrite candidate word corresponds to a probability indicating the rewritability.
In some embodiments of the present application, the iteratively training the admission model according to the rewriting prediction result includes: generating a query input registration sample according to the rewriting candidate word corresponding to the maximum probability and the query input; generating a query input pair negative sample with the query input according to the corresponding preset second number of the rewriting candidate words with the minimum probability; generating an iterative training data set according to the query input positive sample and the query input negative sample generated by the rewriting prediction result of the part of the plurality of query inputs; iteratively training the admission model based on the iterative training data set; wherein the preset second number is smaller than the preset first number. For example, for query input query1, the rewrite candidate word with the highest probability and the corresponding query input query1 are recombined into a positive sample, and the other rewrite candidate words and the query input query1 are respectively recombined into N-1 negative samples. According to this method, 200 positive samples and 200 x (N-1) negative samples are generated for the selected 200 query input pairs. An iterative training data set is generated based on the plurality of positive samples and the plurality of negative samples generated in accordance with the method. Specifically for this embodiment, an iterative training data set including 200 positive samples and 200 x (N-1) negative samples may be obtained.
And then, iteratively training the previously iteratively trained admission model based on the generated iterative training data set.
Thereafter, the process proceeds to step S3, and the step of iteratively training the rewrite model is repeated.
The query rewriting method disclosed by the embodiment of the application respectively trains an admission model and a rewriting model based on a preset first data set, wherein the preset first data set comprises a plurality of query input pairs; then, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to a rewriting and predicting result; screening the plurality of query input pairs in the first data set through the access model after iterative training, and iteratively training the rewrite model according to a screening result; responding to a training process meeting a preset training condition, and performing query rewriting prediction through the rewriting model after iterative training; and in response to the fact that the training process does not meet the preset training condition, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the access model according to the rewriting and predicting result, and skipping to the step of executing iterative training of the rewriting model, so that the accuracy of query rewriting of the predicting result based on the rewriting model is improved. According to the query rewriting method disclosed by the embodiment of the application, the problem of inaccurate prediction result caused by low data quality of a training model is solved by introducing a mode of rewriting model and admission model combined training; the access model improves the quality of the rewriting model training data and optimizes the rewriting model; meanwhile, the quality of the training data of the access model is improved by rewriting the model, and the access model is optimized. By means of the combined training of the access model and the rewrite model, the execution effects of the two models are improved, and therefore the accuracy of the rewrite result of the query based on the rewrite model is improved.
On the other hand, in the prior art, a rewriting model is usually trained based on data with high labeling quality, and a large amount of labor cost is consumed for data labeling.
Example two
As shown in fig. 3, an inquiry rewriting device disclosed in an embodiment of the present application includes:
an admission model and rewrite model initial training module 310, configured to train an admission model and a rewrite model respectively based on a preset first data set, where the preset first data set includes a plurality of query input pairs;
an admission model first iterative training module 320, configured to perform rewrite prediction on a part of the plurality of query input pairs in the first data set through the rewrite model, and iteratively train the admission model according to a rewrite prediction result;
a rewriting model iterative training module 330, configured to filter the plurality of query input pairs in the first data set through the access model after iterative training, and iteratively train the rewriting model according to a filtering result;
a judging module 340, configured to judge whether the iterative training process meets a preset training condition;
a rewriting prediction module 350, configured to perform, in response to a training process satisfying a preset training condition, query rewriting prediction through the rewriting model after the iterative training;
and the admission model second iterative training module 360 is used for responding to the condition that the training process does not meet the preset training condition, performing rewriting prediction on part of the plurality of query input pairs in the first data set through the rewriting model after the iterative training, iteratively training the admission model according to the rewriting prediction result, and then jumping to the rewriting model iterative training module.
In some embodiments of the present application, each query input pair is provided with a rewrite probability tag, where the rewrite probability tag is used to indicate whether query inputs included in the query input pair can be rewritten with each other, and the part of the query inputs used for rewrite prediction is: the rewrite probability label indicates the query input pair that query inputs included in the query input pair can be rewritten with respect to each other.
In some embodiments of the present application, the rewriting prediction result is used to indicate a preset first number of rewriting candidate words corresponding to a query input into the rewriting model, and a probability that each of the rewriting candidate words can be rewritten with each other, and the iteratively training the admission model according to the rewriting prediction result includes:
generating a query input registration sample according to the rewriting candidate word corresponding to the maximum probability and the query input; generating a query input pair negative sample with the query input according to the corresponding preset second number of the rewriting candidate words with the minimum probability;
generating an iterative training data set according to the query input positive sample and the query input negative sample generated by the rewriting prediction result of the part of the plurality of query inputs;
iteratively training the admission model based on the iterative training data set;
wherein the preset second number is smaller than the preset first number.
In some embodiments of the present application, the iteratively trained admission model is used to screen the plurality of query input pairs in the first data set, and the iteratively trained rewrite model according to the screening result includes:
predicting the query input for the positive case of rewriting in the first data set through the admission model after iterative training, and iteratively training the rewrite model based on the query input for the positive case of rewriting.
In some embodiments of the present application, as shown in fig. 4, the apparatus further comprises:
a first data set mining module 370 for mining a first data set based on the query log.
The query rewriting device disclosed in the embodiment of the present application is used to implement the query rewriting method described in the first embodiment of the present application, and specific implementation manners of each module of the device are not described again, and reference may be made to specific implementation manners of corresponding steps in the method embodiments.
The query rewriting device disclosed by the embodiment of the application respectively trains the admission model and the rewriting model based on a preset first data set, wherein the preset first data set comprises a plurality of query input pairs; then, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to a rewriting and predicting result; screening the plurality of query input pairs in the first data set through the access model after iterative training, and iteratively training the rewrite model according to a screening result; responding to a training process meeting a preset training condition, and performing query rewriting prediction through the rewriting model after iterative training; and in response to the fact that the training process does not meet the preset training condition, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the access model according to the rewriting and predicting result, and skipping to the step of executing iterative training of the rewriting model, so that the accuracy of query rewriting of the predicting result based on the rewriting model is improved. The query rewriting device disclosed by the embodiment of the application solves the problem of inaccurate prediction result caused by low data quality of a training model by introducing a mode of rewriting model and admission model combined training; the access model improves the quality of the rewriting model training data and optimizes the rewriting model; meanwhile, the quality of the training data of the access model is improved by rewriting the model, and the access model is optimized. By means of the combined training of the access model and the rewrite model, the execution effects of the two models are improved, and therefore the accuracy of the rewrite result of the query based on the rewrite model is improved.
On the other hand, in the prior art, a rewriting model is usually trained based on data with high labeling quality, and a large amount of labor cost is consumed for data labeling.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The query rewriting method and device disclosed by the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation of the application, and the description of the above embodiment is only used to help understanding the method and a core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an electronic device according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 5 shows an electronic device that may implement a method according to the present application. The electronic device can be a PC, a mobile terminal, a personal digital assistant, a tablet computer and the like. The electronic device conventionally comprises a processor 510 and a memory 520, and program code 530 stored on said memory 520 and executable on the processor 510, said processor 510 implementing the method described in the above embodiments when executing said program code 530. The memory 520 may be a computer program product or a computer readable medium. The memory 520 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 520 has a storage space 5201 for program code 530 of the computer program for performing any of the method steps of the above-described method. For example, the storage space 5201 for the program code 530 may include respective computer programs for implementing the respective steps in the above methods. The program code 530 is computer readable code. The computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform the method according to the above embodiments.
The embodiment of the present application also discloses a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the query rewrite method according to the first embodiment of the present application.
Such a computer program product may be a computer-readable storage medium that may have memory segments, memory spaces, etc. arranged similarly to the memory 520 in the electronic device shown in fig. 5. The program code may be stored in a computer readable storage medium, for example, compressed in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 6. Typically, the storage unit comprises computer readable code 530 ', said computer readable code 530' being code read by a processor, which when executed by the processor, performs the steps of the method described above.
Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Moreover, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.
In the description disclosed herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (12)

1. A query rewrite method, comprising:
step S1, respectively training an admission model and a rewriting model based on a preset first data set, wherein the preset first data set comprises a plurality of query input pairs;
step S2, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model, and iteratively training the admission model according to rewriting and predicting results;
step S3, the admission model after iterative training is used for screening the plurality of query input pairs in the first data set, and the rewriting model is iteratively trained according to the screening result;
step S4, responding to the training process meeting the preset training condition, and performing query rewriting prediction through the rewriting model after iterative training;
and step S5, in response to the fact that the training process does not meet preset training conditions, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the admission model according to rewriting and predicting results, and jumping to the step S3.
2. The method of claim 1, wherein each of the query input pairs is provided with a rewrite probability tag for indicating whether query inputs included in the query input pair can be rewritten over each other, and wherein the portion of the plurality of query inputs used for rewrite prediction is: the rewrite probability label indicates the query input pair that query inputs included in the query input pair can be rewritten with respect to each other.
3. The method of claim 2, wherein the rewrite prediction result indicates a preset first number of rewrite candidate words corresponding to the query input into the rewrite model and a probability that each rewrite candidate word can be rewritten with each other, and the step of iteratively training the admission model according to the rewrite prediction result comprises:
generating a query input registration sample according to the rewriting candidate word corresponding to the maximum probability and the query input; generating a query input pair negative sample with the query input according to the corresponding preset second number of the rewriting candidate words with the minimum probability;
generating an iterative training data set according to the query input positive sample and the query input negative sample generated by the rewriting prediction result of the part of the plurality of query inputs;
iteratively training the admission model based on the iterative training data set;
wherein the preset second number is smaller than the preset first number.
4. The method of claim 2, wherein the step of filtering the query-input pairs in the first data set by the iteratively trained admission model, and iteratively training the rewrite model according to the filtering result comprises:
predicting the query input for the positive case of rewriting in the first data set through the admission model after iterative training, and iteratively training the rewrite model based on the query input for the positive case of rewriting.
5. The method according to any of claims 1 to 4, wherein the step of training the admission model and the adaptation model based on the first data set, respectively, is preceded by the further step of:
a first set of data is mined based on the query log.
6. An inquiry rewriting apparatus comprising:
the system comprises an admission model and rewriting model initial training module, a data processing module and a data processing module, wherein the admission model and rewriting model initial training module is used for respectively training an admission model and a rewriting model based on a preset first data set, and the preset first data set comprises a plurality of query input pairs;
an admission model first iterative training module, configured to perform rewrite prediction on a part of the plurality of query input pairs in the first data set through the rewrite model, and iteratively train the admission model according to a rewrite prediction result;
the rewriting model iterative training module is used for screening the plurality of query input pairs in the first data set through the access model after iterative training and iteratively training the rewriting model according to a screening result;
the judging module is used for judging whether the iterative training process meets the preset training condition or not;
the rewriting prediction module is used for responding to the condition that the training process meets the preset training condition and carrying out inquiry rewriting prediction through the rewriting model after the iterative training;
and the admission model second iterative training module is used for responding to the fact that the training process does not meet preset training conditions, rewriting and predicting a part of the plurality of query input pairs in the first data set through the rewriting model after iterative training, iteratively training the admission model according to the rewriting and predicting result, and then jumping to the rewriting model iterative training module.
7. The apparatus of claim 6, wherein each of the query input pairs is provided with a rewrite probability tag for indicating whether query inputs included in the query input pair can be rewritten over each other, and wherein the portion of the plurality of query inputs for performing rewrite prediction is: the rewrite probability label indicates the query input pair that query inputs included in the query input pair can be rewritten with respect to each other.
8. The apparatus of claim 7, wherein the rewrite prediction result indicates a preset first number of rewrite candidate words corresponding to the query input into the rewrite model and a probability that each of the rewrite candidate words can be rewritten with each other, and wherein iteratively training the admission model according to the rewrite prediction result comprises:
generating a query input registration sample according to the rewriting candidate word corresponding to the maximum probability and the query input; generating a query input pair negative sample with the query input according to the corresponding preset second number of the rewriting candidate words with the minimum probability;
generating an iterative training data set according to the query input positive sample and the query input negative sample generated by the rewriting prediction result of the part of the plurality of query inputs;
iteratively training the admission model based on the iterative training data set;
wherein the preset second number is smaller than the preset first number.
9. The apparatus of claim 7, wherein the iteratively trained admission model filters the query input pairs in the first data set, and wherein iteratively training the rewrite model according to the filtering result comprises:
predicting the query input for the positive case of rewriting in the first data set through the admission model after iterative training, and iteratively training the rewrite model based on the query input for the positive case of rewriting.
10. The apparatus of any one of claims 6 to 9, further comprising:
and the first data set mining module is used for mining the first data set based on the query log.
11. An electronic device comprising a memory, a processor, and program code stored on the memory and executable on the processor, wherein the processor implements the query rewrite method of any of claims 1 to 5 when executing the program code.
12. A computer-readable storage medium, on which a program code is stored, characterized in that the program code realizes the steps of the query rewrite method of any of claims 1 to 5 when executed by a processor.
CN202010100144.XA 2020-02-18 2020-02-18 Query rewriting method and device and electronic equipment Withdrawn CN111428119A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010100144.XA CN111428119A (en) 2020-02-18 2020-02-18 Query rewriting method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010100144.XA CN111428119A (en) 2020-02-18 2020-02-18 Query rewriting method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111428119A true CN111428119A (en) 2020-07-17

Family

ID=71547836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010100144.XA Withdrawn CN111428119A (en) 2020-02-18 2020-02-18 Query rewriting method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111428119A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505194A (en) * 2021-06-15 2021-10-15 北京三快在线科技有限公司 Training method and device for rewrite word generation model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN106557480A (en) * 2015-09-25 2017-04-05 阿里巴巴集团控股有限公司 Implementation method and device that inquiry is rewritten
CN107491447A (en) * 2016-06-12 2017-12-19 百度在线网络技术(北京)有限公司 Establish inquiry rewriting discrimination model, method for distinguishing and corresponding intrument are sentenced in inquiry rewriting
CN109857845A (en) * 2019-01-03 2019-06-07 北京奇艺世纪科技有限公司 Model training and data retrieval method, device, terminal and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN106557480A (en) * 2015-09-25 2017-04-05 阿里巴巴集团控股有限公司 Implementation method and device that inquiry is rewritten
CN107491447A (en) * 2016-06-12 2017-12-19 百度在线网络技术(北京)有限公司 Establish inquiry rewriting discrimination model, method for distinguishing and corresponding intrument are sentenced in inquiry rewriting
CN109857845A (en) * 2019-01-03 2019-06-07 北京奇艺世纪科技有限公司 Model training and data retrieval method, device, terminal and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505194A (en) * 2021-06-15 2021-10-15 北京三快在线科技有限公司 Training method and device for rewrite word generation model
CN113505194B (en) * 2021-06-15 2022-09-13 北京三快在线科技有限公司 Training method and device for rewrite word generation model

Similar Documents

Publication Publication Date Title
US10380236B1 (en) Machine learning system for annotating unstructured text
WO2021027362A1 (en) Information pushing method and apparatus based on data analysis, computer device, and storage medium
CN107609185B (en) Method, device, equipment and computer-readable storage medium for similarity calculation of POI
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
US9864951B1 (en) Randomized latent feature learning
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN108121814B (en) Search result ranking model generation method and device
CN108345419B (en) Information recommendation list generation method and device
CN111079014A (en) Recommendation method, system, medium and electronic device based on tree structure
CN111914561B (en) Entity recognition model training method, entity recognition device and terminal equipment
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
CN112085541A (en) User demand analysis method and device based on browsing consumption time series data
CN110263127A (en) Text search method and device is carried out based on user query word
CN112818218A (en) Information recommendation method and device, terminal equipment and computer readable storage medium
CN113761219A (en) Knowledge graph-based retrieval method and device, electronic equipment and storage medium
US11403304B2 (en) Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
CN111831902A (en) Recommendation reason screening method and device and electronic equipment
CN115048505A (en) Corpus screening method and device, electronic equipment and computer readable medium
CN103324641B (en) Information record recommendation method and device
CN114548296A (en) Graph convolution recommendation method based on self-adaptive framework and related device
CN111680218B (en) User interest identification method and device, electronic equipment and storage medium
CN111428119A (en) Query rewriting method and device and electronic equipment
CN115794898B (en) Financial information recommendation method and device, electronic equipment and storage medium
CN111488510A (en) Method and device for determining related words of small program, processing equipment and search system
CN111967941A (en) Method for constructing sequence recommendation model and sequence recommendation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200717

WW01 Invention patent application withdrawn after publication