CN110968759A

CN110968759A - Method and device for training rewriting model

Info

Publication number: CN110968759A
Application number: CN201811161706.0A
Authority: CN
Inventors: 王浩; 庞旭林; 张晨
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2020-04-07

Abstract

The invention provides a method and a device for training a rewriting model, wherein the method comprises the following steps: collecting query records of network users based on a search engine, and constructing a training data set based on the query records; acquiring training data in the training data set, and randomly disordering the training data; dividing the training data after random scrambling into a plurality of training sample data; and randomly selecting one piece of training sample data from the plurality of pieces of training sample data, inputting the selected training sample data into a pre-constructed rewriting model for rewriting a query sentence input by a user based on a search engine, and training the rewriting model. The training method for the rewrite model provided by the embodiment of the invention is obtained by training the training data set after summarizing the real search query records of the search engine based on the network user, and can further improve the training efficiency of the rewrite model, so that the rewrite model can rewrite the query sentence more accurately and efficiently.

Description

Method and device for training rewriting model

Technical Field

The invention relates to the technical field of machine learning, in particular to a method and a device for training a rewriting model.

Background

With the continuous development of search engine technology, users often obtain various network information based on search engines. When a user inputs a query statement into a search engine, the query statement may be a natural language query that is described in spoken language, which in turn causes the query result obtained by the search engine based on the natural language query to be greatly different from the query intention of the user. Therefore, the query sentence of the user can be rewritten by the rewrite model for more accurate query. The rewriting model belongs to a neural network model in machine learning, and how to train the constructed rewriting model is an urgent problem to be solved.

Disclosure of Invention

The present invention provides a training method and apparatus for adaptation models to overcome the above problems or at least partially solve the above problems.

According to an aspect of the present invention, there is provided a training method of a rewriting model, including:

collecting query records of network users based on a search engine, and constructing a training data set based on the query records;

acquiring training data in the training data set, and randomly disordering the training data;

dividing the training data after random scrambling into a plurality of training sample data;

and randomly selecting one piece of training sample data from the plurality of pieces of training sample data, inputting the selected training sample data into a pre-constructed rewriting model for rewriting a query sentence input by a user based on a search engine, and training the rewriting model.

Optionally, the dividing the training data after random scrambling into multiple pieces of training sample data includes:

averagely dividing the training data in the training data set after random scrambling into S training sample data, and setting the initial value of S as 0.

Optionally, the selecting any one of the plurality of training sample data, inputting the selected training sample data into a pre-constructed rewrite model for rewriting a query statement input by a user based on a search engine, and training the rewrite model includes:

selecting the S training sample data;

inputting the S-th training sample data into a pre-constructed rewriting model for rewriting a query sentence input by a user based on a search engine, and training the rewriting model.

Optionally, the inputting the S-th training sample data into a pre-constructed rewrite model for rewriting a query statement input by a user based on a search engine, and the training the rewrite model includes:

numbering words in the query sentence of the S-th training sample data according to a preset vocabulary table; the preset vocabulary is constructed based on the training data set;

and inputting the numbered words into the rewriting model, and training the rewriting model based on the numbered words.

Optionally, after inputting each numbered word into the rewrite model to train the rewrite model based on each numbered word, the method further includes:

calculating a loss function in the rewrite model training process by the following formula:

where, loss represents the loss function,

a target word is represented by a target word,

the weight of attention is represented as a weight of attention,

representing the coverage vector, t represents time t.

Optionally, the inputting the S-th training sample data into a pre-constructed rewrite model for rewriting a query statement input by a user based on a search engine, and after training the rewrite model, the method further includes:

calculating a loss function of a preset verification set by using the trained rewriting model;

if the loss function is increased, finishing training;

and if the loss function is reduced, setting S to be S +1, selecting the S-th training sample data, inputting the training sample data into the rewriting model, and continuing to train the rewriting model.

Optionally, the collecting query records of network users based on a search engine, and constructing a training data set based on the query records includes:

collecting query records of each network user based on a search engine, and taking the query records as initial training corpora to construct a corpus;

cleaning noise data in the corpus to obtain a data set;

and respectively segmenting the query sentence and the search result in the data set, and taking the data of the first specified proportion of the data set as training data to construct a training data set of the rewriting model.

Optionally, the collecting query records of each network user based on a search engine, and using the query records as an initial corpus to construct a corpus includes:

collecting query sentences input by network users based on a search engine and search results clicked by the network users in a result page returned by the search engine based on the query sentences;

and forming a sentence pair by the query sentence and a search result clicked by the network user based on the query sentence, and using the sentence pair as an initial training corpus to construct a corpus.

Optionally, the cleaning the noise data in the corpus to obtain a data set, including:

obtaining sentence pairs in the corpus; taking the query statement as the input of the data set, and taking the search result clicked by the user corresponding to the query statement as the output of the data set;

and calculating and filtering sentence pairs which are not consistent with the semantics of the query sentences and the search results in the sentence pairs of the initial training corpus based on the topic similarity and/or the word vector similarity.

Optionally, after the segmenting the query statement and the search result in the data set respectively, and taking the data of the first specified proportion of the data set as training data to construct a training data set of the rewriting model, the method further includes:

and acquiring a second specified proportion of data of the data set as verification data, and constructing a preset verification set based on the verification data.

According to another aspect of the present invention, there is also provided a training apparatus for rewriting a model, including:

the system comprises a collecting module, a searching module and a training data set, wherein the collecting module is configured to collect query records of network users based on a search engine and construct a training data set based on the query records;

a data acquisition module configured to acquire training data in the training data set and randomly scramble the training data;

the dividing module is configured to divide the training data after random disorder into a plurality of training sample data;

and the training module is configured to randomly select one piece of training sample data from the plurality of pieces of training sample data, input the selected training sample data into a pre-constructed rewriting model for rewriting a query sentence input by a user based on a search engine, and train the rewriting model.

Optionally, the dividing module is further configured to:

Optionally, the training module comprises:

the selecting unit is configured to select the S training sample data;

and the model training unit is configured to input the S-th training sample data into a pre-constructed rewriting model for rewriting a query sentence input by a user based on a search engine, and train the rewriting model.

Optionally, the model training unit is further configured to:

Optionally, the training module further comprises:

a calculating unit configured to calculate a loss function in the rewrite model training process by the following formula:

where, loss represents the loss function,

a target word is represented by a target word,

the weight of attention is represented as a weight of attention,

representing the coverage vector, t represents time t.

Optionally, the apparatus further comprises:

the loss function calculation module is configured to calculate a loss function of a preset verification set by using the trained rewriting model;

when the loss function is increased, training is finished;

and when the loss function is reduced, setting S to be S +1, selecting the S-th training sample data by the training module, inputting the training sample data into the rewriting model, and continuing to train the rewriting model.

Optionally, the collection module comprises:

the record collection unit is configured to collect query records of each network user based on a search engine, and the query records are used as initial training corpora to construct a corpus;

the noise cleaning unit is configured to clean the noise data in the corpus to obtain a data set;

and the first construction unit is configured to perform word segmentation on the query sentence and the search result in the data set respectively, and use the data of the first specified proportion of the data set as training data to construct a training data set of the rewriting model.

Optionally, the record collection unit is further configured to:

Optionally, the noise cleaning unit is further configured to:

Optionally, the collection module further comprises:

a second construction unit configured to acquire a second specified proportion of the data set as verification data, and construct a preset verification set based on the verification data.

According to another aspect of the present invention, there is also provided a computer storage medium having computer program code stored thereon, which, when run on a computing device, causes the computing device to perform the method of training a rewrite model according to any of the above.

According to another aspect of the present invention, there is also provided a computing device comprising:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform any of the above-described methods of training a rewrite model.

According to the method and the device for training the rewriting model, the query records of the network user based on the search engine are collected, so that the training data set is constructed, and the rewriting model is trained based on the constructed training data set. The rewriting model is obtained by training the training data set after the network user gathers the real search query records of the search engine, so that the network user can more truly and accurately reflect the query requirements of the network user, the training efficiency of the rewriting model is further improved, and the rewriting of the query sentences by the rewriting model is more accurate and efficient.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method for training a rewrite model according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for building an overwrite model according to an embodiment of the present invention;

FIG. 3 is a flow chart of a rewriting method of a query statement according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a rewrite model architecture, according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of query results before rewriting a query in accordance with an embodiment of the invention;

FIG. 6 is a diagram illustrating query results after rewriting a query, according to an embodiment of the invention;

FIG. 7 is a schematic diagram of a training apparatus for rewriting a model according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a training apparatus adapted to a model according to a preferred embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The machine learning is approximately as follows: determining a model-training a model-using the model. After a model is built, it needs to be trained to ensure the accuracy and efficiency of rewriting the query statement by the rewrite model. The rewriting model of the query sentence is to rewrite the spoken query sentence input by the user into a keyword suitable for a search engine through a series of natural language processing technologies, so that the search can return a more accurate result on the basis of keeping the original semantics of the user.

A preferred embodiment of the present invention further provides a method for training a rewrite model, and as shown in fig. 1, the method for training a rewrite model provided by an embodiment of the present invention may include:

step S102, collecting query records of network users based on a search engine, and constructing a training data set based on the query records;

step S104, acquiring training data in a training data set, and randomly disordering the training data;

step S106, dividing the training data after random disorder into a plurality of training sample data;

and S108, randomly selecting one piece of training sample data from a plurality of pieces of training sample data, inputting the selected training sample data into a pre-constructed rewriting model for rewriting the query sentence input by the user based on the search engine, and training the rewriting model.

According to the training method for the rewriting model, provided by the embodiment of the invention, the query records of the network user based on the search engine are collected, so that the training data set is constructed, and the rewriting model is trained based on the constructed training data set. The rewriting model in the embodiment of the invention is obtained by training the training data set after the network user gathers the real search query records of the search engine, so that the network user can more truly and accurately reflect the query requirements of the network user, the training efficiency of the rewriting model is further improved, and the rewriting model can more accurately and efficiently rewrite the query sentences.

When the training data set is constructed in step S102, query records of each network user based on a search engine may be collected first, and the query records are used as an initial training corpus to construct a corpus; then, noise data in the corpus is cleaned to obtain a data set; and then performing word segmentation on the query sentence and the search result in the data set respectively, and taking the data of the first specified proportion of the data set as training data to construct a training data set for rewriting the model. The network user is real search click data of the network user based on the query record of the search engine, and when the query record of the network user is collected, query sentences input by the network user based on the search engine and search results clicked by the network user in a result page returned by the search engine based on the query sentences can be collected; and forming a sentence pair (query-title) by the query sentence and the search result clicked by the network user based on the query sentence, constructing a corpus by taking the sentence pair as an initial training corpus, and further taking the search query record of the high-quality user as the initial training corpus.

A lot of noises exist in the initial training corpus, and the noises are found to be mainly caused by misoperation of a user or coincidentally interest in a certain page through data analysis and are expressed as semantic dissimilarity of training sentence pairs, and the noises can seriously influence the training process. Therefore, it is necessary to clean the sentence pairs in the corpus to obtain reliable data. When the noise data in the corpus is cleaned to obtain a data set, sentence pairs in the corpus can be obtained; taking the query sentence as the input of the data set, and taking the search result clicked by the user corresponding to the query sentence as the output of the data set; and calculating and filtering sentence pairs which do not accord with the semantics of the query sentences and the search results in the sentence pairs of the initial training corpus based on the topic similarity and/or the word vector similarity. The embodiment of the invention mainly measures the quality of sentence pairs from two aspects of topic similarity and semantic similarity. The topic similarity starts with the topic distribution of the sentence, and the similarity between the distributions is calculated. Firstly, semanteme representation is carried out on sentences, an LDA model is trained, and the topic distribution of one sentence is calculated. The similarity between the two distributions was then calculated using the JS divergence (Jensen-Shannon). Starting from the word vectors of words in the sentence, the topic similarity represents a sentence as the mean value of the word vectors of the words in the sentence, and then the cosine similarity is utilized to calculate the similarity of the two sentences. The purpose of removing noise is achieved by setting a reasonable threshold value.

Lda (late Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three layers of structures of words, topics and documents. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.

After the data in the corpus is cleaned to obtain the data set, word segmentation can be performed on the data set. In the embodiment of the invention, an open-source jieba tool can be used for separating sentences by words, then the data of the first specified proportion of the data set is used as training data to construct a training data set for rewriting the model, the data of the second specified proportion of the data set is obtained as verification data, and a preset verification set is constructed based on the verification data. In practical applications, 20% of the data in the data set can be divided into the validation set and the remaining 80% of the data can be divided into the training data set, and the vocabulary can be made by using the training data set.

In the above, when the model is trained and rewritten, the training data after random disturbance may be divided into a plurality of training sample data, and the model may be trained based on the plurality of training sample data. In a preferred embodiment of the present invention, training the rewrite model may include the steps of:

s1-1, randomly disordering the training data in the training data set;

s1-2, averagely dividing training data in the training data set after random disorder into S training sample data, and setting the initial value of S as 0;

s1-3, selecting the S training sample data;

and S1-4, inputting the S training sample data into a pre-constructed rewriting model for rewriting the query sentence input by the user based on the search engine, and training the rewriting model.

Optionally, before the S-th training sample data is input into the rewrite model in step S1-4, the words in the query sentence of the S-th training sample data may also be numbered according to a preset vocabulary table; the method comprises the following steps that a preset vocabulary table is constructed on the basis of a training data set; the numbered words are then input into the rewrite model, and the rewrite model is trained based on the numbered words. In the embodiment of the invention, the vocabulary can be constructed based on the training data set, so that the words in the query sentence are numbered according to the preset vocabulary, and the training process of the rewriting model can be more orderly, thereby improving the training efficiency of the rewriting model.

During the training process of the rewrite model, the loss function at each time can be calculated. That is, after the step S1-4, the method may further include:

s1-5, calculating a loss function in the process of rewriting model training by using the following formula:

where, loss represents the loss function,

a target word is represented by a target word,

the weight of attention is represented as a weight of attention,

representing coverage vector, t represents time t;

the penalty function for the entire query statement is then defined as:

after the rewriting model is trained, the rewriting model can be verified through a preset verification set. When the rewrite model is verified, a loss function in a verification set can be calculated. In machine learning, a loss function (Lossfunction) is used for estimating the degree of inconsistency between a predicted value and a true value of a model, and is a non-negative real value function, and the smaller the loss function is, the better the robustness of the model is. Therefore, after the step S1-4, the method may further include:

s1-6, calculating a loss function of a preset verification set by using the trained rewriting model; if the loss function is increased, the training is finished; and if the loss function is reduced, setting S to be S +1, repeating the steps from S1-1 to S1-5, selecting the S-th training sample data, inputting the training sample data into the rewriting model, and continuing training the rewriting model. Based on the embodiment provided by the invention, whether the rewriting model is trained again is judged by utilizing the loss function in the trained model calculation verification set, so that the rewriting accuracy of the rewriting model can be further improved, the query keyword output by the rewriting model is more in line with the search intention of the user, and the query result is more in line with the user expectation.

In the above description, the method for training the rewrite model is described, and before that, a rewrite model may be constructed.

Conventional overwriting techniques fall into two broad categories: an extraction formula and a generation formula. The extraction formula usually uses a specific calculation rule to calculate the weight of each word in the input query sentence of the user, and selects the word with the larger weight as the keyword. The method is simple and convenient, but all keywords are limited in an input word set, and words with high word frequency tend to be extracted, so that the effect is poor in some occasions. Generative methods can generally "understand" the user's input and then generate some keywords based on the user's intent. This method can generate new words, but the generation process is often uncontrollable and can also generate some completely wrong words. Taking the above query sentence as an example, the user inputs "i want to know how much money is for one mobile phone X", and the extraction method extracts "mobile phone X" and "how much". Both of these terms come from the user's input query and are not sufficient to summarize their intent. The generating method will generate different results according to the training corpus. For example, a "handset 8" and a "price" may be generated. Although new words can be generated, the model calculates the probability according to the vocabulary made by the training corpus when generating the words, and if the cell phone X is not in the training corpus, the cell phone X can be replaced by a wrong near-meaning word. Such results can lead to search page errors.

Another embodiment of the present invention provides a method for constructing a rewriting model, as shown in fig. 2, the method for constructing a rewriting model provided in an embodiment of the present invention may include:

step S202, constructing an encoder, after receiving word vectors corresponding to all terms in the initial query sentence, encoding the word vectors corresponding to all terms by the encoder and respectively representing all the word vectors as input hidden vectors;

step S204, a decoder for decoding the input hidden vector is constructed, and a plurality of keywords are obtained by the decoder respectively based on a generating type rewriting mode and an extracting type rewriting mode;

step S206, calculating a first adjusting factor for adjusting the weight proportion of each keyword in the generating rewrite mode and the extracting rewrite mode, so as to calculate the comprehensive weight of each keyword based on the first adjusting factor;

and S208, combining the encoder and the decoder, setting a first adjusting factor in the decoder, completing the construction of a rewriting model based on the structure of the encoder and the decoder, and selecting at least one keyword from a plurality of keywords by the rewriting model according to the comprehensive weight of each keyword to serve as a query keyword similar to the semantics of the initial sentence and then outputting the query keyword.

The embodiment of the invention provides a simple and efficient method and a device for constructing a rewriting model. The encoder encodes the input sentence into a vector, and the decoder decodes this to output a sequence. The rewriting model provided by the embodiment of the invention can combine an extraction mode and a generation mode to generate keywords, the proportion of the two modes is adjusted by an adjusting factor, and at least one keyword with the highest semantic similarity with an initial query statement input by a user is finally output.

In a preferred embodiment of the present invention, when the encoder is constructed in step S202, the encoder may be constructed according to a bidirectional LSTM long-short term memory network; after receiving the word vectors corresponding to the terms in the initial query sentence, the encoder encodes the word vectors corresponding to the terms and respectively represents the word vectors as input hidden vectors. That is, the initial query sentence inputted by the user based on the search engine is fed into the encoder word by word after being embedding, and a hidden vector is generated. The hidden vector is used as a high-level representation of the input sentence, in the decoding phase, for the generation of a new sequence. Among them, LSTM (Long Short-Term Memory) is a Long Short-Term Memory network, a time recurrent neural network, and is suitable for processing and predicting important events with relatively Long interval and delay in time sequence.

When the decoder is constructed in the step S204, the decoder may be constructed according to the one-way LSTM long-short term memory network, and the input hidden vector is decoded by the decoder; generating at least one generative keyword by adopting a generative rewrite mode based on a preset vocabulary; and extracting at least one extraction type keyword based on the initial query statement by adopting an extraction type rewriting mode.

Based on the rewrite model provided in the foregoing embodiment, a rewrite method for a query statement is further provided in a preferred embodiment of the present invention, and as shown in fig. 3, the rewrite method for a query statement provided in an embodiment of the present invention may include:

step S302, obtaining an initial query sentence input by a user based on a search engine, and segmenting the initial query sentence to obtain a term included in the initial query sentence;

step S304, representing each term in the initial query sentence as a term vector corresponding to each term;

step S306, calling a rewriting model, respectively inputting the word vectors corresponding to the words into the rewriting model, and generating and outputting at least one query keyword similar to the semantics of the initial query statement based on the rewriting model; the rewriting model is obtained by training a training data set obtained by summarizing user query records.

The embodiment of the invention provides a method for rewriting a more efficient query statement, wherein after an initial query statement input by a user based on a search engine is received, terms in the query statement are represented as word vectors, then the word vectors are input into a pre-established rewriting model, and query keywords with similar semantics with the current query statement are output by the rewriting model. The rewriting model in the embodiment of the invention is obtained by training the training data set after summarizing based on the user query records, so that the query more suitable for a search engine can be returned on the premise of not changing the real intention of the user, and the query result can meet the user expectation.

Word vector representation is a representation method that can both represent the word itself and take into account semantic distance. In the preferred embodiment of the present invention, the words are represented as word vectors by using an Embedding word Embedding manner. The neural network-based distributed representation is also called word vector and word embedding, and the neural network word vector model is based on the distribution hypothesis like other distributed representation methods, and the core is still the representation of the context and the modeling of the relation between the context and the target word. The words are expressed as word vectors through word embedding, so that dimensionality can be reduced, context information (which can be expressed as a relation of front and back distances) of the current words in the text can be captured, and accuracy of subsequent rewriting is improved.

As mentioned above, the query rewrite model in the embodiment of the present invention is a network model of an encoder-decoder structure, which is constructed mainly based on a sequence-to-sequence model. When the step S306 calls the rewrite model to rewrite the initial query statement, the rewriting method may specifically include:

step S306-1, calling a rewriting model, respectively inputting word vectors corresponding to each word into the rewriting model, coding each received word vector based on a coder in the rewriting model, and respectively representing each word vector as an input hidden vector;

and S306-2, inputting the input hidden vector into a decoder in the rewriting model for decoding, and generating and outputting at least one query keyword similar to the semantics of the initial query statement.

The sequence-to-sequence model (abbreviated as seq2seq) is a network of "coder-Decoder" (Encoder-Decoder) structure, whose input is a sequence and output is a sequence, the Encoder of the Encoder changes a variable length signal sequence into a fixed length vector representation, and the Decoder of the Decoder changes the fixed length vector into a variable length target signal sequence.

The rewriting model provided by the embodiment of the invention can acquire at least one generated keyword and one extracted keyword from the generated rewriting model and the extracted rewriting mode respectively. In a preferred embodiment of the present invention, the decoder decodes a new query statement in the following two ways, which correspond to the generation mode and the extraction mode respectively:

(1) the decoder receives the input word vector representation and the hidden vector of the decoder, calculates the distribution probability of each word in the vocabulary, and selects at least one generating type keyword according to the distribution probability of each word.

(2) And calculating the weight of each term in the initial query statement through the attention matrix, and selecting at least one extraction type keyword according to the weight of each term.

That is, in the step S306-2, the decoder may decode the received hidden vector and the reverse direction and output the query keyword, including the following steps:

s2-1, inputting the input hidden vector into a decoder in the rewriting model for decoding;

s2-2, selecting at least one generating type keyword and one extracting type keyword from a preset vocabulary list and an initial query sentence respectively; the preset vocabulary table is constructed by a training data set;

and S2-3, analyzing the generated keywords and the extracted keywords, and further selecting a plurality of keywords as query keywords with similar semantics with the initial query sentence and outputting the selected keywords.

Based on the method provided by the embodiment of the invention, the initial query statement input by the user is rewritten by combining the extraction formula and the generation formula, the advantages of the extraction formula and the generation formula can be fused, the initial query statement input by the user in the search engine is rewritten into a more accurate and concise keyword query, a search result meeting the search intention of the user is obtained, and the user experience can be further improved while the search time of the user is saved.

In a preferred embodiment of the present invention, when generating at least one generated keyword in a generated rewrite mode based on a preset vocabulary table, the distribution probability of each word in the vocabulary table can be calculated through an attention mechanism, and at least one generated keyword is selected according to the distribution probability of each word; when at least one extraction type keyword is extracted in an extraction type rewriting mode based on the initial query statement, the weight of each term in the initial query statement can be calculated through the attention matrix, and at least one extraction type keyword is selected according to the weight of each term.

In the model structure of seq2seq, the attention degree of each word in the input is inconsistent when each word is output, and the weight of each word is calculated according to a specific rule. This makes the sequence generated more rational and preserves most of the information in the input. Attention models are generally viewed in natural language processing applications as an alignment model of a word in an output sentence and each word of an input sentence.

In a preferred embodiment of the present invention, when the distribution probability of each word in the vocabulary is calculated by the attention mechanism and at least one generated keyword is selected according to the distribution probability of each word, the method may include the following steps:

s2-1-1, weighing the weight of each term in the initial query statement through a score method, and calculating the weighted sum to obtain a context vector;

s2-1-2, combining the context vector with the target hidden vector at the current moment to obtain the distribution probability of each word in the vocabulary through two full connection layers; the target hidden vector is a hidden layer variable of a decoder at the time t; each node of the full connection layer is connected with all nodes of the previous layer and used for integrating the extracted features;

s2-1-3, predicting and outputting at least one generating type key word in the vocabulary;

s2-1-4, using coverage mechanism to assist decoder to output non-repeated generation formula key words.

When the preferred embodiment of the invention selects the generating keywords, the method refers to the classic seq2seq model, and is an 'encoder-decoder' structure based on the attention mechanism. When a user enters a query x ═ x{x₁,...,x_n}(x_iThe ith term representing the input sentence), the goal is to convert this query into a semantically similar keyword query y ═ y₁,...,y_m}(y_iThe ith word representing the output). Each word of the query is fed into the "encoder" in turn, and the "decoder" then receives the previously generated word and a context vector to predict the next word y_t。

In the step S2-1-1, weighting of each term in the initial query statement is measured by a score method, and the weighting sum is calculated to obtain the context vector, the specific method may be as follows:

(1) augmenting coverage vector cov^tAnd set cov⁰Is an all-zero matrix; wherein t represents time t;

in the attention-based seq2seq model, the words generated by the decoder sometimes trigger a loop due to some special words, so that the generated sentences contain repeated words. A coverage mechanism is then required to prevent this. The coverage mechanism may focus more on words that have not been focused on before, and ignore words that have been focused on before. Measuring the degree of attention of a word by using the accumulated sum of the attention matrixes at the previous moment, and neglecting the words which are focused before to prevent repetition;

(2) calculating the similarity of the target hidden vector and the input hidden vector through the function score

Wherein the content of the first and second substances,

the calculation formula is as follows:

v、W₁、W₂、W_cand b_attenIn order to query the training parameters of the adapted model,

coverage vector, h, representing time t_tA target hidden vector is represented by a target hidden vector,

representing an input hidden vector;

(3) will be provided with

Carrying out normalization processing to obtain the attention weight a^t，a^t＝softmax(e^t)；

(4) At time t, coverage matrix cov is maintained^tThe degree of coverage of terms in the initial query statement is recorded,

is the sum of the attention distributions at all previous times;

(5) by attention weight a^tThe weighted summation of the input implicit vectors obtains a context vector at the time t,

after the context vector is obtained by calculation, at least one generating keyword can be predicted and output by combining the context vector. Optionally, the at least one generated keyword is predicted and output in the vocabulary using the following formula:

wherein, y_tRepresenting a currently output generated keyword, C representing a context vector;

p(y_t|{y₁,...,y_t-1}, C) represents a previous generated keyword { y₁,...,y_t-1Y and context vector C_tThe conditional probability of (2).

Meanwhile, when the context vector and the target hidden vector at the current time are combined in the step S2-1-2 to obtain the distribution probability of each word in the vocabulary through two fully connected layers, the distribution probability of each word in the vocabulary can be calculated by using the following formula:

P_vocab＝f(c_t,h_t)＝softmax(V'(V[h_t,C^t]+b)+b')

wherein V, V ', b' are training parameters for querying rewrite model, P_vocabRepresenting the distribution probability of words in the vocabulary, h_tRepresenting the hidden vector of the object, C^tRepresenting the context vector at time t. Softmax, which is to map a K-dimensional real vector z to a new K-dimensional real vector, so that each element value of the vector is between 0 and 1, and the sum of all elements is 1.

In the introduction, when the extraction type keywords are selected, the weights of all terms in the initial query sentence can be calculated through the attention matrix, and the selection is performed according to the weights of all terms. In a preferred embodiment of the present invention, the method may comprise the steps of:

s2-2-1, calculating the weight of each term in the initial query statement based on the TF-IDF term frequency-inverse file frequency and the attention weight; wherein, TF-IDF and attention weight a^tBy a second adjustment factor p_wAdjusting;

s2-2-2, selecting at least one term from the terms as an extraction type keyword according to the weight of each term in the initial query sentence, and outputting the selected term.

TF-IDF, which is the product of two statistics, the word frequency TF (w) and the inverse document frequency IDF (w). TF-IDF high is determined by the fact that the word frequency is high and the word frequency is low in the whole corpus, so the method can be used to exclude common terms. For natural language queries, this approach can effectively remove some common spoken language descriptions, such as "how", "what", and retain important information.

When the weight of each term in the initial query statement is calculated in step S2-2-1, the following formula may be used:

wherein f is_wRepresenting the number of times that a term w appears in an initial query sentence, N representing the number of times that all query sentences in a corpus constructed from the query records used for the term w appear, | w | representing the number of query sentences containing the term w in the corpus, a^tAnd expressing attention weight, and obtaining the attention weight by normalizing the similarity of the target hidden vector and the input hidden vector.

The TF-IDF value and the attention weight have different emphasis points in measuring the importance of a word. Attention weights focus on semantic matching of inputs and outputs, whose similarity values are computed using hidden states. In this way it focuses on the "meaning" of the word. TF-IDF focuses on the statistical features of a word, which counts the importance of the word throughout the corpus, and these two values describe the importance of the input word from different perspectives. By combining them with weighting factors, better keywords can be extracted from the input.

As mentioned above, after selecting the generated keywords and the extracted keywords, the generated keywords and the extracted keywords may be analyzed, and then a plurality of keywords may be selected as query keywords similar to the semantics of the initial query sentence and then output, which may include:

s3-1, acquiring each keyword in the generated keywords and the extracted keywords;

s3-2, calculating the comprehensive weight of each keyword by combining the weight of each word in the initial query sentence and the distribution probability of each word in the vocabulary;

and S3-3, selecting a plurality of keywords from the keywords as query keywords based on the comprehensive weight of each term.

In the above embodiment, the calculation process of the weight of each word in the initial query sentence and the distribution probability of each word in the vocabulary table has been described, and since the embodiment of the present invention synthesizes the two words to further select the final query keyword, the distribution probability and the weight ratio of the same keyword can be adjusted by using the preset first adjustment factor to calculate the comprehensive weight of each keyword.

In step S206, after the generated keywords and the extracted keywords are selected, a first adjustment factor for adjusting the weight ratio of each keyword in the generated rewrite mode and the extracted rewrite mode may be calculated, so as to calculate the comprehensive weight of each keyword based on the first adjustment factor. In a preferred embodiment of the present invention, the step S206 may include: calculating a first adjusting factor for adjusting the weight proportion of each keyword in a generating type rewriting mode and an extracting type rewriting mode; acquiring each keyword in the generated keywords and the extracted keywords; and adjusting the distribution probability and the weight proportion of the same keyword through the first adjusting factor, and calculating the comprehensive weight of each keyword.

In a preferred embodiment of the present invention, the calculation formula of the first adjustment factor may be as follows:

wherein, w_h、w_s、w_xAnd d represents a training parameter, C^tRepresents a context vector, h_tRepresenting the hidden vector of the object, h_sRepresenting the input hidden vector, x_tRepresenting words in the initial query statement, sigma representing a sigmoid function, p_genRepresenting a first adjustment factor.

When the distribution probability and the weight proportion of the same keyword are adjusted by using a preset first adjusting factor to calculate the comprehensive weight of each keyword, the comprehensive weight of each keyword can be calculated by using the following formula:

P(w)＝p_genP_vocab(w)+(1-p_gen)P_extract(w)

where p (w) represents the composite weight of the keyword, pvocab (w) represents the distribution probability of the keyword in the vocabulary, and pextract (w) represents the weight of the keyword in the initial query statement.

And finally, sequentially ordering to generate a keyword list based on the comprehensive weight of each keyword, and selecting a plurality of keywords from the keyword list as query keywords. When a plurality of keywords are selected from the keyword list, a plurality of keywords with larger comprehensive weight can be selected as query keywords according to the comprehensive weight, and the selected query keywords are output, so that the search engine can conveniently query based on the selected query keywords, and the query result can better meet the user expectation.

FIG. 4 is a diagram illustrating a structure of a rewrite model according to an embodiment of the present invention. The rewriting model provided by the embodiment of the invention is a classic attention-based seq2seq structure, and consists of an encoder and a decoder. The encoder can understand the query input by the user, and encodes the input sentence and sends the encoded sentence to the decoder for interpretation. In the decoding stage, the "decoder" generates each word in turn.

For example, in a real search scenario, the initial query statement entered by the user based on the search engine may be "i want to know how much is a cell phone X". If such a query is entered directly in a search engine, the returned results page is often not the result intended by the user, as shown in FIG. 5.

Based on the method provided by the embodiment of the invention, the rewriting process can be as follows:

1. receiving an initial query sentence 'i want to know how much a mobile phone X needs to be money' input by a user, and firstly segmenting the query sentence; the initial query sentence is segmented to obtain ' I ', ' want ', ' know ', ' one ', ' mobile phone X ' and how much money ';

2. embedding each word, and expressing the words by vectors;

3. inputting each word vector into the encoder in the rewrite model, as in FIG. 4, and representing each word vector as an input hidden vector, as in h in FIG. 4₁、h₂...h_s...h_n-1、h_n；

4. Inputting the input hidden vector of each term into a decoder, and sequentially generating each query keyword similar to the semantics of the initial query sentence by the decoder; in generating the next word, the following two factors are considered:

(1) constructing a vocabulary table by utilizing the training data set, and considering the distribution probability of words in the vocabulary table;

(2) considering the weight of each term in the initial query statement according to the extraction method, and adjusting the factor p_genThe proportion of the two is adjusted, the initial query statement can be rewritten into a target query statement comprising two query keywords of 'mobile phone X, price', and when searching is carried out based on the rewritten target query statement, the result returned by the search engine is more accurate, as shown in fig. 6.

Based on the same inventive concept, an embodiment of the present invention further provides a training apparatus for rewriting a model, as shown in fig. 7, the training apparatus for rewriting a model provided in an embodiment of the present invention may include:

the collection module 710 is configured to collect query records of network users based on a search engine, and construct a training data set based on the query records;

a data obtaining module 720, configured to obtain training data in the training data set, and randomly scramble the training data;

a dividing module 730 configured to divide the training data after random scrambling into a plurality of training sample data;

the training module 740 is configured to select any one of the plurality of training sample data, input the selected training sample data into a pre-constructed rewrite model for rewriting a query sentence input by a user based on a search engine, and train the rewrite model.

In a preferred embodiment of the present invention, the dividing module 730 can be further configured to:

the training data in the training data set after random scrambling is averagely divided into S pieces of training sample data, and the initial value of S is set to be 0.

In a preferred embodiment of the present invention, the training module 740 may include:

the selecting unit 741 is configured to select the training sample data of the S share;

a model training unit 742 is configured to input the S-th training sample data into a pre-constructed rewrite model for rewriting a query sentence input by a user based on a search engine, and train the rewrite model.

In a preferred embodiment of the present invention, the model training unit 742 may be further configured to:

numbering words in the query sentence of the S training sample data according to a preset vocabulary table; the method comprises the following steps that a preset vocabulary table is constructed on the basis of a training data set;

In a preferred embodiment of the present invention, as shown in fig. 8, the training module 740 may further include:

a calculating unit 743 configured to calculate a loss function in the rewrite model training process by the following formula:

where, loss represents the loss function,

a target word is represented by a target word,

the weight of attention is represented as a weight of attention,

representing the coverage vector, t represents time t.

In a preferred embodiment of the present invention, the apparatus may further include:

a loss function calculation module 750 configured to calculate a loss function of a preset validation set by using the trained rewrite model;

when the loss function is increased, the training is finished;

In a preferred embodiment of the present invention, the collecting module 710 may include:

the record collection unit 711 is configured to collect query records of each network user based on a search engine, and use the query records as an initial training corpus to construct a corpus;

a noise cleaning unit 712 configured to clean noise data in the corpus to obtain a data set;

a first constructing unit 713, configured to perform word segmentation on the query sentence and the search result in the data set, respectively, and use data of a first specified proportion of the data set as training data to construct a training data set of the rewriting model.

In a preferred embodiment of the present invention, the record collection unit 711 may be further configured to:

collecting query sentences input by all network users based on a search engine and search results clicked by all network users in a result page returned by the search engine based on the query sentences;

and forming sentence pairs by the query sentences and search results clicked by the network users on the basis of the query sentences, and using the sentence pairs as initial training corpora to construct a corpus.

In a preferred embodiment of the present invention, the noise cleaning unit 712 may be further configured to:

obtaining sentence pairs in a corpus; taking the query sentence as the input of the data set, and taking the search result clicked by the user corresponding to the query sentence as the output of the data set;

and calculating and filtering sentence pairs which do not accord with the semantics of the query sentences and the search results in the sentence pairs of the initial training corpus based on the topic similarity and/or the word vector similarity.

In a preferred embodiment of the present invention, the collecting module 710 may further include:

a second construction unit 714 configured to obtain a second specified proportion of the data set as verification data, and construct a preset verification set based on the verification data.

Based on the same inventive concept, an embodiment of the present invention further provides a computer storage medium, where computer program codes are stored, and when the computer program codes are run on a computing device, the computing device is caused to execute any one of the above-mentioned methods for training a rewrite model.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including:

a processor;

a memory storing computer program code;

The embodiment of the invention provides a method and a device for training a rewriting model, which are used for constructing a training data set by collecting query records of network users based on a search engine so as to train the rewriting model based on the constructed training data set. The rewriting model in the embodiment of the invention is obtained by training the training data set after the network user gathers the real search query records of the search engine, so that the network user can more truly and accurately reflect the query requirements of the network user, the training efficiency of the rewriting model is further improved, and the rewriting model can more accurately and efficiently rewrite the query sentences. Furthermore, the embodiment of the invention also provides a method for constructing the rewriting model, the constructed rewriting model can combine an extraction mode and a generation mode to generate key words, the proportion of the key words and the extraction mode and the generation mode is adjusted by an adjusting factor, the query statement can be simplified and at least one key word with the highest semantic similarity with the initial query statement input by a user can be finally output on the premise of not changing the semantic of the query statement, and compared with the traditional rewriting model, the method can return the query more suitable for a search engine on the premise of not changing the real intention of the traditional rewriting model.

It is clear to those skilled in the art that the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims

1. A method of training a rewrite model, comprising:

2. The method of claim 1, wherein the dividing the randomly scrambled training data into a plurality of pieces of training sample data comprises:

3. The method according to claim 1 or 2, wherein the selecting any one of the plurality of training sample data, inputting the selected training sample data into a pre-constructed rewrite model for rewriting a query statement input by a user based on a search engine, and training the rewrite model comprises:

selecting the S training sample data;

4. The method according to any one of claims 1-3, wherein the inputting the S-th training sample data into a pre-constructed rewrite model for rewriting a query sentence input by a user based on a search engine, and the training of the rewrite model comprises:

5. The method of any of claims 1-4, wherein after entering the numbered words into the rewrite model to train the rewrite model based on the numbered words, further comprising:

where, loss represents the loss function,

a target word is represented by a target word,

the weight of attention is represented as a weight of attention,

representing the coverage vector, t represents time t.

6. The method according to any one of claims 1 to 5, wherein the inputting the S-th training sample data into a pre-constructed rewrite model for rewriting a query sentence input by a user based on a search engine, and after training the rewrite model, further comprises:

if the loss function is increased, finishing training;

7. The method of any of claims 1-6, wherein the collecting network user search engine-based query records, building a training data set based on the query records, comprises:

cleaning noise data in the corpus to obtain a data set;

8. A training apparatus for rewriting a model, comprising:

9. A computer storage medium having computer program code stored thereon which, when run on a computing device, causes the computing device to perform a method of training a rewrite model according to any of claims 1 to 7.

10. A computing device, comprising:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method of training a rewrite model according to any of claims 1 to 7.