CN110990578A

CN110990578A - Method and device for constructing rewriting model

Info

Publication number: CN110990578A
Application number: CN201811161824.1A
Authority: CN
Inventors: 王浩; 庞旭林; 张晨
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2020-04-10

Abstract

The invention provides a method and a device for constructing a rewriting model, wherein the method comprises the following steps: constructing an encoder and a decoder, and obtaining a plurality of keywords respectively based on a generating type rewriting mode and an extracting type rewriting mode through the decoder; calculating a first adjustment factor for adjusting the weight proportion of each keyword in a generating type rewriting mode and an extracting type rewriting mode so as to calculate the comprehensive weight of each keyword based on the first adjustment factor; and combining the encoder and the decoder, setting a first regulating factor in the decoder, completing the construction of a rewriting model based on the structure of the encoder and the decoder, and selecting at least one keyword from a plurality of keywords by the rewriting model according to the comprehensive weight of each keyword to serve as a query keyword similar to the semantics of the initial sentence and then outputting the query keyword. The rewriting model provided by the embodiment of the invention can simplify the query statement on the premise of not changing the semantic meaning of the query statement, so that the obtained query result can better meet the user expectation.

Description

Method and device for constructing rewriting model

Technical Field

The invention relates to the technical field of deep learning, in particular to a method and a device for constructing a rewriting model.

Background

With the continuous development of search engines and the popularization of smart phones, users usually search various information through search software installed in smart phones. However, the query submitted by a user to a search engine may be a colloquially described natural language query, which may present a significant challenge to the search engine. Because a typical search engine is better suited for entering queries composed of precise keywords, queries described by natural language can result in poor returned results, reducing query accuracy. Therefore, it is desirable to provide a rewrite model for rewriting a spoken query entered by a user.

Disclosure of Invention

The present invention provides a method and apparatus for constructing a rewrite model to overcome the above problems or at least partially solve the above problems.

According to an aspect of the present invention, there is provided a method of constructing a rewrite model, including:

constructing an encoder, after receiving word vectors corresponding to all terms in an initial query statement, encoding the word vectors corresponding to all terms by the encoder and respectively representing all the word vectors as input hidden vectors;

constructing a decoder for decoding the input hidden vector, and obtaining a plurality of keywords respectively based on a generating type rewriting mode and an extracting type rewriting mode through the decoder;

calculating a first adjustment factor for adjusting the weight proportion of each keyword in the generating rewrite mode and the extracting rewrite mode, so as to calculate the comprehensive weight of each keyword based on the first adjustment factor;

and combining the encoder and the decoder, and setting the first regulating factor in the decoder to complete the construction of a rewriting model based on an encoder-decoder structure, wherein the rewriting model selects at least one keyword from the plurality of keywords as a query keyword similar to the semantics of the initial sentence according to the comprehensive weight of each keyword and then outputs the selected keyword.

Optionally, the constructing a decoder for decoding the input hidden vector, wherein obtaining, by the decoder, a plurality of keywords based on a generating rewrite mode and a pull rewrite mode respectively includes:

constructing a decoder according to a unidirectional LSTM long-short term memory network, and decoding the input hidden vector through the decoder;

generating at least one generative keyword by adopting a generative rewrite mode based on a preset vocabulary;

and extracting at least one extraction type keyword based on the initial query statement in an extraction type rewriting mode.

Optionally, the generating at least one generated keyword using a generated rewrite mode based on a preset vocabulary includes:

and calculating the distribution probability of each word in the vocabulary through an attention mechanism, and selecting at least one generating type keyword according to the distribution probability of each word.

Optionally, the calculating, by the attention mechanism, a distribution probability of each word in the vocabulary, and selecting at least one generating-formula keyword according to the distribution probability of each word includes:

weighing the weight of each term in the initial query statement by a score method, and calculating the weight and calculating to obtain a context vector;

combining the context vector with the target hidden vector at the current moment to obtain the distribution probability of each word in the vocabulary through two fully-connected layers; the target hidden vector is a hidden layer variable of a decoder at the time t;

predicting and generating at least one generated keyword in the vocabulary;

and utilizing a coverage mechanism to assist the decoder to generate non-repeated generating type keywords.

Optionally, the extracting at least one extracted keyword in an extracted rewrite mode based on the initial query statement includes:

and calculating the weight of each term in the initial query statement through an attention moment array, and selecting at least one extraction type keyword according to the weight of each term.

Optionally, the calculating the weight of each term in the initial query statement by the attention moment matrix, and selecting at least one extracted keyword according to the weight of each term includes:

calculating the weight of each term in the initial query statement based on the TF-IDF term frequency-inverse file frequency and the attention weight; wherein the TF-IDF and attention weight a^tBy a second adjustment factor p_wAdjusting;

and selecting at least one term from the terms as an extraction type keyword according to the weight of each term in the initial query statement.

Optionally, the calculating a first adjustment factor for adjusting a weight ratio of each keyword in the generated rewrite mode and the extracted rewrite mode to calculate an integrated weight of each keyword based on the first adjustment factor includes:

calculating a first adjusting factor for adjusting the weight proportion of each keyword in the generating type rewriting mode and the extracting type rewriting mode;

acquiring each keyword in the generated keywords and the extracted keywords;

and adjusting the distribution probability and the weight proportion of the same keyword through a first adjusting factor, and calculating the comprehensive weight of each keyword.

Optionally, the adjusting, by the first adjustment factor, the distribution probability and the weight ratio of the same keyword to calculate the comprehensive weight of each keyword includes:

the comprehensive weight of each keyword is calculated by using the following formula:

P(w)＝p_genP_vocab(w)+(1-p_gen)P_extract(w)

wherein p (w) represents the integrated weight of the keyword, pvocab (w) represents the distribution probability of the keyword in the vocabulary, and pextract (w) represents the weight of the keyword in the initial query statement.

Optionally, after receiving the word vector corresponding to each word in the initial query statement, the constructing an encoder encodes the word vector corresponding to each word and represents each word vector as an input hidden vector, where the constructing an encoder includes:

constructing an encoder according to a bidirectional LSTM long-short term memory network;

after receiving the word vectors corresponding to the words in the initial query sentence, the encoder encodes the word vectors corresponding to the words and respectively represents the word vectors as input hidden vectors.

Optionally, the calculation formula of the first adjustment factor is as follows:

wherein, w_h、w_s、w_xAnd d represents a training parameter, C^tRepresents a context vector, h_tRepresenting the hidden vector of the object, h_sRepresenting the input hidden vector, x_tRepresenting words in the initial query statement, sigma representing a sigmoid function, p_genRepresenting a first adjustment factor.

According to another aspect of the present invention, there is also provided an overwrite model constructing apparatus including:

the first construction module is configured to construct an encoder, and after word vectors corresponding to all terms in an initial query statement are received, the encoder encodes the word vectors corresponding to all terms and respectively represents all the word vectors as input hidden vectors;

a second construction module configured to construct a decoder that decodes the input hidden vector, and obtain a plurality of keywords by the decoder based on a generative rewrite mode and an extractable rewrite mode, respectively;

the calculation module is configured to calculate a first adjusting factor for adjusting the weight proportion of each keyword in the generating type rewriting mode and the extracting type rewriting mode, so as to calculate the comprehensive weight of each keyword based on the first adjusting factor;

and the setting module is configured to combine the encoder and the decoder, set the first regulating factor in the decoder, complete the construction of a rewriting model based on an encoder-decoder structure, and select at least one keyword from the plurality of keywords according to the comprehensive weight of each keyword by the rewriting model to serve as a query keyword similar to the semantics of the initial sentence and output the query keyword.

Optionally, the second building block comprises:

the decoding unit is configured to construct a decoder according to a unidirectional LSTM long-short term memory network and decode the input implicit vector through the decoder;

the generating unit is configured to generate at least one generative keyword by adopting a generative rewriting mode based on a preset vocabulary;

an extraction unit configured to extract at least one extracted keyword in an extracted rewrite mode based on the initial query statement.

Optionally, the generating unit is further configured to:

predicting and generating at least one generated keyword in the vocabulary;

Optionally, the extracting unit is further configured to:

Optionally, the calculation module comprises:

an adjustment factor calculation unit configured to calculate a first adjustment factor that adjusts a ratio of weights of the keywords in the generated rewrite mode and the extracted rewrite mode;

an acquisition unit configured to acquire each of the generated keywords and the extracted keywords;

and the comprehensive calculation unit is configured to adjust the distribution probability and the weight proportion of the same keyword through the first adjusting factor, and calculate the comprehensive weight of each keyword.

Optionally, the comprehensive calculation unit is further configured to calculate a comprehensive weight of each keyword by using the following formula:

P(w)＝p_genP_vocab(w)+(1-p_gen)P_extract(w)

Optionally, the first building module is further configured to:

Optionally, the calculation module is further configured to calculate the first adjustment factor using the following formula:

According to another aspect of the present invention, there is also provided a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to execute the method of building a rewrite model according to any of the above.

According to another aspect of the present invention, there is also provided a computing device comprising:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform any of the above-described methods of constructing a rewrite model.

The invention provides a simple and efficient method and a device for constructing a rewriting model, wherein the rewriting model is mainly based on an Encoder-Decoder structure and consists of an Encoder and a Decoder. The encoder encodes the input sentence into a vector, and the decoder decodes this to output a sequence. The rewriting model provided by the invention can generate keywords based on combining an extraction mode and a generation mode, and the proportion of the keywords and the extraction mode is adjusted by an adjusting factor, so that the query statement can be simplified and finally output at least one keyword with the highest semantic similarity with the initial query statement input by a user on the premise of not changing the semantic meaning of the query statement, and compared with the traditional rewriting model, the method can return the query which is more suitable for a search engine on the premise of not changing the real intention of the query statement, thereby enabling the query result to better meet the user expectation.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method for building an overwrite model according to an embodiment of the present invention;

FIG. 2 is a flow chart of a rewriting method of a query statement according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a rewrite model architecture according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of query results before rewriting a query in accordance with an embodiment of the invention;

FIG. 5 is a diagram illustrating query results after rewriting a query, according to an embodiment of the invention;

FIG. 6 is a flow chart of a method for training a rewrite model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an apparatus for building a rewrite model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a rewriting model building apparatus according to a preferred embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The query sentence is rewritten, that is, the spoken query sentence input by the user is rewritten into the keyword suitable for the search engine through a series of natural language processing technologies, so that the search can return more accurate results on the basis of keeping the original semantics of the user.

Conventional overwriting techniques fall into two broad categories: an extraction formula and a generation formula. The extraction formula usually uses a specific calculation rule to calculate the weight of each word in the input query sentence of the user, and selects the word with the larger weight as the keyword. The method is simple and convenient, but all keywords are limited in an input word set, and words with high word frequency tend to be extracted, so that the effect is poor in some occasions. Generative methods can generally "understand" the user's input and then generate some keywords based on the user's intent. This method can generate new words, but the generation process is often uncontrollable and can also generate some completely wrong words. Taking the above query sentence as an example, the user inputs "i want to know how much money is for one mobile phone X", and the extraction method extracts "mobile phone X" and "how much". Both of these terms come from the user's input query and are not sufficient to summarize their intent. The generating method will generate different results according to the training corpus. For example, a "handset 8" and a "price" may be generated. Although new words can be generated, the model calculates the probability according to the vocabulary made by the training corpus when generating the words, and if the cell phone X is not in the training corpus, the cell phone X can be replaced by a wrong near-meaning word. Such results can lead to search page errors.

An embodiment of the present invention provides a method for constructing a rewriting model, and as shown in fig. 1, the method for constructing a rewriting model provided by an embodiment of the present invention may include:

step S102, constructing an encoder, after receiving word vectors corresponding to all terms in the initial query sentence, encoding the word vectors corresponding to all terms by the encoder and respectively representing all the word vectors as input hidden vectors;

step S104, constructing a decoder for decoding the input hidden vector, and obtaining a plurality of keywords through the decoder respectively based on a generating type rewriting mode and an extracting type rewriting mode;

step S106, calculating a first adjusting factor for adjusting the weight proportion of each keyword in a generating type rewriting mode and an extracting type rewriting mode so as to calculate the comprehensive weight of each keyword based on the first adjusting factor;

and S108, combining the encoder and the decoder, setting a first adjusting factor in the decoder, completing the construction of a rewriting model based on the structure of the encoder and the decoder, and selecting at least one keyword from a plurality of keywords by the rewriting model according to the comprehensive weight of each keyword to serve as a query keyword similar to the semantics of the initial sentence and then outputting the query keyword.

The embodiment of the invention provides a simple and efficient method and a device for constructing a rewriting model. The encoder encodes the input sentence into a vector, and the decoder decodes this to output a sequence. The rewriting model provided by the embodiment of the invention can generate keywords based on combining an extraction mode and a generation mode, the proportion of the two modes is adjusted by an adjusting factor, and at least one keyword with the highest semantic similarity with an initial query statement input by a user is finally output.

In a preferred embodiment of the present invention, when the encoder is constructed in step S102, the encoder may be constructed according to a bidirectional LSTM long-short term memory network; after receiving the word vectors corresponding to the terms in the initial query sentence, the encoder encodes the word vectors corresponding to the terms and respectively represents the word vectors as input hidden vectors. That is, the initial query sentence inputted by the user based on the search engine is fed into the encoder word by word after being embedding, and a hidden vector is generated. The hidden vector is used as a high-level representation of the input sentence, in the decoding phase, for the generation of a new sequence. Among them, LSTM (Long Short-Term Memory) is a Long Short-Term Memory network, a time recurrent neural network, and is suitable for processing and predicting important events with relatively Long interval and delay in time sequence.

When the decoder is constructed in the step S104, the decoder may be constructed according to the one-way LSTM long-short term memory network, and the input hidden vector may be decoded by the decoder; generating at least one generative keyword by adopting a generative rewrite mode based on a preset vocabulary; and extracting at least one extraction type keyword based on the initial query statement by adopting an extraction type rewriting mode.

Based on the rewrite model provided in the foregoing embodiment, a rewrite method for a query statement is further provided in a preferred embodiment of the present invention, as shown in fig. 2, the rewrite method for a query statement provided in an embodiment of the present invention may include:

step S202, obtaining an initial query statement input by a user based on a search engine, and segmenting the initial query statement to obtain terms included in the initial query statement;

step S204, representing each term in the initial query sentence as a term vector corresponding to each term;

step S206, calling a rewriting model, respectively inputting the word vectors corresponding to the words into the rewriting model, and then generating and outputting at least one query keyword similar to the semantics of the initial query sentence based on the rewriting model; the rewriting model is obtained by training a training data set obtained by summarizing user query records.

The embodiment of the invention provides a method for rewriting a more efficient query statement, wherein after an initial query statement input by a user based on a search engine is received, terms in the query statement are represented as word vectors, then the word vectors are input into a pre-established rewriting model, and query keywords with similar semantics with the current query statement are output by the rewriting model. The rewriting model in the embodiment of the invention is obtained by training the training data set after summarizing based on the user query records, so that the query more suitable for a search engine can be returned on the premise of not changing the real intention of the user, and the query result can meet the user expectation.

Word vector representation is a representation method that can both represent the word itself and take into account semantic distance. In the preferred embodiment of the present invention, the words are represented as word vectors by using an Embedding word Embedding manner. The neural network-based distributed representation is also called word vector and word embedding, and the neural network word vector model is based on the distribution hypothesis like other distributed representation methods, and the core is still the representation of the context and the modeling of the relation between the context and the target word. The words are expressed as word vectors through word embedding, so that dimensionality can be reduced, context information (which can be expressed as a relation of front and back distances) of the current words in the text can be captured, and accuracy of subsequent rewriting is improved.

As mentioned above, the query rewrite model in the embodiment of the present invention is a network model of an encoder-decoder structure, which is constructed mainly based on a sequence-to-sequence model. When the step S206 calls the rewrite model to rewrite the initial query statement, the method may specifically include:

step S206-1, calling a rewriting model, inputting word vectors corresponding to the words into the rewriting model, coding the received word vectors based on a coder in the rewriting model, and representing the word vectors as input hidden vectors;

and S206-2, inputting the input hidden vector into a decoder in the rewriting model for decoding, and generating and outputting at least one query keyword similar to the semantics of the initial query statement.

The sequence-to-sequence model (abbreviated as seq2seq) is a network of "coder-Decoder" (Encoder-Decoder) structure, whose input is a sequence and output is a sequence, the Encoder of the Encoder changes a variable length signal sequence into a fixed length vector representation, and the Decoder of the Decoder changes the fixed length vector into a variable length target signal sequence.

The rewriting model provided by the embodiment of the invention can acquire at least one generated keyword and one extracted keyword from the generated rewriting model and the extracted rewriting mode respectively. In a preferred embodiment of the present invention, the decoder decodes a new query statement in the following two ways, which correspond to the generation mode and the extraction mode respectively:

(1) the decoder receives the input word vector representation and the hidden vector of the decoder, calculates the distribution probability of each word in the vocabulary, and selects at least one generating type keyword according to the distribution probability of each word.

(2) And calculating the weight of each term in the initial query statement through the attention matrix, and selecting at least one extraction type keyword according to the weight of each term.

That is, in the step S206-2, the decoder may decode the received input hidden vector and the reverse direction and output the query keyword, including the following steps:

s1, inputting the input hidden vector into a decoder in the rewriting model for decoding;

s2, selecting at least one generating type keyword and extracting type keyword from a preset vocabulary list and an initial query sentence respectively; the preset vocabulary table is constructed by a training data set;

and S3, analyzing the generated keywords and the extracted keywords, and selecting a plurality of keywords as query keywords similar to the semantics of the initial query sentence and outputting the selected keywords.

Based on the method provided by the embodiment of the invention, the initial query statement input by the user is rewritten by combining the extraction formula and the generation formula, the advantages of the extraction formula and the generation formula can be fused, the initial query statement input by the user in the search engine is rewritten into a more accurate and concise keyword query, a search result meeting the search intention of the user is obtained, and the user experience can be further improved while the search time of the user is saved.

In a preferred embodiment of the present invention, when generating at least one generated keyword in a generated rewrite mode based on a preset vocabulary table, the distribution probability of each word in the vocabulary table can be calculated through an attention mechanism, and at least one generated keyword is selected according to the distribution probability of each word; when at least one extraction type keyword is extracted in an extraction type rewriting mode based on the initial query statement, the weight of each term in the initial query statement can be calculated through the attention matrix, and at least one extraction type keyword is selected according to the weight of each term.

In the model structure of seq2seq, the attention degree of each word in the input is inconsistent when each word is output, and the weight of each word is calculated according to a specific rule. This makes the sequence generated more rational and preserves most of the information in the input. Attention models are generally viewed in natural language processing applications as an alignment model of a word in an output sentence and each word of an input sentence.

In a preferred embodiment of the present invention, when the distribution probability of each word in the vocabulary is calculated by the attention mechanism and at least one generated keyword is selected according to the distribution probability of each word, the method may include the following steps:

s1-1, weighing the weight of each term in the initial query statement through a score method, and calculating the weight sum to obtain a context vector;

s1-2, combining the context vector with the target hidden vector at the current moment to obtain the distribution probability of each word in the vocabulary through two fully-connected layers; the target hidden vector is a hidden layer variable of a decoder at the time t; each node of the full connection layer is connected with all nodes of the previous layer and used for integrating the extracted features;

s1-3, predicting and outputting at least one generating type key word in the vocabulary;

s1-4, using coverage mechanism to assist the decoder to output non-repeated generation formula key words.

When the preferred embodiment of the invention selects the generating keywords, the method refers to the classic seq2seq model, and is an 'encoder-decoder' structure based on the attention mechanism. When a user inputs a query x ═ x₁,...,x_n}(x_iThe ith term representing the input sentence), the goal is to convert this query into a semantically similar keyword query y ═ y₁,...,y_m}(y_iThe ith word representing the output). Each word of the query is fed into the "encoder" in turn, and the "decoder" then receives the previously generated word and a context vector to predict the next word y_t。

In the step S1-1, weighting of each term in the initial query statement is measured by a score method, and the weighting sum is calculated to obtain a context vector, the specific method may be as follows:

(1) augmenting coverage vector cov^tAnd set cov⁰Is an all-zero matrix; wherein t represents time t;

in the attention-based seq2seq model, the words generated by the decoder sometimes trigger a loop due to some special words, so that the generated sentences contain repeated words. A coverage mechanism is then required to prevent this. The coverage mechanism may focus more on words that have not been focused on before, and ignore words that have been focused on before. Measuring the degree of attention of a word by using the accumulated sum of the attention matrixes at the previous moment, and neglecting the words which are focused before to prevent repetition;

(2) calculating the similarity e of the target hidden vector and the input hidden vector through a function score_i ^t；

Wherein the content of the first and second substances,

the calculation formula is as follows:

v、W₁、W₂、W_cand b_attenIn order to query the training parameters of the adapted model,

coverage vector, h, representing time t_tA target hidden vector is represented by a target hidden vector,

representing an input hidden vector;

(3) will be provided with

Carrying out normalization processing to obtain the attention weight a^t，a^t＝softmax(e^t)；

(4) At time t, coverage matrix cov is maintained^tThe degree of coverage of terms in the initial query statement is recorded,

is the sum of the attention distributions at all previous times;

(5) by attention weight a^tThe weighted summation of the input implicit vectors obtains a context vector at the time t,

after the context vector is obtained by calculation, at least one generating keyword can be predicted and output by combining the context vector. Optionally, the at least one generated keyword is predicted and output in the vocabulary using the following formula:

wherein, y_tRepresenting a currently output generated keyword, C representing a context vector;

p(y_t|{y₁,...,y_t-1}, C) represents a previous generated keyword { y₁,...,y_t-1Y and context vector C_tThe conditional probability of (2).

Meanwhile, when the context vector and the target hidden vector at the current time are combined in the step S2-1-2 to obtain the distribution probability of each word in the vocabulary through two fully connected layers, the distribution probability of each word in the vocabulary can be calculated by using the following formula:

P_vocab＝f(c_t,h_t)＝softmax(V'(V[h_t,C^t]+b)+b')

wherein V, V ', b' are training parameters for querying rewrite model, P_vocabRepresenting the distribution probability of words in the vocabulary, h_tRepresenting the hidden vector of the object, C^tRepresenting the context vector at time t. Softmax, which is to map a K-dimensional real vector z to a new K-dimensional real vector, so that each element value of the vector is between 0 and 1, and the sum of all elements is 1.

In the introduction, when the extraction type keywords are selected, the weights of all terms in the initial query sentence can be calculated through the attention matrix, and the selection is performed according to the weights of all terms. In a preferred embodiment of the present invention, the method may comprise the steps of:

s2-1, calculating the weight of each term in the initial query statement based on the TF-IDF term frequency-inverse file frequency and the attention weight; wherein the ratio of TF-IDF and attention weight at is adjusted by a second adjustment factor p_wAdjusting;

and S2-2, selecting at least one term from the terms as an extraction type keyword according to the weight of each term in the initial query statement and outputting the selected term.

TF-IDF, which is the product of two statistics, the word frequency TF (w) and the inverse document frequency IDF (w). TF-IDF high is determined by the fact that the word frequency is high and the word frequency is low in the whole corpus, so the method can be used to exclude common terms. For natural language queries, this approach can effectively remove some common spoken language descriptions, such as "how", "what", and retain important information.

When the weight of each term in the initial query statement is calculated in step S2-2-1, the following formula may be used:

wherein f is_wRepresenting the number of times that a term w appears in an initial query sentence, N representing the number of times that all query sentences in a corpus constructed from the query records used for the term w appear, | w | representing the number of query sentences containing the term w in the corpus, a^tAnd expressing attention weight, and obtaining the attention weight by normalizing the similarity of the target hidden vector and the input hidden vector.

The TF-IDF value and the attention weight have different emphasis points in measuring the importance of a word. Attention weights focus on semantic matching of inputs and outputs, whose similarity values are computed using hidden states. In this way it focuses on the "meaning" of the word. TF-IDF focuses on the statistical features of a word, which counts the importance of the word throughout the corpus, and these two values describe the importance of the input word from different perspectives. By combining them with weighting factors, better keywords can be extracted from the input.

As mentioned above, after selecting the generated keywords and the extracted keywords, the generated keywords and the extracted keywords may be analyzed, and then a plurality of keywords may be selected as query keywords similar to the semantics of the initial query sentence and then output, which may include:

s3-1, acquiring each keyword in the generated keywords and the extracted keywords;

s3-2, calculating the comprehensive weight of each keyword by combining the weight of each word in the initial query sentence and the distribution probability of each word in the vocabulary;

and S3-3, selecting a plurality of keywords from the keywords as query keywords based on the comprehensive weight of each term.

In the above embodiment, the calculation process of the weight of each word in the initial query sentence and the distribution probability of each word in the vocabulary table has been described, and since the embodiment of the present invention synthesizes the two words to further select the final query keyword, the distribution probability and the weight ratio of the same keyword can be adjusted by using the preset first adjustment factor to calculate the comprehensive weight of each keyword.

In step S106, after the generated keywords and the extracted keywords are selected, a first adjustment factor for adjusting the weight ratio of each keyword in the generated rewrite mode and the extracted rewrite mode may be calculated, so as to calculate the comprehensive weight of each keyword based on the first adjustment factor. In a preferred embodiment of the present invention, the step S106 may include: calculating a first adjusting factor for adjusting the weight proportion of each keyword in a generating type rewriting mode and an extracting type rewriting mode; acquiring each keyword in the generated keywords and the extracted keywords; and adjusting the distribution probability and the weight proportion of the same keyword through a first adjusting factor, and calculating the comprehensive weight of each keyword.

In a preferred embodiment of the present invention, the calculation formula of the first adjustment factor may be as follows:

When the distribution probability and the weight proportion of the same keyword are adjusted by using a preset first adjusting factor to calculate the comprehensive weight of each keyword, the comprehensive weight of each keyword can be calculated by using the following formula:

P(w)＝p_genP_vocab(w)+(1-p_gen)P_extract(w)

where p (w) represents the composite weight of the keyword, pvocab (w) represents the distribution probability of the keyword in the vocabulary, and pextract (w) represents the weight of the keyword in the initial query statement.

And finally, sequentially ordering to generate a keyword list based on the comprehensive weight of each keyword, and selecting a plurality of keywords from the keyword list as query keywords. When a plurality of keywords are selected from the keyword list, a plurality of keywords with larger comprehensive weight can be selected as query keywords according to the comprehensive weight, and the selected query keywords are output, so that the search engine can conveniently query based on the selected query keywords, and the query result can better meet the user expectation.

FIG. 3 is a diagram illustrating a structure of a rewrite model according to an embodiment of the present invention. As can be seen from fig. 3, the rewrite model provided by the embodiment of the present invention is a classic attention-based seq2seq structure, and is composed of an "encoder" and a "decoder". The encoder can understand the query input by the user, and encodes the input sentence and sends the encoded sentence to the decoder for interpretation. In the decoding stage, the "decoder" generates each word in turn.

For example, in a real search scenario, the initial query statement entered by the user based on the search engine may be "i want to know how much is a cell phone X". If such a query is entered directly in a search engine, the returned results page is often not the result intended by the user, as shown in FIG. 4.

Based on the method provided by the embodiment of the invention, the rewriting process can be as follows:

1. receiving an initial query sentence 'i want to know how much a mobile phone X needs to be money' input by a user, and firstly segmenting the query sentence; the initial query sentence is segmented to obtain ' I ', ' want ', ' know ', ' one ', ' mobile phone X ' and how much money ';

2. embedding each word, and expressing the words by vectors;

3. the word vectors are input to an encoder in a rewrite model, as shown in FIG. 3, and are represented asInput hidden vectors, e.g. h in FIG. 3₁、h₂...h_s...h_n-1、h_n；

4. Inputting the input hidden vector of each term into a decoder, and sequentially generating each query keyword similar to the semantics of the initial query sentence by the decoder; in generating the next word, the following two factors are considered:

(1) constructing a vocabulary table by utilizing the training data set, and considering the distribution probability of words in the vocabulary table;

(2) considering the weight of each term in the initial query statement according to the extraction method, and adjusting the factor p_genThe proportion of the two is adjusted, the initial query statement can be rewritten into a target query statement comprising two query keywords of 'mobile phone X, price', and when searching is carried out based on the rewritten target query statement, the result returned by the search engine is more accurate, as shown in fig. 5.

The machine learning is approximately as follows: determining a model-training a model-using the model. Therefore, after the rewrite model is built, it needs to be trained to ensure the accuracy and efficiency of rewriting the query statement by the rewrite model. As shown in fig. 6, a preferred embodiment of the present invention further provides a method for training a rewriting model, which may include:

step S602, collecting query records of network users based on a search engine, and constructing a training data set based on the query records;

step S604, acquiring training data in a training data set, and randomly disordering the training data;

step S606, dividing the training data after random disorder into a plurality of training sample data;

step S608, arbitrarily selecting one piece of training sample data from the plurality of pieces of training sample data, inputting the selected training sample data into a pre-constructed rewrite model for rewriting a query sentence input by a user based on a search engine, and training the rewrite model.

According to the training method for the rewriting model, provided by the embodiment of the invention, the query records of the network user based on the search engine are collected, so that the training data set is constructed, and the rewriting model is trained based on the constructed training data set. The rewriting model in the embodiment of the invention is obtained by training the training data set after the network user gathers the search query records of the search engine, so that the network user can more truly and accurately reflect the query requirements of the network user, the training efficiency of the rewriting model is further improved, and the rewriting model can more accurately and efficiently rewrite the query sentences.

When the training data set is constructed in step S602, query records of each network user based on a search engine may be collected first, and the query records are used as an initial training corpus to construct a corpus; then, noise data in the corpus is cleaned to obtain a data set; and then performing word segmentation on the query sentence and the search result in the data set respectively, and taking the data of the first specified proportion of the data set as training data to construct a training data set for rewriting the model. The network user is real search click data of the network user based on the query record of the search engine, and when the query record of the network user is collected, query sentences input by the network user based on the search engine and search results clicked by the network user in a result page returned by the search engine based on the query sentences can be collected; and forming a sentence pair (query-title) by the query sentence and the search result clicked by the network user based on the query sentence, constructing a corpus by taking the sentence pair as an initial training corpus, and further taking the search query record of the high-quality user as the initial training corpus.

A lot of noises exist in the initial training corpus, and the noises are found to be mainly caused by misoperation of a user or coincidentally interest in a certain page through data analysis and are expressed as semantic dissimilarity of training sentence pairs, and the noises can seriously influence the training process. Therefore, it is necessary to clean the sentence pairs in the corpus to obtain reliable data. When the noise data in the corpus is cleaned to obtain a data set, sentence pairs in the corpus can be obtained; taking the query sentence as the input of the data set, and taking the search result clicked by the user corresponding to the query sentence as the output of the data set; and calculating and filtering sentence pairs which do not accord with the semantics of the query sentences and the search results in the sentence pairs of the initial training corpus based on the topic similarity and/or the word vector similarity. The embodiment of the invention mainly measures the quality of sentence pairs from two aspects of topic similarity and semantic similarity. The topic similarity starts with the topic distribution of the sentence, and the similarity between the distributions is calculated. Firstly, semanteme representation is carried out on sentences, an LDA model is trained, and the topic distribution of one sentence is calculated. The similarity between the two distributions was then calculated using the JS divergence (Jensen-Shannon). Starting from the word vectors of words in the sentence, the topic similarity represents a sentence as the mean value of the word vectors of the words in the sentence, and then the cosine similarity is utilized to calculate the similarity of the two sentences. The purpose of removing noise is achieved by setting a reasonable threshold value.

Lda (late Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three layers of structures of words, topics and documents. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.

After the data in the corpus is cleaned to obtain the data set, word segmentation can be performed on the data set. In the embodiment of the invention, an open-source jieba tool can be used for separating sentences by words, then the data of the first specified proportion of the data set is used as training data to construct a training data set for rewriting the model, the data of the second specified proportion of the data set is obtained as verification data, and a preset verification set is constructed based on the verification data. In practical applications, 20% of the data in the data set can be divided into the validation set and the remaining 80% of the data can be divided into the training data set, and the vocabulary can be made by using the training data set.

In the above, when the model is trained and rewritten, the training data after random disturbance may be divided into a plurality of training sample data, and the model may be trained based on the plurality of training sample data. In a preferred embodiment of the present invention, training the rewrite model may include the steps of:

s6-1, randomly disordering the training data in the training data set;

s6-2, averagely dividing training data in the training data set after random disorder into S training sample data, and setting the initial value of S as 0;

s6-3, selecting the S training sample data;

and S6-4, inputting the S training sample data into a pre-constructed rewriting model for rewriting the query sentence input by the user based on the search engine, and training the rewriting model.

Optionally, before the S-th training sample data is input into the rewrite model in step S6-4, the words in the query sentence of the S-th training sample data may also be numbered according to a preset vocabulary table; the method comprises the following steps that a preset vocabulary table is constructed on the basis of a training data set; the numbered words are then input into the rewrite model, and the rewrite model is trained based on the numbered words. In the embodiment of the invention, the vocabulary can be constructed based on the training data set, so that the words in the query sentence are numbered according to the preset vocabulary, and the training process of the rewriting model can be more orderly, thereby improving the training efficiency of the rewriting model.

During the training process of the rewrite model, the loss function at each time can be calculated. That is, after the step S6-4, the method may further include:

s6-5, calculating a loss function in the process of rewriting model training by using the following formula:

where, loss represents the loss function,

a target word is represented by a target word,

the weight of attention is represented as a weight of attention,

representing coverage vector, t represents time t;

the penalty function for the entire query statement is then defined as:

after the rewriting model is trained, the rewriting model can be verified through a preset verification set. When the rewrite model is verified, a loss function in a verification set can be calculated. In machine learning, a loss function (Lossfunction) is used for estimating the degree of inconsistency between a predicted value and a true value of a model, and is a non-negative real value function, and the smaller the loss function is, the better the robustness of the model is. Therefore, after the step S1-4, the method may further include:

s6-6, calculating a loss function of a preset verification set by using the trained rewriting model; if the loss function is increased, the training is finished; and if the loss function is reduced, setting S to be S +1, repeating the steps from S6-1 to S6-5, selecting the S-th training sample data, inputting the training sample data into the rewriting model, and continuing training the rewriting model. Based on the embodiment provided by the invention, whether the rewriting model is trained again is judged by utilizing the loss function in the trained model calculation verification set, so that the rewriting accuracy of the rewriting model can be further improved, the query keyword output by the rewriting model is more in line with the search intention of the user, and the query result is more in line with the user expectation.

Based on the same inventive concept, an embodiment of the present invention further provides a device for constructing a rewriting model, and as shown in fig. 7, the device for constructing a rewriting model according to the embodiment of the present invention may include:

a first constructing module 710 configured to construct an encoder, and after receiving the word vectors corresponding to the terms in the initial query sentence, the encoder encodes the word vectors corresponding to the terms and represents the word vectors as input hidden vectors, respectively;

a second constructing module 720 configured to construct a decoder for decoding the input hidden vector, and obtain a plurality of keywords based on the generating rewrite mode and the extracting rewrite mode by the decoder, respectively;

a calculating module 730 configured to calculate a first adjustment factor for adjusting a weight ratio of each keyword in the generating rewrite mode and the extracting rewrite mode, so as to calculate a comprehensive weight of each keyword based on the first adjustment factor;

and the setting module 740 is configured to combine the encoder and the decoder, set a first adjustment factor in the decoder, complete the construction of the rewrite model based on the encoder-decoder structure, and select at least one keyword from the plurality of keywords by the rewrite model according to the comprehensive weight of each keyword to serve as a query keyword similar to the semantics of the initial sentence and output the query keyword.

In a preferred embodiment of the present invention, as shown in fig. 8, the second building block 720 may include:

a decoding unit 721 configured to construct a decoder according to the unidirectional LSTM long-short term memory network, and decode the input hidden vector through the decoder;

the generating unit 722 is configured to generate at least one generated keyword in a generated rewrite mode based on a preset vocabulary;

the extraction unit 733 is configured to extract at least one extracted keyword in an extracted rewrite mode based on the initial query sentence.

In a preferred embodiment of the present invention, the generating unit 722 may be further configured to: and calculating the distribution probability of each word in the vocabulary table through an attention mechanism, and selecting at least one generating type keyword according to the distribution probability of each word.

In a preferred embodiment of the present invention, the generating unit 722 may be further configured to: weighing the weight of each term in the initial query statement by a score method, and calculating the weight and calculating to obtain a context vector; combining the context vector with the target hidden vector at the current moment to obtain the distribution probability of each word in the vocabulary through two fully-connected layers; the target hidden vector is a hidden layer variable of a decoder at the time t; predicting and generating at least one generating keyword in the vocabulary; and utilizing a coverage mechanism to assist the decoder to generate non-repeated generating keywords.

In a preferred embodiment of the present invention, the extracting unit 723 may be further configured to: and calculating the weight of each term in the initial query statement through the attention matrix, and selecting at least one extraction type keyword according to the weight of each term.

In a preferred embodiment of the present invention, the extracting unit 723 may be further configured to: calculating the weight of each term in the initial query statement based on the TF-IDF term frequency-inverse file frequency and the attention weight; wherein, TF-IDF and attention weight a^tBy a second adjustment factor p_wAdjusting; and selecting at least one term from the terms as the extraction type key words according to the weight of each term in the initial query statement.

In a preferred embodiment of the present invention, as shown in fig. 8, the calculating module 730 may include:

an adjustment factor calculation unit 731 configured to calculate a first adjustment factor that adjusts a ratio of weights of the keywords in the generated rewrite mode and the extracted rewrite mode;

an obtaining unit 732 configured to obtain each of the generated keywords and the extracted keywords;

the comprehensive calculation unit 733 is configured to calculate a comprehensive weight of each keyword by adjusting a distribution probability and a weight ratio of the same keyword by a first adjustment factor.

In a preferred embodiment of the present invention, the comprehensive calculation unit 733 may be further configured to calculate a comprehensive weight of each keyword using the following formula:

P(w)＝p_genP_vocab(w)+(1-p_gen)P_extract(w)

In a preferred embodiment of the present invention, the first building module 710 may be further configured to: constructing an encoder according to a bidirectional LSTM long-short term memory network; after receiving the word vectors corresponding to the terms in the initial query sentence, the encoder encodes the word vectors corresponding to the terms and respectively represents the word vectors as input hidden vectors.

In a preferred embodiment of the present invention, the calculating module 730 can be further configured to calculate the first adjustment factor using the following formula:

Based on the same inventive concept, an embodiment of the present invention further provides a computer storage medium, where computer program codes are stored, and when the computer program codes run on a computing device, the computing device is caused to execute any one of the above-mentioned methods for constructing a rewrite model.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including: a processor; a memory storing computer program code; the computer program code, when executed by a processor, causes a computing device to perform any of the above-described methods of constructing a rewrite model.

The embodiment of the invention provides a simple and efficient method and a device for constructing a rewriting model. The encoder encodes the input sentence into a vector, and the decoder decodes this to output a sequence. The rewriting model provided by the embodiment of the invention can combine an extraction mode and a generation mode to generate keywords, and the proportion of the two modes is adjusted by an adjusting factor, so that the query statement can be simplified and finally output at least one keyword with the highest semantic similarity with the initial query statement input by a user on the premise of not changing the semantic of the query statement, and compared with the traditional rewriting model, the method can return the query more suitable for a search engine on the premise of not changing the real intention of the traditional rewriting model. Furthermore, the rewriting model constructed based on the embodiment of the invention can accurately and efficiently rewrite the query sentence input by the user based on the search engine, and the real query record of the user is cleaned to be used as the data for training the rewriting model, and the search query record of the high-quality user is used as the initial training corpus, so that the query result can better meet the user expectation.

It is clear to those skilled in the art that the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims

1. A method for building a rewrite model, comprising:

2. The method of claim 1, wherein said constructing a decoder that decodes the input hidden vector, obtaining, by the decoder, a plurality of keywords based on a generative rewrite mode and a decimated rewrite mode, respectively, comprises:

3. The method according to claim 1 or 2, wherein the generating at least one generated keyword in a generated rewrite mode based on a preset vocabulary comprises:

4. The method according to any one of claims 1-3, wherein the calculating a distribution probability of each word in the vocabulary by an attention mechanism and selecting at least one generative keyword according to the distribution probability of each word comprises:

predicting and generating at least one generated keyword in the vocabulary;

5. The method of any of claims 1-4, wherein said extracting at least one extracted keyword in an extracted rewrite mode based on the initial query statement comprises:

6. The method of any one of claims 1-5, wherein the calculating a weight of each term in the initial query statement by an attention matrix and selecting at least one extracted keyword according to the weight of each term comprises:

7. The method of any of claims 1-6, wherein the calculating a first adjustment factor that adjusts a weight ratio of each keyword in the generating rewrite mode and the extracting rewrite mode to calculate a composite weight for each keyword based on the first adjustment factor comprises:

acquiring each keyword in the generated keywords and the extracted keywords;

and adjusting the distribution probability and the weight proportion of the same keyword through the first adjusting factor, and calculating the comprehensive weight of each keyword.

8. An overwrite model construction apparatus comprising:

9. A computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform a method of building a rewrite model according to any of claims 1 to 7.

10. A computing device, comprising:

a processor;

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method of building a rewrite model according to any of claims 1 to 7.