CN112182155B

CN112182155B - Search result diversification method based on generated type countermeasure network

Info

Publication number: CN112182155B
Application number: CN202011024084.4A
Authority: CN
Inventors: 窦志成; 刘炯楠
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2023-08-18
Anticipated expiration: 2040-09-25
Also published as: CN112182155A

Abstract

The application realizes a search result diversified training method based on a generated type countermeasure network by a method in the artificial intelligence field, after query words are given, a corresponding candidate document set is defined, a sampler, a generator and a judging unit are sequentially arranged on a logic path, and a means of diversified scoring functions is arranged in the judging unit and the generator, so that training is performed by a positive feedback process; and introducing a generative countermeasure network, and combining an explicit model and an implicit model through the generative countermeasure network, so that a final generator can generate better diversified document sequences through the means.

Description

Search result diversification method based on generated type countermeasure network

Technical Field

The application relates to the field of artificial intelligence, in particular to a search result diversification method based on a generated type countermeasure network.

Background

Search result diversification is an effective method for solving the problem that users propose fuzzy query, most of the currently mainstream diversification algorithms are supervised methods, and a high-quality data set is required for training a search result diversification model. The primary goal of search result diversification is to have the ranked list returned by the search engine cover as much as possible all of the sub-topics of the user's query. A series of search result diversification algorithms have been proposed by researchers. The main flow of these algorithms is: when a user proposes a query word, according to the diversification scoring function, continuously selecting the best diversification document under the current selected document sequence and adding the selected document sequence, and continuously repeating the process to guide the document sequence to be long enough. The models can be broadly divided into two types: implicit models and explicit models. The implicit model focuses on how novel the document is. Dissimilarity between scored documents and selected documents is considered in the diversified scoring function of the implicit model.

The existing research results show that the supervised learning method is better than the unsupervised method. But supervised learning requires high quality data samples for training. However, since there are a large number of documents in the training set, there are few documents related to each sub-topic, resulting in a high quality data sample that is difficult to obtain. The existing supervision method solves the problem through handwriting rules, but has shortcomings. [2] The first 20 documents in the ideal ranking, while of higher quality, are fewer in number resulting in a model that may be under fitted. [3] The positive example sample and the negative example sample are selected according to the evaluation index adopted in the process, and the super parameter which depends on the evaluation index range is compared. The scarcity of high quality training data samples can lead to insufficient or offset training, affecting the final effect. Meanwhile, the existing models can be basically divided into an explicit model and an implicit model respectively, two ideas are not combined, and all information is not utilized, and if the explicit model and the implicit model are combined in a certain way, the diversified effect of the search result can be possibly improved.

Disclosure of Invention

Therefore, the application provides a training method based on diversification of search results of a generated countermeasure network, which comprises the following steps:

in the training process, after a query word in a training library is given, a corresponding candidate document set is defined, a sampler, a generator and a judging device unit are sequentially arranged through a logic path, a diversified scoring function of the judging device is arranged in the judging device, a means of the diversified scoring function of the generator is arranged in the generator, and the training is carried out through a positive feedback process; and, the diversified scoring function introduces a generative countermeasure network, and simultaneously combines an explicit model and an implicit model through the generative countermeasure network. Finally, in the use process, after the user puts forward the query word, the generator performs diversified rearrangement on the search result and returns the diversified search result, and finally the search result is obtained;

specifically, for query word q in training, its sub-topic { i } is determined ₁ ,i ₂ ,…,i _k -the corresponding candidate document set is d= { D } ₁ ,d ₂ ,…,d _k The sampler firstly selects the document from the document set D, rearranges the document, and inputs the rearranged sequence S into the generator as prefix dataThe generator takes S as a selected document set, several document sets D ' with highest scores selected according to a diversified scoring function are taken as negative example samples to be sent to the determiner, the positive example samples are documents D selected according to the maximized diversified scoring standard, and the determiner classifies the negative example document sets D ' and the positive example documents D and gives feedback to the generator after receiving the negative example document sets D ' and the positive example documents D;

this process is formulated as:

g is a generator, D is a determiner, θ is a generator parameter, φ is a determiner parameter, D _φ Given by a sigmod function, a sample distribution p is generated _θ Given by the softmax function;

wherein f _φ The scoring function is diversified in the determiner. f (f) _θ The formulas of the optimization generator and the determiner are as follows:

log(1+exp(f _φ (d|q, S)) is the feedback of the determiner to the generator.

The implementation mode of the diversified scoring function of the determiner is specifically as follows: defining the scoring document sent by the generator as d _t The query is q, and the sub topics are I respectively _q ＝{i ₁ ,i ₂ ,…,i _K The selected document sequence is s= { d } ₁ ,d ₂ ,…d _t-1 Searching query words and sub topics by using a traditional search model, selecting documents with the front order to link into a pseudo document, embedding the document, the pseudo document corresponding to the query and the pseudo document corresponding to the sub topics by using a doc2vec model, and generating a vector e after embedding the scoring document _d For the vector e after the query is embedded _q A vector e embedded in the sub-topics _i And further modeling a relevance vector x of the scoring document and the query _d,q The relevance vector x of the scoring document and the sub-topics _d,i After feature extraction, a diversified scoring function of the determiner is obtained:

the calculation process of the sub-topic distribution condition A (i|S) under the scoring document is that: firstly, using a recurrent neural network to synthesize the selected scoring document:

LSTM is a neuron function of long-term and short-term memory network, and after passing through a layer of recurrent neural network, a distributed representation h of the selected document is obtained _t-1 If the whole information of the past document is contained, the method for calculating the sub-topic distribution is as follows:

for a pair ofFurther considering information about the relevance between the scoring document and the sub-topics:

and finally obtaining a complete diversified scoring function calculation method of the determiner.

The traditional retrieval model is a BM25 model, and the features extracted by the features comprise a TF-IDF model, a BM25 model, an LMIR model, a PageRank score, a webpage input degree and a webpage output degree.

The diversified scoring function of the generator is the diversified scoring function of the implicit model R-LTR, and the implementation mode is as follows: based on the extraction of the scoring documents and the query consideration features, consider a correlation vector between the scoring documents and the query, and a relationship vector R between the scoring documents _ij Modeling a relationship vector using four dimensions, the dimensions including: sub-topic diversity, text diversity, title diversity, anchor text diversity;

the diversified scoring function of the implicit model R-LTR is specifically as follows:

where R is _ijk For dissimilarity between documents considered from different dimensions, we apply here to d _i The maximum dissimilarity with the previously selected document represents his novelty.And->Are trainable parameters.

The implementation mode of the sampler is as follows: designing random samples to simulate a generator in generating the scoring document d _t The random sampling process is: directly selecting k=10 documents in the scoring document set, and rearranging according to the maximized diversified scoring index alpha-NDCG, wherein the obtained sequence S is used as the input of a generator.

The specific method for the generator to carry out diversified rearrangement on the search results comprises the following steps: firstly, initializing the sequence S of the scoring document selected to be empty; then, selecting and obtaining the highest diversified scoring function f _θ A document d of score; if S is long enough, the process is exited and a diversified search result is returned; otherwise, adding d into S, and returning to the previous step.

The application has the technical effects that:

(1) The generated countermeasure network is utilized, so that the problem that high-quality data samples are difficult to obtain is solved to a certain extent; (2) The explicit model and the implicit model are combined, and the coverage of search results is improved by utilizing information in different dimensions; (3) In order to provide data to the generators in the generative antagonism network, sampling algorithms are designed that compromise both scale and quality.

Drawings

Diversified scoring function model for the determiner of FIG. 1

Detailed Description

The following is a preferred embodiment of the present application and a technical solution of the present application is further described with reference to the accompanying drawings, but the present application is not limited to this embodiment.

In order to achieve the above object, the present application provides a method for diversifying search results based on a generative countermeasure network.

Considering the prior art, the search result diversification is an effective method for solving the problem that users propose fuzzy query, most of the currently mainstream diversification algorithms are supervised methods, and the current diversification algorithms can be divided into an explicit model and an implicit model according to different utilized information and different optimized targets. The main flow of these algorithms is: when a user proposes a query word, selecting the best diversified document under the selected document sequence according to the diversified scoring function, adding the best diversified document into the document sequence, and continuously expanding the sequence until the sequence is long enough. The diversity scoring function in the supervised method requires high quality data set training, but how to select the data set is a challenge for the current diversity algorithm because of the large data volume and the small number of documents related to the sub-topics. The present application therefore introduces a generative countermeasure network to the training process of diversification of search results to generate data to replace handwritten rules. The explicit model and the implicit model are combined through the generated countermeasure network, so that the diversity effect of the search results is improved.

Search result diversification framework based on generated type countermeasure network

Most of the existing models for diversification of search results are based on a supervised method, and the supervised method requires high quality data for training. The model of the application introduces the generated countermeasure network into the diversity of search results, directly optimizes the sub-topic coverage in the negative example sample by using an explicit model in the generator, directly compares dissimilarities among documents by using an implicit model in the determiner, and can provide finer information for the generator for optimization. In addition, since it is difficult for the generator itself to generate negative examples, the present application requires one sampler to generate prefix data for the generator.

Assume that for query q, its sub-topics are { i } ₁ ,i ₂ ,…,i _k The corresponding candidate document set D is { D } ₁ ,d ₂ ,…,d _k }. The sampler firstly selects the document from the document set D, rearranges the document, and inputs the rearranged sequence S into the generator as prefix data. The generator takes S as a selected document set, and several document sets D' with highest scores selected according to a diversified scoring function are taken as negative examples and fed to the determiner. The positive example sample is the document d selected with the maximized diversity score criteria. After receiving the negative example document set D' and the positive example document D, the determiner classifies them and gives feedback to the generator, and trains with a positive feedback process.

The entire training process can be formulated with the following formula:

in the formula, G is a generator, D is a determiner, theta is a generator parameter, phi is a determiner parameter, D _φ Given by a sigmod function, a sample distribution p is generated _θ The softmax function is given.

Wherein f _φ The scoring function is diversified in the determiner. f (f) _θ The scoring function is diversified in the determiner. With the above formulas, the present application can easily give formulas for optimizing the generator and the determiner.

Since the generated challenge network is difficult to calculate gradients in discrete domains, the present application employs a poliygradodient approach in reinforcement learning, by sampling from a negative example sample set, where log (1+exp (f) _φ (d|q, S)) can be considered as feedback from the determiner to the generator, which includes some information not considered by the generator, and helps to promote the diversity effect of the search results of the generator.

Diversified scoring function used by determiner

As an important component of the generative antagonism network, the diversified scoring functions in the decider directly determine whether the decider can effectively classify the positive example document and the negative example document. In the model of the application, a diversified scoring function of the explicit model DSSA is selected. The reason is that the explicit model uses external information, which can provide finer information to the generator than the implicit method, thereby forming positive feedback.

The present application assumes that the current scoring document is d _t The query is q, and the sub topics are I respectively _q ＝{i ₁ ,i ₂ ,…,i _K The selected document sequence is s= { d } ₁ ,d ₂ ,…d _t-1 }。

Considering the extraction of characteristics, firstly, the application needs to embed documents and queries respectively with sub-topics, and the generated vectors are e respectively _d ,e _q ,e _i Considering that the query and the sub-topics are actually shorter and only consist of a few words, the application uses the traditional retrieval model such as BM25 to retrieve the query words and the sub-topics, selects the documents with the front order to link into a pseudo document, then embeds the document, the pseudo document corresponding to the query and the pseudo document corresponding to the sub-topics through the doc2vec model to obtain e _d ,e _q ,e _i . But using only embedded documents may not be accurate, so the present application also models documents and queries, direct relevance vectors for documents and sub-topics, x respectively _d,q ,x _d,i The characteristics employed are shown in the following table:

name of the name	Description of the application	Length of
			TF-IDF	TF-IDF model	5
BM25	BM25 model	5
			LMIR	LMIR model	5
PAGERANK	PageRank score	1
			Degree of penetration	Webpage income degree	1
Degree of delivery	Web page out-degree	1

After feature extraction, the form of the diversified scoring function is given:

it can be seen that S ^rel (d _t Q) and S ^sub (d _t ,i _k ) The relevance of documents to queries and sub-topics is scored separately. The most critical part of the whole model is A (i|S), namely the distribution of sub-topics under the currently selected documents. Firstly, considering that the selected document sequence S contains sequence information, the application firstly utilizes a recurrent neural network to synthesize the selected document, and a specific formula is as follows:

LSTM, a neuron function of a long-term and short-term memory network, has a good effect in all recurrent neural networks, and can obtain a representation h of a selected document after passing through a layer of recurrent neural network _t-1 Then the application has a relatively simple method of calculating the sub-topic distribution:

may be used as part of computing features of the sub-topic distribution, but since the application is used in extracting this part of featuresAt the time of symptom, the embedded vector is mainly used, which may not be accurate enough, so the application considers the related information between the document and the sub-topics:

it can be seen that the final model relates the embedded representation of the document itself to the relevance vector of the document to the query, and the relevance vector of the document to the sub-topic.

Diversified scoring function used by generator

As an important component in the generative antagonism network, the diversity scoring function in the generator directly determines the quality of the documents generated by the generator, and thus the behavior of the model. In the model of the present application, the present application selects a diversified scoring function using an implicit model R-LTR in the generator. The implicit model has the advantages that the implicit model has fewer relative parameters and is easy to train, meanwhile, the implicit model directly uses the document to extract the characteristics, and external sub-topic information is not needed.

The present application assumes that the current scoring document is d _t The query is q, and the selected document sequence is s= { d ₁ ,d ₂ ,…d _t-1 }。

Considering feature extraction, similar to the explicit method, the application also considers the relevance vector between the document and the query. At the same time, the application also relates to the relation vector R between the documents _ij Modeling was performed. Considering that when humans compare documents, multiple pieces of information in the documents are often extracted for comparison, such as topics, first sentence of each segment, etc., the present application employs four dimensions to model a relationship vector. The following table shows:

name of the name	Description of the application
		Sub-topic diversity	Euclidean distance on SVD model
Text diversity	Cosine similarity of text vectors
		Title diversity	Cosine similarity of heading vectors
Anchor text diversity	Cosine similarity of anchor text vectors

After feature extraction, the present application gives a scoring function for the implicit model R-LTR:

in the model of R-LTR, the degree of novelty of the scored document relative to the selected document is ultimately obtained by comparison of the document to a plurality of features between the documents. Because of the novelty of directly considering the document, the generator can give some information which cannot be considered by the model considering the coverage of the sub-topics, and the information can help the generator to further optimize the negative example document generated by the generator, give feedback to the determiner, and continuously give positive feedback to the determiner. Thereby promoting diversification of search results.

Sampling device

The sampler is used as a component for providing the selected document sequence S for the generator, the quality of S plays a certain role on the quality of the positive example document and the negative example document, and the selection of S is a part of the whole sampling, so the application designs an algorithm which has both quality and quantity.

Considering generator non-idealities, the present application designs random sampling to simulate the generator in generating d _t The ideal ordering case has not been generated before. The method has the advantages that the thought of random sampling is relatively simple, k documents in the document set are directly selected, rearranged according to the maximized diversified scoring index, and then used as an S as input of a generator. The sampling method also has some problems, and the quality of S can not be ensured due to direct random selection.

Through the test of the application, the combination of the method can generate a better sampling effect, and finally the model of the application also adopts the method for sampling.

Diversification rearrangement of search results

Firstly, the application firstly introduces the training process of the search result diversification algorithm based on the generated countermeasure network. Considering that if the method of generating an countermeasure network is adopted for direct cold start, offset on training can be caused, the method is simple by firstly pre-training two diversified scoring functions of a generator and a determiner before adopting the training method, the method adopts an optimization method in R-LTR, a sequence composed of the first 20 documents in ideal ordering is selected, and the generator and the determiner are respectively optimized by taking the sequence as input according to a maximum likelihood method. After model optimization, reheat starts training of the generated type countermeasure network proposed by the application.

In both the previous MLE and the generative antagonism network, the application gradually optimizes the model through an Adam Optimezer optimizer, and selects a final generator as a search result diversification model.

The way of using the model to diversify the rearrangement of search results is as follows:

firstly, initializing a selected document sequence S to be empty;

secondly, selecting a document d for obtaining the score of the highest diversified scoring function f;

finally, if S is long enough, the process is exited. Otherwise, adding d into S, and returning to the previous step

Through the mode, the method and the device can return a diversified search result aiming at one query of the user.

Claims

1. A search result diversification method based on a generated type countermeasure network is characterized in that: in the training process, after a query word in a training library is given, a corresponding candidate document set is defined, a sampler, a generator and a judging device unit are sequentially arranged through a logic path, a diversified scoring function of the judging device is arranged in the judging device, a means of the diversified scoring function of the generator is arranged in the generator, and the training is carried out through a positive feedback process; in addition, a generating type countermeasure network is introduced into the diversified scoring function, meanwhile, an explicit model and an implicit model are combined through the generating type countermeasure network, finally, in the use process, after a user puts forward a query word, the generator performs diversified rearrangement on search results and returns diversified search results, and finally, the search results are obtained;

specifically, for query word q in training, its sub-topic { i } is determined ₁ ,i ₂ ,…,i _k -the corresponding candidate document set is d= { D } ₁ ,d ₂ ,…,d _k Firstly, selecting a document from a document set D by the sampler, rearranging the document, inputting a sequence S obtained after rearranging into the generator as prefix data, taking the S as the selected document set, and taking a plurality of document sets D' with highest scores selected according to a diversified scoring function as negative example samples to be fed into the determiner by the generatorThe positive example sample is a document D selected by the maximized diversified scoring standard, and the determiner classifies the negative example document set D' and the positive example document D after receiving the same and gives feedback to the generator;

this process is formulated as:

wherein f _φ For diversifying scoring functions in a decider, f _θ The formulas of the optimization generator and the determiner are as follows:

log(1+exp(f _φ (d|q, S)) is the feedback of the determiner to the generator.

2. The method for diversifying search results based on a generative antagonism network of claim 1, wherein: implementation method of diversified scoring function of determinerThe formula is specifically as follows: defining the scoring document sent by the generator as d _t The query is q, and the sub topics are I respectively _q ＝{i ₁ ,i ₂ ,…,i _K The selected document sequence is s= { d } ₁ ,d ₂ ,…d _t-1 Searching query words and sub topics by using a traditional search model, selecting documents with the front order to link into a pseudo document, embedding the document, the pseudo document corresponding to the query and the pseudo document corresponding to the sub topics by using a doc2vec model, and generating a vector e after embedding the scoring document _d For the vector e after the query is embedded _q A vector e embedded in the sub-topics _i And further modeling a relevance vector x of the scoring document and the query _d,q The relevance vector x of the scoring document and the sub-topics _d,i After feature extraction, a diversified scoring function of the determiner is obtained:

LSTM is a neuron function of the long-term and short-term memory network, and after passing through a layer of recurrent neural network, the distributed representation h of the selected document is obtained _t-1 If the whole information of the past document is contained, the method for calculating the sub-topic distribution is as follows:

3. A method of diversifying search results based on a generative antagonism network as recited in claim 2, wherein: the traditional retrieval model is a BM25 model, and the features extracted by the features comprise a TF-IDF model, a BM25 model, an LMIR model, a PageRank score, a webpage input degree and a webpage output degree.

4. A method of diversifying search results based on a generative antagonism network as claimed in claim 3, wherein: the diversified scoring function of the generator is the diversified scoring function of the implicit model R-LTR, and the implementation mode is as follows: considering the scoring based on the extraction of the scoring document and the query consideration featuresA correlation vector between the scored document and the query, a relation vector R between the scored documents _ij Modeling a relationship vector using four dimensions, the dimensions including: sub-topic diversity, text diversity, title diversity, anchor text diversity;

where R is _ijk For dissimilarity between documents considered from different dimensions, pair d _i The maximum dissimilarity with the previously selected document represents his novelty,and->Are trainable parameters.

5. The method for diversifying search results based on a generative antagonism network of claim 4, wherein: the implementation mode of the sampler is as follows: designing random samples to simulate a generator in generating the scoring document d _t The random sampling process is: directly selecting k=10 documents in the scoring document set, and rearranging according to the maximized diversified scoring index alpha-NDCG, wherein the obtained sequence S is used as the input of a generator.

6. The method for diversifying search results based on a generative antagonism network of claim 5, wherein: the generator performs multiple search resultsThe specific method for the sampling rearrangement comprises the following steps: firstly, initializing the sequence S of the scoring document selected to be empty; then, a diversified scoring function f used for obtaining the highest generator is selected _θ A document d of score; if S is long enough, the process is exited and a diversified search result is returned; otherwise, adding d into S, and returning to the previous step.