CN111444414A

CN111444414A - Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task

Info

Publication number: CN111444414A
Application number: CN201910898272.0A
Authority: CN
Inventors: 胡泽婷; 张鹏; 蒋永余
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-07-24

Abstract

The invention discloses an information retrieval model for modeling various relevant features in an ad-hoc retrieval task, namely a Match-Transformer model, which comprises the following steps of collecting corpus according to topic, dividing the corpus into a training set and a testing set, preprocessing queries and documents in the corpus, constructing vector representation of the queries and the documents by using global information and local information, inputting the vector representation of the queries and the documents in the training set into the Match-Transformer model to calculate scores of the documents and train a final model, inputting the vector representation of the queries and the documents in the testing set into the Match-Transformer model to calculate the final score of each document, and finally learning the relative position information between the documents by using an L earning-to-Rank model to finally obtain a more accurate document ranking result.

Description

Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task

Technical Field

The invention relates to the technical field of text information retrieval, in particular to an information retrieval model for modeling various relevant characteristics in an ad-hoc retrieval task.

Background

With the continuous development of internet and intelligent technology, information retrieval is no longer only a personal computer terminal (PC) for searching, and users increasingly rely on mobile devices for searching information and services required by the users. The quality of the information retrieval model directly influences the information retrieval result. Therefore, the information retrieval model not only has important theoretical significance, but also has huge social value. The present invention primarily studies document ranking under a given query in an ad-hoc task, i.e., correlation analysis between queries and documents.

The information retrieval model is the main research content of information retrieval. Current information retrieval includes boolean models, vector space models, probabilistic models, language models, and the like. The main purpose of these models is to abstract the query and document in information retrieval and their matching degree by mathematical or other language tools. Ad-hoc is a classical retrieval task in which a user specifies the user's information needs through a query that initiates a search (performed by an information system) to find documents that may be relevant to the user. a central problem in ad-hoc information retrieval tasks is how to learn a generic function that can be used to evaluate the relevance between queries and documents. In Ad-hoc, the heterogeneity of queries and documents may present challenges that increase the difficulty of document understanding due to insufficient context information and too long terms of the document. These challenges in correlation determination are that there may be multiple correlation features that are correlated with heterogeneity.

MatchPyramid (MP), K-NRM, Conv-KNRM and NNQ L M-II, however, these models only use a small number of relevant features or consider multiple relevant features from a document perspective, and do not consider the relevant features of the query and the mutual information between the query features and the document features.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide an information retrieval model for modeling various relevant characteristics in an ad-hoc retrieval task, respectively construct vector representations of queries and documents, use a Match-Transformer model to capture dependency information, context information and interaction information between the queries and the documents, then use a multilayer perceptron to obtain scores and ranks of the documents, use L earning-to-Rank to learn the relative position information between the documents, obtain the prediction result of an optimal model on a test set, and finally obtain a more accurate evaluation result.

The purpose of the invention is realized by the following technical scheme, which comprises the following steps:

an information retrieval model for modeling diverse relevant features in an ad-hoc retrieval task, comprising the steps of:

(1) constructing a corpus according to topic, wherein the total sample of the corpus is N topics, and each topic comprises a query and a series of documents;

(2) randomly selecting 80% by N topics from the corpus set in the step (1) as a training set and the remaining 20% by N topics as a test set, and respectively preprocessing the training set and the test set;

(3) constructing a Match-Transformer model for the preprocessed query and document;

(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;

(5) updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;

(6) the data of the test set is imported into a trained Match-Transformer model to calculate the ranking score of the last returned document of each topic;

(7) and outputting the evaluation result of the Match-Transformer model on the test set.

The method for constructing the Match-Transformer model in the step (3) comprises the following steps:

3.1 obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in a model initialization stage, and updating and optimizing in a model training process; and for each query in the text and sheet in the documentWord vectors of the words respectively corresponding to W_i ^QAnd W_j ^D(ii) a Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.

3.2 determining the word vector W in the query_i ^QWhether or not it is in the document T_DThe following Overlap Embedding function is constructed;

3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:

wherein the content of the first and second substances,

tf value representing the ith word in the query and

a tf-idf value representing the jth word in the document;

3.4 considering that the above steps do not consider the dependency information between query words and between document words, the information of the query and document are represented by a density operator respectively, namely:

in step 3.4, in order to further obtain matching feature information between the query and the document, the following steps are performed:

wherein the content of the first and second substances,

head_i＝σ(PW_i ^P,KW_i ^K,VW_i ^V)

drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of a Match-transform framework model;

table 3 experimental comparison results of different information retrieval models;

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following. FIG. 1 shows a flow of query-document relevance analysis proposed by the present method; FIG. 2 shows a Match-Transformer model designed according to the present invention; table 3 shows the comparison results between the final different information retrieval models. The method comprises the following specific steps:

(1): from the TREC dataset, 1000 documents relevant to it were found from the dataset according to topic in Web TREC (Robust-04 and ClueWeb-09-CAT-B.).

(2): and (3) randomly selecting 80% 400 topic from the corpus set obtained in the step (1) as a training set and 20% 400 topic as a test set, respectively preprocessing the training set and the test set, and removing stop words and punctuation marks of each text.

(3): for the preprocessed queries and documents, vector representation of the queries and documents is constructed according to the characteristics of a traditional information retrieval model, word vectors and density operators, and a Match-Transformer model is constructed as follows: as shown in fig. 2.

3.1: and obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in the model initialization stage, and updating and optimizing in the model training process. At this step we get the word vectors for each query in the text and word in the document, corresponding to W, respectively_i ^QAnd W_j ^D. Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.

3.2: according to the word W in the query_i ^QWhether or not it is in the document T_DWhere we have constructed an overlapemembedding function. Assuming that the query term appears in the document, the function takes the value 1, otherwise it is 0.

3.3: by combining the two previous operations, global information (word vector) and local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:

wherein the content of the first and second substances,

tf value representing the ith word in the query and

representing the tf-idf value of the jth word in the document.

In order to further obtain the dependency information between query words and between document words, the information of the query and the document are respectively expressed by a density operator, namely:

and constructing a Match-Transformer model according to the vector representation obtained above.

Furthermore, the invention relates to a Transformer model, i.e.

Where d represents the dimension of the columns in P, K and V. The basic framework of the Transformer model is thus as follows:

head_i＝σ(PW_i ^P,KW_i ^K,VW_i ^V)

in the present invention, the above P, K and V correspond to ρ of the fourth step, respectively_Q，ρ_QAnd ρ_D. The use of the Multi-head attention mechanism to build the relevance of matching features between a query and a document can also be seen as an interaction between relevance information between two texts. The specific implementation process is as follows:

and capturing more subtle information among the related characteristics through a convolutional neural network.

G_h＝CNN_h(M)

Wherein h represents the value of N in the N-gram, and the value is {2,3,4,5 }.

Through the steps, the matching features X of various related features can be obtained:

X＝P₂⊙...⊙P₅

using a multi-tier perceptron to compute ranking scores for individual documents, namely:

f(X)＝2*tanh(W·X^T+b)

where W and b are parameters to be learned in the linear ranking model, and tanh (-) is an activation function, in order to make the score of the ranking function range from [ -2,2 ].

to learn the relative location information between documents, L earning-to-rank function was used to train the model.

Wherein the content of the first and second substances,

represents the probability that the jth document returns to the first ranked in the list of documents under a given query, namely:

the method comprises the steps of testing each topic by using the optimal model stored in the previous step, comparing the relevance tags of each document, calculating the score of each document, finally obtaining the prediction result of the returned document under each topic, comparing the relevance tags of all documents, calculating the evaluation result of the final model on a test set, comparing a query likelihood model (Q L), a classical information retrieval model BM25 and a neural network information retrieval model (MatchPyramid (MP), DRMM, K-NRM, Conv-KNRM and HiNT), counting a table, and visually observing the effect of obviously improving the relevance of the query and the document by the method, wherein the table comprises the following table of comprehensive utilization of differential retrieval models over the Web-09-Cat-Band Robust-04 collections,

and

mean a signifcant improvement over BM25*，Conv-KNRM§，

and

using Wilcoxon signed-rank test p＜0.05，respectively.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. An information retrieval model for modeling diverse relevant features in an ad-hoc retrieval task, comprising the steps of:

(6) inputting the test set data into a trained Match-Transformer model to calculate the final returned document ranking score of each topic;

2. The information retrieval model for modeling the diverse correlation features in the ad-hoc retrieval task according to claim 1, wherein the Match-Transformer model construction method in the step (3) comprises the following steps:

3.1 obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in a model initialization stage, and updating and optimizing in a model training process; and the word vectors of each word in the query and the document in the text respectively correspond to W_i ^QAnd W_j ^D(ii) a Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.

wherein the content of the first and second substances,

tf value representing the ith word in the query and

a tf-idf value representing the jth word in the document;

3. an information retrieval model for modeling diverse relevant features in ad-hoc retrieval tasks as claimed in claim 2 where in step 3.4 to further derive matching feature information between queries and documents:

wherein the content of the first and second substances,

head_i＝σ(PW_i ^P,KW_i ^K,VW_i ^V)