CN111444414A - Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task - Google Patents

Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task Download PDF

Info

Publication number
CN111444414A
CN111444414A CN201910898272.0A CN201910898272A CN111444414A CN 111444414 A CN111444414 A CN 111444414A CN 201910898272 A CN201910898272 A CN 201910898272A CN 111444414 A CN111444414 A CN 111444414A
Authority
CN
China
Prior art keywords
document
model
query
information
match
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910898272.0A
Other languages
Chinese (zh)
Inventor
胡泽婷
张鹏
蒋永余
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910898272.0A priority Critical patent/CN111444414A/en
Publication of CN111444414A publication Critical patent/CN111444414A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information retrieval model for modeling various relevant features in an ad-hoc retrieval task, namely a Match-Transformer model, which comprises the following steps of collecting corpus according to topic, dividing the corpus into a training set and a testing set, preprocessing queries and documents in the corpus, constructing vector representation of the queries and the documents by using global information and local information, inputting the vector representation of the queries and the documents in the training set into the Match-Transformer model to calculate scores of the documents and train a final model, inputting the vector representation of the queries and the documents in the testing set into the Match-Transformer model to calculate the final score of each document, and finally learning the relative position information between the documents by using an L earning-to-Rank model to finally obtain a more accurate document ranking result.

Description

Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task
Technical Field
The invention relates to the technical field of text information retrieval, in particular to an information retrieval model for modeling various relevant characteristics in an ad-hoc retrieval task.
Background
With the continuous development of internet and intelligent technology, information retrieval is no longer only a personal computer terminal (PC) for searching, and users increasingly rely on mobile devices for searching information and services required by the users. The quality of the information retrieval model directly influences the information retrieval result. Therefore, the information retrieval model not only has important theoretical significance, but also has huge social value. The present invention primarily studies document ranking under a given query in an ad-hoc task, i.e., correlation analysis between queries and documents.
The information retrieval model is the main research content of information retrieval. Current information retrieval includes boolean models, vector space models, probabilistic models, language models, and the like. The main purpose of these models is to abstract the query and document in information retrieval and their matching degree by mathematical or other language tools. Ad-hoc is a classical retrieval task in which a user specifies the user's information needs through a query that initiates a search (performed by an information system) to find documents that may be relevant to the user. a central problem in ad-hoc information retrieval tasks is how to learn a generic function that can be used to evaluate the relevance between queries and documents. In Ad-hoc, the heterogeneity of queries and documents may present challenges that increase the difficulty of document understanding due to insufficient context information and too long terms of the document. These challenges in correlation determination are that there may be multiple correlation features that are correlated with heterogeneity.
MatchPyramid (MP), K-NRM, Conv-KNRM and NNQ L M-II, however, these models only use a small number of relevant features or consider multiple relevant features from a document perspective, and do not consider the relevant features of the query and the mutual information between the query features and the document features.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide an information retrieval model for modeling various relevant characteristics in an ad-hoc retrieval task, respectively construct vector representations of queries and documents, use a Match-Transformer model to capture dependency information, context information and interaction information between the queries and the documents, then use a multilayer perceptron to obtain scores and ranks of the documents, use L earning-to-Rank to learn the relative position information between the documents, obtain the prediction result of an optimal model on a test set, and finally obtain a more accurate evaluation result.
The purpose of the invention is realized by the following technical scheme, which comprises the following steps:
an information retrieval model for modeling diverse relevant features in an ad-hoc retrieval task, comprising the steps of:
(1) constructing a corpus according to topic, wherein the total sample of the corpus is N topics, and each topic comprises a query and a series of documents;
(2) randomly selecting 80% by N topics from the corpus set in the step (1) as a training set and the remaining 20% by N topics as a test set, and respectively preprocessing the training set and the test set;
(3) constructing a Match-Transformer model for the preprocessed query and document;
(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;
(5) updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;
(6) the data of the test set is imported into a trained Match-Transformer model to calculate the ranking score of the last returned document of each topic;
(7) and outputting the evaluation result of the Match-Transformer model on the test set.
The method for constructing the Match-Transformer model in the step (3) comprises the following steps:
3.1 obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in a model initialization stage, and updating and optimizing in a model training process; and for each query in the text and sheet in the documentWord vectors of the words respectively corresponding to Wi QAnd Wj D(ii) a Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.
3.2 determining the word vector W in the queryi QWhether or not it is in the document TDThe following Overlap Embedding function is constructed;
Figure BDA0002210969820000021
3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
Figure BDA0002210969820000022
Figure BDA0002210969820000023
3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
Figure BDA0002210969820000024
Figure BDA0002210969820000025
wherein the content of the first and second substances,
Figure BDA0002210969820000026
tf value representing the ith word in the query and
Figure BDA0002210969820000027
a tf-idf value representing the jth word in the document;
3.4 considering that the above steps do not consider the dependency information between query words and between document words, the information of the query and document are represented by a density operator respectively, namely:
Figure BDA0002210969820000031
Figure BDA0002210969820000032
in step 3.4, in order to further obtain matching feature information between the query and the document, the following steps are performed:
Figure BDA0002210969820000033
wherein the content of the first and second substances,
Figure BDA0002210969820000034
headi=σ(PWi P,KWi K,VWi V)
Figure BDA0002210969820000035
drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of a Match-transform framework model;
table 3 experimental comparison results of different information retrieval models;
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following. FIG. 1 shows a flow of query-document relevance analysis proposed by the present method; FIG. 2 shows a Match-Transformer model designed according to the present invention; table 3 shows the comparison results between the final different information retrieval models. The method comprises the following specific steps:
(1): from the TREC dataset, 1000 documents relevant to it were found from the dataset according to topic in Web TREC (Robust-04 and ClueWeb-09-CAT-B.).
(2): and (3) randomly selecting 80% 400 topic from the corpus set obtained in the step (1) as a training set and 20% 400 topic as a test set, respectively preprocessing the training set and the test set, and removing stop words and punctuation marks of each text.
(3): for the preprocessed queries and documents, vector representation of the queries and documents is constructed according to the characteristics of a traditional information retrieval model, word vectors and density operators, and a Match-Transformer model is constructed as follows: as shown in fig. 2.
3.1: and obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in the model initialization stage, and updating and optimizing in the model training process. At this step we get the word vectors for each query in the text and word in the document, corresponding to W, respectivelyi QAnd Wj D. Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.
3.2: according to the word W in the queryi QWhether or not it is in the document TDWhere we have constructed an overlapemembedding function. Assuming that the query term appears in the document, the function takes the value 1, otherwise it is 0.
Figure BDA0002210969820000041
3.3: by combining the two previous operations, global information (word vector) and local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
Figure BDA0002210969820000042
Figure BDA0002210969820000043
wherein the content of the first and second substances,
Figure BDA0002210969820000044
tf value representing the ith word in the query and
Figure BDA0002210969820000045
representing the tf-idf value of the jth word in the document.
In order to further obtain the dependency information between query words and between document words, the information of the query and the document are respectively expressed by a density operator, namely:
Figure BDA0002210969820000046
Figure BDA0002210969820000047
and constructing a Match-Transformer model according to the vector representation obtained above.
Furthermore, the invention relates to a Transformer model, i.e.
Figure BDA0002210969820000048
Where d represents the dimension of the columns in P, K and V. The basic framework of the Transformer model is thus as follows:
Figure BDA0002210969820000049
headi=σ(PWi P,KWi K,VWi V)
in the present invention, the above P, K and V correspond to ρ of the fourth step, respectivelyQ,ρQAnd ρD. The use of the Multi-head attention mechanism to build the relevance of matching features between a query and a document can also be seen as an interaction between relevance information between two texts. The specific implementation process is as follows:
Figure BDA00022109698200000410
and capturing more subtle information among the related characteristics through a convolutional neural network.
Gh=CNNh(M)
Figure BDA00022109698200000411
Wherein h represents the value of N in the N-gram, and the value is {2,3,4,5 }.
Through the steps, the matching features X of various related features can be obtained:
X=P2⊙...⊙P5
(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;
using a multi-tier perceptron to compute ranking scores for individual documents, namely:
f(X)=2*tanh(W·XT+b)
where W and b are parameters to be learned in the linear ranking model, and tanh (-) is an activation function, in order to make the score of the ranking function range from [ -2,2 ].
(5) Updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;
to learn the relative location information between documents, L earning-to-rank function was used to train the model.
Figure RE-GDA0002339345570000051
Wherein the content of the first and second substances,
Figure BDA0002210969820000052
represents the probability that the jth document returns to the first ranked in the list of documents under a given query, namely:
Figure BDA0002210969820000053
(6) the data of the test set is imported into a trained Match-Transformer model to calculate the ranking score of the last returned document of each topic;
the method comprises the steps of testing each topic by using the optimal model stored in the previous step, comparing the relevance tags of each document, calculating the score of each document, finally obtaining the prediction result of the returned document under each topic, comparing the relevance tags of all documents, calculating the evaluation result of the final model on a test set, comparing a query likelihood model (Q L), a classical information retrieval model BM25 and a neural network information retrieval model (MatchPyramid (MP), DRMM, K-NRM, Conv-KNRM and HiNT), counting a table, and visually observing the effect of obviously improving the relevance of the query and the document by the method, wherein the table comprises the following table of comprehensive utilization of differential retrieval models over the Web-09-Cat-Band Robust-04 collections,
Figure RE-GDA0002406436310000061
and
Figure RE-GDA0002406436310000062
mean a signifcant improvement over BM25*,Conv-KNRM§,
Figure RE-GDA0002406436310000063
and
Figure RE-GDA0002406436310000064
using Wilcoxon signed-rank test p<0.05,respectively.
Figure RE-GDA0002406436310000065
(7) and outputting the evaluation result of the Match-Transformer model on the test set.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (3)

1. An information retrieval model for modeling diverse relevant features in an ad-hoc retrieval task, comprising the steps of:
(1) constructing a corpus according to topic, wherein the total sample of the corpus is N topics, and each topic comprises a query and a series of documents;
(2) randomly selecting 80% by N topics from the corpus set in the step (1) as a training set and the remaining 20% by N topics as a test set, and respectively preprocessing the training set and the test set;
(3) constructing a Match-Transformer model for the preprocessed query and document;
(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;
(5) updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;
(6) inputting the test set data into a trained Match-Transformer model to calculate the final returned document ranking score of each topic;
(7) and outputting the evaluation result of the Match-Transformer model on the test set.
2. The information retrieval model for modeling the diverse correlation features in the ad-hoc retrieval task according to claim 1, wherein the Match-Transformer model construction method in the step (3) comprises the following steps:
3.1 obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in a model initialization stage, and updating and optimizing in a model training process; and the word vectors of each word in the query and the document in the text respectively correspond to Wi QAnd Wj D(ii) a Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.
3.2 determining the word vector W in the queryi QWhether or not it is in the document TDThe following Overlap Embedding function is constructed;
Figure FDA0002210969810000011
3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
Figure FDA0002210969810000012
Figure FDA0002210969810000013
wherein the content of the first and second substances,
Figure FDA0002210969810000014
tf value representing the ith word in the query and
Figure FDA0002210969810000015
a tf-idf value representing the jth word in the document;
3.4 considering that the above steps do not consider the dependency information between query words and between document words, the information of the query and document are represented by a density operator respectively, namely:
Figure FDA0002210969810000021
Figure FDA0002210969810000022
3. an information retrieval model for modeling diverse relevant features in ad-hoc retrieval tasks as claimed in claim 2 where in step 3.4 to further derive matching feature information between queries and documents:
Figure FDA0002210969810000023
wherein the content of the first and second substances,
Figure FDA0002210969810000024
headi=σ(PWi P,KWi K,VWi V)
Figure FDA0002210969810000025
CN201910898272.0A 2019-09-23 2019-09-23 Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task Pending CN111444414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910898272.0A CN111444414A (en) 2019-09-23 2019-09-23 Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910898272.0A CN111444414A (en) 2019-09-23 2019-09-23 Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task

Publications (1)

Publication Number Publication Date
CN111444414A true CN111444414A (en) 2020-07-24

Family

ID=71648622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910898272.0A Pending CN111444414A (en) 2019-09-23 2019-09-23 Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task

Country Status (1)

Country Link
CN (1) CN111444414A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761890A (en) * 2021-08-17 2021-12-07 汕头市同行网络科技有限公司 BERT context sensing-based multi-level semantic information retrieval method
CN116933766A (en) * 2023-06-02 2023-10-24 盐城工学院 Ad-hoc information retrieval model based on triple word frequency scheme

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions
US20120158621A1 (en) * 2010-12-16 2012-06-21 Microsoft Corporation Structured cross-lingual relevance feedback for enhancing search results
CN105095271A (en) * 2014-05-12 2015-11-25 北京大学 Microblog retrieval method and microblog retrieval apparatus
CN105164676A (en) * 2013-03-29 2015-12-16 惠普发展公司,有限责任合伙企业 Query features and questions
CN109635083A (en) * 2018-11-27 2019-04-16 北京科技大学 It is a kind of for search for TED speech in topic formula inquiry document retrieval method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions
US20120158621A1 (en) * 2010-12-16 2012-06-21 Microsoft Corporation Structured cross-lingual relevance feedback for enhancing search results
CN105164676A (en) * 2013-03-29 2015-12-16 惠普发展公司,有限责任合伙企业 Query features and questions
CN105095271A (en) * 2014-05-12 2015-11-25 北京大学 Microblog retrieval method and microblog retrieval apparatus
CN109635083A (en) * 2018-11-27 2019-04-16 北京科技大学 It is a kind of for search for TED speech in topic formula inquiry document retrieval method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张芳芳等: "基于字面和语义相关性匹配的智能篇章排序", 《山东大学学报(理学版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761890A (en) * 2021-08-17 2021-12-07 汕头市同行网络科技有限公司 BERT context sensing-based multi-level semantic information retrieval method
CN116933766A (en) * 2023-06-02 2023-10-24 盐城工学院 Ad-hoc information retrieval model based on triple word frequency scheme

Similar Documents

Publication Publication Date Title
CN109271505B (en) Question-answering system implementation method based on question-answer pairs
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN109948165B (en) Fine granularity emotion polarity prediction method based on mixed attention network
CN110046304B (en) User recommendation method and device
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN109284357B (en) Man-machine conversation method, device, electronic equipment and computer readable medium
CN107562792B (en) question-answer matching method based on deep learning
Celikyilmaz et al. LDA based similarity modeling for question answering
CN104199965B (en) Semantic information retrieval method
CN111611361A (en) Intelligent reading, understanding, question answering system of extraction type machine
CN105183833B (en) Microblog text recommendation method and device based on user model
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN111291188B (en) Intelligent information extraction method and system
CN109635083B (en) Document retrieval method for searching topic type query in TED (tele) lecture
CN106021364A (en) Method and device for establishing picture search correlation prediction model, and picture search method and device
CN109408600B (en) Book recommendation method based on data mining
CN112434517B (en) Community question-answering website answer ordering method and system combined with active learning
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN111221962A (en) Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN102955848A (en) Semantic-based three-dimensional model retrieval system and method
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN113343125B (en) Academic accurate recommendation-oriented heterogeneous scientific research information integration method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20231215