CN111444414A - Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task - Google Patents
Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task Download PDFInfo
- Publication number
- CN111444414A CN111444414A CN201910898272.0A CN201910898272A CN111444414A CN 111444414 A CN111444414 A CN 111444414A CN 201910898272 A CN201910898272 A CN 201910898272A CN 111444414 A CN111444414 A CN 111444414A
- Authority
- CN
- China
- Prior art keywords
- document
- model
- query
- information
- match
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 claims abstract description 20
- 238000012360 testing method Methods 0.000 claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 10
- 239000000126 substance Substances 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000009827 uniform distribution Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims 1
- 230000006870 function Effects 0.000 description 7
- 230000006872 improvement Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000001793 Wilcoxon signed-rank test Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an information retrieval model for modeling various relevant features in an ad-hoc retrieval task, namely a Match-Transformer model, which comprises the following steps of collecting corpus according to topic, dividing the corpus into a training set and a testing set, preprocessing queries and documents in the corpus, constructing vector representation of the queries and the documents by using global information and local information, inputting the vector representation of the queries and the documents in the training set into the Match-Transformer model to calculate scores of the documents and train a final model, inputting the vector representation of the queries and the documents in the testing set into the Match-Transformer model to calculate the final score of each document, and finally learning the relative position information between the documents by using an L earning-to-Rank model to finally obtain a more accurate document ranking result.
Description
Technical Field
The invention relates to the technical field of text information retrieval, in particular to an information retrieval model for modeling various relevant characteristics in an ad-hoc retrieval task.
Background
With the continuous development of internet and intelligent technology, information retrieval is no longer only a personal computer terminal (PC) for searching, and users increasingly rely on mobile devices for searching information and services required by the users. The quality of the information retrieval model directly influences the information retrieval result. Therefore, the information retrieval model not only has important theoretical significance, but also has huge social value. The present invention primarily studies document ranking under a given query in an ad-hoc task, i.e., correlation analysis between queries and documents.
The information retrieval model is the main research content of information retrieval. Current information retrieval includes boolean models, vector space models, probabilistic models, language models, and the like. The main purpose of these models is to abstract the query and document in information retrieval and their matching degree by mathematical or other language tools. Ad-hoc is a classical retrieval task in which a user specifies the user's information needs through a query that initiates a search (performed by an information system) to find documents that may be relevant to the user. a central problem in ad-hoc information retrieval tasks is how to learn a generic function that can be used to evaluate the relevance between queries and documents. In Ad-hoc, the heterogeneity of queries and documents may present challenges that increase the difficulty of document understanding due to insufficient context information and too long terms of the document. These challenges in correlation determination are that there may be multiple correlation features that are correlated with heterogeneity.
MatchPyramid (MP), K-NRM, Conv-KNRM and NNQ L M-II, however, these models only use a small number of relevant features or consider multiple relevant features from a document perspective, and do not consider the relevant features of the query and the mutual information between the query features and the document features.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide an information retrieval model for modeling various relevant characteristics in an ad-hoc retrieval task, respectively construct vector representations of queries and documents, use a Match-Transformer model to capture dependency information, context information and interaction information between the queries and the documents, then use a multilayer perceptron to obtain scores and ranks of the documents, use L earning-to-Rank to learn the relative position information between the documents, obtain the prediction result of an optimal model on a test set, and finally obtain a more accurate evaluation result.
The purpose of the invention is realized by the following technical scheme, which comprises the following steps:
an information retrieval model for modeling diverse relevant features in an ad-hoc retrieval task, comprising the steps of:
(1) constructing a corpus according to topic, wherein the total sample of the corpus is N topics, and each topic comprises a query and a series of documents;
(2) randomly selecting 80% by N topics from the corpus set in the step (1) as a training set and the remaining 20% by N topics as a test set, and respectively preprocessing the training set and the test set;
(3) constructing a Match-Transformer model for the preprocessed query and document;
(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;
(5) updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;
(6) the data of the test set is imported into a trained Match-Transformer model to calculate the ranking score of the last returned document of each topic;
(7) and outputting the evaluation result of the Match-Transformer model on the test set.
The method for constructing the Match-Transformer model in the step (3) comprises the following steps:
3.1 obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in a model initialization stage, and updating and optimizing in a model training process; and for each query in the text and sheet in the documentWord vectors of the words respectively corresponding to Wi QAnd Wj D(ii) a Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.
3.2 determining the word vector W in the queryi QWhether or not it is in the document TDThe following Overlap Embedding function is constructed;
3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
wherein the content of the first and second substances,tf value representing the ith word in the query anda tf-idf value representing the jth word in the document;
3.4 considering that the above steps do not consider the dependency information between query words and between document words, the information of the query and document are represented by a density operator respectively, namely:
in step 3.4, in order to further obtain matching feature information between the query and the document, the following steps are performed:
wherein the content of the first and second substances,
headi=σ(PWi P,KWi K,VWi V)
drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of a Match-transform framework model;
table 3 experimental comparison results of different information retrieval models;
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following. FIG. 1 shows a flow of query-document relevance analysis proposed by the present method; FIG. 2 shows a Match-Transformer model designed according to the present invention; table 3 shows the comparison results between the final different information retrieval models. The method comprises the following specific steps:
(1): from the TREC dataset, 1000 documents relevant to it were found from the dataset according to topic in Web TREC (Robust-04 and ClueWeb-09-CAT-B.).
(2): and (3) randomly selecting 80% 400 topic from the corpus set obtained in the step (1) as a training set and 20% 400 topic as a test set, respectively preprocessing the training set and the test set, and removing stop words and punctuation marks of each text.
(3): for the preprocessed queries and documents, vector representation of the queries and documents is constructed according to the characteristics of a traditional information retrieval model, word vectors and density operators, and a Match-Transformer model is constructed as follows: as shown in fig. 2.
3.1: and obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in the model initialization stage, and updating and optimizing in the model training process. At this step we get the word vectors for each query in the text and word in the document, corresponding to W, respectivelyi QAnd Wj D. Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.
3.2: according to the word W in the queryi QWhether or not it is in the document TDWhere we have constructed an overlapemembedding function. Assuming that the query term appears in the document, the function takes the value 1, otherwise it is 0.
3.3: by combining the two previous operations, global information (word vector) and local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
wherein the content of the first and second substances,tf value representing the ith word in the query andrepresenting the tf-idf value of the jth word in the document.
In order to further obtain the dependency information between query words and between document words, the information of the query and the document are respectively expressed by a density operator, namely:
and constructing a Match-Transformer model according to the vector representation obtained above.
Furthermore, the invention relates to a Transformer model, i.e.
Where d represents the dimension of the columns in P, K and V. The basic framework of the Transformer model is thus as follows:
headi=σ(PWi P,KWi K,VWi V)
in the present invention, the above P, K and V correspond to ρ of the fourth step, respectivelyQ,ρQAnd ρD. The use of the Multi-head attention mechanism to build the relevance of matching features between a query and a document can also be seen as an interaction between relevance information between two texts. The specific implementation process is as follows:
and capturing more subtle information among the related characteristics through a convolutional neural network.
Gh=CNNh(M)
Wherein h represents the value of N in the N-gram, and the value is {2,3,4,5 }.
Through the steps, the matching features X of various related features can be obtained:
X=P2⊙...⊙P5
(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;
using a multi-tier perceptron to compute ranking scores for individual documents, namely:
f(X)=2*tanh(W·XT+b)
where W and b are parameters to be learned in the linear ranking model, and tanh (-) is an activation function, in order to make the score of the ranking function range from [ -2,2 ].
(5) Updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;
to learn the relative location information between documents, L earning-to-rank function was used to train the model.
Wherein the content of the first and second substances,represents the probability that the jth document returns to the first ranked in the list of documents under a given query, namely:
(6) the data of the test set is imported into a trained Match-Transformer model to calculate the ranking score of the last returned document of each topic;
the method comprises the steps of testing each topic by using the optimal model stored in the previous step, comparing the relevance tags of each document, calculating the score of each document, finally obtaining the prediction result of the returned document under each topic, comparing the relevance tags of all documents, calculating the evaluation result of the final model on a test set, comparing a query likelihood model (Q L), a classical information retrieval model BM25 and a neural network information retrieval model (MatchPyramid (MP), DRMM, K-NRM, Conv-KNRM and HiNT), counting a table, and visually observing the effect of obviously improving the relevance of the query and the document by the method, wherein the table comprises the following table of comprehensive utilization of differential retrieval models over the Web-09-Cat-Band Robust-04 collections,andmean a signifcant improvement over BM25*,Conv-KNRM§,andusing Wilcoxon signed-rank test p<0.05,respectively.
(7) and outputting the evaluation result of the Match-Transformer model on the test set.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.
Claims (3)
1. An information retrieval model for modeling diverse relevant features in an ad-hoc retrieval task, comprising the steps of:
(1) constructing a corpus according to topic, wherein the total sample of the corpus is N topics, and each topic comprises a query and a series of documents;
(2) randomly selecting 80% by N topics from the corpus set in the step (1) as a training set and the remaining 20% by N topics as a test set, and respectively preprocessing the training set and the test set;
(3) constructing a Match-Transformer model for the preprocessed query and document;
(4) inputting the training set query and the representation of the document into a Match-Transformer model, and calculating the score of the document by using a multilayer perceptron;
(5) updating parameters of the trained Match-transform model through L earning-to-Rank algorithm;
(6) inputting the test set data into a trained Match-Transformer model to calculate the final returned document ranking score of each topic;
(7) and outputting the evaluation result of the Match-Transformer model on the test set.
2. The information retrieval model for modeling the diverse correlation features in the ad-hoc retrieval task according to claim 1, wherein the Match-Transformer model construction method in the step (3) comprises the following steps:
3.1 obtaining a 300-dimensional word vector of each word in each text by using a glove tool, initializing the parameter matrix by using uniform distribution in a model initialization stage, and updating and optimizing in a model training process; and the word vectors of each word in the query and the document in the text respectively correspond to Wi QAnd Wj D(ii) a Wherein, the query has n words, and the document has m words, i is 1, …, n; j is 1, …, m.
3.2 determining the word vector W in the queryi QWhether or not it is in the document TDThe following Overlap Embedding function is constructed;
3.3 combining the above two steps of operations, the global information (word vector) and the local information (traditional information retrieval characteristics) of the query and the document can be obtained, namely:
wherein the content of the first and second substances,tf value representing the ith word in the query anda tf-idf value representing the jth word in the document;
3.4 considering that the above steps do not consider the dependency information between query words and between document words, the information of the query and document are represented by a density operator respectively, namely:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910898272.0A CN111444414A (en) | 2019-09-23 | 2019-09-23 | Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910898272.0A CN111444414A (en) | 2019-09-23 | 2019-09-23 | Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111444414A true CN111444414A (en) | 2020-07-24 |
Family
ID=71648622
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910898272.0A Pending CN111444414A (en) | 2019-09-23 | 2019-09-23 | Information retrieval model for modeling various relevant characteristics in ad-hoc retrieval task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111444414A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113761890A (en) * | 2021-08-17 | 2021-12-07 | 汕头市同行网络科技有限公司 | BERT context sensing-based multi-level semantic information retrieval method |
CN116933766A (en) * | 2023-06-02 | 2023-10-24 | 盐城工学院 | Ad-hoc information retrieval model based on triple word frequency scheme |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100179933A1 (en) * | 2009-01-12 | 2010-07-15 | Nec Laboratories America, Inc. | Supervised semantic indexing and its extensions |
US20120158621A1 (en) * | 2010-12-16 | 2012-06-21 | Microsoft Corporation | Structured cross-lingual relevance feedback for enhancing search results |
CN105095271A (en) * | 2014-05-12 | 2015-11-25 | 北京大学 | Microblog retrieval method and microblog retrieval apparatus |
CN105164676A (en) * | 2013-03-29 | 2015-12-16 | 惠普发展公司,有限责任合伙企业 | Query features and questions |
CN109635083A (en) * | 2018-11-27 | 2019-04-16 | 北京科技大学 | It is a kind of for search for TED speech in topic formula inquiry document retrieval method |
-
2019
- 2019-09-23 CN CN201910898272.0A patent/CN111444414A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100179933A1 (en) * | 2009-01-12 | 2010-07-15 | Nec Laboratories America, Inc. | Supervised semantic indexing and its extensions |
US20120158621A1 (en) * | 2010-12-16 | 2012-06-21 | Microsoft Corporation | Structured cross-lingual relevance feedback for enhancing search results |
CN105164676A (en) * | 2013-03-29 | 2015-12-16 | 惠普发展公司,有限责任合伙企业 | Query features and questions |
CN105095271A (en) * | 2014-05-12 | 2015-11-25 | 北京大学 | Microblog retrieval method and microblog retrieval apparatus |
CN109635083A (en) * | 2018-11-27 | 2019-04-16 | 北京科技大学 | It is a kind of for search for TED speech in topic formula inquiry document retrieval method |
Non-Patent Citations (1)
Title |
---|
张芳芳等: "基于字面和语义相关性匹配的智能篇章排序", 《山东大学学报(理学版)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113761890A (en) * | 2021-08-17 | 2021-12-07 | 汕头市同行网络科技有限公司 | BERT context sensing-based multi-level semantic information retrieval method |
CN116933766A (en) * | 2023-06-02 | 2023-10-24 | 盐城工学院 | Ad-hoc information retrieval model based on triple word frequency scheme |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271505B (en) | Question-answering system implementation method based on question-answer pairs | |
CN111753060B (en) | Information retrieval method, apparatus, device and computer readable storage medium | |
CN109948165B (en) | Fine granularity emotion polarity prediction method based on mixed attention network | |
CN110046304B (en) | User recommendation method and device | |
CN109829104B (en) | Semantic similarity based pseudo-correlation feedback model information retrieval method and system | |
CN109284357B (en) | Man-machine conversation method, device, electronic equipment and computer readable medium | |
CN107562792B (en) | question-answer matching method based on deep learning | |
Celikyilmaz et al. | LDA based similarity modeling for question answering | |
CN104199965B (en) | Semantic information retrieval method | |
CN111611361A (en) | Intelligent reading, understanding, question answering system of extraction type machine | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN109635083B (en) | Document retrieval method for searching topic type query in TED (tele) lecture | |
CN106021364A (en) | Method and device for establishing picture search correlation prediction model, and picture search method and device | |
CN109408600B (en) | Book recommendation method based on data mining | |
CN112434517B (en) | Community question-answering website answer ordering method and system combined with active learning | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
CN111221962A (en) | Text emotion analysis method based on new word expansion and complex sentence pattern expansion | |
CN112966091B (en) | Knowledge map recommendation system fusing entity information and heat | |
CN112632228A (en) | Text mining-based auxiliary bid evaluation method and system | |
CN102955848A (en) | Semantic-based three-dimensional model retrieval system and method | |
CN112307182B (en) | Question-answering system-based pseudo-correlation feedback extended query method | |
CN113343125B (en) | Academic accurate recommendation-oriented heterogeneous scientific research information integration method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20231215 |