CN107247745B - A kind of information retrieval method and system based on pseudo-linear filter model - Google Patents
A kind of information retrieval method and system based on pseudo-linear filter model Download PDFInfo
- Publication number
- CN107247745B CN107247745B CN201710370190.XA CN201710370190A CN107247745B CN 107247745 B CN107247745 B CN 107247745B CN 201710370190 A CN201710370190 A CN 201710370190A CN 107247745 B CN107247745 B CN 107247745B
- Authority
- CN
- China
- Prior art keywords
- query
- document
- word
- expansion
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000006870 function Effects 0.000 claims description 34
- 239000013598 vector Substances 0.000 claims description 34
- 238000010606 normalization Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 9
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 102000037983 regulatory factors Human genes 0.000 claims description 2
- 108091008025 regulatory factors Proteins 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000002354 daily effect Effects 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000011430 maximum method Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
- G06F16/3326—Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of information retrieval method based on pseudo-linear filter model, information retrieval is realized including the word degree of correlation is fused in pseudo-linear filter model, when being included in generation query expansion word in spurious correlation collection of document, the query expansion word characterized by the importance of candidate expansion word and the query expansion word characterized by candidate expansion word is with the degree of correlation of inquiry descriptor are generated respectively, the two is attached in former query expansion word again, completes final information retrieval;When generating the query expansion word characterized by candidate expansion word is with the degree of correlation of inquiry descriptor, the degree of correlation appeared in document between the query word on different location and candidate word is calculated using kernel function.The present invention can protrude the distribution situation of query word and candidate word, choose and the inquiry higher candidate word of descriptor degree of correlation, moreover it is possible to because of additional degree of correlation information, so as to position more accurately candidate word, raising expanding query and the precision finally retrieved.
Description
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to an information retrieval method and system for fusing kernel function word relevancy into a pseudo-correlation feedback model.
Background
In the age of the daily prosperity of information competition, browsing and acquiring required information by means of a search engine is an important part of people's daily life. However, network resources are extremely rich, and the total amount of information rapidly expands, so that it is difficult for users to efficiently and accurately acquire and identify important information, and a more effective theory and method for processing increasingly large amounts of data are urgently needed in the information processing technology. Information retrieval, as a classic text processing technique, can meet this requirement and is rapidly becoming a research focus in the current information processing research field.
Information Retrieval (Information Retrieval) refers to the process and technique of organizing Information in a certain way and finding out relevant Information according to the needs of Information users. The information retrieval process can be described simply as: according to the information requirement, a user organizes a query character string and submits the query character string to an information retrieval system, and the information retrieval system retrieves a document subset related to the query from a document set and returns the document subset to the user. Specifically, given a specific group of query topics, through some information retrieval model, the relevance of all documents in the target and the query topics is calculated, and each document is returned in the order of scores from large to small, and the earlier the document in the returned result is, the more relevant the document is to the query topics is. Through research development of the last half century, some effective information retrieval models are successively proposed and gradually applied to related systems. The retrieval model with larger influence comprises the following steps: boolean logic models, vector space models, probabilistic models, language models, and more recently proposed supervised learning based retrieval models.
In actual information retrieval application, a certain deviation exists between a query request of a user and a query result fed back by a system, so that the performance of a retrieval system is reduced. Therefore, information retrieval is often an iterative process, and users often need to perform query adjustment for many times to obtain satisfactory retrieval results. The query expansion technology well solves the problems that the terms used for the query of the user are not matched with the terms used for the document and the expression of the user is incomplete by expanding and reconstructing the initial query of the user, so that the query expansion technology is widely applied to the field of information retrieval. In brief, query expansion is that before a retrieval system performs retrieval, synonyms or near synonyms of keywords in user query are automatically expanded according to an expansion word list to form new query, and then retrieval is performed.
Pseudo-relevance feedback occurs to make the retrieval system more efficient and to better satisfy the user's query request with the retrieval results. The main mechanism is that the system defaults that the self-retrieved result contains a large number of documents relevant to the user query subject, and the first N documents are taken out as relevant documents to adjust or expand the query.
Generally, there are many factors that affect the performance of a retrieval system, and the most critical of them is the information retrieval policy, including the representation method of documents and query conditions, the matching policy for evaluating the relevance of documents and queries, the ranking method of query results, and the mechanism for the user to perform relevant feedback.
With the development of high-speed internet, a large amount of information is stacked, the accuracy of information search becomes the first point of attention of all users, it is becoming more and more difficult to find what the users want through an information retrieval tool, and at the same time, the excessive flooding of various information makes the users have to spend more time to discriminate which information is valuable to the users. The existing information retrieval method generally has the problems that the retrieval average precision is not high, even the average precision of the best retrieval model at present is only 30%, and the improvement of the information retrieval precision has a long way. Information retrieval has been deeply carried out in various aspects of human life, and most people use searching tools such as hundredths, google and the like to search various required data every day, so that various practical problems are solved. In 2010, the request amount of Chinese web page search reaches more than 600 hundred million times, and in 2016, the search request amount of one hundred-degree-one-day reaches 60 hundred million times, and under the requirement of such a large amount of search, each percentage point of improvement of the average accuracy of information search saves a large amount of time and energy for acquiring required information, and the value of the improvement is extraordinary. Large internet companies are also continuously pursuing lower cost and more efficient information retrieval technologies.
Disclosure of Invention
The invention aims to solve the problem that the query expansion is optimized to improve the average retrieval precision finally.
The invention provides an information retrieval method based on a pseudo-correlation feedback model, which fuses word correlation into the pseudo-correlation feedback model to realize information retrieval, and comprises the steps of respectively generating query expansion words with the importance of candidate expansion words as the characteristic and query expansion words with the correlation of the candidate expansion words and query subject words as the characteristic when generating the query expansion words in a pseudo-correlation document set, and then combining the query expansion words and the query expansion words into the original query expansion words to finish final information retrieval; and when generating the query expansion words with the correlation degree of the candidate expansion words and the query subject words as the characteristic, calculating the correlation degree between the query words and the candidate words appearing at different positions in the document by adopting a kernel function.
Moreover, the word relevancy is fused into the pseudo-correlation feedback model to realize information retrieval, and the realization method is as follows,
when a user submits a query theme, preprocessing the query theme to obtain query keywords Q, D is all target documents, NDCalculating the scores of the query keyword Q and each document in the target document set D through a preset retrieval weight model for the total number of documents in the target document set D, and arranging the scores from high to low according to the score results to obtain a first query result; the first N documents in the target document set D are taken out as a pseudo-relevant document set D according to a pseudo-relevant feedback mode1When the query expansion word is selected, the following steps are carried out,
step 1, collecting pseudo-relevant documents D1All the words in each document are used as candidate expansion words, and candidate expansion words t are calculated respectivelyjIn a pseudo-relevant document set D1Document d ofiScore of importance inGet each document diIs vector of importanceAs follows below, the following description will be given,
wherein i is 1,2,3 …, N, j is 1,2,3 …, N;
calculating importance score vector of expansion candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the importance degree score of each expansion candidate word is taken out, the expansion candidate words are sorted from large to small, and the top n with the maximum score is1Value is inCorresponding expansion candidate words are selected to form an importance query expansion word set Q1Using a polynomial V1Query expansion term set Q representing importance1Each word in (1) and the corresponding importance score of the word;
step 2, collecting the pseudo related documents D1Taking all words in each document as expansion candidate words, and calculating each expansion candidate word t by adopting a kernel function according to the co-occurrence position and the co-occurrence frequencyjIn document d together with query keyword QiThe correlation score in (1)Get each textStep diIs related to the vectorAs follows below, the following description will be given,
wherein i is 1,2,3 …, N, j is 1,2,3 …, N;
calculating a relevance score vector of the expanded candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the relevancy score of each expansion candidate word is taken out, the expansion candidate words are sorted in the order from large to small, and the top n with the maximum score is1Value is inSelecting out corresponding expansion candidate words to form a relevancy query expansion word set Q1', using a polynomial V1To denote a set of query expansion terms Q1'each word in the list and the word's corresponding relevancy score;
step 3, the polynomial V obtained in the step 1 and the step 21And V1After normalization, linear combination is carried out to obtain a new query term polynomial V as follows,
V=(1-γ)×||V1||+γ×||V1'||
wherein, | X | | represents the normalization operation of the vector X, and γ is an adjustment factor;
step 4, sorting the query term polynomial V obtained in the step 3 from large to small according to the coefficient of each term, and sorting the top n with the maximum coefficient1Taking out individual terms to obtain a new expansion word set
Step 5, setting the query keyword Q to comprise a query word Qs1,2,3 …, m, representing the query term Q as a polynomial VQThe coefficient value of each query term is set to 1.0; combining the extended words obtained in the step 4Is represented by a polynomial expression V',
will query polynomial VQAnd the query expansion term polynomial V' are combined linearly after normalization until a new query term polynomial K is as follows,
K=α×||VQ||+β×||V'||
wherein α and β are regulatory factors;
and 6, obtaining a new query keyword set Q ' according to the query term polynomial K obtained in the step 5, using the corresponding weight of each query term in the new query keyword set Q ' and Q ' in the query term polynomial K, and performing secondary information retrieval by using a preset retrieval weight model to obtain a query result as a final information retrieval result.
In step 1, the importance score is calculatedThe method adopts TFIDF, BM25 or RM3 to obtain the target.
Furthermore, in step 2, each expansion candidate word t is calculatedjIn document d together with query keyword QiThe correlation score in (1)The realization is as follows,
let trAnd q issIn a certain document diCo-occurrence of (A) and (B) is represented byThe calculation is as follows,
wherein,represents trAnd q issIn document diThe degree of correlation in (1) is,representing a document diInThe co-occurrence frequency of (a) is,representing a document diInCo-occurrence counter-document frequency of (c);
calculating to obtain trIn document d together with query keyword QiThe degree of correlation in (1) is,
furthermore, document diInThe co-occurrence frequency of (c) is calculated as follows,
wherein M and L each represent trAnd q issIn document diThe number of times of occurrence of (a),representing a document diK1 th t of occurrence inr,Representing a document diThe k2 th qs,k1=1,2,3…,M,k2=1,2,3…,L;Is embodied in a kernel functionAndthe proximity of the location of (a).
Also, the kernel function is a gaussian function or a trigonometric function.
Further, when the kernel function is a gaussian function, the following is calculated,
wherein p istAnd pqRespectively representAndthe position value in the document, σ, is the tuning parameter.
Furthermore, document diInCo-occurrence of anti-document frequencyThe calculation is as follows,
wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document diTotal number of co-occurrences in (c).
And, the preset retrieval weight model is based on a vector space model, a probability model or a language model.
The invention also provides an information retrieval system based on the pseudo-correlation feedback model, which comprises a computer or a server, wherein the method is executed on the computer or the server.
According to the information retrieval method for fusing the kernel function word relevancy information into the pseudo-relevance feedback model, provided by the invention, the defect that the traditional pseudo-relevance feedback model only considers the word frequency information can be overcome. In addition, the relevance between the query words and the candidate words appearing at different positions in the document is calculated through the kernel function, so that the distribution condition of the query words and the candidate words can be highlighted, the candidate words with higher relevance with the query subject words can be selected, and the additional relevance information can be used, so that the more accurate candidate words can be positioned, and the average precision of expansion query and final retrieval can be improved. The comparison experiment result of a plurality of international information retrieval evaluation standard data sets and a plurality of internationally best models shows that the information retrieval method for integrating the word correlation degree information of the kernel function into the pseudo-correlation feedback model provided by the invention realizes remarkable improvement on retrieval accuracy and reaches the international leading level.
Drawings
Fig. 1 is a flowchart of a complete information retrieval process according to an embodiment of the present invention.
Detailed description of the invention
The core problem to be solved by the invention is as follows: a kernel function is used for reflecting the distribution situation between a user query word and a document candidate word and the correlation degree between the user query word and the document candidate word, the correlation degree is used as an additional weight to be fused into a pseudo-correlation feedback model, and query expansion is achieved to improve the retrieval accuracy.
The information retrieval method for fusing the correlation degree of the kernel function words into the pseudo-correlation feedback model is described in detail below with reference to the accompanying drawings and embodiments.
The invention provides a method for considering the correlation between words aiming at the unreasonable independent vocabulary assumption in the classical method. Through effective utilization of some statistical information (such as context information and other information reflecting word collocation and use relations) of data in the document set, a related technical scheme is designed in combination with the query condition to obtain words which can reflect the topic of the query condition and are triggered by the query condition, namely, the information is utilized to more accurately capture the information requirement of a user.
The Kernel function adopted in the method originally projects the linear indivisible data in the original coordinate system to another space by Kernel, so that the data can be linearly divided in a new space as much as possible. Which in the method of the invention will be used to assess the degree of relatedness of two words in a document.
Referring to fig. 1, the flow of the embodiment is that, when a user performs retrieval according to a related query topic:
the information retrieval system can establish a query index according to a target document set, and when a user submits a relevant query topic, the system can preprocess the query topic into a query keyword Q (Q is a set and generally comprises a plurality of topic words Q)1、q2、q3Etc.), D is all target documents, NDThe total number of documents in the target document set D. Then, the retrieval system calculates the score of the query keyword Q and each document in the document set D by some preset retrieval weight model (e.g., TFIDF, BM25, RM3, etc.), and obtains the first query result by ranking the score results from high to low. According to the principle of pseudo-correlation feedback, the retrieval system takes the first N (in a large number of relevant research documents, N is generally 10, 20 or 30) documents in the first query result documents of the document set D as the pseudo-correlation document set D1N is less than or equal to NDValues can be preset by those skilled in the art. Obtaining the pseudo-relevant document set D generated by the first query in the retrieval system1When the query expansion word is selected, the following steps are carried out,
step 1, respectively calculating a pseudo-related document set D1The importance scores of all the words (i.e. the expansion candidate words) in each document can be obtained by calculating the word frequency of the words and the word frequency of the inverse document (such as TFIDF, BM25, RM3, etc.), and then the same word importance scores in different documents are accumulated in a word vector mode and divided by D1The number N of the documents in the Chinese character image can obtain the importance degree score vectors of all the expansion candidate wordsArranging the scores of the elements in the vector from big to small, and taking out the top n1(n1Typically 10, 20, 30 or 50, which can be preset by one skilled in the art) scores in the rangeZhongshiCorresponding words are obtained, and a candidate word set Q with expanded importance degree is obtained1By a polynomial V1To represent a set Q1Each word in (a) and the corresponding importance score for that word.
In the invention, N pseudo-related documents are collected into a set D1Each document in (i) is regarded as a bag-of-words model and is expressed in a word vector mode, wherein the relevance vector formula of the ith document is shown as follows.
In the above-mentioned formula,representing a pseudo-relevant document set D1The ith document (i ═ 1,2,3 …, N) d in (c)iWord vector expression of, t1、t2、t3、…、tnFor pseudo-relevant document sets D1All words in all documents in (a) and n represents the total number of these words, i.e. the pseudo-relevant document set D1The number of all words in the Chinese sentence;represents the corresponding t1、t2、t3、…、 tnIn document diThe weight score (also the importance score, the weight is used to represent the importance of the expanded candidate word). The importance score of a word is obtained by calculating the information (such as TFIDF, BM25, RM3, etc.) of the word frequency and the inverse document word frequency of the word, for example, in the method of calculating the document d by using TFIDFiMiddle entry tjThe importance of (a) is the importance of,
wherein,to a certain entry tjIn document diThe importance score (j ═ 1,2,3 …, n), TF (t)jD) the entry tjIn document diFrequency (number of times) of occurrence of, NDTotal number of documents, df (t), of the target document set Dj) Is a pseudo-correlation set D1In, contains the entry tjThe number of documents.
Each document d of the N documents according to formula (2)iCan be expressed in the form of vectors of the importance of the corresponding wordsAnd accumulating and summing each document vector, and dividing the sum by the total number N of the pseudo-related documents to obtain importance degree score vectors of all the entries in all the documentsAs shown in equation (3):
will be provided withThe importance degree scores of each word are taken out and then are sorted from big to small, and the top n with the maximum score is1Value is inCorresponding word is selected to form an importance query expansion word set Q1. For the convenience of the later calculations, polynomial V is used1To represent a set Q1Each word in (a) and the corresponding importance score of that word, as shown in equation (4).
In the formula (4), qh1、qh2、qh3、…、Represents Q1Each specific extended candidate word in (a total of n)1One), wh1、wh2、wh3、…、Indicates the corresponding expansion candidate wordScore of (1).
Step 2, a pseudo-relevant document set D is calculated in sequence1The relevancy score between all the words (i.e. the expansion candidate words) in each document and the query word is obtained by calculating the kernel function according to the positions of the query word and the expansion candidate words in each document, and then the scores of the same words in different documents are accumulated to obtain the relevancy score vectors of all the expansion candidate words and the query wordArranging the scores of the elements in the vector from big to small, and taking out the top n1(n1Typically 10, 20, 30 or 50) scores inThe word corresponding to the Chinese character is obtained to obtain a correlation degree expansion candidate word set Q1', here we use a polynomial V1To denote a set Q1' and a relevancy score for the word.
For ease of explanation, the expansion candidate word t is givenrAnd query term qs(where r is 1,2,3 …, n, n is the pseudo relevant document set D1The number of all words in the query keyword Q set, s is 1,2,3 …, m, m is the number of words in the query keyword Q set), if t is trAnd q issIn a certain document diIn the co-occurrence ofThis is represented asThey have a co-occurrence weight (i.e., degree of correlation). Due to trAnd q issMay occur in multiple locations in a document and therefore cannot simply be readRepresents trAnd q issIn document diThe invention further provides the following formula in order to more reasonably measure the correlation degree:
in the formula (5), the first and second groups,represents trAnd q issIn document diThe degree of correlation in (1).
In the formula (5), the first and second groups,representing a document diInThe specific calculation formula of the co-occurrence frequency of (c) is as follows:
in the formula (6), M and L respectively represent trAnd q issIn document diThe number of times of occurrence of (a),representing a document diK1 th t of occurrence inr,Representing a document diThe k2 th qsK1 is 1,2,3 …, M, k2 is 1,2,3 …, L. The Kernel () represents a Kernel function, which is a type that the proximity relationship between two words can be measured by the position information of the words, and when the positions of the two words which co-occur are closer, the proximity relationship is stronger, that is, the degree of correlation is higher. Such as gaussian functions, trigonometric functions, etc., are very effective in many scenarios. Examples of the inventionIt is meant to embody the kernel function in Gaussian (other kernel functions may be used in specific implementations)Andas in equation (7):
wherein p istAnd pqRespectively representAndthe position value in the document (i.e. the occurrence number of the word in the document, which is a positive integer), σ is an adjustment parameter for adjusting the distribution of the gaussian function, and σ preferably has a value in the range of 10 to 100, which in the specific embodiment is 50.
In the formula (5), the first and second groups,representing a document diInThe specific calculation method of the co-occurrence anti-document frequency is as follows:
wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document diTotal number of co-occurrences in (c).
Equation (5) gives trAnd q issIn document diDegree of correlation inDue to qsIs a query word in the query keyword set Q, and t can be obtained by the formula (5)rIn document d together with query keyword QiThe correlation in (1), the invention usesThe specific calculation formula is as follows:
n sets of pseudo-related documents D according to formula (9)1The ith document d iniCan be expressed in the form of a corresponding relevance vector between the expansion candidate word and the query word, namelyThe specific formula is as follows.
Next, for each document relevance vectorAfter accumulation and summation, dividing the sum by the total number N of the pseudo-relevant documents to finally obtain the relevancy score vectors of all the entries in all the documentsAs shown in formula (11):
will be provided withThe relevancy score of each word is taken out and then is sorted from big to small, and the top n with the largest score is1Value is inCorresponding word is selected to form a relevancy query expansion word set Q1'. For the convenience of the later calculations, polynomial V is used1To denote a set Q1' each word in ' and the word's corresponding relevancy score, as shown in equation (12).
In the formula (12), qh1'、qh'2、qh3'、…、Represents Q1' inEach specific expansion word (a total of n)1One), wh'1、wh′2、wh′3、...、Indicates the corresponding expansion word inScore of (1).
Step 3, the query expansion word polynomial V obtained in the step 1 and the step 21And V1After normalization, linear combination is carried out to obtain a new query term polynomial V, and the specific combination mode is shown as a formula (13).
V=(1-γ)×||V1||+γ×||V1' | | formula (13)
In formula (13), | X | | | denotes that the vector X is normalized, and the purpose of normalization is to unify dimensions, i.e., to normalize the value of each element in the vector to the interval [0,1.0 |]In addition, subsequent parameter adjustment is facilitated. There are many ways to realize normalization, and in this embodiment, a division-by-maximum method is used, that is, the normalized value of each element is the original value of the element divided by the maximum value of the element in the vector. For example, there is a vector [1,2,3,4 ]]If there are 4 elements and the maximum value of the element is 4, then the vector is normalized by dividing by the maximum value method to obtain the resultI.e., [0.25,0.5,0.75,1]It can be seen that all values in the original vector are normalized to the interval [0,1.0 ]]Is as follows.
The adjustment factor γ in the formula (13) has a value range of 0 to 1.0, and has a function of balancing the importance score of the expansion word and the relevance score between the expansion word and the query word, and when the method is applied specifically, the method can test the optimal value of γ on the target document set to be applied by using test data in advance.
Step 4, according to the polynomial V in step 3, according to each termThe coefficients (the integrated weight scores) are sorted from large to small, and the top n with the largest coefficient is sorted1Taking out individual terms to obtain a new expansion word set I.e. the final set of query expansion terms.
Step 5, expressing the original query keyword set Q as a polynomial VQPolynomial VQEach term in Q is each query term in QsWhere s is 1,2,3 …, m, and the coefficient value of each term is set to 1.0, it can be expressed as
VQ=1.0×q1+1.0×q2+1.0×q3+...+1.0×qmFormula (14)
Then, the extended word set obtained in step 4 is collectedAlso expressed by a polynomial V ', each term of the polynomial V' beingEach term (term) having a coefficient that is the corresponding value of the term in the polynomial V in step 4,
wherein, q'2、q'3、…、To representEach specific expansion word in (a total of n)1W)'1、w'2、w'3、…、Indicating the score of the corresponding expansion word in the query term polynomial V.
Will query polynomial VQAnd after normalization of the query expansion term polynomial V', carrying out linear combination again to obtain a new query term polynomial K, wherein the specific combination mode is shown as a formula (16).
K=α×||VQequation of | l + β × | | V' | (16)
the normalization method consistent with the step 3 is adopted in the formula (16), the adjusting factor α in the formula generally takes a fixed value of 1.0, the value range of the adjusting factor β is 0 to 1.0, the function of the normalization method is to balance the weights before the original query word and the expanded query word, and the normalization method can be set as an empirical value during specific implementation.
And 6, obtaining a new query keyword set Q 'according to the step 5, wherein each query word in the Q' is each term in the query word polynomial K. And (3) performing second information retrieval (the same retrieval model as the first retrieval) by using the corresponding weight of each query word in the new query keyword sets Q ' and Q ' in the query word polynomial K, namely calculating the score of each document in the Q ' and the target document set D again, wherein the obtained query result is the final information retrieval result.
When the second search is carried out, the query words are a newly generated query keyword set Q', the weight of each query word is the coefficient of the query word in the polynomial K of the query word when the score of the query word and each document is calculated, and the weight of each query word when the first search is carried out is 1.0.
In specific implementation, a person skilled in the art can implement automatic operation of the above processes by using software technology. Accordingly, it is within the scope of the present invention if an information retrieval system based on a pseudo-correlation feedback model is provided, which includes a computer or a server, and the above process is executed on the computer or the server to fuse the word correlation into the pseudo-correlation feedback model for information retrieval.
For example, the development environment for information search is Java or Python development environment, and the development support library is Lucene.
The information retrieval framework may be a pseudo-correlation feedback information retrieval framework based on a vector space model, a probabilistic model, a language model, and the like.
In order to verify the actual effect of the method, comparison experiments are carried out on a plurality of standard data sets, the comparison experiments are divided into two groups, one group adopts a standard Rocchio pseudo-related feedback information retrieval model, and the other group adopts the Rocchio pseudo-related feedback information retrieval model combined with the method, which is abbreviated as KRC. Six standard international data sets were used in this experiment, including AP88-89, AP90, DISK1&2, DISK4&5, WT2G and WT10G, and the information for these data sets is shown in the following table (table 1):
data set name | Total number of documents | Size and breadth | Query topic numbering | Number of topics queried |
AP90 | 78,321 | 0.23Gb | 51-100 | 50 |
AP88-89 | 164,597 | 0.50Gb | 51-100 | 50 |
DISK1&2 | 741,856 | 2.03Gb | 51-200 | 150 |
DISK4&5 | 528,155 | 1.85Gb | 301-450 | 150 |
WT2G | 247,491 | 2.14Gb | 401-450 | 50 |
WT10G | 1,692,096 | 10Gb | 451-550 | 100 |
TABLE 1 basic information of six data sets
In a comparative experiment, a gaussian kernel function (or other kernel functions) is selected as the kernel function in the method of the present invention, and the σ value in the gaussian kernel function is 50. In order to make the experiment more fair, the number N1 of the query expansion words is selected from four cases, 10, 20, 30 and 50, respectively, and the experimental results in different cases are shown in the following table (table 2):
TABLE 2 average precision (MAP) comparison of Rocchio and KRC models over six standard data sets
In table 2, the rocchoo model in the second column does not adopt the method of the present invention, the KRC model is the rocchoo model adopting the method of the present invention, and the MAP is the average accuracy of the retrieval result, which can be observed from the table.
Claims (9)
1. An information retrieval method based on a pseudo-correlation feedback model is characterized in that: the word relevance is fused into a pseudo-relevance feedback model to realize information retrieval, and the information retrieval comprises the steps of respectively generating query expansion words with the importance of candidate expansion words as the characteristic and query expansion words with the relevance of the candidate expansion words and query subject words as the characteristic when the query expansion words are generated in a pseudo-relevance document set, and then combining the query expansion words and the query expansion words into the original query expansion words to finish the final information retrieval; when generating the query expansion words with the relevance between the candidate expansion words and the query subject words as the characteristic, calculating the relevance between the query words and the candidate words appearing at different positions in the document by adopting a kernel function;
the information retrieval is realized by fusing the word relevancy into the pseudo-relevant feedback model in the following way,
when a user submits a query theme, preprocessing the query theme to obtain query keywords Q, D is all target documents, NDCalculating the scores of the query keyword Q and each document in the target document set D through a preset retrieval weight model for the total number of documents in the target document set D, and arranging the scores from high to low according to the score results to obtain a first query result; the first N documents in the target document set D are taken out as a pseudo-relevant document set D according to a pseudo-relevant feedback mode1When the query expansion word is selected, the following steps are carried out,
step 1, collecting pseudo-relevant documents D1All the words in each document are used as candidate expansion words, and candidate expansion words t are calculated respectivelyjIn a pseudo-relevant document set D1Document d ofiScore of importance inGet each document diIs vector of importanceAs follows below, the following description will be given,
wherein i is 1,2,3 …, N, j is 1,2,3 …, N;
calculating importance score vector of expansion candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the importance degree score of each expansion candidate word is taken out, the expansion candidate words are sorted from large to small, and the top n with the maximum score is1Value is inCorresponding expansion candidate words are selected to form an importance query expansion word set Q1Using a polynomial V1Query expansion term set Q representing importance1Each word in (1) and the corresponding importance score of the word;
step 2, collecting the pseudo related documents D1Taking all words in each document as expansion candidate words, and calculating each expansion candidate word t by adopting a kernel function according to the co-occurrence position and the co-occurrence frequencyrIn document d together with query keyword QiThe correlation score in (1)Get each document diIs related to the vectorAs follows below, the following description will be given,
wherein, i is 1,2,3 …, N, r is 1,2,3 …, N;
calculating a relevance score vector of the expanded candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the relevancy score of each expansion candidate word is taken out, the expansion candidate words are sorted in the order from large to small, and the top n with the maximum score is1Value is inCorresponding expansion candidate word is selected to form a relevancy query expansion word set Q'1Using a polynomial V1'to represent query expansion term set Q'1Each word in (1) and the relevancy score corresponding to the word;
step 3, the polynomial V obtained in the step 1 and the step 21And V1After normalization, linear combination is carried out to obtain a new query term polynomial V as follows,
V=(1-γ)×||V1||+γ×||V1'||
wherein, | X | | represents the normalization operation of the vector X, and γ is an adjustment factor;
step 4, sorting the query term polynomial V obtained in the step 3 from large to small according to the coefficient of each term, and sorting the top n with the maximum coefficient1Taking out individual terms to obtain a new expansion word set
Step 5, setting the query keyword Q to comprise a query word Qs1,2,3 …, m, representing the query term Q as a polynomial VQThe coefficient value of each query term is set to 1.0; combining the extended words obtained in the step 4Is represented by a polynomial expression V',
will query polynomial VQAnd the query expansion term polynomial V' are combined linearly after normalization until a new query term polynomial K is as follows,
K=α×||VQ||+β×||V'||
wherein α and β are regulatory factors;
and 6, obtaining a new query keyword set Q ' according to the query term polynomial K obtained in the step 5, using the corresponding weight of each query term in the new query keyword set Q ' and Q ' in the query term polynomial K, and performing secondary information retrieval by using a preset retrieval weight model to obtain a query result as a final information retrieval result.
2. The information retrieval method based on the pseudo-correlation feedback model according to claim 1, wherein: in step 1, importance scoresThe method adopts TFIDF, BM25 or RM3 to obtain the target.
3. The information retrieval method based on the pseudo-correlation feedback model according to claim 1, wherein: in step 2, each expansion candidate word t is calculatedrIn document d together with query keyword QiThe correlation score in (1)The realization is as follows,
let trAnd q issIn a certain document diCo-occurrence of (A) and (B) is represented byThe calculation is as follows,
wherein,represents trAnd q issIn document diThe degree of correlation in (1) is,representing a document diInThe co-occurrence frequency of (a) is,representing a document diInCo-occurrence counter-document frequency of (c);
calculating to obtain trIn document d together with query keyword QiThe degree of correlation in (1) is,
4. the information retrieval method based on the pseudo-correlation feedback model according to claim 3, wherein: document diInThe co-occurrence frequency of (c) is calculated as follows,
wherein M and L each represent trAnd q issIn document diThe number of times of occurrence of (a),representing a document diK1 th t of occurrence inr,Representing a document diThe k2 th qs,k1=1,2,3…,M,k2=1,2,3…,L;Means thatEmbodied in kernel functionsAndthe proximity of the location of (a).
5. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, wherein: the kernel function is a gaussian function or a trigonometric function.
6. The information retrieval method based on the pseudo-correlation feedback model according to claim 5, wherein: when the kernel function is a gaussian function, it is calculated as follows,
wherein p istAnd pqRespectively representAndthe position value in the document, σ, is the tuning parameter.
7. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, wherein: document diInCo-occurrence of anti-document frequencyThe calculation is as follows,
wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document diTotal number of co-occurrences in (c).
8. The information retrieval method based on the pseudo-correlation feedback model according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7, wherein: the preset retrieval weight model is based on a vector space model, a probability model or a language model.
9. An information retrieval system based on a pseudo-correlation feedback model, characterized in that: comprising a computer or server on which the method according to claims 1 to 8 is performed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710370190.XA CN107247745B (en) | 2017-05-23 | 2017-05-23 | A kind of information retrieval method and system based on pseudo-linear filter model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710370190.XA CN107247745B (en) | 2017-05-23 | 2017-05-23 | A kind of information retrieval method and system based on pseudo-linear filter model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107247745A CN107247745A (en) | 2017-10-13 |
CN107247745B true CN107247745B (en) | 2018-07-03 |
Family
ID=60016912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710370190.XA Active CN107247745B (en) | 2017-05-23 | 2017-05-23 | A kind of information retrieval method and system based on pseudo-linear filter model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107247745B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062355B (en) * | 2017-11-23 | 2020-07-31 | 华南农业大学 | Query term expansion method based on pseudo feedback and TF-IDF |
CN108520033B (en) * | 2018-03-28 | 2020-01-24 | 华中师范大学 | Enhanced pseudo-correlation feedback model information retrieval method based on hyperspace simulation language |
CN108733745B (en) * | 2018-03-30 | 2021-10-15 | 华东师范大学 | Query expansion method based on medical knowledge |
CN108921741A (en) * | 2018-04-27 | 2018-11-30 | 广东机电职业技术学院 | A kind of internet+foreign language expansion learning method |
CN108897737A (en) * | 2018-06-28 | 2018-11-27 | 中译语通科技股份有限公司 | A kind of core vocabulary special topic construction method and system based on big data analysis |
CN109189915B (en) * | 2018-09-17 | 2021-10-15 | 重庆理工大学 | Information retrieval method based on depth correlation matching model |
CN109829104B (en) * | 2019-01-14 | 2022-12-16 | 华中师范大学 | Semantic similarity based pseudo-correlation feedback model information retrieval method and system |
CN109918661B (en) * | 2019-03-04 | 2023-05-30 | 腾讯科技(深圳)有限公司 | Synonym acquisition method and device |
CN110442777B (en) * | 2019-06-24 | 2022-11-18 | 华中师范大学 | BERT-based pseudo-correlation feedback model information retrieval method and system |
CN111737413A (en) * | 2020-05-26 | 2020-10-02 | 湖北师范大学 | Feedback model information retrieval method, system and medium based on concept net semantics |
CN111723179B (en) * | 2020-05-26 | 2023-07-07 | 湖北师范大学 | Feedback model information retrieval method, system and medium based on conceptual diagram |
CN111625624A (en) * | 2020-05-27 | 2020-09-04 | 湖北师范大学 | Pseudo-correlation feedback information retrieval method, system and storage medium based on BM25+ ALBERT model |
CN112307182B (en) * | 2020-10-29 | 2022-11-04 | 上海交通大学 | Question-answering system-based pseudo-correlation feedback extended query method |
CN112988977A (en) * | 2021-04-25 | 2021-06-18 | 成都索贝数码科技股份有限公司 | Fuzzy matching media asset content library retrieval method based on approximate words |
CN116933766B (en) * | 2023-06-02 | 2024-08-16 | 盐城工学院 | Ad-hoc information retrieval model based on triple word frequency scheme |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324707A (en) * | 2013-06-18 | 2013-09-25 | 哈尔滨工程大学 | Query expansion method based on semi-supervised clustering |
US9411886B2 (en) * | 2008-03-31 | 2016-08-09 | Yahoo! Inc. | Ranking advertisements with pseudo-relevance feedback and translation models |
CN105975596A (en) * | 2016-05-10 | 2016-09-28 | 上海珍岛信息技术有限公司 | Query expansion method and system of search engine |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678412B (en) * | 2012-09-21 | 2016-12-21 | 北京大学 | A kind of method and device of file retrieval |
-
2017
- 2017-05-23 CN CN201710370190.XA patent/CN107247745B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9411886B2 (en) * | 2008-03-31 | 2016-08-09 | Yahoo! Inc. | Ranking advertisements with pseudo-relevance feedback and translation models |
CN103324707A (en) * | 2013-06-18 | 2013-09-25 | 哈尔滨工程大学 | Query expansion method based on semi-supervised clustering |
CN105975596A (en) * | 2016-05-10 | 2016-09-28 | 上海珍岛信息技术有限公司 | Query expansion method and system of search engine |
Non-Patent Citations (2)
Title |
---|
Query Dependent Pseudo Relevance Feedback based on Wikipedia;Xu Y等;《ACM》;20090723;全文 * |
支持技术创新的专利检索与分析;刘斌;《通讯学报》;20160331;第37卷(第3期);第81页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107247745A (en) | 2017-10-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107247745B (en) | A kind of information retrieval method and system based on pseudo-linear filter model | |
CN109829104B (en) | Semantic similarity based pseudo-correlation feedback model information retrieval method and system | |
CN110442777B (en) | BERT-based pseudo-correlation feedback model information retrieval method and system | |
Huang et al. | A unified relevance model for opinion retrieval | |
CN109960756B (en) | News event information induction method | |
Wang et al. | Indexing by L atent D irichlet A llocation and an E nsemble M odel | |
Mahdabi et al. | The effect of citation analysis on query expansion for patent retrieval | |
CN102799586B (en) | A kind of escape degree defining method for search results ranking and device | |
CN111723179A (en) | Feedback model information retrieval method, system and medium based on concept map | |
Mass et al. | Language models for keyword search over data graphs | |
Zhou et al. | Enhanced personalized search using social data | |
CN108509449B (en) | Information processing method and server | |
Madnani et al. | Multiple alternative sentence compressions for automatic text summarization | |
Yang et al. | Utility-based information distillation over temporally sequenced documents | |
Deshmukh et al. | A literature survey on latent semantic indexing | |
Li et al. | Complex query recognition based on dynamic learning mechanism | |
Ghorab et al. | Towards multilingual user models for personalized multilingual information retrieval | |
Omri | Effects of terms recognition mistakes on requests processing for interactive information retrieval | |
Krishnan et al. | Select, link and rank: Diversified query expansion and entity ranking using wikipedia | |
Gupta et al. | A review on important aspects of information retrieval | |
Veningston et al. | Semantic association ranking schemes for information retrieval applications using term association graph representation | |
CN112270199A (en) | CGAN (Carrier-grade network Access network) method based personalized semantic space keyword Top-K query method | |
KR100952077B1 (en) | Apparatus and method for choosing entry using keywords | |
Hoque et al. | Information retrieval system in bangla document ranking using latent semantic indexing | |
Zuluaga Cajiao et al. | Graph-based similarity for document retrieval in the biomedical domain |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |