CN107247745B - A kind of information retrieval method and system based on pseudo-linear filter model - Google Patents

A kind of information retrieval method and system based on pseudo-linear filter model Download PDF

Info

Publication number
CN107247745B
CN107247745B CN201710370190.XA CN201710370190A CN107247745B CN 107247745 B CN107247745 B CN 107247745B CN 201710370190 A CN201710370190 A CN 201710370190A CN 107247745 B CN107247745 B CN 107247745B
Authority
CN
China
Prior art keywords
query
document
word
expansion
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710370190.XA
Other languages
Chinese (zh)
Other versions
CN107247745A (en
Inventor
何婷婷
潘敏
简芳洪
毛智明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Normal University
Original Assignee
Huazhong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Normal University filed Critical Huazhong Normal University
Priority to CN201710370190.XA priority Critical patent/CN107247745B/en
Publication of CN107247745A publication Critical patent/CN107247745A/en
Application granted granted Critical
Publication of CN107247745B publication Critical patent/CN107247745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of information retrieval method based on pseudo-linear filter model, information retrieval is realized including the word degree of correlation is fused in pseudo-linear filter model, when being included in generation query expansion word in spurious correlation collection of document, the query expansion word characterized by the importance of candidate expansion word and the query expansion word characterized by candidate expansion word is with the degree of correlation of inquiry descriptor are generated respectively, the two is attached in former query expansion word again, completes final information retrieval;When generating the query expansion word characterized by candidate expansion word is with the degree of correlation of inquiry descriptor, the degree of correlation appeared in document between the query word on different location and candidate word is calculated using kernel function.The present invention can protrude the distribution situation of query word and candidate word, choose and the inquiry higher candidate word of descriptor degree of correlation, moreover it is possible to because of additional degree of correlation information, so as to position more accurately candidate word, raising expanding query and the precision finally retrieved.

Description

Information retrieval method and system based on pseudo-correlation feedback model
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to an information retrieval method and system for fusing kernel function word relevancy into a pseudo-correlation feedback model.
Background
In the age of the daily prosperity of information competition, browsing and acquiring required information by means of a search engine is an important part of people's daily life. However, network resources are extremely rich, and the total amount of information rapidly expands, so that it is difficult for users to efficiently and accurately acquire and identify important information, and a more effective theory and method for processing increasingly large amounts of data are urgently needed in the information processing technology. Information retrieval, as a classic text processing technique, can meet this requirement and is rapidly becoming a research focus in the current information processing research field.
Information Retrieval (Information Retrieval) refers to the process and technique of organizing Information in a certain way and finding out relevant Information according to the needs of Information users. The information retrieval process can be described simply as: according to the information requirement, a user organizes a query character string and submits the query character string to an information retrieval system, and the information retrieval system retrieves a document subset related to the query from a document set and returns the document subset to the user. Specifically, given a specific group of query topics, through some information retrieval model, the relevance of all documents in the target and the query topics is calculated, and each document is returned in the order of scores from large to small, and the earlier the document in the returned result is, the more relevant the document is to the query topics is. Through research development of the last half century, some effective information retrieval models are successively proposed and gradually applied to related systems. The retrieval model with larger influence comprises the following steps: boolean logic models, vector space models, probabilistic models, language models, and more recently proposed supervised learning based retrieval models.
In actual information retrieval application, a certain deviation exists between a query request of a user and a query result fed back by a system, so that the performance of a retrieval system is reduced. Therefore, information retrieval is often an iterative process, and users often need to perform query adjustment for many times to obtain satisfactory retrieval results. The query expansion technology well solves the problems that the terms used for the query of the user are not matched with the terms used for the document and the expression of the user is incomplete by expanding and reconstructing the initial query of the user, so that the query expansion technology is widely applied to the field of information retrieval. In brief, query expansion is that before a retrieval system performs retrieval, synonyms or near synonyms of keywords in user query are automatically expanded according to an expansion word list to form new query, and then retrieval is performed.
Pseudo-relevance feedback occurs to make the retrieval system more efficient and to better satisfy the user's query request with the retrieval results. The main mechanism is that the system defaults that the self-retrieved result contains a large number of documents relevant to the user query subject, and the first N documents are taken out as relevant documents to adjust or expand the query.
Generally, there are many factors that affect the performance of a retrieval system, and the most critical of them is the information retrieval policy, including the representation method of documents and query conditions, the matching policy for evaluating the relevance of documents and queries, the ranking method of query results, and the mechanism for the user to perform relevant feedback.
With the development of high-speed internet, a large amount of information is stacked, the accuracy of information search becomes the first point of attention of all users, it is becoming more and more difficult to find what the users want through an information retrieval tool, and at the same time, the excessive flooding of various information makes the users have to spend more time to discriminate which information is valuable to the users. The existing information retrieval method generally has the problems that the retrieval average precision is not high, even the average precision of the best retrieval model at present is only 30%, and the improvement of the information retrieval precision has a long way. Information retrieval has been deeply carried out in various aspects of human life, and most people use searching tools such as hundredths, google and the like to search various required data every day, so that various practical problems are solved. In 2010, the request amount of Chinese web page search reaches more than 600 hundred million times, and in 2016, the search request amount of one hundred-degree-one-day reaches 60 hundred million times, and under the requirement of such a large amount of search, each percentage point of improvement of the average accuracy of information search saves a large amount of time and energy for acquiring required information, and the value of the improvement is extraordinary. Large internet companies are also continuously pursuing lower cost and more efficient information retrieval technologies.
Disclosure of Invention
The invention aims to solve the problem that the query expansion is optimized to improve the average retrieval precision finally.
The invention provides an information retrieval method based on a pseudo-correlation feedback model, which fuses word correlation into the pseudo-correlation feedback model to realize information retrieval, and comprises the steps of respectively generating query expansion words with the importance of candidate expansion words as the characteristic and query expansion words with the correlation of the candidate expansion words and query subject words as the characteristic when generating the query expansion words in a pseudo-correlation document set, and then combining the query expansion words and the query expansion words into the original query expansion words to finish final information retrieval; and when generating the query expansion words with the correlation degree of the candidate expansion words and the query subject words as the characteristic, calculating the correlation degree between the query words and the candidate words appearing at different positions in the document by adopting a kernel function.
Moreover, the word relevancy is fused into the pseudo-correlation feedback model to realize information retrieval, and the realization method is as follows,
when a user submits a query theme, preprocessing the query theme to obtain query keywords Q, D is all target documents, NDCalculating the scores of the query keyword Q and each document in the target document set D through a preset retrieval weight model for the total number of documents in the target document set D, and arranging the scores from high to low according to the score results to obtain a first query result; the first N documents in the target document set D are taken out as a pseudo-relevant document set D according to a pseudo-relevant feedback mode1When the query expansion word is selected, the following steps are carried out,
step 1, collecting pseudo-relevant documents D1All the words in each document are used as candidate expansion words, and candidate expansion words t are calculated respectivelyjIn a pseudo-relevant document set D1Document d ofiScore of importance inGet each document diIs vector of importanceAs follows below, the following description will be given,
wherein i is 1,2,3 …, N, j is 1,2,3 …, N;
calculating importance score vector of expansion candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the importance degree score of each expansion candidate word is taken out, the expansion candidate words are sorted from large to small, and the top n with the maximum score is1Value is inCorresponding expansion candidate words are selected to form an importance query expansion word set Q1Using a polynomial V1Query expansion term set Q representing importance1Each word in (1) and the corresponding importance score of the word;
step 2, collecting the pseudo related documents D1Taking all words in each document as expansion candidate words, and calculating each expansion candidate word t by adopting a kernel function according to the co-occurrence position and the co-occurrence frequencyjIn document d together with query keyword QiThe correlation score in (1)Get each textStep diIs related to the vectorAs follows below, the following description will be given,
wherein i is 1,2,3 …, N, j is 1,2,3 …, N;
calculating a relevance score vector of the expanded candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the relevancy score of each expansion candidate word is taken out, the expansion candidate words are sorted in the order from large to small, and the top n with the maximum score is1Value is inSelecting out corresponding expansion candidate words to form a relevancy query expansion word set Q1', using a polynomial V1To denote a set of query expansion terms Q1'each word in the list and the word's corresponding relevancy score;
step 3, the polynomial V obtained in the step 1 and the step 21And V1After normalization, linear combination is carried out to obtain a new query term polynomial V as follows,
V=(1-γ)×||V1||+γ×||V1'||
wherein, | X | | represents the normalization operation of the vector X, and γ is an adjustment factor;
step 4, sorting the query term polynomial V obtained in the step 3 from large to small according to the coefficient of each term, and sorting the top n with the maximum coefficient1Taking out individual terms to obtain a new expansion word set
Step 5, setting the query keyword Q to comprise a query word Qs1,2,3 …, m, representing the query term Q as a polynomial VQThe coefficient value of each query term is set to 1.0; combining the extended words obtained in the step 4Is represented by a polynomial expression V',
will query polynomial VQAnd the query expansion term polynomial V' are combined linearly after normalization until a new query term polynomial K is as follows,
K=α×||VQ||+β×||V'||
wherein α and β are regulatory factors;
and 6, obtaining a new query keyword set Q ' according to the query term polynomial K obtained in the step 5, using the corresponding weight of each query term in the new query keyword set Q ' and Q ' in the query term polynomial K, and performing secondary information retrieval by using a preset retrieval weight model to obtain a query result as a final information retrieval result.
In step 1, the importance score is calculatedThe method adopts TFIDF, BM25 or RM3 to obtain the target.
Furthermore, in step 2, each expansion candidate word t is calculatedjIn document d together with query keyword QiThe correlation score in (1)The realization is as follows,
let trAnd q issIn a certain document diCo-occurrence of (A) and (B) is represented byThe calculation is as follows,
wherein,represents trAnd q issIn document diThe degree of correlation in (1) is,representing a document diInThe co-occurrence frequency of (a) is,representing a document diInCo-occurrence counter-document frequency of (c);
calculating to obtain trIn document d together with query keyword QiThe degree of correlation in (1) is,
furthermore, document diInThe co-occurrence frequency of (c) is calculated as follows,
wherein M and L each represent trAnd q issIn document diThe number of times of occurrence of (a),representing a document diK1 th t of occurrence inrRepresenting a document diThe k2 th qs,k1=1,2,3…,M,k2=1,2,3…,L;Is embodied in a kernel functionAndthe proximity of the location of (a).
Also, the kernel function is a gaussian function or a trigonometric function.
Further, when the kernel function is a gaussian function, the following is calculated,
wherein p istAnd pqRespectively representAndthe position value in the document, σ, is the tuning parameter.
Furthermore, document diInCo-occurrence of anti-document frequencyThe calculation is as follows,
wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document diTotal number of co-occurrences in (c).
And, the preset retrieval weight model is based on a vector space model, a probability model or a language model.
The invention also provides an information retrieval system based on the pseudo-correlation feedback model, which comprises a computer or a server, wherein the method is executed on the computer or the server.
According to the information retrieval method for fusing the kernel function word relevancy information into the pseudo-relevance feedback model, provided by the invention, the defect that the traditional pseudo-relevance feedback model only considers the word frequency information can be overcome. In addition, the relevance between the query words and the candidate words appearing at different positions in the document is calculated through the kernel function, so that the distribution condition of the query words and the candidate words can be highlighted, the candidate words with higher relevance with the query subject words can be selected, and the additional relevance information can be used, so that the more accurate candidate words can be positioned, and the average precision of expansion query and final retrieval can be improved. The comparison experiment result of a plurality of international information retrieval evaluation standard data sets and a plurality of internationally best models shows that the information retrieval method for integrating the word correlation degree information of the kernel function into the pseudo-correlation feedback model provided by the invention realizes remarkable improvement on retrieval accuracy and reaches the international leading level.
Drawings
Fig. 1 is a flowchart of a complete information retrieval process according to an embodiment of the present invention.
Detailed description of the invention
The core problem to be solved by the invention is as follows: a kernel function is used for reflecting the distribution situation between a user query word and a document candidate word and the correlation degree between the user query word and the document candidate word, the correlation degree is used as an additional weight to be fused into a pseudo-correlation feedback model, and query expansion is achieved to improve the retrieval accuracy.
The information retrieval method for fusing the correlation degree of the kernel function words into the pseudo-correlation feedback model is described in detail below with reference to the accompanying drawings and embodiments.
The invention provides a method for considering the correlation between words aiming at the unreasonable independent vocabulary assumption in the classical method. Through effective utilization of some statistical information (such as context information and other information reflecting word collocation and use relations) of data in the document set, a related technical scheme is designed in combination with the query condition to obtain words which can reflect the topic of the query condition and are triggered by the query condition, namely, the information is utilized to more accurately capture the information requirement of a user.
The Kernel function adopted in the method originally projects the linear indivisible data in the original coordinate system to another space by Kernel, so that the data can be linearly divided in a new space as much as possible. Which in the method of the invention will be used to assess the degree of relatedness of two words in a document.
Referring to fig. 1, the flow of the embodiment is that, when a user performs retrieval according to a related query topic:
the information retrieval system can establish a query index according to a target document set, and when a user submits a relevant query topic, the system can preprocess the query topic into a query keyword Q (Q is a set and generally comprises a plurality of topic words Q)1、q2、q3Etc.), D is all target documents, NDThe total number of documents in the target document set D. Then, the retrieval system calculates the score of the query keyword Q and each document in the document set D by some preset retrieval weight model (e.g., TFIDF, BM25, RM3, etc.), and obtains the first query result by ranking the score results from high to low. According to the principle of pseudo-correlation feedback, the retrieval system takes the first N (in a large number of relevant research documents, N is generally 10, 20 or 30) documents in the first query result documents of the document set D as the pseudo-correlation document set D1N is less than or equal to NDValues can be preset by those skilled in the art. Obtaining the pseudo-relevant document set D generated by the first query in the retrieval system1When the query expansion word is selected, the following steps are carried out,
step 1, respectively calculating a pseudo-related document set D1The importance scores of all the words (i.e. the expansion candidate words) in each document can be obtained by calculating the word frequency of the words and the word frequency of the inverse document (such as TFIDF, BM25, RM3, etc.), and then the same word importance scores in different documents are accumulated in a word vector mode and divided by D1The number N of the documents in the Chinese character image can obtain the importance degree score vectors of all the expansion candidate wordsArranging the scores of the elements in the vector from big to small, and taking out the top n1(n1Typically 10, 20, 30 or 50, which can be preset by one skilled in the art) scores in the rangeZhongshiCorresponding words are obtained, and a candidate word set Q with expanded importance degree is obtained1By a polynomial V1To represent a set Q1Each word in (a) and the corresponding importance score for that word.
In the invention, N pseudo-related documents are collected into a set D1Each document in (i) is regarded as a bag-of-words model and is expressed in a word vector mode, wherein the relevance vector formula of the ith document is shown as follows.
In the above-mentioned formula,representing a pseudo-relevant document set D1The ith document (i ═ 1,2,3 …, N) d in (c)iWord vector expression of, t1、t2、t3、…、tnFor pseudo-relevant document sets D1All words in all documents in (a) and n represents the total number of these words, i.e. the pseudo-relevant document set D1The number of all words in the Chinese sentence;represents the corresponding t1、t2、t3、…、 tnIn document diThe weight score (also the importance score, the weight is used to represent the importance of the expanded candidate word). The importance score of a word is obtained by calculating the information (such as TFIDF, BM25, RM3, etc.) of the word frequency and the inverse document word frequency of the word, for example, in the method of calculating the document d by using TFIDFiMiddle entry tjThe importance of (a) is the importance of,
wherein,to a certain entry tjIn document diThe importance score (j ═ 1,2,3 …, n), TF (t)jD) the entry tjIn document diFrequency (number of times) of occurrence of, NDTotal number of documents, df (t), of the target document set Dj) Is a pseudo-correlation set D1In, contains the entry tjThe number of documents.
Each document d of the N documents according to formula (2)iCan be expressed in the form of vectors of the importance of the corresponding wordsAnd accumulating and summing each document vector, and dividing the sum by the total number N of the pseudo-related documents to obtain importance degree score vectors of all the entries in all the documentsAs shown in equation (3):
will be provided withThe importance degree scores of each word are taken out and then are sorted from big to small, and the top n with the maximum score is1Value is inCorresponding word is selected to form an importance query expansion word set Q1. For the convenience of the later calculations, polynomial V is used1To represent a set Q1Each word in (a) and the corresponding importance score of that word, as shown in equation (4).
In the formula (4), qh1、qh2、qh3、…、Represents Q1Each specific extended candidate word in (a total of n)1One), wh1、wh2、wh3、…、Indicates the corresponding expansion candidate wordScore of (1).
Step 2, a pseudo-relevant document set D is calculated in sequence1The relevancy score between all the words (i.e. the expansion candidate words) in each document and the query word is obtained by calculating the kernel function according to the positions of the query word and the expansion candidate words in each document, and then the scores of the same words in different documents are accumulated to obtain the relevancy score vectors of all the expansion candidate words and the query wordArranging the scores of the elements in the vector from big to small, and taking out the top n1(n1Typically 10, 20, 30 or 50) scores inThe word corresponding to the Chinese character is obtained to obtain a correlation degree expansion candidate word set Q1', here we use a polynomial V1To denote a set Q1' and a relevancy score for the word.
For ease of explanation, the expansion candidate word t is givenrAnd query term qs(where r is 1,2,3 …, n, n is the pseudo relevant document set D1The number of all words in the query keyword Q set, s is 1,2,3 …, m, m is the number of words in the query keyword Q set), if t is trAnd q issIn a certain document diIn the co-occurrence ofThis is represented asThey have a co-occurrence weight (i.e., degree of correlation). Due to trAnd q issMay occur in multiple locations in a document and therefore cannot simply be readRepresents trAnd q issIn document diThe invention further provides the following formula in order to more reasonably measure the correlation degree:
in the formula (5), the first and second groups,represents trAnd q issIn document diThe degree of correlation in (1).
In the formula (5), the first and second groups,representing a document diInThe specific calculation formula of the co-occurrence frequency of (c) is as follows:
in the formula (6), M and L respectively represent trAnd q issIn document diThe number of times of occurrence of (a),representing a document diK1 th t of occurrence inrRepresenting a document diThe k2 th qsK1 is 1,2,3 …, M, k2 is 1,2,3 …, L. The Kernel () represents a Kernel function, which is a type that the proximity relationship between two words can be measured by the position information of the words, and when the positions of the two words which co-occur are closer, the proximity relationship is stronger, that is, the degree of correlation is higher. Such as gaussian functions, trigonometric functions, etc., are very effective in many scenarios. Examples of the inventionIt is meant to embody the kernel function in Gaussian (other kernel functions may be used in specific implementations)Andas in equation (7):
wherein p istAnd pqRespectively representAndthe position value in the document (i.e. the occurrence number of the word in the document, which is a positive integer), σ is an adjustment parameter for adjusting the distribution of the gaussian function, and σ preferably has a value in the range of 10 to 100, which in the specific embodiment is 50.
In the formula (5), the first and second groups,representing a document diInThe specific calculation method of the co-occurrence anti-document frequency is as follows:
wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document diTotal number of co-occurrences in (c).
Equation (5) gives trAnd q issIn document diDegree of correlation inDue to qsIs a query word in the query keyword set Q, and t can be obtained by the formula (5)rIn document d together with query keyword QiThe correlation in (1), the invention usesThe specific calculation formula is as follows:
n sets of pseudo-related documents D according to formula (9)1The ith document d iniCan be expressed in the form of a corresponding relevance vector between the expansion candidate word and the query word, namelyThe specific formula is as follows.
Next, for each document relevance vectorAfter accumulation and summation, dividing the sum by the total number N of the pseudo-relevant documents to finally obtain the relevancy score vectors of all the entries in all the documentsAs shown in formula (11):
will be provided withThe relevancy score of each word is taken out and then is sorted from big to small, and the top n with the largest score is1Value is inCorresponding word is selected to form a relevancy query expansion word set Q1'. For the convenience of the later calculations, polynomial V is used1To denote a set Q1' each word in ' and the word's corresponding relevancy score, as shown in equation (12).
In the formula (12), qh1'、qh'2、qh3'、…、Represents Q1' inEach specific expansion word (a total of n)1One), wh'1、wh′2、wh′3、...、Indicates the corresponding expansion word inScore of (1).
Step 3, the query expansion word polynomial V obtained in the step 1 and the step 21And V1After normalization, linear combination is carried out to obtain a new query term polynomial V, and the specific combination mode is shown as a formula (13).
V=(1-γ)×||V1||+γ×||V1' | | formula (13)
In formula (13), | X | | | denotes that the vector X is normalized, and the purpose of normalization is to unify dimensions, i.e., to normalize the value of each element in the vector to the interval [0,1.0 |]In addition, subsequent parameter adjustment is facilitated. There are many ways to realize normalization, and in this embodiment, a division-by-maximum method is used, that is, the normalized value of each element is the original value of the element divided by the maximum value of the element in the vector. For example, there is a vector [1,2,3,4 ]]If there are 4 elements and the maximum value of the element is 4, then the vector is normalized by dividing by the maximum value method to obtain the resultI.e., [0.25,0.5,0.75,1]It can be seen that all values in the original vector are normalized to the interval [0,1.0 ]]Is as follows.
The adjustment factor γ in the formula (13) has a value range of 0 to 1.0, and has a function of balancing the importance score of the expansion word and the relevance score between the expansion word and the query word, and when the method is applied specifically, the method can test the optimal value of γ on the target document set to be applied by using test data in advance.
Step 4, according to the polynomial V in step 3, according to each termThe coefficients (the integrated weight scores) are sorted from large to small, and the top n with the largest coefficient is sorted1Taking out individual terms to obtain a new expansion word set I.e. the final set of query expansion terms.
Step 5, expressing the original query keyword set Q as a polynomial VQPolynomial VQEach term in Q is each query term in QsWhere s is 1,2,3 …, m, and the coefficient value of each term is set to 1.0, it can be expressed as
VQ=1.0×q1+1.0×q2+1.0×q3+...+1.0×qmFormula (14)
Then, the extended word set obtained in step 4 is collectedAlso expressed by a polynomial V ', each term of the polynomial V' beingEach term (term) having a coefficient that is the corresponding value of the term in the polynomial V in step 4,
wherein, q'2、q'3、…、To representEach specific expansion word in (a total of n)1W)'1、w'2、w'3、…、Indicating the score of the corresponding expansion word in the query term polynomial V.
Will query polynomial VQAnd after normalization of the query expansion term polynomial V', carrying out linear combination again to obtain a new query term polynomial K, wherein the specific combination mode is shown as a formula (16).
K=α×||VQequation of | l + β × | | V' | (16)
the normalization method consistent with the step 3 is adopted in the formula (16), the adjusting factor α in the formula generally takes a fixed value of 1.0, the value range of the adjusting factor β is 0 to 1.0, the function of the normalization method is to balance the weights before the original query word and the expanded query word, and the normalization method can be set as an empirical value during specific implementation.
And 6, obtaining a new query keyword set Q 'according to the step 5, wherein each query word in the Q' is each term in the query word polynomial K. And (3) performing second information retrieval (the same retrieval model as the first retrieval) by using the corresponding weight of each query word in the new query keyword sets Q ' and Q ' in the query word polynomial K, namely calculating the score of each document in the Q ' and the target document set D again, wherein the obtained query result is the final information retrieval result.
When the second search is carried out, the query words are a newly generated query keyword set Q', the weight of each query word is the coefficient of the query word in the polynomial K of the query word when the score of the query word and each document is calculated, and the weight of each query word when the first search is carried out is 1.0.
In specific implementation, a person skilled in the art can implement automatic operation of the above processes by using software technology. Accordingly, it is within the scope of the present invention if an information retrieval system based on a pseudo-correlation feedback model is provided, which includes a computer or a server, and the above process is executed on the computer or the server to fuse the word correlation into the pseudo-correlation feedback model for information retrieval.
For example, the development environment for information search is Java or Python development environment, and the development support library is Lucene.
The information retrieval framework may be a pseudo-correlation feedback information retrieval framework based on a vector space model, a probabilistic model, a language model, and the like.
In order to verify the actual effect of the method, comparison experiments are carried out on a plurality of standard data sets, the comparison experiments are divided into two groups, one group adopts a standard Rocchio pseudo-related feedback information retrieval model, and the other group adopts the Rocchio pseudo-related feedback information retrieval model combined with the method, which is abbreviated as KRC. Six standard international data sets were used in this experiment, including AP88-89, AP90, DISK1&2, DISK4&5, WT2G and WT10G, and the information for these data sets is shown in the following table (table 1):
data set name Total number of documents Size and breadth Query topic numbering Number of topics queried
AP90 78,321 0.23Gb 51-100 50
AP88-89 164,597 0.50Gb 51-100 50
DISK1&2 741,856 2.03Gb 51-200 150
DISK4&5 528,155 1.85Gb 301-450 150
WT2G 247,491 2.14Gb 401-450 50
WT10G 1,692,096 10Gb 451-550 100
TABLE 1 basic information of six data sets
In a comparative experiment, a gaussian kernel function (or other kernel functions) is selected as the kernel function in the method of the present invention, and the σ value in the gaussian kernel function is 50. In order to make the experiment more fair, the number N1 of the query expansion words is selected from four cases, 10, 20, 30 and 50, respectively, and the experimental results in different cases are shown in the following table (table 2):
TABLE 2 average precision (MAP) comparison of Rocchio and KRC models over six standard data sets
In table 2, the rocchoo model in the second column does not adopt the method of the present invention, the KRC model is the rocchoo model adopting the method of the present invention, and the MAP is the average accuracy of the retrieval result, which can be observed from the table.

Claims (9)

1. An information retrieval method based on a pseudo-correlation feedback model is characterized in that: the word relevance is fused into a pseudo-relevance feedback model to realize information retrieval, and the information retrieval comprises the steps of respectively generating query expansion words with the importance of candidate expansion words as the characteristic and query expansion words with the relevance of the candidate expansion words and query subject words as the characteristic when the query expansion words are generated in a pseudo-relevance document set, and then combining the query expansion words and the query expansion words into the original query expansion words to finish the final information retrieval; when generating the query expansion words with the relevance between the candidate expansion words and the query subject words as the characteristic, calculating the relevance between the query words and the candidate words appearing at different positions in the document by adopting a kernel function;
the information retrieval is realized by fusing the word relevancy into the pseudo-relevant feedback model in the following way,
when a user submits a query theme, preprocessing the query theme to obtain query keywords Q, D is all target documents, NDCalculating the scores of the query keyword Q and each document in the target document set D through a preset retrieval weight model for the total number of documents in the target document set D, and arranging the scores from high to low according to the score results to obtain a first query result; the first N documents in the target document set D are taken out as a pseudo-relevant document set D according to a pseudo-relevant feedback mode1When the query expansion word is selected, the following steps are carried out,
step 1, collecting pseudo-relevant documents D1All the words in each document are used as candidate expansion words, and candidate expansion words t are calculated respectivelyjIn a pseudo-relevant document set D1Document d ofiScore of importance inGet each document diIs vector of importanceAs follows below, the following description will be given,
wherein i is 1,2,3 …, N, j is 1,2,3 …, N;
calculating importance score vector of expansion candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the importance degree score of each expansion candidate word is taken out, the expansion candidate words are sorted from large to small, and the top n with the maximum score is1Value is inCorresponding expansion candidate words are selected to form an importance query expansion word set Q1Using a polynomial V1Query expansion term set Q representing importance1Each word in (1) and the corresponding importance score of the word;
step 2, collecting the pseudo related documents D1Taking all words in each document as expansion candidate words, and calculating each expansion candidate word t by adopting a kernel function according to the co-occurrence position and the co-occurrence frequencyrIn document d together with query keyword QiThe correlation score in (1)Get each document diIs related to the vectorAs follows below, the following description will be given,
wherein, i is 1,2,3 …, N, r is 1,2,3 …, N;
calculating a relevance score vector of the expanded candidate words in all documentsAs follows below, the following description will be given,
will be provided withAfter the relevancy score of each expansion candidate word is taken out, the expansion candidate words are sorted in the order from large to small, and the top n with the maximum score is1Value is inCorresponding expansion candidate word is selected to form a relevancy query expansion word set Q'1Using a polynomial V1'to represent query expansion term set Q'1Each word in (1) and the relevancy score corresponding to the word;
step 3, the polynomial V obtained in the step 1 and the step 21And V1After normalization, linear combination is carried out to obtain a new query term polynomial V as follows,
V=(1-γ)×||V1||+γ×||V1'||
wherein, | X | | represents the normalization operation of the vector X, and γ is an adjustment factor;
step 4, sorting the query term polynomial V obtained in the step 3 from large to small according to the coefficient of each term, and sorting the top n with the maximum coefficient1Taking out individual terms to obtain a new expansion word set
Step 5, setting the query keyword Q to comprise a query word Qs1,2,3 …, m, representing the query term Q as a polynomial VQThe coefficient value of each query term is set to 1.0; combining the extended words obtained in the step 4Is represented by a polynomial expression V',
will query polynomial VQAnd the query expansion term polynomial V' are combined linearly after normalization until a new query term polynomial K is as follows,
K=α×||VQ||+β×||V'||
wherein α and β are regulatory factors;
and 6, obtaining a new query keyword set Q ' according to the query term polynomial K obtained in the step 5, using the corresponding weight of each query term in the new query keyword set Q ' and Q ' in the query term polynomial K, and performing secondary information retrieval by using a preset retrieval weight model to obtain a query result as a final information retrieval result.
2. The information retrieval method based on the pseudo-correlation feedback model according to claim 1, wherein: in step 1, importance scoresThe method adopts TFIDF, BM25 or RM3 to obtain the target.
3. The information retrieval method based on the pseudo-correlation feedback model according to claim 1, wherein: in step 2, each expansion candidate word t is calculatedrIn document d together with query keyword QiThe correlation score in (1)The realization is as follows,
let trAnd q issIn a certain document diCo-occurrence of (A) and (B) is represented byThe calculation is as follows,
wherein,represents trAnd q issIn document diThe degree of correlation in (1) is,representing a document diInThe co-occurrence frequency of (a) is,representing a document diInCo-occurrence counter-document frequency of (c);
calculating to obtain trIn document d together with query keyword QiThe degree of correlation in (1) is,
4. the information retrieval method based on the pseudo-correlation feedback model according to claim 3, wherein: document diInThe co-occurrence frequency of (c) is calculated as follows,
wherein M and L each represent trAnd q issIn document diThe number of times of occurrence of (a),representing a document diK1 th t of occurrence inrRepresenting a document diThe k2 th qs,k1=1,2,3…,M,k2=1,2,3…,L;Means thatEmbodied in kernel functionsAndthe proximity of the location of (a).
5. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, wherein: the kernel function is a gaussian function or a trigonometric function.
6. The information retrieval method based on the pseudo-correlation feedback model according to claim 5, wherein: when the kernel function is a gaussian function, it is calculated as follows,
wherein p istAnd pqRespectively representAndthe position value in the document, σ, is the tuning parameter.
7. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, wherein: document diInCo-occurrence of anti-document frequencyThe calculation is as follows,
wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document diTotal number of co-occurrences in (c).
8. The information retrieval method based on the pseudo-correlation feedback model according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7, wherein: the preset retrieval weight model is based on a vector space model, a probability model or a language model.
9. An information retrieval system based on a pseudo-correlation feedback model, characterized in that: comprising a computer or server on which the method according to claims 1 to 8 is performed.
CN201710370190.XA 2017-05-23 2017-05-23 A kind of information retrieval method and system based on pseudo-linear filter model Active CN107247745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710370190.XA CN107247745B (en) 2017-05-23 2017-05-23 A kind of information retrieval method and system based on pseudo-linear filter model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710370190.XA CN107247745B (en) 2017-05-23 2017-05-23 A kind of information retrieval method and system based on pseudo-linear filter model

Publications (2)

Publication Number Publication Date
CN107247745A CN107247745A (en) 2017-10-13
CN107247745B true CN107247745B (en) 2018-07-03

Family

ID=60016912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710370190.XA Active CN107247745B (en) 2017-05-23 2017-05-23 A kind of information retrieval method and system based on pseudo-linear filter model

Country Status (1)

Country Link
CN (1) CN107247745B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062355B (en) * 2017-11-23 2020-07-31 华南农业大学 Query term expansion method based on pseudo feedback and TF-IDF
CN108520033B (en) * 2018-03-28 2020-01-24 华中师范大学 Enhanced pseudo-correlation feedback model information retrieval method based on hyperspace simulation language
CN108733745B (en) * 2018-03-30 2021-10-15 华东师范大学 Query expansion method based on medical knowledge
CN108921741A (en) * 2018-04-27 2018-11-30 广东机电职业技术学院 A kind of internet+foreign language expansion learning method
CN108897737A (en) * 2018-06-28 2018-11-27 中译语通科技股份有限公司 A kind of core vocabulary special topic construction method and system based on big data analysis
CN109189915B (en) * 2018-09-17 2021-10-15 重庆理工大学 Information retrieval method based on depth correlation matching model
CN109829104B (en) * 2019-01-14 2022-12-16 华中师范大学 Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN109918661B (en) * 2019-03-04 2023-05-30 腾讯科技(深圳)有限公司 Synonym acquisition method and device
CN110442777B (en) * 2019-06-24 2022-11-18 华中师范大学 BERT-based pseudo-correlation feedback model information retrieval method and system
CN111737413A (en) * 2020-05-26 2020-10-02 湖北师范大学 Feedback model information retrieval method, system and medium based on concept net semantics
CN111723179B (en) * 2020-05-26 2023-07-07 湖北师范大学 Feedback model information retrieval method, system and medium based on conceptual diagram
CN111625624A (en) * 2020-05-27 2020-09-04 湖北师范大学 Pseudo-correlation feedback information retrieval method, system and storage medium based on BM25+ ALBERT model
CN112307182B (en) * 2020-10-29 2022-11-04 上海交通大学 Question-answering system-based pseudo-correlation feedback extended query method
CN112988977A (en) * 2021-04-25 2021-06-18 成都索贝数码科技股份有限公司 Fuzzy matching media asset content library retrieval method based on approximate words
CN116933766B (en) * 2023-06-02 2024-08-16 盐城工学院 Ad-hoc information retrieval model based on triple word frequency scheme

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324707A (en) * 2013-06-18 2013-09-25 哈尔滨工程大学 Query expansion method based on semi-supervised clustering
US9411886B2 (en) * 2008-03-31 2016-08-09 Yahoo! Inc. Ranking advertisements with pseudo-relevance feedback and translation models
CN105975596A (en) * 2016-05-10 2016-09-28 上海珍岛信息技术有限公司 Query expansion method and system of search engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678412B (en) * 2012-09-21 2016-12-21 北京大学 A kind of method and device of file retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9411886B2 (en) * 2008-03-31 2016-08-09 Yahoo! Inc. Ranking advertisements with pseudo-relevance feedback and translation models
CN103324707A (en) * 2013-06-18 2013-09-25 哈尔滨工程大学 Query expansion method based on semi-supervised clustering
CN105975596A (en) * 2016-05-10 2016-09-28 上海珍岛信息技术有限公司 Query expansion method and system of search engine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Query Dependent Pseudo Relevance Feedback based on Wikipedia;Xu Y等;《ACM》;20090723;全文 *
支持技术创新的专利检索与分析;刘斌;《通讯学报》;20160331;第37卷(第3期);第81页 *

Also Published As

Publication number Publication date
CN107247745A (en) 2017-10-13

Similar Documents

Publication Publication Date Title
CN107247745B (en) A kind of information retrieval method and system based on pseudo-linear filter model
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN110442777B (en) BERT-based pseudo-correlation feedback model information retrieval method and system
Huang et al. A unified relevance model for opinion retrieval
CN109960756B (en) News event information induction method
Wang et al. Indexing by L atent D irichlet A llocation and an E nsemble M odel
Mahdabi et al. The effect of citation analysis on query expansion for patent retrieval
CN102799586B (en) A kind of escape degree defining method for search results ranking and device
CN111723179A (en) Feedback model information retrieval method, system and medium based on concept map
Mass et al. Language models for keyword search over data graphs
Zhou et al. Enhanced personalized search using social data
CN108509449B (en) Information processing method and server
Madnani et al. Multiple alternative sentence compressions for automatic text summarization
Yang et al. Utility-based information distillation over temporally sequenced documents
Deshmukh et al. A literature survey on latent semantic indexing
Li et al. Complex query recognition based on dynamic learning mechanism
Ghorab et al. Towards multilingual user models for personalized multilingual information retrieval
Omri Effects of terms recognition mistakes on requests processing for interactive information retrieval
Krishnan et al. Select, link and rank: Diversified query expansion and entity ranking using wikipedia
Gupta et al. A review on important aspects of information retrieval
Veningston et al. Semantic association ranking schemes for information retrieval applications using term association graph representation
CN112270199A (en) CGAN (Carrier-grade network Access network) method based personalized semantic space keyword Top-K query method
KR100952077B1 (en) Apparatus and method for choosing entry using keywords
Hoque et al. Information retrieval system in bangla document ranking using latent semantic indexing
Zuluaga Cajiao et al. Graph-based similarity for document retrieval in the biomedical domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant