CN107247745B

CN107247745B - A kind of information retrieval method and system based on pseudo-linear filter model

Info

Publication number: CN107247745B
Application number: CN201710370190.XA
Authority: CN
Inventors: 何婷婷; 潘敏; 简芳洪; 毛智明
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2018-07-03
Anticipated expiration: 2037-05-23
Also published as: CN107247745A

Abstract

The present invention provides a kind of information retrieval method based on pseudo-linear filter model, information retrieval is realized including the word degree of correlation is fused in pseudo-linear filter model, when being included in generation query expansion word in spurious correlation collection of document, the query expansion word characterized by the importance of candidate expansion word and the query expansion word characterized by candidate expansion word is with the degree of correlation of inquiry descriptor are generated respectively, the two is attached in former query expansion word again, completes final information retrieval；When generating the query expansion word characterized by candidate expansion word is with the degree of correlation of inquiry descriptor, the degree of correlation appeared in document between the query word on different location and candidate word is calculated using kernel function.The present invention can protrude the distribution situation of query word and candidate word, choose and the inquiry higher candidate word of descriptor degree of correlation, moreover it is possible to because of additional degree of correlation information, so as to position more accurately candidate word, raising expanding query and the precision finally retrieved.

Description

Information retrieval method and system based on pseudo-correlation feedback model

Technical Field

The invention belongs to the technical field of information retrieval, and particularly relates to an information retrieval method and system for fusing kernel function word relevancy into a pseudo-correlation feedback model.

Background

In the age of the daily prosperity of information competition, browsing and acquiring required information by means of a search engine is an important part of people's daily life. However, network resources are extremely rich, and the total amount of information rapidly expands, so that it is difficult for users to efficiently and accurately acquire and identify important information, and a more effective theory and method for processing increasingly large amounts of data are urgently needed in the information processing technology. Information retrieval, as a classic text processing technique, can meet this requirement and is rapidly becoming a research focus in the current information processing research field.

Information Retrieval (Information Retrieval) refers to the process and technique of organizing Information in a certain way and finding out relevant Information according to the needs of Information users. The information retrieval process can be described simply as: according to the information requirement, a user organizes a query character string and submits the query character string to an information retrieval system, and the information retrieval system retrieves a document subset related to the query from a document set and returns the document subset to the user. Specifically, given a specific group of query topics, through some information retrieval model, the relevance of all documents in the target and the query topics is calculated, and each document is returned in the order of scores from large to small, and the earlier the document in the returned result is, the more relevant the document is to the query topics is. Through research development of the last half century, some effective information retrieval models are successively proposed and gradually applied to related systems. The retrieval model with larger influence comprises the following steps: boolean logic models, vector space models, probabilistic models, language models, and more recently proposed supervised learning based retrieval models.

In actual information retrieval application, a certain deviation exists between a query request of a user and a query result fed back by a system, so that the performance of a retrieval system is reduced. Therefore, information retrieval is often an iterative process, and users often need to perform query adjustment for many times to obtain satisfactory retrieval results. The query expansion technology well solves the problems that the terms used for the query of the user are not matched with the terms used for the document and the expression of the user is incomplete by expanding and reconstructing the initial query of the user, so that the query expansion technology is widely applied to the field of information retrieval. In brief, query expansion is that before a retrieval system performs retrieval, synonyms or near synonyms of keywords in user query are automatically expanded according to an expansion word list to form new query, and then retrieval is performed.

Pseudo-relevance feedback occurs to make the retrieval system more efficient and to better satisfy the user's query request with the retrieval results. The main mechanism is that the system defaults that the self-retrieved result contains a large number of documents relevant to the user query subject, and the first N documents are taken out as relevant documents to adjust or expand the query.

Generally, there are many factors that affect the performance of a retrieval system, and the most critical of them is the information retrieval policy, including the representation method of documents and query conditions, the matching policy for evaluating the relevance of documents and queries, the ranking method of query results, and the mechanism for the user to perform relevant feedback.

With the development of high-speed internet, a large amount of information is stacked, the accuracy of information search becomes the first point of attention of all users, it is becoming more and more difficult to find what the users want through an information retrieval tool, and at the same time, the excessive flooding of various information makes the users have to spend more time to discriminate which information is valuable to the users. The existing information retrieval method generally has the problems that the retrieval average precision is not high, even the average precision of the best retrieval model at present is only 30%, and the improvement of the information retrieval precision has a long way. Information retrieval has been deeply carried out in various aspects of human life, and most people use searching tools such as hundredths, google and the like to search various required data every day, so that various practical problems are solved. In 2010, the request amount of Chinese web page search reaches more than 600 hundred million times, and in 2016, the search request amount of one hundred-degree-one-day reaches 60 hundred million times, and under the requirement of such a large amount of search, each percentage point of improvement of the average accuracy of information search saves a large amount of time and energy for acquiring required information, and the value of the improvement is extraordinary. Large internet companies are also continuously pursuing lower cost and more efficient information retrieval technologies.

Disclosure of Invention

The invention aims to solve the problem that the query expansion is optimized to improve the average retrieval precision finally.

The invention provides an information retrieval method based on a pseudo-correlation feedback model, which fuses word correlation into the pseudo-correlation feedback model to realize information retrieval, and comprises the steps of respectively generating query expansion words with the importance of candidate expansion words as the characteristic and query expansion words with the correlation of the candidate expansion words and query subject words as the characteristic when generating the query expansion words in a pseudo-correlation document set, and then combining the query expansion words and the query expansion words into the original query expansion words to finish final information retrieval; and when generating the query expansion words with the correlation degree of the candidate expansion words and the query subject words as the characteristic, calculating the correlation degree between the query words and the candidate words appearing at different positions in the document by adopting a kernel function.

Moreover, the word relevancy is fused into the pseudo-correlation feedback model to realize information retrieval, and the realization method is as follows,

when a user submits a query theme, preprocessing the query theme to obtain query keywords Q, D is all target documents, N_DCalculating the scores of the query keyword Q and each document in the target document set D through a preset retrieval weight model for the total number of documents in the target document set D, and arranging the scores from high to low according to the score results to obtain a first query result; the first N documents in the target document set D are taken out as a pseudo-relevant document set D according to a pseudo-relevant feedback mode₁When the query expansion word is selected, the following steps are carried out,

step 1, collecting pseudo-relevant documents D₁All the words in each document are used as candidate expansion words, and candidate expansion words t are calculated respectively_jIn a pseudo-relevant document set D₁Document d of_iScore of importance inGet each document d_iIs vector of importanceAs follows below, the following description will be given,

wherein i is 1,2,3 …, N, j is 1,2,3 …, N;

calculating importance score vector of expansion candidate words in all documentsAs follows below, the following description will be given,

will be provided withAfter the importance degree score of each expansion candidate word is taken out, the expansion candidate words are sorted from large to small, and the top n with the maximum score is₁Value is inCorresponding expansion candidate words are selected to form an importance query expansion word set Q₁Using a polynomial V₁Query expansion term set Q representing importance₁Each word in (1) and the corresponding importance score of the word;

step 2, collecting the pseudo related documents D₁Taking all words in each document as expansion candidate words, and calculating each expansion candidate word t by adopting a kernel function according to the co-occurrence position and the co-occurrence frequency_jIn document d together with query keyword Q_iThe correlation score in (1)Get each textStep d_iIs related to the vectorAs follows below, the following description will be given,

wherein i is 1,2,3 …, N, j is 1,2,3 …, N;

calculating a relevance score vector of the expanded candidate words in all documentsAs follows below, the following description will be given,

will be provided withAfter the relevancy score of each expansion candidate word is taken out, the expansion candidate words are sorted in the order from large to small, and the top n with the maximum score is₁Value is inSelecting out corresponding expansion candidate words to form a relevancy query expansion word set Q₁', using a polynomial V₁To denote a set of query expansion terms Q₁'each word in the list and the word's corresponding relevancy score;

step 3, the polynomial V obtained in the step 1 and the step 2₁And V₁After normalization, linear combination is carried out to obtain a new query term polynomial V as follows,

V＝(1-γ)×||V₁||+γ×||V₁'||

wherein, | X | | represents the normalization operation of the vector X, and γ is an adjustment factor;

step 4, sorting the query term polynomial V obtained in the step 3 from large to small according to the coefficient of each term, and sorting the top n with the maximum coefficient₁Taking out individual terms to obtain a new expansion word set

Step 5, setting the query keyword Q to comprise a query word Q_s1,2,3 …, m, representing the query term Q as a polynomial V_QThe coefficient value of each query term is set to 1.0; combining the extended words obtained in the step 4Is represented by a polynomial expression V',

will query polynomial V_QAnd the query expansion term polynomial V' are combined linearly after normalization until a new query term polynomial K is as follows,

K＝α×||V_Q||+β×||V'||

wherein α and β are regulatory factors;

and 6, obtaining a new query keyword set Q ' according to the query term polynomial K obtained in the step 5, using the corresponding weight of each query term in the new query keyword set Q ' and Q ' in the query term polynomial K, and performing secondary information retrieval by using a preset retrieval weight model to obtain a query result as a final information retrieval result.

In step 1, the importance score is calculatedThe method adopts TFIDF, BM25 or RM3 to obtain the target.

Furthermore, in step 2, each expansion candidate word t is calculated_jIn document d together with query keyword Q_iThe correlation score in (1)The realization is as follows,

let t_rAnd q is_sIn a certain document d_iCo-occurrence of (A) and (B) is represented byThe calculation is as follows,

wherein,represents t_rAnd q is_sIn document d_iThe degree of correlation in (1) is,representing a document d_iInThe co-occurrence frequency of (a) is,representing a document d_iInCo-occurrence counter-document frequency of (c);

calculating to obtain t_rIn document d together with query keyword Q_iThe degree of correlation in (1) is,

furthermore, document d_iInThe co-occurrence frequency of (c) is calculated as follows,

wherein M and L each represent t_rAnd q is_sIn document d_iThe number of times of occurrence of (a),representing a document d_iK1 th t of occurrence in_r，Representing a document d_iThe k2 th q_s，k1＝1,2,3…,M，k2＝1,2,3…,L；Is embodied in a kernel functionAndthe proximity of the location of (a).

Also, the kernel function is a gaussian function or a trigonometric function.

Further, when the kernel function is a gaussian function, the following is calculated,

wherein p is_tAnd p_qRespectively representAndthe position value in the document, σ, is the tuning parameter.

Furthermore, document d_iInCo-occurrence of anti-document frequencyThe calculation is as follows,

wherein,is shown asWhen the temperature of the water is higher than the set temperature,in document d_iTotal number of co-occurrences in (c).

And, the preset retrieval weight model is based on a vector space model, a probability model or a language model.

The invention also provides an information retrieval system based on the pseudo-correlation feedback model, which comprises a computer or a server, wherein the method is executed on the computer or the server.

According to the information retrieval method for fusing the kernel function word relevancy information into the pseudo-relevance feedback model, provided by the invention, the defect that the traditional pseudo-relevance feedback model only considers the word frequency information can be overcome. In addition, the relevance between the query words and the candidate words appearing at different positions in the document is calculated through the kernel function, so that the distribution condition of the query words and the candidate words can be highlighted, the candidate words with higher relevance with the query subject words can be selected, and the additional relevance information can be used, so that the more accurate candidate words can be positioned, and the average precision of expansion query and final retrieval can be improved. The comparison experiment result of a plurality of international information retrieval evaluation standard data sets and a plurality of internationally best models shows that the information retrieval method for integrating the word correlation degree information of the kernel function into the pseudo-correlation feedback model provided by the invention realizes remarkable improvement on retrieval accuracy and reaches the international leading level.

Drawings

Fig. 1 is a flowchart of a complete information retrieval process according to an embodiment of the present invention.

Detailed description of the invention

The core problem to be solved by the invention is as follows: a kernel function is used for reflecting the distribution situation between a user query word and a document candidate word and the correlation degree between the user query word and the document candidate word, the correlation degree is used as an additional weight to be fused into a pseudo-correlation feedback model, and query expansion is achieved to improve the retrieval accuracy.

The information retrieval method for fusing the correlation degree of the kernel function words into the pseudo-correlation feedback model is described in detail below with reference to the accompanying drawings and embodiments.

The invention provides a method for considering the correlation between words aiming at the unreasonable independent vocabulary assumption in the classical method. Through effective utilization of some statistical information (such as context information and other information reflecting word collocation and use relations) of data in the document set, a related technical scheme is designed in combination with the query condition to obtain words which can reflect the topic of the query condition and are triggered by the query condition, namely, the information is utilized to more accurately capture the information requirement of a user.

The Kernel function adopted in the method originally projects the linear indivisible data in the original coordinate system to another space by Kernel, so that the data can be linearly divided in a new space as much as possible. Which in the method of the invention will be used to assess the degree of relatedness of two words in a document.

Referring to fig. 1, the flow of the embodiment is that, when a user performs retrieval according to a related query topic:

the information retrieval system can establish a query index according to a target document set, and when a user submits a relevant query topic, the system can preprocess the query topic into a query keyword Q (Q is a set and generally comprises a plurality of topic words Q)₁、q₂、q₃Etc.), D is all target documents, N_DThe total number of documents in the target document set D. Then, the retrieval system calculates the score of the query keyword Q and each document in the document set D by some preset retrieval weight model (e.g., TFIDF, BM25, RM3, etc.), and obtains the first query result by ranking the score results from high to low. According to the principle of pseudo-correlation feedback, the retrieval system takes the first N (in a large number of relevant research documents, N is generally 10, 20 or 30) documents in the first query result documents of the document set D as the pseudo-correlation document set D₁N is less than or equal to N_DValues can be preset by those skilled in the art. Obtaining the pseudo-relevant document set D generated by the first query in the retrieval system₁When the query expansion word is selected, the following steps are carried out,

step 1, respectively calculating a pseudo-related document set D₁The importance scores of all the words (i.e. the expansion candidate words) in each document can be obtained by calculating the word frequency of the words and the word frequency of the inverse document (such as TFIDF, BM25, RM3, etc.), and then the same word importance scores in different documents are accumulated in a word vector mode and divided by D₁The number N of the documents in the Chinese character image can obtain the importance degree score vectors of all the expansion candidate wordsArranging the scores of the elements in the vector from big to small, and taking out the top n₁(n₁Typically 10, 20, 30 or 50, which can be preset by one skilled in the art) scores in the rangeZhongshiCorresponding words are obtained, and a candidate word set Q with expanded importance degree is obtained₁By a polynomial V₁To represent a set Q₁Each word in (a) and the corresponding importance score for that word.

In the invention, N pseudo-related documents are collected into a set D₁Each document in (i) is regarded as a bag-of-words model and is expressed in a word vector mode, wherein the relevance vector formula of the ith document is shown as follows.

In the above-mentioned formula,representing a pseudo-relevant document set D₁The ith document (i ═ 1,2,3 …, N) d in (c)_iWord vector expression of, t₁、t₂、t₃、…、t_nFor pseudo-relevant document sets D₁All words in all documents in (a) and n represents the total number of these words, i.e. the pseudo-relevant document set D₁The number of all words in the Chinese sentence;represents the corresponding t₁、t₂、t₃、…、 t_nIn document d_iThe weight score (also the importance score, the weight is used to represent the importance of the expanded candidate word). The importance score of a word is obtained by calculating the information (such as TFIDF, BM25, RM3, etc.) of the word frequency and the inverse document word frequency of the word, for example, in the method of calculating the document d by using TFIDF_iMiddle entry t_jThe importance of (a) is the importance of,

wherein,to a certain entry t_jIn document d_iThe importance score (j ═ 1,2,3 …, n), TF (t)_jD) the entry t_jIn document d_iFrequency (number of times) of occurrence of, N_DTotal number of documents, df (t), of the target document set D_j) Is a pseudo-correlation set D₁In, contains the entry t_jThe number of documents.

Each document d of the N documents according to formula (2)_iCan be expressed in the form of vectors of the importance of the corresponding wordsAnd accumulating and summing each document vector, and dividing the sum by the total number N of the pseudo-related documents to obtain importance degree score vectors of all the entries in all the documentsAs shown in equation (3):

will be provided withThe importance degree scores of each word are taken out and then are sorted from big to small, and the top n with the maximum score is₁Value is inCorresponding word is selected to form an importance query expansion word set Q₁. For the convenience of the later calculations, polynomial V is used₁To represent a set Q₁Each word in (a) and the corresponding importance score of that word, as shown in equation (4).

In the formula (4), qh₁、qh₂、qh₃、…、Represents Q₁Each specific extended candidate word in (a total of n)₁One), wh₁、wh₂、wh₃、…、Indicates the corresponding expansion candidate wordScore of (1).

Step 2, a pseudo-relevant document set D is calculated in sequence₁The relevancy score between all the words (i.e. the expansion candidate words) in each document and the query word is obtained by calculating the kernel function according to the positions of the query word and the expansion candidate words in each document, and then the scores of the same words in different documents are accumulated to obtain the relevancy score vectors of all the expansion candidate words and the query wordArranging the scores of the elements in the vector from big to small, and taking out the top n₁(n₁Typically 10, 20, 30 or 50) scores inThe word corresponding to the Chinese character is obtained to obtain a correlation degree expansion candidate word set Q₁', here we use a polynomial V₁To denote a set Q₁' and a relevancy score for the word.

For ease of explanation, the expansion candidate word t is given_rAnd query term q_s(where r is 1,2,3 …, n, n is the pseudo relevant document set D₁The number of all words in the query keyword Q set, s is 1,2,3 …, m, m is the number of words in the query keyword Q set), if t is t_rAnd q is_sIn a certain document d_iIn the co-occurrence ofThis is represented asThey have a co-occurrence weight (i.e., degree of correlation). Due to t_rAnd q is_sMay occur in multiple locations in a document and therefore cannot simply be readRepresents t_rAnd q is_sIn document d_iThe invention further provides the following formula in order to more reasonably measure the correlation degree:

in the formula (5), the first and second groups,represents t_rAnd q is_sIn document d_iThe degree of correlation in (1).

In the formula (5), the first and second groups,representing a document d_iInThe specific calculation formula of the co-occurrence frequency of (c) is as follows:

in the formula (6), M and L respectively represent t_rAnd q is_sIn document d_iThe number of times of occurrence of (a),representing a document d_iK1 th t of occurrence in_r，Representing a document d_iThe k2 th q_sK1 is 1,2,3 …, M, k2 is 1,2,3 …, L. The Kernel () represents a Kernel function, which is a type that the proximity relationship between two words can be measured by the position information of the words, and when the positions of the two words which co-occur are closer, the proximity relationship is stronger, that is, the degree of correlation is higher. Such as gaussian functions, trigonometric functions, etc., are very effective in many scenarios. Examples of the inventionIt is meant to embody the kernel function in Gaussian (other kernel functions may be used in specific implementations)Andas in equation (7):

wherein p is_tAnd p_qRespectively representAndthe position value in the document (i.e. the occurrence number of the word in the document, which is a positive integer), σ is an adjustment parameter for adjusting the distribution of the gaussian function, and σ preferably has a value in the range of 10 to 100, which in the specific embodiment is 50.

In the formula (5), the first and second groups,representing a document d_iInThe specific calculation method of the co-occurrence anti-document frequency is as follows:

Equation (5) gives t_rAnd q is_sIn document d_iDegree of correlation inDue to q_sIs a query word in the query keyword set Q, and t can be obtained by the formula (5)_rIn document d together with query keyword Q_iThe correlation in (1), the invention usesThe specific calculation formula is as follows:

n sets of pseudo-related documents D according to formula (9)₁The ith document d in_iCan be expressed in the form of a corresponding relevance vector between the expansion candidate word and the query word, namelyThe specific formula is as follows.

Next, for each document relevance vectorAfter accumulation and summation, dividing the sum by the total number N of the pseudo-relevant documents to finally obtain the relevancy score vectors of all the entries in all the documentsAs shown in formula (11):

will be provided withThe relevancy score of each word is taken out and then is sorted from big to small, and the top n with the largest score is₁Value is inCorresponding word is selected to form a relevancy query expansion word set Q₁'. For the convenience of the later calculations, polynomial V is used₁To denote a set Q₁' each word in ' and the word's corresponding relevancy score, as shown in equation (12).

In the formula (12), qh₁'、qh'₂、qh₃'、…、Represents Q₁' inEach specific expansion word (a total of n)₁One), wh'₁、wh′₂、wh′₃、...、Indicates the corresponding expansion word inScore of (1).

Step 3, the query expansion word polynomial V obtained in the step 1 and the step 2₁And V₁After normalization, linear combination is carried out to obtain a new query term polynomial V, and the specific combination mode is shown as a formula (13).

V＝(1-γ)×||V₁||+γ×||V₁' | | formula (13)

In formula (13), | X | | | denotes that the vector X is normalized, and the purpose of normalization is to unify dimensions, i.e., to normalize the value of each element in the vector to the interval [0,1.0 |]In addition, subsequent parameter adjustment is facilitated. There are many ways to realize normalization, and in this embodiment, a division-by-maximum method is used, that is, the normalized value of each element is the original value of the element divided by the maximum value of the element in the vector. For example, there is a vector [1,2,3,4 ]]If there are 4 elements and the maximum value of the element is 4, then the vector is normalized by dividing by the maximum value method to obtain the resultI.e., [0.25,0.5,0.75,1]It can be seen that all values in the original vector are normalized to the interval [0,1.0 ]]Is as follows.

The adjustment factor γ in the formula (13) has a value range of 0 to 1.0, and has a function of balancing the importance score of the expansion word and the relevance score between the expansion word and the query word, and when the method is applied specifically, the method can test the optimal value of γ on the target document set to be applied by using test data in advance.

Step 4, according to the polynomial V in step 3, according to each termThe coefficients (the integrated weight scores) are sorted from large to small, and the top n with the largest coefficient is sorted₁Taking out individual terms to obtain a new expansion word set I.e. the final set of query expansion terms.

Step 5, expressing the original query keyword set Q as a polynomial V_QPolynomial V_QEach term in Q is each query term in Q_sWhere s is 1,2,3 …, m, and the coefficient value of each term is set to 1.0, it can be expressed as

V_Q＝1.0×q₁+1.0×q₂+1.0×q₃+...+1.0×q_mFormula (14)

Then, the extended word set obtained in step 4 is collectedAlso expressed by a polynomial V ', each term of the polynomial V' beingEach term (term) having a coefficient that is the corresponding value of the term in the polynomial V in step 4,

wherein, q'₂、q'₃、…、To representEach specific expansion word in (a total of n)₁W)'₁、w'₂、w'₃、…、Indicating the score of the corresponding expansion word in the query term polynomial V.

Will query polynomial V_QAnd after normalization of the query expansion term polynomial V', carrying out linear combination again to obtain a new query term polynomial K, wherein the specific combination mode is shown as a formula (16).

K＝α×||V_Qequation of | l + β × | | V' | (16)

the normalization method consistent with the step 3 is adopted in the formula (16), the adjusting factor α in the formula generally takes a fixed value of 1.0, the value range of the adjusting factor β is 0 to 1.0, the function of the normalization method is to balance the weights before the original query word and the expanded query word, and the normalization method can be set as an empirical value during specific implementation.

And 6, obtaining a new query keyword set Q 'according to the step 5, wherein each query word in the Q' is each term in the query word polynomial K. And (3) performing second information retrieval (the same retrieval model as the first retrieval) by using the corresponding weight of each query word in the new query keyword sets Q ' and Q ' in the query word polynomial K, namely calculating the score of each document in the Q ' and the target document set D again, wherein the obtained query result is the final information retrieval result.

When the second search is carried out, the query words are a newly generated query keyword set Q', the weight of each query word is the coefficient of the query word in the polynomial K of the query word when the score of the query word and each document is calculated, and the weight of each query word when the first search is carried out is 1.0.

In specific implementation, a person skilled in the art can implement automatic operation of the above processes by using software technology. Accordingly, it is within the scope of the present invention if an information retrieval system based on a pseudo-correlation feedback model is provided, which includes a computer or a server, and the above process is executed on the computer or the server to fuse the word correlation into the pseudo-correlation feedback model for information retrieval.

For example, the development environment for information search is Java or Python development environment, and the development support library is Lucene.

The information retrieval framework may be a pseudo-correlation feedback information retrieval framework based on a vector space model, a probabilistic model, a language model, and the like.

In order to verify the actual effect of the method, comparison experiments are carried out on a plurality of standard data sets, the comparison experiments are divided into two groups, one group adopts a standard Rocchio pseudo-related feedback information retrieval model, and the other group adopts the Rocchio pseudo-related feedback information retrieval model combined with the method, which is abbreviated as KRC. Six standard international data sets were used in this experiment, including AP88-89, AP90, DISK1&2, DISK4&5, WT2G and WT10G, and the information for these data sets is shown in the following table (table 1):

data set name	Total number of documents	Size and breadth	Query topic numbering	Number of topics queried
					AP90	78,321	0.23Gb	51-100	50
AP88-89	164,597	0.50Gb	51-100	50
					DISK1&2	741,856	2.03Gb	51-200	150
DISK4&5	528,155	1.85Gb	301-450	150
					WT2G	247,491	2.14Gb	401-450	50
WT10G	1,692,096	10Gb	451-550	100

TABLE 1 basic information of six data sets

In a comparative experiment, a gaussian kernel function (or other kernel functions) is selected as the kernel function in the method of the present invention, and the σ value in the gaussian kernel function is 50. In order to make the experiment more fair, the number N1 of the query expansion words is selected from four cases, 10, 20, 30 and 50, respectively, and the experimental results in different cases are shown in the following table (table 2):

TABLE 2 average precision (MAP) comparison of Rocchio and KRC models over six standard data sets

In table 2, the rocchoo model in the second column does not adopt the method of the present invention, the KRC model is the rocchoo model adopting the method of the present invention, and the MAP is the average accuracy of the retrieval result, which can be observed from the table.

Claims

1. An information retrieval method based on a pseudo-correlation feedback model is characterized in that: the word relevance is fused into a pseudo-relevance feedback model to realize information retrieval, and the information retrieval comprises the steps of respectively generating query expansion words with the importance of candidate expansion words as the characteristic and query expansion words with the relevance of the candidate expansion words and query subject words as the characteristic when the query expansion words are generated in a pseudo-relevance document set, and then combining the query expansion words and the query expansion words into the original query expansion words to finish the final information retrieval; when generating the query expansion words with the relevance between the candidate expansion words and the query subject words as the characteristic, calculating the relevance between the query words and the candidate words appearing at different positions in the document by adopting a kernel function;

the information retrieval is realized by fusing the word relevancy into the pseudo-relevant feedback model in the following way,

wherein i is 1,2,3 …, N, j is 1,2,3 …, N;

step 2, collecting the pseudo related documents D₁Taking all words in each document as expansion candidate words, and calculating each expansion candidate word t by adopting a kernel function according to the co-occurrence position and the co-occurrence frequency_rIn document d together with query keyword Q_iThe correlation score in (1)Get each document d_iIs related to the vectorAs follows below, the following description will be given,

wherein, i is 1,2,3 …, N, r is 1,2,3 …, N;

will be provided withAfter the relevancy score of each expansion candidate word is taken out, the expansion candidate words are sorted in the order from large to small, and the top n with the maximum score is₁Value is inCorresponding expansion candidate word is selected to form a relevancy query expansion word set Q'₁Using a polynomial V₁'to represent query expansion term set Q'₁Each word in (1) and the relevancy score corresponding to the word;

V＝(1-γ)×||V₁||+γ×||V₁'||

K＝α×||V_Q||+β×||V'||

wherein α and β are regulatory factors;

2. The information retrieval method based on the pseudo-correlation feedback model according to claim 1, wherein: in step 1, importance scoresThe method adopts TFIDF, BM25 or RM3 to obtain the target.

3. The information retrieval method based on the pseudo-correlation feedback model according to claim 1, wherein: in step 2, each expansion candidate word t is calculated_rIn document d together with query keyword Q_iThe correlation score in (1)The realization is as follows,

4. the information retrieval method based on the pseudo-correlation feedback model according to claim 3, wherein: document d_iInThe co-occurrence frequency of (c) is calculated as follows,

wherein M and L each represent t_rAnd q is_sIn document d_iThe number of times of occurrence of (a),representing a document d_iK1 th t of occurrence in_r，Representing a document d_iThe k2 th q_s，k1＝1,2,3…,M，k2＝1,2,3…,L；Means thatEmbodied in kernel functionsAndthe proximity of the location of (a).

5. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, wherein: the kernel function is a gaussian function or a trigonometric function.

6. The information retrieval method based on the pseudo-correlation feedback model according to claim 5, wherein: when the kernel function is a gaussian function, it is calculated as follows,

7. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, wherein: document d_iInCo-occurrence of anti-document frequencyThe calculation is as follows,

8. The information retrieval method based on the pseudo-correlation feedback model according to claim 1 or 2 or 3 or 4 or 5 or 6 or 7, wherein: the preset retrieval weight model is based on a vector space model, a probability model or a language model.

9. An information retrieval system based on a pseudo-correlation feedback model, characterized in that: comprising a computer or server on which the method according to claims 1 to 8 is performed.