CN111897925B

CN111897925B - Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning

Info

Publication number: CN111897925B
Application number: CN202010774429.1A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2022-08-26
Anticipated expiration: 2040-08-04
Also published as: CN111897925A

Abstract

The invention provides a pseudo-correlation feedback expansion method integrating correlation pattern mining and word vector learning, which comprises the steps of firstly, querying and searching a primary test document obtained by an original Chinese document set by a user to construct a primary test document set, extracting m primary test documents in the front as a pseudo-correlation feedback document set, mining candidate expansion words from the pseudo-correlation feedback document set by adopting a correlation rule mining method based on a CSC (computer-controlled computer) frame, establishing a candidate expansion word set, then calculating the vector cosine similarity between the candidate expansion words and the original query, extracting the candidate expansion words not lower than a similarity threshold value as final expansion words, combining the final expansion words and the original query into a new query, and retrieving the original document set again to realize query expansion. Experimental results show that the expansion retrieval performance of the method is higher than that of the existing query expansion method based on the association mode and the word vector, the problems of query theme drift and word mismatching can be effectively reduced, the information retrieval performance is improved, and the method has good application value and popularization prospect.

Description

Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning

Technical Field

The invention relates to a pseudo-correlation feedback expansion method integrating correlation pattern mining and word vector learning, and belongs to the technical field of information retrieval.

Background

With the development of network technology and the arrival of the big data era, in the face of massive network information resources, it is increasingly difficult for network users to quickly and accurately acquire required information from the massive resources, and the main reason is that the current network retrieval system has the problems of query subject drift and word mismatching. Query expansion is one of key technologies for solving the problems of query topic drift and word mismatching in information retrieval, and the query expansion refers to modifying the weight of an original query or adding words related to the original query to obtain a new query longer than the original query so as to describe the semantic meaning or topic implied by the original query more completely and accurately, make up for the deficiency of user query information and improve the retrieval performance of an information retrieval system.

The core problem of query expansion is the source of the expansion terms and the design of the expansion model. In recent decades, researchers have conducted extensive research on query extension models from different perspectives and methods, wherein association pattern mining technology and word vector learning training were introduced into the field of query extension, and relevant feedback extension based on association pattern mining and query extension research based on deep learning were conducted, and good results were achieved, for example, Jabri et al proposed an extended word mining method based on rule association graphs (see Jabri S, Dahbi A, Gadi T.A graph-based adaptation for textual expression using rule association search feedback and association rules mining [ J ]. International Journal of electric & Computer Engineering,2019 (6) 5016. info 5023), zi et al (see Press S, science A, Ku. query and extension Information [ 12 ] of interest and Knowledge of interest C. and Knowledge of interest), 2016: 1929-: bouziri A, Latiri C, Gaussian E et al.learning query expansion from association rules between terms [ C ]. Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), Lisbon, Portugal,2015:525-530.) and ranking learning model-based extension mining methods (see literature: bouziri A, Latiri C, Gaussier E.Effect Association selection for Automatic Query Expansion [ C ]. Proceedings of the 18th International Conference on Computational Linear and Intelligent Text Processing (CICLing 2017), Budapest, Hungary, Springer, Cham, LNCS 10762,2017: 563-. Experimental results show that the query expansion method is effective and has better performance in the aspect of improving the information retrieval performance.

However, the existing query expansion method has not completely solved the technical problems of query topic drift, word mismatching and the like in information retrieval, and the query expansion based on association rule mining mainly adopts a mining technology based on statistical analysis to obtain the expansion words, and has the defect of ignoring semantic information of the expansion words in the context. Aiming at the defect, the method fuses the association pattern with the word vector with rich context semantic information, and provides a pseudo-correlation feedback expansion method fusing association pattern mining and word vector learning.

Disclosure of Invention

The invention aims to provide a pseudo-correlation feedback expansion method integrating correlation pattern mining and word vector learning, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.

The invention adopts the following specific technical scheme:

a pseudo-correlation feedback expansion method for combination of correlation pattern mining and word vector learning comprises the following steps:

step 1, a Chinese user queries and searches an original Chinese document set to obtain a primary check document, and a primary check document set is constructed.

Step 2, extracting m primary inspection documents in the front row as a pseudo-related feedback document set, and mining candidate extension words from the pseudo-related feedback document set by adopting an association rule mining method based on a CSC (computer-aided Support and Confidence) frame, wherein the CSC (computer-aided Support and Confidence) frame refers to a Support degree-Confidence frame based on a Copulas theory, and the association rule mining method based on the CSC frame specifically comprises the following steps:

and (2.1) extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.

The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.

(2.2) extracting characteristic words from the Chinese characteristic word library as 1_ candidate item set C ₁ 。

(2.3) calculation of C ₁ Support CSC _ Sup (C) based on CSC framework ₁ ) If CSC _ Su (C) ₁ ) If not lower than the minimum support threshold ms, C is set ₁ As 1_ frequent item set L ₁ And added to the frequent item set FIS (frequency ItemSet).

The CSC _ Sup (CSC based Support) represents the support based on CSC framework. The CSC _ Sup (C) ₁ ) Is calculated as shown in equation (1):

in the formula (1), the reaction mixture is,

represents the 1_ candidate C ₁ The occurrence frequency in the pseudo related feedback Chinese document library, DocNum represents the total document number of the pseudo related feedback Chinese document library,

represents the 1_ candidate C ₁ Item set weights in the pseudo-correlation feedback Chinese document library, ItemsW represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library, and exp represents an exponential function with a natural constant e as a base.

(2.4) adopting a self-connection method to collect (k-1) _ frequent item set L _k-1 Deriving k _ candidate C from concatenation _k And k is more than or equal to 2.

The self-ligation method adopts a candidate item ligation method given in Apriori algorithm (see the literature: Agrawal R, Imielinski T, Swami A. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993: 207-.

(2.5) when 2_ candidate C is mined ₂ When, ifThe C is ₂ If the original query term is not contained, the C is deleted ₂ If the C is ₂ If the original query term is contained, the C is left ₂ Then, remaining C ₂ Transferring to the step (2.6); when mining to k _ candidate C _k And if the k is more than or equal to 3, directly transferring to the step (2.6).

(2.6) calculation of C _k Support CSC _ Sup (C) based on CSC framework _k ) If CSC _ Su (C) _k ) Not less than ms, then C _k As k _ frequent item set L _k And added to the frequent itemset FIS.

The CSC _ Sup (C) _k ) Is calculated as shown in equation (2):

in the formula (2), the reaction mixture is,

represents the k _ candidate C _k The frequency of occurrence in the pseudo-relevant feedback chinese document library,

represents the k _ candidate C _k Item set weights in a pseudo-relevant feedback Chinese document library; DocNum and ItemsW are defined as in equation (1), and exp represents an exponential function based on a natural constant e.

(2.7) after k is added with 1, the step (2.4) is carried out to continue the subsequent steps until the L is _k And (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (2.8).

(2.8) optional removal of L from FIS _k And k is more than or equal to 2.

(2.9) from L _k Extracting the association rule Q _i →Ret _j Calculating the association rule Q _i →Ret _j Confidence CSC _ Con (Q) based on CSC framework _i →Ret _j ) I is more than or equal to 1, j is more than or equal to 1

Q _i ∪Ret _j ＝L _k ，

The Ret _j For a proper subset of terms set without query terms, said Q _i The method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.

The CSC _ Con (CSC based Confidence) represents a CSC framework based confidence level, the CSC _ Con (Q) _i →Ret _j ) The formula (3) is shown as follows:

in the formula (3), the reaction mixture is,

represents k _ frequent item set L _k The frequency of occurrence in the pseudo-relevant feedback chinese document library,

represents k _ frequent item set L _k Term set weights in the pseudo-relevance feedback chinese document library,

represents k _ frequent item set L _k Is a proper subset of items Q _i The frequency of occurrence in the pseudo-relevant feedback chinese document library,

represents k _ frequent item set L _k Is a proper subset of items Q _i Item set weights in a pseudo-relevance feedback Chinese document library.

(2.10) extraction of CSC _ Con (Q) _i →Ret _j ) Association rule Q not less than minimum confidence threshold mc _i →Ret _j Added to the association rule set AR (Association rule), and then, proceeds to step (2.9) from L _k In which other proper subset item sets Ret are re-extracted _j And Q _i Then sequentially carrying out the followingStep(s), so circulating until L _k If and only if all proper subset entries in the set are retrieved once, then proceed to step (2.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS _k Then sequentially proceeding the next steps, and so on, until all k _ frequent item sets L in FIS _k If and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (2.11).

(2.11) extracting association rule back-piece Ret from association rule set AR _j Obtaining a candidate Expansion word set CET (candidate Expansion term) as a candidate Expansion word, and calculating a candidate Expansion word weight w _Ret Then, the process proceeds to step (2.12).

The CET is represented by formula (4):

in formula (4), Ret _i And i is greater than or equal to 1, and represents the ith candidate expansion word.

The candidate expansion word weight w _Ret The calculation formula is shown in formula (5):

in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.

And (2.12) carrying out Chinese word segmentation and Chinese stop words removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set of the characteristic words.

The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (see https:// code. Google. com/p/word2vec /).

(2.13) computing candidate expansions in the set of word vectors of feature wordsWord Ret _i And the original query term set Q (Q ═ Q) ₁ ,q ₂ ,…,q _r ) Each query term q in (q) ₁ ,q ₂ ,…,q _r The vector cosine similarity VecSim (Ret) of _i ,q _j ) J is more than or equal to 1 and less than or equal to r, and the vector similarity of the candidate expansion words and each query word is accumulated to be used as the total vector similarity VecSim (Ret) of the candidate expansion words _i ,Q)。

The VecSim (Ret) _i ,q _j ) As shown in formula (6):

in the formula (6), the vRet _i Indicating the ith candidate expansion word Ret _i Word vector value of, vq _j Representing the jth query term q _j The word vector value of.

The VecSim (Ret) _i Q) the calculation formula is shown in formula (7):

(2.14) extraction of VecSim (Ret) _i Q) taking the candidate Expansion words not lower than the vector similarity threshold minVSim as the final Expansion words of the original query Term set Q to obtain a final Expansion Term set FETS (final Expansion Term set), and calculating the weight w of the final Expansion words Ret _Fet Then, the process proceeds to step (2.15).

The final extended word set FETS is as shown in equation (8):

the weight w of the final expansion word Ret _Fet By the weight w of the candidate expansion word _Ret And vector similarity VecSim (Ret) _i And Q) is selected. Said w _Fet The calculation formula is shown in formula (9):

and (2.15) combining the expansion words with the original query to retrieve the original Chinese document set again for the new query, thereby realizing query expansion.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a pseudo-correlation feedback expansion method combining correlation pattern mining and word vector learning, which fuses a correlation pattern based on statistical analysis and a word vector with context semantic information, mines candidate expansion words for a pseudo-correlation feedback document set, establishes a candidate expansion word set, calculates the vector cosine similarity between the candidate expansion words and each query term, and extracts the candidate expansion words not lower than a similarity threshold from the candidate expansion word set as final expansion words. Experimental results show that the method can improve the information retrieval performance, the retrieval performance is higher than that of the similar comparison method in recent years, and the method has good application value and popularization prospect.

(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that the MAP value of the method is higher than that of the reference retrieval BLR, compared with 4 comparison methods, the MAP value of the method is mostly improved, the retrieval performance of the method is better than that of the reference retrieval and comparison methods, the information retrieval performance can be improved, the problems of query drift and word mismatching in information retrieval are reduced, and the method has high application value and wide popularization prospect.

Drawings

Fig. 1 is a general flow diagram of a pseudo-correlation feedback expansion method of association pattern mining and word vector learning fusion according to the present invention.

Detailed Description

Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:

1. item set

In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.

2. Associating rules front and back parts

Let x and y be arbitrary feature term sets, and an implication of the form x → y is called an association rule, where x is called a rule antecedent and y is called a rule consequent.

CSC framework

The CSC (copula-based Support and Confidence) framework refers to a Support-Confidence framework based on copula theory.

4. CSC framework based support and confidence

Copula theory (see Sklar A. principles de repetition n dimensions et sources marks [ J ]. Publication de l 'institute de Statistique l' Universal Paris 1959,8(1):229 and 231.) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function.

The invention utilizes a copula function to comprehensively unify the frequency and the weight of a feature term set to the support degree and the confidence degree of a feature term association mode, and provides a support degree-confidence degree framework based on a copula theory, namely a CSC framework, wherein the calculation formulas of the support degree CSC _ Su (CSC based support) and the confidence degree CSC _ Con (CSC based confidence) in the CSC framework are as follows:

CSC framework based feature term set (T) ₁ ∪T ₂ ) Support CSC _ Sup (T) ₁ ∪T ₂ ) The formula (2) is shown in equation (10):

in the formula (10), the reaction mixture is,

set of representation items (T) ₁ ∪T ₂ ) In pseudo-correlation feedbackThe frequency of occurrence in the library of chinese documents,

representing a set of items (T) ₁ ∪T ₂ ) Item set weights in a pseudo-relevance feedback Chinese document library. DocNum represents the total document number of the pseudo-correlation feedback Chinese document library, and ItemsW represents the weighted accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library. exp denotes an exponential function with a natural constant e as the base.

CSC framework-based feature word association rule T ₁ →T ₂ Confidence CSC _ Con (T) ₁ →T ₂ ) The formula (2) is shown in equation (11):

in the formula (11), the reaction mixture is,

representing a set of items (T) ₁ ∪T ₂ ) The frequency of occurrence in the pseudo-relevant feedback chinese document library,

representing a set of items (T) ₁ ∪T ₂ ) Term set weights in a pseudo-relevance feedback Chinese document corpus, the

Set of representation items T ₁ The frequency of occurrence in the pseudo-relevant feedback chinese document library,

representing a set of items T ₁ Item set weights in a pseudo-relevance feedback Chinese document library.

The invention is further explained by combining the drawings and specific comparative experiments.

As shown in fig. 1, the pseudo-correlation feedback expansion method of the invention, which integrates correlation pattern mining and word vector learning, comprises the following steps:

step 1, a Chinese user queries and retrieves an original Chinese document set to obtain a primary check document, and a primary check document set is constructed.

The invention adopts TF-IDF weighting technology to calculate the weight of the feature words.

(2.2) extracting characteristic words from the Chinese characteristic word bank as 1_ candidate item set C ₁ 。

(2.3) calculation of C ₁ Support CSC _ Sup (C) based on CSC framework ₁ ) If CSC _ Su (C) ₁ ) If not lower than the minimum support threshold ms, C is set ₁ As 1_ frequent item set L ₁ And added to the frequent itemset set fis (frequency itemset).

The CSC _ Sup (CSC based support) represents the support based on the CSC framework. The CSC _ Sup (C) ₁ ) Is calculated as shown in equation (1):

in the formula (1), the reaction mixture is,

The self-connection method adopts a candidate item set connection method given in an Apriori algorithm.

(2.5) when 2_ candidate C is mined ₂ When, if the C is ₂ If the original query term is not contained, the C is deleted ₂ If the C is ₂ If the original query term is contained, the C is left ₂ Then, C is left ₂ Transferring to the step (2.6); when mining to k _ candidate C _k And if the k is more than or equal to 3, directly transferring to the step (2.6).

(2.6) calculation of C _k Support CSC _ Sup (C) based on CSC framework _k ) If CSC _ Su (C) _k ) Not less than ms, then C _k As k _ frequent item set L _k And added to the frequent itemset set FIS.

The CSC _ Sup (C) _k ) Is calculated as shown in equation (2):

in the formula (2), the reaction mixture is,

represents the k _ candidate C _k Item set weights in a pseudo-relevant feedback Chinese document library; DocNum and ItemsW are defined as in the formula (1), exp representsAn exponential function with a natural constant e as the base.

(2.7) after k is added with 1, the step (2.4) is carried out to continue the subsequent steps until the L is _k And (5) if the item set is an empty set, finishing the excavation of the frequent item set, and turning to the step (2.8).

(2.8) optional removal of L from FIS _k And k is more than or equal to 2.

Q _i ∪Ret _j ＝L _k ，

in the formula (3), the reaction mixture is,

denotes k _ frequentItem set L _k Is a proper subset of items Q _i The frequency of occurrence in the pseudo-relevant feedback chinese document library,

(2.10) extraction of CSC _ Con (Q) _i →Ret _j ) Association rule Q not less than minimum confidence threshold mc _i →Ret _j Adding into the association rule set AR (Association rule), and then proceeding to step (2.9), from L _k In which other proper subset item sets Ret are re-extracted _j And Q _i Sequentially proceeding the next steps, and circulating the steps until L _k If and only if all proper subset entries in the set are retrieved once, then proceed to step (2.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS _k Then, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS _k If and only if all are taken out once, then the association rule pattern mining is finished, and the following step (2.11) is carried out.

(2.11) extracting the association rule back-piece Ret from the association rule set AR _j Obtaining a candidate Expansion word set CET (candidate Expansion term) as a candidate Expansion word, and calculating a weight w of the candidate Expansion word _Ret Then, the process proceeds to step (2.12).

The CET is represented by formula (4):

The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.

(2.13) calculating candidate expansion words Ret in the word vector set of the characteristic words _i And the original query term set Q (Q ═ Q) ₁ ,q ₂ ,…,q _r ) Each query term q in (q) ₁ ,q ₂ ,…,q _r The vector cosine similarity VecSim (Ret) of _i ,q _j ) J is more than or equal to 1 and less than or equal to r, and the vector similarity of the candidate expansion words and each query word is accumulated to be used as the total vector similarity VecSim (Ret) of the candidate expansion words _i ,Q)。

The VecSim (Ret) _i ,q _j ) As shown in formula (6):

The VecSim (Ret) _i Q) the calculation formula is shown in formula (7):

(2.14) extraction of VecSim (Ret) _i Q) candidate Expansion words not lower than the vector similarity threshold minVSim are used as Final Expansion words of the original query term set Q to obtain a Final Expansion term set FETS (Final Expansion Ter)m Set), calculating the weight w of the final extension word Ret _Fet Then, the process proceeds to step (2.15).

The final extended word set FETS is as shown in equation (8):

Experimental design and results:

we compared the same method to the existing experiment, to show the effectiveness of the method of the invention.

1. Experimental environment and experimental data:

the experimental data of the invention is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text corpus, which has 8 data sets, and 901446 Chinese documents in total, and the specific information is shown in Table 1. The corpus is a standard corpus which is universal internationally and comprises a document set, a query set and a result set, namely 50 Chinese queries, 4 types of query subjects and a result set of 2 evaluation criteria. The method comprises the steps that a Description (Desc for short) and a Title query topic are adopted to complete a retrieval experiment, the Title query belongs to a short query, the query topic is briefly described by nouns and nominal phrases, the Desc query belongs to a long query, and the query topic is briefly described in a sentence form; the result set has Rigid (highly relevant, relevant to the query) and Relax (highly relevant, and partially relevant to the query) evaluation criteria.

The experimental data pretreatment is as follows: segmenting words and removing stop words.

The index for evaluation of the search for the Experimental results is MAP (mean Average precision)

TABLE 1 original corpus of experiments and its quantity of the invention

2. The reference retrieval and comparison method comprises the following steps:

the experimental basic retrieval environment is built by Lucene.

The base retrieval blr (baseline retrieval) is a retrieval result obtained by submitting an original query to lucene.

The comparative method is described as follows:

QE _ WAPM (query Expansion Based on Weighted Association Pattern mining), and mining rule Expansion words by adopting a Weighted Association Pattern mining technology of a document (yellow name selection, cross-English language query Expansion [ J ] information academic newspaper Based on Weighted Association Pattern mining, 2017,36(3):307- & lt318 >), wherein the parameters are mc 0.1, mi 0.0001, and the experimental results are average values when ms is 0.004,0.005,0.006 and 0.007 respectively.

QE _ WPNPM (query Expansion Based on Weighted Positive and Negative Patterns mining) mines the rule extension words by adopting a fully Weighted Positive and Negative association pattern mining technology of a literature (yellow name selection, JianCao Qing, more-English cross-language query translation extension [ J ] electronic newspaper mined Based on a fully Weighted Positive and Negative association pattern, 2018,46(12): 3029-. The parameters mc is 0.1, α is 0.3, minPR is 0.1, minNR is 0.01, and ms is 0.10,0.11,0.12,0.13, respectively, and the experimental results are averaged.

QE _ WMSM (query Expansion Based on Weighted Multiple Supports mining), using the multi-support threshold Based Weighted frequent pattern mining technique mining rule extension words of the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining with Weighted Multiple minimum Supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605. 612.), with the parameters mc ═ 0.1, LMS ═ 0.2, HMS ═ 0.25, WT ═ 0.1, ms are 0.1,0.15,0.2,0.25, respectively, and the experimental results are averaged.

QE _ W2Vec (Query Expansion based on Word Embedding) adopts a Word vector based Query Expansion method of documents (Kan, Linyuan, cool pillow, etc.. A Word vector method of patent Query Expansion studies [ J ] computer science and exploration, 2018,12(6): 972-980.). Experimental parameters: k is 60 and α is 0.1.

The word vector semantic learning training parameters of the Skip-gram model used by the invention are as follows: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.

3. The experimental methods and results are as follows:

the inventive experiment was performed on 8 data sets with 50 chinese queries of NTCIR-5CLIR corpus to obtain the baseline search BLR results, the comparison method, and the search result MAP average of the inventive method, as shown in tables 2 and 3.

TABLE 2 MAP value of search performance (Title query) for the method of the present invention and the reference search and comparison method

TABLE 3 search Performance MAP values for the inventive method and the reference search and comparison method (Desc Inquiry)

The experimental result shows that the MAP value of the method is higher than that of the standard retrieval BLR, compared with 4 comparison methods, the MAP value of the method is mostly improved, and the expansion retrieval performance of the method is higher than that of the standard retrieval and the similar comparison methods. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims

1. A pseudo-correlation feedback expansion method for combination of correlation pattern mining and word vector learning is characterized by comprising the following steps:

step 1, a Chinese user inquires and retrieves an original Chinese document set to obtain a primary check document, and a primary check document set is constructed;

step 2, extracting m primary inspection documents in the front row as a pseudo-related feedback document set, and mining candidate extension words from the pseudo-related feedback document set by adopting an association rule mining method based on a CSC frame, wherein the CSC frame is a support degree-confidence coefficient frame based on a Copulas theory, and the association rule mining method based on the CSC frame specifically comprises the following steps:

(2.1) extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library;

(2.2) extracting characteristic words from the Chinese characteristic word library as 1_ candidate item set C ₁ ；

(2.3) calculation of C ₁ Support CSC _ Sup (C) based on CSC framework ₁ ) If CSC _ Su (C) ₁ ) If not lower than the minimum support threshold ms, C is set ₁ As 1_ frequent item set L ₁ And adding to the frequent item set FIS;

(2.4) adopting a self-connection method to connect (k-1) _ frequent item set L _k-1 Deriving k _ candidate C from concatenation _k The k is more than or equal to 2;

(2.5) when mining to 2_ candidate C ₂ When it is, if C is ₂ If the original query term is not contained, the C is deleted ₂ If the C is ₂ If the original query term is contained, the C is left ₂ Then, remaining C ₂ Transferring to the step (2.6); when mining to k _ candidate C _k If k is more than or equal to 3, directly switching to the step (2.6);

(2.6) calculation of C _k Support CSC _ Sup (C) based on CSC framework _k ) If CSC _ Su (C) _k ) Not less than ms, then C _k As k _ frequent item set L _k And adding to a frequent item set FIS;

(2.7) k plus 1 and then proceeding to step (2.4) to continue the sequential execution of the following steps until said L is generated _k If the item set is an empty set, finishing the excavation of the frequent item set, and turning to the step (2.8);

(2.8) optionally removing L from FIS _k The k is more than or equal to 2;

(2.9) from L _k Extracting the association rule Q _i →Ret _j Calculating the association rule Q _i →Ret _j Confidence CSC _ Con (Q) based on CSC framework _i →Ret _j ) I is not less than 1, j is not less than 1, and

Q _i ∪Ret _j ＝L _k ，

the Ret _j For a proper subset of terms set without query terms, said Q _i The method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;

(2.10) extraction of CSC _ Con (Q) _i →Ret _j ) Association rule Q not less than minimum confidence threshold mc _i →Ret _j Adding to the association rule set AR, then proceeding to step (2.9), from L _k In which other proper subset item sets Ret are re-extracted _j And Q _i Sequentially proceeding the next steps, and circulating the steps until L _k If and only if all proper subset entries in the set are retrieved once, then proceed to step (2.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS _k Then, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS _k If and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (2.11) is carried out;

(2.11) extracting the association rule back-piece Ret from the association rule set AR _j As candidate expansion words, obtaining a candidate expansion word set CET, and calculating a candidate expansion word weight w _Ret Then, the step (2.12) is carried out;

(2.12) carrying out Chinese word segmentation and Chinese stop words removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set of the characteristic words;

(2.13) calculating candidate expansion words Ret in the word vector set of the characteristic words _i With the original query term set Q (Q ═ Q) ₁ ,q ₂ ,…,q _r ) Each query term q in (q) ₁ ,q ₂ ,…,q _r The vector cosine similarity VecSim (Ret) of _i ,q _j ) J is more than or equal to 1 and less than or equal to r, and the vector similarity of the candidate expansion words and each query word is accumulated to be used as the total vector similarity VecSim (Ret) of the candidate expansion words _i ,Q)；

(2.14) extraction of VecSim (Ret) _i Q) using the candidate expansion words not lower than the vector similarity threshold minVSim as the final expansion words of the original query term set Q to obtain a final expansion term set FETS, and calculating the weight w of the final expansion words Ret _Fet Then, the step (2.15) is carried out;

(2.15) combining the expansion words with the original query to retrieve the original Chinese document set again for the new query, so as to realize query expansion;

in the step (2.3), the CSC _ Sup (C) ₁ ) Is calculated as shown in equation (1):

in the formula (1), the acid-base catalyst,

represents the 1_ candidate C ₁ The occurrence frequency in the pseudo-relevant feedback Chinese document library, DocNum represents the total document number of the pseudo-relevant feedback Chinese document library,

represents the 1_ candidate C ₁ Item set weights in pseudo-relevance feedback Chinese document library, ItemsW represents the weight accumulation sum of all Chinese characteristic words in pseudo-relevance feedback Chinese document library, exp represents the weight accumulation sum of all Chinese characteristic words in natural constant feedback Chinese document libraryAn exponential function with the number e as the base;

in the step (2.6), the CSC _ Sup (C) _k ) Is calculated as shown in equation (1):

in the formula (2), the reaction mixture is,

represents k _ candidate C _k Item set weights in a pseudo-relevant feedback Chinese document library; DocNum and ItemsW are defined as in equation (1), exp represents an exponential function with a natural constant e as the base;

in the step (2.9), the CSC _ Con (Q) _i →Ret _j ) The formula (3) is shown as follows:

in the formula (3), the reaction mixture is,

represents k _ frequent item set L _k Is a proper subset of items Q _i In pseudo-relevant feedback Chinese document libraryThe frequency of the occurrence of the frequency,

represents k _ frequent item set L _k Is a proper subset of items Q _i Item set weights in a pseudo-correlation feedback Chinese document library;

in step (2.11), the CET is of formula (4):

in the formula (4), Ret _i Representing the ith candidate expansion word, wherein i is more than or equal to 1;

in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word;

in the step (2.13), the VecSim (Ret) _i ,q _j ) As shown in formula (6):

in the formula (6), the vRet _i Represents the ith candidate expansion word Ret _i Word vector value of, vq _j Representing the jth query term q _j A word vector value of;

the VecSim (Ret) _i Q) the calculation formula is shown in formula (7):

in the step (2.14), the final extension word set FETS is as shown in equation (8):

FETS＝{Ret ₁ ,Ret ₂ ,...,Ret _i }

(VecSim(Ret _l ,Q)≥minVSim(l∈(1,2,…,i))) (8)

the weight w of the final expansion word Ret _Fet By the weight w of the candidate expansion word _Ret And vector similarity VecSim (Ret) _i And Q) composition; said w _Fet The calculation formula is shown in formula (9):

2. the method of claim 1, wherein the method comprises the following steps: in the step (2.1), a TF-IDF weighting technology is adopted to calculate the feature word weight.

3. The method of claim 1, wherein the method comprises the following steps: in the step (2.4), the self-connection method adopts a candidate connection method given in Apriori algorithm.

4. The method of claim 1, wherein the method comprises the following steps: in the step (2.12), the deep learning tool refers to a Skip-gram model of the Google open source word vector tool word2 vec.