CN111897925B - Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning - Google Patents

Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning Download PDF

Info

Publication number
CN111897925B
CN111897925B CN202010774429.1A CN202010774429A CN111897925B CN 111897925 B CN111897925 B CN 111897925B CN 202010774429 A CN202010774429 A CN 202010774429A CN 111897925 B CN111897925 B CN 111897925B
Authority
CN
China
Prior art keywords
ret
word
expansion
pseudo
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010774429.1A
Other languages
Chinese (zh)
Other versions
CN111897925A (en
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN202010774429.1A priority Critical patent/CN111897925B/en
Publication of CN111897925A publication Critical patent/CN111897925A/en
Application granted granted Critical
Publication of CN111897925B publication Critical patent/CN111897925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a pseudo-correlation feedback expansion method integrating correlation pattern mining and word vector learning, which comprises the steps of firstly, querying and searching a primary test document obtained by an original Chinese document set by a user to construct a primary test document set, extracting m primary test documents in the front as a pseudo-correlation feedback document set, mining candidate expansion words from the pseudo-correlation feedback document set by adopting a correlation rule mining method based on a CSC (computer-controlled computer) frame, establishing a candidate expansion word set, then calculating the vector cosine similarity between the candidate expansion words and the original query, extracting the candidate expansion words not lower than a similarity threshold value as final expansion words, combining the final expansion words and the original query into a new query, and retrieving the original document set again to realize query expansion. Experimental results show that the expansion retrieval performance of the method is higher than that of the existing query expansion method based on the association mode and the word vector, the problems of query theme drift and word mismatching can be effectively reduced, the information retrieval performance is improved, and the method has good application value and popularization prospect.

Description

Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning
Technical Field
The invention relates to a pseudo-correlation feedback expansion method integrating correlation pattern mining and word vector learning, and belongs to the technical field of information retrieval.
Background
With the development of network technology and the arrival of the big data era, in the face of massive network information resources, it is increasingly difficult for network users to quickly and accurately acquire required information from the massive resources, and the main reason is that the current network retrieval system has the problems of query subject drift and word mismatching. Query expansion is one of key technologies for solving the problems of query topic drift and word mismatching in information retrieval, and the query expansion refers to modifying the weight of an original query or adding words related to the original query to obtain a new query longer than the original query so as to describe the semantic meaning or topic implied by the original query more completely and accurately, make up for the deficiency of user query information and improve the retrieval performance of an information retrieval system.
The core problem of query expansion is the source of the expansion terms and the design of the expansion model. In recent decades, researchers have conducted extensive research on query extension models from different perspectives and methods, wherein association pattern mining technology and word vector learning training were introduced into the field of query extension, and relevant feedback extension based on association pattern mining and query extension research based on deep learning were conducted, and good results were achieved, for example, Jabri et al proposed an extended word mining method based on rule association graphs (see Jabri S, Dahbi A, Gadi T.A graph-based adaptation for textual expression using rule association search feedback and association rules mining [ J ]. International Journal of electric & Computer Engineering,2019 (6) 5016. info 5023), zi et al (see Press S, science A, Ku. query and extension Information [ 12 ] of interest and Knowledge of interest C. and Knowledge of interest), 2016: 1929-: bouziri A, Latiri C, Gaussian E et al.learning query expansion from association rules between terms [ C ]. Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), Lisbon, Portugal,2015:525-530.) and ranking learning model-based extension mining methods (see literature: bouziri A, Latiri C, Gaussier E.Effect Association selection for Automatic Query Expansion [ C ]. Proceedings of the 18th International Conference on Computational Linear and Intelligent Text Processing (CICLing 2017), Budapest, Hungary, Springer, Cham, LNCS 10762,2017: 563-. Experimental results show that the query expansion method is effective and has better performance in the aspect of improving the information retrieval performance.
However, the existing query expansion method has not completely solved the technical problems of query topic drift, word mismatching and the like in information retrieval, and the query expansion based on association rule mining mainly adopts a mining technology based on statistical analysis to obtain the expansion words, and has the defect of ignoring semantic information of the expansion words in the context. Aiming at the defect, the method fuses the association pattern with the word vector with rich context semantic information, and provides a pseudo-correlation feedback expansion method fusing association pattern mining and word vector learning.
Disclosure of Invention
The invention aims to provide a pseudo-correlation feedback expansion method integrating correlation pattern mining and word vector learning, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.
The invention adopts the following specific technical scheme:
a pseudo-correlation feedback expansion method for combination of correlation pattern mining and word vector learning comprises the following steps:
step 1, a Chinese user queries and searches an original Chinese document set to obtain a primary check document, and a primary check document set is constructed.
Step 2, extracting m primary inspection documents in the front row as a pseudo-related feedback document set, and mining candidate extension words from the pseudo-related feedback document set by adopting an association rule mining method based on a CSC (computer-aided Support and Confidence) frame, wherein the CSC (computer-aided Support and Confidence) frame refers to a Support degree-Confidence frame based on a Copulas theory, and the association rule mining method based on the CSC frame specifically comprises the following steps:
and (2.1) extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.
The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.
(2.2) extracting characteristic words from the Chinese characteristic word library as 1_ candidate item set C 1
(2.3) calculation of C 1 Support CSC _ Sup (C) based on CSC framework 1 ) If CSC _ Su (C) 1 ) If not lower than the minimum support threshold ms, C is set 1 As 1_ frequent item set L 1 And added to the frequent item set FIS (frequency ItemSet).
The CSC _ Sup (CSC based Support) represents the support based on CSC framework. The CSC _ Sup (C) 1 ) Is calculated as shown in equation (1):
Figure GDA0003694941950000031
in the formula (1), the reaction mixture is,
Figure GDA0003694941950000032
represents the 1_ candidate C 1 The occurrence frequency in the pseudo related feedback Chinese document library, DocNum represents the total document number of the pseudo related feedback Chinese document library,
Figure GDA0003694941950000033
represents the 1_ candidate C 1 Item set weights in the pseudo-correlation feedback Chinese document library, ItemsW represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library, and exp represents an exponential function with a natural constant e as a base.
(2.4) adopting a self-connection method to collect (k-1) _ frequent item set L k-1 Deriving k _ candidate C from concatenation k And k is more than or equal to 2.
The self-ligation method adopts a candidate item ligation method given in Apriori algorithm (see the literature: Agrawal R, Imielinski T, Swami A. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993: 207-.
(2.5) when 2_ candidate C is mined 2 When, ifThe C is 2 If the original query term is not contained, the C is deleted 2 If the C is 2 If the original query term is contained, the C is left 2 Then, remaining C 2 Transferring to the step (2.6); when mining to k _ candidate C k And if the k is more than or equal to 3, directly transferring to the step (2.6).
(2.6) calculation of C k Support CSC _ Sup (C) based on CSC framework k ) If CSC _ Su (C) k ) Not less than ms, then C k As k _ frequent item set L k And added to the frequent itemset FIS.
The CSC _ Sup (C) k ) Is calculated as shown in equation (2):
Figure GDA0003694941950000034
in the formula (2), the reaction mixture is,
Figure GDA0003694941950000035
represents the k _ candidate C k The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000036
represents the k _ candidate C k Item set weights in a pseudo-relevant feedback Chinese document library; DocNum and ItemsW are defined as in equation (1), and exp represents an exponential function based on a natural constant e.
(2.7) after k is added with 1, the step (2.4) is carried out to continue the subsequent steps until the L is k And (4) if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (2.8).
(2.8) optional removal of L from FIS k And k is more than or equal to 2.
(2.9) from L k Extracting the association rule Q i →Ret j Calculating the association rule Q i →Ret j Confidence CSC _ Con (Q) based on CSC framework i →Ret j ) I is more than or equal to 1, j is more than or equal to 1
Figure GDA0003694941950000037
Q i ∪Ret j =L k
Figure GDA0003694941950000038
The Ret j For a proper subset of terms set without query terms, said Q i The method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
The CSC _ Con (CSC based Confidence) represents a CSC framework based confidence level, the CSC _ Con (Q) i →Ret j ) The formula (3) is shown as follows:
Figure GDA0003694941950000041
in the formula (3), the reaction mixture is,
Figure GDA0003694941950000042
represents k _ frequent item set L k The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000043
represents k _ frequent item set L k Term set weights in the pseudo-relevance feedback chinese document library,
Figure GDA0003694941950000044
represents k _ frequent item set L k Is a proper subset of items Q i The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000045
represents k _ frequent item set L k Is a proper subset of items Q i Item set weights in a pseudo-relevance feedback Chinese document library.
(2.10) extraction of CSC _ Con (Q) i →Ret j ) Association rule Q not less than minimum confidence threshold mc i →Ret j Added to the association rule set AR (Association rule), and then, proceeds to step (2.9) from L k In which other proper subset item sets Ret are re-extracted j And Q i Then sequentially carrying out the followingStep(s), so circulating until L k If and only if all proper subset entries in the set are retrieved once, then proceed to step (2.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS k Then sequentially proceeding the next steps, and so on, until all k _ frequent item sets L in FIS k If and only if all are taken out once, then the association rule pattern mining is finished, and the process goes to the following step (2.11).
(2.11) extracting association rule back-piece Ret from association rule set AR j Obtaining a candidate Expansion word set CET (candidate Expansion term) as a candidate Expansion word, and calculating a candidate Expansion word weight w Ret Then, the process proceeds to step (2.12).
The CET is represented by formula (4):
Figure GDA0003694941950000046
in formula (4), Ret i And i is greater than or equal to 1, and represents the ith candidate expansion word.
The candidate expansion word weight w Ret The calculation formula is shown in formula (5):
Figure GDA0003694941950000047
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
And (2.12) carrying out Chinese word segmentation and Chinese stop words removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set of the characteristic words.
The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (see https:// code. Google. com/p/word2vec /).
(2.13) computing candidate expansions in the set of word vectors of feature wordsWord Ret i And the original query term set Q (Q ═ Q) 1 ,q 2 ,…,q r ) Each query term q in (q) 1 ,q 2 ,…,q r The vector cosine similarity VecSim (Ret) of i ,q j ) J is more than or equal to 1 and less than or equal to r, and the vector similarity of the candidate expansion words and each query word is accumulated to be used as the total vector similarity VecSim (Ret) of the candidate expansion words i ,Q)。
The VecSim (Ret) i ,q j ) As shown in formula (6):
Figure GDA0003694941950000051
in the formula (6), the vRet i Indicating the ith candidate expansion word Ret i Word vector value of, vq j Representing the jth query term q j The word vector value of.
The VecSim (Ret) i Q) the calculation formula is shown in formula (7):
Figure GDA0003694941950000052
(2.14) extraction of VecSim (Ret) i Q) taking the candidate Expansion words not lower than the vector similarity threshold minVSim as the final Expansion words of the original query Term set Q to obtain a final Expansion Term set FETS (final Expansion Term set), and calculating the weight w of the final Expansion words Ret Fet Then, the process proceeds to step (2.15).
The final extended word set FETS is as shown in equation (8):
Figure GDA0003694941950000053
the weight w of the final expansion word Ret Fet By the weight w of the candidate expansion word Ret And vector similarity VecSim (Ret) i And Q) is selected. Said w Fet The calculation formula is shown in formula (9):
Figure GDA0003694941950000054
and (2.15) combining the expansion words with the original query to retrieve the original Chinese document set again for the new query, thereby realizing query expansion.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a pseudo-correlation feedback expansion method combining correlation pattern mining and word vector learning, which fuses a correlation pattern based on statistical analysis and a word vector with context semantic information, mines candidate expansion words for a pseudo-correlation feedback document set, establishes a candidate expansion word set, calculates the vector cosine similarity between the candidate expansion words and each query term, and extracts the candidate expansion words not lower than a similarity threshold from the candidate expansion word set as final expansion words. Experimental results show that the method can improve the information retrieval performance, the retrieval performance is higher than that of the similar comparison method in recent years, and the method has good application value and popularization prospect.
(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that the MAP value of the method is higher than that of the reference retrieval BLR, compared with 4 comparison methods, the MAP value of the method is mostly improved, the retrieval performance of the method is better than that of the reference retrieval and comparison methods, the information retrieval performance can be improved, the problems of query drift and word mismatching in information retrieval are reduced, and the method has high application value and wide popularization prospect.
Drawings
Fig. 1 is a general flow diagram of a pseudo-correlation feedback expansion method of association pattern mining and word vector learning fusion according to the present invention.
Detailed Description
Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:
1. item set
In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.
2. Associating rules front and back parts
Let x and y be arbitrary feature term sets, and an implication of the form x → y is called an association rule, where x is called a rule antecedent and y is called a rule consequent.
CSC framework
The CSC (copula-based Support and Confidence) framework refers to a Support-Confidence framework based on copula theory.
4. CSC framework based support and confidence
Copula theory (see Sklar A. principles de repetition n dimensions et sources marks [ J ]. Publication de l 'institute de Statistique l' Universal Paris 1959,8(1):229 and 231.) is used to describe the correlation between variables, and arbitrary forms of distributions can be combined and connected into an effective multivariate distribution function.
The invention utilizes a copula function to comprehensively unify the frequency and the weight of a feature term set to the support degree and the confidence degree of a feature term association mode, and provides a support degree-confidence degree framework based on a copula theory, namely a CSC framework, wherein the calculation formulas of the support degree CSC _ Su (CSC based support) and the confidence degree CSC _ Con (CSC based confidence) in the CSC framework are as follows:
CSC framework based feature term set (T) 1 ∪T 2 ) Support CSC _ Sup (T) 1 ∪T 2 ) The formula (2) is shown in equation (10):
Figure GDA0003694941950000071
in the formula (10), the reaction mixture is,
Figure GDA0003694941950000072
set of representation items (T) 1 ∪T 2 ) In pseudo-correlation feedbackThe frequency of occurrence in the library of chinese documents,
Figure GDA0003694941950000073
representing a set of items (T) 1 ∪T 2 ) Item set weights in a pseudo-relevance feedback Chinese document library. DocNum represents the total document number of the pseudo-correlation feedback Chinese document library, and ItemsW represents the weighted accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library. exp denotes an exponential function with a natural constant e as the base.
CSC framework-based feature word association rule T 1 →T 2 Confidence CSC _ Con (T) 1 →T 2 ) The formula (2) is shown in equation (11):
Figure GDA0003694941950000074
in the formula (11), the reaction mixture is,
Figure GDA0003694941950000075
representing a set of items (T) 1 ∪T 2 ) The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000076
representing a set of items (T) 1 ∪T 2 ) Term set weights in a pseudo-relevance feedback Chinese document corpus, the
Figure GDA0003694941950000077
Set of representation items T 1 The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000078
representing a set of items T 1 Item set weights in a pseudo-relevance feedback Chinese document library.
The invention is further explained by combining the drawings and specific comparative experiments.
As shown in fig. 1, the pseudo-correlation feedback expansion method of the invention, which integrates correlation pattern mining and word vector learning, comprises the following steps:
step 1, a Chinese user queries and retrieves an original Chinese document set to obtain a primary check document, and a primary check document set is constructed.
Step 2, extracting m primary inspection documents in the front row as a pseudo-related feedback document set, and mining candidate extension words from the pseudo-related feedback document set by adopting an association rule mining method based on a CSC (computer-aided Support and Confidence) frame, wherein the CSC (computer-aided Support and Confidence) frame refers to a Support degree-Confidence frame based on a Copulas theory, and the association rule mining method based on the CSC frame specifically comprises the following steps:
and (2.1) extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.
The invention adopts TF-IDF weighting technology to calculate the weight of the feature words.
(2.2) extracting characteristic words from the Chinese characteristic word bank as 1_ candidate item set C 1
(2.3) calculation of C 1 Support CSC _ Sup (C) based on CSC framework 1 ) If CSC _ Su (C) 1 ) If not lower than the minimum support threshold ms, C is set 1 As 1_ frequent item set L 1 And added to the frequent itemset set fis (frequency itemset).
The CSC _ Sup (CSC based support) represents the support based on the CSC framework. The CSC _ Sup (C) 1 ) Is calculated as shown in equation (1):
Figure GDA0003694941950000081
in the formula (1), the reaction mixture is,
Figure GDA0003694941950000082
represents the 1_ candidate C 1 The occurrence frequency in the pseudo related feedback Chinese document library, DocNum represents the total document number of the pseudo related feedback Chinese document library,
Figure GDA0003694941950000083
represents the 1_ candidate C 1 Item set weights in the pseudo-correlation feedback Chinese document library, ItemsW represents the weight accumulation sum of all Chinese characteristic words in the pseudo-correlation feedback Chinese document library, and exp represents an exponential function with a natural constant e as a base.
(2.4) adopting a self-connection method to collect (k-1) _ frequent item set L k-1 Deriving k _ candidate C from concatenation k And k is more than or equal to 2.
The self-connection method adopts a candidate item set connection method given in an Apriori algorithm.
(2.5) when 2_ candidate C is mined 2 When, if the C is 2 If the original query term is not contained, the C is deleted 2 If the C is 2 If the original query term is contained, the C is left 2 Then, C is left 2 Transferring to the step (2.6); when mining to k _ candidate C k And if the k is more than or equal to 3, directly transferring to the step (2.6).
(2.6) calculation of C k Support CSC _ Sup (C) based on CSC framework k ) If CSC _ Su (C) k ) Not less than ms, then C k As k _ frequent item set L k And added to the frequent itemset set FIS.
The CSC _ Sup (C) k ) Is calculated as shown in equation (2):
Figure GDA0003694941950000084
in the formula (2), the reaction mixture is,
Figure GDA0003694941950000085
represents the k _ candidate C k The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000086
represents the k _ candidate C k Item set weights in a pseudo-relevant feedback Chinese document library; DocNum and ItemsW are defined as in the formula (1), exp representsAn exponential function with a natural constant e as the base.
(2.7) after k is added with 1, the step (2.4) is carried out to continue the subsequent steps until the L is k And (5) if the item set is an empty set, finishing the excavation of the frequent item set, and turning to the step (2.8).
(2.8) optional removal of L from FIS k And k is more than or equal to 2.
(2.9) from L k Extracting the association rule Q i →Ret j Calculating the association rule Q i →Ret j Confidence CSC _ Con (Q) based on CSC framework i →Ret j ) I is more than or equal to 1, j is more than or equal to 1
Figure GDA0003694941950000087
Q i ∪Ret j =L k
Figure GDA0003694941950000088
The Ret j For a proper subset of terms set without query terms, said Q i The method comprises the steps of setting a proper subset item set containing query terms, wherein Q is an original query term set.
The CSC _ Con (CSC based Confidence) represents a CSC framework based confidence level, the CSC _ Con (Q) i →Ret j ) The formula (3) is shown as follows:
Figure GDA0003694941950000091
in the formula (3), the reaction mixture is,
Figure GDA0003694941950000092
represents k _ frequent item set L k The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000093
represents k _ frequent item set L k Term set weights in the pseudo-relevance feedback chinese document library,
Figure GDA0003694941950000094
denotes k _ frequentItem set L k Is a proper subset of items Q i The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure GDA0003694941950000095
represents k _ frequent item set L k Is a proper subset of items Q i Item set weights in a pseudo-relevance feedback Chinese document library.
(2.10) extraction of CSC _ Con (Q) i →Ret j ) Association rule Q not less than minimum confidence threshold mc i →Ret j Adding into the association rule set AR (Association rule), and then proceeding to step (2.9), from L k In which other proper subset item sets Ret are re-extracted j And Q i Sequentially proceeding the next steps, and circulating the steps until L k If and only if all proper subset entries in the set are retrieved once, then proceed to step (2.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS k Then, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS k If and only if all are taken out once, then the association rule pattern mining is finished, and the following step (2.11) is carried out.
(2.11) extracting the association rule back-piece Ret from the association rule set AR j Obtaining a candidate Expansion word set CET (candidate Expansion term) as a candidate Expansion word, and calculating a weight w of the candidate Expansion word Ret Then, the process proceeds to step (2.12).
The CET is represented by formula (4):
Figure GDA0003694941950000096
in formula (4), Ret i And i is greater than or equal to 1, and represents the ith candidate expansion word.
The candidate expansion word weight w Ret The calculation formula is shown in formula (5):
Figure GDA0003694941950000097
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
And (2.12) carrying out Chinese word segmentation and Chinese stop words removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set of the characteristic words.
The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.
(2.13) calculating candidate expansion words Ret in the word vector set of the characteristic words i And the original query term set Q (Q ═ Q) 1 ,q 2 ,…,q r ) Each query term q in (q) 1 ,q 2 ,…,q r The vector cosine similarity VecSim (Ret) of i ,q j ) J is more than or equal to 1 and less than or equal to r, and the vector similarity of the candidate expansion words and each query word is accumulated to be used as the total vector similarity VecSim (Ret) of the candidate expansion words i ,Q)。
The VecSim (Ret) i ,q j ) As shown in formula (6):
Figure GDA0003694941950000101
in the formula (6), the vRet i Indicating the ith candidate expansion word Ret i Word vector value of, vq j Representing the jth query term q j The word vector value of.
The VecSim (Ret) i Q) the calculation formula is shown in formula (7):
Figure GDA0003694941950000102
(2.14) extraction of VecSim (Ret) i Q) candidate Expansion words not lower than the vector similarity threshold minVSim are used as Final Expansion words of the original query term set Q to obtain a Final Expansion term set FETS (Final Expansion Ter)m Set), calculating the weight w of the final extension word Ret Fet Then, the process proceeds to step (2.15).
The final extended word set FETS is as shown in equation (8):
Figure GDA0003694941950000103
the weight w of the final expansion word Ret Fet By the weight w of the candidate expansion word Ret And vector similarity VecSim (Ret) i And Q) is selected. Said w Fet The calculation formula is shown in formula (9):
Figure GDA0003694941950000104
and (2.15) combining the expansion words with the original query to retrieve the original Chinese document set again for the new query, thereby realizing query expansion.
Experimental design and results:
we compared the same method to the existing experiment, to show the effectiveness of the method of the invention.
1. Experimental environment and experimental data:
the experimental data of the invention is NTCIR-5CLIR (see http:// research. ni. ac. jp/NTCIR/data/data-en. html.) Chinese text corpus, which has 8 data sets, and 901446 Chinese documents in total, and the specific information is shown in Table 1. The corpus is a standard corpus which is universal internationally and comprises a document set, a query set and a result set, namely 50 Chinese queries, 4 types of query subjects and a result set of 2 evaluation criteria. The method comprises the steps that a Description (Desc for short) and a Title query topic are adopted to complete a retrieval experiment, the Title query belongs to a short query, the query topic is briefly described by nouns and nominal phrases, the Desc query belongs to a long query, and the query topic is briefly described in a sentence form; the result set has Rigid (highly relevant, relevant to the query) and Relax (highly relevant, and partially relevant to the query) evaluation criteria.
The experimental data pretreatment is as follows: segmenting words and removing stop words.
The index for evaluation of the search for the Experimental results is MAP (mean Average precision)
TABLE 1 original corpus of experiments and its quantity of the invention
Figure GDA0003694941950000111
2. The reference retrieval and comparison method comprises the following steps:
the experimental basic retrieval environment is built by Lucene.
The base retrieval blr (baseline retrieval) is a retrieval result obtained by submitting an original query to lucene.
The comparative method is described as follows:
QE _ WAPM (query Expansion Based on Weighted Association Pattern mining), and mining rule Expansion words by adopting a Weighted Association Pattern mining technology of a document (yellow name selection, cross-English language query Expansion [ J ] information academic newspaper Based on Weighted Association Pattern mining, 2017,36(3):307- & lt318 >), wherein the parameters are mc 0.1, mi 0.0001, and the experimental results are average values when ms is 0.004,0.005,0.006 and 0.007 respectively.
QE _ WPNPM (query Expansion Based on Weighted Positive and Negative Patterns mining) mines the rule extension words by adopting a fully Weighted Positive and Negative association pattern mining technology of a literature (yellow name selection, JianCao Qing, more-English cross-language query translation extension [ J ] electronic newspaper mined Based on a fully Weighted Positive and Negative association pattern, 2018,46(12): 3029-. The parameters mc is 0.1, α is 0.3, minPR is 0.1, minNR is 0.01, and ms is 0.10,0.11,0.12,0.13, respectively, and the experimental results are averaged.
QE _ WMSM (query Expansion Based on Weighted Multiple Supports mining), using the multi-support threshold Based Weighted frequent pattern mining technique mining rule extension words of the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining with Weighted Multiple minimum Supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605. 612.), with the parameters mc ═ 0.1, LMS ═ 0.2, HMS ═ 0.25, WT ═ 0.1, ms are 0.1,0.15,0.2,0.25, respectively, and the experimental results are averaged.
QE _ W2Vec (Query Expansion based on Word Embedding) adopts a Word vector based Query Expansion method of documents (Kan, Linyuan, cool pillow, etc.. A Word vector method of patent Query Expansion studies [ J ] computer science and exploration, 2018,12(6): 972-980.). Experimental parameters: k is 60 and α is 0.1.
The word vector semantic learning training parameters of the Skip-gram model used by the invention are as follows: batch _ size 128, embedding _ size 300, skip _ window 2, num _ skip 4, and num _ sampled 64.
3. The experimental methods and results are as follows:
the inventive experiment was performed on 8 data sets with 50 chinese queries of NTCIR-5CLIR corpus to obtain the baseline search BLR results, the comparison method, and the search result MAP average of the inventive method, as shown in tables 2 and 3.
TABLE 2 MAP value of search performance (Title query) for the method of the present invention and the reference search and comparison method
Figure GDA0003694941950000121
TABLE 3 search Performance MAP values for the inventive method and the reference search and comparison method (Desc Inquiry)
Figure GDA0003694941950000122
The experimental result shows that the MAP value of the method is higher than that of the standard retrieval BLR, compared with 4 comparison methods, the MAP value of the method is mostly improved, and the expansion retrieval performance of the method is higher than that of the standard retrieval and the similar comparison methods. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims (4)

1. A pseudo-correlation feedback expansion method for combination of correlation pattern mining and word vector learning is characterized by comprising the following steps:
step 1, a Chinese user inquires and retrieves an original Chinese document set to obtain a primary check document, and a primary check document set is constructed;
step 2, extracting m primary inspection documents in the front row as a pseudo-related feedback document set, and mining candidate extension words from the pseudo-related feedback document set by adopting an association rule mining method based on a CSC frame, wherein the CSC frame is a support degree-confidence coefficient frame based on a Copulas theory, and the association rule mining method based on the CSC frame specifically comprises the following steps:
(2.1) extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, carrying out Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library;
(2.2) extracting characteristic words from the Chinese characteristic word library as 1_ candidate item set C 1
(2.3) calculation of C 1 Support CSC _ Sup (C) based on CSC framework 1 ) If CSC _ Su (C) 1 ) If not lower than the minimum support threshold ms, C is set 1 As 1_ frequent item set L 1 And adding to the frequent item set FIS;
(2.4) adopting a self-connection method to connect (k-1) _ frequent item set L k-1 Deriving k _ candidate C from concatenation k The k is more than or equal to 2;
(2.5) when mining to 2_ candidate C 2 When it is, if C is 2 If the original query term is not contained, the C is deleted 2 If the C is 2 If the original query term is contained, the C is left 2 Then, remaining C 2 Transferring to the step (2.6); when mining to k _ candidate C k If k is more than or equal to 3, directly switching to the step (2.6);
(2.6) calculation of C k Support CSC _ Sup (C) based on CSC framework k ) If CSC _ Su (C) k ) Not less than ms, then C k As k _ frequent item set L k And adding to a frequent item set FIS;
(2.7) k plus 1 and then proceeding to step (2.4) to continue the sequential execution of the following steps until said L is generated k If the item set is an empty set, finishing the excavation of the frequent item set, and turning to the step (2.8);
(2.8) optionally removing L from FIS k The k is more than or equal to 2;
(2.9) from L k Extracting the association rule Q i →Ret j Calculating the association rule Q i →Ret j Confidence CSC _ Con (Q) based on CSC framework i →Ret j ) I is not less than 1, j is not less than 1, and
Figure FDA0003694941940000011
Q i ∪Ret j =L k
Figure FDA0003694941940000012
the Ret j For a proper subset of terms set without query terms, said Q i The method comprises the steps of (1) setting a proper subset item set containing query terms, wherein Q is an original query term set;
(2.10) extraction of CSC _ Con (Q) i →Ret j ) Association rule Q not less than minimum confidence threshold mc i →Ret j Adding to the association rule set AR, then proceeding to step (2.9), from L k In which other proper subset item sets Ret are re-extracted j And Q i Sequentially proceeding the next steps, and circulating the steps until L k If and only if all proper subset entries in the set are retrieved once, then proceed to step (2.8), perform a new round of association rule pattern mining, and retrieve any other L from the FIS k Then, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS k If and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (2.11) is carried out;
(2.11) extracting the association rule back-piece Ret from the association rule set AR j As candidate expansion words, obtaining a candidate expansion word set CET, and calculating a candidate expansion word weight w Ret Then, the step (2.12) is carried out;
(2.12) carrying out Chinese word segmentation and Chinese stop words removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set of the characteristic words;
(2.13) calculating candidate expansion words Ret in the word vector set of the characteristic words i With the original query term set Q (Q ═ Q) 1 ,q 2 ,…,q r ) Each query term q in (q) 1 ,q 2 ,…,q r The vector cosine similarity VecSim (Ret) of i ,q j ) J is more than or equal to 1 and less than or equal to r, and the vector similarity of the candidate expansion words and each query word is accumulated to be used as the total vector similarity VecSim (Ret) of the candidate expansion words i ,Q);
(2.14) extraction of VecSim (Ret) i Q) using the candidate expansion words not lower than the vector similarity threshold minVSim as the final expansion words of the original query term set Q to obtain a final expansion term set FETS, and calculating the weight w of the final expansion words Ret Fet Then, the step (2.15) is carried out;
(2.15) combining the expansion words with the original query to retrieve the original Chinese document set again for the new query, so as to realize query expansion;
in the step (2.3), the CSC _ Sup (C) 1 ) Is calculated as shown in equation (1):
Figure FDA0003694941940000021
in the formula (1), the acid-base catalyst,
Figure FDA0003694941940000022
represents the 1_ candidate C 1 The occurrence frequency in the pseudo-relevant feedback Chinese document library, DocNum represents the total document number of the pseudo-relevant feedback Chinese document library,
Figure FDA0003694941940000023
represents the 1_ candidate C 1 Item set weights in pseudo-relevance feedback Chinese document library, ItemsW represents the weight accumulation sum of all Chinese characteristic words in pseudo-relevance feedback Chinese document library, exp represents the weight accumulation sum of all Chinese characteristic words in natural constant feedback Chinese document libraryAn exponential function with the number e as the base;
in the step (2.6), the CSC _ Sup (C) k ) Is calculated as shown in equation (1):
Figure FDA0003694941940000024
in the formula (2), the reaction mixture is,
Figure FDA0003694941940000025
represents the k _ candidate C k The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure FDA0003694941940000026
represents k _ candidate C k Item set weights in a pseudo-relevant feedback Chinese document library; DocNum and ItemsW are defined as in equation (1), exp represents an exponential function with a natural constant e as the base;
in the step (2.9), the CSC _ Con (Q) i →Ret j ) The formula (3) is shown as follows:
Figure FDA0003694941940000027
in the formula (3), the reaction mixture is,
Figure FDA0003694941940000031
represents k _ frequent item set L k The frequency of occurrence in the pseudo-relevant feedback chinese document library,
Figure FDA0003694941940000032
represents k _ frequent item set L k Term set weights in the pseudo-relevance feedback chinese document library,
Figure FDA0003694941940000033
represents k _ frequent item set L k Is a proper subset of items Q i In pseudo-relevant feedback Chinese document libraryThe frequency of the occurrence of the frequency,
Figure FDA0003694941940000034
represents k _ frequent item set L k Is a proper subset of items Q i Item set weights in a pseudo-correlation feedback Chinese document library;
in step (2.11), the CET is of formula (4):
Figure FDA0003694941940000035
in the formula (4), Ret i Representing the ith candidate expansion word, wherein i is more than or equal to 1;
the candidate expansion word weight w Ret The calculation formula is shown in formula (5):
Figure FDA0003694941940000036
in the formula (5), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule modes at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word;
in the step (2.13), the VecSim (Ret) i ,q j ) As shown in formula (6):
Figure FDA0003694941940000037
in the formula (6), the vRet i Represents the ith candidate expansion word Ret i Word vector value of, vq j Representing the jth query term q j A word vector value of;
the VecSim (Ret) i Q) the calculation formula is shown in formula (7):
Figure FDA0003694941940000038
in the step (2.14), the final extension word set FETS is as shown in equation (8):
FETS={Ret 1 ,Ret 2 ,...,Ret i }
(VecSim(Ret l ,Q)≥minVSim(l∈(1,2,…,i))) (8)
the weight w of the final expansion word Ret Fet By the weight w of the candidate expansion word Ret And vector similarity VecSim (Ret) i And Q) composition; said w Fet The calculation formula is shown in formula (9):
Figure FDA0003694941940000039
2. the method of claim 1, wherein the method comprises the following steps: in the step (2.1), a TF-IDF weighting technology is adopted to calculate the feature word weight.
3. The method of claim 1, wherein the method comprises the following steps: in the step (2.4), the self-connection method adopts a candidate connection method given in Apriori algorithm.
4. The method of claim 1, wherein the method comprises the following steps: in the step (2.12), the deep learning tool refers to a Skip-gram model of the Google open source word vector tool word2 vec.
CN202010774429.1A 2020-08-04 2020-08-04 Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning Active CN111897925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010774429.1A CN111897925B (en) 2020-08-04 2020-08-04 Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010774429.1A CN111897925B (en) 2020-08-04 2020-08-04 Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning

Publications (2)

Publication Number Publication Date
CN111897925A CN111897925A (en) 2020-11-06
CN111897925B true CN111897925B (en) 2022-08-26

Family

ID=73245724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010774429.1A Active CN111897925B (en) 2020-08-04 2020-08-04 Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning

Country Status (1)

Country Link
CN (1) CN111897925B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN106530360A (en) * 2016-11-01 2017-03-22 复旦大学 Complementary color wavelet color image processing method
CN106570183A (en) * 2016-11-14 2017-04-19 宜宾学院 Color picture retrieval and classification method
CN109582769A (en) * 2018-11-26 2019-04-05 广西财经学院 Association mode based on weight sequence excavates and the text searching method of consequent extension
CN109739952A (en) * 2018-12-30 2019-05-10 广西财经学院 Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension
CN110879834A (en) * 2019-11-27 2020-03-13 福州大学 Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364470B2 (en) * 2008-01-15 2013-01-29 International Business Machines Corporation Text analysis method for finding acronyms

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN106530360A (en) * 2016-11-01 2017-03-22 复旦大学 Complementary color wavelet color image processing method
CN106570183A (en) * 2016-11-14 2017-04-19 宜宾学院 Color picture retrieval and classification method
CN109582769A (en) * 2018-11-26 2019-04-05 广西财经学院 Association mode based on weight sequence excavates and the text searching method of consequent extension
CN109739952A (en) * 2018-12-30 2019-05-10 广西财经学院 Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension
CN110879834A (en) * 2019-11-27 2020-03-13 福州大学 Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Stress-Strength Time-Varying Correlation Interference Model for Structural Reliability Analysis Using Copulas;Jianchun Zhang等;《 IEEE Transactions on Reliability 》;20170502;全文 *
生物医学文献检索方法与问答系统;潘昊杰等;《情报工程》;20161015(第05期);全文 *

Also Published As

Publication number Publication date
CN111897925A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
Ibrihich et al. A Review on recent research in information retrieval
Pan et al. An improved TextRank keywords extraction algorithm
Mahata et al. Theme-weighted ranking of keywords from text documents using phrase embeddings
CN109299278B (en) Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent
Charnine et al. Measuring of" idea-based" influence of scientific papers
Wang et al. Named entity recognition method of brazilian legal text based on pre-training model
Zehtab-Salmasi et al. FRAKE: fusional real-time automatic keyword extraction
CN111897922A (en) Chinese query expansion method based on pattern mining and word vector similarity calculation
CN109739953B (en) Text retrieval method based on chi-square analysis-confidence framework and back-part expansion
CN111897925B (en) Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning
Abimbola et al. A Noun-Centric Keyphrase Extraction Model: Graph-Based Approach
CN109299292B (en) Text retrieval method based on matrix weighted association rule front and back part mixed expansion
CN111897924A (en) Text retrieval method based on association rule and word vector fusion expansion
Horasan et al. Alternate low-rank matrix approximation in latent semantic analysis
Li et al. Deep learning and semantic concept spaceare used in query expansion
CN111897921A (en) Text retrieval method based on word vector learning and mode mining fusion expansion
CN111897928A (en) Chinese query expansion method for embedding expansion words into query words and counting expansion word union
Li et al. Complex query recognition based on dynamic learning mechanism
CN111897926A (en) Chinese query expansion method integrating deep learning and expansion word mining intersection
Pinto et al. QUALIBETA at the NTCIR-11 Math 2 Task: An Attempt to Query Math Collections.
Heidary et al. Automatic text summarization using genetic algorithm and repetitive patterns
Kuhr et al. Context-specific adaptation of subjective content descriptions
CN111897927B (en) Chinese query expansion method integrating Copulas theory and association rule mining
CN109684465B (en) Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison
Liu et al. Modelling and Implementation of a Knowledge Question-answering System for Product Quality Problem Based on Knowledge Graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant