CN111897922A - Chinese query expansion method based on pattern mining and word vector similarity calculation - Google Patents
Chinese query expansion method based on pattern mining and word vector similarity calculation Download PDFInfo
- Publication number
- CN111897922A CN111897922A CN202010773432.1A CN202010773432A CN111897922A CN 111897922 A CN111897922 A CN 111897922A CN 202010773432 A CN202010773432 A CN 202010773432A CN 111897922 A CN111897922 A CN 111897922A
- Authority
- CN
- China
- Prior art keywords
- word
- expansion
- vector
- aet
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention has proposed a Chinese inquiry expansion method based on pattern mining and word vector similarity calculation, it obtains the first survey file through the Chinese document set of user's inquiry retrieval at first, carry on the word vector semantic learning training to the first survey document set and get the word vector set including inquiry lexical item and non-inquiry lexical item; then mining an extension word from the pseudo-related feedback document set by adopting a Copulas function-based associated extension word mining method, and establishing an associated extension word set; and performing two kinds of vector cosine similarity operation in the word vector set to obtain a word embedded expansion word set and a word vector associated expansion word set, finally fusing the word embedded expansion word set and the word vector associated expansion word set to obtain a final expansion word, combining the final expansion word and the original query into a new query, and searching the document set again to realize query expansion. The method integrates the association mode mining and the word vector learning, can mine high-quality extension words, improves the information retrieval performance, and has good application value and popularization prospect.
Description
Technical Field
The invention relates to a Chinese query expansion method based on pattern mining and word vector similarity calculation, and belongs to the technical field of information retrieval.
Background
The query expansion refers to modifying the original query weight or adding words related to the original query, so that the deficiency of the query information of a user is made up, the recall ratio and the precision ratio of an information retrieval system are improved, and the query expansion is one of core technologies for solving the problems of query subject drift and word mismatching in the field of information retrieval.
In recent decades, with the development of network technology and the arrival of big data era, how to precisely retrieve the information required by users from massive big data resources is the focus of the academic and industrial circles at home and abroad, so that the query expansion technology has been greatly developed, some new query expansion methods are proposed, for example, Liu et al (Liu C, Qi R, liuq. query expansion term based on positive and negative association rules [ C ]. Proceedings of the Third interactive relationship science and technology (ICIST),2013IEEE, Yangzhou, Jiangsu, China,2013: 802) proposed an extended word mining method based on positive and negative association rule mining, yellow et al (yellow name selection, small size, Zhang super-Zhang. related matrix based on weighted query matrix [ 2009 ], 1855. pseudo-query expansion matrix [ 10 ] proposed a pseudo-based on weighted query expansion rule mining method (1854, 1855), roy et al (Roy D, great D, middle M, et al. word vector based retrieval feedback using key similarity [ C ]. Proceedings of the 25th ACM International Conference on Information and knowledge management.new York: ACM Press,2016: 1281-.
However, the existing query expansion method does not finally and completely solve the technical problems of query theme drift, word mismatching and the like in information retrieval, aiming at the defects, the invention integrates the association pattern mining and the word vector learning, and provides a Chinese query expansion method based on the pattern mining and the word vector similarity calculation.
Disclosure of Invention
The invention aims to provide a Chinese query expansion method based on pattern mining and word vector similarity calculation, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve and enhance the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.
The invention adopts the following specific technical scheme:
a Chinese query expansion method based on pattern mining and word vector similarity calculation comprises the following steps:
step 1, a user queries and retrieves a Chinese document set to obtain a primary inspection document set.
And 2, performing Chinese word segmentation and Chinese stop word removal on the initial detection document set, and performing word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms.
The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (detailed in https:// code. Google. com/p/word2vec /).
Step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:
(3.1) computing non-query terms (cet) in the set of word vectors1,cet2,…,ceti) With each query term (Q) in the original query term set Q1,q2,…,qj) Vector cosine similarity VecCos (cet) of (2)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j.
Vector cosine similarity VecCos (cet)l,qs) Is calculated as shown in equation (1):
in the formula (1), vcetlIndicating the ith non-query termTerm cetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set QlQ), as shown in formula (2):
(3.3) vector cosine similarity VecSim (cet)lQ) arranging descending order, extracting the Vm non-query terms in the front row as the word embedding Expansion words of the original query Term set Q according to the arranged descending order, constructing a word embedding Expansion word set WEETS (WordEmbedding Expansion Term sets), and calculating a word embedding Expansion word weight w (vet)l) Then, the process proceeds to step 4.
The word embedding extended word set WEETS is shown as formula (3):
in formula (3), vetlRepresents the ith word embedding extension word (l e (1,2, …, Vm)).
The invention takes the total vector cosine similarity value as the weight w (vet) of the word embedded expansion wordl) As shown in formula (4):
w(vetl)=VecSim(vetl,Q) (4)
and 4, extracting m primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.
The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.
And 5, mining the associated extended words AET (association extension term) from the pseudo-related feedback document set by adopting an extended word mining method based on a Copulas function, and establishing an associated extended word set. The Copulas function-based associated expanded word mining method specifically comprises the following steps:
(5.1) mining 1_ frequent item set L1: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C1And calculates 1_ candidate C based on Copulas function1Degree of Support Copulas _ Support (C)1) If Copulas _ Support (C)1) If not lower than the minimum support threshold ms, C is set1As 1_ frequent item set L1And added to the frequent itemset set fis (frequency itemset).
The copolas _ supported (copolas based supported support) represents the support degree based on the copolas function.
The copolas _ Support (C)1) Is calculated as shown in equation (5):
in formula (5), frequency (C)1) Represents the 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library1) Represents the 1_ candidate C1Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp denotes an exponential function with a natural constant e as the base.
(5.2) mining k _ frequent item set Lk: from (k-1) _ frequent item set Lk-1Self-concatenating to generate k _ candidate CkThe k is more than or equal to 2; when k is 2, if C iskIf the original query term is not contained, the C is deletedkIf the C iskIf the original query term is contained, the C is leftkThen, leaveC ofkComputing k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a When k > 2, then CkDirect computation of k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a If Copulas _ Support (C)k) Not less than ms, then CkAs k _ frequent item set LkAnd added to the FIS.
The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.
The copolas _ Support (C)k) Is calculated as shown in equation (6):
in formula (6), frequency (C)k) Represents the k _ candidate CkWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents the k _ candidate CkThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the same as formula (5).
(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generatedkAnd if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4).
(5.4) optional removal of L from FISkAnd k is more than or equal to 2.
(5.5) from LkSet of extracted proper subset items LqAnd LAetCalculating the association rule Lq→LAetConfidence copula _ Confidence (L) based on copula functionq→LAet) And L isq∪LAet=Lk,Said LAetA proper subset term set that does not contain a query term,said LqIs a proper subset item set containing query terms.
The copula _ Confidence (copula based Confidence) represents the Confidence based on copula function, which is Lq→LAet) Is represented by equation (7):
in formula (7), frequency (L)k) Represents k _ frequent item set LkWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents k _ frequent item set LkTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)q) Represents k _ frequent item set LkIs a proper subset of item sets LqWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryq) Represents k _ frequent item set LkIs a proper subset of item sets LqItem set weights in a pseudo-relevance feedback Chinese document library.
(5.6) mining association rule Lq→LAet: extract copula _ Confidence (L)q→LAet) Association rule L not less than minimum confidence threshold mcq→LAetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from LkTo re-extract other proper subset item sets LqAnd LAetSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (5.4), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if all are taken out once, then the association rule pattern mining is finished, and the following step (5.7) is carried out.
(5.7) extracting association rule back-piece L from association rule set ARAetThe feature words are used as the associated Expansion words to obtain an associated Expansion word set AETS (association Expansion Term set), and the associated Expansion words are calculatedWeight wAetThen, the process proceeds to step 6.
The AETS is represented by formula (8):
in formula (8), AetiRepresenting the ith associated expanded word.
The associated expansion word weight wAetThe calculation formula is shown in formula (9):
wAet=max(Copulas_Confidence(Lq→LAet)) (9)
in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
Step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:
(6.1) computing associated expanded words (Aet) in the set of word vectors1,Aet2,..,Aets) And the original query term set Q (Q ═ Q)1,q2,…,qj) Each query term (q) in (b)1,q2,…,qj) Vector cosine similarity of (8) VecCos (Aet)l,qs) As shown in formula (10), l is greater than or equal to 1 and less than or equal to i, and s is greater than or equal to 1 and less than or equal to j.
In formula (10), vAetlIndicating the ith statistical expansion word AetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(6.2) accumulating the vector similarity value of the associated expansion word and each query word to obtain a similarity sum which is used as the vector cosine similarity value Vec of the associated expansion word and the original query term set QSim(AetlQ), as shown in formula (11):
(6.3) extracting vector similarity VecSim (Aet)lQ) value is not lower than the minimum similarity threshold minVSim as the word vector associated extension word, the word vector associated extension word set WEAETS (word embedding association extension Term set) is obtained, and the word vector associated extension word weight w (Avet) is calculatedl) (ii) a Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlAnd Q) is selected.
The word vector association extended word set WEAETS is shown as formula (12):
in formula (12), AvetsIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)lAnd Q) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11).
The w (Avet)l) The formula (2) is shown in formula (13).
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (13)
Step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)l)。
The final expansion word FETS is shown as equation (14):
the final expanded word weight w (ET)l) Embedding expanded word weights for wordsw(vetl) Or associating an expanded word weight w (Avet) for the word vectorl) Or the sum of the two; final expanded word weight w (ET)l) As shown in equation (15):
and 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a Chinese query expansion method based on pattern mining and word vector similarity calculation, which is characterized in that a word vector semantic learning training is carried out on an initial inspection document set to obtain a word vector set comprising query terms and non-query terms, an associated expansion word mining method based on Copulas function is adopted to mine associated expansion words from a pseudo-related feedback document set, and two vector cosine similarity operations are carried out in the word vector set: calculating the vector cosine similarity between the non-query terms and the original query, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, obtaining a word embedding expansion word set, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words of which the vector similarity value is not lower than the minimum similarity threshold value, obtaining a word vector associated expansion word set, fusing the word embedding expansion word set and the word vector associated expansion word set to obtain final expansion words, combining the final expansion words and the original query into a new query, searching the document set again, and realizing query expansion. The method combines the association mode mining and the word vector similarity calculation, excavates the high-quality extension words, improves the information retrieval performance, and has good application value and popularization prospect.
(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that the retrieval results MAP and P @5 of the method are higher than those of the reference retrieval and 4 contrast expansion methods, and the retrieval performance of the method is superior to that of the reference retrieval and contrast methods, so that the information retrieval performance can be improved, the problems of query drift and word mismatching in information retrieval are reduced, and the method has high application value and wide popularization prospect.
Drawings
FIG. 1 is a general flow chart of the Chinese query expansion method based on pattern mining and word vector similarity calculation according to the present invention.
Detailed Description
Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:
1. item set
In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.
2. Associating rules front and back parts
Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.
3. Characteristic term set support degree and confidence degree based on Copulas function
Copula theory (see Sklar A. principles de repetition n dimensions et sources markers [ J ]. Publication de l 'institute de Statistique l' Universal Paris 1959,8(1): 229-.
The invention utilizes Copulas function to integrate the frequency and weight of the feature term set into the support degree and confidence degree of the feature term association mode, and provides the support degree Copulas _ support (Copulas based supported support) and the confidence degree Copulas _ confidence (Copulas based communications) of the feature term set based on Copulas function.
Characteristic term set (T) based on Copulas function1∪T2) Degree of Support Copulas _ Support (T)1∪T2) Is/are as followsThe calculation formula is shown as formula (16):
in formula (16), frequency (T1 ≧ T2) represents the frequency of occurrence of the item set (T1 ≧ T2) in the pseudo-relevant feedback chinese document library, frequency (alldocs) represents the total number of documents in the pseudo-relevant feedback chinese document library, weight (T1 ≦ T2) represents the item set weight of the item set (T1 ≦ T2) in the pseudo-relevant feedback chinese document library, and weight (allitems) represents the weight-summed-up sum of all chinese feature terms in the pseudo-relevant feedback chinese document library. exp denotes an exponential function with a natural constant e as the base.
Feature word association rule Confidence copula _ Confidence (T) based on copula function1→T2) And (4) calculating, as shown in formula (17):
in formula (17), frequency (T)1) Representing a set of items T1Weight (T) frequency of occurrence in pseudo-relevance feedback Chinese document library1) Representing a set of items T1The weights of the item sets in the pseudo-correlation feedback Chinese document library, namely frequency (T1U T2) and weight (T1U T2) are defined as the same as the formula (16).
4. Associated expansion word and word vector associated expansion word
The associated expansion words come from a back-part item set of the association rule, and the confidence of the association rule is used as the weight of the associated expansion words.
And calculating the vector cosine similarity between the associated expanded words and the original query, and calling the associated expanded words with the vector similarity value not lower than the minimum similarity threshold as a word vector associated expanded word set.
Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlQ), because the two expansion words have different weight sources, the invention utilizes the cumulus of CopulasIntegrating the associated expanded word weight and the vector cosine similarity value of the associated expanded word and the original query term set Q into a statistical word vector associated expanded word weight w (Avet)l) As shown in formula (18)
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (18)
5. Word-embedded expansion word
Calculating the vector cosine similarity of the non-query terms and all query terms, taking the accumulated sum of the vector cosine similarity of the non-query terms and all query terms as the total vector cosine similarity of the non-query terms and the original query, taking the front Vm non-query terms extracted according to the total vector cosine similarity in descending order arrangement as word embedding expansion words, and taking the total vector cosine similarity as the weight of the word embedding expansion words.
The invention is further explained below by referring to the drawings and specific comparative experiments.
As shown in FIG. 1, the Chinese query expansion method based on pattern mining and word vector similarity calculation of the present invention comprises the following steps:
step 1, a user queries and retrieves a Chinese document set to obtain a primary inspection document set.
And 2, performing Chinese word segmentation and Chinese stop word removal on the initial detection document set, and performing word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms.
The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.
Step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:
(3.1) computing non-query terms (cet) in the set of word vectors1,cet2,…,ceti) With each query term (Q) in the original query term set Q1,q2,…,qj) Vector cosine similarity VecCos (cet) of (2)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j.
Vector cosine similarity VecCos (cet)l,qs) Is calculated as shown in equation (1):
in the formula (1), vcetlIndicating the ith non-query term cetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set QlQ), as shown in formula (2):
(3.3) vector cosine similarity VecSim (cet)lQ) arranging descending order, extracting the Vm non-query terms in the front row as the word embedding Expansion words of the original query Term set Q according to the arranged descending order, constructing a word embedding Expansion word set WEETS (WordEmbedding Expansion Term sets), and calculating a word embedding Expansion word weight w (vet)l) Then, the process proceeds to step 4.
The word embedding extended word set WEETS is shown as formula (3):
in formula (3), vetlRepresents the ith word embedding extension word (l e (1,2, …, Vm)).
The invention takes the total vector cosine similarity value as the weight w (vet) of the word embedded expansion wordl) As shown in formula (4):
w(vetl)=VecSim(vetl,Q) (4)
and 4, extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a characteristic word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.
And 5, mining the associated extended words AET (association extension term) from the pseudo-related feedback document set by adopting an extended word mining method based on a Copulas function, and establishing an associated extended word set. The Copulas function-based associated expanded word mining method specifically comprises the following steps:
(5.1) mining 1_ frequent item set L1: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C1And calculates 1_ candidate C based on Copulas function1Degree of Support Copulas _ Support (C)1) If Copulas _ Support (C)1) If not lower than the minimum support threshold ms, C is set1As 1_ frequent item set L1And added to the frequent itemset set fis (frequency itemset).
The copolas _ supported (copolas based supported support) represents the support degree based on the copolas function.
The copolas _ Support (C)1) Is calculated as shown in equation (5):
in formula (5), frequency (C)1) Represents the 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library1) Represents the 1_ candidate C1Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp denotes an exponential function with a natural constant e as the base.
(5.2) mining k _ frequent item set Lk: from (k-1) _ frequent item set Lk-1Self-connectingGenerating k _ candidate CkThe k is more than or equal to 2; when k is 2, if C iskIf the original query term is not contained, the C is deletedkIf the C iskIf the original query term is contained, the C is leftkThen, C is leftkComputing k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a When k > 2, then CkDirect computation of k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a If Copulas _ Support (C)k) Not less than ms, then CkAs k _ frequent item set LkAnd added to the FIS.
The self-join method uses a candidate set join method given in Apriori algorithm.
The copolas _ Support (C)k) Is calculated as shown in equation (6):
in formula (6), frequency (C)k) Represents the k _ candidate CkWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents the k _ candidate CkThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the same as formula (5).
(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generatedkAnd if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4).
(5.4) optional removal of L from FISkAnd k is more than or equal to 2.
(5.5) from LkSet of extracted proper subset items LqAnd LAetCalculating the association rule Lq→LAetConfidence copula _ Confidence (L) based on copula functionq→LAet) And L isq∪LAet=Lk,Said LAetAs a proper subset of terms without query termsSaid L isqIs a proper subset item set containing query terms.
The copula _ Confidence (copula based Confidence) represents the Confidence based on copula function, which is Lq→LAet) Is represented by equation (7):
in formula (7), frequency (L)k) Represents k _ frequent item set LkWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents k _ frequent item set LkTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)q) Represents k _ frequent item set LkIs a proper subset of item sets LqWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryq) Represents k _ frequent item set LkIs a proper subset of item sets LqItem set weights in a pseudo-relevance feedback Chinese document library.
(5.6) mining association rule Lq→LAet: extract copula _ Confidence (L)q→LAet) Association rule L not less than minimum confidence threshold mcq→LAetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from LkTo re-extract other proper subset item sets LqAnd LAetSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (5.4), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if all are taken out once, then the association rule pattern mining is finished, and the following step (5.7) is carried out.
(5.7) extracting association rule back-piece L from association rule set ARAetThe feature words are used as the associated Expansion words to obtain an associated Expansion word set AETS (association Expansion Term set), and the associated Expansion words are calculatedWord weight wAetThen, the process proceeds to step 6.
The AETS is represented by formula (8):
in formula (8), AetiRepresenting the ith associated expanded word.
The associated expansion word weight wAetThe calculation formula is shown in formula (9):
wAet=max(Copulas_Confidence(Lq→LAet)) (9)
in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
Step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:
(6.1) computing associated expanded words (Aet) in the set of word vectors1,Aet2,..,Aets) And the original query term set Q (Q ═ Q)1,q2,…,qj) Each query term (q) in (b)1,q2,…,qj) Vector cosine similarity of (8) VecCos (Aet)l,qs) As shown in formula (10), l is greater than or equal to 1 and less than or equal to i, and s is greater than or equal to 1 and less than or equal to j.
In formula (10), vAetlIndicating the ith statistical expansion word AetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(6.2) accumulating the vector similarity value of the associated expansion word and each query word to obtain a similarity sum which is used as the vector cosine similarity value of the associated expansion word and the original query term set QVecSim(AetlQ), as shown in formula (11):
(6.3) extracting vector similarity VecSim (Aet)lQ) value is not lower than the minimum similarity threshold minVSim, the associated expansion words are used as word vector associated expansion words, a word vector associated expansion word set WEAETS is obtained, and a word vector associated expansion word weight w (Avet) is calculatedl) (ii) a Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlAnd Q) is selected.
The word vector association extended word set WEAETS is shown as formula (12):
in formula (12), AvetsIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)lAnd Q) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11).
The w (Avet)l) The formula (2) is shown in formula (13).
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (13)
Step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)l)。
The final expansion word FETS is shown as equation (14):
the final expanded word weight w (ET)l) Embedding expanded word weights w (vet) for wordsl) Or for word vector associationsExpanded word weight w (Avet)l) Or the sum of the two; final expanded word weight w (ET)l) As shown in equation (15):
and 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.
Experimental design and results:
we compare the experiment with the prior similar method to illustrate the effectiveness of the method of the invention.
1. Experimental environment and experimental data:
in order to verify the validity of the query expansion model proposed herein, the Chinese text corpus of the International Standard data set NTCIR-5CLIR (http:// research. ni. ac. jp/NTCIR/data/data-en. html.) was used as experimental data. The Chinese corpus is 901446 documents in total of 8 data sets, and the specific information is shown in Table 1. The corpus has 4 types of query subjects, 50 Chinese queries in total, and a result set has 2 evaluation criteria: rigid (highly relevant, relevant to the query) and Relax (highly relevant, and partially relevant to the query).
The invention adopts Title query subject, which is described briefly by nouns and noun phrases.
The experimental data pretreatment is as follows: chinese word segmentation and Chinese stop word removal. Experimental results the indexes for evaluation of the search were MAP (Mea n Average precision) and P @ 5.
TABLE 1 NTCIR-5CLIR Chinese original corpus information
2. The reference retrieval and comparison method comprises the following steps:
the experimental basic retrieval environment is built by Lucene.
The baseline search and comparison algorithm is illustrated as follows:
benchmark search br (baseline retrieval): refers to the search results obtained by the initial search of 50 original queries through lucene. The specific comparison query expansion method is described as follows:
comparative method 1: adopting a weighted association pattern mining technology of documents (yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper based on weighted association pattern mining, 2017,36(3): 307-: mc is 0.1, mi is 0.0001, ms is e (0.004,0.005,0.006, 0.007).
Comparative method 2: the multiple-support-threshold-based weighted frequent pattern mining technology of the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining algorithm with weighted multiple minimum supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605-: mc 0.1, LMS 0.2, HMS 0.25, WT 0.1, ms ∈ (0.1,0.15,0.2, 0.25).
Comparative method 3: the word vector based query expansion method is studied by the word vector method of patent query expansion in literature (Kan, Linyuan, qu-Liang, et al, J. computer science and exploration, 2018,12(6): 972-ion 980.). Experimental parameters: k is 60 and α is 0.1.
Comparative method 4: positive and negative expansion words are mined by adopting a fully weighted positive and negative association mode mining technology of a document (yellow name selection, JianCao, more-English cross language query translation and expansion [ J ] electronic bulletin, 2018,46(12): 3029-: mc is 0.1, α is 0.3, minPR is 0.1, minNR is 0.01, ms is ∈ (0.10,0.11,0.12, 0.13).
3. The experimental results are as follows:
net, lucene and source programs of the method and the comparison method are run on an experimental data set for 50 Chinese queries to obtain average values of the reference retrieval and comparison methods and retrieval results MAP and P @5 of the method, as shown in tables 2 to 5.
TABLE 2 search result P @5 value (Relay) of the inventive method and the reference search and comparison method
TABLE 3 search result P @5 value (Rigid) of the method of the present invention and the reference search and comparison method
TABLE 4 MAP values (RELax) of the search results of the inventive method and the reference search and comparison method
TABLE 5 MAP values (Rigid) of search results of the inventive method and the reference search and comparison method
Tables 2-5 show that the experimental results of the method disclosed by the invention are that the MAP and the P @5 values are higher than the standard retrieval, compared with 4 comparison methods, the MAP and the P @5 values of the method disclosed by the invention are mostly improved, and the extended retrieval performance of the method disclosed by the invention is higher than that of the standard retrieval and similar comparison methods. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.
Claims (8)
1. A Chinese query expansion method based on pattern mining and word vector similarity calculation is characterized by comprising the following steps:
step 1, a user queries and retrieves a Chinese document set to obtain a primary check document set;
step 2, carrying out Chinese word segmentation and Chinese stop word removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms;
step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:
(3.1) computing non-query terms (cet) in the set of word vectors1,cet2,…,ceti) With each query term (Q) in the original query term set Q1,q2,…,qj) Vector cosine similarity VecCos (cet) of (2)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j;
(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set Ql,Q);
(3.3) vector cosine similarity VecSim (cet)lQ), sorting descending order, extracting Vm non-query terms in the front row as word embedding expansion words of the original query term set Q according to the sorted descending order, constructing a word embedding expansion word set WEETS, and calculating a word embedding expansion word weight w (vet)l) Then, go to step 4;
step 4, extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library;
step 5, mining the related expansion words AET from the pseudo related feedback document set by adopting an expansion word mining method based on a Copulas function, and establishing a related expansion word set; the Copulas function-based associated expanded word mining method specifically comprises the following steps:
(5.1) mining 1_ frequent item set L1: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C1And calculates 1_ candidate C based on Copulas function1Degree of Support Copulas _ Support (C)1) If Copulas _ Support (C)1) If not lower than the minimum support threshold ms, C is set1As 1_ frequent item set L1And adding to a frequent item set FIS;
(5.2) mining k _ frequent item set Lk: from (k-1) _ frequent item set Lk-1Self-concatenating to generate k _ candidate CkThe k is more than or equal to 2; when k is 2, if C iskIf the original query term is not contained, the C is deletedkIf the C iskIf the original query term is contained, the C is leftkThen, C is leftkComputing k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a When k > 2, then CkDirect computation of k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a If Copulas _ Support (C)k) Not less than ms, then CkAs k _ frequent item set LkAnd added to the FIS;
(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generatedkIf the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4);
(5.4) optional removal of L from FISkThe k is more than or equal to 2;
(5.5) from LkSet of extracted proper subset items LqAnd LAetCalculating the association rule Lq→LAetConfidence copula _ Confidence (L) based on copula functionq→LAet) And L isq∪LAet=Lk,Said LAetFor a proper subset of terms set without query terms, said LqA proper subset item set containing query terms;
(5.6) mining association rule Lq→LAet: extract copula _ Confidence (L)q→LAet) Association rule L not less than minimum confidence threshold mcq→LAetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from LkTo re-extract other proper subset item sets LqAnd LAetSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entry sets are taken out once, then step (5.4) is carried out, a new round of association rule pattern mining is carried out,reextraction of any other L from FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (5.7) is carried out;
(5.7) extracting association rule back-piece L from association rule set ARAetThe characteristic words are used as associated expansion words to obtain an associated expansion word set AETS, and associated expansion word weight w is calculatedAetThen, go to step 6;
step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:
(6.1) computing associated expanded words (Aet) in the set of word vectors1,Aet2,..,Aets) And the original query term set Q (Q ═ Q)1,q2,…,qj) Each query term (q) in (b)1,q2,…,qj) Vector cosine similarity of (8) VecCos (Aet)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j;
(6.2) accumulating the similarity sum obtained by the vector similarity values of the associated expansion words and the query words as the vector cosine similarity value VecSim of the associated expansion words and the original query term set Q (Aet)l,Q);
(6.3) extracting vector similarity VecSim (Aet)lQ) value is not lower than the minimum similarity threshold minVSim, the associated expansion words are used as word vector associated expansion words, a word vector associated expansion word set WEAETS is obtained, and a word vector associated expansion word weight w (Avet) is calculatedl) (ii) a Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlAnd Q) composition;
step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)l);
And 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.
2. The method of claim 1, wherein the method comprises the steps of:
in said step (3.1), vector cosine similarity VecCos (cet)l,qs) Is calculated as shown in equation (1):
in the formula (1), vcetlIndicating the ith non-query term cetlWord vector value of, vqsRepresenting the s-th query term qsA word vector value of;
in the step (3.2), the vector cosine similarity VecSim (cet) of the non-query term and the original query term set QlQ), as shown in formula (2):
in the step (3.3), the word embedding extended word set WEETS is as shown in formula (3):
in formula (3), vetlRepresents the l < th > word embedding extension word (l is the (1,2, …, Vm));
using the total vector cosine similarity value as a word embedding expansion word weight w (vet)l) As shown in formula (4):
w(vetl)=VecSim(vetl,Q) (4)。
3. the method of claim 1, wherein the method comprises the steps of:
in the step (5.1), the copolas _ Support (C)1) Is calculated as shown in equation (5):
in formula (5), frequency (C)1) Represents the 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library1) Represents the 1_ candidate C1Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp represents an exponential function with a natural constant e as the base;
in the step (5.2), the copolas _ Support (C)k) Is calculated as shown in equation (6):
in formula (6), frequency (C)k) Represents the k _ candidate CkWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents the k _ candidate CkThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the formula (5);
in the step (5.5), the copula _ Confidence (L)q→LAet) Is represented by equation (7):
in formula (7), frequency (L)k) Represents k _ frequent item set LkWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents k _ frequent item set LkTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)q) Represents k _ frequencyComplex collection LkIs a proper subset of item sets LqWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryq) Represents k _ frequent item set LkIs a proper subset of item sets LqItem set weights in a pseudo-relevant feedback Chinese document library;
in step (5.7), the AETS is as shown in formula (8):
in formula (8), AetiRepresenting the ith associated expanded word;
the associated expansion word weight wAetThe calculation formula is shown in formula (9):
wAet=max(Copulas_Confidence(Lq→LAet)) (9)
in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
4. The method of claim 1, wherein the method comprises the steps of:
in the step (6.1), the VecCos (Aet)l,qs) As shown in equation (10):
in formula (10), vAetlIndicating the ith statistical expansion word AetlWord vector value of, vqsRepresenting the s-th query term qsA word vector value of;
in the step (6.2), the VecSim (Aet)lQ) is as shown in formula (11):
in the step (6.3), the word vector association extended word set WEAETS is as shown in equation (12):
in formula (12), AvetsIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)lQ) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11);
the w (Avet)l) The formula (2) is shown as formula (13);
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (13)。
5. the method of claim 1, wherein the method comprises the steps of:
in step 7, the final expansion word FETS is as shown in equation (14):
the final expanded word weight w (ET)l) Embedding expanded word weights w (vet) for wordsl) Or associating an expanded word weight w (Avet) for the word vectorl) Or the sum of the two; final expanded word weight w (ET)l) As shown in equation (15):
6. the method of claim 1, wherein the method comprises the steps of: in step 2, the deep learning tool refers to a Skip-gram model of the Google open source word vector tool word2 vec.
7. The method of claim 1, wherein the method comprises the steps of: in the step 4, a TF-IDF weighting technology is adopted to calculate the weight of the feature words.
8. The method of claim 1, wherein the method comprises the steps of: in the step (5.2), the self-join method adopts a candidate join method given in Apriori algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010773432.1A CN111897922A (en) | 2020-08-04 | 2020-08-04 | Chinese query expansion method based on pattern mining and word vector similarity calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010773432.1A CN111897922A (en) | 2020-08-04 | 2020-08-04 | Chinese query expansion method based on pattern mining and word vector similarity calculation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111897922A true CN111897922A (en) | 2020-11-06 |
Family
ID=73183322
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010773432.1A Withdrawn CN111897922A (en) | 2020-08-04 | 2020-08-04 | Chinese query expansion method based on pattern mining and word vector similarity calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111897922A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112612875A (en) * | 2020-12-29 | 2021-04-06 | 重庆农村商业银行股份有限公司 | Method, device and equipment for automatically expanding query words and storage medium |
CN112651224A (en) * | 2020-12-24 | 2021-04-13 | 天津大学 | Intelligent search method and device for engineering construction safety management document text |
-
2020
- 2020-08-04 CN CN202010773432.1A patent/CN111897922A/en not_active Withdrawn
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112651224A (en) * | 2020-12-24 | 2021-04-13 | 天津大学 | Intelligent search method and device for engineering construction safety management document text |
CN112612875A (en) * | 2020-12-29 | 2021-04-06 | 重庆农村商业银行股份有限公司 | Method, device and equipment for automatically expanding query words and storage medium |
CN112612875B (en) * | 2020-12-29 | 2023-05-23 | 重庆农村商业银行股份有限公司 | Query term automatic expansion method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103064969A (en) | Method for automatically creating keyword index table | |
CN109299278B (en) | Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent | |
Pan et al. | An improved TextRank keywords extraction algorithm | |
CN111897922A (en) | Chinese query expansion method based on pattern mining and word vector similarity calculation | |
CN109582769A (en) | Association mode based on weight sequence excavates and the text searching method of consequent extension | |
CN109739953B (en) | Text retrieval method based on chi-square analysis-confidence framework and back-part expansion | |
CN109299292B (en) | Text retrieval method based on matrix weighted association rule front and back part mixed expansion | |
CN109726263B (en) | Cross-language post-translation hybrid expansion method based on feature word weighted association pattern mining | |
CN111897928A (en) | Chinese query expansion method for embedding expansion words into query words and counting expansion word union | |
CN109684463B (en) | Cross-language post-translation and front-part extension method based on weight comparison and mining | |
Dalton et al. | Semantic entity retrieval using web queries over structured RDF data | |
CN111897921A (en) | Text retrieval method based on word vector learning and mode mining fusion expansion | |
CN111897924A (en) | Text retrieval method based on association rule and word vector fusion expansion | |
CN111897927B (en) | Chinese query expansion method integrating Copulas theory and association rule mining | |
CN109684465B (en) | Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison | |
CN111897926A (en) | Chinese query expansion method integrating deep learning and expansion word mining intersection | |
CN108416442B (en) | Chinese word matrix weighting association rule mining method based on item frequency and weight | |
CN111897919A (en) | Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion | |
CN109684464B (en) | Cross-language query expansion method for realizing rule back-part mining through weight comparison | |
Li et al. | Deep learning and semantic concept spaceare used in query expansion | |
CN111897925B (en) | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning | |
Yang et al. | An improved pagerank algorithm based on time feedback and topic similarity | |
CN108170778B (en) | Chinese-English cross-language query post-translation expansion method based on fully weighted rule post-piece | |
CN111897920A (en) | Text retrieval method based on word embedding and association mode union expansion | |
Wu et al. | Beyond greedy search: pruned exhaustive search for diversified result ranking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201106 |
|
WW01 | Invention patent application withdrawn after publication |