CN111897922A - Chinese query expansion method based on pattern mining and word vector similarity calculation - Google Patents

Chinese query expansion method based on pattern mining and word vector similarity calculation Download PDF

Info

Publication number
CN111897922A
CN111897922A CN202010773432.1A CN202010773432A CN111897922A CN 111897922 A CN111897922 A CN 111897922A CN 202010773432 A CN202010773432 A CN 202010773432A CN 111897922 A CN111897922 A CN 111897922A
Authority
CN
China
Prior art keywords
word
expansion
vector
aet
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010773432.1A
Other languages
Chinese (zh)
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN202010773432.1A priority Critical patent/CN111897922A/en
Publication of CN111897922A publication Critical patent/CN111897922A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention has proposed a Chinese inquiry expansion method based on pattern mining and word vector similarity calculation, it obtains the first survey file through the Chinese document set of user's inquiry retrieval at first, carry on the word vector semantic learning training to the first survey document set and get the word vector set including inquiry lexical item and non-inquiry lexical item; then mining an extension word from the pseudo-related feedback document set by adopting a Copulas function-based associated extension word mining method, and establishing an associated extension word set; and performing two kinds of vector cosine similarity operation in the word vector set to obtain a word embedded expansion word set and a word vector associated expansion word set, finally fusing the word embedded expansion word set and the word vector associated expansion word set to obtain a final expansion word, combining the final expansion word and the original query into a new query, and searching the document set again to realize query expansion. The method integrates the association mode mining and the word vector learning, can mine high-quality extension words, improves the information retrieval performance, and has good application value and popularization prospect.

Description

Chinese query expansion method based on pattern mining and word vector similarity calculation
Technical Field
The invention relates to a Chinese query expansion method based on pattern mining and word vector similarity calculation, and belongs to the technical field of information retrieval.
Background
The query expansion refers to modifying the original query weight or adding words related to the original query, so that the deficiency of the query information of a user is made up, the recall ratio and the precision ratio of an information retrieval system are improved, and the query expansion is one of core technologies for solving the problems of query subject drift and word mismatching in the field of information retrieval.
In recent decades, with the development of network technology and the arrival of big data era, how to precisely retrieve the information required by users from massive big data resources is the focus of the academic and industrial circles at home and abroad, so that the query expansion technology has been greatly developed, some new query expansion methods are proposed, for example, Liu et al (Liu C, Qi R, liuq. query expansion term based on positive and negative association rules [ C ]. Proceedings of the Third interactive relationship science and technology (ICIST),2013IEEE, Yangzhou, Jiangsu, China,2013: 802) proposed an extended word mining method based on positive and negative association rule mining, yellow et al (yellow name selection, small size, Zhang super-Zhang. related matrix based on weighted query matrix [ 2009 ], 1855. pseudo-query expansion matrix [ 10 ] proposed a pseudo-based on weighted query expansion rule mining method (1854, 1855), roy et al (Roy D, great D, middle M, et al. word vector based retrieval feedback using key similarity [ C ]. Proceedings of the 25th ACM International Conference on Information and knowledge management.new York: ACM Press,2016: 1281-.
However, the existing query expansion method does not finally and completely solve the technical problems of query theme drift, word mismatching and the like in information retrieval, aiming at the defects, the invention integrates the association pattern mining and the word vector learning, and provides a Chinese query expansion method based on the pattern mining and the word vector similarity calculation.
Disclosure of Invention
The invention aims to provide a Chinese query expansion method based on pattern mining and word vector similarity calculation, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve and enhance the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.
The invention adopts the following specific technical scheme:
a Chinese query expansion method based on pattern mining and word vector similarity calculation comprises the following steps:
step 1, a user queries and retrieves a Chinese document set to obtain a primary inspection document set.
And 2, performing Chinese word segmentation and Chinese stop word removal on the initial detection document set, and performing word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms.
The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (detailed in https:// code. Google. com/p/word2vec /).
Step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:
(3.1) computing non-query terms (cet) in the set of word vectors1,cet2,…,ceti) With each query term (Q) in the original query term set Q1,q2,…,qj) Vector cosine similarity VecCos (cet) of (2)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j.
Vector cosine similarity VecCos (cet)l,qs) Is calculated as shown in equation (1):
Figure BDA0002617491770000021
in the formula (1), vcetlIndicating the ith non-query termTerm cetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set QlQ), as shown in formula (2):
Figure BDA0002617491770000022
(3.3) vector cosine similarity VecSim (cet)lQ) arranging descending order, extracting the Vm non-query terms in the front row as the word embedding Expansion words of the original query Term set Q according to the arranged descending order, constructing a word embedding Expansion word set WEETS (WordEmbedding Expansion Term sets), and calculating a word embedding Expansion word weight w (vet)l) Then, the process proceeds to step 4.
The word embedding extended word set WEETS is shown as formula (3):
Figure BDA0002617491770000023
in formula (3), vetlRepresents the ith word embedding extension word (l e (1,2, …, Vm)).
The invention takes the total vector cosine similarity value as the weight w (vet) of the word embedded expansion wordl) As shown in formula (4):
w(vetl)=VecSim(vetl,Q) (4)
and 4, extracting m primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.
The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.
And 5, mining the associated extended words AET (association extension term) from the pseudo-related feedback document set by adopting an extended word mining method based on a Copulas function, and establishing an associated extended word set. The Copulas function-based associated expanded word mining method specifically comprises the following steps:
(5.1) mining 1_ frequent item set L1: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C1And calculates 1_ candidate C based on Copulas function1Degree of Support Copulas _ Support (C)1) If Copulas _ Support (C)1) If not lower than the minimum support threshold ms, C is set1As 1_ frequent item set L1And added to the frequent itemset set fis (frequency itemset).
The copolas _ supported (copolas based supported support) represents the support degree based on the copolas function.
The copolas _ Support (C)1) Is calculated as shown in equation (5):
Figure BDA0002617491770000031
in formula (5), frequency (C)1) Represents the 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library1) Represents the 1_ candidate C1Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp denotes an exponential function with a natural constant e as the base.
(5.2) mining k _ frequent item set Lk: from (k-1) _ frequent item set Lk-1Self-concatenating to generate k _ candidate CkThe k is more than or equal to 2; when k is 2, if C iskIf the original query term is not contained, the C is deletedkIf the C iskIf the original query term is contained, the C is leftkThen, leaveC ofkComputing k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a When k > 2, then CkDirect computation of k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a If Copulas _ Support (C)k) Not less than ms, then CkAs k _ frequent item set LkAnd added to the FIS.
The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.
The copolas _ Support (C)k) Is calculated as shown in equation (6):
Figure BDA0002617491770000041
in formula (6), frequency (C)k) Represents the k _ candidate CkWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents the k _ candidate CkThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the same as formula (5).
(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generatedkAnd if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4).
(5.4) optional removal of L from FISkAnd k is more than or equal to 2.
(5.5) from LkSet of extracted proper subset items LqAnd LAetCalculating the association rule Lq→LAetConfidence copula _ Confidence (L) based on copula functionq→LAet) And L isq∪LAet=Lk
Figure BDA0002617491770000042
Said LAetA proper subset term set that does not contain a query term,said LqIs a proper subset item set containing query terms.
The copula _ Confidence (copula based Confidence) represents the Confidence based on copula function, which is Lq→LAet) Is represented by equation (7):
Figure BDA0002617491770000043
in formula (7), frequency (L)k) Represents k _ frequent item set LkWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents k _ frequent item set LkTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)q) Represents k _ frequent item set LkIs a proper subset of item sets LqWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryq) Represents k _ frequent item set LkIs a proper subset of item sets LqItem set weights in a pseudo-relevance feedback Chinese document library.
(5.6) mining association rule Lq→LAet: extract copula _ Confidence (L)q→LAet) Association rule L not less than minimum confidence threshold mcq→LAetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from LkTo re-extract other proper subset item sets LqAnd LAetSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (5.4), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if all are taken out once, then the association rule pattern mining is finished, and the following step (5.7) is carried out.
(5.7) extracting association rule back-piece L from association rule set ARAetThe feature words are used as the associated Expansion words to obtain an associated Expansion word set AETS (association Expansion Term set), and the associated Expansion words are calculatedWeight wAetThen, the process proceeds to step 6.
The AETS is represented by formula (8):
Figure BDA0002617491770000051
in formula (8), AetiRepresenting the ith associated expanded word.
The associated expansion word weight wAetThe calculation formula is shown in formula (9):
wAet=max(Copulas_Confidence(Lq→LAet)) (9)
in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
Step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:
(6.1) computing associated expanded words (Aet) in the set of word vectors1,Aet2,..,Aets) And the original query term set Q (Q ═ Q)1,q2,…,qj) Each query term (q) in (b)1,q2,…,qj) Vector cosine similarity of (8) VecCos (Aet)l,qs) As shown in formula (10), l is greater than or equal to 1 and less than or equal to i, and s is greater than or equal to 1 and less than or equal to j.
Figure BDA0002617491770000052
In formula (10), vAetlIndicating the ith statistical expansion word AetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(6.2) accumulating the vector similarity value of the associated expansion word and each query word to obtain a similarity sum which is used as the vector cosine similarity value Vec of the associated expansion word and the original query term set QSim(AetlQ), as shown in formula (11):
Figure BDA0002617491770000053
(6.3) extracting vector similarity VecSim (Aet)lQ) value is not lower than the minimum similarity threshold minVSim as the word vector associated extension word, the word vector associated extension word set WEAETS (word embedding association extension Term set) is obtained, and the word vector associated extension word weight w (Avet) is calculatedl) (ii) a Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlAnd Q) is selected.
The word vector association extended word set WEAETS is shown as formula (12):
Figure BDA0002617491770000061
in formula (12), AvetsIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)lAnd Q) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11).
The w (Avet)l) The formula (2) is shown in formula (13).
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (13)
Step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)l)。
The final expansion word FETS is shown as equation (14):
Figure BDA0002617491770000062
the final expanded word weight w (ET)l) Embedding expanded word weights for wordsw(vetl) Or associating an expanded word weight w (Avet) for the word vectorl) Or the sum of the two; final expanded word weight w (ET)l) As shown in equation (15):
Figure BDA0002617491770000063
and 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a Chinese query expansion method based on pattern mining and word vector similarity calculation, which is characterized in that a word vector semantic learning training is carried out on an initial inspection document set to obtain a word vector set comprising query terms and non-query terms, an associated expansion word mining method based on Copulas function is adopted to mine associated expansion words from a pseudo-related feedback document set, and two vector cosine similarity operations are carried out in the word vector set: calculating the vector cosine similarity between the non-query terms and the original query, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, obtaining a word embedding expansion word set, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words of which the vector similarity value is not lower than the minimum similarity threshold value, obtaining a word vector associated expansion word set, fusing the word embedding expansion word set and the word vector associated expansion word set to obtain final expansion words, combining the final expansion words and the original query into a new query, searching the document set again, and realizing query expansion. The method combines the association mode mining and the word vector similarity calculation, excavates the high-quality extension words, improves the information retrieval performance, and has good application value and popularization prospect.
(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that the retrieval results MAP and P @5 of the method are higher than those of the reference retrieval and 4 contrast expansion methods, and the retrieval performance of the method is superior to that of the reference retrieval and contrast methods, so that the information retrieval performance can be improved, the problems of query drift and word mismatching in information retrieval are reduced, and the method has high application value and wide popularization prospect.
Drawings
FIG. 1 is a general flow chart of the Chinese query expansion method based on pattern mining and word vector similarity calculation according to the present invention.
Detailed Description
Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:
1. item set
In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.
2. Associating rules front and back parts
Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.
3. Characteristic term set support degree and confidence degree based on Copulas function
Copula theory (see Sklar A. principles de repetition n dimensions et sources markers [ J ]. Publication de l 'institute de Statistique l' Universal Paris 1959,8(1): 229-.
The invention utilizes Copulas function to integrate the frequency and weight of the feature term set into the support degree and confidence degree of the feature term association mode, and provides the support degree Copulas _ support (Copulas based supported support) and the confidence degree Copulas _ confidence (Copulas based communications) of the feature term set based on Copulas function.
Characteristic term set (T) based on Copulas function1∪T2) Degree of Support Copulas _ Support (T)1∪T2) Is/are as followsThe calculation formula is shown as formula (16):
Figure BDA0002617491770000071
in formula (16), frequency (T1 ≧ T2) represents the frequency of occurrence of the item set (T1 ≧ T2) in the pseudo-relevant feedback chinese document library, frequency (alldocs) represents the total number of documents in the pseudo-relevant feedback chinese document library, weight (T1 ≦ T2) represents the item set weight of the item set (T1 ≦ T2) in the pseudo-relevant feedback chinese document library, and weight (allitems) represents the weight-summed-up sum of all chinese feature terms in the pseudo-relevant feedback chinese document library. exp denotes an exponential function with a natural constant e as the base.
Feature word association rule Confidence copula _ Confidence (T) based on copula function1→T2) And (4) calculating, as shown in formula (17):
Figure BDA0002617491770000081
in formula (17), frequency (T)1) Representing a set of items T1Weight (T) frequency of occurrence in pseudo-relevance feedback Chinese document library1) Representing a set of items T1The weights of the item sets in the pseudo-correlation feedback Chinese document library, namely frequency (T1U T2) and weight (T1U T2) are defined as the same as the formula (16).
4. Associated expansion word and word vector associated expansion word
The associated expansion words come from a back-part item set of the association rule, and the confidence of the association rule is used as the weight of the associated expansion words.
And calculating the vector cosine similarity between the associated expanded words and the original query, and calling the associated expanded words with the vector similarity value not lower than the minimum similarity threshold as a word vector associated expanded word set.
Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlQ), because the two expansion words have different weight sources, the invention utilizes the cumulus of CopulasIntegrating the associated expanded word weight and the vector cosine similarity value of the associated expanded word and the original query term set Q into a statistical word vector associated expanded word weight w (Avet)l) As shown in formula (18)
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (18)
5. Word-embedded expansion word
Calculating the vector cosine similarity of the non-query terms and all query terms, taking the accumulated sum of the vector cosine similarity of the non-query terms and all query terms as the total vector cosine similarity of the non-query terms and the original query, taking the front Vm non-query terms extracted according to the total vector cosine similarity in descending order arrangement as word embedding expansion words, and taking the total vector cosine similarity as the weight of the word embedding expansion words.
The invention is further explained below by referring to the drawings and specific comparative experiments.
As shown in FIG. 1, the Chinese query expansion method based on pattern mining and word vector similarity calculation of the present invention comprises the following steps:
step 1, a user queries and retrieves a Chinese document set to obtain a primary inspection document set.
And 2, performing Chinese word segmentation and Chinese stop word removal on the initial detection document set, and performing word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms.
The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.
Step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:
(3.1) computing non-query terms (cet) in the set of word vectors1,cet2,…,ceti) With each query term (Q) in the original query term set Q1,q2,…,qj) Vector cosine similarity VecCos (cet) of (2)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j.
Vector cosine similarity VecCos (cet)l,qs) Is calculated as shown in equation (1):
Figure BDA0002617491770000091
in the formula (1), vcetlIndicating the ith non-query term cetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set QlQ), as shown in formula (2):
Figure BDA0002617491770000092
(3.3) vector cosine similarity VecSim (cet)lQ) arranging descending order, extracting the Vm non-query terms in the front row as the word embedding Expansion words of the original query Term set Q according to the arranged descending order, constructing a word embedding Expansion word set WEETS (WordEmbedding Expansion Term sets), and calculating a word embedding Expansion word weight w (vet)l) Then, the process proceeds to step 4.
The word embedding extended word set WEETS is shown as formula (3):
Figure BDA0002617491770000093
in formula (3), vetlRepresents the ith word embedding extension word (l e (1,2, …, Vm)).
The invention takes the total vector cosine similarity value as the weight w (vet) of the word embedded expansion wordl) As shown in formula (4):
w(vetl)=VecSim(vetl,Q) (4)
and 4, extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a characteristic word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.
And 5, mining the associated extended words AET (association extension term) from the pseudo-related feedback document set by adopting an extended word mining method based on a Copulas function, and establishing an associated extended word set. The Copulas function-based associated expanded word mining method specifically comprises the following steps:
(5.1) mining 1_ frequent item set L1: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C1And calculates 1_ candidate C based on Copulas function1Degree of Support Copulas _ Support (C)1) If Copulas _ Support (C)1) If not lower than the minimum support threshold ms, C is set1As 1_ frequent item set L1And added to the frequent itemset set fis (frequency itemset).
The copolas _ supported (copolas based supported support) represents the support degree based on the copolas function.
The copolas _ Support (C)1) Is calculated as shown in equation (5):
Figure BDA0002617491770000101
in formula (5), frequency (C)1) Represents the 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library1) Represents the 1_ candidate C1Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp denotes an exponential function with a natural constant e as the base.
(5.2) mining k _ frequent item set Lk: from (k-1) _ frequent item set Lk-1Self-connectingGenerating k _ candidate CkThe k is more than or equal to 2; when k is 2, if C iskIf the original query term is not contained, the C is deletedkIf the C iskIf the original query term is contained, the C is leftkThen, C is leftkComputing k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a When k > 2, then CkDirect computation of k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a If Copulas _ Support (C)k) Not less than ms, then CkAs k _ frequent item set LkAnd added to the FIS.
The self-join method uses a candidate set join method given in Apriori algorithm.
The copolas _ Support (C)k) Is calculated as shown in equation (6):
Figure BDA0002617491770000102
in formula (6), frequency (C)k) Represents the k _ candidate CkWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents the k _ candidate CkThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the same as formula (5).
(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generatedkAnd if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4).
(5.4) optional removal of L from FISkAnd k is more than or equal to 2.
(5.5) from LkSet of extracted proper subset items LqAnd LAetCalculating the association rule Lq→LAetConfidence copula _ Confidence (L) based on copula functionq→LAet) And L isq∪LAet=Lk
Figure BDA0002617491770000103
Said LAetAs a proper subset of terms without query termsSaid L isqIs a proper subset item set containing query terms.
The copula _ Confidence (copula based Confidence) represents the Confidence based on copula function, which is Lq→LAet) Is represented by equation (7):
Figure BDA0002617491770000111
in formula (7), frequency (L)k) Represents k _ frequent item set LkWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents k _ frequent item set LkTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)q) Represents k _ frequent item set LkIs a proper subset of item sets LqWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryq) Represents k _ frequent item set LkIs a proper subset of item sets LqItem set weights in a pseudo-relevance feedback Chinese document library.
(5.6) mining association rule Lq→LAet: extract copula _ Confidence (L)q→LAet) Association rule L not less than minimum confidence threshold mcq→LAetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from LkTo re-extract other proper subset item sets LqAnd LAetSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entries in the set are retrieved once, then proceed to step (5.4), perform a new round of association rule pattern mining, and retrieve any other L from the FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if all are taken out once, then the association rule pattern mining is finished, and the following step (5.7) is carried out.
(5.7) extracting association rule back-piece L from association rule set ARAetThe feature words are used as the associated Expansion words to obtain an associated Expansion word set AETS (association Expansion Term set), and the associated Expansion words are calculatedWord weight wAetThen, the process proceeds to step 6.
The AETS is represented by formula (8):
Figure BDA0002617491770000112
in formula (8), AetiRepresenting the ith associated expanded word.
The associated expansion word weight wAetThe calculation formula is shown in formula (9):
wAet=max(Copulas_Confidence(Lq→LAet)) (9)
in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
Step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:
(6.1) computing associated expanded words (Aet) in the set of word vectors1,Aet2,..,Aets) And the original query term set Q (Q ═ Q)1,q2,…,qj) Each query term (q) in (b)1,q2,…,qj) Vector cosine similarity of (8) VecCos (Aet)l,qs) As shown in formula (10), l is greater than or equal to 1 and less than or equal to i, and s is greater than or equal to 1 and less than or equal to j.
Figure BDA0002617491770000121
In formula (10), vAetlIndicating the ith statistical expansion word AetlWord vector value of, vqsRepresenting the s-th query term qsThe word vector value of.
(6.2) accumulating the vector similarity value of the associated expansion word and each query word to obtain a similarity sum which is used as the vector cosine similarity value of the associated expansion word and the original query term set QVecSim(AetlQ), as shown in formula (11):
Figure BDA0002617491770000122
(6.3) extracting vector similarity VecSim (Aet)lQ) value is not lower than the minimum similarity threshold minVSim, the associated expansion words are used as word vector associated expansion words, a word vector associated expansion word set WEAETS is obtained, and a word vector associated expansion word weight w (Avet) is calculatedl) (ii) a Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlAnd Q) is selected.
The word vector association extended word set WEAETS is shown as formula (12):
Figure BDA0002617491770000123
in formula (12), AvetsIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)lAnd Q) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11).
The w (Avet)l) The formula (2) is shown in formula (13).
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (13)
Step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)l)。
The final expansion word FETS is shown as equation (14):
Figure BDA0002617491770000124
the final expanded word weight w (ET)l) Embedding expanded word weights w (vet) for wordsl) Or for word vector associationsExpanded word weight w (Avet)l) Or the sum of the two; final expanded word weight w (ET)l) As shown in equation (15):
Figure BDA0002617491770000125
and 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.
Experimental design and results:
we compare the experiment with the prior similar method to illustrate the effectiveness of the method of the invention.
1. Experimental environment and experimental data:
in order to verify the validity of the query expansion model proposed herein, the Chinese text corpus of the International Standard data set NTCIR-5CLIR (http:// research. ni. ac. jp/NTCIR/data/data-en. html.) was used as experimental data. The Chinese corpus is 901446 documents in total of 8 data sets, and the specific information is shown in Table 1. The corpus has 4 types of query subjects, 50 Chinese queries in total, and a result set has 2 evaluation criteria: rigid (highly relevant, relevant to the query) and Relax (highly relevant, and partially relevant to the query).
The invention adopts Title query subject, which is described briefly by nouns and noun phrases.
The experimental data pretreatment is as follows: chinese word segmentation and Chinese stop word removal. Experimental results the indexes for evaluation of the search were MAP (Mea n Average precision) and P @ 5.
TABLE 1 NTCIR-5CLIR Chinese original corpus information
Figure BDA0002617491770000131
2. The reference retrieval and comparison method comprises the following steps:
the experimental basic retrieval environment is built by Lucene.
The baseline search and comparison algorithm is illustrated as follows:
benchmark search br (baseline retrieval): refers to the search results obtained by the initial search of 50 original queries through lucene. The specific comparison query expansion method is described as follows:
comparative method 1: adopting a weighted association pattern mining technology of documents (yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper based on weighted association pattern mining, 2017,36(3): 307-: mc is 0.1, mi is 0.0001, ms is e (0.004,0.005,0.006, 0.007).
Comparative method 2: the multiple-support-threshold-based weighted frequent pattern mining technology of the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining algorithm with weighted multiple minimum supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605-: mc 0.1, LMS 0.2, HMS 0.25, WT 0.1, ms ∈ (0.1,0.15,0.2, 0.25).
Comparative method 3: the word vector based query expansion method is studied by the word vector method of patent query expansion in literature (Kan, Linyuan, qu-Liang, et al, J. computer science and exploration, 2018,12(6): 972-ion 980.). Experimental parameters: k is 60 and α is 0.1.
Comparative method 4: positive and negative expansion words are mined by adopting a fully weighted positive and negative association mode mining technology of a document (yellow name selection, JianCao, more-English cross language query translation and expansion [ J ] electronic bulletin, 2018,46(12): 3029-: mc is 0.1, α is 0.3, minPR is 0.1, minNR is 0.01, ms is ∈ (0.10,0.11,0.12, 0.13).
3. The experimental results are as follows:
net, lucene and source programs of the method and the comparison method are run on an experimental data set for 50 Chinese queries to obtain average values of the reference retrieval and comparison methods and retrieval results MAP and P @5 of the method, as shown in tables 2 to 5.
TABLE 2 search result P @5 value (Relay) of the inventive method and the reference search and comparison method
Figure BDA0002617491770000141
TABLE 3 search result P @5 value (Rigid) of the method of the present invention and the reference search and comparison method
Figure BDA0002617491770000142
TABLE 4 MAP values (RELax) of the search results of the inventive method and the reference search and comparison method
Figure BDA0002617491770000151
TABLE 5 MAP values (Rigid) of search results of the inventive method and the reference search and comparison method
Figure BDA0002617491770000152
Tables 2-5 show that the experimental results of the method disclosed by the invention are that the MAP and the P @5 values are higher than the standard retrieval, compared with 4 comparison methods, the MAP and the P @5 values of the method disclosed by the invention are mostly improved, and the extended retrieval performance of the method disclosed by the invention is higher than that of the standard retrieval and similar comparison methods. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims (8)

1. A Chinese query expansion method based on pattern mining and word vector similarity calculation is characterized by comprising the following steps:
step 1, a user queries and retrieves a Chinese document set to obtain a primary check document set;
step 2, carrying out Chinese word segmentation and Chinese stop word removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms;
step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:
(3.1) computing non-query terms (cet) in the set of word vectors1,cet2,…,ceti) With each query term (Q) in the original query term set Q1,q2,…,qj) Vector cosine similarity VecCos (cet) of (2)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j;
(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set Ql,Q);
(3.3) vector cosine similarity VecSim (cet)lQ), sorting descending order, extracting Vm non-query terms in the front row as word embedding expansion words of the original query term set Q according to the sorted descending order, constructing a word embedding expansion word set WEETS, and calculating a word embedding expansion word weight w (vet)l) Then, go to step 4;
step 4, extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library;
step 5, mining the related expansion words AET from the pseudo related feedback document set by adopting an expansion word mining method based on a Copulas function, and establishing a related expansion word set; the Copulas function-based associated expanded word mining method specifically comprises the following steps:
(5.1) mining 1_ frequent item set L1: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C1And calculates 1_ candidate C based on Copulas function1Degree of Support Copulas _ Support (C)1) If Copulas _ Support (C)1) If not lower than the minimum support threshold ms, C is set1As 1_ frequent item set L1And adding to a frequent item set FIS;
(5.2) mining k _ frequent item set Lk: from (k-1) _ frequent item set Lk-1Self-concatenating to generate k _ candidate CkThe k is more than or equal to 2; when k is 2, if C iskIf the original query term is not contained, the C is deletedkIf the C iskIf the original query term is contained, the C is leftkThen, C is leftkComputing k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a When k > 2, then CkDirect computation of k _ candidate set CkDegree of Support Copulas _ Support (C)k) (ii) a If Copulas _ Support (C)k) Not less than ms, then CkAs k _ frequent item set LkAnd added to the FIS;
(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generatedkIf the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4);
(5.4) optional removal of L from FISkThe k is more than or equal to 2;
(5.5) from LkSet of extracted proper subset items LqAnd LAetCalculating the association rule Lq→LAetConfidence copula _ Confidence (L) based on copula functionq→LAet) And L isq∪LAet=Lk
Figure FDA0002617491760000021
Said LAetFor a proper subset of terms set without query terms, said LqA proper subset item set containing query terms;
(5.6) mining association rule Lq→LAet: extract copula _ Confidence (L)q→LAet) Association rule L not less than minimum confidence threshold mcq→LAetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from LkTo re-extract other proper subset item sets LqAnd LAetSequentially proceeding the next steps, and circulating the steps until LkIf and only if all proper subset entry sets are taken out once, then step (5.4) is carried out, a new round of association rule pattern mining is carried out,reextraction of any other L from FISkThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FISkIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (5.7) is carried out;
(5.7) extracting association rule back-piece L from association rule set ARAetThe characteristic words are used as associated expansion words to obtain an associated expansion word set AETS, and associated expansion word weight w is calculatedAetThen, go to step 6;
step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:
(6.1) computing associated expanded words (Aet) in the set of word vectors1,Aet2,..,Aets) And the original query term set Q (Q ═ Q)1,q2,…,qj) Each query term (q) in (b)1,q2,…,qj) Vector cosine similarity of (8) VecCos (Aet)l,qs) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j;
(6.2) accumulating the similarity sum obtained by the vector similarity values of the associated expansion words and the query words as the vector cosine similarity value VecSim of the associated expansion words and the original query term set Q (Aet)l,Q);
(6.3) extracting vector similarity VecSim (Aet)lQ) value is not lower than the minimum similarity threshold minVSim, the associated expansion words are used as word vector associated expansion words, a word vector associated expansion word set WEAETS is obtained, and a word vector associated expansion word weight w (Avet) is calculatedl) (ii) a Word vector associated expanded word weight w (Avet)l) By the associated expanded word weight wAetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set QlAnd Q) composition;
step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)l);
And 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.
2. The method of claim 1, wherein the method comprises the steps of:
in said step (3.1), vector cosine similarity VecCos (cet)l,qs) Is calculated as shown in equation (1):
Figure FDA0002617491760000031
in the formula (1), vcetlIndicating the ith non-query term cetlWord vector value of, vqsRepresenting the s-th query term qsA word vector value of;
in the step (3.2), the vector cosine similarity VecSim (cet) of the non-query term and the original query term set QlQ), as shown in formula (2):
Figure FDA0002617491760000032
in the step (3.3), the word embedding extended word set WEETS is as shown in formula (3):
Figure FDA0002617491760000033
in formula (3), vetlRepresents the l < th > word embedding extension word (l is the (1,2, …, Vm));
using the total vector cosine similarity value as a word embedding expansion word weight w (vet)l) As shown in formula (4):
w(vetl)=VecSim(vetl,Q) (4)。
3. the method of claim 1, wherein the method comprises the steps of:
in the step (5.1), the copolas _ Support (C)1) Is calculated as shown in equation (5):
Figure FDA0002617491760000034
in formula (5), frequency (C)1) Represents the 1_ candidate C1The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library1) Represents the 1_ candidate C1Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp represents an exponential function with a natural constant e as the base;
in the step (5.2), the copolas _ Support (C)k) Is calculated as shown in equation (6):
Figure FDA0002617491760000035
in formula (6), frequency (C)k) Represents the k _ candidate CkWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents the k _ candidate CkThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the formula (5);
in the step (5.5), the copula _ Confidence (L)q→LAet) Is represented by equation (7):
Figure FDA0002617491760000041
in formula (7), frequency (L)k) Represents k _ frequent item set LkWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryk) Represents k _ frequent item set LkTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)q) Represents k _ frequencyComplex collection LkIs a proper subset of item sets LqWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document libraryq) Represents k _ frequent item set LkIs a proper subset of item sets LqItem set weights in a pseudo-relevant feedback Chinese document library;
in step (5.7), the AETS is as shown in formula (8):
Figure FDA0002617491760000042
in formula (8), AetiRepresenting the ith associated expanded word;
the associated expansion word weight wAetThe calculation formula is shown in formula (9):
wAet=max(Copulas_Confidence(Lq→LAet)) (9)
in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.
4. The method of claim 1, wherein the method comprises the steps of:
in the step (6.1), the VecCos (Aet)l,qs) As shown in equation (10):
Figure FDA0002617491760000043
in formula (10), vAetlIndicating the ith statistical expansion word AetlWord vector value of, vqsRepresenting the s-th query term qsA word vector value of;
in the step (6.2), the VecSim (Aet)lQ) is as shown in formula (11):
Figure FDA0002617491760000044
in the step (6.3), the word vector association extended word set WEAETS is as shown in equation (12):
Figure FDA0002617491760000045
in formula (12), AvetsIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)lQ) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11);
the w (Avet)l) The formula (2) is shown as formula (13);
w(Avetl)=exp(log(wAet)+log(VecSim(Avetl,Q))) (13)。
5. the method of claim 1, wherein the method comprises the steps of:
in step 7, the final expansion word FETS is as shown in equation (14):
Figure FDA0002617491760000051
the final expanded word weight w (ET)l) Embedding expanded word weights w (vet) for wordsl) Or associating an expanded word weight w (Avet) for the word vectorl) Or the sum of the two; final expanded word weight w (ET)l) As shown in equation (15):
Figure FDA0002617491760000052
6. the method of claim 1, wherein the method comprises the steps of: in step 2, the deep learning tool refers to a Skip-gram model of the Google open source word vector tool word2 vec.
7. The method of claim 1, wherein the method comprises the steps of: in the step 4, a TF-IDF weighting technology is adopted to calculate the weight of the feature words.
8. The method of claim 1, wherein the method comprises the steps of: in the step (5.2), the self-join method adopts a candidate join method given in Apriori algorithm.
CN202010773432.1A 2020-08-04 2020-08-04 Chinese query expansion method based on pattern mining and word vector similarity calculation Withdrawn CN111897922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010773432.1A CN111897922A (en) 2020-08-04 2020-08-04 Chinese query expansion method based on pattern mining and word vector similarity calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010773432.1A CN111897922A (en) 2020-08-04 2020-08-04 Chinese query expansion method based on pattern mining and word vector similarity calculation

Publications (1)

Publication Number Publication Date
CN111897922A true CN111897922A (en) 2020-11-06

Family

ID=73183322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010773432.1A Withdrawn CN111897922A (en) 2020-08-04 2020-08-04 Chinese query expansion method based on pattern mining and word vector similarity calculation

Country Status (1)

Country Link
CN (1) CN111897922A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112612875A (en) * 2020-12-29 2021-04-06 重庆农村商业银行股份有限公司 Method, device and equipment for automatically expanding query words and storage medium
CN112651224A (en) * 2020-12-24 2021-04-13 天津大学 Intelligent search method and device for engineering construction safety management document text

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651224A (en) * 2020-12-24 2021-04-13 天津大学 Intelligent search method and device for engineering construction safety management document text
CN112612875A (en) * 2020-12-29 2021-04-06 重庆农村商业银行股份有限公司 Method, device and equipment for automatically expanding query words and storage medium
CN112612875B (en) * 2020-12-29 2023-05-23 重庆农村商业银行股份有限公司 Query term automatic expansion method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103064969A (en) Method for automatically creating keyword index table
CN109299278B (en) Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent
Pan et al. An improved TextRank keywords extraction algorithm
CN111897922A (en) Chinese query expansion method based on pattern mining and word vector similarity calculation
CN109582769A (en) Association mode based on weight sequence excavates and the text searching method of consequent extension
CN109739953B (en) Text retrieval method based on chi-square analysis-confidence framework and back-part expansion
CN109299292B (en) Text retrieval method based on matrix weighted association rule front and back part mixed expansion
CN109726263B (en) Cross-language post-translation hybrid expansion method based on feature word weighted association pattern mining
CN111897928A (en) Chinese query expansion method for embedding expansion words into query words and counting expansion word union
CN109684463B (en) Cross-language post-translation and front-part extension method based on weight comparison and mining
Dalton et al. Semantic entity retrieval using web queries over structured RDF data
CN111897921A (en) Text retrieval method based on word vector learning and mode mining fusion expansion
CN111897924A (en) Text retrieval method based on association rule and word vector fusion expansion
CN111897927B (en) Chinese query expansion method integrating Copulas theory and association rule mining
CN109684465B (en) Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison
CN111897926A (en) Chinese query expansion method integrating deep learning and expansion word mining intersection
CN108416442B (en) Chinese word matrix weighting association rule mining method based on item frequency and weight
CN111897919A (en) Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion
CN109684464B (en) Cross-language query expansion method for realizing rule back-part mining through weight comparison
Li et al. Deep learning and semantic concept spaceare used in query expansion
CN111897925B (en) Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning
Yang et al. An improved pagerank algorithm based on time feedback and topic similarity
CN108170778B (en) Chinese-English cross-language query post-translation expansion method based on fully weighted rule post-piece
CN111897920A (en) Text retrieval method based on word embedding and association mode union expansion
Wu et al. Beyond greedy search: pruned exhaustive search for diversified result ranking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201106

WW01 Invention patent application withdrawn after publication