CN111897922A

CN111897922A - Chinese query expansion method based on pattern mining and word vector similarity calculation

Info

Publication number: CN111897922A
Application number: CN202010773432.1A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-11-06

Abstract

The invention has proposed a Chinese inquiry expansion method based on pattern mining and word vector similarity calculation, it obtains the first survey file through the Chinese document set of user's inquiry retrieval at first, carry on the word vector semantic learning training to the first survey document set and get the word vector set including inquiry lexical item and non-inquiry lexical item; then mining an extension word from the pseudo-related feedback document set by adopting a Copulas function-based associated extension word mining method, and establishing an associated extension word set; and performing two kinds of vector cosine similarity operation in the word vector set to obtain a word embedded expansion word set and a word vector associated expansion word set, finally fusing the word embedded expansion word set and the word vector associated expansion word set to obtain a final expansion word, combining the final expansion word and the original query into a new query, and searching the document set again to realize query expansion. The method integrates the association mode mining and the word vector learning, can mine high-quality extension words, improves the information retrieval performance, and has good application value and popularization prospect.

Description

Chinese query expansion method based on pattern mining and word vector similarity calculation

Technical Field

The invention relates to a Chinese query expansion method based on pattern mining and word vector similarity calculation, and belongs to the technical field of information retrieval.

Background

The query expansion refers to modifying the original query weight or adding words related to the original query, so that the deficiency of the query information of a user is made up, the recall ratio and the precision ratio of an information retrieval system are improved, and the query expansion is one of core technologies for solving the problems of query subject drift and word mismatching in the field of information retrieval.

In recent decades, with the development of network technology and the arrival of big data era, how to precisely retrieve the information required by users from massive big data resources is the focus of the academic and industrial circles at home and abroad, so that the query expansion technology has been greatly developed, some new query expansion methods are proposed, for example, Liu et al (Liu C, Qi R, liuq. query expansion term based on positive and negative association rules [ C ]. Proceedings of the Third interactive relationship science and technology (ICIST),2013IEEE, Yangzhou, Jiangsu, China,2013: 802) proposed an extended word mining method based on positive and negative association rule mining, yellow et al (yellow name selection, small size, Zhang super-Zhang. related matrix based on weighted query matrix [ 2009 ], 1855. pseudo-query expansion matrix [ 10 ] proposed a pseudo-based on weighted query expansion rule mining method (1854, 1855), roy et al (Roy D, great D, middle M, et al. word vector based retrieval feedback using key similarity [ C ]. Proceedings of the 25th ACM International Conference on Information and knowledge management.new York: ACM Press,2016: 1281-.

However, the existing query expansion method does not finally and completely solve the technical problems of query theme drift, word mismatching and the like in information retrieval, aiming at the defects, the invention integrates the association pattern mining and the word vector learning, and provides a Chinese query expansion method based on the pattern mining and the word vector similarity calculation.

Disclosure of Invention

The invention aims to provide a Chinese query expansion method based on pattern mining and word vector similarity calculation, which is used in the field of information retrieval, such as an actual Chinese search engine and a web information retrieval system, and can improve and enhance the query performance of the information retrieval system and reduce the problems of query theme drift and word mismatching in information retrieval.

The invention adopts the following specific technical scheme:

a Chinese query expansion method based on pattern mining and word vector similarity calculation comprises the following steps:

step 1, a user queries and retrieves a Chinese document set to obtain a primary inspection document set.

And 2, performing Chinese word segmentation and Chinese stop word removal on the initial detection document set, and performing word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms.

The deep learning tool is a Skip-gram model of a Google open source word vector tool word2vec (detailed in https:// code. Google. com/p/word2vec /).

Step 3, calculating and accumulating the vector cosine similarity of the non-query terms and all query terms, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, and obtaining a word embedding expansion word set, wherein the specific steps are as follows:

(3.1) computing non-query terms (cet) in the set of word vectors₁,cet₂,…,cet_i) With each query term (Q) in the original query term set Q₁,q₂,…,q_j) Vector cosine similarity VecCos (cet) of (2)_l,q_s) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j.

Vector cosine similarity VecCos (cet)_l,q_s) Is calculated as shown in equation (1):

in the formula (1), vcet_lIndicating the ith non-query termTerm cet_lWord vector value of, vq_sRepresenting the s-th query term q_sThe word vector value of.

(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set Q_lQ), as shown in formula (2):

(3.3) vector cosine similarity VecSim (cet)_lQ) arranging descending order, extracting the Vm non-query terms in the front row as the word embedding Expansion words of the original query Term set Q according to the arranged descending order, constructing a word embedding Expansion word set WEETS (WordEmbedding Expansion Term sets), and calculating a word embedding Expansion word weight w (vet)_l) Then, the process proceeds to step 4.

The word embedding extended word set WEETS is shown as formula (3):

in formula (3), vet_lRepresents the ith word embedding extension word (l e (1,2, …, Vm)).

The invention takes the total vector cosine similarity value as the weight w (vet) of the word embedded expansion word_l) As shown in formula (4):

w(vet_l)＝VecSim(vet_l,Q) (4)

and 4, extracting m primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.

The invention adopts TF-IDF (term frequency-inverse document frequency) weighting technology (see the literature: Ricardo Baeza-Yates Berthier Ribeiro-Net, et al, WangZhijin et al, modern information retrieval, mechanical industry Press, 2005: 21-22.) to calculate the weight of the feature words.

And 5, mining the associated extended words AET (association extension term) from the pseudo-related feedback document set by adopting an extended word mining method based on a Copulas function, and establishing an associated extended word set. The Copulas function-based associated expanded word mining method specifically comprises the following steps:

(5.1) mining 1_ frequent item set L₁: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C₁And calculates 1_ candidate C based on Copulas function₁Degree of Support Copulas _ Support (C)₁) If Copulas _ Support (C)₁) If not lower than the minimum support threshold ms, C is set₁As 1_ frequent item set L₁And added to the frequent itemset set fis (frequency itemset).

The copolas _ supported (copolas based supported support) represents the support degree based on the copolas function.

The copolas _ Support (C)₁) Is calculated as shown in equation (5):

in formula (5), frequency (C)₁) Represents the 1_ candidate C₁The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library₁) Represents the 1_ candidate C₁Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp denotes an exponential function with a natural constant e as the base.

(5.2) mining k _ frequent item set L_k: from (k-1) _ frequent item set L_k-1Self-concatenating to generate k _ candidate C_kThe k is more than or equal to 2; when k is 2, if C is_kIf the original query term is not contained, the C is deleted_kIf the C is_kIf the original query term is contained, the C is left_kThen, leaveC of_kComputing k _ candidate set C_kDegree of Support Copulas _ Support (C)_k) (ii) a When k > 2, then C_kDirect computation of k _ candidate set C_kDegree of Support Copulas _ Support (C)_k) (ii) a If Copulas _ Support (C)_k) Not less than ms, then C_kAs k _ frequent item set L_kAnd added to the FIS.

The self-ligation method employs a candidate ligation method as set forth in Apriori algorithm (see: Agrawal R, Imielinski T, SwamiA. minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of data, Washington D C, USA,1993: 207-.

The copolas _ Support (C)_k) Is calculated as shown in equation (6):

in formula (6), frequency (C)_k) Represents the k _ candidate C_kWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document library_k) Represents the k _ candidate C_kThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the same as formula (5).

(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generated_kAnd if the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4).

(5.4) optional removal of L from FIS_kAnd k is more than or equal to 2.

(5.5) from L_kSet of extracted proper subset items L_qAnd L_AetCalculating the association rule L_q→L_AetConfidence copula _ Confidence (L) based on copula function_q→L_Aet) And L is_q∪L_Aet＝L_k，

Said L_AetA proper subset term set that does not contain a query term,said L_qIs a proper subset item set containing query terms.

The copula _ Confidence (copula based Confidence) represents the Confidence based on copula function, which is L_q→L_Aet) Is represented by equation (7):

in formula (7), frequency (L)_k) Represents k _ frequent item set L_kWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document library_k) Represents k _ frequent item set L_kTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)_q) Represents k _ frequent item set L_kIs a proper subset of item sets L_qWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document library_q) Represents k _ frequent item set L_kIs a proper subset of item sets L_qItem set weights in a pseudo-relevance feedback Chinese document library.

(5.6) mining association rule L_q→L_Aet: extract copula _ Confidence (L)_q→L_Aet) Association rule L not less than minimum confidence threshold mc_q→L_AetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from L_kTo re-extract other proper subset item sets L_qAnd L_AetSequentially proceeding the next steps, and circulating the steps until L_kIf and only if all proper subset entries in the set are retrieved once, then proceed to step (5.4), perform a new round of association rule pattern mining, and retrieve any other L from the FIS_kThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS_kIf and only if all are taken out once, then the association rule pattern mining is finished, and the following step (5.7) is carried out.

(5.7) extracting association rule back-piece L from association rule set AR_AetThe feature words are used as the associated Expansion words to obtain an associated Expansion word set AETS (association Expansion Term set), and the associated Expansion words are calculatedWeight w_AetThen, the process proceeds to step 6.

The AETS is represented by formula (8):

in formula (8), Aet_iRepresenting the ith associated expanded word.

The associated expansion word weight w_AetThe calculation formula is shown in formula (9):

w_Aet＝max(Copulas_Confidence(L_q→L_Aet)) (9)

in the formula (9), max () represents the maximum value of the confidence of the association rule, and when the same rule expansion word appears in a plurality of association rule patterns at the same time, the maximum value of the confidence is taken as the weight of the rule expansion word.

Step 6, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words with the vector similarity value not lower than the minimum similarity threshold value to obtain a word vector associated expansion word set, and specifically comprising the following steps:

(6.1) computing associated expanded words (Aet) in the set of word vectors₁,Aet₂,..,Aet_s) And the original query term set Q (Q ═ Q)₁,q₂,…,q_j) Each query term (q) in (b)₁,q₂,…,q_j) Vector cosine similarity of (8) VecCos (Aet)_l,q_s) As shown in formula (10), l is greater than or equal to 1 and less than or equal to i, and s is greater than or equal to 1 and less than or equal to j.

In formula (10), vAet_lIndicating the ith statistical expansion word Aet_lWord vector value of, vq_sRepresenting the s-th query term q_sThe word vector value of.

(6.2) accumulating the vector similarity value of the associated expansion word and each query word to obtain a similarity sum which is used as the vector cosine similarity value Vec of the associated expansion word and the original query term set QSim(Aet_lQ), as shown in formula (11):

(6.3) extracting vector similarity VecSim (Aet)_lQ) value is not lower than the minimum similarity threshold minVSim as the word vector associated extension word, the word vector associated extension word set WEAETS (word embedding association extension Term set) is obtained, and the word vector associated extension word weight w (Avet) is calculated_l) (ii) a Word vector associated expanded word weight w (Avet)_l) By the associated expanded word weight w_AetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set Q_lAnd Q) is selected.

The word vector association extended word set WEAETS is shown as formula (12):

in formula (12), Avet_sIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)_lAnd Q) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11).

The w (Avet)_l) The formula (2) is shown in formula (13).

w(Avet_l)＝exp(log(w_Aet)+log(VecSim(Avet_l,Q))) (13)

Step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)_l)。

The final expansion word FETS is shown as equation (14):

the final expanded word weight w (ET)_l) Embedding expanded word weights for wordsw(vet_l) Or associating an expanded word weight w (Avet) for the word vector_l) Or the sum of the two; final expanded word weight w (ET)_l) As shown in equation (15):

and 8, finally combining the expansion words and the original query into a new query, and searching the document set again to realize query expansion.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a Chinese query expansion method based on pattern mining and word vector similarity calculation, which is characterized in that a word vector semantic learning training is carried out on an initial inspection document set to obtain a word vector set comprising query terms and non-query terms, an associated expansion word mining method based on Copulas function is adopted to mine associated expansion words from a pseudo-related feedback document set, and two vector cosine similarity operations are carried out in the word vector set: calculating the vector cosine similarity between the non-query terms and the original query, extracting the front non-query terms as word embedding expansion words according to the similarity value of descending order arrangement, obtaining a word embedding expansion word set, calculating the vector cosine similarity between the associated expansion words and the original query, extracting the associated expansion words of which the vector similarity value is not lower than the minimum similarity threshold value, obtaining a word vector associated expansion word set, fusing the word embedding expansion word set and the word vector associated expansion word set to obtain final expansion words, combining the final expansion words and the original query into a new query, searching the document set again, and realizing query expansion. The method combines the association mode mining and the word vector similarity calculation, excavates the high-quality extension words, improves the information retrieval performance, and has good application value and popularization prospect.

(2) 4 similar query expansion methods appearing in recent years are selected as comparison methods of the method, and experimental data are Chinese corpus of a national standard data set NTCIR-5 CLIR. The experimental result shows that the retrieval results MAP and P @5 of the method are higher than those of the reference retrieval and 4 contrast expansion methods, and the retrieval performance of the method is superior to that of the reference retrieval and contrast methods, so that the information retrieval performance can be improved, the problems of query drift and word mismatching in information retrieval are reduced, and the method has high application value and wide popularization prospect.

Drawings

FIG. 1 is a general flow chart of the Chinese query expansion method based on pattern mining and word vector similarity calculation according to the present invention.

Detailed Description

Firstly, in order to better explain the technical scheme of the invention, the related concepts related to the invention are introduced as follows:

1. item set

In text mining, a text document is regarded as a transaction, each feature word in the document is called an item, a set of feature word items is called an item set, and the number of all items in the item set is called an item set length. The k _ term set refers to a term set containing k items, k being the length of the term set.

2. Associating rules front and back parts

Let x and y be any feature term set, and the implication of the form x → y is called association rule, where x is called rule antecedent and y is called rule postcedent.

3. Characteristic term set support degree and confidence degree based on Copulas function

Copula theory (see Sklar A. principles de repetition n dimensions et sources markers [ J ]. Publication de l 'institute de Statistique l' Universal Paris 1959,8(1): 229-.

The invention utilizes Copulas function to integrate the frequency and weight of the feature term set into the support degree and confidence degree of the feature term association mode, and provides the support degree Copulas _ support (Copulas based supported support) and the confidence degree Copulas _ confidence (Copulas based communications) of the feature term set based on Copulas function.

Characteristic term set (T) based on Copulas function₁∪T₂) Degree of Support Copulas _ Support (T)₁∪T₂) Is/are as followsThe calculation formula is shown as formula (16):

in formula (16), frequency (T1 ≧ T2) represents the frequency of occurrence of the item set (T1 ≧ T2) in the pseudo-relevant feedback chinese document library, frequency (alldocs) represents the total number of documents in the pseudo-relevant feedback chinese document library, weight (T1 ≦ T2) represents the item set weight of the item set (T1 ≦ T2) in the pseudo-relevant feedback chinese document library, and weight (allitems) represents the weight-summed-up sum of all chinese feature terms in the pseudo-relevant feedback chinese document library. exp denotes an exponential function with a natural constant e as the base.

Feature word association rule Confidence copula _ Confidence (T) based on copula function₁→T₂) And (4) calculating, as shown in formula (17):

in formula (17), frequency (T)₁) Representing a set of items T₁Weight (T) frequency of occurrence in pseudo-relevance feedback Chinese document library₁) Representing a set of items T₁The weights of the item sets in the pseudo-correlation feedback Chinese document library, namely frequency (T1U T2) and weight (T1U T2) are defined as the same as the formula (16).

4. Associated expansion word and word vector associated expansion word

The associated expansion words come from a back-part item set of the association rule, and the confidence of the association rule is used as the weight of the associated expansion words.

And calculating the vector cosine similarity between the associated expanded words and the original query, and calling the associated expanded words with the vector similarity value not lower than the minimum similarity threshold as a word vector associated expanded word set.

Word vector associated expanded word weight w (Avet)_l) By the associated expanded word weight w_AetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set Q_lQ), because the two expansion words have different weight sources, the invention utilizes the cumulus of CopulasIntegrating the associated expanded word weight and the vector cosine similarity value of the associated expanded word and the original query term set Q into a statistical word vector associated expanded word weight w (Avet)_l) As shown in formula (18)

w(Avet_l)＝exp(log(w_Aet)+log(VecSim(Avet_l,Q))) (18)

5. Word-embedded expansion word

Calculating the vector cosine similarity of the non-query terms and all query terms, taking the accumulated sum of the vector cosine similarity of the non-query terms and all query terms as the total vector cosine similarity of the non-query terms and the original query, taking the front Vm non-query terms extracted according to the total vector cosine similarity in descending order arrangement as word embedding expansion words, and taking the total vector cosine similarity as the weight of the word embedding expansion words.

The invention is further explained below by referring to the drawings and specific comparative experiments.

As shown in FIG. 1, the Chinese query expansion method based on pattern mining and word vector similarity calculation of the present invention comprises the following steps:

The deep learning tool disclosed by the invention is a Skip-gram model of a Google open source word vector tool word2 vec.

in the formula (1), vcet_lIndicating the ith non-query term cet_lWord vector value of, vq_sRepresenting the s-th query term q_sThe word vector value of.

The word embedding extended word set WEETS is shown as formula (3):

w(vet_l)＝VecSim(vet_l,Q) (4)

and 4, extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a characteristic word weight by adopting a TF-IDF weighting technology, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library.

The copolas _ Support (C)₁) Is calculated as shown in equation (5):

(5.2) mining k _ frequent item set L_k: from (k-1) _ frequent item set L_k-1Self-connectingGenerating k _ candidate C_kThe k is more than or equal to 2; when k is 2, if C is_kIf the original query term is not contained, the C is deleted_kIf the C is_kIf the original query term is contained, the C is left_kThen, C is left_kComputing k _ candidate set C_kDegree of Support Copulas _ Support (C)_k) (ii) a When k > 2, then C_kDirect computation of k _ candidate set C_kDegree of Support Copulas _ Support (C)_k) (ii) a If Copulas _ Support (C)_k) Not less than ms, then C_kAs k _ frequent item set L_kAnd added to the FIS.

The self-join method uses a candidate set join method given in Apriori algorithm.

The copolas _ Support (C)_k) Is calculated as shown in equation (6):

(5.4) optional removal of L from FIS_kAnd k is more than or equal to 2.

Said L_AetAs a proper subset of terms without query termsSaid L is_qIs a proper subset item set containing query terms.

(5.7) extracting association rule back-piece L from association rule set AR_AetThe feature words are used as the associated Expansion words to obtain an associated Expansion word set AETS (association Expansion Term set), and the associated Expansion words are calculatedWord weight w_AetThen, the process proceeds to step 6.

The AETS is represented by formula (8):

in formula (8), Aet_iRepresenting the ith associated expanded word.

w_Aet＝max(Copulas_Confidence(L_q→L_Aet)) (9)

(6.2) accumulating the vector similarity value of the associated expansion word and each query word to obtain a similarity sum which is used as the vector cosine similarity value of the associated expansion word and the original query term set QVecSim(Aet_lQ), as shown in formula (11):

(6.3) extracting vector similarity VecSim (Aet)_lQ) value is not lower than the minimum similarity threshold minVSim, the associated expansion words are used as word vector associated expansion words, a word vector associated expansion word set WEAETS is obtained, and a word vector associated expansion word weight w (Avet) is calculated_l) (ii) a Word vector associated expanded word weight w (Avet)_l) By the associated expanded word weight w_AetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set Q_lAnd Q) is selected.

The word vector association extended word set WEAETS is shown as formula (12):

The w (Avet)_l) The formula (2) is shown in formula (13).

w(Avet_l)＝exp(log(w_Aet)+log(VecSim(Avet_l,Q))) (13)

The final expansion word FETS is shown as equation (14):

the final expanded word weight w (ET)_l) Embedding expanded word weights w (vet) for words_l) Or for word vector associationsExpanded word weight w (Avet)_l) Or the sum of the two; final expanded word weight w (ET)_l) As shown in equation (15):

Experimental design and results:

we compare the experiment with the prior similar method to illustrate the effectiveness of the method of the invention.

1. Experimental environment and experimental data:

in order to verify the validity of the query expansion model proposed herein, the Chinese text corpus of the International Standard data set NTCIR-5CLIR (http:// research. ni. ac. jp/NTCIR/data/data-en. html.) was used as experimental data. The Chinese corpus is 901446 documents in total of 8 data sets, and the specific information is shown in Table 1. The corpus has 4 types of query subjects, 50 Chinese queries in total, and a result set has 2 evaluation criteria: rigid (highly relevant, relevant to the query) and Relax (highly relevant, and partially relevant to the query).

The invention adopts Title query subject, which is described briefly by nouns and noun phrases.

The experimental data pretreatment is as follows: chinese word segmentation and Chinese stop word removal. Experimental results the indexes for evaluation of the search were MAP (Mea n Average precision) and P @ 5.

TABLE 1 NTCIR-5CLIR Chinese original corpus information

2. The reference retrieval and comparison method comprises the following steps:

the experimental basic retrieval environment is built by Lucene.

The baseline search and comparison algorithm is illustrated as follows:

benchmark search br (baseline retrieval): refers to the search results obtained by the initial search of 50 original queries through lucene. The specific comparison query expansion method is described as follows:

comparative method 1: adopting a weighted association pattern mining technology of documents (yellow name selection, cross-English cross-language query expansion [ J ] information academic newspaper based on weighted association pattern mining, 2017,36(3): 307-: mc is 0.1, mi is 0.0001, ms is e (0.004,0.005,0.006, 0.007).

Comparative method 2: the multiple-support-threshold-based weighted frequent pattern mining technology of the literature (Zhang H R, Zhang J W, Wei X Y, et al. A new frequency pattern mining algorithm with weighted multiple minimum supports [ J ]. Intelligent Automation & Soft Computing,2017,23(4): 605-: mc 0.1, LMS 0.2, HMS 0.25, WT 0.1, ms ∈ (0.1,0.15,0.2, 0.25).

Comparative method 3: the word vector based query expansion method is studied by the word vector method of patent query expansion in literature (Kan, Linyuan, qu-Liang, et al, J. computer science and exploration, 2018,12(6): 972-ion 980.). Experimental parameters: k is 60 and α is 0.1.

Comparative method 4: positive and negative expansion words are mined by adopting a fully weighted positive and negative association mode mining technology of a document (yellow name selection, JianCao, more-English cross language query translation and expansion [ J ] electronic bulletin, 2018,46(12): 3029-: mc is 0.1, α is 0.3, minPR is 0.1, minNR is 0.01, ms is ∈ (0.10,0.11,0.12, 0.13).

3. The experimental results are as follows:

net, lucene and source programs of the method and the comparison method are run on an experimental data set for 50 Chinese queries to obtain average values of the reference retrieval and comparison methods and retrieval results MAP and P @5 of the method, as shown in tables 2 to 5.

TABLE 2 search result P @5 value (Relay) of the inventive method and the reference search and comparison method

TABLE 3 search result P @5 value (Rigid) of the method of the present invention and the reference search and comparison method

TABLE 4 MAP values (RELax) of the search results of the inventive method and the reference search and comparison method

TABLE 5 MAP values (Rigid) of search results of the inventive method and the reference search and comparison method

Tables 2-5 show that the experimental results of the method disclosed by the invention are that the MAP and the P @5 values are higher than the standard retrieval, compared with 4 comparison methods, the MAP and the P @5 values of the method disclosed by the invention are mostly improved, and the extended retrieval performance of the method disclosed by the invention is higher than that of the standard retrieval and similar comparison methods. The experimental result shows that the method is effective, can actually improve the information retrieval performance, and has very high application value and wide popularization prospect.

Claims

1. A Chinese query expansion method based on pattern mining and word vector similarity calculation is characterized by comprising the following steps:

step 1, a user queries and retrieves a Chinese document set to obtain a primary check document set;

step 2, carrying out Chinese word segmentation and Chinese stop word removal on the initial detection document set, and carrying out word vector semantic learning training on the initial detection document set by using a deep learning tool to obtain a word vector set comprising query terms and non-query terms;

(3.1) computing non-query terms (cet) in the set of word vectors₁,cet₂,…,cet_i) With each query term (Q) in the original query term set Q₁,q₂,…,q_j) Vector cosine similarity VecCos (cet) of (2)_l,q_s) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j;

(3.2) accumulating the vector cosine similarity of the non-query terms and each query term in the original query term set Q to obtain a total similarity value which is used as the vector cosine similarity VecSim (cet) of the non-query terms and the original query term set Q_l,Q)；

(3.3) vector cosine similarity VecSim (cet)_lQ), sorting descending order, extracting Vm non-query terms in the front row as word embedding expansion words of the original query term set Q according to the sorted descending order, constructing a word embedding expansion word set WEETS, and calculating a word embedding expansion word weight w (vet)_l) Then, go to step 4;

step 4, extracting m pieces of primary detection documents in the front row from the primary detection document set as pseudo-related feedback documents, constructing a pseudo-related feedback document set, performing Chinese word segmentation, Chinese stop words removal and characteristic word extraction preprocessing on the primary detection pseudo-related feedback document set, calculating a weight of the characteristic words, and finally constructing a pseudo-related feedback Chinese document library and a Chinese characteristic word library;

step 5, mining the related expansion words AET from the pseudo related feedback document set by adopting an expansion word mining method based on a Copulas function, and establishing a related expansion word set; the Copulas function-based associated expanded word mining method specifically comprises the following steps:

(5.1) mining 1_ frequent item set L₁: extracting characteristic words from Chinese characteristic word library to obtain 1_ candidate item set C₁And calculates 1_ candidate C based on Copulas function₁Degree of Support Copulas _ Support (C)₁) If Copulas _ Support (C)₁) If not lower than the minimum support threshold ms, C is set₁As 1_ frequent item set L₁And adding to a frequent item set FIS;

(5.2) mining k _ frequent item set L_k: from (k-1) _ frequent item set L_k-1Self-concatenating to generate k _ candidate C_kThe k is more than or equal to 2; when k is 2, if C is_kIf the original query term is not contained, the C is deleted_kIf the C is_kIf the original query term is contained, the C is left_kThen, C is left_kComputing k _ candidate set C_kDegree of Support Copulas _ Support (C)_k) (ii) a When k > 2, then C_kDirect computation of k _ candidate set C_kDegree of Support Copulas _ Support (C)_k) (ii) a If Copulas _ Support (C)_k) Not less than ms, then C_kAs k _ frequent item set L_kAnd added to the FIS;

(5.3) k plus 1 and then proceeding to step (5.2) to continue the sequential execution of the next steps until said L is generated_kIf the item set is an empty set, finishing the mining of the frequent item set, and turning to the step (5.4);

(5.4) optional removal of L from FIS_kThe k is more than or equal to 2;

Said L_AetFor a proper subset of terms set without query terms, said L_qA proper subset item set containing query terms;

(5.6) mining association rule L_q→L_Aet: extract copula _ Confidence (L)_q→L_Aet) Association rule L not less than minimum confidence threshold mc_q→L_AetAdding into the association rule set AR (Association rule), and then proceeding to step (5.5), from L_kTo re-extract other proper subset item sets L_qAnd L_AetSequentially proceeding the next steps, and circulating the steps until L_kIf and only if all proper subset entry sets are taken out once, then step (5.4) is carried out, a new round of association rule pattern mining is carried out,reextraction of any other L from FIS_kThen, the subsequent steps are performed sequentially, and the process is circulated until all k _ frequent item sets L in the FIS_kIf and only if the rule patterns are taken out once, the association rule pattern mining is finished, and the following step (5.7) is carried out;

(5.7) extracting association rule back-piece L from association rule set AR_AetThe characteristic words are used as associated expansion words to obtain an associated expansion word set AETS, and associated expansion word weight w is calculated_AetThen, go to step 6;

(6.1) computing associated expanded words (Aet) in the set of word vectors₁,Aet₂,..,Aet_s) And the original query term set Q (Q ═ Q)₁,q₂,…,q_j) Each query term (q) in (b)₁,q₂,…,q_j) Vector cosine similarity of (8) VecCos (Aet)_l,q_s) Wherein l is more than or equal to 1 and less than or equal to i, and s is more than or equal to 1 and less than or equal to j;

(6.2) accumulating the similarity sum obtained by the vector similarity values of the associated expansion words and the query words as the vector cosine similarity value VecSim of the associated expansion words and the original query term set Q (Aet)_l,Q)；

(6.3) extracting vector similarity VecSim (Aet)_lQ) value is not lower than the minimum similarity threshold minVSim, the associated expansion words are used as word vector associated expansion words, a word vector associated expansion word set WEAETS is obtained, and a word vector associated expansion word weight w (Avet) is calculated_l) (ii) a Word vector associated expanded word weight w (Avet)_l) By the associated expanded word weight w_AetAnd vector cosine similarity value VecSim (Avet) of the associated expanded word and the original query term set Q_lAnd Q) composition;

step 7, embedding words into the expanded word set WEETS and fusing word vector association expanded word set WEAETS union to obtain final expanded words FETS (final Expansion Term set), and calculating final expanded word weight w (ET)_l)；

2. The method of claim 1, wherein the method comprises the steps of:

in said step (3.1), vector cosine similarity VecCos (cet)_l,q_s) Is calculated as shown in equation (1):

in the formula (1), vcet_lIndicating the ith non-query term cet_lWord vector value of, vq_sRepresenting the s-th query term q_sA word vector value of;

in the step (3.2), the vector cosine similarity VecSim (cet) of the non-query term and the original query term set Q_lQ), as shown in formula (2):

in the step (3.3), the word embedding extended word set WEETS is as shown in formula (3):

in formula (3), vet_lRepresents the l < th > word embedding extension word (l is the (1,2, …, Vm));

using the total vector cosine similarity value as a word embedding expansion word weight w (vet)_l) As shown in formula (4):

w(vet_l)＝VecSim(vet_l,Q) (4)。

3. the method of claim 1, wherein the method comprises the steps of:

in the step (5.1), the copolas _ Support (C)₁) Is calculated as shown in equation (5):

in formula (5), frequency (C)₁) Represents the 1_ candidate C₁The frequency of occurrence in the pseudo related feedback Chinese document library, frequency (allDocs) represents the total document quantity, weight (C) of the pseudo related feedback Chinese document library₁) Represents the 1_ candidate C₁Weight (allItems) represents the weight accumulation sum of all Chinese characteristic word items in the pseudo-correlation feedback Chinese document library; exp represents an exponential function with a natural constant e as the base;

in the step (5.2), the copolas _ Support (C)_k) Is calculated as shown in equation (6):

in formula (6), frequency (C)_k) Represents the k _ candidate C_kWeight (C) frequency of occurrence in pseudo-relevance feedback Chinese document library_k) Represents the k _ candidate C_kThe weight of the item set in the pseudo-relevant feedback Chinese document library, frequency (alldocs), weight (allItems) is defined as the formula (5);

in the step (5.5), the copula _ Confidence (L)_q→L_Aet) Is represented by equation (7):

in formula (7), frequency (L)_k) Represents k _ frequent item set L_kWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document library_k) Represents k _ frequent item set L_kTerm set weights in pseudo-relevance feedback Chinese document library, the frequency (L)_q) Represents k _ frequencyComplex collection L_kIs a proper subset of item sets L_qWeight (L) frequency of occurrence in pseudo-relevance feedback Chinese document library_q) Represents k _ frequent item set L_kIs a proper subset of item sets L_qItem set weights in a pseudo-relevant feedback Chinese document library;

in step (5.7), the AETS is as shown in formula (8):

in formula (8), Aet_iRepresenting the ith associated expanded word;

w_Aet＝max(Copulas_Confidence(L_q→L_Aet)) (9)

4. The method of claim 1, wherein the method comprises the steps of:

in the step (6.1), the VecCos (Aet)_l,q_s) As shown in equation (10):

in formula (10), vAet_lIndicating the ith statistical expansion word Aet_lWord vector value of, vq_sRepresenting the s-th query term q_sA word vector value of;

in the step (6.2), the VecSim (Aet)_lQ) is as shown in formula (11):

in the step (6.3), the word vector association extended word set WEAETS is as shown in equation (12):

in formula (12), Avet_sIndicates that the s-th word vector is associated with an expanded word, VecSim (Avet)_lQ) represents the accumulated sum of the vector cosine similarity values of the l-th word vector associated expansion word and each query term, and is calculated according to the formula (11);

the w (Avet)_l) The formula (2) is shown as formula (13);

w(Avet_l)＝exp(log(w_Aet)+log(VecSim(Avet_l,Q))) (13)。

5. the method of claim 1, wherein the method comprises the steps of:

in step 7, the final expansion word FETS is as shown in equation (14):

the final expanded word weight w (ET)_l) Embedding expanded word weights w (vet) for words_l) Or associating an expanded word weight w (Avet) for the word vector_l) Or the sum of the two; final expanded word weight w (ET)_l) As shown in equation (15):

6. the method of claim 1, wherein the method comprises the steps of: in step 2, the deep learning tool refers to a Skip-gram model of the Google open source word vector tool word2 vec.

7. The method of claim 1, wherein the method comprises the steps of: in the step 4, a TF-IDF weighting technology is adopted to calculate the weight of the feature words.

8. The method of claim 1, wherein the method comprises the steps of: in the step (5.2), the self-join method adopts a candidate join method given in Apriori algorithm.