CN107609095B

CN107609095B - Based on across the language inquiry extended method for weighting positive and negative regular former piece and relevant feedback

Info

Publication number: CN107609095B
Application number: CN201710807540.4A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2019-07-09
Anticipated expiration: 2037-09-08
Also published as: CN107609095A

Abstract

A kind of across language inquiry extended method based on weighting positive and negative regular former piece and relevant feedback, source language query is first translated as object language using translation tool to inquire, target document is retrieved to obtain initial survey document, forefront initial survey document is extracted and constructs object language initial survey set of relevant documents after End-user relevance judges；Positive and negative association rule model, the positive and negative correlation rule library of construction feature word are weighted containing the Feature Words of inquiry lexical item to the excavation of initial survey set of relevant documents using towards the positive and negative association mode digging technology of weighting extended across language inquiry again；It is the positive and negative association rule model of weighting for inquiring lexical item that its consequent is extracted from rule base, using positive association rules former piece Feature Words as positive expansion word, negative customers rule former piece is removed in positive expansion word and obtains final former piece expansion word after negative expansion word and realize to translate rear former piece extension across language inquiry as negative expansion word.The present invention can improve and improve cross-language information retrieval performance, there is preferable application value and promotion prospect.

Description

Cross-language query expansion method based on weighted positive and negative rule front piece and related feedback

Technical Field

The invention belongs to the field of internet information retrieval, in particular to a cross-language query expansion method based on weighted positive and negative rule front pieces and related feedback, which is suitable for the field of cross-language information retrieval.

Background

Cross-Language Information Retrieval (CLIR) refers to a technique for retrieving Information resources in other languages in the form of a query in one Language, called source Language (SourceLanguage), expressing a user query, and a Language in which a document to be retrieved is called Target Language (Target Language). The cross-language query expansion technology is one of core technologies capable of improving and enhancing cross-language retrieval performance, and aims to solve the problems of long-term puzzlement, serious query topic drift, word mismatching and the like in the cross-language information retrieval field. The cross-language query expansion is divided into a pre-translation query expansion, a post-translation query expansion and a mixed query expansion (namely, the query expansion simultaneously occurs before translation and after translation) according to the fact that the expansion occurs at different stages of a retrieval process. With the rise of cross-language information retrieval research, cross-language query expansion is more and more concerned and discussed by scholars at home and abroad, and becomes a research hotspot.

The cross-language information retrieval is a technology combining information retrieval and machine translation, is more complex than single-language retrieval, and has more serious problems than the single-language retrieval. These problems have been the bottleneck restricting the development of cross-language information retrieval technology, and are also the common problems in the cross-language information retrieval which are urgently needed to be solved internationally, and mainly appear as follows: query topic gross drift, word mismatch, and query term translation ambiguities and ambiguities, among others. Cross-language query expansion is one of the core technologies to solve the above problems. In the last 10 years, the cross-language query expansion model and algorithm are widely concerned and deeply researched, and rich theoretical achievements are obtained, but the problems are not finally and completely solved.

Disclosure of Invention

The invention applies the mining of the weighted positive and negative association mode to the expansion after the translation of cross-language query, provides a cross-language query expansion method based on the weighted positive and negative rule front piece and the relevant feedback, is applied to the field of cross-language information retrieval, can solve the problems of query subject drift and word mismatching existing in the cross-language information retrieval for a long time, improves the performance of the cross-language information retrieval, can also be applied to a cross-language search engine, and improves the retrieval performances of the search engine, such as recall rate, precision rate and the like.

The technical scheme adopted by the invention is as follows:

1. a cross-language query expansion method based on weighted positive and negative rule antecedents and related feedback is characterized by comprising the following steps:

1.1 translating a source language query into a target language query using a machine translation system;

1.2, the target language queries and retrieves the original document set of the target language to obtain a target language initial check document;

1.3, constructing a target language initial examination related document set: the method comprises the steps that firstly, a user relevance judgment is carried out on a front-line n-space target language primary examination document to obtain a primary examination related document, and therefore a target language primary examination related document set is constructed;

1.4 mining a weighted frequent item set and a negative item set containing original query terms for a target language initial examination related document set;

the method comprises the following specific steps:

1.4.1, preprocessing a relevant document set of the initial inspection of the target language, and constructing a document index library and a total feature word library;

1.4.2 mining frequent 1_ item set L₁：

Namely, obtaining the 1_ item set C of the characteristic word candidate from the total characteristic word library₁Calculate 1_ item set C₁Support of (a) awSup (C)₁) If awSu (C)₁) The support degree threshold value ms is more than or equal to, the candidate 1_ item set C₁For frequent 1_ item set L₁And mixing L₁Adding to a weighted frequent item set PIS; the awSup (C)₁) The calculation formula is shown as formula (1):

wherein n and W are the total length of the documents in the relevant document set of the initial examination of the target language and the sum of the weights of all the characteristic words respectively,is C₁The frequency of occurrence in the target language initial examination related document set,is C₁The item set weight in the target language initial detection relevant document set, β ∈ (0,1) is an adjustment coefficient, and the value cannot be 0 or 1;

1.4.3 mining a frequent k _ term set L containing query terms_kAnd negative k _ term set N_kK is more than or equal to 2

The method comprises the following specific steps:

(1) mining candidate k _ term set C_k: by frequent (k-1) _ sets of items L_k-1Obtained by carrying out Aproiri ligation;

(2) when k is 2, prune the candidate 2_ term set C without query terms₂Keeping candidate 2_ term set C containing query terms₂；

(3) Computing a set of candidate k _ terms C_kSupport of (a) awSup (C)_k)：

If awSu (C)_k) Not less than the support threshold ms, and then C is calculated_kWeighted frequent item set relevance of (C) awPIR_k) If awPIR (C)_k) The relevance threshold value minPR of the frequent item set is more than or equal to, the k _ candidate item set C_kAs a weighted frequent k _ term set L_kAdding the weighted frequent item set PIS;

if awSu (C)_k)<ms, then calculate a weighted negative term set relevance awNIR (C)_k) If awNIR (C)_k) A threshold value minNR for relevancy of a set of ≧ negative terms, then C_kIs a weighted negative k _ term set N_kAnd added to the weighted negative set NIS; the awSup (C)_k) The calculation formula is shown in formula (2):

wherein,is C_kThe frequency of occurrence in the target language initial examination related document set,is C_kItem set weight in target language initial check related document set, k is C_kThe number of items of (2);

awPIR(C_k) The calculation formula (c) is divided into two cases: m is 2 and m>2 case, i.e. formula(3) And a compound represented by the formula (4),

wherein, the candidate weighted positive term set C_k＝(t₁,t₂,…,t_m)，m≥2，t_max(1. ltoreq. max. ltoreq.m) is C_kOf all items of (1) the single item whose support is the greatest, I_qIs C_kAll the 2_ sub-item sets to (m-1) _ sub-item set with the largest supporting degree;

awNIR(C_k) The calculation formula (c) is divided into two cases: r is 2 and r>2, namely, as shown in the formulas (5) and (6),

wherein the candidate weighted negative term set C_k＝(t₁,t₂,…,t_r)，r≥2，t_max(1. ltoreq. max. ltoreq. r) is C_kOf all items of (1) the single item whose support is the greatest, I_pIs C_kAll the 2_ sub-item sets to (r-1) _ sub-item set with the largest supporting degree;

(4) if k _ entry set L_kIf the item set is an empty set, ending the mining of the item set, and turning to the step 1.5, otherwise, turning to the step (1) and continuing the mining;

1.5 mining a weighted strong positive association rule from a weighted frequent item set PIS: for the featureEach frequent k _ term set L in the word-weighted frequent term set PIS_kK is more than or equal to 2, excavating L_kThe front part is an association rule I → qt of an expansion term set I and the back part is a query term set qt, the union of qt and I is L_kThe intersection of qt and I is an empty set, qt is a query term set, and I is an extended term set, and the specific mining steps are as follows:

(1) find the positive term set L_kAll proper subsets of (A) to obtain L_kA set of proper subset items;

(2) from L_kArbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝L_k，Wherein;

(3) calculating a weighted association rule I → qt confidence level awARConf (I → qt) and a lifting degree awARL (I → qt) thereof; if awARL (I → qt) >1 and awARConf (I → qt) > is the minimum weighted confidence threshold mc, obtaining a weighted strong association rule I → qt and adding the weighted strong association rule I → qt to the weighted strong positive association rule set PAR; the calculation formulas of awARConf (I → qt) and awARL (I → qt) are shown in formulas (7) and (8):

(4) returning to the step (2) and then sequentially carrying out until L_kIf and only once for each proper subset in the proper subset item set, then re-fetch a new positive item set L from the PIS set_kTurning to the step (1) to carry out a new round of weighted association rule mining step by step until each PISPersonal positive item set L_kAll are taken out, and then the step 1.6 is carried out;

1.6 mining weighted strong negative association rules from the negative set of terms NIS: for each negative set N in the negative set NIS_k，k>Dig 2, N_kThe front part being the query term set qt and the back part being the weighted negative association rule I → qt with q and I → qt with the negative extension term set I, the sum of qt and I being N_kAnd the intersection of qt and I is an empty set, and the concrete mining steps are as follows:

(1) find out negative item set N_kAll proper subsets of (A) to obtain N_kA set of proper subsets;

(2) from N_kArbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝N_k，Wherein qt is the set of query terms;

(3) calculate the degree of lift awARL (I → qt), if awARL (I → qt) <1:

calculating a negative association rule I → qqt confidence level awARConf (I → qt), if awARConf (I → qt) > minimum weighted confidence threshold mc, resulting in a weighted strong negative association rule I → qt and adding to the set of weighted strong negative association rules NAR;

calculating negative association rule I → qt confidence aw arconf (| → qt), if aw arconf (| → qt) > < mc, then we get weighted strong negative association rule I → qt, and add to NAR; the calculations of awaronf (I → qt) and awaronf (I → qt) are as shown in equations (9) and (10):

awARConf(I→﹁qt)＝1-awARCong(I→qt) (9)

(4) returning to the step (2) and then sequentially executing until N_kIf and only if each proper subset in the proper subset set is taken out once, then step (5) is carried out;

(5) re-fetching a new negative set of terms N from the NIS set_kTurning to the step (1) to carry out a new round of weighted negative association rule mining, if each negative item set in the NIS set is valid and is taken out only once, finishing the mining of the weighted strong negative association rule, and turning to the step 1.7;

1.7, extracting a weighted positive association rule mode I → qt of which the rule back part is a query term from a weighted strong positive association rule set PAR, and constructing a candidate front part expansion word bank by taking the characteristic words of the front part of the positive association rule as candidate expansion words;

1.8 extracting the weighted negative association rule pattern I → qt and I → qt whose back part of the rule is query term from the weighted strong negative association rule set NAR, constructing a front part negative extended word bank by using the front part I of the negative association rule as the front part negative extended word;

1.9 comparing each candidate front part expansion word in the candidate front part expansion word library with a negative expansion word of the front part negative expansion word library, deleting the candidate expansion words same as the negative expansion words in the candidate front part expansion word library, wherein the rest candidate front part expansion words in the candidate front part expansion word library are final front part expansion words;

2.0 the final combination of the front-piece expansion word and the target language original query word is searched again, and the front-piece expansion after cross-language query translation is realized.

The above strongly negative association rule of I → q and I → qt equi means negative associated symbols, "| means no occurrence of the set I in the target language first check related documents i.e. belonging to the negative related case.

"I → qt" means that the set of expanded terms I and the set of query terms qt exhibit a negative relevance, the occurrence of the set of expanded terms I in the target language first run relevant document set being such that the set of query terms qt does not occur.

I → qt "means that the set of expanded terms I and the set of query terms qt exhibit negative relevance, the absence of the set of expanded terms I in the target language first run related document set causing the set of query terms qt to appear.

The meaning of the weighted strong positive association rule I → qt is that the occurrence of the expansion term set I in the target language first run related document set causes the query term set qt to occur as well.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a cross-language query expansion method based on weighted positive and negative rule front pieces and relevant feedback. The method adopts a positive and negative mode mining technology based on a weighted support degree-association degree-promotion degree-confidence degree evaluation framework to mine a weighted positive and negative association rule mode for a cross-language initial examination related document set, extracts a front piece of the weighted positive and negative association rule mode as a front piece expansion word related to an original query term to realize cross-language query translation front piece expansion, and enables cross-language information retrieval performance to be better promoted.

(2) The English text data set of cross-language information retrieval standard data testing corpus NTCIR-5CLIR on the multinational language processing international evaluation conference sponsored by the Japanese information research institute is selected as the experimental corpus of the invention, Vietnamese and English are taken as language objects, and the experiment of the method is carried out. The experimental comparison reference method is as follows: a Vietnamese-English Cross-Language Retrieval (VECLR) benchmark method and a pseudo-correlation Feedback Query Post-Translation extension (QPTE _ PRF) method Based on a pseudo-correlation Cross-Language Query extension [ J ] information bulletin, 2010,29(2): 232-. Experimental results show that compared with a reference method VECLR and a QPTE _ PRF, the R-Prec and P @5 values of the more English cross-language retrieval result of the TITLE query type of the method are greatly improved, the improvement amplitude of the method compared with the VECLR method can reach 91.28% to the maximum extent, and the improvement amplitude of the method compared with the QPTE _ PRF reference method can reach 265.88% to the maximum extent; the R-Prec and P @5 values of the DESC query type cross-English language retrieval result of the method are greatly improved compared with those of the reference methods VECLR and QPTE-PRF, and the maximum improvement amplitudes are 137.38% and 238.75% respectively.

(3) Experimental results show that the method is effective and can improve the cross-language information retrieval performance, and the main reasons are analyzed as follows: the invention discloses a cross-language information retrieval method, which solves the problems that the cross-language information retrieval is influenced by word mismatching and query translation quality, and serious initial check query theme drift is often caused.

Drawings

FIG. 1 is a block diagram of a cross-language query expansion method based on weighted positive and negative rule front-parts and related feedback according to the present invention.

FIG. 2 is a general flow diagram of the cross-language query expansion method based on weighted positive and negative rule antecedents and related feedback according to the present invention.

Detailed Description

In order to better illustrate the technical solution of the present invention, the following introduces the related concepts related to the present invention as follows:

1. cross-language query post-translation front-part extension

The cross-language query translated front-part extension refers to: in the cross-language query expansion, after an association rule mode obtained by mining relevant documents of target language initial examination is extracted, an association rule mode front piece relevant to target language original query is extracted as an expansion word, and the expansion word and a target language original query term are combined to form a new query.

2. Degree of weighting support

Let DS be { d }₁,d₂,…,d_nIs a cross-language target language primary examination related Document Set (DS), where d_i(1. ltoreq. i. ltoreq. n) is the ith document in the document set DS, d_i＝{t₁,t₂,…,t_m,…,t_p}，t_m(m ═ 1,2, …, p) is document feature word item, short feature item, generally composed of word, word or phrase, d_iThe corresponding feature item weight set W_i＝{w_i1,w_i2,…,w_im,…,w_ip}，w_imFor the ith document d_iM characteristic item t_mCorresponding weight, TS ═ t₁,t₂,…,t_kAnd expressing the whole feature item set in the DS, wherein each subset of the TS is called a feature item set, and is called an item set for short.

Aiming at the defects of the prior art, the invention gives a new method for calculating the weighted Support (All-weighted Support, awSup) awSup (I) by fully considering the frequency and the weight of the feature term item. The awSup (I) has a calculation formula shown in a formula (11).

Wherein, w_ISum of weights of sets of terms in cross-language target language primary examination related document set DS for weighted set of terms I, n_IFor weighting the frequency of term set I appearing in cross-language target language primary examination related document set DS, n isThe method comprises the steps of obtaining a total document space, W being the sum of weights of all feature words in a cross-language target language primary examination related document set DS, k being the number of items (namely the length of the item set) in the item set I, β belonging to (0,1) being an adjusting coefficient, wherein the value of the adjusting coefficient cannot be 0 or 1, and the main function is to adjust the influence of the comprehensive frequency and weight of the items on weighting support.

Assuming that the minimum weighted support threshold is ms, if awSup (I)₁∪I₂) Greater than ms, then the set of weighted terms (I)₁∪I₂) Is a positive item set (i.e., a frequent item set), otherwise, (I)₁∪I₂) Is a negative term set.

The method only focuses on the following three types of weighted negative term sets: (I)₁∪﹁I₂) And (|)₁∪I₂) Giving a weighted negative set of support degrees awSup (|), awSup (I)₁∪﹁I₂) And aw & lt (|)₁∪I₂) The calculation formula is shown in formula (12) to formula (14).

awSup(ㄱI)＝1-awSup(I) (12)

awSup(I₁∪﹁I₂)＝awSup(I₁)-awSup(I₁∪I₂) (13)

awSup(﹁I₁∪I₂)＝awSup(I₂)-awSup(I₁∪I₂) (14)

The method only focuses on the following two types of weighted negative association rules: (I)₁→﹁I₂) And (|)₁→I₂) Weighted positive-negative Association Rule Confidence (awARConf) awARConf (I)₁→I₂)、awARConf(I₁→﹁I₂) And awARConf (|)₁→I₂) The calculation formula (2) is shown in the formulas (15) to (17).

3. Weighted positive and negative term set relevancy

The weighted term set relevancy refers to a measure of the strength of the relevancy between any two individual terms in the weighted term set and between the sub-term sets. The higher the relevance of the item set, the more closely the relationship between the sub-item sets in the item set is, and the more attention is paid. The invention improves the existing relevance, provides a relevance calculation method of the weighted positive and negative term set, not only considers the relevance degree of any two single terms in the term set, but also considers the relevance between two sub term sets in the term set.

Weighted Positive item set relevance (All-weighted Positive Itemset Relevance, awPIR): positive term set C for weighted feature words_k＝(t₁,t₂,…,t_m) M is a positive term set C_kM is not less than 2, set t_max(1. ltoreq. max. ltoreq.m) is C_kOf all items of (1) the single item whose support is the greatest, I_qIs C_kGiving a weighted positive item set relevance awPIR (C) for all the sub item sets from 2_ sub item set to (m-1) _ sub item set with the highest support degree_k) The calculation formula (2) is shown in the formula (18) and the formula (19).

Wherein, the candidate weighted positive term set C_k＝(t₁,t₂,…,t_m)，m≥2，t_max(1. ltoreq. max. ltoreq.m) is C_kOf all items of (1) the single item whose support is the greatest, I_qIs C_kFrom all 2_ sub-item sets to (m-1) _ sub-item set with the most supported sub-item set.

Equations (18) and (19) indicate that the positive set of terms C is weighted_kThe relevance is equal to the single item t with the maximum support_maxAnd a sub-set of items I_q(i.e. I)_qOne of the 2_ subentry through (m-1) _ subentry) is the sum of the conditional probabilities that the positive itemset occurs when it occurs, respectively.

Weighted Negative item set relevance (All-weighted Negative Itemset Relevance, awNIR): for weighted feature word negative term set C_k＝(t₁,t₂,…,t_r) R is a negative term set C_kR is not less than 2, set t_max(1. ltoreq. max. ltoreq. r) is a negative term set C_kOf all items of (1) the single item whose support is the greatest, I_pAs a negative set of terms C_kGiving a weighted negative term set relevance awNIR (C) to all the 2_ subentry sets to the (r-1) _ subentry set with the largest support_k) The calculation formula (2) is shown in the formula (20) and the formula (21).

Wherein the candidate weighted negative term set C_k＝(t₁,t₂,…,t_r)，r≥2，t_max(1. ltoreq. max. ltoreq. r) is C_kOf all items of (1) the single item whose support is the greatest, I_pIs C_kFrom all 2_ sub-item sets to the (r-1) _ sub-item set whose support is the greatest.

Equations (20) and (21) show that the weighted negative term set C_kThe relevance is equal to the single item t with the maximum support_maxAnd a sub-set of items I_p(i.e. I)_pOne of the 2_ subentry through (r-1) _ subentry) conditional probabilities of the negative set of terms occurring when they do not occur, respectively.

Example (c): if C_k＝(t₁∪t₂∪t₃∪t₄) (degree of support 0.65), its single entry t₁，t₂，t₃And t₄Are 0.82, 0.45, 0.76 and 0.75, respectively, their 2_ subentry set and 3_ subentry set (t)₁∪t₂)，(t₁∪t₃)，(t₁∪t₄)，(t₂∪t₃)，(t₂∪t₄)，(t₁∪t₂∪t₃)，(t₁∪t₂∪t₄)，(t₂∪t₃∪t₄) The support degrees are respectively 0.64,0.78,0.75,0.74,0.67,0, 66,0.56 and 0.43, and the single item with the maximum support degree (value of 0.82) is t₁The subset of the 2_ subset and the 3_ subset whose support is the greatest (value 0.78) is (t)₁∪t₃) Then, a positive term set (t) is calculated using equation (14)₁∪t₂∪t₃∪t₄) The degree of correlation of (2) was 0.81.

4. Weighted association rule promotion

The traditional association rule evaluation framework (support-confidence) has the limitation of neglecting the item set support appearing in the rule back-piece, so that the rule with high confidence level may be misled sometimes. The degree of Lift (Lift) is an effective correlation metric to solve this problem. The association rule X → Y Lift (X → Y) refers to a ratio of a probability of simultaneously containing Y to a probability of occurrence of Y as a whole under a condition containing X, that is, a ratio of a Confidence of the rule (X → Y) to a support of the back part Y (sup (Y). Based on the traditional lifting degree concept, a weighted association rule I is given₁→I₂Elevation (All-weighted Association rule Lift, awARL) awARL (I)₁→I₂) The formula (2) is shown in formula (22).

According to the relevance theory, the promotion degree can evaluate the relevance of the front piece and the back piece of the association rule, and the degree of promotion (or reduction) of the appearance of one party to the appearance of the other party can be evaluated. I.e., when awARL (I)₁→I₂)>1 hour, I₁→I₂Is a positive association rule, item set I₁And I₂In (2), the occurrence of one increases the probability of the occurrence of the other; when awARL (I)₁→I₂)<1 hour, I₁→I₂A negative association rule, the occurrence of one party reduces the probability of the occurrence of the other party; when awARL (I)₁→I₂) When 1, item set I₁And I₂Are independent and unrelated, and the association rule I₁→I₂Is a dummy rule. It is easy to prove awARL (I)₁→I₂) Has the following properties 1.

Properties 1②awARL(﹁I₁→I₂)<1； ⑤awARL(﹁I₁→I₂)>1；⑥awARL(﹁I₁→﹁I₂)<1。

According to property 1, when awARL (I)₁→I₂)>1, a weighted positive association rule I can be mined₁→I₂. When awARL (I)₁→I₂)<1 hour, can dig out the weighted negative association rule I₁→﹁I₂And I₁→I₂。

Assuming that the minimum weighted confidence threshold is mc, in combination with property 1, a weighted strong positive-negative association rule is given as follows:

for weighted positive term set (I)₁∪I₂) If awARL (I)₁→I₂)>1, and awARConf (I)₁→I₂) If not less than mc, weighting association rule I₁→I₂Is a strongly associated rule.

For negative item set (I)₁∪I₂) If awARL (I)₁→I₂)<1, and awARConf (I)₁→﹁I₂)≥mc，awARConf(﹁I₁→I₂) Not less than mc, then I₁→﹁I₂And I₁→I₂Is a strong negative association rule.

The invention relates to a cross-language query expansion method based on weighted positive and negative rule front pieces and related feedback, which comprises the following steps:

the machine translation system may be: microsoft applied to the machine translation interface Microsoft Translator API, Google machine translation interface, and so on.

1.2 the target language queries and retrieves the original document set of the target language to obtain the initial documents of the target language, and the specific retrieval model is a classical retrieval model based on a vector space model.

the method comprises the following specific steps:

the pretreatment steps are as follows:

(1) for the target language is Chinese, performing Chinese word segmentation, removing stop words, extracting Chinese characteristic words, and adopting a Chinese lexical analysis system ICTCCLAS developed and compiled by the research institute of computational technology of Chinese academy of sciences to perform Chinese word segmentation; for the target language of English, a Porter program (see the website: http:// tartarus. org/. martin/Porter Stemmer in detail) is adopted to extract the stem of the word and remove the stop words of English;

(2) calculating the weight of the feature word, wherein the weight of the feature word indicates the importance degree of the feature word to the document where the feature word is located, and the invention adopts the classical and popular tf-idf feature word weight w_ijAnd (4) a calculation method. W is_ijThe calculation formula is shown in formula (23):

wherein, w_ijRepresenting a document d_iMiddle characteristic word t_jWeight of (tf)_j,iRepresentation feature word t_jIn document d_iOf occurrence of (1), df_jMeaning containing a characteristic word t_jN represents the total number of documents in the document set.

(3) And constructing a document index library and a total feature word library.

1.4.2 mining frequent 1_ item set L₁: namely, obtaining the 1_ item set C of the characteristic word candidate from the total characteristic word library₁Calculate 1_ item set C₁Support of (a) awSup (C)₁) If awSu (C)₁) The support degree threshold value ms is more than or equal to, the candidate 1_ item set C₁For frequent 1_ item set L₁And mixing L₁Adding to a weighted frequent item set PIS; the awSup (C)₁) The calculation formula is shown in formula (24):

wherein n and W are the total length of the documents in the relevant document set of the initial examination of the target language and the sum of the weights of all the characteristic words respectively,is C₁The frequency of occurrence in the target language initial examination related document set,is C₁And (4) initially checking the weight value of the item set in the relevant document set in the target language, wherein β E (0,1) is an adjusting coefficient, and the value of the adjusting coefficient cannot be 0 or 1.

1.4.3 mining a weighted frequent k _ term set L containing query terms_kAnd negative k _ term set N_kAnd k is more than or equal to 2.

The method comprises the following specific steps:

the Aproiri ligation method is described in the literature: agrawal R, Iminilinski T, Swami A. Miningassortment rules between sections of entities in large database [ C ]// Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Washington D C, USA,1993: 207-.

(2) When k is 2, prune the candidate 2_ term set C without query terms₂Keeping candidate 2_ term set C containing query terms₂。

(3) Computing a set of candidate k _ terms C_kSupport of (a) awSup (C)_k)：

if awSu(C_k)<ms, then calculate a weighted negative term set relevance awNIR (C)_k) If awNIR (C)_k) A threshold value minNR for relevancy of a set of ≧ negative terms, then C_kIs a weighted negative k _ term set N_kAnd added to the set of weighted negative terms NIS. The awSup (C)_k) The calculation formula is shown in formula (25):

wherein,is C_kThe frequency of occurrence in the target language initial examination related document set,is C_kItem set weight in target language initial check related document set, k is C_kThe number of items.

awPIR(C_k) The calculation formula (c) is divided into two cases: m is 2 and m>2, i.e., as shown in formula (26) and formula (27),

awNIR(C_k) Is divided into twoThe following conditions are adopted: r is 2 and r>2, i.e., as shown in formula (28) and formula (29),

(4) If k _ entry set L_kAnd (4) if the set is an empty set, ending the item set mining, and turning to the step (1.5), otherwise, turning to the step (1) and continuing the mining.

1.5 mining a weighted strong positive association rule from a weighted frequent item set PIS: weighting each frequent k _ item set L in the frequent item set PIS for the feature words_kK is more than or equal to 2, excavating L_kThe front part is an association rule I → qt of an expansion term set I and the back part is a query term set qt, the union of qt and I is L_kThe intersection of qt and I is an empty set, qt is a query term set, and I is an extended term set, and the specific mining steps are as follows:

(2) from L_kArbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝L_k，

(3) The weighted association rule I → qt confidence awardonf (I → qt) and its degree of lifting awARL (I → qt) are calculated. If awARL (I → qt) >1 and awARConf (I → qt) > is the minimum weighted confidence threshold mc, then the weighted strong association rule I → qt is obtained and added to the weighted strong positive association rule set PAR. The calculation formulas of awARConf (I → qt) and awARL (I → qt) are shown in formulas (30) and (31):

(4) returning to the step (2) and then sequentially carrying out until L_kIf and only once for each proper subset in the proper subset item set, then re-fetch a new positive item set L from the PIS set_kTurning to the step (1) to carry out a new round of weighted association rule mining step by step until each positive item set L in the PIS_kAll have been removed, at which point step 1.6 is performed.

(1) find out negative item set N_kAll proper subsets of (A) to obtain N_kA set of proper subsets.

(2) From N_kArbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝N_k，

(3) Calculate the degree of lift awARL (I → qt), if awARL (I → qt) <1:

calculating negative association rule I → qt confidence awARconf (| → qt), if awARconf (| → qt) > < mc, then we get the weighted strong negative association rule I → qt, and add to NAR. The calculations of awaronf (I → qt) and awaronf (I → qt) are as shown for equations (32) and (33):

awARConf(I→﹁qt)＝1-awARConf(I→qt) (32)

(5) re-fetching a new negative set of terms N from the NIS set_kAnd (3) turning to the step (1) to carry out a new round of weighted negative association rule mining, if each negative item set in the NIS set is valid and is taken out only once, finishing the mining of the weighted strong negative association rule, and turning to the step 1.7.

1.7 extracting a weighted positive association rule mode I → qt of which the rule back part is the query term from the weighted strong positive association rule set PAR, and constructing a candidate front part expansion word bank by taking the characteristic word of the positive association rule front part as a candidate expansion word.

1.8 extracting the weighted negative association rule pattern I → qt and I → qt whose latter part of the rule is query term from the weighted strong negative association rule set NAR, and constructing the front part negative extended word bank by using the front part I of the negative association rule as the front part negative extended word.

1.9 comparing each candidate front part expansion word in the candidate front part expansion word library with the negative expansion word of the front part negative expansion word library, deleting the candidate expansion word same as the negative expansion word in the candidate front part expansion word library, wherein the rest candidate front part expansion words in the candidate front part expansion word library are the final front part expansion words.

Experimental design and results:

in order to illustrate the effectiveness of the method, Vietnamese and English are used as language objects to carry out a cross-English language information retrieval experiment based on the method and the comparison method.

Experimental data set:

english text data set of NTCIR-5CLIR is selected as the text experimental corpus. The corpus is a cross-language information retrieval standard data test corpus on a multi-national language processing international evaluation conference sponsored by the Japanese information research institute, and is derived from News texts of Mainichi Daily News News media 2000, 2001 (abbreviated as mdn00 and mdn01) and Korea Times2001 (abbreviated as ktn01), wherein 26224 pieces of English text information are provided in total (namely, there are 6608 pieces in mdn00, 5547 pieces in mdn01 and 14069 pieces in ktn01). The data set comprises a document test set, a result set and a query set, wherein the result set comprises two types, namely a Rigid standard (highly relevant and relevant to the query) and a Relay standard (highly relevant, relevant and partially relevant to the query), the query set comprises 50 query subjects, four versions of Japanese, Korean, Chinese and English are provided, and 4 types of query subjects, namely TITLE, DESC, NARR and CONC are provided, the TITLE query type describes the query subjects in brief by nouns and noun phrases and belongs to short query, and the DESC query type describes the query subjects in brief by sentences and belongs to long query. Search experiments were performed using TITLE and DESC query types.

In the experiment of the invention, because the NTCIR-5CLIR corpus does not provide Vietnamese query version, we specifically ask the professional translators of the Dong alliance language of the translation mechanism to manually translate 50 Chinese version query subject corpora in the NTCIR-5CLIR into Vietnamese query as the source language query of the text experiment.

The reference comparison method comprises the following steps:

(1) Cross-English Language search (Vietnamese-English Cross-Language Retrieval, VECLR) benchmark method: the method refers to a result of first retrieval in cross-English languages, namely a retrieval result obtained by retrieving English documents after Vietnamese query in the source language is translated into English by a machine, and a query expansion technology is not adopted in the retrieval process.

(2) The cross-English language retrieval method based on the Pseudo-relevant Feedback Query Post-Translation extension (Query Post-Translation Expansion base Pseudo-Translation Feedback, QPTE _ PRF) comprises the following steps: the QPTE _ PRF benchmark algorithm is a retrieval result which is expanded after translating English-crossing language query and cross-language query and is realized by a cross-language query expansion method based on pseudo-correlation cross-language query expansion [ J ] information academic newspaper 2010,29(2): 232-. The experimental method and parameters are as follows: the method comprises the steps of translating a Source language Vietnamese query machine into English query retrieval English documents, extracting 20 English documents of cross-language primary retrieval prostate English documents to construct a primary retrieval English related document set, extracting English feature terms and calculating weights of the English feature terms, and arranging the 20 feature terms in descending order according to the weights to realize cross-English and cross-language query translation and then expansion for English expansion words.

R-precision ratio (R-Prec) and P @5 are adopted as cross-language retrieval evaluation indexes of the invention. R-precision refers to the precision calculated when R documents are retrieved, wherein R refers to the number of relevant documents in the document set corresponding to a query, and the ranking of the documents in the document result set is not emphasized.

The experimental results are as follows:

the method comprises the steps of writing a source program of the method and the reference method, analyzing and comparing the over-English and over-language information retrieval performance of the method and the reference method through experiments, performing over-English and over-language information retrieval on 50 Vietnamese TITLE and DESC queries, performing user relevance judgment on 50 English documents in the front of cross-language initial examination to obtain relevant feedback documents of initial examination users (for simplicity, relevant documents in the front of initial examination 50 documents containing known result sets are regarded as relevant documents of the initial examination), performing experiments to obtain average values of R-Prec and P @5 of the over-English and over-language retrieval results, wherein the average values are respectively shown in tables 1 to 2, common experiment parameters are set as α ═ 0.3, minPR ═ 0.1, minNR ═ 0.01, and a 3_ item set is mined.

TABLE 1 search Performance comparison of the inventive method of the present invention with a comparative baseline method (TITLE query)

The experimental parameters of the table are mc is 0.8, ms is formed by {0.2,0.25,0.3,0.35,0.4,0.45} (mdn00), and ms is formed by {0.2,0.23,0.25,0.28,0.3} (mdn01 and ktn01).

The experimental results in table 1 show that, compared with the reference methods VECLR and QPTE _ PRF, the R-Prec and P @5 values of the TITLE query type cross-English language retrieval result of the method are greatly improved, the improvement amplitude of the method compared with the VECLR method can reach 91.28% to the maximum extent, and the improvement amplitude of the method compared with the QPTE _ PRF is 265.88% to the maximum extent.

TABLE 2 comparison of search Performance between the inventive method and the reference method (DESC query)

The experimental parameters of the table are mc is 0.8, ms belongs to {0.2,0.23,0.25,0.28,0.3}

From the experimental results in Table 2, the R-Prec and P @5 values of the DESC query type cross-English language retrieval result of the method of the invention are greatly improved compared with those of the reference methods VECLR and QPTE _ PRF, and the maximum improvement amplitudes are 137.38% and 238.75% respectively.

The experimental result shows that the method is effective and can actually improve the cross-language information retrieval performance.

Claims

the method comprises the following specific steps:

1.4.2 mining frequent 1_ item set L₁：

Namely, obtaining the 1_ item set C of the characteristic word candidate from the total characteristic word library₁Calculate 1_ item set C₁Support of (a) awSup (C)₁) If awSu (C)₁) The support degree threshold value ms is more than or equal to, the candidate 1_ item set C₁For frequent 1_ item set L₁And mixing L₁Adding to a weighted frequent item set PIS; the awSup (C)₁) The calculation formula is as follows:

The method comprises the following specific steps:

(3) Computing a set of candidate k _ terms C_kSupport of (a) awSup (C)_k)：

if awSu (C)_k)<ms, then calculate a weighted negative term set relevance awNIR (C)_k) If awNIR (C)_k) A threshold value minNR for relevancy of a set of ≧ negative terms, then C_kIs a weighted negative k _ term set N_kAnd added to the weighted negative set NIS; the awSup (C)_k) The calculation formula is as follows:

awPIR(C_k) The calculation formula (c) is divided into two cases: m is 2 and m>The number of 2 cases, that is,

awNIR(C_k) The calculation formula (c) is divided into two cases: r is 2 and r>The number of 2 cases, that is,

(3) Calculating a weighted association rule I → qt confidence level awARConf (I → qt) and a lifting degree awARL (I → qt) thereof; if awARL (I → qt) >1 and awARConf (I → qt) > is the minimum weighted confidence threshold mc, obtaining a weighted strong association rule I → qt and adding the weighted strong association rule I → qt to the weighted strong positive association rule set PAR; the calculation formulas of awARConf (I → qt) and awARL (I → qt) are as follows:

(4) returning to the step (2) and then sequentially carrying out until L_kIf and only once for each proper subset in the proper subset item set, then re-fetch a new positive item set L from the PIS set_kTurning to the step (1) to carry out a new round of weighted association rule mining until each positive item set L in the PIS_kAll are taken out, and then the step 1.6 is carried out;

1.6 mining weighted strong negative association rules from the negative set of terms NIS: for each negative set N in the negative set NIS_k，k>Dig 2, N_kWeighted negative association rule with query term set qt in the first part and negative extension term set I in the second partAndthe union of qt and I is N_kAnd the intersection of qt and I is an empty set, and the concrete mining steps are as follows:

(3) calculate the degree of lift awARL (I → qt), if awARL (I → qt) <1:

computing negative association rulesConfidence levelIf it isawARConf (I→﹁qt) >= minimum weighted confidence thresholdmc，Then a weighted strong negative association rule is obtainedAnd adding the NAR into a weighted strong negative association rule set;

computing negative association rulesConfidence levelIf it isThen a weighted strong negative association rule is obtainedAnd added to the NAR; saidAndthe calculation formula of (a) is as follows:

1.8 extracting weighted negative association rule mode with rule back part being query term from weighted strong negative association rule set NAR Andtaking the front part I of the negative association rule as a front part negative expansion word, and constructing a front part negative expansion word bank;