CN107609095A

CN107609095A - Based on across the language inquiry extended method for weighting positive and negative regular former piece and relevant feedback

Info

Publication number: CN107609095A
Application number: CN201710807540.4A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2018-01-19
Anticipated expiration: 2037-09-08
Also published as: CN107609095B

Abstract

A kind of across language inquiry extended method based on the positive and negative regular former piece of weighting and relevant feedback, source language query first is translated as into object language using translation tool to inquire about, target document is retrieved to obtain initial survey document, extraction forefront initial survey document builds object language initial survey set of relevant documents after End-user relevance judges；The Feature Words for containing inquiry lexical item using being excavated towards the positive and negative association mode digging technology of weighting extended across language inquiry to initial survey set of relevant documents again weight positive and negative association rule model, the positive and negative correlation rule storehouse of construction feature word；It is the positive and negative association rule model of weighting for inquiring about lexical item that its consequent is extracted from rule base, using positive association rules former piece Feature Words as positive expansion word, negative customers rule former piece, which is used as, bears expansion word, and obtaining final former piece expansion word after the negative expansion word of removal in positive expansion word realizes that translating rear former piece across language inquiry extends.The present invention can improve and improve cross-language information retrieval performance, there is preferable application value and promotion prospect.

Description

Cross-language query expansion method based on weighted positive and negative rule front piece and related feedback

Technical Field

The invention belongs to the field of internet information retrieval, in particular to a cross-language query expansion method based on weighted positive and negative rule front pieces and related feedback, which is suitable for the field of cross-language information retrieval.

Background

Cross-Language Information Retrieval (CLIR) refers to a technique for retrieving Information resources in other languages in a query form of one Language, the Language in which a user query is expressed is called Source Language (Source Language), and the Language in which a document to be retrieved is called Target Language (Target Language). The cross-language query expansion technology is one of core technologies capable of improving and enhancing cross-language retrieval performance, and aims to solve the problems of long-term puzzlement, serious query topic drift, word mismatching and the like in the cross-language information retrieval field. The cross-language query expansion is divided into a pre-translation query expansion, a post-translation query expansion and a mixed query expansion (namely, the query expansion simultaneously occurs before translation and after translation) according to the fact that the expansion occurs at different stages of a retrieval process. With the rise of cross-language information retrieval research, cross-language query expansion is more and more concerned and discussed by scholars at home and abroad, and becomes a research hotspot.

The cross-language information retrieval is a technology combining information retrieval and machine translation, is more complex than single-language retrieval, and has more serious problems than the single-language retrieval. These problems have been the bottleneck restricting the development of cross-language information retrieval technology, and are also the common problems in the cross-language information retrieval which are urgently needed to be solved internationally, and mainly appear as follows: query topic gross drift, word mismatch, and query term translation ambiguities and ambiguities, among others. Cross-language query expansion is one of the core technologies to solve the above problems. In the last 10 years, the cross-language query expansion model and algorithm are widely concerned and deeply researched, and rich theoretical achievements are obtained, but the problems are not finally and completely solved.

Disclosure of Invention

The invention applies the mining of the weighted positive and negative association mode to the expansion after the translation of cross-language query, provides a cross-language query expansion method based on the weighted positive and negative rule front piece and the relevant feedback, is applied to the field of cross-language information retrieval, can solve the problems of query subject drift and word mismatching existing in the cross-language information retrieval for a long time, improves the performance of the cross-language information retrieval, can also be applied to a cross-language search engine, and improves the retrieval performances of the search engine, such as recall rate, precision rate and the like.

The technical scheme adopted by the invention is as follows:

1. a cross-language query expansion method based on weighted positive and negative rule antecedents and related feedback is characterized by comprising the following steps:

1.1 translating a source language query into a target language query using a machine translation system;

1.2, the target language queries and retrieves the original document set of the target language to obtain a target language initial check document;

1.3, constructing a target language initial examination related document set: the method comprises the steps that firstly, a user relevance judgment is carried out on a front-line n-space target language primary examination document to obtain a primary examination related document, and therefore a target language primary examination related document set is constructed;

1.4 mining a weighted frequent item set and a negative item set containing original query terms for a target language initial examination related document set;

the method comprises the following specific steps:

1.4.1, preprocessing a relevant document set of the initial inspection of the target language, and constructing a document index library and a total feature word library;

1.4.2 mining frequent 1_ item set L ₁ ：

Namely, obtaining the 1_ item set C of the characteristic word candidate from the total characteristic word library ₁ Calculate 1_ item set C ₁ Support of (a) awSup (C) ₁ ) If awSup (C) ₁ ) The support degree threshold value ms is more than or equal to, the candidate 1_ item set C ₁ For frequent 1_ item set L ₁ And mixing L ₁ Adding to a weighted frequent item set PIS; the awSup (C) ₁ ) The calculation formula is shown as formula (1):

wherein n and W are the total length of the documents in the relevant document set of the initial examination of the target language and the sum of the weights of all the characteristic words respectively,is C ₁ The frequency of occurrence in the target language initial examination related document set,is C ₁ Initial phase detection in target languageRegarding the weight value of the item set in the document set, the beta epsilon (0,1) is an adjusting coefficient, and the value cannot be 0 or 1;

1.4.3 mining frequent k _ term set L containing query terms _k And negative k _ term set N _k K is more than or equal to 2

The method comprises the following specific steps:

(1) Mining candidate k _ item set C _k : by frequent (k-1) _ sets of items L _k-1 Obtained by carrying out Aproiri ligation;

(2) When k =2, prune candidate 2_ term set C without query term ₂ Keeping candidate 2_ term set C containing query terms ₂ ；

(3) Computing a set of candidate k _ terms C _k Support of (a) awSup (C) _k )：

If awSup (C) _k ) Not less than the support threshold ms, and then C is calculated _k Weighted frequent item set relevance of (C) awPIR (C) _k ) If awPIR (C) _k ) The relevance threshold value minPR of the frequent item set is larger than or equal to, k _ candidate item set C _k As a weighted frequent k _ term set L _k Adding the weighted frequent item set PIS;

if awSup (C) _k )&(lt, ms), calculating the weighted negative term set association degree awNIR (C) _k ) If awNIR (C) _k ) A threshold value minNR for relevancy of a set of ≧ negative terms, then C _k Is a weighted negative k _ term set N _k And added to the weighted negative set NIS; the awSup (C) _k ) The calculation formula is shown in formula (2):

wherein the content of the first and second substances,is C _k The frequency of occurrence in the relevant document set is initially checked in the target language,is C _k Item set weight in target language initial check related document set, k is C _k The number of items of (2);

awPIR(C _k ) The calculation formula (c) is divided into two cases: m =2 and m&gt, 2, i.e. as shown in formula (3) and formula (4),

wherein the candidate weighted positive term set C _k ＝(t ₁ ,t ₂ ,…,t _k )，k≥2，t _max (1. Ltoreq. Max. Ltoreq.m) is C _k Of all the items of (2) whose support is the greatest, I _q Is C _k All the 2_ sub-item sets to (m-1) _ sub-item set with the largest supporting degree;

awNIR(C _k ) The calculation formula (c) is divided into two cases: r =2 and r&gt, 2, i.e. as shown in formula (5) and formula (6),

wherein the candidate weighted negative term set C _k ＝(t ₁ ,t ₂ ,…,t _r )，r≥2，t _max (1. Ltoreq. Max. Ltoreq. R) is C _k Of all items of (1) the single item whose support is the greatest, I _p Is C _k All the 2_ sub-item sets to (r-1) _ sub-item set with the largest supporting degree;

(4) If k _ entry set L _k If the item set is an empty set, ending the mining of the item set, and turning to the step 1.5, otherwise, turning to the step (1) and continuing the mining;

1.5 mining a weighted strong positive association rule from a weighted frequent item set PIS: to pairEach frequent k _ item set L in the feature word weighted frequent item set PIS _k K is more than or equal to 2, excavating L _k The front part is an association rule I → qt of an expansion term set I and the back part is a query term set qt, the union of qt and I is L _k The intersection of qt and I is an empty set, qt is a query term set, and I is an extended term set, and the specific mining steps are as follows:

(1) Find the positive term set L _k All proper subsets of (A) to obtain L _k A set of proper subset items;

(2) From L _k Arbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝L _k ，Wherein;

(3) Calculating a weighted association rule I → qt confidence level awARConf (I → qt) and a lifting degree awARL (I → qt) thereof; if awARL (I → qt) >1 and awARConf (I → qt) > = mc, obtaining a weighted strong association rule I → qt, and adding the weighted strong positive association rule set PAR; the calculation formulas of awARConf (I → qt) and awARL (I → qt) are shown in formulas (7) and (8):

(4) Returning to the step (2) and then sequentially carrying out until L _k If and only once for each proper subset in the proper subset item set, then re-fetch a new positive item set L from the PIS set _k Turning to the step (1) to carry out a new round of weighted association rule mining step by step until each positive item set L in the PIS _k All are taken out, and then the step 1.6 is carried out;

1.6 mining weighted strong negative association rules from the negative set of terms NIS: for each negative set N in the negative set NIS _k ，k&gt =2, dig N _k The front part being the query term set qt and the back part being the weighted negative association rule I → qt with q and I → qt with the sum of L _k And the intersection of qt and I is an empty set, and the concrete mining steps are as follows:

(1) Find out negative item set N _k All proper subsets of (A), get N _k A set of proper subsets;

(2) From N _k Arbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝N _k ，Wherein qt is the set of query terms;

(3) Calculate the degree of lift awARL (I → qt), if awARL (I → qt) <1:

calculating a negative association rule I → qt confidence level awARConf (I → qt), if awARConf (I → qt) > = mc, obtaining a weighted strong negative association rule I → qt and adding to a weighted strong negative association rule set NAR;

calculating the negative association rule I → qt confidence aw arconf (I → qt), if aw arconf (I → qt) > = mc, then we get the weighted strong negative association rule I → qt and add to NAR; the calculations of awaronf (I → qt) and awaronf (I → qt) are as shown in equations (9) and (10):

awARConf(I→﹁qt)=1-awARconf(I→qt) (9)

(4) Returning to the step (2) and then sequentially executing until N _k If and only if each proper subset in the proper subset set is taken out once, then step (5) is carried out;

(5) Re-fetching a new negative set of terms N from the NIS set _k Turning to the step (1) to carry out a new round of weighted negative association rule mining, if each negative item set in the NIS set is valid and is taken out only once, finishing the mining of the weighted strong negative association rule, and turning to the step 1.7;

1.7, extracting a weighted positive association rule mode I → qt of which the rule back part is a query term from a weighted strong positive association rule set PAR, and constructing a candidate front part expansion word bank by taking the characteristic words of the front part of the positive association rule as candidate expansion words;

1.8 extracting the weighted negative association rule pattern I → qt and I → qt whose back part of the rule is query term from the weighted strong negative association rule set NAR, constructing a front part negative extended word bank by using the front part I of the negative association rule as the front part negative extended word;

1.9 comparing each candidate front part expansion word in the candidate front part expansion word library with a negative expansion word of the front part negative expansion word library, deleting the candidate expansion words same as the negative expansion words in the candidate front part expansion word library, wherein the rest candidate front part expansion words in the candidate front part expansion word library are final front part expansion words;

2.0 the final combination of the front-piece expansion word and the target language original query word is searched again, and the front-piece expansion after cross-language query translation is realized.

The above strongly negative association rule of I → q and I → qt equi means negative associated symbols, "| means no occurrence of the set I in the target language first check related documents i.e. belonging to the negative related case.

"I → qt" means that the set of expanded terms I and the set of query terms qt exhibit a negative relevance, the occurrence of the set of expanded terms I in the target language first run relevant document set being such that the set of query terms qt does not occur.

I → qt "means that the set of expanded terms I and the set of query terms qt exhibit negative relevance, the absence of the set of expanded terms I in the target language first run related document set causing the set of query terms qt to appear.

The meaning of the weighted strong positive association rule I → qt is that the occurrence of the expansion term set I in the target language first run related document set causes the query term set qt to occur as well.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention provides a cross-language query expansion method based on weighted positive and negative rule antecedents and related feedback. The method adopts a positive and negative mode mining technology based on a weighted support degree-association degree-promotion degree-confidence degree evaluation framework to mine a weighted positive and negative association rule mode for a cross-language initial examination related document set, extracts a front piece of the weighted positive and negative association rule mode as a front piece expansion word related to an original query term to realize cross-language query translation front piece expansion, and enables cross-language information retrieval performance to be better promoted.

(2) The English text data set of cross-language information retrieval standard data testing corpus NTCIR-5 CLIR on the multi-language processing international evaluation conference sponsored by the Japan information research institute is selected as the experimental corpus of the invention, and Vietnamese and English are taken as language objects to carry out the experiment of the method of the invention. The experimental comparison reference method is as follows: a Cross-English Language Retrieval (Vietnamese-English Cross-Language Retrieval, VECLR) reference method and a Cross-English Language Retrieval method Based on documents (Wu Dan, he Daqing, wang Huilin. Pseudo-correlation Cross-Language Query extension [ J ] intelligence bulletin, 2010,29 (2): 232-239.) are not implemented. Experimental results show that compared with a reference method VECLR and a QPTE _ PRF, the R-Prec and P@5 values of the more-English cross-language retrieval result of the TITLE query type of the method are greatly improved, the improvement amplitude of the TITLE query type of the method compared with the VECLR method can reach 91.28% to the maximum extent, and the improvement amplitude of the TITLE query type of the method compared with the QPTE _ PRF reference method can reach 265.88% to the maximum extent; the R-Prec and P@5 values of the DESC query type cross-over language retrieval result of the method are greatly improved compared with those of a reference method VECLR and a QPTE _ PRF, and the maximum improvement amplitudes are 137.38% and 238.75% respectively.

(3) Experimental results show that the method is effective, can improve the cross-language information retrieval performance, and has the following main reason analysis: the invention discloses a cross-language information retrieval method, which solves the problems that the cross-language information retrieval is influenced by word mismatching and query translation quality, and serious initial check query theme drift is often caused.

Drawings

FIG. 1 is a block diagram of a cross-language query expansion method based on weighted positive and negative rule front-parts and related feedback according to the present invention.

FIG. 2 is a general flow diagram of the cross-language query expansion method based on weighted positive and negative rule antecedents and related feedback according to the present invention.

Detailed Description

In order to better illustrate the technical solution of the present invention, the following introduces the related concepts related to the present invention as follows:

1. cross-language query post-translation front-part extension

The cross-language query translated front-part extension refers to: in the cross-language query expansion, after an association rule mode obtained by mining relevant documents of target language initial examination is extracted, an association rule mode front piece relevant to target language original query is extracted as an expansion word, and the expansion word and a target language original query term are combined to form a new query.

2. Degree of weighting support

Suppose DS = { d = ₁ ,d ₂ ,…,d _n Is a cross-language target language primary examination related Document Set (DS), where d _i (1. Ltoreq. I. Ltoreq.n) is the ith document in the document set DS, d _i ＝{t ₁ ,t ₂ ,…,t _m ,…,t _p }，t _m (m =1,2, …, p) is a document feature word item, called feature item for short, generally composed of a word, a word or a phrase, d _i The corresponding feature item weight set W _i ＝{w _i1 ,w _i2 ,…,w _im ,…,w _ip }，w _im For the ith document d _i M characteristic item t _m Corresponding weight, TS = { t = } ₁ ,t ₂ ,…,t _k And expressing the whole feature item set in the DS, wherein each subset of the TS is called a feature item set, and is called an item set for short.

Aiming at the defects of the prior art, the invention gives a new method for calculating the weighted Support (All-weighted Support, awSup) awSup (I) by fully considering the frequency and the weight of the feature term item. The calculation formula of the awSup (I) is shown as a formula (11).

Wherein, w _I Sum of weights of sets of terms in cross-language target language primary examination related document set DS for weighted set of terms I, n _I The frequency of the weighted item set I appearing in the cross-language target language primary examination related document set DS is n, and the n is the total document number in the cross-language target language primary examination related document set DS; w is the sum of all feature word weights in the cross-language target language initial detection related document set DS; k is the number of items in the item set I (namely the length of the item set), and beta epsilon (0,1) is an adjusting coefficient, the value of the adjusting coefficient cannot be 0 or 1, and the main function is to adjust the influence of the combination of the item frequency and the item weight on the weighting support degree.

Assuming that the minimum weighted support threshold is ms, if awSup (I) ₁ ∪I ₂ ) Greater than ms, then the set of weighted terms (I) ₁ ∪I ₂ ) Is a positive item set (i.e., a frequent item set), otherwise, (I) ₁ ∪I ₂ ) Is a negative term set.

The method only focuses on the following three types of weighted negative term sets: (I) ₁ ∪﹁I ₂ ) And (|) ₁ ∪I ₂ ) Giving a weighted negative term setSustained degree aw & lt (& gt I), aw & lt (I) ₁ ∪﹁I ₂ ) And aw & lt (|) ₁ ∪I ₂ ) The calculation formula is shown in formula (12) to formula (14).

awsup(﹁I)＝1-awSup(I) (12)

awSup(I ₁ ∪﹁I ₂ )＝awSup(I ₁ )-awSup(I ₁ ∪I ₂ ) (13)

awSup(﹁I ₁ ∪I ₂ )＝awSup(I ₂ )-awSup(I ₁ ∪I ₂ ) (14)

The method only focuses on the following two types of weighted negative association rules: (I) ₁ →﹁I ₂ ) And (|) ₁ →I ₂ ) Weighted positive and negative Association Rule Confidence (awARConf) awARConf (I) ₁ →I ₂ )、awARConf(I ₁ →﹁I ₂ ) And awARConf (|) ₁ →I ₂ ) The calculation formula (2) is shown in the formulas (15) to (17).

3. Weighted positive and negative term set relevancy

The weighted term set relevancy refers to a measure of the strength of the relevancy between any two individual terms in the weighted term set and between the sub-term sets. The higher the relevance of the item set, the more closely the relationship between the sub-item sets in the item set is, and the more attention is paid. The invention improves the existing relevance, provides a relevance calculation method of the weighted positive and negative term set, not only considers the relevance degree of any two single terms in the term set, but also considers the relevance between two sub term sets in the term set.

Weighted Positive item set relevance (All-weighted Positive Itemset Relevance, awPIR): positive term set C for weighted feature words _k ＝(t ₁ ,t ₂ ,…,t _m ) M is a positive term set C _k M is not less than 2, set t _max (1. Ltoreq. Max. Ltoreq.m) is C _k Of all items of (1) the single item whose support is the greatest, I _q Is C _k Giving a weighted positive item set relevance awPIR (C) for all the sub item sets from 2_ sub item set to (m-1) _ sub item set with the highest support degree _k ) The calculation formula (2) is shown in the formula (18) and the formula (19).

Wherein, the candidate weighted positive term set C _k ＝(t ₁ ,t ₂ ,…,t _k )，k≥2，t _max (1. Ltoreq. Max. Ltoreq.m) is C _k Of all items of (1) the single item whose support is the greatest, I _q Is C _k From all 2_ sub-item sets to (m-1) _ sub-item set with the most supported sub-item set.

Equations (18) and (19) indicate that the positive set of terms C is weighted _k The relevance is equal to the single item t with the maximum support _max And a sub-set of items I _q (i.e. I) _q One of the 2_ subentry through (m-1) _ subentry) is the sum of the conditional probabilities that the positive itemset occurs when it occurs, respectively.

Weighted Negative item set relevance (All-weighted Negative Itemset Relevance, awNIR): for weighted feature word negative term set C _k ＝(t ₁ ,t ₂ ,…,t _r ) R is a negative term set C _k R is not less than 2, set t _max (1. Ltoreq. Max. Ltoreq. R) is a negative term set C _k Of all items of (1) the single item whose support is the greatest, I _p As a negative set of terms C _k All the 2_ sub-item sets to (r-1) _ sub-item set, giving the most supported sub-item setWeighted negative term set relevance awNIR (C) _k ) The calculation formula (2) is shown in the formula (20) and the formula (21).

Wherein the candidate weighted negative term set C _k ＝(t ₁ ,t ₂ ,…,t _r )，r≥2，t _max (1. Ltoreq. Max. Ltoreq. R) is C _k Of all the items of (2) whose support is the greatest, I _p Is C _k From all 2_ sub-item sets to the (r-1) _ sub-item set whose support is the greatest.

Equations (20) and (21) show that the weighted negative term set C _k The relevance is equal to the single item t with the maximum support _max And a sub-set of items I _p (i.e. I) _p One of the 2_ subentry through (r-1) _ subentry) conditional probabilities of the negative set of terms occurring when they do not occur, respectively.

Example (c): if C _k ＝(t ₁ ∪t ₂ ∪t ₃ ∪t ₄ ) (degree of support 0.65), its single entry t ₁ ，t ₂ ，t ₃ And t ₄ Are 0.82,0.45,0.76 and 0.75, respectively, their 2_ subentry set and 3_ subentry set (t) ₁ ∪t ₂ )，(t ₁ ∪t ₃ )，(t ₁ ∪t ₄ )，(t ₂ ∪t ₃ )，(t ₂ ∪t ₄ )，(t ₁ ∪t ₂ ∪t ₃ )，(t ₁ ∪t ₂ ∪t ₄ )，(t ₂ ∪t ₃ ∪t ₄ ) The support degrees are 0.64,0.78,0.75,0.74,0.67,0, 66,0.56,0.43 respectively, and the single item with the maximum support degree (value of 0.82) is t ₁ The subset of the 2_ subset and the 3_ subset whose support is the greatest (value 0.78) is (t) ₁ ∪t ₃ ) Then, a positive term set (t) is calculated using equation (14) ₁ ∪t ₂ ∪t ₃ ∪t ₄ ) The degree of correlation of (2) was 0.81.

4. Weighted association rule promotion

A limitation of the traditional association rule evaluation framework (support-confidence) is that the item set support present in the rule back-piece is ignored, so that high-confidence rules may sometimes be misleading. The degree of Lift (Lift) is an effective correlation metric to solve this problem. The association rule X → Y promotion degree Lift (X → Y) refers to a ratio of a probability of simultaneously containing Y to a probability of occurrence of Y as a whole under a condition containing X, that is, a ratio of a Confidence degree Confidence (X → Y) of the rule to a support degree sup (Y) of the back part Y. Based on the traditional lifting degree concept, a weighted association rule I is given ₁ →I ₂ Elevation (All-weighted Association Rule Lift, awARL) awARL (I) ₁ →I ₂ ) The formula (2) is shown in formula (22).

According to the relevance theory, the promotion degree can evaluate the relevance of the front piece and the back piece of the association rule, and the degree of promotion (or reduction) of the appearance of one party to the appearance of the other party can be evaluated. I.e., when awARL (I) ₁ →I ₂ )&1, I ₁ →I ₂ Is a positive association rule, item set I ₁ And I ₂ In (2), the occurrence of one increases the probability of the occurrence of the other; when awARL (I) ₁ →I ₂ )&1, 1 is ₁ →I ₂ A negative association rule, the occurrence of one party reduces the probability of the occurrence of the other party; when awARL (I) ₁ →I ₂ ) When =1, item set I ₁ And I ₂ Are independent and unrelated, and the association rule I ₁ →I ₂ Is a dummy rule. It is easy to prove awARL (I) ₁ →I ₂ ) Has the following properties 1.

Properties 1②awARL(﹁I ₁ →I ₂ )<1；③awARL(﹁I ₁ →﹁I ₂ )>1.⑤awARL(﹁I ₁ →I ₂ )>1；⑥awARL(﹁I ₁ →﹁I ₂ )<1。

According to property 1, when awARL (I) ₁ →I ₂ )&When gt, 1, a weighted positive association rule I can be mined ₁ →I ₂ . When awARL (I) ₁ →I ₂ )&1, a weighted negative association rule I can be mined ₁ →﹁I ₂ I and 2 ₁ →I ₂ 。

Assuming that the minimum weighted confidence threshold is mc, in combination with property 1, a weighted strong positive-negative association rule is given as follows:

for weighted positive term set (I) ₁ ∪I ₂ ) If awARL (I) ₁ →I ₂ )&gt, 1, and awARConf (I) ₁ →I ₂ ) If not less than mc, weighting association rule I ₁ →I ₂ Is a strongly associated rule.

For negative item set (I) ₁ ∪I ₂ ) If awARL (I) ₁ →I ₂ )&lt, 1, and awARConf (I) ₁ →﹁I ₂ )≥mc，awARConf(﹁I ₁ →I ₂ ) Not less than mc, then I ₁ →﹁I ₂ And I ₁ →I ₂ Is a strong negative association rule.

The invention relates to a cross-language query expansion method based on weighted positive and negative rule antecedents and related feedback, which comprises the following steps:

the machine translation system may be: microsoft applied to the machine translation interface Microsoft Translator API, google machine translation interface, and so on.

1.2 the target language queries and retrieves the original document set of the target language to obtain the initial documents of the target language, and the specific retrieval model is a classical retrieval model based on a vector space model.

the method comprises the following specific steps:

the pretreatment steps are as follows:

(1) For the target language is Chinese, performing Chinese word segmentation, removing stop words, extracting Chinese characteristic words, and adopting a Chinese lexical analysis system ICTCCLAS developed and compiled by the research institute of computational technology of Chinese academy of sciences to perform Chinese word segmentation; for English as the target language, a Porter program (see the website: http:// tartartartargarus. Org/. About martin/Porter stemmer in detail) is adopted for carrying out stem extraction, and English stop words are removed;

(2) Calculating the weight of the feature word, wherein the weight of the feature word indicates the importance degree of the feature word to the document where the feature word is located, and the invention adopts the classical and popular tf-idf feature word weight w _ij And (4) a calculation method. W is _ij The calculation formula is shown in formula (23):

wherein, w _ij Representing a document d _i Middle characteristic word t _j Weight of (1), tf _j,i Representation feature word t _j In document d _i Of occurrence of (1), df _j Meaning containing a characteristic word t _j N represents the total number of documents in the document set.

(3) And constructing a document index library and a total feature word library.

1.4.2 mining frequent 1_ item set L ₁ : namely, obtaining the 1_ item set C of the characteristic word candidate from the total characteristic word library ₁ Calculate 1_ item set C ₁ Degree of support awSup(C ₁ ) If awSu (C) ₁ ) Not less than the support threshold ms, then the candidate 1_ item set C ₁ For frequent 1_ item set L ₁ And mixing L ₁ Adding to a weighted frequent item set PIS; the awSup (C) ₁ ) The calculation formula is shown in formula (24):

wherein n and W are the total length of the documents in the relevant document set of the initial examination of the target language and the sum of the weights of all the characteristic words respectively,is C ₁ The frequency of occurrence in the target language initial examination related document set,is C ₁ And (3) initially checking the weight value of the item set in the relevant document set in the target language, wherein beta epsilon (0,1) is an adjusting coefficient, and the value cannot be 0 or 1.

1.4.3 mining a weighted frequent k _ term set L containing query terms _k And negative k _ term set N _k And k is more than or equal to 2.

The method comprises the following specific steps:

(1) Mining candidate k _ term set C _k : by frequent (k-1) _ sets of items L _k-1 Aproiri ligation is performed to obtain;

the Aproiri ligation method is described in the literature: agrawal R, iminilinski T, swami A. Minor association rules between sections of entities in large database [ C ]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of data, washington D C, USA, 1993.

(2) When k =2, pruning candidate 2_ term set C without query term ₂ Keeping candidate 2_ term set C containing query terms ₂ 。

If awSu (C) _k ) Threshold of support degree not less thanms, recalculate C _k Weighted frequent item set relevance of (C) awPIR _k ) If awPIR (C) _k ) The relevance threshold value minPR of the frequent item set is more than or equal to, the k _ candidate item set C _k As a weighted frequent k _ term set L _k Adding the weighted frequent item set PIS;

if awSu (C) _k )&Ms, calculating weighted negative term set association degree awNIR (C) _k ) If awNIR (C) _k ) A threshold value minNR for relevancy of a set of ≧ negative terms, then C _k Is a weighted negative k _ term set N _k And added to the set of weighted negative terms NIS. The awSup (C) _k ) The calculation formula is shown in formula (25):

wherein the content of the first and second substances,is C _k The frequency of occurrence in the target language initial examination related document set,is C _k Item set weight in target language initial check related document set, k is C _k The number of items.

awPIR(C _k ) The calculation formula (c) is divided into two cases: m =2 and m&gt, 2 cases, i.e., as shown in formulas (26) and (27),

wherein the candidate weighted positive term set C _k ＝(t ₁ ,t ₂ ,…,t _k )，k≥2，t _max (1. Ltoreq. Max. Ltoreq.m) is C _k Of all items of (1) whose support is greatestSingle item, I _q Is C _k From all 2_ sub-item sets to (m-1) _ sub-item set with the most supported sub-item set.

awNIR(C _k ) The calculation formula of (c) is divided into two cases: r =2 and r&gt, 2 cases, i.e. as shown in formula (28) and formula (29),

wherein the candidate weighted negative term set C _k ＝(t ₁ ,t ₂ ,…,t _r )，r≥2，t _max (1. Ltoreq. Max. Ltoreq. R) is C _k Of all items of (1) the single item whose support is the greatest, I _p Is C _k From all 2_ sub-item sets to the (r-1) _ sub-item set whose support is the greatest.

(4) If k _ entry set L _k And (4) if the set is an empty set, ending the item set mining, and turning to the step (1.5), otherwise, turning to the step (1) and continuing the mining.

1.5 mining a weighted strong positive association rule from a weighted frequent item set PIS: weighting each frequent k _ item set L in the frequent item set PIS for the feature words _k K is more than or equal to 2, excavating L _k The front part is an expansion term set I, the back part is an association rule I → qt of a query term set qt, and the union of qt and I is L _k The intersection of qt and I is an empty set, qt is a query term set, and I is an extended term set, and the specific mining steps are as follows:

(1) Solving for a positive term set L _k All proper subsets of (A) to obtain L _k A set of proper subset items;

(2) From L _k Arbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝L _k ，

(3) The weighted association rule I → qt confidence awardonf (I → qt) and its degree of lifting awARL (I → qt) are calculated. If awARL (I → qt) >1 and awARConf (I → qt) > = mc, then the weighted strong association rule I → qt is obtained and added to the weighted strong positive association rule set PAR. The calculation formulas of awARConf (I → qt) and awARL (I → qt) are shown in formulas (30) and (31):

(4) Returning to the step (2) and then sequentially carrying out until L _k If and only once for each proper subset in the proper subset item set, then a new positive item set L is re-fetched from the PIS set _k And (2) turning to the step (1) to carry out new round of weighted association rule mining until each positive item set L in the PIS _k All have been removed, at which point step 1.6 is performed.

(1) Find out negative item set N _k All proper subsets of (A) to obtain N _k A set of proper subsets.

(2) From N _k Arbitrarily take out two sub item sets qt and I in the proper subset set, anqt∪I＝N _k ，

(3) Calculate the degree of lift awARL (I → qt), if awARL (I → qt) <1:

calculating the negative association rule I → qt confidence aw arconf (I → qt), if aw arconf (I → qt) > = mc then get the weighted strong negative association rule I → qt and add to NAR. The calculations of awaronf (I → qt) and awaronf (I → qt) are as shown for equations (32) and (33):

awARconf(I→﹁qt)＝1-awARConf(I→qt) (32)

(5) Re-fetching new negative item set N from NIS set _k And (3) turning to the step (1) to carry out a new round of weighted negative association rule mining, if each negative item set in the NIS set is valid and is taken out only once, finishing the mining of the weighted strong negative association rule, and turning to the step 1.7.

1.7 extracting a weighted positive association rule mode I → qt of which the rule back part is the query term from the weighted strong positive association rule set PAR, and constructing a candidate front part expansion word bank by taking the characteristic word of the positive association rule front part as a candidate expansion word.

1.8 extracting the weighted negative association rule pattern I → qt and I → qt whose latter part of the rule is query term from the weighted strong negative association rule set NAR, and constructing the front part negative extended word bank by using the front part I of the negative association rule as the front part negative extended word.

1.9 for each candidate front part expansion word in the candidate front part expansion word library, comparing the candidate front part expansion word with the negative expansion word in the front part negative expansion word library, deleting the candidate expansion word same as the negative expansion word in the candidate front part expansion word library, wherein the rest candidate front part expansion words in the candidate front part expansion word library are the final front part expansion words.

Experimental design and results:

in order to illustrate the effectiveness of the method, vietnamese and English are used as language objects to develop a Vietnamese and English-crossing language information retrieval experiment based on the method and the comparison method.

Experimental data set:

english text data set of NTCIR-5 CLIR is selected as the text experimental corpus. The corpus is a cross-language information retrieval standard data test corpus on a multinational language processing international evaluation conference sponsored by the Japanese information research institute, and is derived from Mainichi Daily News News media 2000, 2001 (abbreviated as mdn00 and mdn 01) and Korea Times2001 (abbreviated as ktn), and 26224 pieces of English text information (namely, mdn00 has 6608 pieces, mdn01 has 5547 pieces, and ktn has 14069 pieces). The data set comprises a document test set, a result set and a query set, wherein the result set comprises two types, namely a Rigid standard (highly relevant and relevant to the query) and a Relay standard (highly relevant, relevant and partially relevant to the query), the query set comprises 50 query subjects, four versions of Japanese, korean, chinese and English are provided, and 4 types of query subjects, namely TITLE, DESC, NARR and CONC are provided, the TITLE query type describes the query subjects in brief by nouns and noun phrases and belongs to short query, and the DESC query type describes the query subjects in brief by sentences and belongs to long query. Search experiments were performed using TITLE and DESC query types.

In the experiment of the invention, because the NTCIR-5 CLIR corpus does not provide Vietnamese query version, we specifically ask the professional translators of the Dong alliance language of the translation mechanism to manually translate 50 Chinese version query subject corpora in the NTCIR-5 CLIR into Vietnamese query as the source language query of the text experiment.

The reference comparison method comprises the following steps:

(1) Cross-English Language search (Vietnamese-English Cross-Language Retrieval, VECLR) benchmark method: the method refers to a result of first retrieval in cross-English languages, namely a retrieval result obtained by retrieving English documents after Vietnamese query in a source language is translated into English by a machine, and a query expansion technology is not adopted in the retrieval process.

(2) The cross-English language retrieval method Based on the Pseudo-relevant Feedback Query Post-Translation extension (Query Post-Translation extension basic on Pseudo-Relevance feed, QPTE _ PRF) comprises the following steps: the QPTE _ PRF benchmark algorithm is based on a cross-language query expansion method of documents (Wu Dan, he Daqing and Wang Huilin. The cross-language query expansion [ J ] intelligence bulletin, 2010,29 (2): 232-239.) to realize the search result of the cross-English and cross-language query post-translation expansion. The experimental method and parameters are as follows: the method comprises the steps of translating a Source language Vietnamese query machine into English query retrieval English documents, extracting 20 English documents of cross-language primary retrieval prostate English documents to construct a primary retrieval English related document set, extracting English feature terms and calculating weights of the English feature terms, and arranging the 20 feature terms in descending order according to the weights to realize cross-English and cross-language query translation and then expansion for English expansion words.

R-precision (R-Prec) and P@5 are adopted as cross-language retrieval evaluation indexes of the invention. R-precision refers to the precision calculated when R documents are retrieved, wherein R refers to the number of relevant documents in the document set corresponding to a query, and the ranking of the documents in the document result set is not emphasized.

The experimental results are as follows:

the method and the reference method are written, the over-English and over-language information retrieval performance of the method and the reference method is analyzed and compared through experiments, the over-English and over-language information retrieval is carried out on 50 Vietnamese TITLE and DESC queries, the user relevance judgment is carried out on 50 English documents in the front of the cross-language initial examination, and then the relevant feedback documents of the initial examination user are obtained (for simplicity, in the text experiment, relevant documents in the front of the initial examination 50 documents containing known result sets are taken as relevant documents for the initial examination), experiments are carried out, and the average values of R-Prec and P@5 of the over-English and over-language retrieval results are obtained, and are respectively shown in tables 1 to 2, and the common experiment parameters are set as follows: α =0.3, minpr =0.1, minnr =0.01, mined into a 3_ term set.

TABLE 1 search Performance comparison of the inventive method of the present invention with a comparative baseline method (TITLE query)

The experimental parameters of the table are mc =0.8, ms is formed by {0.2,0.25,0.3,0.35,0.4,0.45} (mdn 00), and ms is formed by {0.2,0.23,0.25,0.28,0.3} (mdn 01 and ktn).

The experimental results in Table 1 show that compared with the reference methods VECLR and QPTE _ PRF, the R-Prec and P@5 values of the cross-English language retrieval result of the TITLE query type of the method are greatly improved, the improvement range of the method compared with the VECLR method can reach 91.28% to the maximum extent, and the improvement range of the method compared with the QPTE _ PRF is 265.88% to the maximum extent.

TABLE 2 comparison of search Performance between the inventive method and the reference method (DESC query)

The experimental parameters of the table are mc =0.8, ms is epsilon {0.2,0.23,0.25,0.28,0.3}

From the experimental results in Table 2, it can be seen that the R-Prec and P@5 values of the DESC query type cross-English language retrieval result of the method of the present invention are also greatly improved as compared with those of the reference methods VECLR and QPTE _ PRF, and the maximum improvement ranges are 137.38% and 238.75%, respectively.

The experimental result shows that the method is effective and can actually improve the cross-language information retrieval performance.

Claims

1.4 mining a weighted frequent item set and a negative item set containing original query terms for the target language initial examination related document set;

the method comprises the following specific steps:

1.4.1, preprocessing a relevant document set of the initial detection of the target language to construct a document index library and a total feature word library;

1.4.2 mining frequent 1_ item set L ₁ ：

Namely, obtaining the 1_ item set C of the characteristic word candidate from the total characteristic word library ₁ Calculate 1_ item set C ₁ Support of (a) awSup (C) ₁ ) If awSu (C) ₁ ) The support degree threshold value ms is more than or equal to, the candidate 1_ item set C ₁ For frequent 1_ item set L ₁ And mixing L ₁ Adding to a weighted frequent item set PIS; the awSup (C) ₁ ) The calculation formula is as follows:

wherein n and W are the total length of the documents in the relevant document set of the initial examination of the target language and the sum of the weights of all the characteristic words respectively,is C ₁ The frequency of occurrence in the target language initial examination related document set,is C ₁ In the initial examination of the weight value of the item set in the relevant document set in the target language, beta epsilon (0,1) is an adjusting coefficient, and the value cannot be 0 or1；

1.4.3 mining a frequent k _ term set L containing query terms _k And negative k _ term set N _k K is more than or equal to 2

The method comprises the following specific steps:

(1) Mining candidate k _ term set C _k : by frequent (k-1) _ sets of items L _k-1 Obtained by carrying out Aproiri ligation;

(2) When k =2, pruning candidate 2_ term set C without query term ₂ Keeping candidate 2_ term set C containing query terms ₂ ；

If awSu (C) _k ) Not less than the support threshold ms, and then C is calculated _k Weighted frequent item set relevance of (C) awPIR _k ) If awPIR (C) _k ) The relevance threshold value minPR of the frequent item set is more than or equal to, the k _ candidate item set C _k As a weighted frequent k _ term set L _k Adding the weighted frequent item set PIS;

if awSu (C) _k )&(lt, ms), calculating the weighted negative term set association degree awNIR (C) _k ) If awNIR (C) _k ) A threshold value minNR for relevancy of a set of ≧ negative terms, then C _k Is a weighted negative k _ term set N _k And added to the weighted negative set NIS; the awSup (C) _k ) The calculation formula is as follows:

wherein the content of the first and second substances,is C _k The frequency of occurrence in the relevant document set is initially checked in the target language,is C _k Item set weight in target language initial check relevant document set, k is C _k The number of items of (2);

awPIR(C _k ) The calculation formula is divided into twoThe situation is as follows: m =2 and m&gt, 2 cases, namely,

wherein, the candidate weighted positive term set C _k ＝(t ₁ ,t ₂ ,…,t _k )，k≥2，t _max (1. Ltoreq. Max. Ltoreq.m) is C _k Of all items of (1) the single item whose support is the greatest, I _q Is C _k All the 2_ sub-item sets to (m-1) _ sub-item set with the largest supporting degree;

awNIR(C _k ) The calculation formula (c) is divided into two cases: r =2 and r&gt, 2 cases, i.e.,

wherein the candidate weighted negative term set C _k ＝(t ₁ ,t ₂ ,…,t _r )，r≥2，t _max (1. Ltoreq. Max. Ltoreq. R) is C _k Of all the items of (2) whose support is the greatest, I _p Is C _k All the 2_ sub-item sets to (r-1) _ sub-item set with the largest supporting degree;

(4) If k _ entry set L _k If the item set is an empty set, the item set mining is finished, and the step 1.5 is carried out, otherwise, the step (1) is carried out, and the mining is continued;

1.5 mining a weighted strong positive association rule from a weighted frequent item set PIS: weighting each frequent k _ item set L in the frequent item set PIS for the feature words _k K is more than or equal to 2, excavating L _k The first part being an extended term set I and the second part being a query termThe association rule of the set qt is I → qt, and the union of qt and I is L _k The intersection of qt and I is an empty set, qt is a query term set, and I is an extended term set, and the specific mining steps are as follows:

(3) Calculating a weighted association rule I → qt confidence level awARConf (I → qt) and a lifting degree awARL (I → qt) thereof; if awARL (I → qt) >1 and awARConf (I → qt) > = mc, a weighted strong association rule I → qt is obtained and added to the weighted strong positive association rule set PAR; the calculation formulas of awARConf (I → qt) and awARL (I → qt) are as follows:

(4) Returning to the step (2) and then sequentially carrying out until L _k If and only once for each proper subset in the proper subset item set, then re-fetch a new positive item set L from the PIS set _k Turning to the step (1) to carry out a new round of weighted association rule mining until each positive item set L in the PIS _k All are taken out, and then the step 1.6 is carried out;

1.6 mining weighted strong negative association rules from the negative set of terms NIS: for each negative set N in the negative set NIS _k ，k&gt =2, dig N _k The front piece is the query term set qt andthe back-piece being the weighted negative association rule I → qt and I → qt of the negative expanded vocabulary I, the sum of said qt and I being L _k And the intersection of qt and I is an empty set, and the concrete mining steps are as follows:

(1) Find out negative item set N _k All proper subsets of (A) to obtain N _k A set of proper subsets;

(3) Calculate the degree of lift awARL (I → qt), if awARL (I → qt) <1:

calculating a negative association rule I → qnt confidence level awARConf (I → qt), if awARConf (I → qt) > = mc, resulting in a weighted strong negative association rule I → qt, joining to the weighted strong negative association rule set NAR;

calculating the negative association rule I → qt confidence aw arconf (I → qt), if aw arconf (I → qt) > = mc, then we get the weighted strong negative association rule I → qt and add to NAR; the equations for awaronf (I → qt) and awaronf (I → qt) are as follows:

awARConf(I→﹁qt)＝1-awARConf(I→qt)

(5) Re-fetching a new negative set of terms N from the NIS set _k And (3) turning to the step (1) to carry out a new round of weighted negative association rule mining, and if each negative item set in the NIS set is valid and is taken out only once, then weighting a strong negative association ruleAfter the excavation is finished, the step 1.7 is carried out;