CN109739952A - Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension - Google Patents
Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension Download PDFInfo
- Publication number
- CN109739952A CN109739952A CN201811646512.XA CN201811646512A CN109739952A CN 109739952 A CN109739952 A CN 109739952A CN 201811646512 A CN201811646512 A CN 201811646512A CN 109739952 A CN109739952 A CN 109739952A
- Authority
- CN
- China
- Prior art keywords
- item
- document
- language
- item collection
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses the cross-language retrieval methods of the mode excavation and extension of the fusion degree of association and chi-square value, source language query is translated into object language and searched targets Language Document by machine translation tools first, initial survey set of relevant documents is constructed according to user's relevant feedback, compare to concentrate from initial survey relevant documentation by item centralized value and excavates the frequent item set containing former inquiry lexical item, by the degree of association, chi-square value and confidence level fusion, the Feature Words correlation rule containing former inquiry lexical item is extracted from frequent item set, it is the correlation rule former piece item collection of former inquiry lexical item as expansion word using consequent, expansion word and former inquiry word combination are that searched targets Language Document obtains final search result document again for new inquiry, final search result document machine is translated as source document and returns to user.The present invention improves digging efficiency by item collection beta pruning, excavates good expansion word, improves and improves cross-language retrieval performance, in information retrieval field application value with higher and promotion prospect.
Description
Technical field
The invention belongs to information retrieval fields, and specifically the mode excavation of the fusion degree of association and chi-square value is with extension across language
Say search method.
Background technique
Cross-language information retrieval refer to by machine translation tools with a kind of inquiry of language go to retrieve another or
The retrieval technique of the information resources of person's multilingual.With the fast development of network technology and machine translation mothod, across language letter
Breath retrieval technique has obtained extensive concern and discussion, and scholars are from different angles with direction to Cross-Language Infomation Retrieval Models
Further investigated and research have been carried out with algorithm, has achieved achievement abundant, however, current cross-language information retrieval research institute exists
The problem of be that inquiry theme seriously drift about and word mismatch problem, these problems frequently result in cross-language retrieval degraded performance.Closely
Nian Lai, the cross-language information retrieval based on association rule mining and query expansion are studied to have obtained more concerns and discussion, example
Such as, cross-language information retrieval method (Gao J F, Nie J Y, Zhang J, the et al.TREC- based on relevant feedback extension
9CLIR Experiments at MSRCN[C].In:Proceedings of the 9th Text Retrieval
Evaluation Conference, 2001:343-353.), cross-language information retrieval method (Ning Jian, woods based on potential applications
Go away for some great undertakings based on cross-language retrieval [J] the Journal of Chinese Information Processing for improving latent semantic analysis, 2010,24 (3): 105-111.) and
(Huang Mingxuan is excavated cross-language information retrieval method based on association mode excavation and query expansion based on weighted association pattern
Across the language inquiry extension of more-English [J] information journal, 2017,36 (3): the complete weighting pattern of 307-318., Huang Mingxuan excavate with
Across the language inquiry small-sized microcomputer system of extension of Indonesia's Chinese of relevant feedback fusion, 2017,38 (8): 1783-1791.),
Etc., recall ratio and precision ratio problem in cross-language information retrieval are fully solved but without final.
Currently, since Chinese Nanning City is as the permanent host city of China-ASEAN Exposition, the political affairs of China and ASEAN countries
Control, the contacts such as economy, culture more frequently and closely, cross-language information retrieval and cross-language information towards ASEAN countries' language
Service research seems more urgent, and importance is increasingly prominent.In consideration of it, it is necessary to study using Indonesian as original language, with English
Language is the cross-language retrieval method of object language, can improve and improve across language text information retrieval performance, has and preferably answers
With value and promotion prospect.
Summary of the invention
The invention proposes the cross-language retrieval methods of the mode excavation and extension of the fusion degree of association and chi-square value, are suitable for
Cross-language information retrieval field can improve cross-language information retrieval performance, solve to inquire topic drift in cross-language information retrieval
With word mismatch problem.
The present invention adopts the following technical scheme:
Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension, including the following steps:
(1) original language user query are translated into object language by machine translation tools, are examined using Vector Space Retrieval Of Automatic model
Rope target document collection obtains initial survey forefront document.
Machine translation tools are: Microsoft must answer machine translation interface Microsoft Translator API, alternatively, Google
Machine translation interface, etc..
(2) by carrying out correlation judgement building initial survey set of relevant documents to initial survey forefront document.
(3) initial survey set of relevant documents is pre-processed, target document index database and feature dictionary are constructed.
The pretreatment of initial survey set of relevant documents will use corresponding preprocess method according to different language, for example, if target
Language is English, then preprocess method is: remove English stop words, using Porter program (see network address in detail: http: //
Tartarus.org/martin/PorterStemmer it) extracts and obtains English Feature Words stem, calculate English Feature Words weight,
If object language is Chinese, preprocess method is: removing Chinese stop words, it is special that Chinese is extracted after segmenting to Chinese document
Word is levied, Chinese Feature Words weight is calculated, as shown in formula (1):
In formula (1), wijIndicate document diMiddle Feature Words tjWeight, tfj,iIndicate Feature Words tjIn document diIn word frequency,
Generally by tfj,iIt is standardized, the standardization refers to document diIn tf described in each Feature Wordsj,iDivided by
Document diMaximum word frequency, idfjIt is inverse document frequency.
(4) text feature word 1_ frequent item set L is excavated1, the specific steps are as follows:
(4.1) Feature Words are extracted from feature dictionary as 1_ candidate C1;
(4.2) target document index database, statistics text document sum n and statistics C are scanned1Item collection weight w [C1];
(4.3) it calculates minimum weight and supports threshold value MWS.Shown in the MWS calculation formula such as formula (2).
MWS=n × ms (2)
In formula (2), the ms is minimum support threshold value, and n is the text document sum of target document index database.
(4.4) if w [C1] >=MWS, then C1With regard to text feature word 1_ frequent item set L1, it is added to frequent item set set
FIS。
(5) text feature word 2_ frequent item set L is excavated2, the specific steps are as follows:
(5.1) use Aproiri connection method by text feature word 1_ frequent item set L1From connect it is multiple to 2_ candidate
Item collection C2。
The Aproiri connection method is detailed in document (Agrawal R, Imielinski T, Swami A.Mining
association rules between sets of items in large database[C]//Proceedings of
the 1993ACM SIGMOD International Conference on Management of Data,Washington
D C,USA,1993:207-216.)
(5.2) 2_ candidate C of the beta pruning without former inquiry lexical item2;
(5.3) scanning target document index database counts remaining 2_ candidate C respectively2Item collection weight w [C2];
(5.4) if w [C2] >=MWS, then C2With regard to text feature word 2_ frequent item set L2, it is added to frequent item set set
FIS。
(6) text feature word k_ frequent item set L is excavatedk, k >=2.Specific step is as follows:
(6.1) use Aproiri connection method by text feature word (k-1) _ frequent item set Lk-1Multiple k_ are obtained from connection
Candidate Ck=(i1,i2,…,ik), k >=2;
(6.2) target document index database is scanned, counts each C respectivelykItem collection weight w [Ck] and each CkIn it is maximum
Project weight wm, obtain the maximum project weight wmCorresponding project im, the m ∈ (1,2 ..., k);
(6.3) if the project imCorresponding 1_ item collection (im) it is non-frequent or wm< MWS, then C described in beta pruningk。
(6.4) for remaining Ck, calculate separately CkItem collection degree of association IRe (Ck), if w [Ck] >=MWS × k and
IRe(Ck) >=minIRe, then, the CkIt is exactly text feature word k_ frequent item set Lk, it is added to frequent item set set FIS;It is described
MinIRe is minimum item collection degree of association threshold value;IRe (the Ck) calculation formula such as formula (3) shown in;
In formula (3), wmin[(iq)] and wmax[(ip)] meaning it is as follows: for Ck=(i1,i2,…ik), k_ candidate
CkEach project i1,i2,…,ikIt is (i when separately as 1_ item collection1),(i2),…,(ik);wmin[(iq)] and wmax[(ip)]
Respectively indicate 1_ item collection (i1),(i2),…,(ik) in the smallest 1_ centralized value and maximum 1_ centralized value;The q ∈ (1,
2 ..., k), p ∈ (1,2 ..., k);
(6.5) if text feature word k_ frequent item set LkFor empty set, at this moment, then feature words frequent item set excavation terminates,
Following steps (7) are transferred to, otherwise, k is transferred to step (6.1) continuation sequence and recycles after adding 1.
(7) any one text feature word k_ frequent item set L is taken out from frequent item set set FISk, according to below step
Excavate each LkAll association rule models containing former query word item collection.
(7.1) L is constructedkAll proper subclass item collection set;
(7.2) two proper subclass item collection q are arbitrarily taken out from proper subclass item collection settAnd Et, andqt∪Et=
Lk,QTLLexical item set, E are inquired for object language originaltFor the Feature Words item collection without former inquiry lexical item, item collection is calculated
(qt,Et) chi-square value Chis (qt,Et), shown in calculation formula such as formula (4).
In formula (4), w [(qt)] it is item collection qtIn target language text document index library middle term centralized value, k1For item collection qt's
Length, w [(Et)] it is item collection EtIn target language text document index library middle term centralized value, k2For item collection EtLength, w [(qt,
Et)] it is item collection (qt,Et) item centralized value in target language text document index library, kLFor item collection (qt,Et) project
Number.
(7.3) if Chis (qt,Et) > 0, then calculate Feature Words correlation rule confidence level (Weighted Confidence,
WConf)WConf(Et→qt).If WConf (Et→qt) >=minimal confidence threshold mc, then correlation rule Et→qtIt is Qiang Guanlian
Mode of rule is added to association rule model set AR.WConf (the Et→qt) calculation formula such as formula (5) shown in:
In formula (5), w [(Et)], k2, w [(qt,Et)], kLThe same formula of definition (4).
(7.4) if LkEach proper subclass item collection it is primary and if only if being removed, then this LkIn Feature Words close
Connection mode of rule excavation terminates, and at this moment retrieves another L from numerous item collection set FISk, and it is transferred to step (7.1) progress
Another LkAssociation rule model excavate, otherwise, be transferred to step (7.2) and sequentially execute each step again;If frequent item set
Each L in set FISkMining Association Rules mode is all had been taken out, then terminates association rule model excavation, is transferred to as follows
Step (8).
(8) correlation rule former piece E is extracted from association rule model set ARtAs expansion word, the expansion word is calculated
Weight.
Each correlation rule E is extracted from association rule model set ARt→qtFormer piece Et as translating rear expansion word, institute
State the weight w of expansion wordeShown in calculation formula such as formula (6).
we=0.5 × max (WConf ())+0.3 × max (Chis ())+0.2 × max (IRe ()) (6)
In formula (6), max (WConf ()), max (Chis ()) and max (IRe ()) respectively indicate correlation rule confidence level,
The maximum value of chi-square value and the degree of association takes above-mentioned 3 metrics that is, when expansion word is repetitively appearing in multiple association rule models
Maximum value.
(9) expansion word and former inquiry word combination is to inquire searched targets Language Document again after newly translating to obtain most final inspection after translating
Rope result document.
(10) final search result document is translated into source document by machine translation tools and returns to user.
Compared with the prior art, the present invention has the following beneficial effects:
(1) present invention proposes the cross-language retrieval method of a kind of mode excavation for merging the degree of association and chi-square value and extension.
The inventive method will compare to concentrate from initial survey relevant documentation by item centralized value excavates the frequent item set containing former inquiry lexical item, will
The Feature Words correlation rule containing former inquiry lexical item is extracted in the item collection degree of association, chi-square value and confidence level fusion from frequent item set
Mode, using consequent be former query word item collection correlation rule former piece item collection as expansion word, expansion word and former inquiry phrase after translating
It is combined into after newly translating and inquires searched targets Language Document again and obtain final search result document, it will be final by machine translation tools
Search result document is translated into source document and returns to user.The experimental results showed that the present invention can improve and improve across language
Information retrieval performance has preferable application value and promotion prospect.
(2) the experiment corpus using quasi- cross-language retrieval world mark data set NTCIR-5CLIR as the method for the present invention, with
Across language reference retrieval and control methods carry out experiment comparison, the experimental results showed that, the cross-language retrieval result of the method for the present invention
The search result of P@20 and MAP value all than across language reference retrieval and control methods is high, and significant effect illustrates the method for the present invention
Retrieval performance is superior to across language reference retrieval and control methods, can improve cross-language information retrieval performance, reduces across language letter
Inquiry drift and word mismatch problem, have very high application value and wide promotion prospect in breath retrieval.
Detailed description of the invention
Fig. 1 is that the process of the cross-language retrieval method of the mode excavation and extension of the present invention fusion degree of association and chi-square value is shown
It is intended to.
Specific embodiment
Related notion of the present invention is described below below:
1. the former piece and consequent of correlation rule
If T1、T2It is arbitrary text feature lexical item collection, it will be shaped like T1→T2Implication be known as text feature word association rule
Then, wherein T1Referred to as regular former piece, T2Referred to as consequent.
2. assuming DS={ d1,d2,…,dnIt is text document collection (Document Set, DS), wherein di(1≤i≤n)
It is i-th document in document sets DS, di={ t1,t2,…,tm,…,tp, tm(m=1,2 ..., p) it is file characteristics lexical item
Mesh, abbreviation characteristic item are usually made of word, word or phrase, diIn corresponding Features weight set Wi={ wi1,wi2,…,
wim,…,wip, wimFor i-th document diIn m-th of characteristic item tmCorresponding weight, T={ t1,t2,…,tnIndicate complete in DS
Body characteristics item set, each subset of T are referred to as characteristic item item collection, abbreviation item collection.
The difference of project weight and item centralized value is described as follows: assuming that counting k_ candidate in text document index database
Ck=(i1,i2,…,ik) item collection weight w [Ck], obtain CkEach project i1,i2,…,ikCorresponding weight is w1,w2,…,
wk, then, the w1,w2,…,wkReferred to as project weight, and CkItem collection weight w [Ck]=w1+w2+…+wk。
Embodiment 1
As shown in Figure 1, the cross-language retrieval method of the mode excavation and extension of the fusion degree of association and chi-square value, including it is following
Step:
(1) original language user query are translated into object language by machine translation tools, are examined using Vector Space Retrieval Of Automatic model
Rope target document collection obtains initial survey forefront document.
Machine translation tools are: Microsoft must answer machine translation interface Microsoft Translator API, alternatively, Google
Machine translation interface, etc..
(2) by carrying out correlation judgement building initial survey set of relevant documents to initial survey forefront document.
(3) initial survey set of relevant documents is pre-processed, target document index database and feature dictionary are constructed.
The pretreatment of initial survey set of relevant documents will use corresponding preprocess method according to different language, for example, if target
Language is English, then preprocess method is: remove English stop words, using Porter program (see network address in detail: http: //
Tartarus.org/martin/PorterStemmer it) extracts and obtains English Feature Words stem, calculate English Feature Words weight,
If object language is Chinese, preprocess method is: removing Chinese stop words, it is special that Chinese is extracted after segmenting to Chinese document
Word is levied, Chinese Feature Words weight is calculated.
The present invention provides initial survey set of relevant documents Feature Words weight computing formula, as shown in formula (1):
In formula (1), wijIndicate document diMiddle Feature Words tjWeight, tfj,iIndicate Feature Words tjIn document diIn word frequency,
Generally by tfj,iIt is standardized, the standardization refers to the document diIn each Feature Words tfj,iDivided by
Document diMaximum word frequency, idfjIt is inverse document frequency (Inverse Document Frequency).
(4) text feature word 1_ frequent item set L is excavated1, the specific steps are as follows:
(4.1) Feature Words are extracted from feature dictionary as 1_ candidate C1;
(4.2) target document index database, statistics text document sum n and statistics C are scanned1Item collection weight w [C1];
(4.3) it calculates minimum weight and supports threshold value MWS.Shown in the MWS calculation formula such as formula (2).
MWS=n × ms (2)
In formula (2), the ms is minimum support threshold value, and n is the text document sum of target document index database.
(4.4) if w [C1] >=MWS, then C1It is exactly text feature word 1_ frequent item set L1, it is added to frequent item set set
FIS(Frequent ItemSet)。
(5) text feature word 2_ frequent item set L is excavated2, the specific steps are as follows:
(5.1) use Aproiri connection method by text feature word 1_ frequent item set L1It is candidate that multiple 2_ are obtained from connection
Item collection C2。
The Aproiri connection method is detailed in document (Agrawal R, Imielinski T, Swami A.Mining
association rules between sets of items in large database[C]//Proceedings of
the 1993ACM SIGMOD International Conference on Management of Data,Washington
D C,USA,1993:207-216.)
(5.2) 2_ candidate C of the beta pruning without former inquiry lexical item2;
(5.3) target document index database is scanned, counts remaining 2_ candidate C respectively2Item collection weight w [C2];
(5.4) if w [C2] >=MWS, then C2It is exactly text feature word 2_ frequent item set L2, it is added to frequent item set set
FIS(Frequent ItemSet)。
(6) text feature word k_ frequent item set L is excavatedk, k >=2.Specific step is as follows:
(6.1) use Aproiri connection method by text feature word (k-1) _ frequent item set Lk-1Multiple k_ are obtained from connection
Candidate Ck=(i1,i2,…,ik), k >=2;
(6.2) target document index database is scanned, respectively each statistics CkItem collection weight w [Ck] and each CkIn it is maximum
Project weight wm, respectively obtain the maximum project weight wmCorresponding project im, the m ∈ (1,2 ..., k);
(6.3) if the project imCorresponding 1_ item collection (im) it is non-frequent or wm< MWS, then C described in beta pruningk。
(6.4) for remaining Ck, calculate separately CkThe item collection degree of association (ItemSet Relrvancy, IRe) IRe
(Ck), if w [Ck] >=MWS × k and IRe (Ck) >=minIRe, then, k_ candidate CkIt is exactly text feature word k_ frequency
Numerous item collection Lk, it is added to frequent item set set FIS, otherwise C described in beta pruningk;
The minIRe is minimum item collection degree of association threshold value, IRe (Ck) calculation formula such as following formula (3):
In formula (3), wmin[(iq)] and wmax[(ip)] meaning it is as follows: for Ck=(i1,i2,…ik), k_ candidate
CkEach project i1,i2,…,ikIt is (i when separately as 1_ item collection1),(i2),…,(ik);wmin[(iq)] and wmax[(ip)]
Respectively indicate 1_ item collection (i1),(i2),…,(ik) in the smallest 1_ centralized value and maximum 1_ centralized value;The q ∈ (1,
2 ..., k), p ∈ (1,2 ..., k).
(6.5) if text feature word k_ frequent item set LkFor empty set, at this moment, then feature words frequent item set excavation terminates,
Following steps (7) are transferred to, otherwise, k is transferred to step (6.1) continuation sequence and recycles after adding 1.
(7) any one text feature word k_ frequent item set L is taken out from frequent item set set FISk, according to below step
Excavate each LkAll association rule models for containing former inquiry lexical item.
(7.1) L is constructedkAll proper subclass item collection set;
(7.2) two proper subclass item collection q are arbitrarily taken out from proper subclass item collection settAnd Et, andqt∪Et=
Lk,QTLLexical item set, E are inquired for object language originaltFor the Feature Words item collection without former inquiry lexical item, item collection is calculated
(qt,Et) card side (Chi-Square, Chis) value, the card side Chis (qt,Et) shown in calculation formula such as formula (4).
In formula (4), w [(qt)] it is item collection qtIn target language text document index library middle term centralized value, k1For item collection qt's
Length, w [(Et)] it is item collection EtIn target language text document index library middle term centralized value, k2For item collection EtLength, w [(qt,
Et)] it is item collection (qt,Et) item centralized value in target language text document index library, kLFor item collection (qt,Et) project
Number, n are the text document sum in target language text document index library.
(7.3) if Chis (qt,Et) > 0, then calculate Feature Words correlation rule confidence level (Weighted Confidence,
WConf)WConf(Et→qt).If WConf (Et→qt) >=minimal confidence threshold mc, then correlation rule Et→qtIt is Qiang Guanlian
Mode of rule is added to association rule model set AR (Association Rule).WConf (the Et→qt) calculating it is public
Shown in formula such as formula (5).
In formula (5), w [(Et)], k2, w [(qt,Et)], kLThe same formula of definition (4).
(7.4) if LkEach proper subclass item collection it is primary and if only if being removed, then this LkIn Feature Words close
Connection mode of rule excavation terminates, and at this moment retrieves another L from numerous item collection set FISk, and it is transferred to step (7.1) progress
Another LkAssociation rule model excavate, otherwise, be transferred to step (7.2) and sequentially execute each step again;If frequent item set
Each L in set FISkMining Association Rules mode is all had been taken out, then terminates association rule model excavation, is transferred to as follows
Step (8).
(8) correlation rule former piece E is extracted from association rule model set ARtAs expansion word, the expansion word is calculated
Weight.
Each correlation rule E is extracted from association rule model set ARt→qtFormer piece Et as translating rear expansion word, institute
State the weight w of expansion wordeShown in calculation formula such as formula (6).
we=0.5 × max (WConf ())+0.3 × max (Chis ())+0.2 × max (IRe ()) (6)
In formula (6), max (WConf ()), max (Chis ()) and max (IRe ()) respectively indicate correlation rule confidence level,
The maximum value of chi-square value and the degree of association takes above-mentioned 3 degree that is, when expansion word is repetitively appearing in multiple association rule models respectively
The maximum value of magnitude.
(9) expansion word and former inquiry word combination is to inquire searched targets Language Document again after newly translating to obtain most final inspection after translating
Rope result document.
(10) final search result document is translated into source document by machine translation tools and returns to user.
The beta pruning is with the following method:
(1) for k_ candidate Ck=(i1,i2,…,ik), if the CkItem collection weight w [Ck] < MWS × k, then
It is described be it is non-frequent, wipe out the Ck;If the CkItem collection degree of association IRe (Ck) < minIRe, the then CkIt is invalid
Item collection wipes out the Ck;In conclusion w [C is only excavated in invention hereink] >=MWS × k and IRe (Ck) >=minIRe's is effective
Frequent item set, the minIRe are minimum item collection degree of association threshold value.
(2) if k_ candidate Ck=(i1,i2,…,ik) in maximum project weight be less than minimum weight and support threshold value
MWS, then CkBe it is non-frequent, then wipe out the Ck;
(3) assume k_ candidate Ck=(i1,i2,…,ik) in the corresponding project of maximum project weight separately as 1_
Collection is (im), if the 1_ item collection (im) be it is non-frequent, then wipe out the Ck。
(4) when candidate's 2_ item collection is arrived in excavation, the candidate 2_ item collection deletion of former inquiry lexical item will be free of, be left former containing containing
Inquire the candidate 2_ item collection of lexical item.
Experimental design and result:
In order to illustrate the validity of the method for the present invention, We conducted the Indonesia-based on the method for the present invention and control methods
The experiment of English cross-language information retrieval, compares the cross-language retrieval performance of the method for the present invention and control methods.
Test corpus:
With across language standard data set NTCIR-5CLIR corpus for generally being used in the world in information retrieval field (see net
Location: http://research.nii.ac.jp/ntcir/permission/ntcir-5/perm-en-CLIR .html) conduct
The present invention test corpus, i.e., selection NTCIR-5CLIR corpus in English document collection Mainichi Daily News 2000,
2001 and Korea Times newsletter archive in 2001 share 26224 English documents and are used as experimental data of the present invention, tool
Body is the newsletter archive 6608 (abbreviation m0) of Mainichi Daily News 2000, Mainichi Daily News 2001
5547 (m1) and Korea Times 2001 14069 (k1).
NTCIR-5CLIR corpus has wen chang qiao district collection, 50 inquiry theme collection and its corresponding result set, wherein each
Inquiry type of theme has 4 seed type such as Title, Desc, Narr and Conc, and result set has 2 kinds of evaluation criterions, i.e., highly relevant,
Relevant Rigid standard and highly relevant, related Relax standard relevant with part.The inquiry theme class of experiment of the present invention
Type selects Title and Desc type, and Title inquiry belongs to short inquiry, briefly describes inquiry theme with noun and nominal phrase,
Desc inquiry belongs to long inquiry, briefly describes inquiry theme with sentential form.
Using P@20 and MAP as the evaluation index of the method for the present invention experimental result.The P@20 refers to for test query
The accuracy rate of preceding 20 results returned, the MAP are average precision mean value (Mean Average Precision, MAP),
Refer to and arithmetic average is carried out again to the average precision of all inquiries, is averaged for measuring a searching system to multiple queries
Retrieval quality.
Control methods:
(1) reference retrieval: across language reference retrieval method (the Indonesian-English Cross- of Indonesia-English
Language Retrieval,IECLR)。
The reference retrieval IECLR method refers to obtains Indonesian inquiry by machine translation for retrieval English document after English
The search result arrived, without carrying out various expansion techniques in retrieving.
(2) control methods 1: the Indonesia-English cross-language information retrieval method excavated based on all-weighted association.It is described
Using document, (the complete weighting pattern of Huang Mingxuan excavates across the language inquiry expansion of Indonesia's Chinese merged with relevant feedback for control methods 1
Open up the small-sized microcomputer system of, 2017,38 (8): 1783-1791.) all-weighted association digging technology to Indonesia-
Across the language initial survey user relevant feedback document sets of English excavate Feature Words correlation rule, are the association of former inquiry lexical item by regular former piece
Consequent is realized and is extended after Indonesia-English is translated across language inquiry as expansion word.Experiment parameter is: minimal confidence threshold mc
It is 0.1, minimum support threshold value ms is respectively 0.8,1.0,1.3,1.5,1.7.
(3) control methods 2: Indonesia-English cross-language information retrieval method based on pseudo-linear filter extension.Described pair of analogy
Method 2 be based on document (across language inquiry extension [J] the information journal of Wu Dan, He great Qing, Wang Huilin based on pseudo-linear filter,
2010,29 (2): 232-239.) pseudo-linear filter extended method realize Indonesia-English cross-language information retrieval result.Experiment
Method: extracting across the language 20 building initial survey set of relevant documents of initial survey forefront English document of Indonesia-English, extracts feature lexical item and counts
Its weight is calculated, by the arrangement of weight descending using 20, forefront feature lexical item as English expansion word, expansion word and former inquiry word combination
English document is retrieved again newly to inquire, and obtains final search result.
(4) control methods 3: based on the Indonesia-English cross-language information retrieval method for weighting positive and negative association rule mining completely.
The control methods 3 uses document (the positive and negative association rule mining of weighting completely that Zhou Xiumei, Huang Mingxuan are changed based on item weight
[J] electronic letters, vol, 2015,43 (8): 1545-1554.) the positive and negative Association Rule Mining of weighting completely to Indonesia-English across
Language initial survey user's relevant feedback document sets excavate Feature Words correlation rule, are the correlation rule of former inquiry lexical item by consequent
Former piece realizes Indonesia-English cross-language information retrieval as expansion word.Experiment parameter is: minimal confidence threshold mc is 0.5, most
Small support threshold ms is respectively 0.2,0.25,0.3,0.35,0.4, and minimum interestingness threshold value mi is 0.02.
(5) control methods 4: the Indonesia-English cross-language information retrieval method excavated based on weighted association pattern.The comparison
Method 4 be based on document (Huang Mingxuan based on weighted association pattern excavates get over-across the language inquiry extension of English [J] information journal,
2017,36 (3): 307-318.) across language inquiry extended method realize Indonesia-English cross-language information retrieval result.Experiment
Parameter is: minimal confidence threshold mc is 0.01, and minimum interestingness threshold value mi is 0.0001, and minimal confidence threshold ms is
0.007,0.008,0.009,0.01,0.011。
Experimental method and result are as follows:
Run the source program of the method for the present invention and control methods, first by the Title of 50 Indonesians inquiry theme and
Desc inquiry is translated as English inquiry by machine translation system, and English document inspection is carried out in 3 data sets m0, m1 and k1
Rope, to realize Indonesia-English cross-language information retrieval.When experiment, to 50, the forefront English text of across the language initial survey result of Indonesia-English
Shelves obtain initial survey user's relevant feedback document (for simplicity, in present invention experiment, by initial survey forefront after carrying out user's relevant feedback
The relevant documentation concentrated in 50 documents containing known results is considered as initial survey relevant documentation).The method of the present invention passes through item centralized value
Compare and excavate the frequent item set containing former inquiry lexical item from initial survey user's relevant feedback document sets, by the item collection degree of association, card side
The Feature Words association rule model containing former inquiry lexical item is extracted in value and confidence level fusion from frequent item set, is former by consequent
The correlation rule former piece item collection of query word item collection is inquired again after new translate as expansion word, expansion word with rear former inquiry word combination is translated
Secondary retrieval English document obtains final search result document.By experiment, Indonesia-English of the method for the present invention and control methods is obtained
Cross-language retrieval result P@20 and MAP value are respectively as shown in table 1 to table 4, and when experiment excavates to 3_ item collection, wherein side of the present invention
The experiment parameter of method is: minimum support threshold value ms=0.5, and minimal confidence threshold mc is respectively 0.5,0.6,0.7,0.8,
0.9, minimum item collection degree of association threshold value minIRe=0.4.
1 the method for the present invention of table (Title inquires theme) compared with 20 value of search result P@of control methods
2 the method for the present invention of table (Title inquires theme) compared with the search result MAP value of control methods
3 the method for the present invention of table (Desc inquires theme) compared with 20 value of search result P@of control methods
4 the method for the present invention of table (Desc inquires theme) compared with the search result MAP value of control methods
The cross-language retrieval result P@20 and MAP value that table 1 to table 4 shows the method for the present invention are than across language reference retrieval
High, the significant effect with the search result of 4 control methods.The experimental results showed that the method for the present invention is effectively, to improve really
Cross-language information retrieval performance has very high application value and wide promotion prospect.
Claims (2)
1. merging the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension, which is characterized in that including following
Step:
(1) original language user query are translated into object language by machine translation tools, using Vector Space Retrieval Of Automatic model index mesh
Poster speech document sets obtain initial survey forefront document;
(2) by carrying out correlation judgement building initial survey set of relevant documents to initial survey forefront document;
(3) initial survey set of relevant documents is pre-processed, target document index database and feature dictionary are constructed;
(4) text feature word 1_ frequent item set L is excavated1, the specific steps are as follows:
(4.1) Feature Words are extracted from feature dictionary as 1_ candidate C1;
(4.2) target document index database, statistics text document sum n and statistics C are scanned1Item collection weight w [C1];
(4.3) it calculates minimum weight and supports threshold value MWS;Shown in the MWS calculation formula such as formula (2):
MWS=n × ms (2)
In formula (2), the ms is minimum support threshold value, and n is the text document sum of target document index database;
(4.4) if w [C1] >=MWS, then C1With regard to text feature word 1_ frequent item set L1, it is added to frequent item set set FIS;
(5) text feature word 2_ frequent item set L is excavated2, the specific steps are as follows:
(5.1) use Aproiri connection method by text feature word 1_ frequent item set L1Multiple 2_ candidates are obtained from connection
C2;
(5.2) 2_ candidate C of the beta pruning without former inquiry lexical item2;
(5.3) scanning target document index database counts remaining 2_ candidate C respectively2Item collection weight w [C2];
(5.4) if w [C2] >=MWS, then C2With regard to text feature word 2_ frequent item set L2, it is added to frequent item set set FIS;
(6) text feature word k_ frequent item set L is excavatedk, k >=2;Specific step is as follows:
(6.1) use Aproiri connection method by text feature word (k-1) _ frequent item set Lk-1It is candidate that multiple k_ are obtained from connection
Item collection Ck=(i1,i2,…,ik), k >=2;
(6.2) target language text document index library is scanned, counts each C respectivelykItem collection weight w [Ck] and each CkIn it is maximum
Project weight wm, respectively obtain the maximum project weight wmCorresponding project im, the m ∈ (1,2 ..., k);
(6.3) if the project imCorresponding 1_ item collection (im) it is non-frequent or wm< MWS, then C described in beta pruningk;
(6.4) for remaining Ck, calculate separately CkItem collection degree of association IRe (Ck), if w [Ck] >=MWS × k and IRe
(Ck) >=minIRe, then, the CkIt is exactly text feature word k_ frequent item set Lk, it is added to frequent item set set FIS;It is described
MinIRe is minimum item collection degree of association threshold value;IRe (the Ck) calculation formula such as formula (3) shown in;
In formula (3), wmin[(iq)] and wmax[(ip)] meaning it is as follows: for Ck=(i1,i2,…ik), k_ candidate CkIt is each
A project i1,i2,…,ikIt is (i when separately as 1_ item collection1),(i2),…,(ik);wmin[(iq)] and wmax[(ip)] difference table
Show 1_ item collection (i1),(i2),…,(ik) in the smallest 1_ centralized value and maximum 1_ centralized value;The q ∈ (1,2 ...,
K), p ∈ (1,2 ..., k);
(6.5) if text feature word k_ frequent item set LkFor empty set, then feature words frequent item set excavation terminates, and is transferred to following step
Suddenly (7), otherwise, k are transferred to step (6.1) continuation sequence and recycle after adding 1;
(7) any one text feature word k_ frequent item set L is taken out from frequent item set set FISk, excavated according to below step
Each LkAll association rule models for containing former inquiry lexical item:
(7.1) L is constructedkAll proper subclass item collection set;
(7.2) two proper subclass item collection q are arbitrarily taken out from proper subclass item collection settAnd Et, andqt∪Et=Lk,QTLLexical item set, E are inquired for object language originaltFor the Feature Words item collection without former inquiry lexical item, item collection is calculated
(qt,Et) chi-square value Chis (qt,Et), shown in calculation formula such as formula (4);
In formula (4), w [(qt)] it is item collection qtIn target language text document index library middle term centralized value, k1For item collection qtLength,
w[(Et)] it is item collection EtIn target language text document index library middle term centralized value, k2For item collection EtLength, w [(qt,Et)] be
Item collection (qt,Et) item centralized value in target language text document index library, kLFor item collection (qt,Et) number of items, n is mesh
Poster says the text document sum of text document index database;
(7.3) if Chis (qt,Et) > 0 then calculates Feature Words correlation rule confidence level WConf (Et→qt);If WConf (Et→
qt) >=minimal confidence threshold mc, then correlation rule Et→qtIt is Strong association rule mode, is added to association rule model set
AR;WConf (the Et→qt) calculation formula such as formula (5) shown in:
In formula (5), w [(Et)], k2, w [(qt,Et)], kLThe same formula of definition (4);
(7.4) if LkEach proper subclass item collection it is primary and if only if being removed, then this LkIn feature word association rule
Then mode excavation terminates, and at this moment retrieves another L from numerous item collection set FISk, and be transferred to step (7.1) and sequentially execute
Carry out another LkAssociation rule model excavate, otherwise, be transferred to step (7.2) and sequentially execute each step again;If frequently
Each L in item collection set FISkMining Association Rules mode is all had been taken out, then terminates association rule model excavation, is transferred to
Following steps (8);
(8) each correlation rule E is extracted from association rule model set ARt→qtFormer piece Et as translating rear expansion word, calculate
The expansion word weight we, shown in calculation formula such as formula (6):
we=0.5 × max (WConf ())+0.3 × max (Chis ())+0.2 × max (IRe ()) (6)
In formula (6), max (WConf ()), max (Chis ()) and max (IRe ()) respectively indicate correlation rule confidence level, card side
The maximum value of value and the degree of association takes above-mentioned 3 metrics most that is, when expansion word is repetitively appearing in multiple association rule models
Big value;
(9) expansion word and former inquiry word combination is to inquire searched targets Language Document again after newly translating and obtain finally retrieving to tie after translating
Fruit document;
(10) final search result document is translated into source document by machine translation tools and returns to user.
2. the cross-language retrieval method of the mode excavation and extension of the fusion degree of association and chi-square value as described in claim 1,
Be characterized in that, the step (3) pre-processes initial survey set of relevant documents, and specific method is: removal stop words extracts Feature Words, meter
Feature Words weight is calculated, shown in calculation formula such as formula (1):
In formula (1), wijIndicate document diMiddle Feature Words tjWeight, tfj,iIndicate Feature Words tjIn document diIn word frequency, generally
By tfj,iIt is standardized, the standardization refers to the document diIn each Feature Words tfj,iDivided by document
diMaximum word frequency, idfjIt is inverse document frequency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811646512.XA CN109739952A (en) | 2018-12-30 | 2018-12-30 | Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811646512.XA CN109739952A (en) | 2018-12-30 | 2018-12-30 | Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109739952A true CN109739952A (en) | 2019-05-10 |
Family
ID=66362826
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811646512.XA Withdrawn CN109739952A (en) | 2018-12-30 | 2018-12-30 | Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109739952A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Chinese query expansion method for embedding expansion words into query words and counting expansion word union |
CN111897925A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning |
-
2018
- 2018-12-30 CN CN201811646512.XA patent/CN109739952A/en not_active Withdrawn
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897928A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Chinese query expansion method for embedding expansion words into query words and counting expansion word union |
CN111897925A (en) * | 2020-08-04 | 2020-11-06 | 广西财经学院 | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning |
CN111897925B (en) * | 2020-08-04 | 2022-08-26 | 广西财经学院 | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106372241B (en) | More across the language text search method of English and the system of word-based weighted association pattern | |
CN109299278B (en) | Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent | |
CN104182527A (en) | Partial-sequence itemset based Chinese-English test word association rule mining method and system | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
CN109582769A (en) | Association mode based on weight sequence excavates and the text searching method of consequent extension | |
CN109739952A (en) | Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension | |
CN106484781B (en) | Merge the Indonesia's Chinese cross-language retrieval method and system of association mode and user feedback | |
CN107609095B (en) | Based on across the language inquiry extended method for weighting positive and negative regular former piece and relevant feedback | |
CN109684463A (en) | Compared based on weight and translates rear former piece extended method across language with what is excavated | |
CN109726263B (en) | Cross-language post-translation hybrid expansion method based on feature word weighted association pattern mining | |
CN109739953A (en) | The text searching method extended based on chi-square analysis-Confidence Framework and consequent | |
CN107526839B (en) | Consequent extended method is translated across language inquiry based on weight positive negative mode completely | |
CN109299292A (en) | Text searching method based on the mixing extension of matrix weights correlation rule front and back pieces | |
CN109753559A (en) | Across the language text search method with consequent extension is excavated based on RCSAC frame | |
CN111897922A (en) | Chinese query expansion method based on pattern mining and word vector similarity calculation | |
CN109684465B (en) | Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison | |
CN109684464A (en) | Compare across the language inquiry extended method of implementation rule consequent excavation by weight | |
CN109739967A (en) | Based on chi-square analysis-Confidence Framework and the cross-language retrieval method for mixing extension | |
CN106383883B (en) | Indonesia's Chinese cross-language retrieval method and system based on matrix weights association mode | |
CN111897919A (en) | Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion | |
CN107562904B (en) | Positive and negative association mode method for digging is weighted between fusion item weight and the English words of frequency | |
CN111897924A (en) | Text retrieval method based on association rule and word vector fusion expansion | |
CN109710777A (en) | Text searching method based on item centralized value than beta pruning and the extension of correlation rule former piece | |
CN108170778A (en) | Rear extended method is translated across language inquiry by China and Britain based on complete Weighted Rule consequent | |
CN109543197A (en) | Indonesia-English the cross-language retrieval method extended based on correlation rule former piece and after translating |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190510 |