CN106484781A

CN106484781A - Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system

Info

Publication number: CN106484781A
Application number: CN201610827858.4A
Authority: CN
Inventors: 黄名选
Original assignee: Guangxi University of Finance and Economics
Current assignee: Guangxi University of Finance and Economics
Priority date: 2016-09-18
Filing date: 2016-09-18
Publication date: 2017-03-08
Anticipated expiration: 2036-09-18
Also published as: CN106484781B

Abstract

The invention discloses Indonesia's Chinese cross-language retrieval method of a kind of fusion association mode and user feedback and system，Using machine translation module, Indonesian user's query translation is submitted to search engine module retrieval for Chinese queries and obtain initial survey set of result documents，Click on behavior related feedback information extraction module using user and obtain user feedback initial survey set of relevant documents，Obtain initial survey relevant documentation data base through document pretreatment module pretreatment，All-weighted association is called to excavate module construction all-weighted association storehouse，Set up extension dictionary using across language inquiry expansion word generation module，Realizing module using across language inquiry extension submits to search engine module to obtain final retrieval result Chinese document the new inquiry after combining again，Using final result display module, last retrieval result submission machine translation module is translated as after Indonesian document returning to user.The present invention effectively improves and improves cross-language retrieval performance, has preferable actual application value and promotion prospect.

Description

Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system

Technical field

The invention belongs to document information retrieval field, specifically a kind of Indonesia's Chinese merging association mode and user feedback across Language retrieval method and system are it is adaptable to adopt the neck such as across language text information retrieval of Indonesian query and search Chinese document Domain.

Background technology

Cross-language information retrieval refers to a kind of technology of the information resources of other language of query and search of language.Indonesia Chinese cross-language information retrieval method is the cross-language retrieval problem with Indonesian query and search Chinese document, wherein, expression inquiry Indonesia's language be referred to as original language, the Chinese language of the document retrieved is referred to as object language.Hand over China and ASEAN countries Stream is increasingly closer, and the cross-language information retrieval method research towards ASEAN countries' language seems urgent and important.

Scholar has carried out deep spy with direction to cross-language information retrieval method and system from different angles all over the world Beg for and study, achieve abundant achievement, however, the problems of current cross-language information retrieval research does not also solve completely Certainly, this field is urgently to be resolved hurrily and one of the higher problem of attention rate is seriously to inquire about present in cross-language information retrieval process Topic drift problem, is faced with the word mismatch problem even more serious than single language retrieval, and these problems frequently result in across language Retrieval degraded performance, not as single language retrieval performance.For the problems referred to above, in recent years, the cross-language information based on query expansion Retrieval research has obtained more concerns and has discussed, its research is concentrated mainly on (Parton K, Gao based on relevant feedback J.Combining Signals for Cross-Lingual Relevance Feedback[C].Proceedings of8thAsia Information Retrieval Societies Conference(AIRS 2012),Tianjin, China.Springer-Verlag Berlin Heidelberg2012,LNCS 7675,Information Retrieval Technology.2012:356-365.Lee C J,Croft W B.Cross-Language Pseudo-Relevance Feedback Techniques for Informal Text[C].Proceedings of 36th European Conference on IR Research(ECIR 2014),Amsterdam,The Netherlands.Advances in Information Retrieval.Springer International Publishing,2014:260-272.), potential language Justice (close that sword is graceful, Su Yidan. across the language inquiry extended method [J] based on latent semantic analysis. computer engineering, 2009,35 (10):49-53. is rather good for, and woods is gone away for some great undertakings. based on the cross-language retrieval [J] improving latent semantic analysis. Journal of Chinese Information Processing, and 2010, 24(3):105-111.), language model and topic model (Ganguly Debasis and Leveling Johannes and Jones Gareth J.F.Cross-lingual topical relevance models[C].In:24th International Conference on Computational Linguistics(COLING 2012),2012.；Wang Xuwen,Zhang Qiang,Wang Xiaojie,et al.LDA based pseudo relevance feedback for cross language information retrieval[C].IEEE International Conference on Cloud Computing and Intelligence Systems(CCIS2012).Hangzhou:IEEE,2012:1993- 1998.；Xuwen Wang,Qiang Zhang,Xiaojie Wang,et al.Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment.Proceedings ofthe 29th Pacific Asia Conference on Language,Information and Computation,PACLIC 29,Shanghai,China,2015:The cross-language information retrieval research such as 529-534.), its language object with English is mainly Main, it is all the cross-language retrieval problem of research English and other language mostly.

Currently, since Chinese Nanning City is as the permanent host city of China-ASEAN Exposition, the political affairs of China and ASEAN countries Control, the contact such as economic, cultural more frequently and closely, towards cross-language information retrieval and the cross-language information of ASEAN countries' language Service research seems more urgent, and its importance increasingly highlights.

Content of the invention

Present invention aims to the problems referred to above of the prior art, by all-weighted association digging technology and User's relevant feedback is conjointly employed in Indonesia's Chinese cross-language information retrieval, provides a kind of print merging association mode and user feedback Buddhist nun's Chinese cross-language retrieval method and system, can improve and improve cross-language information retrieval performance in Indonesia, the Indonesia to long inquiry Middle cross-language retrieval effect is more preferable.

For achieving the above object, present invention employs following technical scheme：

A kind of Indonesia's Chinese cross-language retrieval method merging association mode and user feedback, comprises the steps：

(1) Indonesian user inquiry is translated as Chinese Query formula by machine translation module, and is submitted to search engine Preliminary search in the Internet, obtains initial survey set of result documents；

(2) extract across language initial survey set of result documents prostatitis r piece Chinese document and submit to user；

(3) user carries out judgement to the Chinese document of across language initial survey set of result documents and obtains user feedback relevant documentation Collection, the total record of the document in document sets is set to n；

(4) pretreatment user feedback set of relevant documents, that is, carry out Chinese word segmentation, remove stop words, calculate Feature Words weights With the pretreatment operation extracting Feature Words, build initial survey relevant documentation data base；

(5) scan initial survey relevant documentation data base, excavate complete weighted feature word 1_ candidate C₁, calculate C₁Weight w (C₁), count C₁The maximum weights maxCw of project in addition_i(！C₁) and C₁Support count n_c1, ms is minimum support threshold value, Calculate the value of KIWT (1,2), the computing formula of KIWT (1,2) is：KIWT (1,2)=n × 1 × ms-n_C1×maxCw_i(！C₁)；

(6) calculate C₁Support FTISup (C₁), if FTISup is (C₁) ms, then from 1_ candidate C₁Dig Pick 1_ frequent item set L₁, and it is added to complete weighted feature word frequent item set set L, FTISup (C₁) computing formula be：

(7) excavate k_ item collection, wherein said k 2, including step (7.1) to (7.7)：

(7.1) compare candidate (k-1) _ item collection C_k-1(k-1, k) value wipe out its W (C for weights and KIWT_k-1)<KIWT(k-1, K) candidate C_k-1；

(7.2) carry out candidate (k-1) item collection C by remaining_k-1Carry out Aproiri connection, obtain C_k；

(7.3) as k=2, wipe out the candidate's 2_ item collection without query term；

(7.4) scan initial survey relevant documentation data base, count C_kThe maximum weights maxCw of project in addition_i(！C_k) and C_k Support count n_ck, calculate C_kWeight w (C_k) and KIWT ((k-1, computing formula k) is KIWT for k-1, value k)：KIWT(k- 1, k)=n × k × ms-n_ck×maxCw_i(！C_k)；

(7.5) wipe out n_ckCandidate C for 0_k；

(7.6) to remaining candidate's k_ item collection C_k, calculate C_kSupport FTISup (C_k), if FTISup is (C_k) ms, then From candidate's k_ item collection C_kMiddle excavation k_ frequent item set L_k, and it is added to complete weighted feature word frequent item set set L, FTISup (C_k) Computing formula be：

(7.7) if it is empty set that k is more than candidate length threshold or candidate's k_ item collection, excavate and terminate, otherwise, continue Circulation step (7.1) to (7.6)；

(8) excavate the Feature Words containing inquiry lexical item from complete weighted feature word frequent item set set L and weight pass completely Connection rule, builds all-weighted association storehouse；

(9) extract across the language extension word related to former inquiry from all-weighted association storehouse, build extension dictionary；

(10) former inquiry and extension word combination are submitted to search engine and retrieve again and obtain final retrieval result Chinese literary composition Shelves；

(11) final retrieval result Chinese document submission machine translation module is translated as Indonesian document, finally will be final Retrieval result Chinese document and final retrieval result Indonesian document return to user.

The calculating of the Feature Words weights described in above-mentioned steps (4) adopts tf-idf method, and its computing formula is：Wherein, tf_m,nRepresent Feature Words t_mIn document d_nIn Occurrence number, df_mRepresent and contain Feature Words t_mNumber of documents, N represents total number of documents in collection of document.

The method of above-mentioned steps (8) includes step (8.1) to (8.4)：

(8.1) extract a certain i_ frequent item set tlL of weighting completely from complete weighted feature word frequent item set set L_i, look for Go out tlL_iAll proper subclass；

(8.2) from tlL_iProper subclass set in arbitrarily take out two proper subclass tlI₁And tlI₂, whenAnd And tlI₁∪tlI₂=L_iIf, FTARConf (tlI₁→tlI₂) mc, then excavate complete weighted feature word Strong association rule tlI₁→tlI₂；If FTARConf is (tlI₂→tlI₁) mc, then excavate complete weighted feature word Strong association rule tlI₂→ tlI₁；Described mc is minimal confidence threshold, tlI₁And tlI₂For complete weighted feature word frequent item set, it is tlL_iVery son Collection item collection, FTARConf (tlI₁→tlI₂) it is complete weighted feature word association rule tlI₁→tlI₂Confidence level, it calculates public Formula is：

Wherein, FTISup (L_i) it is complete Weighted frequent items L_iSupport, FTISup (tlI₁) it is complete weighted frequent items tlI₁Support；

(8.3) circulation carries out step (8.2), until weighting i_ frequent item set tlL completely_iProper subclass set in each is true Subset is all removed once, and is only capable of taking out once, then proceed to step (8.4)；

(8.4) circulation carries out step (8.1) to step (8.3), the item in complete weighted feature word frequent item set set L Collection is all removed once, and is only capable of taking out once, then excavate and terminate.

A kind of searching system of the Indonesia's Chinese cross-language retrieval method being applied to above-mentioned fusion association mode and user feedback, Including following 4 modules and 3 data bases：

Machine translation module：This module use must answer machine translation interface, in by Indonesian user's query translation being Query text, and final retrieval result Chinese document is translated as Indonesian document submits to user；

Search engine module：This module is search engine, is examined on the internet for the Chinese Query formula after paginal translation Rope, obtains across language initial survey set of result documents；

Weighted association pattern excavates and user's relevant feedback module completely：For across language for prostatitis r piece initial survey result is civilian User submitted to by shelves collection, by user, these documents is carried out with dependency and judges and determine initial survey relevant documentation data base, then adopts With all-weighted association digging technology to initial survey relevant documentation database mining expansion word associated with the query, realize across language Retrieval obtains final retrieval result Chinese document again for speech query expansion, expansion word and former inquiry combination；

Final result display module：It is translated as printing for final retrieval result Chinese document is submitted to machine translation module Buddhist nun's Chinese language shelves, and final retrieval result Chinese document and final retrieval result Indonesian document are returned user；

Initial survey relevant documentation data base；

All-weighted association storehouse；

Extension dictionary.

Above-mentioned complete weighted association pattern excavates and user's relevant feedback module includes following 5 modules：

User clicks on behavior relevant feedback extraction module：For catch user browse produced during initial survey set of result documents Profile download behavior, extracts the initial survey document structure user feedback set of relevant documents that user downloads；

Document pretreatment module：For user feedback set of relevant documents is carried out Chinese word segmentation, removes stop words, calculates spy The pretreatment levied word weights and extract Feature Words, builds initial survey relevant documentation data base；

All-weighted association excavates module：For all-weighted association is carried out to initial survey relevant documentation data base Excavate, excavate the complete weighted feature lexical item frequent item set containing former inquiry lexical item and association rule model, build and weight completely Correlation rule storehouse；

Across language inquiry expansion word generation module：Related to former inquiry for extracting from all-weighted association storehouse Expansion word, builds extension dictionary；

Module is realized in across language inquiry extension：For extracting Chinese expansion word from extension dictionary, by expansion word with former look into Inquiry is combined into new inquiry, submits to search engine again and retrieves in the Internet, obtains final retrieval result Chinese document.

Compared to prior art, advantage of the invention is that：

(1) all-weighted association digging technology and user's relevant feedback are conjointly employed in Indonesia's Chinese across language by the present invention Speech information retrieval, proposes user and clicks on cross-language information inspection in the Indonesia that download behavior is merged with complete weighted association pattern excavation Rope method and system.With single language Chinese text retrieve benchmark MB, in Indonesia cross-language retrieval benchmark CLB and traditional based on puppet The cross-language information retrieval method CLR_PRF of relevant feedback compares, and the retrieval performance of the inventive method obtains very big improvement And raising, test result indicate that, the present invention obtains good retrieval result, and its indices value is higher than all benchmark CLB and CLR_ The value of PRF algorithm, the retrieval effectiveness of inquiry theme description type is also good than title type, its retrieval result MAP value increase rate is maximum.

(2) test result indicate that, proposed by the present invention merge complete weighted association pattern and excavate and user's relevant feedback Indonesia's Chinese cross-language information retrieval method and system are effective, can improve cross-language information retrieval performance.It is main The analysis of causes is as follows：In cross-language information retrieval, query translation result is larger on the impact of cross-language retrieval result, frequently results in Across language initial survey outcome quality is not so good as the initial survey result of single language, that is, occur inquiring about topic drift problem.And user is clicked on row It is to excavate fusion application Cross-Language Infomation Retrieval Models in Indonesia with complete weighted association pattern, it is possible to obtain with former inquiry Related feedback information, is excavated by all-weighted association and obtains the expansion word realization related to former inquiry across language inquiry Extension, it is to avoid serious topic drift problem present in cross-language retrieval, improves cross-language retrieval performance in Indonesia.

Brief description

Fig. 1 merges the block diagram of Indonesia's Chinese cross-language retrieval method of association mode and user feedback for the present invention.

Fig. 2 merges Indonesia's Chinese cross-language retrieval system overall flow figure of association mode and user feedback for the present invention.

Fig. 3 merges Indonesia's Chinese cross-language retrieval system architecture diagram of association mode and user feedback for the present invention.

Fig. 4 is that complete weighted association pattern of the present invention excavates and user's relevant feedback modular structure block diagram.

Specific embodiment

With reference to embodiments and its accompanying drawing is further non-limitingly described in detail to technical solution of the present invention.

First, in order to technical scheme is better described, below related notion according to the present invention is described below：

Assume the object language (Target that user's inquiry obtains after across language preliminary search and user's relevant feedback Language, TL) initial survey set of relevant documents be TLdoc={ tld₁,tld₂,…,tld_n, tld_i(1 i n) represents target language I-th document in speech document sets TLdoc, tld_j={ t₁,t₂,…,t_m,…,t_p, t_m(m=1,2 ..., p) it is referred to as target language Speech Feature Words project (Feature-term Item, FTI), referred to as characteristic item, usually it is made up of word, word or phrase, tld_i In corresponding Features weight set W_i={ w_i1,w_i2,…,w_im,…,w_ip},w_imFor i-th document tld_iIn m-th characteristic item t_mCorresponding weights, make tlI={ t₁,t₂,…,t_kRepresenting all characteristic item set in TLdoc, then subset Y of tlI is referred to as Feature Words item collection (Feature-term Itemsets) in TLdoc, i.e. item collection Y.

For item collection (tlI₁,tlI₂),AndAccording to complete weighted association mould Formula excavation theoretical knowledge (Huang Mingxuan, Yan little Wei, Zhang Shichao. the pseudo-linear filter inquiry based on matrix weights association rule mining Extension. Journal of Software, Vol.20, No.7, July 2009, pp.1854-1865), provide the following basic conception.

Define 1 Feature Words item collection I (I=(tlI₁,tlI₂)) complete weighted support measure (Feature-term Itemsets Support, FTISup) computing formula is as shown in (1) formula.

Wherein,It is the weights of item collection I each piece document in TLdocD Summation, k is the item length (i.e. project number) of item collection I, and n is the total number of documents of initial survey set of relevant documents TLdoc.

Define correlation rule tlI between 2 words₁→tlI₂The confidence level (Feature-termAssociation of weighting completely Rule Confidence, FTARConf) as shown in (2) formula.

Wherein, FTIsup (tlI₁,tlI₂) it is item collection (tlI₁,tlI₂) complete weighted support measure.

Define 3 and assume that minimum support threshold value is ms, minimal confidence threshold is mc, if meeting：FTISup(tlI₁, tlI₂) ms, FTARConf (tlI₁→tlI₂) mc, then claim Feature Words item collection (tlI₁,tlI₂) it is frequent item set, associate between word Regular (tlI₁→tlI₂) it is Strong association rule.

Define the 4 Feature Words k_ item collection weight thresholds (k-Item Weighted Threshold, KIWT) comprising q_ item collection (q<K) refer to the weights prediction to the follow-up item collection comprising q_ item collection.

If tlT is to weight q- item collection completely, andq<K, in (tlI-tlT) item collection, (k-q) individual weights before note The maximum corresponding weights of project are w₁,w₂,…w_k-q, support in TLdoc for q- item collection tlT is counted as SC (tlT), according to literary composition Offer (Huang Mingxuan, Yan little Wei, Zhang Shichao. the pseudo-linear filter query expansion based on matrix weights association rule mining. software Report, Vol.20, No.7, July 2009, pp.1854-1865) k- weight threshold theoretical knowledge, give and comprise q_ item collection Shown in the computing formula such as formula (3) of Feature Words k_ item collection weight threshold.

Two as shown in figure 1, the Indonesia's Chinese cross-language retrieval method bag merging association mode and user feedback of the present embodiment Include following steps：

(1) Indonesian user inquiry is translated as Chinese Query formula by machine translation module, and is submitted to search engine Preliminary search in the Internet, obtains initial survey set of result documents；Machine translation module using machine translation interface must be answered, that is, Microsoft TranslatorAPI；Search engine module can be the search engines such as existing Baidu or Google；

(2) before extracting across language initial survey set of result documents, r piece Chinese document submits to user；

The calculating of Feature Words weights adopts tf-idf method, and its computing formula is：

Wherein, tf_m,nRepresent Feature Words t_m? Document d_nIn occurrence number, df_mRepresent and contain Feature Words t_mNumber of documents, N represents total number of documents in collection of document；

(7) excavate k_ item collection, wherein k 2, including step (7.1) to (7.7)：

(7.3) as k=2, wipe out the candidate's 2_ item collection without query term；

(7.5) wipe out n_ckCandidate C for 0_k；

(8) excavate the Feature Words containing inquiry lexical item from complete weighted feature word frequent item set set L and weight pass completely Connection rule, builds all-weighted association storehouse；Method includes step (8.1) to (8.4)：

Wherein, FTISup (L_i) For complete weighted frequent items L_iSupport, FTISup (tlI₁) it is complete weighted frequent items tlI₁Support；

(8.4) circulation carries out step (8.1) to step (8.3), the item in complete weighted feature word frequent item set set L Collection is all removed once, and is only capable of taking out once, then excavate and terminate；

3rd, as shown in Figures 2 to 4 it is adaptable to the present embodiment merges across the language inspection of Indonesia's Chinese of association mode and user feedback The searching system of Suo Fangfa, including following 4 modules and 3 data bases：

Machine translation module：This module use must answer machine translation interface, i.e. Microsoft TranslatorAPI, uses In being Chinese Query by Indonesian user's query translation, and final retrieval result Chinese document is translated as Indonesian document carries Give user；

Initial survey relevant documentation data base；

All-weighted association storehouse；

Extension dictionary.

Wherein, described complete weighted association pattern excavates and user's relevant feedback module includes following 5 modules：

4th, combine technical scheme, below by experiment, beneficial effects of the present invention are described further：

Because the research range of search engine is wide and factor to be considered is relatively more, the present invention is changed to empty based on vector Between model Indonesia in carry out in cross-language retrieval system, therefore, this experiment is a simulation experiment.Write the inventive method and The source program of system carries out the experiment of the present invention.The international evaluation and test of multi-lingual process sponsored using Japan Information information research The Chinese language material of the cross-language information retrieval normal data test set NTCIR-5CLIR in meeting is as this experiment language material.

NTCIR-5CLIR has query set, wen chang qiao district collection and result set, and wherein, query set has 50 inquiry themes, point There are TITLE, DESC, NARR and CONC etc. 4 type, this paper experimental selection TITLE and DESC type, TITLE type queries master Topic is briefly described with noun and nominal phrase, belongs to short inquiry, DESC type is to briefly describe inquiry master with sentential form Topic, belongs to long inquiry.Its result set has 2 kinds of evaluation criterions such as Rigid and Relax, Rigid standard refer to its answer be all with former Inquiry height correlation or correlation, Relax standard refer to height correlation, related or partly related.

In order to carry out the experiment of Cross-Language Infomation Retrieval Models in this paper Indonesia, invitation body translation technical translator personage will 50 inquiry theme human translations of NTCIR-5CLIR Chinese edition are inquired about for Indonesian.

In testing herein, the Chinese lexical analysis system write is developed using Inst. of Computing Techn. Academia Sinica ICTCLAS to Chinese experiment language material and translates rear Chinese Query and carries out pretreatment.Feature Words weight computing adopts traditional tf-idf Method, translates rear query term weight (w_i,q) computing formula is (from document G.Salton, C.Buckley.Term-weighting approaches in automatic text retrieval[J].Information Processing&Management, 1988,24(5):513-523.) as shown in formula (4).

Wherein, tf_i,qThe original frequency occurring in query text information for query term, N is initial survey relevant documentation sum, df_iFor comprising the initial survey relevant documentation number of i-th query term.

In this experiment, the weights method to set up of Chinese expansion word is：Using the confidence level of matrix weights correlation rule as expansion The weights of exhibition word, when multiple correlation rules contain repetition identical query term, take its confidence level soprano as this expansion word Weights.

Benchmark is evaluated and tested in experiment：

(1) single language retrieval benchmark (Monolingual Baseline, MB)：Directly retrieve Chinese document with Chinese Query The retrieval result obtaining.

(2) cross-language retrieval benchmark (Cross-language Baseline, CLB)：Refer to the not head through any relevant feedback Secondary cross-language retrieval result, i.e. Indonesia's inquiry retrieval result that retrieval Chinese document obtains after machine translation system translation.

(3) traditional cross-language retrieval method CLR_PRF based on pseudo-linear filter (Jianfeng Gao, JianyunNie,Jian Zhang,et al,TREC-9CLIR Experiments atMSRCN[C].In:Proc.ofthe 9th Text Retrieval Evaluation Conference,2001:343-353.；Wu Dan, what grand celebration, Wang Huilin. base Across language inquiry extension [J] in spurious correlation. information journal, 2010,29 (2):232-239.).In this experiment, extract across language 20 structure initial survey set of relevant documents of prostatitis initial survey document, 20 Feature Words extracting prostatitis weights (descending) are extension Word.

The inventive method experiment parameter：Extract 100 documents in across language initial survey document prostatitis and submit to user, user is carried out Dependency determines initial survey document sets after judging, in testing herein, the related literary composition containing known results concentration in 100, initial survey prostatitis Shelves are considered as user's related feedback information, and extract structure user's initial survey set of relevant documents, finally, with complete weighted association rule Then digging technology excavates expansion word to initial survey set of relevant documents and realizes query expansion.

Write source program, by the inventive method with pedestal method MB, CLB and CLR_PRF in NTCIR-5CLIR test set On carry out across the language text retrieval of Indonesia's Chinese, compare and analyze its cross-language retrieval performance.

(1) benchmarks result

Running experiment source program, submits title part and the description of 50 inquiry themes of NTCIR-5CLIR to Part carries out Chinese list language retrieval, Indonesia's Chinese cross-language retrieval and traditional across language inspection of Indonesia's Chinese based on pseudo-linear filter Rope, that is, run benchmark algorithm MB, CLB and CLR_PRF, obtains 3 kinds of pedestal method retrieval experimental results as shown in table 1.

Table 1：

Table 1 test result indicate that, Indonesia Chinese cross-language retrieval benchmark CLB and traditional CLR_PRF method retrieval result Each evaluation index value only reaches the 30% to 60% about of single language retrieval benchmark MB, long inquiry description type Retrieval effectiveness is better than the retrieval effectiveness of short inquiry title type.For CLR_PRF algorithm, in its retrieval evaluation index, except Outside MAP, remaining desired value increases than benchmark CLB's, increase rate be 5% to 30% about, and MAP value generally under Fall, amplitude peak reaches %46.These results illustrate, cross-language retrieval is affected by query translation factor, and retrieval performance is generally low Under, also do not reach its single language retrieval performance accordingly.

(2) the retrieval Performance comparision of the inventive method and benchmark algorithm

Using title type and the description type of 50 inquiry themes of NTCIR-5CLIR, support is become Change and two kinds of situations carry out retrieving performance test during confidence level change, with Indonesia Chinese cross-language retrieval benchmark CLB and traditional CLR_PRF method, and single language retrieval benchmark MB carries out retrieving Performance comparision.Experiment design parameter：Support threshold changes When retrieval Performance comparision as shown in table 2, during confidence threshold value change, MAP, P 5 and P 15 of retrieval result is worth as shown in table 3.

Table 2：

Table 3：

Knowable to the experimental result of table 2, when complete weighted support measure changes of threshold, the inventive method retrieval result each Item desired value is higher than all the value of Indonesia Chinese cross-language retrieval benchmark CLB and traditional spurious correlation cross-language retrieval method CLR_PRF, All reach the 60% to 102% of single language retrieval benchmark MB.Compare with benchmark CLB, the amplitude that it improves is 91.55% to the maximum (i.e. the P@5 of Rigid type is worth), minimum be 36.06% type, Relax evaluation and test P@15 be worth).With CLR_PRF method phase The amplitude maximum that it improves is up to 244.97% (i.e. the MAP value of description query type, Rigid evaluation and test), minimum for ratio Be 32.89%, especially, its description query type, Rigid evaluation and test MAP value met and exceeded single language The 2% of retrieval benchmark MB.In addition, the retrieval effectiveness of inquiry theme description type is better than title type, its retrieval The MAP value increase rate of result is maximum.

Table 3 test result indicate that, when confidence threshold value changes, the present invention obtains good retrieval result, and its item refers to Scale value is higher than all benchmark CLB and the value of CLR_PRF algorithm, all reaches the 58.07% to 101.2% of single language retrieval benchmark MB, The retrieval effectiveness of inquiry theme description type is also good than title type, the MAP value increase rate of its retrieval result Maximum.

In sum, the present invention has preferable application value.

Claims

1. a kind of Indonesia's Chinese cross-language retrieval method merging association mode and user feedback is it is characterised in that include following walking Suddenly：

(1) Indonesian user inquiry is translated as Chinese Query formula by machine translation module, and is submitted to search engine mutual Preliminary search in networking, obtains initial survey set of result documents；

(3) user carries out judgement to the Chinese document of across language initial survey set of result documents and obtains user feedback set of relevant documents, literary composition The total record of document that shelves are concentrated is set to n；

(4) pretreatment user feedback set of relevant documents, that is, carry out Chinese word segmentation, remove stop words, calculate Feature Words weights and carry Take the pretreatment operation of Feature Words, build initial survey relevant documentation data base；

(5) scan initial survey relevant documentation data base, excavate complete weighted feature word 1_ candidate C₁, calculate C₁Weight w (C₁), Statistics C₁The maximum weights maxCw of project in addition_i(！C₁) and C₁Support count n_c1, ms is minimum support threshold value, calculates The value of KIWT (1,2), the computing formula of KIWT (1,2) is：KIWT (1,2)=n × 1 × ms-n_C1×maxCw_i(！C₁)；

(6) calculate C₁Support FTISup (C₁), if FTISup is (C₁) ms, then from 1_ candidate C₁Excavate 1_ frequent item set L₁, and it is added to complete weighted feature word frequent item set set L, FTISup (C₁) computing formula be：

(7.1) compare candidate (k-1) _ item collection C_k-1(k-1, k) value wipe out its W (C for weights and KIWT_k-1)<KIWT (k-1, k) Candidate C_k-1；

(7.3) as k=2, wipe out the candidate's 2_ item collection without query term；

(7.4) scan initial survey relevant documentation data base, count C_kThe maximum weights maxCw of project in addition_i(！C_k) and C_kSupport Count n_ck, calculate C_kWeight w (C_k) and KIWT ((k-1, computing formula k) is KIWT for k-1, value k)：KIWT (k-1, k)= n×k×ms-n_ck×maxCw_i(！C_k)；

(7.5) wipe out n_ckCandidate C for 0_k；

(7.6) to remaining candidate's k_ item collection C_k, calculate C_kSupport FTISup (C_k), if FTISup is (C_k) ms, then from time Select k_ item collection C_kMiddle excavation k_ frequent item set L_k, and it is added to complete weighted feature word frequent item set set L, FTISup (C_k) meter Calculating formula is：

(7.7) if it is empty set that k is more than candidate length threshold or candidate's k_ item collection, excavate and terminate, otherwise, continue cycling through Step (7.1) to (7.6)；

(8) the Feature Words complete weighted association rule containing inquiry lexical item are excavated from complete weighted feature word frequent item set set L Then, build all-weighted association storehouse；

(10) former inquiry and extension word combination are submitted to search engine and retrieve again and obtain final retrieval result Chinese document；

(11) final retrieval result Chinese document submission machine translation module is translated as Indonesian document, finally will finally retrieve Result Chinese document and final retrieval result Indonesian document return to user.

2. the Indonesia's Chinese cross-language retrieval method merging association mode and user feedback according to claim 1, its feature It is, the calculating of the Feature Words weights described in step (4) adopts tf-idf method, and its computing formula is：Wherein, tf_m,nRepresent Feature Words t_mIn document d_nIn Occurrence number, df_mRepresent and contain Feature Words t_mNumber of documents, N represents total number of documents in collection of document.

3. the Indonesia's Chinese cross-language retrieval method merging association mode and user feedback according to claim 1, its feature It is, the method for step (8) includes step (8.1) to (8.4)：

(8.1) extract a certain i_ frequent item set tlL of weighting completely from complete weighted feature word frequent item set set L_i, find out tlL_iAll proper subclass；

(8.2) from tlL_iProper subclass set in arbitrarily take out two proper subclass tlI₁And tlI₂, whenAnd tlI₁∪tlI₂=L_iIf, FTARConf (tlI₁→tlI₂) mc, then excavate complete weighted feature word Strong association rule tlI₁ →tlI₂；If FTARConf is (tlI₂→tlI₁) mc, then excavate complete weighted feature word Strong association rule tlI₂→tlI₁；Institute The mc stating is minimal confidence threshold, tlI₁And tlI₂For complete weighted feature word frequent item set, it is tlL_iProper subclass item collection, FTARConf(tlI₁→tlI₂) it is complete weighted feature word association rule tlI₁→tlI₂Confidence level, its computing formula is：

(8.3) circulation carries out step (8.2), until weighting i_ frequent item set tlL completely_iProper subclass set in each proper subclass All it is removed once, and is only capable of taking out once, then proceed to step (8.4)；

(8.4) circulation carries out step (8.1) to step (8.3), when the item collection in complete weighted feature word frequent item set set L all It is removed once, and is only capable of taking out once, then excavate and terminate.

4. a kind of inspection being applied to the Indonesia's Chinese cross-language retrieval method merging association mode and user feedback described in claim 1 Cable system it is characterised in that：Including following 4 modules and 3 data bases：

Machine translation module：This module use must answer machine translation interface, for looking into Indonesian user's query translation for Chinese Ask, and final retrieval result Chinese document is translated as Indonesian document and submit to user；

Search engine module：This module is search engine, enters line retrieval on the internet for the Chinese Query formula after paginal translation, obtains Arrive across language initial survey set of result documents；

Weighted association pattern excavates and user's relevant feedback module completely：For by across language for prostatitis r piece initial survey set of result documents Submit to user, dependency is carried out by user to these documents and judges and determine initial survey relevant documentation data base, then adopted Full weighted association rules digging technology expansion word associated with the query to initial survey relevant documentation database mining, realizes looking into across language Ask extension, retrieval obtains final retrieval result Chinese document again for expansion word and former inquiry combination；

Final result display module：It is translated as Indonesian for final retrieval result Chinese document is submitted to machine translation module Document, and final retrieval result Chinese document and final retrieval result Indonesian document are returned user；

Initial survey relevant documentation data base；

All-weighted association storehouse；

Extension dictionary.

5. searching system according to claim 4 is it is characterised in that described complete weighted association pattern excavates and user's phase Close feedback module and include following 5 modules：

User clicks on behavior relevant feedback extraction module：Browse produced document during initial survey set of result documents for catching user Download behavior, extracts the initial survey document structure user feedback set of relevant documents that user downloads；

Document pretreatment module：For user feedback set of relevant documents is carried out Chinese word segmentation, removes stop words, calculates Feature Words Weights and the pretreatment extracting Feature Words, build initial survey relevant documentation data base；

All-weighted association excavates module：Dig for all-weighted association is carried out to initial survey relevant documentation data base Pick, excavates the complete weighted feature lexical item frequent item set containing former inquiry lexical item and association rule model, builds weighting completely and closes Connection rule base；

Across language inquiry expansion word generation module：For extracting the extension related to former inquiry from all-weighted association storehouse Word, builds extension dictionary；

Module is realized in across language inquiry extension：For extracting Chinese expansion word from extension dictionary, by expansion word and former inquiry group Synthesis is new to be inquired about, and submits to search engine again and retrieves in the Internet, obtains final retrieval result Chinese document.