CN106484781A - Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system - Google Patents

Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system Download PDF

Info

Publication number
CN106484781A
CN106484781A CN201610827858.4A CN201610827858A CN106484781A CN 106484781 A CN106484781 A CN 106484781A CN 201610827858 A CN201610827858 A CN 201610827858A CN 106484781 A CN106484781 A CN 106484781A
Authority
CN
China
Prior art keywords
language
chinese
tli
user
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610827858.4A
Other languages
Chinese (zh)
Other versions
CN106484781B (en
Inventor
黄名选
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Finance and Economics
Original Assignee
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Finance and Economics filed Critical Guangxi University of Finance and Economics
Priority to CN201610827858.4A priority Critical patent/CN106484781B/en
Publication of CN106484781A publication Critical patent/CN106484781A/en
Application granted granted Critical
Publication of CN106484781B publication Critical patent/CN106484781B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses Indonesia's Chinese cross-language retrieval method of a kind of fusion association mode and user feedback and system,Using machine translation module, Indonesian user's query translation is submitted to search engine module retrieval for Chinese queries and obtain initial survey set of result documents,Click on behavior related feedback information extraction module using user and obtain user feedback initial survey set of relevant documents,Obtain initial survey relevant documentation data base through document pretreatment module pretreatment,All-weighted association is called to excavate module construction all-weighted association storehouse,Set up extension dictionary using across language inquiry expansion word generation module,Realizing module using across language inquiry extension submits to search engine module to obtain final retrieval result Chinese document the new inquiry after combining again,Using final result display module, last retrieval result submission machine translation module is translated as after Indonesian document returning to user.The present invention effectively improves and improves cross-language retrieval performance, has preferable actual application value and promotion prospect.

Description

Indonesia's Chinese cross-language retrieval method of fusion association mode and user feedback and system
Technical field
The invention belongs to document information retrieval field, specifically a kind of Indonesia's Chinese merging association mode and user feedback across Language retrieval method and system are it is adaptable to adopt the neck such as across language text information retrieval of Indonesian query and search Chinese document Domain.
Background technology
Cross-language information retrieval refers to a kind of technology of the information resources of other language of query and search of language.Indonesia Chinese cross-language information retrieval method is the cross-language retrieval problem with Indonesian query and search Chinese document, wherein, expression inquiry Indonesia's language be referred to as original language, the Chinese language of the document retrieved is referred to as object language.Hand over China and ASEAN countries Stream is increasingly closer, and the cross-language information retrieval method research towards ASEAN countries' language seems urgent and important.
Scholar has carried out deep spy with direction to cross-language information retrieval method and system from different angles all over the world Beg for and study, achieve abundant achievement, however, the problems of current cross-language information retrieval research does not also solve completely Certainly, this field is urgently to be resolved hurrily and one of the higher problem of attention rate is seriously to inquire about present in cross-language information retrieval process Topic drift problem, is faced with the word mismatch problem even more serious than single language retrieval, and these problems frequently result in across language Retrieval degraded performance, not as single language retrieval performance.For the problems referred to above, in recent years, the cross-language information based on query expansion Retrieval research has obtained more concerns and has discussed, its research is concentrated mainly on (Parton K, Gao based on relevant feedback J.Combining Signals for Cross-Lingual Relevance Feedback[C].Proceedings of8thAsia Information Retrieval Societies Conference(AIRS 2012),Tianjin, China.Springer-Verlag Berlin Heidelberg2012,LNCS 7675,Information Retrieval Technology.2012:356-365.Lee C J,Croft W B.Cross-Language Pseudo-Relevance Feedback Techniques for Informal Text[C].Proceedings of 36th European Conference on IR Research(ECIR 2014),Amsterdam,The Netherlands.Advances in Information Retrieval.Springer International Publishing,2014:260-272.), potential language Justice (close that sword is graceful, Su Yidan. across the language inquiry extended method [J] based on latent semantic analysis. computer engineering, 2009,35 (10):49-53. is rather good for, and woods is gone away for some great undertakings. based on the cross-language retrieval [J] improving latent semantic analysis. Journal of Chinese Information Processing, and 2010, 24(3):105-111.), language model and topic model (Ganguly Debasis and Leveling Johannes and Jones Gareth J.F.Cross-lingual topical relevance models[C].In:24th International Conference on Computational Linguistics(COLING 2012),2012.;Wang Xuwen,Zhang Qiang,Wang Xiaojie,et al.LDA based pseudo relevance feedback for cross language information retrieval[C].IEEE International Conference on Cloud Computing and Intelligence Systems(CCIS2012).Hangzhou:IEEE,2012:1993- 1998.;Xuwen Wang,Qiang Zhang,Xiaojie Wang,et al.Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment.Proceedings ofthe 29th Pacific Asia Conference on Language,Information and Computation,PACLIC 29,Shanghai,China,2015:The cross-language information retrieval research such as 529-534.), its language object with English is mainly Main, it is all the cross-language retrieval problem of research English and other language mostly.
Currently, since Chinese Nanning City is as the permanent host city of China-ASEAN Exposition, the political affairs of China and ASEAN countries Control, the contact such as economic, cultural more frequently and closely, towards cross-language information retrieval and the cross-language information of ASEAN countries' language Service research seems more urgent, and its importance increasingly highlights.
Content of the invention
Present invention aims to the problems referred to above of the prior art, by all-weighted association digging technology and User's relevant feedback is conjointly employed in Indonesia's Chinese cross-language information retrieval, provides a kind of print merging association mode and user feedback Buddhist nun's Chinese cross-language retrieval method and system, can improve and improve cross-language information retrieval performance in Indonesia, the Indonesia to long inquiry Middle cross-language retrieval effect is more preferable.
For achieving the above object, present invention employs following technical scheme:
A kind of Indonesia's Chinese cross-language retrieval method merging association mode and user feedback, comprises the steps:
(1) Indonesian user inquiry is translated as Chinese Query formula by machine translation module, and is submitted to search engine Preliminary search in the Internet, obtains initial survey set of result documents;
(2) extract across language initial survey set of result documents prostatitis r piece Chinese document and submit to user;
(3) user carries out judgement to the Chinese document of across language initial survey set of result documents and obtains user feedback relevant documentation Collection, the total record of the document in document sets is set to n;
(4) pretreatment user feedback set of relevant documents, that is, carry out Chinese word segmentation, remove stop words, calculate Feature Words weights With the pretreatment operation extracting Feature Words, build initial survey relevant documentation data base;
(5) scan initial survey relevant documentation data base, excavate complete weighted feature word 1_ candidate C1, calculate C1Weight w (C1), count C1The maximum weights maxCw of project in additioni(!C1) and C1Support count nc1, ms is minimum support threshold value, Calculate the value of KIWT (1,2), the computing formula of KIWT (1,2) is:KIWT (1,2)=n × 1 × ms-nC1×maxCwi(!C1);
(6) calculate C1Support FTISup (C1), if FTISup is (C1) ms, then from 1_ candidate C1Dig Pick 1_ frequent item set L1, and it is added to complete weighted feature word frequent item set set L, FTISup (C1) computing formula be:
(7) excavate k_ item collection, wherein said k 2, including step (7.1) to (7.7):
(7.1) compare candidate (k-1) _ item collection Ck-1(k-1, k) value wipe out its W (C for weights and KIWTk-1)<KIWT(k-1, K) candidate Ck-1
(7.2) carry out candidate (k-1) item collection C by remainingk-1Carry out Aproiri connection, obtain Ck
(7.3) as k=2, wipe out the candidate's 2_ item collection without query term;
(7.4) scan initial survey relevant documentation data base, count CkThe maximum weights maxCw of project in additioni(!Ck) and Ck Support count nck, calculate CkWeight w (Ck) and KIWT ((k-1, computing formula k) is KIWT for k-1, value k):KIWT(k- 1, k)=n × k × ms-nck×maxCwi(!Ck);
(7.5) wipe out nckCandidate C for 0k
(7.6) to remaining candidate's k_ item collection Ck, calculate CkSupport FTISup (Ck), if FTISup is (Ck) ms, then From candidate's k_ item collection CkMiddle excavation k_ frequent item set Lk, and it is added to complete weighted feature word frequent item set set L, FTISup (Ck) Computing formula be:
(7.7) if it is empty set that k is more than candidate length threshold or candidate's k_ item collection, excavate and terminate, otherwise, continue Circulation step (7.1) to (7.6);
(8) excavate the Feature Words containing inquiry lexical item from complete weighted feature word frequent item set set L and weight pass completely Connection rule, builds all-weighted association storehouse;
(9) extract across the language extension word related to former inquiry from all-weighted association storehouse, build extension dictionary;
(10) former inquiry and extension word combination are submitted to search engine and retrieve again and obtain final retrieval result Chinese literary composition Shelves;
(11) final retrieval result Chinese document submission machine translation module is translated as Indonesian document, finally will be final Retrieval result Chinese document and final retrieval result Indonesian document return to user.
The calculating of the Feature Words weights described in above-mentioned steps (4) adopts tf-idf method, and its computing formula is:Wherein, tfm,nRepresent Feature Words tmIn document dnIn Occurrence number, dfmRepresent and contain Feature Words tmNumber of documents, N represents total number of documents in collection of document.
The method of above-mentioned steps (8) includes step (8.1) to (8.4):
(8.1) extract a certain i_ frequent item set tlL of weighting completely from complete weighted feature word frequent item set set Li, look for Go out tlLiAll proper subclass;
(8.2) from tlLiProper subclass set in arbitrarily take out two proper subclass tlI1And tlI2, whenAnd And tlI1∪tlI2=LiIf, FTARConf (tlI1→tlI2) mc, then excavate complete weighted feature word Strong association rule tlI1→tlI2;If FTARConf is (tlI2→tlI1) mc, then excavate complete weighted feature word Strong association rule tlI2→ tlI1;Described mc is minimal confidence threshold, tlI1And tlI2For complete weighted feature word frequent item set, it is tlLiVery son Collection item collection, FTARConf (tlI1→tlI2) it is complete weighted feature word association rule tlI1→tlI2Confidence level, it calculates public Formula is:
Wherein, FTISup (Li) it is complete Weighted frequent items LiSupport, FTISup (tlI1) it is complete weighted frequent items tlI1Support;
(8.3) circulation carries out step (8.2), until weighting i_ frequent item set tlL completelyiProper subclass set in each is true Subset is all removed once, and is only capable of taking out once, then proceed to step (8.4);
(8.4) circulation carries out step (8.1) to step (8.3), the item in complete weighted feature word frequent item set set L Collection is all removed once, and is only capable of taking out once, then excavate and terminate.
A kind of searching system of the Indonesia's Chinese cross-language retrieval method being applied to above-mentioned fusion association mode and user feedback, Including following 4 modules and 3 data bases:
Machine translation module:This module use must answer machine translation interface, in by Indonesian user's query translation being Query text, and final retrieval result Chinese document is translated as Indonesian document submits to user;
Search engine module:This module is search engine, is examined on the internet for the Chinese Query formula after paginal translation Rope, obtains across language initial survey set of result documents;
Weighted association pattern excavates and user's relevant feedback module completely:For across language for prostatitis r piece initial survey result is civilian User submitted to by shelves collection, by user, these documents is carried out with dependency and judges and determine initial survey relevant documentation data base, then adopts With all-weighted association digging technology to initial survey relevant documentation database mining expansion word associated with the query, realize across language Retrieval obtains final retrieval result Chinese document again for speech query expansion, expansion word and former inquiry combination;
Final result display module:It is translated as printing for final retrieval result Chinese document is submitted to machine translation module Buddhist nun's Chinese language shelves, and final retrieval result Chinese document and final retrieval result Indonesian document are returned user;
Initial survey relevant documentation data base;
All-weighted association storehouse;
Extension dictionary.
Above-mentioned complete weighted association pattern excavates and user's relevant feedback module includes following 5 modules:
User clicks on behavior relevant feedback extraction module:For catch user browse produced during initial survey set of result documents Profile download behavior, extracts the initial survey document structure user feedback set of relevant documents that user downloads;
Document pretreatment module:For user feedback set of relevant documents is carried out Chinese word segmentation, removes stop words, calculates spy The pretreatment levied word weights and extract Feature Words, builds initial survey relevant documentation data base;
All-weighted association excavates module:For all-weighted association is carried out to initial survey relevant documentation data base Excavate, excavate the complete weighted feature lexical item frequent item set containing former inquiry lexical item and association rule model, build and weight completely Correlation rule storehouse;
Across language inquiry expansion word generation module:Related to former inquiry for extracting from all-weighted association storehouse Expansion word, builds extension dictionary;
Module is realized in across language inquiry extension:For extracting Chinese expansion word from extension dictionary, by expansion word with former look into Inquiry is combined into new inquiry, submits to search engine again and retrieves in the Internet, obtains final retrieval result Chinese document.
Compared to prior art, advantage of the invention is that:
(1) all-weighted association digging technology and user's relevant feedback are conjointly employed in Indonesia's Chinese across language by the present invention Speech information retrieval, proposes user and clicks on cross-language information inspection in the Indonesia that download behavior is merged with complete weighted association pattern excavation Rope method and system.With single language Chinese text retrieve benchmark MB, in Indonesia cross-language retrieval benchmark CLB and traditional based on puppet The cross-language information retrieval method CLR_PRF of relevant feedback compares, and the retrieval performance of the inventive method obtains very big improvement And raising, test result indicate that, the present invention obtains good retrieval result, and its indices value is higher than all benchmark CLB and CLR_ The value of PRF algorithm, the retrieval effectiveness of inquiry theme description type is also good than title type, its retrieval result MAP value increase rate is maximum.
(2) test result indicate that, proposed by the present invention merge complete weighted association pattern and excavate and user's relevant feedback Indonesia's Chinese cross-language information retrieval method and system are effective, can improve cross-language information retrieval performance.It is main The analysis of causes is as follows:In cross-language information retrieval, query translation result is larger on the impact of cross-language retrieval result, frequently results in Across language initial survey outcome quality is not so good as the initial survey result of single language, that is, occur inquiring about topic drift problem.And user is clicked on row It is to excavate fusion application Cross-Language Infomation Retrieval Models in Indonesia with complete weighted association pattern, it is possible to obtain with former inquiry Related feedback information, is excavated by all-weighted association and obtains the expansion word realization related to former inquiry across language inquiry Extension, it is to avoid serious topic drift problem present in cross-language retrieval, improves cross-language retrieval performance in Indonesia.
Brief description
Fig. 1 merges the block diagram of Indonesia's Chinese cross-language retrieval method of association mode and user feedback for the present invention.
Fig. 2 merges Indonesia's Chinese cross-language retrieval system overall flow figure of association mode and user feedback for the present invention.
Fig. 3 merges Indonesia's Chinese cross-language retrieval system architecture diagram of association mode and user feedback for the present invention.
Fig. 4 is that complete weighted association pattern of the present invention excavates and user's relevant feedback modular structure block diagram.
Specific embodiment
With reference to embodiments and its accompanying drawing is further non-limitingly described in detail to technical solution of the present invention.
First, in order to technical scheme is better described, below related notion according to the present invention is described below:
Assume the object language (Target that user's inquiry obtains after across language preliminary search and user's relevant feedback Language, TL) initial survey set of relevant documents be TLdoc={ tld1,tld2,…,tldn, tldi(1 i n) represents target language I-th document in speech document sets TLdoc, tldj={ t1,t2,…,tm,…,tp, tm(m=1,2 ..., p) it is referred to as target language Speech Feature Words project (Feature-term Item, FTI), referred to as characteristic item, usually it is made up of word, word or phrase, tldi In corresponding Features weight set Wi={ wi1,wi2,…,wim,…,wip},wimFor i-th document tldiIn m-th characteristic item tmCorresponding weights, make tlI={ t1,t2,…,tkRepresenting all characteristic item set in TLdoc, then subset Y of tlI is referred to as Feature Words item collection (Feature-term Itemsets) in TLdoc, i.e. item collection Y.
For item collection (tlI1,tlI2),AndAccording to complete weighted association mould Formula excavation theoretical knowledge (Huang Mingxuan, Yan little Wei, Zhang Shichao. the pseudo-linear filter inquiry based on matrix weights association rule mining Extension. Journal of Software, Vol.20, No.7, July 2009, pp.1854-1865), provide the following basic conception.
Define 1 Feature Words item collection I (I=(tlI1,tlI2)) complete weighted support measure (Feature-term Itemsets Support, FTISup) computing formula is as shown in (1) formula.
Wherein,It is the weights of item collection I each piece document in TLdocD Summation, k is the item length (i.e. project number) of item collection I, and n is the total number of documents of initial survey set of relevant documents TLdoc.
Define correlation rule tlI between 2 words1→tlI2The confidence level (Feature-termAssociation of weighting completely Rule Confidence, FTARConf) as shown in (2) formula.
Wherein, FTIsup (tlI1,tlI2) it is item collection (tlI1,tlI2) complete weighted support measure.
Define 3 and assume that minimum support threshold value is ms, minimal confidence threshold is mc, if meeting:FTISup(tlI1, tlI2) ms, FTARConf (tlI1→tlI2) mc, then claim Feature Words item collection (tlI1,tlI2) it is frequent item set, associate between word Regular (tlI1→tlI2) it is Strong association rule.
Define the 4 Feature Words k_ item collection weight thresholds (k-Item Weighted Threshold, KIWT) comprising q_ item collection (q<K) refer to the weights prediction to the follow-up item collection comprising q_ item collection.
If tlT is to weight q- item collection completely, andq<K, in (tlI-tlT) item collection, (k-q) individual weights before note The maximum corresponding weights of project are w1,w2,…wk-q, support in TLdoc for q- item collection tlT is counted as SC (tlT), according to literary composition Offer (Huang Mingxuan, Yan little Wei, Zhang Shichao. the pseudo-linear filter query expansion based on matrix weights association rule mining. software Report, Vol.20, No.7, July 2009, pp.1854-1865) k- weight threshold theoretical knowledge, give and comprise q_ item collection Shown in the computing formula such as formula (3) of Feature Words k_ item collection weight threshold.
Two as shown in figure 1, the Indonesia's Chinese cross-language retrieval method bag merging association mode and user feedback of the present embodiment Include following steps:
(1) Indonesian user inquiry is translated as Chinese Query formula by machine translation module, and is submitted to search engine Preliminary search in the Internet, obtains initial survey set of result documents;Machine translation module using machine translation interface must be answered, that is, Microsoft TranslatorAPI;Search engine module can be the search engines such as existing Baidu or Google;
(2) before extracting across language initial survey set of result documents, r piece Chinese document submits to user;
(3) user carries out judgement to the Chinese document of across language initial survey set of result documents and obtains user feedback relevant documentation Collection, the total record of the document in document sets is set to n;
(4) pretreatment user feedback set of relevant documents, that is, carry out Chinese word segmentation, remove stop words, calculate Feature Words weights With the pretreatment operation extracting Feature Words, build initial survey relevant documentation data base;
The calculating of Feature Words weights adopts tf-idf method, and its computing formula is:
Wherein, tfm,nRepresent Feature Words tm? Document dnIn occurrence number, dfmRepresent and contain Feature Words tmNumber of documents, N represents total number of documents in collection of document;
(5) scan initial survey relevant documentation data base, excavate complete weighted feature word 1_ candidate C1, calculate C1Weight w (C1), count C1The maximum weights maxCw of project in additioni(!C1) and C1Support count nc1, ms is minimum support threshold value, Calculate the value of KIWT (1,2), the computing formula of KIWT (1,2) is:KIWT (1,2)=n × 1 × ms-nc1×maxCwi(!C1);
(6) calculate C1Support FTISup (C1), if FTISup is (C1) ms, then from 1_ candidate C1Dig Pick 1_ frequent item set L1, and it is added to complete weighted feature word frequent item set set L, FTISup (C1) computing formula be:
(7) excavate k_ item collection, wherein k 2, including step (7.1) to (7.7):
(7.1) compare candidate (k-1) _ item collection Ck-1(k-1, k) value wipe out its W (C for weights and KIWTk-1)<KIWT(k-1, K) candidate Ck-1
(7.2) carry out candidate (k-1) item collection C by remainingk-1Carry out Aproiri connection, obtain Ck
(7.3) as k=2, wipe out the candidate's 2_ item collection without query term;
(7.4) scan initial survey relevant documentation data base, count CkThe maximum weights maxCw of project in additioni(!Ck) and Ck Support count nck, calculate CkWeight w (Ck) and KIWT ((k-1, computing formula k) is KIWT for k-1, value k):KIWT(k- 1, k)=n × k × ms-nck×maxCwi(!Ck);
(7.5) wipe out nckCandidate C for 0k
(7.6) to remaining candidate's k_ item collection Ck, calculate CkSupport FTISup (Ck), if FTISup is (Ck) ms, then From candidate's k_ item collection CkMiddle excavation k_ frequent item set Lk, and it is added to complete weighted feature word frequent item set set L, FTISup (Ck) Computing formula be:
(7.7) if it is empty set that k is more than candidate length threshold or candidate's k_ item collection, excavate and terminate, otherwise, continue Circulation step (7.1) to (7.6);
(8) excavate the Feature Words containing inquiry lexical item from complete weighted feature word frequent item set set L and weight pass completely Connection rule, builds all-weighted association storehouse;Method includes step (8.1) to (8.4):
(8.1) extract a certain i_ frequent item set tlL of weighting completely from complete weighted feature word frequent item set set Li, look for Go out tlLiAll proper subclass;
(8.2) from tlLiProper subclass set in arbitrarily take out two proper subclass tlI1And tlI2, whenAnd And tlI1∪tlI2=LiIf, FTARConf (tlI1→tlI2) mc, then excavate complete weighted feature word Strong association rule tlI1→tlI2;If FTARConf is (tlI2→tlI1) mc, then excavate complete weighted feature word Strong association rule tlI2→ tlI1;Described mc is minimal confidence threshold, tlI1And tlI2For complete weighted feature word frequent item set, it is tlLiVery son Collection item collection, FTARConf (tlI1→tlI2) it is complete weighted feature word association rule tlI1→tlI2Confidence level, it calculates public Formula is:
Wherein, FTISup (Li) For complete weighted frequent items LiSupport, FTISup (tlI1) it is complete weighted frequent items tlI1Support;
(8.3) circulation carries out step (8.2), until weighting i_ frequent item set tlL completelyiProper subclass set in each is true Subset is all removed once, and is only capable of taking out once, then proceed to step (8.4);
(8.4) circulation carries out step (8.1) to step (8.3), the item in complete weighted feature word frequent item set set L Collection is all removed once, and is only capable of taking out once, then excavate and terminate;
(9) extract across the language extension word related to former inquiry from all-weighted association storehouse, build extension dictionary;
(10) former inquiry and extension word combination are submitted to search engine and retrieve again and obtain final retrieval result Chinese literary composition Shelves;
(11) final retrieval result Chinese document submission machine translation module is translated as Indonesian document, finally will be final Retrieval result Chinese document and final retrieval result Indonesian document return to user.
3rd, as shown in Figures 2 to 4 it is adaptable to the present embodiment merges across the language inspection of Indonesia's Chinese of association mode and user feedback The searching system of Suo Fangfa, including following 4 modules and 3 data bases:
Machine translation module:This module use must answer machine translation interface, i.e. Microsoft TranslatorAPI, uses In being Chinese Query by Indonesian user's query translation, and final retrieval result Chinese document is translated as Indonesian document carries Give user;
Search engine module:This module is search engine, is examined on the internet for the Chinese Query formula after paginal translation Rope, obtains across language initial survey set of result documents;
Weighted association pattern excavates and user's relevant feedback module completely:For across language for prostatitis r piece initial survey result is civilian User submitted to by shelves collection, by user, these documents is carried out with dependency and judges and determine initial survey relevant documentation data base, then adopts With all-weighted association digging technology to initial survey relevant documentation database mining expansion word associated with the query, realize across language Retrieval obtains final retrieval result Chinese document again for speech query expansion, expansion word and former inquiry combination;
Final result display module:It is translated as printing for final retrieval result Chinese document is submitted to machine translation module Buddhist nun's Chinese language shelves, and final retrieval result Chinese document and final retrieval result Indonesian document are returned user;
Initial survey relevant documentation data base;
All-weighted association storehouse;
Extension dictionary.
Wherein, described complete weighted association pattern excavates and user's relevant feedback module includes following 5 modules:
User clicks on behavior relevant feedback extraction module:For catch user browse produced during initial survey set of result documents Profile download behavior, extracts the initial survey document structure user feedback set of relevant documents that user downloads;
Document pretreatment module:For user feedback set of relevant documents is carried out Chinese word segmentation, removes stop words, calculates spy The pretreatment levied word weights and extract Feature Words, builds initial survey relevant documentation data base;
All-weighted association excavates module:For all-weighted association is carried out to initial survey relevant documentation data base Excavate, excavate the complete weighted feature lexical item frequent item set containing former inquiry lexical item and association rule model, build and weight completely Correlation rule storehouse;
Across language inquiry expansion word generation module:Related to former inquiry for extracting from all-weighted association storehouse Expansion word, builds extension dictionary;
Module is realized in across language inquiry extension:For extracting Chinese expansion word from extension dictionary, by expansion word with former look into Inquiry is combined into new inquiry, submits to search engine again and retrieves in the Internet, obtains final retrieval result Chinese document.
4th, combine technical scheme, below by experiment, beneficial effects of the present invention are described further:
Because the research range of search engine is wide and factor to be considered is relatively more, the present invention is changed to empty based on vector Between model Indonesia in carry out in cross-language retrieval system, therefore, this experiment is a simulation experiment.Write the inventive method and The source program of system carries out the experiment of the present invention.The international evaluation and test of multi-lingual process sponsored using Japan Information information research The Chinese language material of the cross-language information retrieval normal data test set NTCIR-5CLIR in meeting is as this experiment language material.
NTCIR-5CLIR has query set, wen chang qiao district collection and result set, and wherein, query set has 50 inquiry themes, point There are TITLE, DESC, NARR and CONC etc. 4 type, this paper experimental selection TITLE and DESC type, TITLE type queries master Topic is briefly described with noun and nominal phrase, belongs to short inquiry, DESC type is to briefly describe inquiry master with sentential form Topic, belongs to long inquiry.Its result set has 2 kinds of evaluation criterions such as Rigid and Relax, Rigid standard refer to its answer be all with former Inquiry height correlation or correlation, Relax standard refer to height correlation, related or partly related.
In order to carry out the experiment of Cross-Language Infomation Retrieval Models in this paper Indonesia, invitation body translation technical translator personage will 50 inquiry theme human translations of NTCIR-5CLIR Chinese edition are inquired about for Indonesian.
In testing herein, the Chinese lexical analysis system write is developed using Inst. of Computing Techn. Academia Sinica ICTCLAS to Chinese experiment language material and translates rear Chinese Query and carries out pretreatment.Feature Words weight computing adopts traditional tf-idf Method, translates rear query term weight (wi,q) computing formula is (from document G.Salton, C.Buckley.Term-weighting approaches in automatic text retrieval[J].Information Processing&Management, 1988,24(5):513-523.) as shown in formula (4).
Wherein, tfi,qThe original frequency occurring in query text information for query term, N is initial survey relevant documentation sum, dfiFor comprising the initial survey relevant documentation number of i-th query term.
In this experiment, the weights method to set up of Chinese expansion word is:Using the confidence level of matrix weights correlation rule as expansion The weights of exhibition word, when multiple correlation rules contain repetition identical query term, take its confidence level soprano as this expansion word Weights.
Benchmark is evaluated and tested in experiment:
(1) single language retrieval benchmark (Monolingual Baseline, MB):Directly retrieve Chinese document with Chinese Query The retrieval result obtaining.
(2) cross-language retrieval benchmark (Cross-language Baseline, CLB):Refer to the not head through any relevant feedback Secondary cross-language retrieval result, i.e. Indonesia's inquiry retrieval result that retrieval Chinese document obtains after machine translation system translation.
(3) traditional cross-language retrieval method CLR_PRF based on pseudo-linear filter (Jianfeng Gao, JianyunNie,Jian Zhang,et al,TREC-9CLIR Experiments atMSRCN[C].In:Proc.ofthe 9th Text Retrieval Evaluation Conference,2001:343-353.;Wu Dan, what grand celebration, Wang Huilin. base Across language inquiry extension [J] in spurious correlation. information journal, 2010,29 (2):232-239.).In this experiment, extract across language 20 structure initial survey set of relevant documents of prostatitis initial survey document, 20 Feature Words extracting prostatitis weights (descending) are extension Word.
The inventive method experiment parameter:Extract 100 documents in across language initial survey document prostatitis and submit to user, user is carried out Dependency determines initial survey document sets after judging, in testing herein, the related literary composition containing known results concentration in 100, initial survey prostatitis Shelves are considered as user's related feedback information, and extract structure user's initial survey set of relevant documents, finally, with complete weighted association rule Then digging technology excavates expansion word to initial survey set of relevant documents and realizes query expansion.
Write source program, by the inventive method with pedestal method MB, CLB and CLR_PRF in NTCIR-5CLIR test set On carry out across the language text retrieval of Indonesia's Chinese, compare and analyze its cross-language retrieval performance.
(1) benchmarks result
Running experiment source program, submits title part and the description of 50 inquiry themes of NTCIR-5CLIR to Part carries out Chinese list language retrieval, Indonesia's Chinese cross-language retrieval and traditional across language inspection of Indonesia's Chinese based on pseudo-linear filter Rope, that is, run benchmark algorithm MB, CLB and CLR_PRF, obtains 3 kinds of pedestal method retrieval experimental results as shown in table 1.
Table 1:
Table 1 test result indicate that, Indonesia Chinese cross-language retrieval benchmark CLB and traditional CLR_PRF method retrieval result Each evaluation index value only reaches the 30% to 60% about of single language retrieval benchmark MB, long inquiry description type Retrieval effectiveness is better than the retrieval effectiveness of short inquiry title type.For CLR_PRF algorithm, in its retrieval evaluation index, except Outside MAP, remaining desired value increases than benchmark CLB's, increase rate be 5% to 30% about, and MAP value generally under Fall, amplitude peak reaches %46.These results illustrate, cross-language retrieval is affected by query translation factor, and retrieval performance is generally low Under, also do not reach its single language retrieval performance accordingly.
(2) the retrieval Performance comparision of the inventive method and benchmark algorithm
Using title type and the description type of 50 inquiry themes of NTCIR-5CLIR, support is become Change and two kinds of situations carry out retrieving performance test during confidence level change, with Indonesia Chinese cross-language retrieval benchmark CLB and traditional CLR_PRF method, and single language retrieval benchmark MB carries out retrieving Performance comparision.Experiment design parameter:Support threshold changes When retrieval Performance comparision as shown in table 2, during confidence threshold value change, MAP, P 5 and P 15 of retrieval result is worth as shown in table 3.
Table 2:
Table 3:
Knowable to the experimental result of table 2, when complete weighted support measure changes of threshold, the inventive method retrieval result each Item desired value is higher than all the value of Indonesia Chinese cross-language retrieval benchmark CLB and traditional spurious correlation cross-language retrieval method CLR_PRF, All reach the 60% to 102% of single language retrieval benchmark MB.Compare with benchmark CLB, the amplitude that it improves is 91.55% to the maximum (i.e. the P@5 of Rigid type is worth), minimum be 36.06% type, Relax evaluation and test P@15 be worth).With CLR_PRF method phase The amplitude maximum that it improves is up to 244.97% (i.e. the MAP value of description query type, Rigid evaluation and test), minimum for ratio Be 32.89%, especially, its description query type, Rigid evaluation and test MAP value met and exceeded single language The 2% of retrieval benchmark MB.In addition, the retrieval effectiveness of inquiry theme description type is better than title type, its retrieval The MAP value increase rate of result is maximum.
Table 3 test result indicate that, when confidence threshold value changes, the present invention obtains good retrieval result, and its item refers to Scale value is higher than all benchmark CLB and the value of CLR_PRF algorithm, all reaches the 58.07% to 101.2% of single language retrieval benchmark MB, The retrieval effectiveness of inquiry theme description type is also good than title type, the MAP value increase rate of its retrieval result Maximum.
In sum, the present invention has preferable application value.

Claims (5)

1. a kind of Indonesia's Chinese cross-language retrieval method merging association mode and user feedback is it is characterised in that include following walking Suddenly:
(1) Indonesian user inquiry is translated as Chinese Query formula by machine translation module, and is submitted to search engine mutual Preliminary search in networking, obtains initial survey set of result documents;
(2) extract across language initial survey set of result documents prostatitis r piece Chinese document and submit to user;
(3) user carries out judgement to the Chinese document of across language initial survey set of result documents and obtains user feedback set of relevant documents, literary composition The total record of document that shelves are concentrated is set to n;
(4) pretreatment user feedback set of relevant documents, that is, carry out Chinese word segmentation, remove stop words, calculate Feature Words weights and carry Take the pretreatment operation of Feature Words, build initial survey relevant documentation data base;
(5) scan initial survey relevant documentation data base, excavate complete weighted feature word 1_ candidate C1, calculate C1Weight w (C1), Statistics C1The maximum weights maxCw of project in additioni(!C1) and C1Support count nc1, ms is minimum support threshold value, calculates The value of KIWT (1,2), the computing formula of KIWT (1,2) is:KIWT (1,2)=n × 1 × ms-nC1×maxCwi(!C1);
(6) calculate C1Support FTISup (C1), if FTISup is (C1) ms, then from 1_ candidate C1Excavate 1_ frequent item set L1, and it is added to complete weighted feature word frequent item set set L, FTISup (C1) computing formula be:
(7) excavate k_ item collection, wherein said k 2, including step (7.1) to (7.7):
(7.1) compare candidate (k-1) _ item collection Ck-1(k-1, k) value wipe out its W (C for weights and KIWTk-1)<KIWT (k-1, k) Candidate Ck-1
(7.2) carry out candidate (k-1) item collection C by remainingk-1Carry out Aproiri connection, obtain Ck
(7.3) as k=2, wipe out the candidate's 2_ item collection without query term;
(7.4) scan initial survey relevant documentation data base, count CkThe maximum weights maxCw of project in additioni(!Ck) and CkSupport Count nck, calculate CkWeight w (Ck) and KIWT ((k-1, computing formula k) is KIWT for k-1, value k):KIWT (k-1, k)= n×k×ms-nck×maxCwi(!Ck);
(7.5) wipe out nckCandidate C for 0k
(7.6) to remaining candidate's k_ item collection Ck, calculate CkSupport FTISup (Ck), if FTISup is (Ck) ms, then from time Select k_ item collection CkMiddle excavation k_ frequent item set Lk, and it is added to complete weighted feature word frequent item set set L, FTISup (Ck) meter Calculating formula is:
(7.7) if it is empty set that k is more than candidate length threshold or candidate's k_ item collection, excavate and terminate, otherwise, continue cycling through Step (7.1) to (7.6);
(8) the Feature Words complete weighted association rule containing inquiry lexical item are excavated from complete weighted feature word frequent item set set L Then, build all-weighted association storehouse;
(9) extract across the language extension word related to former inquiry from all-weighted association storehouse, build extension dictionary;
(10) former inquiry and extension word combination are submitted to search engine and retrieve again and obtain final retrieval result Chinese document;
(11) final retrieval result Chinese document submission machine translation module is translated as Indonesian document, finally will finally retrieve Result Chinese document and final retrieval result Indonesian document return to user.
2. the Indonesia's Chinese cross-language retrieval method merging association mode and user feedback according to claim 1, its feature It is, the calculating of the Feature Words weights described in step (4) adopts tf-idf method, and its computing formula is:Wherein, tfm,nRepresent Feature Words tmIn document dnIn Occurrence number, dfmRepresent and contain Feature Words tmNumber of documents, N represents total number of documents in collection of document.
3. the Indonesia's Chinese cross-language retrieval method merging association mode and user feedback according to claim 1, its feature It is, the method for step (8) includes step (8.1) to (8.4):
(8.1) extract a certain i_ frequent item set tlL of weighting completely from complete weighted feature word frequent item set set Li, find out tlLiAll proper subclass;
(8.2) from tlLiProper subclass set in arbitrarily take out two proper subclass tlI1And tlI2, whenAnd tlI1∪tlI2=LiIf, FTARConf (tlI1→tlI2) mc, then excavate complete weighted feature word Strong association rule tlI1 →tlI2;If FTARConf is (tlI2→tlI1) mc, then excavate complete weighted feature word Strong association rule tlI2→tlI1;Institute The mc stating is minimal confidence threshold, tlI1And tlI2For complete weighted feature word frequent item set, it is tlLiProper subclass item collection, FTARConf(tlI1→tlI2) it is complete weighted feature word association rule tlI1→tlI2Confidence level, its computing formula is:
Wherein, FTISup (Li) it is complete Weighted frequent items LiSupport, FTISup (tlI1) it is complete weighted frequent items tlI1Support;
(8.3) circulation carries out step (8.2), until weighting i_ frequent item set tlL completelyiProper subclass set in each proper subclass All it is removed once, and is only capable of taking out once, then proceed to step (8.4);
(8.4) circulation carries out step (8.1) to step (8.3), when the item collection in complete weighted feature word frequent item set set L all It is removed once, and is only capable of taking out once, then excavate and terminate.
4. a kind of inspection being applied to the Indonesia's Chinese cross-language retrieval method merging association mode and user feedback described in claim 1 Cable system it is characterised in that:Including following 4 modules and 3 data bases:
Machine translation module:This module use must answer machine translation interface, for looking into Indonesian user's query translation for Chinese Ask, and final retrieval result Chinese document is translated as Indonesian document and submit to user;
Search engine module:This module is search engine, enters line retrieval on the internet for the Chinese Query formula after paginal translation, obtains Arrive across language initial survey set of result documents;
Weighted association pattern excavates and user's relevant feedback module completely:For by across language for prostatitis r piece initial survey set of result documents Submit to user, dependency is carried out by user to these documents and judges and determine initial survey relevant documentation data base, then adopted Full weighted association rules digging technology expansion word associated with the query to initial survey relevant documentation database mining, realizes looking into across language Ask extension, retrieval obtains final retrieval result Chinese document again for expansion word and former inquiry combination;
Final result display module:It is translated as Indonesian for final retrieval result Chinese document is submitted to machine translation module Document, and final retrieval result Chinese document and final retrieval result Indonesian document are returned user;
Initial survey relevant documentation data base;
All-weighted association storehouse;
Extension dictionary.
5. searching system according to claim 4 is it is characterised in that described complete weighted association pattern excavates and user's phase Close feedback module and include following 5 modules:
User clicks on behavior relevant feedback extraction module:Browse produced document during initial survey set of result documents for catching user Download behavior, extracts the initial survey document structure user feedback set of relevant documents that user downloads;
Document pretreatment module:For user feedback set of relevant documents is carried out Chinese word segmentation, removes stop words, calculates Feature Words Weights and the pretreatment extracting Feature Words, build initial survey relevant documentation data base;
All-weighted association excavates module:Dig for all-weighted association is carried out to initial survey relevant documentation data base Pick, excavates the complete weighted feature lexical item frequent item set containing former inquiry lexical item and association rule model, builds weighting completely and closes Connection rule base;
Across language inquiry expansion word generation module:For extracting the extension related to former inquiry from all-weighted association storehouse Word, builds extension dictionary;
Module is realized in across language inquiry extension:For extracting Chinese expansion word from extension dictionary, by expansion word and former inquiry group Synthesis is new to be inquired about, and submits to search engine again and retrieves in the Internet, obtains final retrieval result Chinese document.
CN201610827858.4A 2016-09-18 2016-09-18 Merge the Indonesia's Chinese cross-language retrieval method and system of association mode and user feedback Expired - Fee Related CN106484781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610827858.4A CN106484781B (en) 2016-09-18 2016-09-18 Merge the Indonesia's Chinese cross-language retrieval method and system of association mode and user feedback

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610827858.4A CN106484781B (en) 2016-09-18 2016-09-18 Merge the Indonesia's Chinese cross-language retrieval method and system of association mode and user feedback

Publications (2)

Publication Number Publication Date
CN106484781A true CN106484781A (en) 2017-03-08
CN106484781B CN106484781B (en) 2019-03-15

Family

ID=58267229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610827858.4A Expired - Fee Related CN106484781B (en) 2016-09-18 2016-09-18 Merge the Indonesia's Chinese cross-language retrieval method and system of association mode and user feedback

Country Status (1)

Country Link
CN (1) CN106484781B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526839A (en) * 2017-09-08 2017-12-29 广西财经学院 Based on weight positive negative mode completely consequent extended method is translated across language inquiry
CN109739953A (en) * 2018-12-30 2019-05-10 广西财经学院 The text searching method extended based on chi-square analysis-Confidence Framework and consequent
CN109992644A (en) * 2019-03-26 2019-07-09 苏州大成有方数据科技有限公司 A kind of intellectual property type of structured text intelligent semantic reconfiguration system
CN111125102A (en) * 2019-12-16 2020-05-08 北京明略软件系统有限公司 Data query method and device based on index data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method
CN104182527A (en) * 2014-08-27 2014-12-03 广西教育学院 Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN104216874A (en) * 2014-09-22 2014-12-17 广西教育学院 Chinese interword weighing positive and negative mode excavation method and system based on relevant coefficients

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279570A (en) * 2013-06-19 2013-09-04 广西教育学院 Text database oriented matrix weighting negative pattern mining method
CN104182527A (en) * 2014-08-27 2014-12-03 广西教育学院 Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN104216874A (en) * 2014-09-22 2014-12-17 广西教育学院 Chinese interword weighing positive and negative mode excavation method and system based on relevant coefficients

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DEBASIS GANGULY ET AL.,: ""Cross-Lingual Topical Relevance Models"", 《24TH INTEENATIONAL CONFERENCE ON COMPUTATIONAL LINGUISITICS》 *
XUWEN WANG ET AL.,: ""LDA BASED PSEUDO RELEVANCE FEEDBACK FOR CROSS LANGUAGE INFORMATION RETRIEVAL"", 《PROCEEDINGS OF IEEE CCIS2012》 *
黄名选,严小卫等: ""基于矩阵加权关联规则挖掘的伪相关反馈查询扩展"", 《软件学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526839A (en) * 2017-09-08 2017-12-29 广西财经学院 Based on weight positive negative mode completely consequent extended method is translated across language inquiry
CN107526839B (en) * 2017-09-08 2019-09-10 广西财经学院 Consequent extended method is translated across language inquiry based on weight positive negative mode completely
CN109739953A (en) * 2018-12-30 2019-05-10 广西财经学院 The text searching method extended based on chi-square analysis-Confidence Framework and consequent
CN109739953B (en) * 2018-12-30 2021-07-20 广西财经学院 Text retrieval method based on chi-square analysis-confidence framework and back-part expansion
CN109992644A (en) * 2019-03-26 2019-07-09 苏州大成有方数据科技有限公司 A kind of intellectual property type of structured text intelligent semantic reconfiguration system
CN109992644B (en) * 2019-03-26 2022-07-12 苏州大成有方数据科技有限公司 Intelligent semantic reconstruction system for intellectual property structured text
CN111125102A (en) * 2019-12-16 2020-05-08 北京明略软件系统有限公司 Data query method and device based on index data
CN111125102B (en) * 2019-12-16 2023-03-21 北京明略软件系统有限公司 Data query method and device based on index data

Also Published As

Publication number Publication date
CN106484781B (en) 2019-03-15

Similar Documents

Publication Publication Date Title
CN106372241B (en) More across the language text search method of English and the system of word-based weighted association pattern
CN106484781B (en) Merge the Indonesia&#39;s Chinese cross-language retrieval method and system of association mode and user feedback
Qin et al. An efficient location extraction algorithm by leveraging web contextual information
CN103646112A (en) Dependency parsing field self-adaption method based on web search
Guo et al. Improving candidate generation for entity linking
CN107609095B (en) Based on across the language inquiry extended method for weighting positive and negative regular former piece and relevant feedback
Afyouni et al. AraCap: A hybrid deep learning architecture for Arabic Image Captioning
CN107526839B (en) Consequent extended method is translated across language inquiry based on weight positive negative mode completely
CN109684463B (en) Cross-language post-translation and front-part extension method based on weight comparison and mining
CN106383883B (en) Indonesia&#39;s Chinese cross-language retrieval method and system based on matrix weights association mode
CN109739952A (en) Merge the mode excavation of the degree of association and chi-square value and the cross-language retrieval method of extension
Wang et al. Chinese text keyword extraction based on Doc2vec and TextRank
CN109684464B (en) Cross-language query expansion method for realizing rule back-part mining through weight comparison
Yang et al. Research on improvement of text processing and clustering algorithms in public opinion early warning system
CN109684465B (en) Text retrieval method based on pattern mining and mixed expansion of item set weight value comparison
CN108170778B (en) Chinese-English cross-language query post-translation expansion method based on fully weighted rule post-piece
Liu et al. Recognition of collocation frames from sentences
CN108416442B (en) Chinese word matrix weighting association rule mining method based on item frequency and weight
Azzopardi et al. Page retrievability calculator
Ng et al. Data Fusion of Machine-Learning Methods for the TREC5 Routing Task (and other work).
Ma et al. Selecting related terms in query-logs using two-stage simrank
Caon et al. Finding synonyms and other semantically-similar terms from coselection data
CN108133022B (en) Matrix weighting association rule-based Chinese-English cross-language query front piece expansion method
Yan et al. Research on Sino-Tibetan Machine Translation Based on the Reusing of Domain Ontology.
Liubonko et al. Matching Ukrainian Wikipedia red links with English Wikipedia’s articles

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190315

Termination date: 20190918