CN102117284A - Method for retrieving cross-language knowledge - Google Patents

Method for retrieving cross-language knowledge Download PDF

Info

Publication number
CN102117284A
CN102117284A CN2009102439934A CN200910243993A CN102117284A CN 102117284 A CN102117284 A CN 102117284A CN 2009102439934 A CN2009102439934 A CN 2009102439934A CN 200910243993 A CN200910243993 A CN 200910243993A CN 102117284 A CN102117284 A CN 102117284A
Authority
CN
China
Prior art keywords
language
verb
index
object
retrieval
Prior art date
Application number
CN2009102439934A
Other languages
Chinese (zh)
Inventor
吴祖林
赵琦
邱李豪
高建忠
Original Assignee
安世亚太科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安世亚太科技(北京)有限公司 filed Critical 安世亚太科技(北京)有限公司
Priority to CN2009102439934A priority Critical patent/CN102117284A/en
Publication of CN102117284A publication Critical patent/CN102117284A/en

Links

Abstract

The invention provides a method for retrieving cross-language knowledge, which comprises the following steps: 10) semantically analyzing an original language retrieval mode, thereby acquiring an original language retrieval index which has a 'verb + object' structure formed by the verb-object construction of the original language retrieval mode; 20) translating the original language retrieval index into a target language retrieval index; and 30) matching the target language retrieval index with the original language retrieval index, wherein the target language retrieval index has a 'verb + object' structure formed by the verb-object construction of a target language file bank, acquired by semantically analyzing the target language file bank by using the target language retrieval index. By using the method, the cross-language knowledge can be retrieved efficiently and accurately.

Description

一种跨语言知识检索的方法 A method of cross-language information retrieval

技术领域 FIELD

[0001] 本发明涉及计算机检索领域,特别是一种跨语言知识检索的方法。 [0001] The present invention relates to the field of computer search, and particularly to a method for cross-language retrieval knowledge. 背景技术 Background technique

[0002] 随着信息技术的发展,人们越来越普遍地利用检索电子文档的方式来获取知识。 [0002] With the development of information technology, it is increasingly common to use way to access to knowledge retrieval of electronic documents. 但是,用户所需的知识可能存在于不同语言的文档中,而用户更愿意与电子系统用母语进行交流。 However, the knowledge required users may exist in the document in different languages, and users are more willing to communicate with the electronic systems in their mother tongue. 这就产生了跨语言知识检索和抽取的需求。 This creates a cross-language retrieval and knowledge extraction needs.

[0003] 跨语言检索指用户使用某种自然语言(源语言)的检索词汇检索由另一种自然语言(目标语言)表达的文档。 [0003] cross-language retrieval refers to the user using a natural language (source language) to retrieve a document retrieval words expressed by another natural language (target language). 它允许用户以熟悉的语言构造检索提问式,然后使用该提问式检索任一种以非提问式语言写成的文档。 It allows the user to become familiar with the language question-retrieval structure, and then use the question-retrieve the document either in non-written language question.

[0004] 实现跨语言检索的常用方法有:文献翻译方法和提问式翻译方法等。 [0004] The method used to achieve cross-language retrieval are: Document Translation and question-translation methods.

[0005] 文献翻译方法在信息检索之前,将文档的信息语言(目标语言)转化为提问语言(源语言)。 [0005] Document Translation prior information retrieval, information language of the document (the target language) into question by language (source language). 该方法的优点在于,由文献翻译方法实现的跨语言知识检索返回给用户的检索结果是用源(提问)语言描述的,用户能够方便地选择利用;对于文献层次的翻译,其语境更加宽泛,能够利用上下文消除翻译的歧义性。 The advantage of this approach is that the search results returned to the user's knowledge of the cross-language translation of documents retrieved by the method implemented with a source (questions) described in language, the user can easily select use; for document translation level, which broader context You can use context to eliminate ambiguity translation. 但是文献翻译要求所有被检索信息改变语言,而现有的大多数机器翻译系统的正确率还难以达到令人满意程度,无法达到实用水平; 而且要将数据库中全部文献从目标语言翻译到源语言,所需的工作量巨大,代价昂贵。 But literature translation requires all information to be retrieved language changed, and the accuracy of most of the existing machine translation system is also difficult to achieve a satisfactory degree, can not reach a practical level; and all documents in the database you want to translate from the source language to the target language , a huge amount of work required, costly. 此外重新构造大范围的被翻译的索引数据其代价也不小。 In addition to re-construct a wide range of index data are translated at the expense of not small. 所以,文献翻译方法只在被检索信息内容有限的情况下才有意义。 So, in literature translation method is only limited information is retrieved content situation makes sense. 目前这种方法在研究和实用上都远不如提问式翻译方法。 This method is currently in the research and are far less practical question-translation method.

[0006] 提问式翻译方法将用户输入的提问式翻译为检索系统支持的每种语言,然后将多种语言的提问式提交给检索系统的匹配模块,来检索相应语言的文档。 [0006] question-translation method entered by the user to retrieve question-translation system for each language supported, and submit to the multilingual question-retrieval system matching module to retrieve the appropriate document language. 它是目前实现跨语言检索最为常用的方法。 It is cross-language retrieval of the most commonly used method. 其优点是仅对提问式进行翻译,翻译量小且翻译能够快速执行;主要缺点是:1、由于检索返回的结果是以目标语言描述的,增加了用户利用所获得信息的难度;2、提问式通常很短,语境信息很少,难以消除歧义,每个提问词被其所有可能的译法所替代,翻译模糊性问题严重,因此控制翻译的模糊性是设计有效的提问式翻译方法的一个关键问题。 The advantage is only a question-translation, a small amount of translation and the translator can quickly perform; The main disadvantages are: 1, as a result of retrieval is based on return target language to describe, increasing the difficulty of the user using the obtained information; 2, ask questions style is usually very short, very little contextual information, it is difficult to eliminate ambiguity, questioning each word is replaced by all possible translation, translation ambiguity problem is serious, and therefore control the translation of ambiguity is to design an effective method of question-translation a key issue.

[0007] 提问式翻译可以通过基于字典方法、基于语料库方法、字典一语料库混合方法等来加以实现。 [0007] The question may be based on the formula translation dictionary methods, corpus, a corpus dictionary based on the mixing method and the like be achieved by. 提问式翻译方法中,基于字典的提问式翻译方法通常只是对用户提问式的关键词进行简单翻译,无法根据提问式语境消除歧义,获得的检索结果查准率较低。 Question-translation method, the user is usually only a question of style keyword dictionary question-translation method based on a simple translation, not according to disambiguate question-context retrieval results obtained lower precision. 基于语料库的提问式翻译方法可以从语料库中获得提问式中某些短语或短句的译法,能消除部分歧义,但受语料库规模和内容所限,往往只能获得提问式关键词的一个或多个译法,无法获得关键词同义词的检索结果,查全率较低。 Questioning can get translation corpus-based approach from the corpus in question-translation of certain phrases or short sentences, and can eliminate some ambiguity, but by the corpus size and content constraints, often only get a question type or keyword multiple translations, synonyms keyword search results can not be obtained, the lower the recall rate.

发明内容 SUMMARY

[0008] 本发明要解决的技术问题是提高跨语言知识检索的查准率。 [0008] The present invention to solve the technical problem is to improve the precision of cross-language retrieval of knowledge.

[0009] 为解决上述问题,根据本发明的一个方面提供了一种跨语言知识检索的方法,包 [0009] In order to solve the above problems, there is provided a method of cross-language knowledge retrieval according to one aspect of the invention, the package

3括下列步骤: 3 comprising the following steps:

[0010] 10)对源语言检索式进行语义分析,获得源语言检索索引,其中所述源语言检索索引是所述源语言检索式的动宾结构构成的“动词+对象”; [0010] 10) in the source language search query semantic analysis to obtain an index to retrieve the source language, the source language wherein said source language retrieval index is retrieved verb-object type structure composed of "Verb + Object";

[0011] 20)将所述源语言检索索引翻译为目标语言检索索引; [0011] 20) The source language to a target language translation retrieval index retrieval index;

[0012] 30)将目标语言文档索引与所述目标语言检索索引匹配,其中所述目标语言检索索引为对目标语言文档库进行语义分析所获得的所述目标语言文档库中的动宾结构构成的“动词+对象”。 [0012] 30) a target language document indexing and retrieval of the target language index matching, wherein the movable object Structure target language document database retrieval index to the target language to the target language document database semantic analysis of the obtained configuration the "verb + object."

[0013] 上述方法中,所述步骤10)后,还包括下列步骤: [0013] In the above method, after the step 10), further comprising the steps of:

[0014] 11)将所述源语言检索索引进行同义扩展。 [0014] 11) The source-language retrieval index extension synonymous.

[0015] 上述方法中,所述步骤11)后还包括下列步骤: After [0015] In the above method, the step 11) further comprises the steps of:

[0016] 12)验证所述源语言检索索引。 [0016] 12) to verify the source language retrieval index.

[0017] 上述方法中,所述步骤20)是利用“动词+对象”双语词典,其中,所述“动词+对象”双语词典包括源语言“动词+对象”和对应的目标语言“动词+对象”。 [0017] In the above method, the step 20) is the use of "Verb + object" bilingual dictionary, wherein the "Verb + object" includes a source language bilingual dictionary "Verb + object" and the corresponding target language "Verb + object . "

[0018] 上述方法中,所述步骤20)中如果所述“动词+对象”双语词典中不包括所述目标语言检索索引,则包括下列步骤:利用动词双语词典和名词双语词典将所述源语言检索索引翻译为目标语言检索索引。 [0018] In the above method, in the step 20) if the "Verb + object" does not include a bilingual dictionary retrieval index of the target language, comprising the steps of: using a bilingual dictionary verbs and nouns bilingual dictionary the source language retrieval index translates to the target language search index.

[0019] 上述方法中,所述步骤20)是利用动词双语词典和名词双语词典。 [0019] In the above method, the step 20) is the use of the verb and noun bilingual dictionary bilingual dictionary.

[0020] 上述方法中,所述步骤20)后,还包括下列步骤: [0020] In the above method, after the step 20), further comprising the steps of:

[0021] 21)将所述目标语言检索索引进行同义扩展。 [0021] 21) the target language retrieval index extension synonymous.

[0022] 上述方法中,所述步骤21)后还包括步骤: After [0022] In the above method, the step 21) further comprises the step of:

[0023] 22)验证所述目标语言检索索引。 [0023] 22) to verify the target language retrieval index.

[0024] 本发明的有益效果在于提供了一种查准率较高的跨语言知识检索方法,另外,本发明还有效提高了跨语言知识检索的查全率。 [0024] Advantageous effects of the present invention is to provide a method of information retrieval across a language search of higher precision. Further, the present invention effectively improves the recall cross-language retrieval knowledge.

附图说明 BRIEF DESCRIPTION

[0025] 图1是根据本发明一个具体实施例的跨语言知识检索方法流程图; [0025] FIG. 1 is a flowchart of a method for cross-language retrieval knowledge a particular embodiment of the present invention;

[0026] 图2是根据本发明一个具体实施例的双语词典建立流程图。 [0026] FIG 2 is a flowchart of a bilingual dictionary to establish a particular embodiment of the present invention.

具体实施方式 Detailed ways

[0027] 为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图,对根据本发明具体实施例的跨语言知识检索的方法进一步详细说明。 [0027] To make the objectives, technical solutions and advantages of the present invention will become more apparent hereinafter in conjunction with the accompanying drawings, the detailed description of a method according to a further embodiment the cross-language knowledge of the specific embodiments of the present invention search. 应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。 It should be understood that the specific embodiments described herein are only intended to illustrate the present invention and are not intended to limit the present invention.

[0028] 图1示出了根据本发明一个具体实施例的跨语言知识检索方法流程图,该方法包括下列步骤: [0028] FIG. 1 shows a flowchart of a cross language information retrieval method embodiment of the present invention, the method comprising the steps of:

[0029] 对源语言检索式和目标语言文档库进行语义分析,以提取出其中的动宾结构,进而获得源语言检索索引和目标语言文档索引。 [0029] The search query source language and the target language document database semantic analysis to extract a structure in which the verb-object, and then obtain the source language and a target language document retrieval index index. 一般地,一句话中的动宾结构是句内的核心成分,可以体现该句的主旨内容,如:“如何在冬天提高室内的温度?”中的动宾结构是“提高+温度”;而动宾结构中的“动词+对象”组合存在一定的语言结合规律性;所以提取“动词+对象”组合(动宾结构)作为索引。 In general, the verb-object structure of a sentence is the core element in the sentence, the sentence can reflect the subject content, such as: Structure of "How to improve the indoor temperature in winter?" Is "the temperature increase +"; and verb-object structure "verb + object" language present in combination in conjunction with certain regularity; is extracted "verb + object" composition (VO construction) as an index. [0030] 选择利用斯坦福大学的斯坦福句法分析器(Manford Parser)作为语义分析器来完成语义分析,该工具目前支持对英文、中文、德文和阿拉伯文的语义分析,详细说明见http://www~nlp. Stanford, edu/software/lex-parser, shtmlo 本领域普通技术人员可以理解,语义分析可利用自然语言处理领域的很多现有的语义分析器来完成,其可以分别支持不同语言的语义分析。 [0030] choose to use Stanford University, Stanford parser (Manford Parser) as a semantic analyzer to complete semantic analysis, semantic analysis of English, Chinese, German and Arabic the tool currently supports, details see http: // www ~ nlp. Stanford, edu / software / lex-parser, shtmlo of ordinary skill in the art will be appreciated, semantic analysis may utilize many existing parser natural language processing to complete, which can support the semantics are different languages analysis. 本步骤不限定具体的语义分析器和所针对的语言。 This step is not specifically defined for the semantic analyzer and the language. 下面描述了两个具体的语义分析示例: The following describes two specific examples of semantic analysis:

[0031] 示例1 :假定源语言为中文,源语言检索式为“如何探测微波辐射”,语义分析结果为: [0031] Example 1: Assume that the source for the Chinese language, the source language search query is "how to detect microwave radiation", the semantic analysis results:

[0032] (ROOT [0032] (ROOT

[0033] (IP [0033] (IP

[0034] (VP [0034] (VP

[0035] (ADVP (AD 如何)) [0035] (ADVP (How the AD))

[0036] (VP (W 探测) [0036] (VP (W probe)

[0037] (NP [0037] (NP

[0038] 仏0开(!1微波)) [0038] Fo opening 0 (! 1 microwave))

[0039] (NP(NN 辐射))))))) [0039] (NP (NN radiation)))))))

[0040] 其中,各个英文简写的含义如下: [0040] wherein the meaning of the English abbreviation is as follows:

[0041] ROOT:根节点; [0041] ROOT: a root node;

[0042] IP :屈折语素短语; [0042] IP: inflected morpheme phrase;

[0043] VP :动词短语; [0043] VP: verb phrase;

[0044] ADVP :副词短语; [0044] ADVP: adverbial phrase;

[0045] AD :副词; [0045] AD: adverb;

[0046] VV :动词; [0046] VV: verb;

[0047] NP :名词短语; [0047] NP: noun phrase;

[0048] ADJP :形容词短语; [0048] ADJP: adjective phrase;

[0049] JJ :形容词; [0049] JJ: adjective;

[0050] NN:普通名词。 [0050] NN: common nouns.

[0051] 根据语义分析的结果标记,自动提取动宾结构VP"探测微波辐射”,获得动词VV“探测” +对象NP “微波辐射”的组合,作为源语言检索索引。 [0051] The semantic analysis result flag, automatically extracting the movable object Structure VP "detecting microwave radiation", VV composition obtained verb "probe" + object NP "microwave radiation" as the source language retrieval index. 动词标记为V,对象标记为0,即该源语言检索索引为“探测(V) +微波辐射(0)”。 Verb labeled V, the object flag is 0, i.e., the source language index to retrieve "probe (V) + microwave irradiation (0)."

[0052] 示例2 :假定目标语言为英文,目标语言文档库中的一句话为“Dopplereffect transducer measures fluid flow,,,语义分析结果为: [0052] Example 2: Assume the target language is English, the target language document library sentence is "Dopplereffect transducer measures fluid flow ,,, semantic analysis results:

[0053] (ROOT [0053] (ROOT

[0054] (S [0054] (S

[0055] (NP(JJ Doppler)(NN effect)(NN transducer)) [0055] (NP (JJ Doppler) (NN effect) (NN transducer))

[0056] (VP(VBZ measures) [0056] (VP (VBZ measures)

[0057] (NP(JJ fluid)(NN flow))))) [0057] (NP (JJ fluid) (NN flow)))))

[0058] 其中,各个英文简写的含义如下: [0058] wherein the meaning of the English abbreviation is as follows:

[0059] ROOT :根节点; [0059] ROOT: a root node;

5[0060] S :句子; 5 [0060] S: a sentence;

[0061] NP;名词短语; [0061] NP; noun phrase;

[0062] JJ :形容词; [0062] JJ: adjective;

[0063] NN:普通名词; [0063] NN: common noun;

[0064] VP :动词短语; [0064] VP: verb phrase;

[0065] VBZ:现在时态动词。 [0065] VBZ: present tense verb.

[0066] 根据语义分析的结果标记,自动提取动宾结构VP "measures fluidflow”,获得动词VBZ "measures" +对象NP "fluid flow"的组合,作为目标语言文档索引。 [0066] The semantic analysis result flag, automatically extracting the movable object Structure VP "measures fluidflow", obtained verb VBZ "measures" + Object NP "fluid flow" combination, as the target language document index. 动词标记为V,对象标记为0,即该目标语言文档索引为“measure (V) +fluid flow (0) ”。 Verb labeled V, the object flag is 0, i.e., the target language document index "measure (V) + fluid flow (0)".

[0067] 优选地,对源语言检索索引自动进行同义扩展,更具体地,利用源语言同义词典对一个源语言检索索引中的“动词(V) ”和“对象(0) ”进行同义扩展;并用扩展后的“动词(V),, 和“对象(0)”词语组成扩展源语言检索索引,从而获得扩展源语言检索索引,即扩展“动词(V) +对象(0)”组合。其中,源语言同义词典包括动词同义词典和名词同义词典,动词同义词典可以选取现有公知词典,如《常用同义词词典》等,由其中的“动词同义词”构建“动词同义词词典”;名词同义词典可以选取现有公知词典,如《常用同义词词典》,由其中的“名词同义词”构建“名词同义词典”。下面给出了一个对源语言检索索引进行同义扩展的示例。 [0067] Preferably, the search index source language synonyms automatically extended, and more particularly, to the use of a thesaurus, the source language index to retrieve the source language "verb (V)" and "objects (0)" in the same Yi extended; and treated with "a verb (V) ,, and" Object (0) after extended "words composed of the extended source language retrieval index, thereby obtaining the extended source language retrieval index, i.e. extension" verb (V) + object (0). " combination wherein the source language includes a thesaurus, a verb and a noun thesaurus thesaurus, verb thesaurus dictionary may be selected conventionally known as "common synonym dictionary" and the like, where the "verb synonyms" build "verb synonyms Dictionary "; the term thesaurus dictionary may be selected conventionally known as" common synonym dictionary ", where the" N synonymous with "build" a thesaurus term "below gives a retrieval index source language extensions synonymous. examples.

[0068] 示例3 :假定源语言为中文,一个源语言检索索引为“稀释(V) +光刻胶(0)”。 [0068] Example 3: Assume that the source language is Chinese, a source language retrieval index is "diluted (V) + a photoresist (0)."

[0069] 在源语言动词同义词典中查找“稀释(V),,的同义词,未获得“稀释(V),,同义词; 在源语言对象同义词典中查找“光刻胶(0)”的同义词,获得同义词“光致抗蚀剂(0)”。 [0069] Finding a verb in the source language in the thesaurus "dilution (V) ,, synonym not obtained" dilution (V) ,, synonyms; look in the source language objects thesaurus "photoresist (0)" synonyms, to obtain synonym "photoresist (0)." 因此,源语言检索索引“稀释(V) +光刻胶(0)”的扩展源语言检索索引为:“稀释(V) +光致抗蚀剂(0) ”。 Thus, the source language retrieval index "dilution (V) + a photoresist (0)" extended source language retrieval index as: "dilution (V) + photoresist (0)."

[0070] 在本步骤中,采用对关键词组合进行同义扩展的方法,以获得更多正确检索结果, 提高了跨语言检索的查全率。 [0070] In this step, the method of combination synonymous keywords extended to get more correct retrieval result, the recall ratio improved cross-language retrieval. 本领域普通技术人员可以理解,也可以不进行该同义扩展步 Those of ordinary skill in the art will be appreciated, it may not be synonymous with the extension step

马聚ο Ma poly ο

[0071] 上述利用词典进行同义扩展和关键词组合的步骤可能会产生下述错误,某个“动词(V)”的同义词和某个“对象(0)”的同义词在语言表述中不太可能同时出现,例如:“增加(V) +热量(0)”,“增长(V)”是“增加(V)”的同义词,但是,“增长(V) +热量(0)”的组合并不符合语言规律,存在不合理性。 Step [0071] The use of synonymous dictionary and keyword expansion may produce the following combination of error, a "verb (V)" and a synonym for "objects (0)" is not synonymous Its Language It may also occur, such as: "increase (V) + heat (0)", "growth (V)" is "to increase (V)" a synonym, but "growth (V) + heat (0)" combination and does not meet the language law, there is irrationality. 因此,根据本发明的优选实施例,本发明还包括对扩展源语言检索索引的合理性进行验证的步骤。 Thus, according to a preferred embodiment of the present invention, the present invention further comprises the step of adequacy for the extended source language retrieval index verification.

[0072] 在对扩展源语言检索索引进行验证这个步骤中,可以采用共现技术。 [0072] In the extended source language retrieval index to verify this step, co-current techniques may be employed. 共现技术是基于这样一个假设:在翻译一个提问词时,其他的提问词(或是它们的翻译)就成为选择该提问词的翻译词的“语境”。 Co-occurrence technique is based on the assumption: in the translation of a word question, other questions words (or their translations) will become the choice of the question word translation of the word "context." 正确的翻译在目标语言文献中共同出现的频率高,而错误的翻译在目标语言文献中共同出现的频率低。 The high frequency of correct frequency translation in the target language literature occur together, and the error of translation in the target language literature co-occurrence is low. 因此,在为每一个提问词选择正确的翻译时,此词的翻译与其他提问词的翻译在目标语种文献中共现度最大时才会被选中。 Therefore, when choosing the correct translation for each question word, translation and other translation of the word question word in the target language literature of the Chinese Communist Party is now the maximum will be selected. 该过程具体运行如下:对含有η个提问词的集合{Si,...,Sn},首先根据词典给出每个Si (1彡i彡η)的翻译集合Ti,然后再从Ti中选择与其他提问词Sj (1 < j < η,且j兴i)的翻译集合Tj共现率最高的词作为Si翻译。 The specific operation process is as follows: [eta] contains a set of question words {Si, ..., Sn}, each translation is given first Si (1 San San [eta] i) set according to Ti dictionary, then select from the Ti translation and other questions words Sj (1 <j <η, and j Hing i) Tj collection of the highest rates of co-occurrence word translated as Si. 上述验证方法,仅考虑“动词(V)”和“对象(0)”的共现度,而忽略句子中其它词,有效提高了该方法的执行效率。 The verification methods consider only "a verb (V)" and "objects (0)" in co-occurrences, while ignoring other words sentence, effectively improve the efficiency of the process.

[0073] 根据本发明的一个具体实施例,计算扩展源语言检索索引的共现度的步骤如下:[0074] 在源语言文档库中检索扩展源语言检索索引,抽取出源语言文档库中同时包含扩展源语言检索索引中的“动词(V),,和“对象(0),,的文档。 The step of co-occurrences of [0073] DETAILED accordance with an embodiment of the present invention, calculates the extended source language retrieval index is as follows: [0074] in the source language document database retrieving extended source language retrieval index, extract the source language document database simultaneously comprising expanded search index in the source language, "a verb (V),, and" Object (0) ,, document.

[0075] 设“动词”表示为V,“对象”表示为O,一个扩展源语言检索索引在源语言文档库中的共现度为SIM (ν,ο),则计算公式如下: [0075] provided "verb" is expressed as V, "object" represents is O, an expansion degree of cooccurrence index to retrieve the source language in the source language document library is SIM (ν, ο), is calculated as follows:

[0076] SIM(ν,ο) = ρ(ν, ο) X Iog2 (ρ (ν, ο)/(ρ (ν) Xp (ο)))-Iog2Dis (ν, ο)公式1 [0076] SIM (ν, ο) = ρ (ν, ο) X Iog2 (ρ (ν, ο) / (ρ (ν) Xp (ο))) - Iog2Dis (ν, ο) Equation 1

[0077] 其中,c(v)、c(o)是ν、ο在源语言文档库中出现的次数,c (ν, ο)表示ν和ο在源语言文档库的同一句中的共现次数,ρ (ν, ο) = C (ν, o)/c(v)+c(v, o)/c(o), p(v)= c(v)/ Σ c(v),Dis(v,o)是一句中¥和ο之间的平均距离,用二者间的词数来计算。 [0077] wherein, c (v), c (o) is ν, ο number of occurrences in the source language document library, c (ν, ο) and v o represents the same sentence in the source language document library cooccurrence number, ρ (ν, ο) = C (ν, o) / c (v) + c (v, o) / c (o), p (v) = c (v) / Σ c (v), Dis (v, o) is the average distance between the one and the ¥ ο, with the number of words between the two is calculated.

[0078] 本领域的普通技术人员可以理解,还可以根据公式2计算扩展源语言检索索引的共现度: [0078] Those of ordinary skill in the art can be appreciated, the co-occurrence of the extended source language retrieval index may also be calculated according to equation 2:

[0079] [0079]

Figure CN102117284AD00071

[0080] 通常,SIM(v,ο)值小于2的认为该扩展源语言检索索引通过验证;获得的SIM(V, ο)值大于2的扩展源语言检索索引被删除。 [0080] Generally, SIM (v, ο) value of less than 2 is considered the source language extension retrieval index validated; obtained in the SIM (V, ο) is greater than 2 extended source language retrieval index is deleted.

[0081] 将已验证扩展源语言检索索引翻译为目标语言检索索引。 [0081] The Verified extended source language retrieval index translates to the target language search index. 优选地,运用“动词+对象”双语词典和已验证扩展源语言检索索引中的“动词(V)+ “对象(0)”进行匹配,其中,该“动词+对象”双语词典包括源语言“动词+对象”和对应的目标语言“动词+对象”。表1 示出了一个源语言是中文而目标语言是英文的“动词+对象”双语词典的部分内容。 Preferably, the use of "Verb + object" bilingual dictionary and verified expanded search index source language "verb (V) +" objects (0) "to match, where the" Verb + object "bilingual dictionary includes a source language" verb + object "and the corresponding target language" verb + object ". table 1 shows a source language is Chinese and the target language is English," verb + object part "bilingual dictionaries.

[0082] 表1汉英双语词典 [0082] Table 1 Chinese bilingual dictionaries

[0083] [0083]

Figure CN102117284AD00072

[0084] 图2示出了根据本发明一个具体实施例的建立“动词+对象”双语词典的流程图。 [0084] FIG. 2 shows a flowchart of an example of the establishment of "Verb + object" bilingual dictionary in accordance with a particular embodiment of the present invention. 该词典的建立基于平行语料库的使用,其中平行语料库是一种双语或多语的语料库,即库中不但有源语言文本,还有对应的目标语言文本。 Based on the dictionary used to establish parallel corpus, wherein the parallel corpus is bilingual or multilingual corpus of one kind, i.e., only the active database language text, as well as the corresponding target language text. 两种或多种文本一般采用句子或段落对齐方式编排。 Usually two or more text sentences or paragraphs using alignment arrangement. 计算机可以对源语文本和译语文本进行全文检索,并提供对照显示。 The computer can retrieve the full text of the source text and target text, and provides display controls. 该建立双语词典的过程包括下列步骤:首先用语义分析器处理两个语料库Tl和Τ2,其中语料库Tl 和Τ2包括内容逐句对应的翻译文档,一个语料库Tl的语言是s,另一个语料库Τ2的语言是t。 The process of establishing a bilingual dictionary comprises the steps of: processing first two corpora Τ2 Tl and semantic parser, wherein the corpus comprises content Τ2 Tl and a corresponding translated document was previously published, a language corpus Tl is s, the other corpus Τ2 language is t. 语义分析器将语料库Tl和T2转化为由一些平行“动词(V) +对象(0)”表示的语义索引。 Parser The corpus Tl and T2 parallel conversion by the number of "verb (V) + object (0)" indicates the semantic index. 从平行“动词(V) +对象(0)”表示的索引中抽取平行“动词(V) +对象(0)”,并建立一个双语“动词(V) +对象(0),,词对,例如"heat (V) +water (0),,与“加热(V) +水(0),,平行,二者一起来建立一个词对。所建立的词对随后被编辑加工,例如,删除词法单元中的重复对。编辑完成的词对被添加到“动词+对象”双语词典。 Index extracted from a parallel representation "verb (V) + object (0)" parallel "verb (V) + object (0)", and the establishment of a bilingual "verb (V) + object (0) ,, word pairs, for example, "heat (V) + water (0) ,, and" heating (V) + water (0) ,, parallel, the two words together to create a pair. the word pair is then established editing such as deleting lexical units repeated. edited word "verb + object" to be added to a bilingual dictionary.

[0085] 本步骤优先选取“动词+对象”双语词典的匹配结果对已验证源语言检索索引进行翻译,如果未能获得匹配结果,则利用单独的动词双语词典和名词双语词典对已验证源语言检索索引进行匹配,获得目标语言检索索引。 [0085] The present step prefers the "Verb + object" matches the source language bilingual dictionary retrieval index verified translation, if unable to obtain a matching result, the use of a separate verb and noun bilingual dictionary bilingual dictionary of source language verified retrieval index matching to obtain target language search index. 本领域普通技术人员可以理解,当然也可以直接利用单独的动词双语词典和对象双语词典对已验证源语言检索索引进行匹配,获得目标语言检索索引。 Those of ordinary skill in the art will be appreciated, of course, directly with a separate verb objects bilingual dictionary and bilingual dictionary retrieval index of verified to match the source language, the target language retrieval index is obtained.

[0086] 从以上描述可知,本发明的翻译过程不是对用户请求的各个词进行简单翻译,而是对用户请求的某些信息词组合进行翻译,同时保留了用户请求的词性标记和语义关系。 [0086] apparent from the foregoing description, the present invention is not a translation of each word requested by the user a simple translation, but a combination of some of the information requested by the user words to be translated, while retaining the user request and speech tags semantic relations.

[0087] 根据本发明的优选实施例,还包括利用目标语言同义字典对所获得的目标语言检索索引进行同义扩展的步骤,其中目标语言同义词典包括动词同义词典和名词同义词典。 [0087] According to a preferred embodiment of the present invention, further comprising the step of synonymous extended target language retrieval index obtained is synonymous with the target language dictionary, thesaurus, wherein the target language includes nouns and verbs thesaurus thesaurus . 具体地,利用目标语言动词同义词典和名词同义词典分别对一个目标语言检索索引中的“动词(V)”和“对象(0)”进行同义扩展;并用扩展后的“动词(V)”和“对象(0)”词语组成扩展目标语言检索索引,即获得目标语言扩展“动词(V) +对象(0)”组合。 Specifically, the target language using the verbs and nouns thesaurus thesaurus for each target language retrieval index "verb (V)" and "objects (0)" is synonymous for expansion; and a "verb after extended (V ) "and" objects (0) "the words in the target language retrieval index extended composition, i.e., obtain a target language extensions" a verb (V) + object (0) "in combination. 下面给出了一个对目标语言检索索引进行扩展的示例。 The following example is given for a retrieval index extended target language.

[0088] 示例4 :假定目标语言为英文,一个目标语言检索索引为“diSS0lVe(V)+alUminUm layer(0)”。 [0088] Example 4: Assume that the target language is English and a target language retrieval index "diSS0lVe (V) + alUminUm layer (0)".

[0089] 在目标语言动词同义词典中查找“diSSolVe(V)”的同义词,获得同义词“liquefy (V)”;在目标语言对象同义词典中查找“aluminum layer (0) ”的同义词,获得同义词“Al Iayer(O) ”。 [0089] Find Verb thesaurus in the target language "diSSolVe (V)" synonym, to obtain synonym "liquefy (V)"; find synonyms "aluminum layer (0)" in the target language objects thesaurus, obtaining synonym "Al Iayer (O)". 因此,目标语言检索索引“dissolve (V)+aluminum layer (0) ”的扩展目标语言索引为: Therefore, the target language retrieval index "dissolve (V) + aluminum layer (0)" extended target language index:

[0090] "liquefy (V) +aluminum layer (0) [0090] "liquefy (V) + aluminum layer (0)

[0091] "dissolve (V)+Al Iayer(O) ”,和 [0091] "dissolve (V) + Al Iayer (O)", and

[0092] "liquefy (V)+Al layer (0) [0092] "liquefy (V) + Al layer (0)

[0093] 因为在提问式语境中,两个不相关的提问词的译文也可能一起出现在目标语料库中,结果,不合适的译文可能被选上。 [0093] Since the question-context, asked two questions unrelated words also occur in the target corpus together as a result, improper translation could be chosen. 这种情况将严重影响检索效果。 This situation will seriously affect the retrieval results. 所以,与对扩展源语言检索索引进行验证的过程类似,对扩展目标语言检索索引进行验证,从而获得同时满足全面性和准确性的目标语言检索索引。 Therefore, the process of the expanded search index to verify the source language similar to the language of the extended target retrieval index for verification, in order to obtain the same time meet the target language search index comprehensiveness and accuracy.

[0094] 匹配已验证目标语言检索索引和目标语言文档索引,获得匹配用户检索式的文档作为输出。 [0094] match Verified target index and retrieve the target language document indexing language, access to match the user's search query document as an output. 具体地,在目标语言文档库中利用目标语言文档索引进行检索,在所检索出来的存在目标语言文档索引的文本文件子集中进一步检索与用户请求相关的知识/文档,即检索目标语言文档索引与已验证目标语言检索索引相同的文档,并将这些文档作为输出返回给用户。 Specifically, the use of the target language document library in the target language document index search, the searched out the presence of the target language document indexing text file subset further retrieve user requests related to knowledge / document, that is to retrieve the target language document indexing and verified same target language retrieval index documents, and these documents as output returned to the user.

[0095] 本领域的普通技术人员可以理解,本发明的方法利用了目标语言文档索引,如上所述其是对目标语言文档库进行与源语言检索式类似地语义分析而获得。 [0095] Those skilled in the art will appreciate, the method of the present invention utilizes a target language document index, which is described above, the target language to the source language document library retrieval formula is obtained analogously semantic analysis. 如果在上述方法的基础上,再次进行另外的检索过程,则可以直接利用上述步骤所获得的目标语言文档索弓丨,而不必重新执行对目标语言文档库再次语义分析的步骤。 If the basis of the above methods, an additional retrieval process again, you can directly use the steps above target language document obtained rope bow Shu, without having to re-execute the steps of the target language semantic analysis of the document library again.

[0096] 综上所述,本发明将检索式中的“动词+对象”组合(动宾结构)作为检索索引, 可以减少翻译单个关键词所存在的歧义性的问题,提高跨语言检索的查准率;优选地,结合对关键词组合进行同义扩展的方法,以获得更多正确检索结果,可以提高跨语言检索的查全率。 [0096] In summary, the present invention will be retrieved in the formula "Verb + object" composition (VO Construction) as the retrieval can reduce ambiguity in Translation single keyword exists, to improve the cross-language retrieval search precision; preferably, the binding method synonymous keywords combination extensible to obtain more correct retrieval result can be improved recall cross-language retrieval.

[0097] 应该注意到并理解,在不脱离后附的权利要求所要求的本发明的精神和范围的情况下,能够对上述详细描述的本发明做出各种修改和改进。 [0097] It should be noted and understood that, without departing from the spirit and scope of the invention appended claims of claims, can make various changes and modifications of the invention described in detail above. 因此,要求保护的技术方案的范围不受所给出的任何特定示范教导的限制。 Thus, any of the specific exemplary teachings limit the scope of the claimed technical solution is not given.

Claims (10)

1. 一种跨语言知识检索的方法,包括下列步骤:10)对源语言检索式进行语义分析,获得源语言检索索引,其中所述源语言检索索引是所述源语言检索式的动宾结构构成的“动词+对象”;20)将所述源语言检索索引翻译为目标语言检索索引;30)将目标语言文档索引与所述目标语言检索索引匹配,其中所述目标语言检索索引为对目标语言文档库进行语义分析所获得的所述目标语言文档库中的动宾结构构成的“动词+对象”。 1. A method for cross-language retrieval knowledge, comprising the steps of: 10) to retrieve the source language semantic analysis formula, to obtain the source language retrieval index, wherein said source language is the verb-object retrieval index structure of the source language retrieval formula configuration "verb + object"; 20) of said source language to a target language translation retrieval index retrieval index; 30) a target language document indexing and retrieval index language matches the target, wherein the target language for the certain retrieval index moving the target object structure language document library semantic analysis of the obtained language document library configuration "verb + object."
2.根据权利要求1所述的方法,其特征在于,所述步骤10)后,还包括下列步骤:11)将所述源语言检索索引进行同义扩展。 2. The method according to claim 1, wherein after said step 10), further comprising the steps of: 11) to retrieve the source language index extension synonymous.
3.根据权利要求2所述的方法,其特征在于,所述步骤11)后还包括下列步骤:12)验证所述源语言检索索引。 3. The method according to claim 2, wherein, after the step 11) further comprises the steps of: 12) to verify the source language retrieval index.
4.根据权利要求3所述的方法,其特征在于,所述步骤1¾进一步包括根据以下公式计算所述源语言检索索引中动词和对象的共现度,SIM (ν, ο) = ρ(ν, ο) X Iog2 (ρ (ν, ο)/(ρ (ν) Xp (ο)))-Iog2Dis (ν, ο),其中,动词表示为ν,对象表示为ο,c(v)、c (ο)是ν、ο在源语言文档库中出现的次数, C (ν, O)表示V和O在源语言文档库的同一句中的共现次数,ρ (ν, O) = C (ν, O) /c (v) +C (ν, o)/c(o),p(v) = C (ν) / Σ c(v),Dis(v,o)是一句中V* O 之间的平均距离。 4. The method according to claim 3, wherein said step comprising a further 1¾ following formula degree of cooccurrence index to retrieve the source language verbs and objects, SIM (ν, ο) = ρ (ν , ο) X Iog2 (ρ (ν, ο) / (ρ (ν) Xp (ο))) - Iog2Dis (ν, ο), wherein the verb is represented as v, objects are represented as ο, c (v), c (ο) is ν, ο the number of occurrences in the source language document library, C (ν, O) O and V represents the total number now in the same sentence in the source language document library, ρ (ν, O) = C ( ν, O) / c (v) + C (ν, o) / c (o), p (v) = C (ν) / Σ c (v), Dis (v, o) is the one in V * O the average distance between.
5.根据权利要求3所述的方法,其特征在于,所述步骤1¾进一步包括根据以下公式计算所述源语言检索索引中动词和对象的共现度, 5. The method according to claim 3, wherein said further comprising the step of co-occurrence 1¾ calculated according to the formula in the source language index to retrieve an object and a verb,
Figure CN102117284AC00021
其中,动词表示为V,对象表示为O,C(V)、C (ο)是V、O在源语言文档库中出现的次数, C (ν, O)表示V和O在源语言文档库的同一句中的共现次数。 Wherein the verb is represented as V, objects are represented as O, C (V), C (ο) is V, the number of O appear in the source language document library, C (ν, O) represents V and O in the source language document library the number of co-occurrence in the same sentence.
6.根据权利要求1至5中任一项所述的方法,其特征在于,所述步骤20)是利用“动词+对象”双语词典,其中,所述“动词+对象”双语词典包括源语言“动词+对象”和对应的目标语言“动词+对象”。 6. The method according to claim 5, characterized in that said step 20) is the use of "Verb + object" bilingual dictionary, wherein the "Verb + object" includes a source language bilingual dictionaries "verb + object" and the corresponding target language "verb + object."
7.根据权利要求6所述的方法,其特征在于,所述步骤20)中如果所述“动词+对象” 双语词典中不包括所述目标语言检索索引,则包括下列步骤:利用动词双语词典和名词双语词典将将所述源语言检索索引翻译为目标语言检索索引。 7. The method according to claim 6, wherein, in the step 20) if the "Verb + object" does not include a bilingual dictionary retrieval index of the target language, comprising the steps of: using a bilingual dictionary verb nouns and bilingual dictionaries to retrieve the source language to the target language translation index searchable index.
8.根据权利要求1至5中任一项所述的方法,其特征在于,所述步骤20)是利用动词双语词典和名词双语词典。 8. The method according to claim 5, characterized in that said step 20) is the use of the verb and noun bilingual dictionary bilingual dictionary.
9.根据权利要求1至5中任一项所述的方法,其特征在于,所述步骤20)后,还包括下列步骤:21)将所述目标语言检索索引进行同义扩展。 9. The method according to claim 5, characterized in that, after the step 20), further comprising the steps of: 21) to retrieve the target language index extension synonymous.
10.根据权利要求9所述的方法,其特征在于,所述步骤21)后还包括步骤:22)验证所述目标语言检索索引。 10. The method according to claim 9, wherein, after the step 21) further comprises the step of: 22) to verify the target language retrieval index.
CN2009102439934A 2009-12-30 2009-12-30 Method for retrieving cross-language knowledge CN102117284A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102439934A CN102117284A (en) 2009-12-30 2009-12-30 Method for retrieving cross-language knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102439934A CN102117284A (en) 2009-12-30 2009-12-30 Method for retrieving cross-language knowledge

Publications (1)

Publication Number Publication Date
CN102117284A true CN102117284A (en) 2011-07-06

Family

ID=44216058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102439934A CN102117284A (en) 2009-12-30 2009-12-30 Method for retrieving cross-language knowledge

Country Status (1)

Country Link
CN (1) CN102117284A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294682A (en) * 2012-02-24 2013-09-11 摩根全球购物有限公司 Multi-language retrieving method, computer readable storage medium and network searching system
CN103678714A (en) * 2013-12-31 2014-03-26 北京百度网讯科技有限公司 Construction method and device for entity knowledge base
CN104573019A (en) * 2015-01-12 2015-04-29 百度在线网络技术(北京)有限公司 Information searching method and device
CN104850610A (en) * 2015-05-11 2015-08-19 均康(上海)信息科技有限公司 Network search engine system
CN106372187A (en) * 2016-08-31 2017-02-01 中译语通科技(北京)有限公司 Cross-language retrieval method oriented to big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1325513A (en) * 1998-09-09 2001-12-05 发明机器公司 Document semantic analysis/selection with knowledge creativity capability
CN101194253A (en) * 2005-06-14 2008-06-04 微软公司 Collocation translation from monolingual and available bilingual corpora
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1325513A (en) * 1998-09-09 2001-12-05 发明机器公司 Document semantic analysis/selection with knowledge creativity capability
CN101194253A (en) * 2005-06-14 2008-06-04 微软公司 Collocation translation from monolingual and available bilingual corpora
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294682A (en) * 2012-02-24 2013-09-11 摩根全球购物有限公司 Multi-language retrieving method, computer readable storage medium and network searching system
CN103678714A (en) * 2013-12-31 2014-03-26 北京百度网讯科技有限公司 Construction method and device for entity knowledge base
CN104573019A (en) * 2015-01-12 2015-04-29 百度在线网络技术(北京)有限公司 Information searching method and device
CN104573019B (en) * 2015-01-12 2019-04-02 百度在线网络技术(北京)有限公司 Information retrieval method and device
CN104850610A (en) * 2015-05-11 2015-08-19 均康(上海)信息科技有限公司 Network search engine system
CN106372187A (en) * 2016-08-31 2017-02-01 中译语通科技(北京)有限公司 Cross-language retrieval method oriented to big data

Similar Documents

Publication Publication Date Title
Diab Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking
Mcnamee et al. Character n-gram tokenization for European language text retrieval
JP3266246B2 (en) Knowledge base construction method for natural language analysis apparatus and method, as well as natural language analysis
CN100392644C (en) Method for synthesising self-learning system for knowledge acquistition for retrieval systems
US6535842B1 (en) Automatic bilingual translation memory system
Foo et al. Chinese word segmentation and its effect on information retrieval
CN101878476B (en) Machine translation for query expansion
US20070073678A1 (en) Semantic document profiling
US20060235689A1 (en) Question answering system, data search method, and computer program
Arampatzis et al. Phase-based information retrieval
Grave et al. Learning word vectors for 157 languages
US7783633B2 (en) Display of results of cross language search
Kishida Technical issues of cross-language information retrieval: a review
Davis et al. tion Using Evolutionary Programming for Multi-Lin-gual Information Retrieval,” In Proceedings of the Fourth Annual Conference on Evolutionary Program-ming, San Diego, Evolutionary Programming Society.
US8306807B2 (en) Structured data translation apparatus, system and method
US7672831B2 (en) System and method for cross-language knowledge searching
Resnik et al. The web as a parallel corpus
Nie Cross-language information retrieval
KR20040102329A (en) Unilingual translator
JP2005216126A (en) Text generation method and text generation device of other language
Levow et al. Dictionary-based techniques for cross-language information retrieval
US8041697B2 (en) Semi-automatic example-based induction of semantic translation rules to support natural language search
US20080235202A1 (en) Method and system for translation of cross-language query request and cross-language information retrieval
CN1894688A (en) Translation determination system, method, and program
US9477656B1 (en) Cross-lingual indexing and information retrieval

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C53 Correction of patent for invention or patent application
COR Change of bibliographic data

Free format text: CORRECT: APPLICANT; FROM: PERA GLOBAL TECHNOLOGY (BEIJING) CO., LTD. TO: PERA GLOBAL TECHNOLOGY CO., LTD.

C12 Rejection of a patent application after its publication