CN102117284A

CN102117284A - Method for retrieving cross-language knowledge

Info

Publication number: CN102117284A
Application number: CN2009102439934A
Authority: CN
Inventors: 高建忠; 赵琦; 吴祖林; 邱李豪
Original assignee: PERA GLOBAL TECHNOLOGY (BEIJING) Co Ltd
Current assignee: PERA GLOBAL TECHNOLOGY (BEIJING) Co Ltd
Priority date: 2009-12-30
Filing date: 2009-12-30
Publication date: 2011-07-06

Abstract

The invention provides a method for retrieving cross-language knowledge, which comprises the following steps: 10) semantically analyzing an original language retrieval mode, thereby acquiring an original language retrieval index which has a 'verb + object' structure formed by the verb-object construction of the original language retrieval mode; 20) translating the original language retrieval index into a target language retrieval index; and 30) matching the target language retrieval index with the original language retrieval index, wherein the target language retrieval index has a 'verb + object' structure formed by the verb-object construction of a target language file bank, acquired by semantically analyzing the target language file bank by using the target language retrieval index. By using the method, the cross-language knowledge can be retrieved efficiently and accurately.

Description

A kind of method of striding the linguistry retrieval

Technical field

The present invention relates to the computer search field, particularly a kind of method of striding the linguistry retrieval.

Background technology

Along with the development of infotech, people utilize the mode of retrieve electronic document to obtain knowledge more and more at large.But the required knowledge of user may be present in the document of different language, and the user more is ready to exchange with mother tongue with electronic system.This has just produced the demand of striding the linguistry retrieval and extracting.

Cross-language retrieval refers to the document that the user uses the retrieval vocabulary retrieval of certain natural language (source language) to be expressed by another kind of natural language (target language).It allows the language construct retrieval question-type of user to be familiar with, and uses this question-type to retrieve the document that any is write as with non-question-type language then.

The common method that realizes cross-language retrieval has: document interpretation method and question-type interpretation method etc.

The document interpretation method was converted into the information-oriented language (target language) of document and puts question to language (source language) before information retrieval.The advantage of this method is, with source (enquirement) language description, the user can select utilization easily by striding of realizing of the document interpretation method result for retrieval that the linguistry retrieval returns to the user; For the translation of document level, its linguistic context is more wide in range, can utilize context to eliminate the ambiguousness of translation.But all information that are retrieved of document translation brief change language, and the accuracy of existing most of machine translation systems also is difficult to reach satisfactory degree, can't reach realistic scale; And whole documents in the database to be translated source language from target language, required workload is huge, costs dearly.It is also not little to re-construct its cost of index data that is translated on a large scale in addition.So the document interpretation method is only just meaningful under the information content condition of limited that is retrieved.Present this method all can not show a candle to the question-type interpretation method in research and practicality.

The question-type interpretation method is translated as every kind of language that searching system is supported with the question-type of user's input, then multilingual question-type is submitted to the matching module of searching system, retrieves the document of corresponding language.It is the method that realizes that at present cross-language retrieval is the most commonly used.Its advantage is only question-type to be translated, the translation amount little and the translation can carry out fast; Major defect is: 1, because the retrieval result that returns describes with target language, increased the difficulty that the user utilizes institute's acquired information; 2, question-type is very short usually, language ambience information seldom is difficult to disambiguation, and each question term is substituted by its all possible translation, the translation fuzzy problem is serious, and therefore the ambiguity of control translation is a key issue of the effective question-type interpretation method of design.

Question-type translation can be by based on dictionary method, wait and realized based on corpus-based, dictionary one corpus mixed method.In the question-type interpretation method, just the keyword of user's question-type is simply translated usually based on the question-type interpretation method of dictionary, can't be according to the disambiguation of question-type linguistic context, the result for retrieval precision ratio of acquisition is lower.Can from corpus, obtain the translation of some phrase in the question-type or short sentence based on the question-type interpretation method of corpus, can eliminate the part ambiguity, but limit by corpus scale and content, often can only obtain one or more translations of question-type keyword, can't obtain the result for retrieval of keyword synonym, recall ratio is lower.

Summary of the invention

The technical problem to be solved in the present invention is to improve the precision ratio of striding the linguistry retrieval.

For addressing the above problem, a kind of method of striding the linguistry retrieval is provided according to an aspect of the present invention, comprise the following steps:

10) the source language retrieval type is carried out semantic analysis, obtain the source language search index, " verb+object " that the V-O construction that wherein said source language search index is described source language retrieval type constitutes;

20) described source language search index is translated as the target language search index;

30) with target document index and described target language search index coupling, " verb+object " that wherein said target language search index constitutes for the V-O construction of the target document storehouse being carried out in the described target document storehouse that semantic analysis obtained.

In the said method, after the described step 10), also comprise the following steps:

11) described source language search index is carried out the synonym expansion.

In the said method, also comprise the following steps: after the described step 11)

12) the described source language search index of checking.

In the said method, described step 20) be to utilize " verb+object " bilingual dictionary, wherein, described " verb+object " bilingual dictionary comprises source language " verb+object " and corresponding target language " verb+object ".

In the said method, described step 20) if in do not comprise described target language search index in described " verb+object " bilingual dictionary, then comprise the following steps: to utilize verb bilingual dictionary and noun bilingual dictionary that described source language search index is translated as the target language search index.

In the said method, described step 20) be to utilize verb bilingual dictionary and noun bilingual dictionary.

In the said method, described step 20) after, also comprise the following steps:

21) described target language search index is carried out the synonym expansion.

In the said method, described step 21) also comprise step after:

22) the described target language search index of checking.

Beneficial effect of the present invention be to provide a kind of precision ratio higher stride the linguistry search method, in addition, the present invention has also effectively improved the recall ratio of striding the linguistry retrieval.

Description of drawings

Fig. 1 be according to the present invention a specific embodiment stride linguistry search method process flow diagram;

Fig. 2 is that the bilingual dictionary of the specific embodiment according to the present invention is set up process flow diagram.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer,, the method for striding the linguistry retrieval according to the specific embodiment of the invention is further described below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

What Fig. 1 showed according to a present invention specific embodiment strides linguistry search method process flow diagram, and this method comprises the following steps:

Source language retrieval type and target document storehouse are carried out semantic analysis, extracting V-O construction wherein, and then obtain source language search index and target document index.Usually, in short the V-O construction in is the nucleus in the sentence, can embody this purport content, as: " how improving indoor temperature? " in winter in V-O construction be " raising+temperature "; And " verb+object " in V-O construction combination exists certain language in conjunction with regularity; So extract " verb+object " combination (V-O construction) as index.

Selection utilizes the Stamford parser (Stanford Parser) of Stanford University to finish semantic analysis as semantic analyzer, and this instrument supports that at present detailed description is seen to the semantic analysis of English, Chinese, German and Arabic Http:// www-nlp.stanford.edu/software/lex-parser.shtmlOne of ordinary skill in the art will appreciate that semantic analysis can utilize a lot of existing semantic analyzer of natural language processing field to finish, it can support the semantic analysis of different language respectively.This step do not limit concrete semantic analyzer and at language.Two concrete semantic analysis examples are described below:

Example 1: the supposition source language is a Chinese, and the source language retrieval type is " how surveying microwave radiation ", and semantic analysis result is:

(ROOT

(IP

(VP

(ADVP (AD how))

(VP (VV detection)

(NP

(ADJP (JJ microwave))

(NP (NN radiation)))))))

Wherein, each English implication of writing a Chinese character in simplified form is as follows:

ROOT: root node;

IP: inflexional language prime phrase;

VP: verb phrase;

ADVP: adverbial phrase;

AD: adverbial word;

VV: verb;

NP: noun phrase;

ADJP: adjective phrase;

JJ: adjective;

NN: common noun.

According to the result queue of semantic analysis, extract V-O construction VP " detection microwave radiation " automatically, obtain the combination of verb VV " detection "+object NP " microwave radiation ", as the source language search index.Verb is labeled as V, and object tag is O, and promptly this source language search index is " surveying (V)+microwave radiation (O) ".

Example 2: the hypothetical target language is English, and a word in the target document storehouse is " Dopplereffect transducer measures fluid flow ", and semantic analysis result is:

(ROOT

(S

(NP(JJ?Doppler)(NN?effect)(NN?transducer))

(VP(VBZ?measures)

(NP(JJ?fluid)(NN?flow)))))

ROOT: root node;

S: sentence;

NP; Noun phrase;

JJ: adjective;

NN: common noun;

VP: verb phrase;

VBZ: present tense verb.

According to the result queue of semantic analysis, extract V-O construction VP " measures fluidflow " automatically, obtain the combination of verb VBZ " measures "+object NP " fluid flow ", as the target document index.Verb is labeled as V, and object tag is O, and promptly this target document index is " measure (V)+fluid flow (O) ".

Preferably, the source language search index is carried out the synonym expansion automatically, more specifically, utilize the source language thesaurus that " verb (V) " in the source language search index and " object (O) " carried out the synonym expansion; And form extended source language retrieval index with " verb (V) " and " object (the O) " word after the expansion, thus obtaining extended source language retrieval index, i.e. expansion " verb (V)+object (O) " is made up.Wherein, the source language thesaurus comprises verb thesaurus and noun thesaurus, and the verb thesaurus can be chosen existing known dictionary, as " synonymicon commonly used " etc., by " verb synonym " structure " verb synonymicon " wherein; The noun thesaurus can be chosen existing known dictionary, as " synonymicon commonly used ", by " noun synonym " structure " noun thesaurus " wherein.Provided an example of the source language search index being carried out the synonym expansion below.

Example 3: the supposition source language is a Chinese, and a source language search index is " dilution (V)+photoresist (O) ".

In source language verb thesaurus, search the synonym of " dilution (V) ", do not obtain " dilution (V) " synonym; In source language object thesaurus, search the synonym of " photoresist (O) ", obtain synonym " photoresist (O) ".Therefore, the extended source language retrieval index of source language search index " dilution (V)+photoresist (O) " is: " dilution (V)+photoresist (O) ".

In this step, adopt method,, improved the recall ratio of cross-language retrieval to obtain how correct result for retrieval to the expansion of keyword combination carrying out synonym.One of ordinary skill in the art will appreciate that, also can not carry out this synonym spread step.

The above-mentioned step of utilizing dictionary to carry out synonym expansion and keyword combination may produce following mistake, the synonym of the synonym of certain " verb (V) " and certain " object (O) " unlikely occurs in language expression simultaneously, for example: " increasing (V)+heat (O) ", " increasing (V) " is the synonym of " increasing (V) ", but, the combination that " increases (V)+heat (O) " does not also meet language regulation, has irrationality.Therefore, according to a preferred embodiment of the invention, the present invention also comprises the step that the rationality of extended source language retrieval index is verified.

In that being verified, extended source language retrieval index can adopt the co-occurrence technology in this step.The co-occurrence technology is based on such hypothesis: when question term of translation, other question term (or their translation) just becomes " linguistic context " of the translation speech of selecting this question term.Correct translation is the common frequency height that occurs in the target language document, and wrong translation common frequency that occurs in the target language document is low.Therefore, when selecting correct translation for each question term, the translation of this speech and the translation of other question terms just can be selected when the co-occurrence degree is maximum in the target language document.The concrete operation of this process is as follows: to containing the set { S1 of n question term, ..., Sn}, at first provide each Si (translation set Ti of 1≤i≤n) according to dictionary, and then select from Ti that (1≤j≤n, the highest speech of translation set Tj co-occurrence rate of and j ≠ i) is translated as Si with other question terms Sj.Above-mentioned verification method is only considered the co-occurrence degree of " verb (V) " and " object (O) ", and is ignored other speech in the sentence, has effectively improved the execution efficient of this method.

According to a specific embodiment of the present invention, the step of the co-occurrence degree of calculating extended source language retrieval index is as follows:

In the source document storehouse, retrieve extended source language retrieval index, extract the document of " verb (V) " and " object (O) " that comprise simultaneously in the source document storehouse in the extended source language retrieval index.

If " verb " is expressed as v, " object " is expressed as o, the co-occurrence degree of an extended source language retrieval index in the source document storehouse be SIM (v, o), then computing formula is as follows:

SIM (v, o)=p (v, o) * log ₂(p (v, o)/(p (v) * p (o)))-log ₂(v, o) formula 1 for Dis

Wherein, (v), c (o) is the number of times that v, o occur in the source document storehouse to c, c (v, o) expression v and the o co-occurrence number of times in the same sentence in source document storehouse, p (v, o)=c (v, o)/c (v)+and c (v, o)/c (o), p (v)=and c (v)/∑ c is (v), (v o) is the mean distance between v and the o in to Dis, calculates with the two speech number.

Persons of ordinary skill in the art may appreciate that the co-occurrence degree that can also calculate extended source language retrieval index according to formula 2:

SIM (v, o) = (\frac{c (v, o)}{c (v)} + \frac{c (v, o)}{c (o)}) / 2

Formula 2

Usually, SIM (v, o) value less than 2 think this extended source language retrieval index by the checking; (v, o) value is deleted greater than 2 extended source language retrieval index for the SIM that obtains.

To verify that extended source language retrieval index is translated as the target language search index.Preferably, utilization " verb+object " bilingual dictionary and verified in the extended source language retrieval index " verb (V)+" object (O) " mates; wherein, should " verb+object " bilingual dictionary comprise the target language " verb+object " of source language " verb+object " and correspondence.It is Chinese and target language is the partial content of English " verb+object " bilingual dictionary that table 1 shows a source language.

Table 1 Chinese-English bilingual dictionary

Chinese	English
		Raising+temperature	increase+temperature raise+temperature
Output+light signal	output+light?signal output+optic?signal output+optical?signal

Fig. 2 shows the process flow diagram of foundation " verb+object " bilingual dictionary of the specific embodiment according to the present invention.The foundation of this dictionary is based on the use of Parallel Corpus, and wherein Parallel Corpus is a kind of bilingual or multi-lingual corpus, and source language text is promptly not only arranged in the storehouse, also has corresponding target language text.Two or more texts generally adopt sentence or paragraph alignment layout.Computing machine can and be translated Chinese language and originally carry out full-text search source Chinese language basis, and provides contrast to show.This process of setting up bilingual dictionary comprises the following steps: at first to handle two corpus T1 and T2 with semantic analyzer, wherein corpus T1 and T2 comprise the translation document that content is corresponding sentence by sentence, the language of a corpus T1 is s, and the language of another corpus T2 is t.Semantic analyzer is converted into corpus T1 and T2 the semantic indexing of being represented by some parallel " verb (V)+object (O) ".From the index of parallel " verb (V)+object (O) " expression, extract parallel " verb (V)+object (O) ", and it is right to set up bilingual " verb (V)+object (O) " speech, for example " heat (V)+water (O) " is parallel with " heating (V)+water (O) ", and it is right that the two comes together to set up a speech.The speech of being set up is to being edited and processed subsequently, and for example, repeating in the deletion lexical unit is right.The speech that editor finishes is to being added to " verb+object " bilingual dictionary.

This step is preferentially chosen the matching result of " verb+object " bilingual dictionary and is translated verifying the source language search index, if fail to obtain matching result, then utilize independent verb bilingual dictionary and noun bilingual dictionary to mate, obtain the target language search index verifying the source language search index.One of ordinary skill in the art will appreciate that, can certainly directly utilize independent verb bilingual dictionary and object bilingual dictionary to mate, obtain the target language search index verifying the source language search index.

From the above description as can be known, translation process of the present invention is not that each speech of user's request is simply translated, but some information word combination of user's request is translated, and has kept the part of speech mark and the semantic relation of user's request simultaneously.

According to a preferred embodiment of the invention, also comprise the step of utilizing target language synonym dictionary that the target language search index that is obtained is carried out the synonym expansion, wherein the target language thesaurus comprises verb thesaurus and noun thesaurus.Particularly, utilize target language verb thesaurus and noun thesaurus respectively " verb (V) " in the target language search index and " object (O) " to be carried out the synonym expansion; And, promptly obtain target language expansion " verb (V)+object (O) " combination with the composition of " verb (V) " and " object (the O) " word after expansion expansion target language search index.Provided an example that the target language search index is expanded below.

Example 4: the hypothetical target language is English, and a target language search index is " dissolve (V)+aluminum layer (O) ".

In target language verb thesaurus, search the synonym of " dissolve (V) ", obtain synonym " liquefy (V) "; In target language object thesaurus, search the synonym of " aluminum layer (O) ", obtain synonym " Al layer (O) ".Therefore, the expansion target language index of target language search index " dissolve (V)+aluminum layer (O) " is:

“liquefy(V)+aluminum?layer(O)”，

" dissolve (V)+Al layer (O) " and

“liquefy(V)+Al?layer(O)”。

Because in the question-type linguistic context, the translation of two incoherent question terms also may appear in the target corpus together, the result, and inappropriate translation may be chosen.This situation will have a strong impact on retrieval effectiveness.So, similar with the process that extended source language retrieval index is verified, expansion target language search index is verified, thereby obtained to satisfy comprehensive simultaneously and target language search index accuracy.

Coupling has been verified target language search index and target document index, and the document that obtains the match user retrieval type is as output.Particularly, in the target document storehouse, utilize the target document index to retrieve, in the text subclass that has the target document index retrieving out further retrieval ask relevant knowledge/document with the user, be searched targets Language Document index and the document of verifying that the target language search index is identical, and these documents are returned to the user as output.

Persons of ordinary skill in the art may appreciate that method of the present invention utilized the target document index, it is that the target document storehouse is carried out and the semantic analysis and obtaining similarly of source language retrieval type as mentioned above.If based on the above method, carry out other retrieving once more, the target document index that then can directly utilize above-mentioned steps and obtained, and needn't re-execute the target document storehouse step of semantic analysis once more.

In sum, the present invention as search index, can reduce the problem of the existing ambiguousness of the single keyword of translation with " verb+object " in retrieval type combination (V-O construction), improves the precision ratio of cross-language retrieval; Preferably, the method in conjunction with to the expansion of keyword combination carrying out synonym to obtain how correct result for retrieval, can improve the recall ratio of cross-language retrieval.

Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subjected to the restriction of given any specific exemplary teachings.

Claims

1. a method of striding the linguistry retrieval comprises the following steps:

2. method according to claim 1 is characterized in that, after the described step 10), also comprises the following steps:

3. method according to claim 2 is characterized in that, also comprises the following steps: after the described step 11)

12) the described source language search index of checking.

4. method according to claim 3 is characterized in that, described step 12) further comprises the co-occurrence degree that calculates verb and object in the described source language search index according to following formula,

SIM(v，o)＝p(v，o)×log ₂(p(v，o)/(p(v)×p(o)))-log ₂Dis(v，o)，

Wherein, verb list is shown v, and object representation is o, (v), c (o) is the number of times that v, o occur in the source document storehouse to c, c (v, o) expression v and the o co-occurrence number of times in the same sentence in source document storehouse, p (v, o)=c (v, o)/c (v)+c (v, o)/c (o), p (v)=and c (v)/∑ c is (v), (v o) is the mean distance between v and the o in to Dis.

5. method according to claim 3 is characterized in that, described step 12) further comprises the co-occurrence degree that calculates verb and object in the described source language search index according to following formula,

SIM (v, o) = (\frac{c (v, o)}{c (v)} + \frac{c (v, o)}{c (o)}) / 2,

Wherein, verb list is shown v, and object representation is o, and (v), c (o) is the number of times that v, o occur in the source document storehouse to c, c (v, o) expression v and the o co-occurrence number of times in the same sentence in source document storehouse.

6. according to each described method in the claim 1 to 5, it is characterized in that, described step 20) be to utilize " verb+object " bilingual dictionary, wherein, described " verb+object " bilingual dictionary comprises source language " verb+object " and corresponding target language " verb+object ".

7. method according to claim 6, it is characterized in that, described step 20) if in do not comprise described target language search index in described " verb+object " bilingual dictionary, then comprise the following steps: to utilize the just described source language search index of verb bilingual dictionary and noun bilingual dictionary to be translated as the target language search index.

8. according to each described method in the claim 1 to 5, it is characterized in that described step 20) be to utilize verb bilingual dictionary and noun bilingual dictionary.

9. according to each described method in the claim 1 to 5, it is characterized in that described step 20) after, also comprise the following steps:

10. method according to claim 9 is characterized in that, described step 21) after also comprise step:

22) the described target language search index of checking.