CN102654866A

CN102654866A - Method and device for establishing example sentence index and method and device for indexing example sentences

Info

Publication number: CN102654866A
Application number: CN2011100498475A
Authority: CN
Inventors: 赵世奇; 吴甜; 王海峰; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-03-02
Filing date: 2011-03-02
Publication date: 2012-09-05

Abstract

The invention provides a method and a device for establishing an example sentence index and a method and a device for indexing example sentences. A special index is established for the example sentences by performing text analysis on the example sentences in an example sentence library; when a user inputs a grammar-based advanced search requirement, the search requirement input by the user is resolved; search results of respective inquiry items are acquired according to resolved inquiry items; and the search results of the respective inquiry items are integrated and processed according to a logic relation of the resolved inquiry items. The established index and the inquiry items are at least one of the following combinations: a combination of terms in the example sentences and parts of speech corresponding to the terms, a combination of terms in the example sentences and types of named entities corresponding to the terms, a combination of terms in the example sentences and syntactic roles corresponding to the terms, and a combination of terms in the example sentences. According to the methods and the devices, the grammar-based advanced search can be realized, so that the search effect can be improved.

Description

Example sentence index creation method and apparatus and illustrative sentence retrieval method and apparatus

[technical field]

The present invention relates to field of computer technology, particularly a kind of example sentence index creation method and apparatus and illustrative sentence retrieval method and apparatus.

[background technology]

Information retrieval is meant that information organizes by certain mode, and finds out the process and the technology of relevant information according to information user's needs.Information retrieval has been widely used in document, multimedia and translation field etc.

In the existing information retrieval technique, have a kind of special information retrieval: illustrative sentence retrieval, promptly be used to retrieve the example sentence that comprises some keyword, illustrative sentence retrieval be generally used for that example sentence in the monolingual dictionary represents or translation technology in example sentence represent.Yet existing illustrative sentence retrieval is retrieved based on keyword coupling merely usually, and for example, the example sentence that is applied in monolingual dictionary represents when middle, and user input query request (query) " computing machine " can be come out the illustrative sentence retrieval that comprises keyword " computing machine ".Be applied in-the Ying translation technology in the time, when the user imports query " computing machine ", can the illustrative sentence retrieval that comprise " computing machine " corresponding English be come out, promptly retrieve the example sentence that comprises " computer ".For some advanced searches, then can't realize based on grammer.For example; If which example sentence the user has when wanting retrieval " difficulty " as noun, perhaps, which example sentence is arranged when wanting retrieval " raising " and " level " collocation to use; Perhaps, want retrieval " apple " to have during as an electronics brand which example sentence etc. then can't realize.

[summary of the invention]

The invention provides a kind of example sentence index creation method and apparatus and illustrative sentence retrieval method and apparatus, thereby realize advanced search based on grammer.

Concrete technical scheme is following:

A kind of example sentence index creation method, carry out following steps to each example sentence in the example sentence storehouse respectively:

A, example sentence is carried out text analyzing;

B, according to the result of text analyzing, create the pairing index of this example sentence;

Wherein index comprises following at least a in listed: combination and the word in the example sentence and the combination between the word of the syntactic role that combination, the word in the example sentence of the named entity type that the combination of the part of speech that the word in the example sentence and this word are corresponding, the word in the example sentence and this word are corresponding is corresponding with this word.

Wherein, said steps A specifically comprises:

A1, said example sentence is carried out word segmentation processing;

Among A2, execution in step A21, A22, A23, the A24 at least one:

A21, each word that obtains after the word segmentation processing is carried out part-of-speech tagging;

A22, each word that obtains after the word segmentation processing is carried out the identification of proper noun, confirm to be identified as the corresponding named entity type of word of proper noun;

A23, each word that obtains after the word segmentation processing is carried out syntactic analysis, confirm the syntactic role of each word;

A24, each word that obtains after the word segmentation processing is made up in twos;

If carry out said steps A 21, then said step B specifically comprises: the combination of one by one that each word and word is corresponding part of speech is as the index of said example sentence;

If carry out said steps A 22, then said step B specifically comprises: the combination that will be identified as the corresponding named entity type of word and the word of proper noun one by one is as the corresponding index of said example sentence.

If carry out said steps A 23, then said step B specifically comprises: the combination of one by one that each word and word is corresponding syntactic role is as the corresponding index of said example sentence.

If carry out said steps A 24, then said step B specifically comprises: the combination that said steps A 24 is obtained is respectively as the index of said example sentence.

In addition, this method also comprises: with each word that obtains after the word segmentation processing respectively as the index of said example sentence.

Wherein, said steps A 24 specifically comprises: have the combination in twos of collocation relation between each word of confirming to obtain after the word segmentation processing based on syntactic analysis;

Wherein said collocation relation comprises: subject-predicate relation, moving guest's relation, polarization relation, middle benefit relation or apposition.

More excellent ground before said steps A 24, perhaps, before said step B, also comprises:

Each word that obtains after to word segmentation processing based on preset inactive vocabulary filters, and filters out the word that comprises in the vocabulary of stopping using.

Wherein, said example sentence storehouse is single language example sentence storehouse or bilingual example sentence storehouse.

If said example sentence storehouse is bilingual example sentence storehouse, then this method also comprises:

With the pairing index of bilingual each example sentence of example sentence centering in the said bilingual example sentence storehouse all as this bilingual example sentence to pairing index.

Further, this method also comprises:

Utilize each example sentence and the corresponding index of example sentence in the said example sentence storehouse, set up concordance list through the mode of falling row;

Wherein, index value is an example sentence in the said concordance list, and index key is the corresponding index of example sentence.

If to bilingual example sentence storehouse; Then utilize each bilingual example sentence in the said bilingual example sentence storehouse to and bilingual example sentence to the index of correspondence, set up concordance list through the mode of falling row, wherein; Index value is that bilingual example sentence is right in the said concordance list, and index key is the index of bilingual example sentence to correspondence.

Said concordance list comprises following at least a in listed at least:

" speech-part of speech " concordance list, index key wherein are the combination of the corresponding part of speech of word and word;

" speech-NE type " concordance list, index key wherein are the combination of the corresponding NE type of word and word;

" speech-syntactic role " concordance list, index key wherein are the combination of the corresponding syntactic role of word and word; And,

" speech-speech " concordance list, index key wherein are the combination of word and word.

More excellent ground, in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list or " speech-speech " concordance list, index key is the secondary index key, is specially:

Identical word is summarized in together as first order index in index key; The part of speech that first order index is corresponding in said " speech-part of speech " concordance list is as second level index; The NE type that first order index is corresponding in said " speech-NE type " concordance list is as second level index; The syntactic role that first order index is corresponding in said " speech-syntactic role " concordance list is as second level index, and another word that makes up with first order index in said " speech-speech " concordance list is as second level index.

A kind of illustrative sentence retrieval method, this method comprises:

A, reception user's retrieval request query;

B, parse the query term that said query comprises,, then also parse the logical relation between each query term if comprise a plurality of query terms;

Each query term that C, utilization parse is retrieved one by one, obtains the corresponding result for retrieval of each query term;

If the said query of D comprises a plurality of query terms, then according to the logical relation between each query term, the result for retrieval corresponding to each query term carries out integration processing, and the result for retrieval after the integration processing is returned to said user; If said query comprises a query term, then that this query term is corresponding result for retrieval returns to said user;

Wherein, said query term at least a in listed below being: the combination of the syntactic role that combination, the word of the named entity type that the combination of the part of speech that word and this word are corresponding, word and this word are corresponding is corresponding with this word and the combination between word and the word; Said logical relation is: occur simultaneously or difference set.

Wherein, said step C is specially:

If the query term that parses is the combination of the corresponding part of speech of word and this word, then the index key in this query term and " speech-part of speech " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term;

If the query term that parses is the combination of the corresponding NE type of word and this word, then the index key in this query term and " speech-NE type " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term;

If the query term that parses is the combination of the corresponding syntactic role of word and this word, then the index key in this query term and " speech-syntactic role " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term;

If the query term that parses is the combination of word and word, then the index key in this query term and " speech-speech " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term.

Being combined as between said word and the word: existence is based on the word of the collocation relation of syntactic analysis and the combination of word;

In addition, the query term that parses also comprises: word;

If query term is a word, then the index key in this query term and " speech " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term.

Index value in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list, " speech-speech " concordance list in index value, " speech " concordance list is that example sentence or bilingual example sentence are right.

More excellent ground; If certain query term is not the query term for the adjacent rear end of logical relation of difference set; And the result for retrieval that this query term is corresponding is lower than preset minimum retrieval requirement; Then with each word in this query term respectively with said " speech " concordance list in index key mate, with the index key corresponding index value of coupling as the result for retrieval of this query term.

Further, before said step e, also comprise:

Result for retrieval after the said integration processing is sorted, and the foundation of wherein said ordering comprises following one of listed or combination:

The letter situation of putting in result for retrieval source, and, the matching state of result for retrieval and said query.

Particularly, the matching state F (R of said result for retrieval and said query _i) be:

F (R_{i}) = λ_{item} Σ_{j = 1}^{J} δ (R_{i}, ite m_{j}) + λ_{word} Σ_{k = 1}^{K} δ (R_{i}, {word}_{k}) + λ_{[+]} Σ_{m = 1}^{M} δ (R_{i}, {[+]}_{m}) + λ_{[-]} Σ_{n = 1}^{N} δ (R_{i}, {[-]}_{n});

Wherein, λ _Item, λ _Word, λ _[+]And λ _[-]Be the predetermined weights parameter, δ (R _i, item _j) be result for retrieval R _iWith the matching value of j query term, J is the query term number that said query comprises, δ (R _i, word _k) be result for retrieval R _iWith the matching value of k word, K by retrieval among the said query the number of use word, δ (R _i, [+] _m) be result for retrieval R _iWith m the matching value for the logical relation of common factor, M is the logical relation number for occuring simultaneously among the said query, δ (R _i, [-] _n) be result for retrieval R _iIndividual with n is the matching value of the logical relation of difference set, and N is the logical relation number of difference set among the said query.

If item _jBe R _iIndex, δ (R _i, item _j) be 1, otherwise δ (R _i, item _j) be 0;

If word _kBe R _iIndex, δ (R _i, word _k) be 1, otherwise δ (R _i, item _j) be 0;

If be the logical relation [+] of occuring simultaneously _mThe query term at two ends is R _iIndex, δ (R _i, [+] _m) be 1, otherwise δ (R _i, [+] _m) be 0;

If be the logical relation [-] of difference set _nThe query term of adjacent headend is R _iIndex and the query term of adjacent rear end be not R _iIndex, δ (R then _i, [-] _n) be 1, otherwise δ (R _i, [-] _n) be 0.

A kind of example sentence index creation device, this device comprises: the unit set up in text analyzing unit and index;

Said text analyzing unit is used for respectively carrying out text analyzing to each example sentence in example sentence storehouse;

The unit set up in said index, is used for the analysis result according to said text analyzing unit, creates the pairing index of each example sentence; Wherein index comprises following at least a in listed: combination and the word in the example sentence and the combination between the word of the syntactic role that combination, the word in the example sentence of the named entity type that the combination of the part of speech that the word in the example sentence and this word are corresponding, the word in the example sentence and this word are corresponding is corresponding with this word.

Wherein, said text analyzing unit comprises the word segmentation processing subelement, also comprises in the following subelement at least one: part-of-speech tagging subelement, NE recognin unit, syntactic analysis subelement and matched combined subelement;

Said word segmentation processing subelement is used for example sentence is carried out word segmentation processing;

Said part-of-speech tagging subelement is used for each word that obtains after the word segmentation processing is carried out part-of-speech tagging;

Said NE recognin unit is used for each word that obtains after the word segmentation processing is carried out the identification of proper noun, confirms to be identified as the corresponding named entity type of word of proper noun;

Said syntactic analysis subelement is used for each word that obtains after the word segmentation processing is carried out syntactic analysis, confirms the syntactic role of each word;

Said matched combined subelement is used for each word that obtains after the word segmentation processing is made up in twos;

The part-of-speech tagging result of unit according to said part-of-speech tagging subelement set up in said index, and the combination of one by one that each word and word is corresponding part of speech is as the index of said example sentence; Perhaps, according to the recognition result of said NE recognin unit, the combination that will be identified as the corresponding named entity type of word and the word of proper noun one by one is as the corresponding index of said example sentence; Perhaps, according to the analysis result of said syntactic analysis subelement, the combination of one by one that each word and word is corresponding syntactic role is as the corresponding index of said example sentence; Perhaps, the combination that said matched combined subelement is obtained is respectively as the index of said example sentence.

In addition, the unit set up in said index, also is used for each word that obtains after the said word segmentation processing subelement word segmentation processing respectively as the index of said example sentence.

There is the combination in twos of collocation relation between each word that said matched combined subelement is specifically confirmed to obtain after the word segmentation processing based on syntactic analysis;

More excellent ground; Said text analyzing unit also comprises: word filters subelement, is used for based on preset inactive vocabulary each word that obtains after the said word segmentation processing subelement word segmentation processing being filtered; After filtering out the word that comprises in the vocabulary of stopping using; Supply said matched combined subelement to make up, perhaps, supply said index to set up the foundation that the unit carries out index.

If said example sentence storehouse is bilingual example sentence storehouse, then said index set up the unit with the pairing index of bilingual each example sentence of example sentence centering in the said bilingual example sentence storehouse all as this bilingual example sentence to pairing index.

Further, this device also comprises: concordance list is set up the unit, is used for utilizing said index to set up the index that the unit is set up for each example sentence of example sentence storehouse; Set up concordance list through the mode of falling row; Wherein, index value is an example sentence in the said concordance list, and index key is the corresponding index of example sentence.

To bilingual example sentence storehouse; Concordance list is set up the unit; Being used for utilizing said index to set up the unit is the index of each bilingual example sentence of bilingual example sentence storehouse to setting up, and sets up concordance list through the mode of falling row, wherein; Index value is that bilingual example sentence is right in the said concordance list, and index key is the index of bilingual example sentence to correspondence.

Wherein, said concordance list comprises following at least a in listed:

A kind of illustrative sentence retrieval device, this device comprises: user side interactive unit, request analysis unit, retrieval processing unit and integral unit as a result;

Said user side interactive unit is used to receive user's retrieval request query, and the result for retrieval that said integral unit is as a result provided returns to said user;

The described request resolution unit is used to parse the query term that said query comprises, if comprise a plurality of query terms, then also parses the logical relation between each query term;

Said retrieval processing unit, each query term that is used to utilize the described request resolution unit to parse is retrieved one by one, obtains the corresponding result for retrieval of each query term;

Said integral unit as a result; Be used for parsing said query when comprising a plurality of query term in the described request resolution unit; Utilize the logical relation between each query term that the described request resolution unit parses; The corresponding result for retrieval of said each query term is carried out integration processing, the result for retrieval after the integration processing is offered said user side interactive unit; Parse said query when comprising a query term in the described request resolution unit, the result for retrieval that this query term is corresponding offers said user side interactive unit;

If the query term that the described request resolution unit parses is the combination of the corresponding part of speech of word and this word; Then said retrieval processing unit matees the index key in this query term and " speech-part of speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term;

If the query term that the described request resolution unit parses is the combination of the corresponding NE type of word and this word; Then said retrieval processing unit matees the index key in this query term and " speech-NE type " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term;

If the query term that the described request resolution unit parses is the combination of the corresponding syntactic role of word and this word; Then said retrieval processing unit matees the index key in this query term and " speech-syntactic role " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term;

If the query term that the described request resolution unit parses is the combination of word and word; Then said retrieval processing unit matees the index key in this query term and " speech-speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

In addition, the query term that parses of described request resolution unit comprises word;

If the query term that the described request resolution unit parses is a word, then said retrieval processing unit matees the index key in this query term and " speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

Wherein, the index value in index value, " speech " concordance list is that example sentence or bilingual example sentence are right in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list, " speech-speech " concordance list.

More excellent ground; This device also comprises: replenish retrieval unit; Being used at certain query term is not the query term for the adjacent rear end of logical relation of difference set; And when the result for retrieval that this query term is corresponding is lower than preset minimum retrieval and requires, with each word in this query term respectively with said " speech " concordance list in index key mate, with the index key corresponding index value of coupling result for retrieval as this query term.

Particularly, said integral unit as a result can comprise:

The integration processing subelement; Be used for parsing said query when comprising a plurality of query term in the described request resolution unit; Utilize the logical relation between each query term that the described request resolution unit parses, the corresponding result for retrieval of said each query term is carried out integration processing;

Subelement is handled in ordering, is used for the result for retrieval after the integration processing is sorted, and the foundation of wherein said ordering comprises following one of listed or combination: the letter situation of putting in result for retrieval source, and, the matching state of result for retrieval and said query.

Matching state F (the R of said result for retrieval and said query _i) be:

F (R_{i}) = λ_{item} Σ_{j = 1}^{J} δ (R_{i}, ite m_{j}) + λ_{word} Σ_{k = 1}^{K} δ (R_{i}, {word}_{k}) + λ_{[+]} Σ_{m = 1}^{M} δ (R_{i}, {[+]}_{m}) + λ_{[-]} Σ_{n = 1}^{N} δ (R_{i}, {[-]}_{n});

Particularly, if item _jBe R _iIndex, δ (R _i, item _j) be 1, otherwise δ (R _i, item _j) be 0;

Can find out that by above technical scheme the present invention is the special index of illustrative sentence retrieval through after the example sentence in the example sentence storehouse is carried out text analyzing; The user is in input during based on the advanced search of grammer; Retrieval request to user's input is resolved, and utilizes the query term that parses, and obtains the result for retrieval of each query term; And, the result for retrieval of each query term is carried out integration processing according to the logical relation between each query term that parses.Wherein, the index of above-mentioned foundation and query term are listed at least a below being: combination and the word in the example sentence and the combination between the word of the syntactic role that combination, the word in the example sentence of the named entity type that the combination of the part of speech that the word in the example sentence and this word are corresponding, the word in the example sentence and this word are corresponding is corresponding with this word.The present invention can make the user in input during based on the advanced search of grammer, if which example sentence is arranged when wanting retrieval " difficulty " as noun such as the user, perhaps; Which example sentence is arranged when wanting retrieval " raising " and " level " collocation to use; Perhaps, when wanting retrieval " apple " which example sentence etc. is arranged, can satisfy user's Search Requirement as an electronics brand; Promptly can realize advanced search, thereby improve retrieval effectiveness based on grammer.

[description of drawings]

The example sentence index creation method flow diagram that Fig. 1 provides for the embodiment of the invention one;

Fig. 2 a is the instance graph of " speech-part of speech " concordance list of providing of the embodiment of the invention one;

Fig. 2 b is the instance graph of " speech-NE type " concordance list of providing of the embodiment of the invention one;

Fig. 2 c is the instance graph of " speech-syntactic role " concordance list of providing of the embodiment of the invention one;

Fig. 2 d is the instance graph of " speech-speech " concordance list of providing of the embodiment of the invention one;

The concordance list synoptic diagram that Fig. 3 provides for the embodiment of the invention one sets up when being used for bilingual database;

The example sentence indexing means process flow diagram that Fig. 4 provides for the embodiment of the invention two;

The structural drawing of the example sentence index creation device that Fig. 5 provides for the embodiment of the invention three;

The structural drawing of the illustrative sentence retrieval device that Fig. 6 provides for the embodiment of the invention four.

[embodiment]

In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.

At first be described in detail through a pair of example sentence index creation method provided by the invention of embodiment.

Embodiment one,

The example sentence index creation method flow diagram that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, carry out following steps to each example sentence in the example sentence storehouse respectively:

Step 101: example sentence is carried out word segmentation processing.

At this, the word segmentation processing technology be this area than proven technique, for English example sentence,, separate by the space between speech and the speech, so can realize participle at an easy rate because English itself is unit with the speech.Chinese is unit with the word; Can adopt existing such as: based on the segmenting method of string matching, Chinese is carried out word segmentation processing, commonly used for example based on the maximum forward matching algorithm in the segmenting method of string matching based on the segmenting method of understanding or based on the segmenting method of statistics etc.No longer the method for word segmentation processing is described in detail at this.

With example sentence " I like the mobile phone of apple " is example, in this embodiment one, is example with this example sentence all, and the processing of each step is described visually.

In this step, at first example sentence " I like the mobile phone of apple " is carried out word segmentation processing, the word that obtains: " I ", " liking ", " apple ", " ", " mobile phone ".

Step 102: each word to obtaining after the word segmentation processing carries out part-of-speech tagging.

The mode of word being carried out part-of-speech tagging also is an existing mature technology, usually based on corpus, in conjunction with the part of speech probability of each word, and the part of speech probability of word during based on the word context, confirms the part of speech of word.For example, can adopt hidden Markov model, maximum entropy Markov model etc.

Continue and go up instance, with the word that obtains after the word segmentation processing: " I " am labeled as pronoun, and " liking " is labeled as verb, and " apple " is labeled as noun, " " be auxiliary word, " mobile phone " is labeled as noun.

Step 103: each word that obtains after the word segmentation processing is carried out the identification of proper noun, confirm to be identified as corresponding named entity (NE, the Named Entity) type of word of proper noun.

Proper noun is meant the title with specific meanings, and corresponding NE type can include but not limited to: name, place name, unit name, organization name, city name, country name, brand name etc.

For example; " the United Nations ", " New York ", " Hemingway displays " etc. all are proper nouns in the Chinese; Corresponding NE type is respectively organization name, city name and name, and is same, and corresponding English " UnitedNations ", " New York " and " Hemingway " also are proper nouns.

Word is carried out the identification of proper noun, also is existing mature technology, usually based on corpus, in conjunction with the NE type probability of each word, and the NE type probability of word during based on the word context, confirms the NE type of word.For example, can adopt condition random field (CRF, Conditional RandomFields) model, range upon range of hidden Markov model etc.

Continue and go up instance, the word that obtains after the word segmentation processing is carried out the identification of proper noun, identify " apple " and be proper noun, its NE type is a brand name.

Step 104: each word that obtains after the word segmentation processing is carried out syntactic analysis, confirm the syntactic role of each word.

The syntactic role of word can include but not limited to: subject, predicate, object, the adverbial modifier, attribute, predicative and complement.

The mode of word being carried out syntactic analysis also is an existing mature technology, and the syntactic role probability of word during according to the part of speech of word and word context is confirmed the syntactic role of word.For example, can adopt the analytical approach based on interdependent sentence structure, the analytical approach of generation set of complex features syntax tree etc.

Continue and go up instance, after each speech that example sentence " I like i Phone " is carried out obtaining after the word segmentation processing carried out syntactic analysis, confirm that the syntactic role of each speech is: " I " was subject, and " liking " is predicate, and " apple " is attribute, and " mobile phone " is object.

Step 105: each word that obtains after the word segmentation processing is made up in twos.

More excellent ground can be before step 105, and perhaps, before execution in step 106 to step 109, each word that at first obtains after to word segmentation processing based on the vocabulary of stopping using filters, and filters out the word that comprises in the vocabulary of stopping using.Wherein, Stop using and to comprise usually in the vocabulary: pronoun, function word, auxiliary word, article, interjection etc., collect through the word that the frequency of occurrences in the existing resource is reached preset high frequency condition and to obtain, for example; Auxiliary word " " have a very high frequency of occurrences; But it has the very low ability of expressing the meaning usually, therefore, it is collected in the vocabulary of stopping using.

Certainly, can word not filtered yet, adopt all words that obtain after the word segmentation processing and set up index, in the example that the embodiment of the invention is takeed, adopt each word that obtains after to word segmentation processing based on the vocabulary of stopping using to be filtered into example.

Suppose to stop using include in the vocabulary " I " with " ", after each speech that obtains after the word segmentation processing in to instance based on the vocabulary of stopping using carries out filtration treatment, with wherein " I " and " " filter out, remain " liking ", " apple ", " mobile phone ".

In step 105, word is made up in twos, can obtain that following combination " is liked " and the combination of " apple ", the combination of " liking " and " mobile phone ", the combination of " apple " and " mobile phone ".

More excellent ground; When in this step, each word that obtains after the word segmentation processing being made up in twos; Can will exist the word of collocation relation to make up in twos based on syntactic analysis; Wherein, collocation relation can include but not limited to: subject-predicate relation, moving guest's relation, polarization relation, middle benefit relation or apposition.

Exist the combination in twos of arranging in pairs or groups between the word that concerns to be exactly on the correspondence in the example: the combination of the combination of " liking " and " mobile phone ", " apple " and " mobile phone ".

Step 101 to step 105 is the processes of example sentence being carried out text analyzing; Wherein, Step 102 to step 105 can be selected at least one execution wherein, if select a plurality of steps wherein to carry out, does not have fixing sequencing between then a plurality of steps; Promptly can successively carry out in any order, also can carry out simultaneously.

Step 106: the combination of one by one that each word and word is corresponding part of speech is as the index of this example sentence.

Continue and go up instance, example sentence " I like the mobile phone of apple " has three at the index of this generating step: " liking " as verb, " apple " as noun, " mobile phone " as noun.

This step is corresponding with step 102, if execution in step 102 not, then execution in step 106 not.

Step 107: the combination that will be identified as the corresponding NE type of word and this word of specific term one by one is as the corresponding index of this example sentence.

Continue and go up instance, example sentence " I like the mobile phone of apple " has one at the index of this generating step: " apple " is as brand name.

This step is corresponding with step 103, if execution in step 103 not, then execution in step 107 not.

Step 108: the combination of one by one that the word that obtains after the word segmentation processing and this word is corresponding syntactic role is as the corresponding index of this example sentence.

Continue and go up instance, example sentence " I like the mobile phone of apple " has three at the index of this generating step: " liking " as predicate, " apple " as attribute, " mobile phone " as object.

This step is corresponding with step 104, if execution in step 104 not, then execution in step 108 not.

Step 109: the combination that one by one step 105 is obtained is respectively as the index of this example sentence.

Continue and go up instance, example sentence " I like the mobile phone of apple " has three at the index of this generating step: the combination of the combination of " liking " and " apple ", " liking " and " mobile phone ", the combination of " apple " and " mobile phone ".If only use the combination in twos that has the collocation relation, the index that then produces has two: the combination of the combination of " liking " and " mobile phone ", " apple " and " mobile phone ".

This step is corresponding with step 105, if execution in step 105 not, then execution in step 109 not.

Equally, if there are a plurality of steps to be performed in step 106 to the step 109, then these a plurality of steps can also can be carried out by order execution successively arbitrarily simultaneously.

After the execution of the example sentence in example sentence storehouse above-mentioned steps is determined corresponding index, continue execution in step 110.

Step 110: set up concordance list through the mode of falling row, wherein index value is an example sentence in the concordance list, and index key is the index of example sentence.

When setting up concordance list, can set up different types of concordance list according to index dissimilar, specifically can comprise following at least a concordance list:

" speech-part of speech " concordance list, index key are the combination of the corresponding part of speech of word and word, and index value is for being the example sentence of index with the manipulative indexing key.

" speech-NE type " concordance list, index key are the combination of the corresponding NE type of word and word, and index value is for being the example sentence of index with the manipulative indexing key.

" speech-syntactic role " concordance list, index key are the combination of the corresponding syntactic role of word and word, and index value is for being the example sentence of index with the manipulative indexing key.

" speech-speech " concordance list, index key are the combination of word and word, and index value is for being the example sentence of index with the manipulative indexing key.

Except above-mentioned concordance list, can also set up existing " speech " concordance list, promptly index key is a word, index value is the example sentence that comprises the manipulative indexing key.

More preferably, can conclude, set up the secondary index key the index key in " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list, " speech-speech " concordance list.Wherein, identical word is summarized in together as first order index in the index key, and the part of speech that first order index is corresponding in " speech-part of speech " concordance list is as second level index, for example shown in Fig. 2 a.The NE type that first order index is corresponding in " speech-NE type " concordance list is as second level index, for example shown in Fig. 2 b.The syntactic role that first order index is corresponding in " speech-syntactic role " concordance list is as second level index, for example shown in Fig. 2 c.Another word that makes up with first order index in " speech-speech " concordance list is as second level index, for example shown in Fig. 2 d.

When concordance list is upgraded,, then directly this example sentence is added in this index key corresponding index value and gets final product if the corresponding index of example sentence exists in the index key of concordance list.If the index that example sentence is corresponding does not exist in the index key of concordance list, then add the index key of this example sentence manipulative indexing, and with example sentence as this index key corresponding index value.

The method that the embodiment of the invention one provides can be applied to single language example sentence storehouse, and promptly each example sentence in the example sentence storehouse is with a kind of language; Also can be applied to bilingual example sentence storehouse, there is macaronic example sentence in bilingual example sentence in the storehouse, and it is right that two example sentences translating each other constitute bilingual example sentence, and the bilingual that two example sentences of promptly bilingual example sentence centering are the same meaning is expressed.

If be applied to bilingual example sentence storehouse; Then at first each example sentence in the bilingual example sentence storehouse is carried out text analyzing respectively according to the mode among the embodiment one; Obtain the corresponding index of two example sentences of bilingual example sentence centering respectively, the index that two example sentences are corresponding all as this bilingual example sentence to pairing index.When setting up concordance list through the mode of falling row, in the various concordance lists, the index key corresponding index value is that bilingual example sentence is right.

Give one example, suppose to exist in the bilingual example sentence storehouse so bilingual example sentence right: " difficulty is surmountable " and " Every difficulty can be overcome ".After two example sentences that respectively will this bilingual example sentence centering carry out text analyzing; Can obtain corresponding index of " difficulty is surmountable " is: " difficulty " is as subject; The index that " Every difficulty can be overcome " is corresponding is: " difficulty " is as subject; Therefore, " difficulty " is as subject, and " difficulty " all is the right index of this bilingual example sentence as subject.In " speech-syntactic role " concordance list of setting up through the mode of falling row; This bilingual example sentence is to being that index key " difficulty " is as the subject corresponding index value; And this bilingual example sentence to be index key " difficulty " as the subject corresponding index value, as shown in Figure 3.

So far, the example sentence index creation method shown in the embodiment one finishes, and is described in detail through two pairs of illustrative sentence retrieval methods provided by the invention of embodiment below.

Embodiment two,

The illustrative sentence retrieval method flow diagram that Fig. 4 provides for the embodiment of the invention two, as shown in Figure 4, this method may further comprise the steps:

Step 401: the query that receives the user.

Can define the syntax rule of query in embodiments of the present invention in advance, the user is based on this syntax rule input query.

The query of input need comprise query term, if exist a plurality of query terms can further comprise the logical relation between the query term.Wherein, query term at least a in listed below being: the combination of the syntactic role that combination, the word of the NE type that the combination of the part of speech that word and this word are corresponding, word and this word are corresponding is corresponding with this word and the combination between word and the word; Said logical relation is: occur simultaneously or difference set.

More excellent ground; Above-mentioned word and the combination between the word can be: existence is based on the word of the collocation relation of syntactic analysis and the combination of word, and wherein collocation relation can include but not limited to: subject-predicate concerns, moves perhaps apposition of guest's relation, polarization relation, middle benefit relation.

When the definition syntax rule, can adopt various forms flexibly, for example, can indicate part of speech, NE type and the syntactic role of this word in the bracket behind word, and define the sign of various parts of speech, NE type and syntactic role in advance.Can adopt " ^ " to connect the word that makes up with this word afterwards.In the logical relation, common factor can adopt "+" expression, and difference set can adopt "-" expression.

Take some examples: if the user thinks the example sentence that retrieval " solutions " and " difficulty " collocation is used, the query of input can for: solution ^ is difficult.

If the user thinks retrieval " difficulty " as noun and do the example sentence of subject, the query of input can be difficulty (N)+difficulty (SUB), and wherein, N identifies noun, and SUB identifies subject.

If the user wants retrieval " apple " as brand name but do not comprise the example sentence of " mobile phone ", the query of input can be apple (TRM)-mobile phone, wherein TRM sign brand name.

Step 402: parse the query term that query comprises,, then also parse the logical relation between each query term if comprise a plurality of query terms.

Give an example, if user's query is: difficulty (N)+difficulty (SUB) after resolving, parses two query terms: " difficulty (N) " and " difficulty (SUB) " also parses the logical relation between these two query terms: get common factor.

Step 403: utilize the query term that parses to retrieve one by one, obtain the corresponding result for retrieval of each query term.

When utilizing query term to retrieve; If query term is the combination of the corresponding part of speech of word and this word; Then the index key in this query term and " speech-part of speech " concordance list is mated, with the result for retrieval of the index key corresponding index value of mating as this query term correspondence.

If query term is the combination of the corresponding NE type of word and this word, then the index key in this query term and " speech-NE type " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term.

If query term is the combination of the corresponding syntactic role of word and this word, then the index key in this query term and " speech-syntactic role " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term.

If query term is the combination of word and word, then the index key in this query term and " speech-speech " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term.

If query term is a word, then the index key in query term and " speech " concordance list is mated, the result for retrieval of the index key corresponding index value of mating as this query term.

Further; For some query term; Its corresponding result for retrieval possibly be lower than preset minimum retrieval requirement, for example, utilizes some query term to retrieve out result for retrieval; If then this query term is not to be used to ask (that is to say of difference set; Be not query term for the adjacent rear end of logical relation of difference set), then can with each word in this query term respectively with " speech " concordance list in index key mate, with the index key corresponding index value of coupling as the result for retrieval of this query term.

Give an example,, do not have the result for retrieval of correspondence, then can word " solution " be mated in " speech " concordance list, obtain the result for retrieval that comprises " solution " result for retrieval as this query term if query term " solves (SUB) " after retrieval.Again for example; If a query term is " solving the ^ level "; After retrieval, there is not corresponding result for retrieval, then can " solution " and " level " be mated in " speech " concordance list respectively; The result for retrieval that obtains the result for retrieval that comprises " solution " and comprise " level ", the corresponding result for retrieval of the result for retrieval that " solution " is corresponding and " level " are all as the result for retrieval of this query term.That is to say,, farthest improve retrieval effectiveness through the mode of this additional retrieval.

Equally, the index value in the above-mentioned all kinds of concordance lists can be the example sentence of single language, also can be right for bilingual example sentence.If be that bilingual example sentence is right, the result for retrieval that then obtains is right for each bilingual example sentence.For example; If the query of user's input is " difficulty (SUB) "; The result for retrieval that then obtains is right for each bilingual example sentence, for example can comprise: the bilingual example sentence that " difficulty can be defeated " and " Every difficulty can be overcome " constitutes is right.

Step 404: according to the logical relation between each query term, the result for retrieval corresponding to each query term carries out integration processing.

After obtaining the result for retrieval of each query term, if the logical relation between two query terms for occuring simultaneously (corresponding symbol is "+"), is then got common factor with the result for retrieval of two query terms correspondence; If the logical relation between two query terms is difference set (corresponding symbol is "-"), then that two query terms are corresponding result for retrieval is got difference set.

Give an example, if the query of user's input is: difficulty (N)+difficulty (SUB), then that query term " difficulty (N) " is corresponding result for retrieval and the corresponding result for retrieval of query term " difficulty (SUB) " are got common factor.Wherein, When obtaining the corresponding result for retrieval of query term " difficulty (N) "; In " speech-part of speech " concordance list; Index key in " difficulty is as noun " and " speech-part of speech " concordance list is mated, the index key corresponding index value of coupling is confirmed as the result for retrieval of query term " difficulty (N) " correspondence.When obtaining the result for retrieval of query term " difficulty (SUB) " correspondence; In " speech-syntactic role " concordance list; Index key in " difficulty is as subject " and " speech-syntactic role " concordance list is mated, the index key corresponding index value of coupling is confirmed as the result for retrieval of query term " difficulty (SUB) " correspondence.

If the query of user's input is: apple (TRM)-mobile phone, then that query term " apple (TRM) " is corresponding result for retrieval and the corresponding result for retrieval of query term " mobile phone " are got difference set.Wherein, When obtaining the Query Result of the corresponding result for retrieval of query term " apple (TRM) "; Index key in " apple is as brand name " and " speech-NE type " concordance list is mated, the index key corresponding index value of coupling is confirmed as the Query Result of the corresponding result for retrieval of query term " apple (TRM) ".When obtaining the result for retrieval of query term " mobile phone " correspondence, the index key in " mobile phone " and " speech " concordance list is mated, the index key corresponding index value of coupling is confirmed as the query term mobile phone " corresponding result for retrieval.

Only comprise a query term if parse query, then execution in step 404 not after directly that this query term is corresponding result for retrieval sorts, returns to the user.

Step 405: the result for retrieval to after the integration processing sorts.

When the result for retrieval after the integration processing is sorted, can adopt multiple sort by, for example, can sort according to the letter situation of putting in result for retrieval source, i.e. the degree of confidence of example sentence institute source web, degree of confidence is high more, and row is time high more.

More excellent ground also can sort according to the matching state of result for retrieval and user's query.Wherein, i result for retrieval R _iMatching state F (R with user's query _i) can pass through result for retrieval R _iCovering situation to query term, word and logical relation embodies.

Particularly, can adopt following formula:

F (R_{i}) = λ_{item} Σ_{j = 1}^{J} δ (R_{i}, ite m_{j}) + λ_{word} Σ_{k = 1}^{K} δ (R_{i}, {word}_{k}) + λ_{[+]} Σ_{m = 1}^{M} δ (R_{i}, {[+]}_{m}) + λ_{[-]} Σ_{n = 1}^{N} δ (R_{i}, {[-]}_{n});

(1)

Wherein, λ _Item, λ _Word, λ _[+]And λ _[-]Be the predetermined weights parameter, can select empirical value, also can the predetermined weight parameter of selective system adjustment mode.Wherein, system's adjustment mode is meant, the ranking results that sets up standard in advance through the mode of system adjustment, confirms to make the ordering of result for retrieval and the weight parameter of standard sorted result's diversity factor minimum.

δ (R _i, item _j) be result for retrieval R _iWith the matching value of j query term, J is the query term number that said query comprises.If item _jBe R _iIndex, i.e. R _iCover item _j, δ (R then _i, item _j) be 1, otherwise δ (R _i, item _j) be 0.

δ (R _i, word _k) be result for retrieval R _iWith the matching value of k word, K by retrieval among the said query the number of use word.If word _kBe R _iIndex, i.e. R _iCover word _k, δ (R then _i, word _k) be 1, otherwise δ (R _i, item _j) be 0.If do not consider R _iCovering situation to word then can be provided with λ _WordBe 0.

δ (R _i, [+] _m) be result for retrieval R _iWith m the matching value for the logical relation of common factor, M is the logical relation number for occuring simultaneously among the said query.δ (R _i, [+] _m) value and logical relation [+] for occuring simultaneously _mThe covering situation of two ends query term is relevant, if [+] _mThe query term at two ends is R _iIndex, δ (R _i, [+] _m) be 1, otherwise δ (R _i, [+] _m) be 0.

δ (R _i, [-] _n) be result for retrieval R _iIndividual with n is the matching value of the logical relation of difference set, and N is the logical relation number of difference set among the said query.Equally, δ (R _i, [-] _n) value be the logical relation [-] of difference set _nThe covering situation of two ends query term is relevant, if [-] _nThe query term of adjacent headend is R _iIndex and the query term of adjacent rear end be not R _iIndex, δ (R then _i, [-] _n) be 1, otherwise δ (R _i, [-] _n) be 0.

Step 406: result for retrieval returns to the user.

So far, flow process shown in the embodiment two finishes.Describe respectively with four pairs of example sentence index creation devices provided by the invention of embodiment and illustrative sentence retrieval device through embodiment three below.

Embodiment three,

The structural drawing of the example sentence index creation device that Fig. 5 provides for the embodiment of the invention three, as shown in Figure 5, this device can comprise: unit 510 is set up with index in text analyzing unit 500.

Text analyzing unit 500 is used for respectively carrying out text analyzing to each example sentence in example sentence storehouse.

Unit 510 set up in index, is used for the analysis result according to text analyzing unit 500, creates the pairing index of each example sentence; Wherein index comprises following at least a in listed: combination and the word in the example sentence and the combination between the word of the syntactic role that combination, the word in the example sentence of the named entity type that the combination of the part of speech that the word in the example sentence and this word are corresponding, the word in the example sentence and this word are corresponding is corresponding with this word.

Wherein, text analyzing unit 500 comprises word segmentation processing subelement 501, can also comprise in the following subelement at least one: part-of-speech tagging subelement 502, NE recognin unit 503, syntactic analysis subelement 504 and matched combined subelement 505.

Word segmentation processing subelement 501 is used for example sentence is carried out word segmentation processing.

Part-of-speech tagging subelement 502 is used for each word that obtains after the word segmentation processing is carried out part-of-speech tagging.

NE recognin unit 503 is used for each word that obtains after the word segmentation processing is carried out the identification of proper noun, confirms to be identified as the corresponding NE type of word of proper noun.

Syntactic analysis subelement 504 is used for each word that obtains after the word segmentation processing is carried out syntactic analysis, confirms the syntactic role of each word.

The specific descriptions of word segmentation processing, part-of-speech tagging, NE identification and syntactic analysis, are not described in detail at this owing to be existing mature technology referring to step 101 to step 104 among the embodiment one.

Matched combined subelement 505 is used for each word that obtains after the word segmentation processing is made up in twos.

Unit 510 set up in index can be according to the part-of-speech tagging result of part-of-speech tagging subelement 502, and the combination of one by one that each word and word is corresponding part of speech is as the index of example sentence; Perhaps, according to the recognition result of NE recognin unit 503, the combination that will be identified as the corresponding NE type of word and the word of proper noun one by one is as the corresponding index of example sentence; Perhaps, according to the analysis result of syntactic analysis subelement 504, the combination of one by one that each word and word is corresponding syntactic role is as the corresponding index of example sentence; Perhaps, the combination that matched combined subelement 505 is obtained is respectively as the index of example sentence.

There is the combination in twos of collocation relation between each word that more excellent ground, matched combined subelement are specifically confirmed to obtain after the word segmentation processing based on syntactic analysis.Wherein said collocation relation comprises: subject-predicate relation, moving guest's relation, polarization relation, middle benefit relation or apposition.

In addition, unit 510 set up in index, can also be with each word that obtains after word segmentation processing subelement 501 word segmentation processing respectively as the index of example sentence.

Owing in example sentence, may exist a part of word ability of expressing the meaning relatively poor, for example, pronoun, function word, auxiliary word, article, interjection etc. can be included in the word of these types in the vocabulary of stopping using in advance, utilize the vocabulary of stopping using to realize the filtration to these words.In order to realize this function; Text analyzing unit 500 can also comprise: word filters subelement 506, is used for based on preset inactive vocabulary each word that obtains after word segmentation processing subelement 501 word segmentation processing being filtered; After filtering out the word that comprises in the vocabulary of stopping using; Supply matched combined subelement 505 to make up (not shown this kind situation among Fig. 5), perhaps, supply index to set up unit 510 and carry out the foundation of index.

Situation is passed through the filtration that word filters subelement 506 as shown in Figure 5, and the word that uses when index is set up in unit 510 set up in index all is that these words can have the stronger ability of expressing the meaning through the word after filtering.

Wherein, the model sentence storehouse can also can be bilingual example sentence storehouse for single language example sentence storehouse, and what exist in the bilingual example sentence storehouse is that bilingual example sentence is right, and two example sentences of each of bilingual example sentence centering are for expressing the macaronic example sentence of same implication.

If the example sentence storehouse is bilingual example sentence storehouse, then index set up unit 510 with the pairing index of bilingual each example sentence of example sentence centering in the bilingual example sentence storehouse all as this bilingual example sentence to pairing index.

In addition, this device can also comprise: concordance list is set up unit 520, is used for utilizing index to set up the index that unit 510 is set up for each example sentence of example sentence storehouse; Set up concordance list through the mode of falling row; Wherein, index value is an example sentence in the concordance list, and index key is the corresponding index of example sentence.

If be used for bilingual example sentence storehouse; Then concordance list is set up unit 520 and can be utilized index to set up unit 510 to be the index of each bilingual example sentence in the bilingual example sentence storehouse to setting up; Set up concordance list through the mode of falling row; Wherein, index value is that bilingual example sentence is right in the concordance list, and index key is the index of bilingual example sentence to correspondence.

Above-mentioned concordance list can comprise following at least a in listed:

" speech-part of speech " concordance list, index key wherein are the combination of the corresponding part of speech of word and word.

" speech-NE type " concordance list, index key wherein are the combination of the corresponding NE type of word and word.

" speech-syntactic role " concordance list, index key wherein are the combination of the corresponding syntactic role of word and word.

Can also further integrate the index key in the above-mentioned concordance list; Identical word is summarized in together, at this moment, in " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list or " speech-speech " concordance list; Index key is the secondary index key, is specially:

Identical word is summarized in together as first order index in index key; The part of speech that first order index is corresponding in " speech-part of speech " concordance list is as second level index; The NE type that first order index is corresponding in " speech-NE type " concordance list is as second level index; The syntactic role that first order index is corresponding in " speech-syntactic role " concordance list is as second level index, and another word that makes up with first order index in " speech-speech " concordance list is as second level index.

Embodiment four,

The structural drawing of the illustrative sentence retrieval device that Fig. 6 provides for the embodiment of the invention four, as shown in Figure 6, this device can comprise: user side interactive unit 600, request analysis unit 610, retrieval processing unit 620 and integral unit 630 as a result.

User side interactive unit 600 is used to receive user's retrieval request query, and the result for retrieval that integral unit 630 is as a result provided returns to the user.

Request analysis unit 610 is used to parse the query term that query comprises, if comprise a plurality of query terms, then also parses the logical relation between each query term.

Retrieval processing unit 620, each query term that is used to utilize request analysis unit 610 to parse is retrieved one by one, obtains the corresponding result for retrieval of each query term.

Integral unit 630 as a result; When being used for parsing query and comprising a plurality of query term in request analysis unit 610; Utilize the logical relation between each query term that request analysis unit 610 parses; The result for retrieval corresponding to each query term carries out integration processing, and the result for retrieval after the integration processing is offered user side interactive unit 600; Parse query in request analysis unit 610 when comprising a query term, the result for retrieval that this query term is corresponding offers user side interactive unit 600.

Wherein, query term at least a in listed below being: the combination of the syntactic role that combination, the word of the NE type that the combination of the part of speech that word and this word are corresponding, word and this word are corresponding is corresponding with this word and the combination between word and the word; Logical relation is: occur simultaneously or difference set.

If the query term that request analysis unit 610 parses is the combination of the corresponding part of speech of word and this word; Then retrieval processing unit 620 matees the index key in this query term and " speech-part of speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

If the query term that request analysis unit 610 parses is the combination of the corresponding NE type of word and this word; Then retrieval processing unit 620 matees the index key in this query term and " speech-NE type " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

If the query term that request analysis unit 610 parses is the combination of the corresponding syntactic role of word and this word; Then retrieval processing unit 620 matees the index key in this query term and " speech-syntactic role " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

If the query term that request analysis unit 610 parses is the combination of word and word; Then retrieval processing unit 620 matees the index key in this query term and " speech-speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

More excellent ground, the combination between above-mentioned word and the word can for: exist based on the word of the collocation relation of syntactic analysis and the combination of word.Wherein collocation relation can include but not limited to: subject-predicate relation, moving guest's relation, polarization relation, middle benefit relation or apposition.

In addition, the query term that parses of request analysis unit 610 can also comprise word.If the query term that request analysis unit 610 parses is a word, then retrieval processing unit 620 matees the index key in this query term and " speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term.

Index value in above-mentioned " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list, " speech-speech " concordance list in index value, " speech " concordance list can be example sentence, also can be right for bilingual example sentence.

For some query term; Its corresponding result for retrieval possibly be lower than preset minimum retrieval requirement; In order to improve retrieval effectiveness; This device can also comprise: replenish retrieval unit 640, being used at certain query term is not the query term for the adjacent rear end of logical relation of difference set, and the corresponding result for retrieval of this query term is when being lower than preset minimum retrieval and requiring; With each word in this query term respectively with " speech " concordance list in index key mate, with the index key corresponding index value of coupling as the result for retrieval of this query term.

Structure in the face of the integral unit as a result 630 in the said apparatus specifically describes down, this as a result integral unit 630 can specifically comprise: integration processing subelement 631 is handled subelement 632 with ordering.

Integration processing subelement 631 when being used for parsing in request analysis unit 610 query and comprising a plurality of query term, utilizes the logical relation between each query term that request analysis unit 610 parses, and the result for retrieval corresponding to each query term carries out integration processing.

If the logical relation between adjacent two query terms is for occuring simultaneously, then that two query terms are corresponding result for retrieval is got common factor; If the logical relation between adjacent two query terms is a difference set, then that two query terms are corresponding result for retrieval is got difference set.

Subelement 632 is handled in ordering, is used for the result for retrieval after the integration processing is sorted, and wherein the foundation of ordering comprises following one of listed or combination: the letter situation of putting in result for retrieval source, and, the matching state of result for retrieval and query.

Wherein, the matching state F (R of result for retrieval and query _i) can for:

F (R_{i}) = λ_{item} Σ_{j = 1}^{J} δ (R_{i}, ite m_{j}) + λ_{word} Σ_{k = 1}^{K} δ (R_{i}, {word}_{k}) + λ_{[+]} Σ_{m = 1}^{M} δ (R_{i}, {[+]}_{m}) + λ_{[-]} Σ_{n = 1}^{N} δ (R_{i}, {[-]}_{n});

Wherein, λ _Item, λ _Word, λ _[+]And λ _[-]Be the predetermined weights parameter, δ (R _i, item _j) be result for retrieval R _iWith the matching value of j query term, J is the query term number that query comprises, δ (R _i, word _k) be result for retrieval R _iWith the matching value of k word, K by retrieval among the query the number of use word, δ (R _i, [+] _m) be result for retrieval R _iWith m the matching value for the logical relation of common factor, M is the logical relation number for occuring simultaneously among the query, δ (R _i, [-] _n) be result for retrieval R _iIndividual with n is the matching value of the logical relation of difference set, and N is the logical relation number of difference set among the query.

If item _jBe R _iIndex, i.e. R _iCover item _j, δ (R then _i, item _j) be 1, otherwise δ (R _i, item _j) be 0.

If word _kBe R _iIndex, i.e. R _iCover word _k, δ (R _i, word _k) be 1, otherwise δ (R _i, item _j) be 0.

δ (R _i, [+] _m) value and logical relation [+] for occuring simultaneously _mThe covering situation of two ends query term is relevant, if the logical relation [+] for occuring simultaneously _mThe query term at two ends is R _iIndex, δ (R _i, [+] _m) be 1, otherwise δ (R _i, [+] _m) be 0.

Equally, δ (R _i, [-] _n) value be the logical relation [-] of difference set _nThe covering situation of two ends query term is relevant, if be the logical relation [-] of difference set _nThe query term of adjacent headend is R _iIndex and the query term of adjacent rear end be not R _iIndex, δ (R then _i, [-] _n) be 1, otherwise δ (R _i, [-] _n) be 0.

The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims

1. an example sentence index creation method is characterized in that, carries out following steps to each example sentence in the example sentence storehouse respectively:

A, example sentence is carried out text analyzing;

2. method according to claim 1 is characterized in that, said steps A specifically comprises:

A1, said example sentence is carried out word segmentation processing;

Among A2, execution in step A21, A22, A23, the A24 at least one:

3. method according to claim 2 is characterized in that, this method also comprises: with each word that obtains after the word segmentation processing respectively as the index of said example sentence.

4. method according to claim 2 is characterized in that, said steps A 24 specifically comprises: have the combination in twos of collocation relation between each word of confirming to obtain after the word segmentation processing based on syntactic analysis;

5. method according to claim 2 is characterized in that, before said steps A 24, perhaps, before said step B, also comprises:

6. method according to claim 1 is characterized in that, said example sentence storehouse is single language example sentence storehouse or bilingual example sentence storehouse.

7. method according to claim 6 is characterized in that, if said example sentence storehouse is bilingual example sentence storehouse, then this method also comprises:

8. method according to claim 1 is characterized in that, this method also comprises:

9. method according to claim 7 is characterized in that, this method also comprises:

Utilize each bilingual example sentence in the said bilingual example sentence storehouse to and bilingual example sentence to the index of correspondence, set up concordance list through the mode of falling row, wherein, index value is that bilingual example sentence is right in the said concordance list, index key is the index of bilingual example sentence to correspondence.

10. according to Claim 8 or 9 described methods, it is characterized in that said concordance list comprises following at least a in listed at least:

11. method according to claim 10 is characterized in that, in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list or " speech-speech " concordance list, index key is the secondary index key, is specially:

12. an illustrative sentence retrieval method is characterized in that, this method comprises:

A, reception user's retrieval request query;

13. method according to claim 12 is characterized in that, said step C is specially:

14. method according to claim 12 is characterized in that, being combined as between said word and the word: existence is based on the word of the collocation relation of syntactic analysis and the combination of word;

15., it is characterized in that the query term that parses also comprises: word according to claim 12,13 or 14 described methods;

16. method according to claim 15; It is characterized in that the index value in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list, " speech-speech " concordance list in index value, " speech " concordance list is that example sentence or bilingual example sentence are right.

17. method according to claim 15; It is characterized in that; If certain query term is not the query term for the adjacent rear end of logical relation of difference set; And the result for retrieval that this query term is corresponding is lower than preset minimum retrieval requirement, then with each word in this query term respectively with said " speech " concordance list in index key mate, with the index key corresponding index value of coupling result for retrieval as this query term.

18. method according to claim 12 is characterized in that, before said step e, also comprises:

19. method according to claim 18 is characterized in that, the matching state F (R of said result for retrieval and said query _i) be:

F (R_{i}) = λ_{item} Σ_{j = 1}^{J} δ (R_{i}, ite m_{j}) + λ_{word} Σ_{k = 1}^{K} δ (R_{i}, {word}_{k}) + λ_{[+]} Σ_{m = 1}^{M} δ (R_{i}, {[+]}_{m}) + λ_{[-]} Σ_{n = 1}^{N} δ (R_{i}, {[-]}_{n});

20. method according to claim 19 is characterized in that, if item _jBe R _iIndex, δ (R _i, item _j) be 1, otherwise δ (R _i, item _j) be 0;

21. an example sentence index creation device, it is characterized in that this device comprises: the unit set up in text analyzing unit and index;

22. device according to claim 21; It is characterized in that; Said text analyzing unit comprises the word segmentation processing subelement, also comprises in the following subelement at least one: part-of-speech tagging subelement, NE recognin unit, syntactic analysis subelement and matched combined subelement;

23. device according to claim 22 is characterized in that, the unit set up in said index, also is used for each word that obtains after the said word segmentation processing subelement word segmentation processing respectively as the index of said example sentence.

24. device according to claim 22 is characterized in that, has the combination in twos of collocation relation between each word that said matched combined subelement is specifically confirmed to obtain after the word segmentation processing based on syntactic analysis;

25. device according to claim 22 is characterized in that, said text analyzing unit also comprises: word filters subelement; Be used for based on preset inactive vocabulary; Each word that obtains after the said word segmentation processing subelement word segmentation processing is filtered, filter out the word that comprises in the vocabulary of stopping using after, supply said matched combined subelement to make up; Perhaps, supply said index to set up the foundation that the unit carries out index.

26. device according to claim 21 is characterized in that, said example sentence storehouse is single language example sentence storehouse or bilingual example sentence storehouse.

27. device according to claim 26; It is characterized in that; If said example sentence storehouse is bilingual example sentence storehouse, then said index set up the unit with the pairing index of bilingual each example sentence of example sentence centering in the said bilingual example sentence storehouse all as this bilingual example sentence to pairing index.

28. device according to claim 21; It is characterized in that this device also comprises: concordance list is set up the unit, be used for utilizing said index to set up the index that the unit is set up for each example sentence of example sentence storehouse; Set up concordance list through the mode of falling row; Wherein, index value is an example sentence in the said concordance list, and index key is the corresponding index of example sentence.

29. device according to claim 27; It is characterized in that this device also comprises: concordance list is set up the unit, being used for utilizing said index to set up the unit is the index of each bilingual example sentence of bilingual example sentence storehouse to setting up; Set up concordance list through the mode of falling row; Wherein, index value is that bilingual example sentence is right in the said concordance list, and index key is the index of bilingual example sentence to correspondence.

30., it is characterized in that said concordance list comprises following at least a in listed according to claim 28 or 29 described devices:

31. device according to claim 20 is characterized in that, in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list or " speech-speech " concordance list, index key is the secondary index key, is specially:

32. an illustrative sentence retrieval device is characterized in that, this device comprises: user side interactive unit, request analysis unit, retrieval processing unit and integral unit as a result;

33. device according to claim 32; It is characterized in that; If the query term that the described request resolution unit parses is the combination of the corresponding part of speech of word and this word; Then said retrieval processing unit matees the index key in this query term and " speech-part of speech " concordance list, with the result for retrieval of the index key corresponding index value of mating as this query term;

34. device according to claim 32 is characterized in that, being combined as between said word and the word: existence is based on the word of the collocation relation of syntactic analysis and the combination of word;

35., it is characterized in that the query term that the described request resolution unit parses comprises word according to claim 32,33 or 34 described devices;

36. device according to claim 35; It is characterized in that the index value in said " speech-part of speech " concordance list, " speech-NE type " concordance list, " speech-syntactic role " concordance list, " speech-speech " concordance list in index value, " speech " concordance list is that example sentence or bilingual example sentence are right.

37. device according to claim 35; It is characterized in that; This device also comprises: replenish retrieval unit, being used at certain query term is not the query term for the adjacent rear end of logical relation of difference set, and the corresponding result for retrieval of this query term is when being lower than preset minimum retrieval and requiring; With each word in this query term respectively with said " speech " concordance list in index key mate, with the index key corresponding index value of coupling as the result for retrieval of this query term.

38. device according to claim 32 is characterized in that, said integral unit as a result specifically comprises:

39., it is characterized in that the matching state F (R of said result for retrieval and said query according to the described device of claim 38 _i) be:

F (R_{i}) = λ_{item} Σ_{j = 1}^{J} δ (R_{i}, ite m_{j}) + λ_{word} Σ_{k = 1}^{K} δ (R_{i}, {word}_{k}) + λ_{[+]} Σ_{m = 1}^{M} δ (R_{i}, {[+]}_{m}) + λ_{[-]} Σ_{n = 1}^{N} δ (R_{i}, {[-]}_{n});

40. according to the described device of claim 39, it is characterized in that, if item _jBe R _iIndex, δ (R _i, item _j) be 1, otherwise δ (R _i, item _j) be 0;