CN102682000A

CN102682000A - Text clustering method, question-answering system applying same and search engine applying same

Info

Publication number: CN102682000A
Application number: CN2011100563794A
Authority: CN
Inventors: 沈文竹; 吴甜; 柴春光; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-03-09
Filing date: 2011-03-09
Publication date: 2012-09-19

Abstract

The invention provides a text clustering method, a question-answering system applying the same and a search engine applying the same. The method comprises the following steps of 1) clustering texts in various languages; 2) drawing character word vectors of clustered texts in the various languages; and 3) calculating similarity of character word vectors of the texts in different languages, and clustering all the texts. By using the method, the question-answering system and the search engine, the texts in the various languages can be clustered correctly.

Description

A kind of text cluster method and the question answering system and the search engine that adopt this method

Technical field

The present invention relates to area of pattern recognition, more specifically, relate to a kind of mode identification method natural language.

Background technology

Along with popularizing of information network, the appearance an urgent demand of the electronic text message of magnanimity automatically carries out text classification by machine.Text automatic classification can be practiced thrift a large amount of man power and materials, many defectives such as the cycle of avoiding the manual sort to bring is long, expense is high and efficient is low.Text automatic classification is exactly that a large amount of texts are classified according to its content automatically, thereby helps people to handle and organize text data effectively.

More strong for this demand of search engine.How People more and more relies on search to obtain knowledge and information.In the face of hundreds of millions of webpage, information resources, the greatest problem that search engine will face is from so huge information bank, how for the user its needed information to be provided quickly and accurately.

For example, the question and answer type systematic that relates to of search engine.In the question and answer type systematic, complete problem page includes the problem that a user proposes, and other users one or more answers that this problem is provided.When new user inquired about problem in the question and answer type systematic, this system need be based on the existing problem page before in the searching keyword searching system in the new user institute inquiry problem, and returns to this new user.

Because the diversity of spoken and written languages, question answering system can not limit the language format of problem that new user imports, so an identical problem has multiple form of presentation on the flesh and blood.From may be different, and then make that the problem page that can find is also different with the searching keyword that is extracted the described problem of different form of presentations.Because these problems are the same on essential content, therefore, if the problem page of all relevant issues is all returned to the user, rather than a part wherein, must promote user experience.Address the above problem, just need do cluster problem.

Present question and answer type systematic is through the problem page being carried out text analyzing, set up the index that " searching keyword " arrives " the problem page ", user inquiring being returned the relevant problem page.That is to say that " the problem page " must comprise " searching keyword " and just might be returned.Expenditure from the user; For example: " way of fried rice with eggs ", " how Fried rice with eggs is cooked ", " how fried rice with eggs is cooked " these problems say it is of equal value from content; When the user searched for, all corresponding " problem pages " of the problems referred to above of meeting consumers' demand all can be returned.In brief, people hope to gather into same type to the problem that contains same subject.When one in user inquiring keyword and the one type of problem when relevant, all pages of this type problem all are put into the candidate, show the user.As shown in Figure 1, before the cluster, search key 1 can return the problem page 1,3.Search key 2 can return the problem page 2,4.In the cluster process, the page 1,2 is gathered into one type, and the page 3,4 has been gathered into one type, after the cluster, no matter is search key 1, or keyword 2, all can return the problem page 1,2,3,4.

But the cluster for text in the prior art has only been considered monolingual text, has ignored the problem on obstacle of language.In fact, along with the increase of international exchange cooperation, multilingual appears in increasing credit union.Still above example question answering system, there will be multi-lingual issues, such as: "how? To? Make? The? Fired? Rice? And? Eggs" and "egg into ri tea a firm nn をど Full specifications ni su ru ka "These two issues with the above" approach fried rice "and other three questions from a content perspective is also equivalent.Thus, should these five problems be gathered exactly is one type, thinks that the user provides better search experience.

In sum, press for a kind of method that can carry out accurate cluster to the text of various language, and a kind of can multilingual problem the unification, the question answering system of being convenient to answerer's identification and answering.

Summary of the invention

The object of the present invention is to provide a kind of method that can carry out accurate cluster to the text of various language.

According to an aspect of the present invention, a kind of text cluster method is provided, has comprised the steps:

1) text to each language carries out cluster respectively;

2) text to each language after the cluster extracts the characteristic term vector respectively;

3) similarity of the characteristic term vector of the text of calculating different language is carried out cluster to all texts.

In said method, said step 3) further comprises:

31) characteristics were calculated for both language text translation of the word probability where

, respectively, and a language text in another language text feature words;

32) calculate said similarity according to following formula:<img file="BDA0000049500890000031.GIF" he="111" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="700" />Wherein<w1, w2, L wn>With<img file="BDA0000049500890000032.GIF" he="83" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="277" />Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;

33) said similarity is gathered into one type greater than the text of threshold value.

In said method, said step 31) further comprise:

311) calculate the lexical translation probability according to following formula

With Wherein

Expression

Translate into w _iProbability,

Expression w _iTranslate into

Probability;

312) the following formula of basis calculates the translation probability of said characteristic speech

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = \sqrt{P (w_{i}, \overset{&OverBar;}{w_{j}}) P (\overset{&OverBar;}{w_{j}}, w_{i})} .

In said method, said step 31) further comprise:

313) calculate the lexical translation probability according to following formula

Wherein

Expression

Translate into w _iProbability;

314) the following formula of basis calculates the translation probability of said characteristic speech

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

In said method, said step 31) further comprise:

315) calculate the lexical translation probability according to following formula

Wherein

Expression w _iTranslate into

Probability;

316) the following formula of basis calculates the translation probability of said characteristic speech

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

In said method; The span of said threshold value be

T get min (n, m).

According to another aspect of the invention, a kind of text cluster system is provided also, has comprised:

Single language text cluster module is used for the text of each language is carried out cluster respectively;

Extraction module is used for the text of each language after said single language text cluster module cluster is extracted the characteristic term vector respectively;

Analysis module is used to calculate the similarity of characteristic term vector of the text of the different language that is extracted by said extraction module, and all texts are carried out cluster.

In the system of a practical implementation of the present invention, said analysis module further comprises:

The translation probability computing module is used for calculating the translation probability by the bilingual text characteristic speech of said extraction module extraction

W wherein _i,

It is respectively a kind of characteristic speech of text of text and another kind of language of language;

Similarity calculation module is used for being calculated according to said translation probability computing module

Utilize following formula to calculate said similarity:

Wherein<w ₁, w ₂, L w _n>With

Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;

Multi-language text cluster module, the similarity that is used for calculating according to said similarity calculation module is gathered into one type with said similarity greater than the text of threshold value.

In the system of a practical implementation of the present invention, said translation probability computing module further comprises:

The first probability calculation module is used for calculating the lexical translation probability according to following formula Wherein

Expression

Translate into w _iProbability;

The second probability calculation module is used for calculating the lexical translation probability according to following formula

Wherein

Expression w _iTranslate into

Probability;

The first probability determination module,

that

that is used for being calculated according to the said first probability calculation module and the said second probability calculation module are calculated utilizes following formula to calculate the translation probability

of said characteristic speech

In the system of another practical implementation of the present invention, said translation probability computing module further comprises:

The first probability calculation module, be used for according to following formula calculate lexical translation probability

wherein

expression

translate into the probability of wi;

The second probability determination module is used for being calculated according to the said first probability calculation module Utilize following formula to calculate the translation probability of said characteristic speech

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

Wherein

Expression w _iTranslate into

Probability;

The 3rd probability determination module is used for being calculated according to the said second probability calculation module

Utilize following formula to calculate the translation probability of said characteristic speech

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

In the system of a practical implementation of the present invention, said extraction module further comprises:

The feature clustering module; Be used for characteristic being carried out cluster according to the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;

Characteristic selecting module is used for that said feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.

According to a further aspect in the invention, a kind of question answering system is provided also, it comprises above-mentioned text cluster system, so that the problem of inquiry is classified, furnishes an answer according to classification then.

According to another aspect of the invention, a kind of search engine is provided also, has comprised above-mentioned question answering system.

Utilize file classification method provided by the present invention to carry out accurate cluster to the multilingual text.This method is applied in the question answering system, especially in the question answering system of search engine, can improves the accuracy of answer, effectively save Internet resources.

Description of drawings

Fig. 1 is a problem cluster synoptic diagram in the question and answer type systematic;

Fig. 2 is the text cluster method flow diagram of the specific embodiment according to the present invention;

Fig. 3 is the block diagram of text cluster system in accordance with a preferred embodiment of the present invention.

Embodiment

In order to make the object of the invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing, to file classification method further explain according to an embodiment of the invention.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

The process flow diagram that will combine Fig. 2 below is the text cluster method that example is described the specific embodiment according to the present invention in detail with the macaronic text of China and British.

At first, centering, English text carries out cluster respectively.This cluster can adopt content relevant cluster method or the irrelevant clustering method of content.

Wherein, content relevant cluster method is used similarity function some internal characteristics according to element, for example having or not etc. of certain characteristic of two texts, describes the similarity degree between the text.

Preferably, adopt the irrelevant clustering method of content, it can be assessed the similarity degree between text under the situation of not considering content of text, and then carries out cluster.Further; Employing is based on the iteration clustering method of figure; It can excavate the implication relation in the keyword: some keywords appear to wide of the mark at the initial stage of method computing; But these keywords may become similar gradually in the process of carrying out, and thus, it can be efficiently with Chinese or English text cluster.

Especially, for the cluster of the related search problem of search engine, can carry out cluster through problem log, be consistent if two problem users click, and thinks that then these two problems are similar.Particularly, can adopt the hierarchy clustering method of coagulation type.This method mainly comprises following two steps:

1) at first with the two-dimensional plot of search engine inquiry daily record structure that reads in.Node q on one side represents keyword; The node d on one side representative link in addition; Simultaneously; Appear at simultaneously in the record if in daily record, find certain keyword and link, that is to say that the user has clicked some link in the result of page searching when carrying out a keyword query, set up a two-way limit so between the node of representing this keyword and link.

2) after the structure of accomplishing two-way limit, merge two maximum node q of similarity degree, merge two maximum node d of similarity degree then, so iteration is till satisfying end condition.

Then, macaronic text is striden the language cluster.

To the Chinese text and the English text of cluster extract the characteristic term vector respectively<w ₁, w ₂, L w _n>With

Wherein n and m are respectively the dimensions of characteristic term vector of characteristic term vector and the English text of Chinese text.

The extraction of characteristic term vector can be based on the method for cluster; Specifically comprise: at first characteristic is carried out cluster according to the similarity between characteristic; In each bunch of institute's cluster, select whole bunch of a characteristic representative; With bunch in other characteristics concentrate to reject from candidate feature, thereby reduce the redundancy in the feature set; Then, use the information gain method to carry out feature selecting, obtain the characteristic term vector remaining characteristic." computer utility " that be published in January, 2007 referring to Zhang Wenliang etc. gone up " a kind of text feature system of selection based on cluster " literary composition.

Certainly, the extraction of characteristic term vector also can be adopted other method, for example, and based on mutual information, x ²Statistic law etc.

Calculate the translation probability of Chinese and English characteristic speech

W wherein _i∈ w ₁, w ₂, L w _n,

1≤i≤n, 1≤j≤m, the two is respectively the characteristic speech of Chinese and English text.In order to guarantee the symmetry of similarity, order

Expression

Translate into w _iProbability,

Expression w _iTranslate into

Probability.For example, " bank " this English word, it comprises two kinds of Chinese implications, and a kind of is " bank ", and a kind of is " riverside ".P (bank | bank) and P (riverside | bank) these two conditional probabilities represent that respectively bank translates into the probability of bank and the probability that bank translates into the riverside.Can be through the bilingual corpora in the bilingual corpus be carried out participle; Carry out word alignment then; Add up total what " bank " according to the alignment result and translated into " bank ", what " bank " have been translated into " riverside " and have been calculated as above conditional probability.

One of ordinary skill in the art will appreciate that, also can make

Perhaps

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

Calculate the similarity of Chinese and English characteristic term vector according to translation probability

:

<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mo><;</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>,</mo> <mi>L</mi> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>></mo> <mo>,</mo> <mo><;</mo> <mover> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&OverBar;</mo> </mover> <mo>,</mo> <mover> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&OverBar;</mo> </mover> <mo>,</mo> <mi>L</mi> <mover> <msub> <mi>w</mi> <mi>m</mi> </msub> <mo>&OverBar;</mo> </mover> <mo>></mo> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>sim</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>,</mo> <mover> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> </mrow> <mrow> <mi>m</mi> <mo>&CenterDot;</mo> <mi>n</mi> </mrow> </mfrac> <mo>.</mo> </mrow></math>

The text that similarity is higher than threshold value gathers into one type.Wherein, preferably, this threshold value obtains through test text is trained according to as above step.Certainly; This threshold value can confirm that also preferably, this threshold value is relevant with the dimension of the characteristic term vector of text according to experience; Span is

for example T=n=m when n equates with m; When n and m are unequal, T get min (n, m).

Based on above text cluster method, the present invention also provides a kind of text cluster system, comprising:

Single language text cluster module 100 is used for the text of each language is carried out cluster respectively;

Extraction module 200 is used for the text of each language after single language text cluster module cluster is extracted the characteristic term vector respectively;

Analysis module 300 is used to calculate the similarity of the text feature term vector of the different language that is extracted by extraction module, and all texts are carried out cluster.

According to a particular embodiment of the invention, this extraction module 200 can comprise feature clustering module 210 and characteristic selecting module 220.

This feature clustering module 210; Be used for characteristic being carried out cluster based on the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;

This characteristic selecting module 220 is used for that above-mentioned feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.

This analysis module 300 can comprise translation probability computing module 310, similarity calculation module 320 and multi-language text cluster module 330.

This translation probability computing module 310 is used for calculating the translation probability of characteristic speech of the characteristic term vector of the bilingual text that is extracted by extraction module W wherein _i,

It is respectively a kind of characteristic speech of text of text and another kind of language of language.Different embodiment according to the subject invention, it can comprise the first probability calculation module 311 and the first probability determination module, perhaps the second probability calculation module 312 and the second probability determination module.Fig. 3 shows text cluster system chart according to a preferred embodiment of the invention.According to the preferred embodiment, this translation probability computing module comprises the first probability calculation module 311, the second probability calculation module 312 and the 3rd probability determination module 313.

This first probability calculation module 311 is used for calculating the lexical translation probability according to following formula

Wherein

Expression

Translate into w _iProbability; This second probability calculation module 312 is used for calculating the lexical translation probability according to following formula

Wherein Expression w _iTranslate into

Probability.

This first probability determination module is used for being calculated according to the said first probability calculation module Utilize following formula to calculate the translation probability of said characteristic speech

This second probability determination module is used for being calculated according to the said second probability calculation module

The 3rd probability determination module 313 is used for being calculated according to the said first probability calculation module

And the said second probability calculation module is calculated

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = \sqrt{P (w_{i}, \overset{&OverBar;}{w_{j}}) P (\overset{&OverBar;}{w_{j}}, w_{i})} .

This similarity calculation module 320 is used for being calculated according to said translation probability computing module

Utilize following formula to calculate said similarity:

Wherein<w ₁, w ₂, L w _n>With

This multi-language text cluster module 330 is used for the similarity calculated according to similarity calculation module, and said similarity is gathered into one type greater than the text of threshold value.

Above-mentioned text cluster method and text cluster system can be applied in the question answering system, particularly in the question answering system of search engine.For example, question answering system has received a problem " how to make the fired rice and eggs " through the problem page.Question answering system will be extracted the characteristic term vector of this problem, calculate the similarity between the characteristic term vector of each classification of having accomplished cluster in this characteristic term vector and the question answering system then, thus successfully with this problem cluster to suitable classification.If there is not qualified classification, then set up a new classification for this problem.Thus, realized the problem of inquiry is classified, based on classification answer more accurately is provided then.

One of ordinary skill in the art will appreciate that, be that example is described with Chinese with English text above, but the present invention is not limited to this, and it can be widely used in the text of various language.And the application of file classification method of the present invention also is not limited to the question answering system described in the specific embodiment of the invention, and it can be applied to other and relate to multilingual text.

Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of above-mentioned detailed description.Therefore, the scope of the technical scheme of requirement protection does not receive the restriction of given any specific exemplary teachings.

Claims

1. a text cluster method comprises the steps:

1) text to each language carries out cluster respectively;

2. method according to claim 1 is characterized in that, said step 3) further comprises:

31) translation probability of characteristic speech in the calculating bilingual text

W wherein _i,

32) calculate said similarity according to following formula:<img file="FDA0000049500880000013.GIF" he="111" id="ifm0003" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="700" />Wherein<w1, w2, L wn>With<img file="FDA0000049500880000014.GIF" he="83" id="ifm0004" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="277" />Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;

3. method according to claim 2 is characterized in that, said step 31) further comprise:

Wherein

Expression

Translate into w _iProbability;

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

4. method according to claim 2 is characterized in that, said step 31) further comprise:

313) calculate the lexical translation probability according to following formula Wherein

Expression w _iTranslate into

Probability;

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

5. method according to claim 2 is characterized in that, said step 31) further comprise:

Wherein Expression

Translate into w _iProbability;

Wherein

Expression w _iTranslate into

Probability;

315) the following formula of basis calculates the translation probability of said characteristic speech

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = \sqrt{P (w_{i}, \overset{&OverBar;}{w_{j}}) P (\overset{&OverBar;}{w_{j}}, w_{i})} .

6. according to each described method in the claim 1 to 5, it is characterized in that cluster described in the said step 1) adopts the hierarchy clustering method of coagulation type.

7. according to each described method in the claim 1 to 5, it is characterized in that said step 2) described in extract the method that the characteristic term vector is based on cluster.

8. method according to claim 7 is characterized in that, said step 2) further comprise:

21) at first characteristic is carried out cluster, in each bunch, selects whole bunch of a characteristic representative according to the similarity between the characteristic in the text of each language after the cluster, with bunch in other characteristics concentrate from candidate feature and reject;

22) use the information gain method to carry out feature selecting to remaining characteristic, obtain the characteristic term vector.

9. method according to claim 2 is characterized in that said threshold value obtains through test text is trained.

10. method according to claim 2; It is characterized in that; The span of said threshold value be

T get min (n, m).

11. a text cluster system comprises:

12. system according to claim 11 is characterized in that, said analysis module further comprises:

W wherein _i,

Utilize following formula to calculate said similarity:

Wherein<w ₁, w ₂, L w _n>With

13. system according to claim 12 is characterized in that, said translation probability computing module further comprises:

The first probability calculation module is used for calculating the lexical translation probability according to following formula

Wherein Expression

Translate into w _iProbability;

Wherein

Expression w _iTranslate into

Probability;

The first probability determination module,

that

of said characteristic speech

14. system according to claim 12 is characterized in that, said translation probability computing module further comprises:

Wherein Expression Translate into w _iProbability;

The second probability determination module is used for being calculated according to the said first probability calculation module

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

15. system according to claim 12 is characterized in that, said translation probability computing module further comprises:

Wherein

Expression w _iTranslate into

Probability;

Sim (w_{i}, \overset{&OverBar;}{w_{j}}) = P (w_{i} / \overset{&OverBar;}{w_{i}}) .

16. system according to claim 11 is characterized in that, said extraction module further comprises:

17. a question answering system comprises the described text cluster of claim 11 system.

18. a search engine comprises the described question answering system of claim 17.