CN102682000A - Text clustering method, question-answering system applying same and search engine applying same - Google Patents

Text clustering method, question-answering system applying same and search engine applying same Download PDF

Info

Publication number
CN102682000A
CN102682000A CN2011100563794A CN201110056379A CN102682000A CN 102682000 A CN102682000 A CN 102682000A CN 2011100563794 A CN2011100563794 A CN 2011100563794A CN 201110056379 A CN201110056379 A CN 201110056379A CN 102682000 A CN102682000 A CN 102682000A
Authority
CN
China
Prior art keywords
text
probability
characteristic
language
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011100563794A
Other languages
Chinese (zh)
Inventor
沈文竹
吴甜
柴春光
吴华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2011100563794A priority Critical patent/CN102682000A/en
Publication of CN102682000A publication Critical patent/CN102682000A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text clustering method, a question-answering system applying the same and a search engine applying the same. The method comprises the following steps of 1) clustering texts in various languages; 2) drawing character word vectors of clustered texts in the various languages; and 3) calculating similarity of character word vectors of the texts in different languages, and clustering all the texts. By using the method, the question-answering system and the search engine, the texts in the various languages can be clustered correctly.

Description

A kind of text cluster method and the question answering system and the search engine that adopt this method
Technical field
The present invention relates to area of pattern recognition, more specifically, relate to a kind of mode identification method natural language.
Background technology
Along with popularizing of information network, the appearance an urgent demand of the electronic text message of magnanimity automatically carries out text classification by machine.Text automatic classification can be practiced thrift a large amount of man power and materials, many defectives such as the cycle of avoiding the manual sort to bring is long, expense is high and efficient is low.Text automatic classification is exactly that a large amount of texts are classified according to its content automatically, thereby helps people to handle and organize text data effectively.
More strong for this demand of search engine.How People more and more relies on search to obtain knowledge and information.In the face of hundreds of millions of webpage, information resources, the greatest problem that search engine will face is from so huge information bank, how for the user its needed information to be provided quickly and accurately.
For example, the question and answer type systematic that relates to of search engine.In the question and answer type systematic, complete problem page includes the problem that a user proposes, and other users one or more answers that this problem is provided.When new user inquired about problem in the question and answer type systematic, this system need be based on the existing problem page before in the searching keyword searching system in the new user institute inquiry problem, and returns to this new user.
Because the diversity of spoken and written languages, question answering system can not limit the language format of problem that new user imports, so an identical problem has multiple form of presentation on the flesh and blood.From may be different, and then make that the problem page that can find is also different with the searching keyword that is extracted the described problem of different form of presentations.Because these problems are the same on essential content, therefore, if the problem page of all relevant issues is all returned to the user, rather than a part wherein, must promote user experience.Address the above problem, just need do cluster problem.
Present question and answer type systematic is through the problem page being carried out text analyzing, set up the index that " searching keyword " arrives " the problem page ", user inquiring being returned the relevant problem page.That is to say that " the problem page " must comprise " searching keyword " and just might be returned.Expenditure from the user; For example: " way of fried rice with eggs ", " how Fried rice with eggs is cooked ", " how fried rice with eggs is cooked " these problems say it is of equal value from content; When the user searched for, all corresponding " problem pages " of the problems referred to above of meeting consumers' demand all can be returned.In brief, people hope to gather into same type to the problem that contains same subject.When one in user inquiring keyword and the one type of problem when relevant, all pages of this type problem all are put into the candidate, show the user.As shown in Figure 1, before the cluster, search key 1 can return the problem page 1,3.Search key 2 can return the problem page 2,4.In the cluster process, the page 1,2 is gathered into one type, and the page 3,4 has been gathered into one type, after the cluster, no matter is search key 1, or keyword 2, all can return the problem page 1,2,3,4.
But the cluster for text in the prior art has only been considered monolingual text, has ignored the problem on obstacle of language.In fact, along with the increase of international exchange cooperation, multilingual appears in increasing credit union.Still above example question answering system, there will be multi-lingual issues, such as: "how? To? Make? The? Fired? Rice? And? Eggs" and "egg into ri tea a firm nn を ど Full specifications ni su ru ka "These two issues with the above" approach fried rice "and other three questions from a content perspective is also equivalent.Thus, should these five problems be gathered exactly is one type, thinks that the user provides better search experience.
In sum, press for a kind of method that can carry out accurate cluster to the text of various language, and a kind of can multilingual problem the unification, the question answering system of being convenient to answerer's identification and answering.
Summary of the invention
The object of the present invention is to provide a kind of method that can carry out accurate cluster to the text of various language.
According to an aspect of the present invention, a kind of text cluster method is provided, has comprised the steps:
1) text to each language carries out cluster respectively;
2) text to each language after the cluster extracts the characteristic term vector respectively;
3) similarity of the characteristic term vector of the text of calculating different language is carried out cluster to all texts.
In said method, said step 3) further comprises:
31) characteristics were calculated for both language text translation of the word probability where
Figure BDA0000049500890000022
, respectively, and a language text in another language text feature words;
32) calculate said similarity according to following formula:<img file="BDA0000049500890000031.GIF" he="111" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="700" />Wherein<w<sub >1</sub>, w<sub >2</sub>, L w<sub >n</sub>>With<img file="BDA0000049500890000032.GIF" he="83" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="277" />Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
33) said similarity is gathered into one type greater than the text of threshold value.
In said method, said step 31) further comprise:
311) calculate the lexical translation probability according to following formula
Figure BDA0000049500890000033
With Wherein
Figure BDA0000049500890000035
Expression
Figure BDA0000049500890000036
Translate into w iProbability,
Figure BDA0000049500890000037
Expression w iTranslate into
Figure BDA0000049500890000038
Probability;
312) the following formula of basis calculates the translation probability of said characteristic speech
Figure BDA0000049500890000039
Sim ( w i , w j &OverBar; ) = P ( w i , w j &OverBar; ) P ( w j &OverBar; , w i ) .
In said method, said step 31) further comprise:
313) calculate the lexical translation probability according to following formula
Figure BDA00000495008900000311
Wherein
Figure BDA00000495008900000312
Expression
Figure BDA00000495008900000313
Translate into w iProbability;
314) the following formula of basis calculates the translation probability of said characteristic speech
Figure BDA00000495008900000314
Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
In said method, said step 31) further comprise:
315) calculate the lexical translation probability according to following formula
Figure BDA00000495008900000316
Wherein
Figure BDA00000495008900000317
Expression w iTranslate into
Figure BDA00000495008900000318
Probability;
316) the following formula of basis calculates the translation probability of said characteristic speech Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
In said method; The span of said threshold value be
Figure BDA00000495008900000321
T get min (n, m).
According to another aspect of the invention, a kind of text cluster system is provided also, has comprised:
Single language text cluster module is used for the text of each language is carried out cluster respectively;
Extraction module is used for the text of each language after said single language text cluster module cluster is extracted the characteristic term vector respectively;
Analysis module is used to calculate the similarity of characteristic term vector of the text of the different language that is extracted by said extraction module, and all texts are carried out cluster.
In the system of a practical implementation of the present invention, said analysis module further comprises:
The translation probability computing module is used for calculating the translation probability by the bilingual text characteristic speech of said extraction module extraction
Figure BDA0000049500890000041
W wherein i,
Figure BDA0000049500890000042
It is respectively a kind of characteristic speech of text of text and another kind of language of language;
Similarity calculation module is used for being calculated according to said translation probability computing module
Figure BDA0000049500890000043
Utilize following formula to calculate said similarity:
Figure BDA0000049500890000044
Wherein<w 1, w 2, L w n>With
Figure BDA0000049500890000045
Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
Multi-language text cluster module, the similarity that is used for calculating according to said similarity calculation module is gathered into one type with said similarity greater than the text of threshold value.
In the system of a practical implementation of the present invention, said translation probability computing module further comprises:
The first probability calculation module is used for calculating the lexical translation probability according to following formula Wherein
Figure BDA0000049500890000047
Expression
Figure BDA0000049500890000048
Translate into w iProbability;
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Figure BDA0000049500890000049
Wherein
Figure BDA00000495008900000410
Expression w iTranslate into
Figure BDA00000495008900000411
Probability;
The first probability determination module,
Figure BDA00000495008900000413
that
Figure BDA00000495008900000412
that is used for being calculated according to the said first probability calculation module and the said second probability calculation module are calculated utilizes following formula to calculate the translation probability
Figure BDA00000495008900000415
of said characteristic speech
In the system of another practical implementation of the present invention, said translation probability computing module further comprises:
The first probability calculation module, be used for according to following formula calculate lexical translation probability
Figure BDA00000495008900000416
wherein
Figure BDA0000049500890000051
expression
Figure BDA0000049500890000052
translate into the probability of wi;
The second probability determination module is used for being calculated according to the said first probability calculation module Utilize following formula to calculate the translation probability of said characteristic speech Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
In the system of another practical implementation of the present invention, said translation probability computing module further comprises:
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Figure BDA0000049500890000056
Wherein
Figure BDA0000049500890000057
Expression w iTranslate into
Figure BDA0000049500890000058
Probability;
The 3rd probability determination module is used for being calculated according to the said second probability calculation module
Figure BDA0000049500890000059
Utilize following formula to calculate the translation probability of said characteristic speech
Figure BDA00000495008900000510
Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
In the system of a practical implementation of the present invention, said extraction module further comprises:
The feature clustering module; Be used for characteristic being carried out cluster according to the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;
Characteristic selecting module is used for that said feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.
According to a further aspect in the invention, a kind of question answering system is provided also, it comprises above-mentioned text cluster system, so that the problem of inquiry is classified, furnishes an answer according to classification then.
According to another aspect of the invention, a kind of search engine is provided also, has comprised above-mentioned question answering system.
Utilize file classification method provided by the present invention to carry out accurate cluster to the multilingual text.This method is applied in the question answering system, especially in the question answering system of search engine, can improves the accuracy of answer, effectively save Internet resources.
Description of drawings
Fig. 1 is a problem cluster synoptic diagram in the question and answer type systematic;
Fig. 2 is the text cluster method flow diagram of the specific embodiment according to the present invention;
Fig. 3 is the block diagram of text cluster system in accordance with a preferred embodiment of the present invention.
Embodiment
In order to make the object of the invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing, to file classification method further explain according to an embodiment of the invention.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The process flow diagram that will combine Fig. 2 below is the text cluster method that example is described the specific embodiment according to the present invention in detail with the macaronic text of China and British.
At first, centering, English text carries out cluster respectively.This cluster can adopt content relevant cluster method or the irrelevant clustering method of content.
Wherein, content relevant cluster method is used similarity function some internal characteristics according to element, for example having or not etc. of certain characteristic of two texts, describes the similarity degree between the text.
Preferably, adopt the irrelevant clustering method of content, it can be assessed the similarity degree between text under the situation of not considering content of text, and then carries out cluster.Further; Employing is based on the iteration clustering method of figure; It can excavate the implication relation in the keyword: some keywords appear to wide of the mark at the initial stage of method computing; But these keywords may become similar gradually in the process of carrying out, and thus, it can be efficiently with Chinese or English text cluster.
Especially, for the cluster of the related search problem of search engine, can carry out cluster through problem log, be consistent if two problem users click, and thinks that then these two problems are similar.Particularly, can adopt the hierarchy clustering method of coagulation type.This method mainly comprises following two steps:
1) at first with the two-dimensional plot of search engine inquiry daily record structure that reads in.Node q on one side represents keyword; The node d on one side representative link in addition; Simultaneously; Appear at simultaneously in the record if in daily record, find certain keyword and link, that is to say that the user has clicked some link in the result of page searching when carrying out a keyword query, set up a two-way limit so between the node of representing this keyword and link.
2) after the structure of accomplishing two-way limit, merge two maximum node q of similarity degree, merge two maximum node d of similarity degree then, so iteration is till satisfying end condition.
Then, macaronic text is striden the language cluster.
To the Chinese text and the English text of cluster extract the characteristic term vector respectively<w 1, w 2, L w n>With
Figure BDA0000049500890000071
Wherein n and m are respectively the dimensions of characteristic term vector of characteristic term vector and the English text of Chinese text.
The extraction of characteristic term vector can be based on the method for cluster; Specifically comprise: at first characteristic is carried out cluster according to the similarity between characteristic; In each bunch of institute's cluster, select whole bunch of a characteristic representative; With bunch in other characteristics concentrate to reject from candidate feature, thereby reduce the redundancy in the feature set; Then, use the information gain method to carry out feature selecting, obtain the characteristic term vector remaining characteristic." computer utility " that be published in January, 2007 referring to Zhang Wenliang etc. gone up " a kind of text feature system of selection based on cluster " literary composition.
Certainly, the extraction of characteristic term vector also can be adopted other method, for example, and based on mutual information, x 2Statistic law etc.
Calculate the translation probability of Chinese and English characteristic speech
Figure BDA0000049500890000072
W wherein i∈ w 1, w 2, L w n,
Figure BDA0000049500890000073
1≤i≤n, 1≤j≤m, the two is respectively the characteristic speech of Chinese and English text.In order to guarantee the symmetry of similarity, order
Figure BDA0000049500890000074
Expression
Figure BDA0000049500890000076
Translate into w iProbability,
Figure BDA0000049500890000077
Expression w iTranslate into
Figure BDA0000049500890000078
Probability.For example, " bank " this English word, it comprises two kinds of Chinese implications, and a kind of is " bank ", and a kind of is " riverside ".P (bank | bank) and P (riverside | bank) these two conditional probabilities represent that respectively bank translates into the probability of bank and the probability that bank translates into the riverside.Can be through the bilingual corpora in the bilingual corpus be carried out participle; Carry out word alignment then; Add up total what " bank " according to the alignment result and translated into " bank ", what " bank " have been translated into " riverside " and have been calculated as above conditional probability.
One of ordinary skill in the art will appreciate that, also can make
Figure BDA0000049500890000079
Perhaps Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
Calculate the similarity of Chinese and English characteristic term vector according to translation probability
Figure BDA00000495008900000711
:
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mo>&lt;;</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>,</mo> <mi>L</mi> <msub> <mi>w</mi> <mi>n</mi> </msub> <mo>></mo> <mo>,</mo> <mo>&lt;;</mo> <mover> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>&amp;OverBar;</mo> </mover> <mo>,</mo> <mover> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>&amp;OverBar;</mo> </mover> <mo>,</mo> <mi>L</mi> <mover> <msub> <mi>w</mi> <mi>m</mi> </msub> <mo>&amp;OverBar;</mo> </mover> <mo>></mo> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>sim</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>,</mo> <mover> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> </mrow> <mrow> <mi>m</mi> <mo>&amp;CenterDot;</mo> <mi>n</mi> </mrow> </mfrac> <mo>.</mo> </mrow></math>
The text that similarity is higher than threshold value gathers into one type.Wherein, preferably, this threshold value obtains through test text is trained according to as above step.Certainly; This threshold value can confirm that also preferably, this threshold value is relevant with the dimension of the characteristic term vector of text according to experience; Span is
Figure BDA00000495008900000713
for example T=n=m when n equates with m; When n and m are unequal, T get min (n, m).
Based on above text cluster method, the present invention also provides a kind of text cluster system, comprising:
Single language text cluster module 100 is used for the text of each language is carried out cluster respectively;
Extraction module 200 is used for the text of each language after single language text cluster module cluster is extracted the characteristic term vector respectively;
Analysis module 300 is used to calculate the similarity of the text feature term vector of the different language that is extracted by extraction module, and all texts are carried out cluster.
According to a particular embodiment of the invention, this extraction module 200 can comprise feature clustering module 210 and characteristic selecting module 220.
This feature clustering module 210; Be used for characteristic being carried out cluster based on the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;
This characteristic selecting module 220 is used for that above-mentioned feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.
This analysis module 300 can comprise translation probability computing module 310, similarity calculation module 320 and multi-language text cluster module 330.
This translation probability computing module 310 is used for calculating the translation probability of characteristic speech of the characteristic term vector of the bilingual text that is extracted by extraction module W wherein i,
Figure BDA0000049500890000082
It is respectively a kind of characteristic speech of text of text and another kind of language of language.Different embodiment according to the subject invention, it can comprise the first probability calculation module 311 and the first probability determination module, perhaps the second probability calculation module 312 and the second probability determination module.Fig. 3 shows text cluster system chart according to a preferred embodiment of the invention.According to the preferred embodiment, this translation probability computing module comprises the first probability calculation module 311, the second probability calculation module 312 and the 3rd probability determination module 313.
This first probability calculation module 311 is used for calculating the lexical translation probability according to following formula
Figure BDA0000049500890000083
Wherein
Figure BDA0000049500890000084
Expression
Figure BDA0000049500890000085
Translate into w iProbability; This second probability calculation module 312 is used for calculating the lexical translation probability according to following formula
Figure BDA0000049500890000086
Wherein Expression w iTranslate into
Figure BDA0000049500890000088
Probability.
This first probability determination module is used for being calculated according to the said first probability calculation module Utilize following formula to calculate the translation probability of said characteristic speech
Figure BDA0000049500890000092
Figure BDA0000049500890000093
This second probability determination module is used for being calculated according to the said second probability calculation module
Figure BDA0000049500890000094
Utilize following formula to calculate the translation probability of said characteristic speech
Figure BDA0000049500890000095
The 3rd probability determination module 313 is used for being calculated according to the said first probability calculation module
Figure BDA0000049500890000097
And the said second probability calculation module is calculated
Figure BDA0000049500890000098
Utilize following formula to calculate the translation probability of said characteristic speech
Figure BDA0000049500890000099
Sim ( w i , w j &OverBar; ) = P ( w i , w j &OverBar; ) P ( w j &OverBar; , w i ) .
This similarity calculation module 320 is used for being calculated according to said translation probability computing module
Figure BDA00000495008900000911
Utilize following formula to calculate said similarity:
Figure BDA00000495008900000912
Wherein<w 1, w 2, L w n>With
Figure BDA00000495008900000913
Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
This multi-language text cluster module 330 is used for the similarity calculated according to similarity calculation module, and said similarity is gathered into one type greater than the text of threshold value.
Above-mentioned text cluster method and text cluster system can be applied in the question answering system, particularly in the question answering system of search engine.For example, question answering system has received a problem " how to make the fired rice and eggs " through the problem page.Question answering system will be extracted the characteristic term vector of this problem, calculate the similarity between the characteristic term vector of each classification of having accomplished cluster in this characteristic term vector and the question answering system then, thus successfully with this problem cluster to suitable classification.If there is not qualified classification, then set up a new classification for this problem.Thus, realized the problem of inquiry is classified, based on classification answer more accurately is provided then.
One of ordinary skill in the art will appreciate that, be that example is described with Chinese with English text above, but the present invention is not limited to this, and it can be widely used in the text of various language.And the application of file classification method of the present invention also is not limited to the question answering system described in the specific embodiment of the invention, and it can be applied to other and relate to multilingual text.
Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of above-mentioned detailed description.Therefore, the scope of the technical scheme of requirement protection does not receive the restriction of given any specific exemplary teachings.

Claims (18)

1. a text cluster method comprises the steps:
1) text to each language carries out cluster respectively;
2) text to each language after the cluster extracts the characteristic term vector respectively;
3) similarity of the characteristic term vector of the text of calculating different language is carried out cluster to all texts.
2. method according to claim 1 is characterized in that, said step 3) further comprises:
31) translation probability of characteristic speech in the calculating bilingual text
Figure FDA0000049500880000011
W wherein i,
Figure FDA0000049500880000012
It is respectively a kind of characteristic speech of text of text and another kind of language of language;
32) calculate said similarity according to following formula:<img file="FDA0000049500880000013.GIF" he="111" id="ifm0003" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="700" />Wherein<w<sub >1</sub>, w<sub >2</sub>, L w<sub >n</sub>>With<img file="FDA0000049500880000014.GIF" he="83" id="ifm0004" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="277" />Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
33) said similarity is gathered into one type greater than the text of threshold value.
3. method according to claim 2 is characterized in that, said step 31) further comprise:
311) calculate the lexical translation probability according to following formula
Figure FDA0000049500880000015
Wherein
Figure FDA0000049500880000016
Expression
Figure FDA0000049500880000017
Translate into w iProbability;
312) the following formula of basis calculates the translation probability of said characteristic speech
Figure FDA0000049500880000018
Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
4. method according to claim 2 is characterized in that, said step 31) further comprise:
313) calculate the lexical translation probability according to following formula Wherein
Figure FDA00000495008800000111
Expression w iTranslate into
Figure FDA00000495008800000112
Probability;
314) the following formula of basis calculates the translation probability of said characteristic speech
Figure FDA00000495008800000113
Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
5. method according to claim 2 is characterized in that, said step 31) further comprise:
311) calculate the lexical translation probability according to following formula
Figure FDA0000049500880000021
Wherein Expression
Figure FDA0000049500880000023
Translate into w iProbability;
313) calculate the lexical translation probability according to following formula
Figure FDA0000049500880000024
Wherein
Figure FDA0000049500880000025
Expression w iTranslate into
Figure FDA0000049500880000026
Probability;
315) the following formula of basis calculates the translation probability of said characteristic speech
Figure FDA0000049500880000027
Sim ( w i , w j &OverBar; ) = P ( w i , w j &OverBar; ) P ( w j &OverBar; , w i ) .
6. according to each described method in the claim 1 to 5, it is characterized in that cluster described in the said step 1) adopts the hierarchy clustering method of coagulation type.
7. according to each described method in the claim 1 to 5, it is characterized in that said step 2) described in extract the method that the characteristic term vector is based on cluster.
8. method according to claim 7 is characterized in that, said step 2) further comprise:
21) at first characteristic is carried out cluster, in each bunch, selects whole bunch of a characteristic representative according to the similarity between the characteristic in the text of each language after the cluster, with bunch in other characteristics concentrate from candidate feature and reject;
22) use the information gain method to carry out feature selecting to remaining characteristic, obtain the characteristic term vector.
9. method according to claim 2 is characterized in that said threshold value obtains through test text is trained.
10. method according to claim 2; It is characterized in that; The span of said threshold value be
Figure FDA0000049500880000029
T get min (n, m).
11. a text cluster system comprises:
Single language text cluster module is used for the text of each language is carried out cluster respectively;
Extraction module is used for the text of each language after said single language text cluster module cluster is extracted the characteristic term vector respectively;
Analysis module is used to calculate the similarity of characteristic term vector of the text of the different language that is extracted by said extraction module, and all texts are carried out cluster.
12. system according to claim 11 is characterized in that, said analysis module further comprises:
The translation probability computing module is used for calculating the translation probability by the bilingual text characteristic speech of said extraction module extraction
Figure FDA0000049500880000031
W wherein i,
Figure FDA0000049500880000032
It is respectively a kind of characteristic speech of text of text and another kind of language of language;
Similarity calculation module is used for being calculated according to said translation probability computing module
Figure FDA0000049500880000033
Utilize following formula to calculate said similarity:
Figure FDA0000049500880000034
Wherein<w 1, w 2, L w n>With
Figure FDA0000049500880000035
Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
Multi-language text cluster module, the similarity that is used for calculating according to said similarity calculation module is gathered into one type with said similarity greater than the text of threshold value.
13. system according to claim 12 is characterized in that, said translation probability computing module further comprises:
The first probability calculation module is used for calculating the lexical translation probability according to following formula
Figure FDA0000049500880000036
Wherein Expression
Figure FDA0000049500880000038
Translate into w iProbability;
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Figure FDA0000049500880000039
Wherein
Figure FDA00000495008800000310
Expression w iTranslate into
Figure FDA00000495008800000311
Probability;
The first probability determination module,
Figure FDA00000495008800000313
that
Figure FDA00000495008800000312
that is used for being calculated according to the said first probability calculation module and the said second probability calculation module are calculated utilizes following formula to calculate the translation probability
Figure FDA00000495008800000314
Figure FDA00000495008800000315
of said characteristic speech
14. system according to claim 12 is characterized in that, said translation probability computing module further comprises:
The first probability calculation module is used for calculating the lexical translation probability according to following formula
Figure FDA0000049500880000041
Wherein Expression Translate into w iProbability;
The second probability determination module is used for being calculated according to the said first probability calculation module
Figure FDA0000049500880000044
Utilize following formula to calculate the translation probability of said characteristic speech
Figure FDA0000049500880000045
Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
15. system according to claim 12 is characterized in that, said translation probability computing module further comprises:
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Figure FDA0000049500880000047
Wherein
Figure FDA0000049500880000048
Expression w iTranslate into
Figure FDA0000049500880000049
Probability;
The 3rd probability determination module is used for being calculated according to the said second probability calculation module
Figure FDA00000495008800000410
Utilize following formula to calculate the translation probability of said characteristic speech Sim ( w i , w j &OverBar; ) = P ( w i / w i &OverBar; ) .
16. system according to claim 11 is characterized in that, said extraction module further comprises:
The feature clustering module; Be used for characteristic being carried out cluster according to the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;
Characteristic selecting module is used for that said feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.
17. a question answering system comprises the described text cluster of claim 11 system.
18. a search engine comprises the described question answering system of claim 17.
CN2011100563794A 2011-03-09 2011-03-09 Text clustering method, question-answering system applying same and search engine applying same Pending CN102682000A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100563794A CN102682000A (en) 2011-03-09 2011-03-09 Text clustering method, question-answering system applying same and search engine applying same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100563794A CN102682000A (en) 2011-03-09 2011-03-09 Text clustering method, question-answering system applying same and search engine applying same

Publications (1)

Publication Number Publication Date
CN102682000A true CN102682000A (en) 2012-09-19

Family

ID=46813948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100563794A Pending CN102682000A (en) 2011-03-09 2011-03-09 Text clustering method, question-answering system applying same and search engine applying same

Country Status (1)

Country Link
CN (1) CN102682000A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955857A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Class center compression transformation-based text clustering method in search engine
CN103049433A (en) * 2012-12-11 2013-04-17 微梦创科网络科技(中国)有限公司 Automatic question answering method, automatic question answering system and method for constructing question answering case base
CN104361127A (en) * 2014-12-05 2015-02-18 广西师范大学 Multilanguage question and answer interface fast constituting method based on domain ontology and template logics
WO2015035628A1 (en) * 2013-09-12 2015-03-19 广东电子工业研究院有限公司 Method of clustering literature in multiple languages
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN104778256A (en) * 2015-04-20 2015-07-15 江苏科技大学 Rapid incremental clustering method for domain question-answering system consultations
CN105912734A (en) * 2016-06-22 2016-08-31 北京金山安全软件有限公司 User feedback automatic reply method and device
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN106559695A (en) * 2016-10-14 2017-04-05 北京金山安全软件有限公司 Barrage message processing method and device and electronic equipment
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN107145573A (en) * 2017-05-05 2017-09-08 上海携程国际旅行社有限公司 The problem of artificial intelligence customer service robot, answers method and system
CN108170691A (en) * 2016-12-07 2018-06-15 北京国双科技有限公司 It is associated with the determining method and apparatus of document
CN108416014A (en) * 2018-03-05 2018-08-17 杭州朗和科技有限公司 Data processing method, medium, system and electronic equipment
CN109063184A (en) * 2018-08-24 2018-12-21 广东外语外贸大学 Multilingual newsletter archive clustering method, storage medium and terminal device
CN110046332A (en) * 2019-04-04 2019-07-23 珠海远光移动互联科技有限公司 A kind of Similar Text data set generation method and device
CN113570380A (en) * 2020-04-28 2021-10-29 中国移动通信集团浙江有限公司 Service complaint processing method, device and equipment based on semantic analysis and computer readable storage medium

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955857B (en) * 2012-11-09 2015-07-08 北京航空航天大学 Class center compression transformation-based text clustering method in search engine
CN102955857A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Class center compression transformation-based text clustering method in search engine
CN103049433B (en) * 2012-12-11 2015-10-28 微梦创科网络科技(中国)有限公司 The method of automatic question-answering method, automatically request-answering system and structure question and answer case library
CN103049433A (en) * 2012-12-11 2013-04-17 微梦创科网络科技(中国)有限公司 Automatic question answering method, automatic question answering system and method for constructing question answering case base
WO2015035628A1 (en) * 2013-09-12 2015-03-19 广东电子工业研究院有限公司 Method of clustering literature in multiple languages
CN104361127A (en) * 2014-12-05 2015-02-18 广西师范大学 Multilanguage question and answer interface fast constituting method based on domain ontology and template logics
CN104361127B (en) * 2014-12-05 2017-09-26 广西师范大学 The multilingual quick constructive method of question and answer interface based on domain body and template logic
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN104573046B (en) * 2015-01-20 2018-07-31 成都品果科技有限公司 A kind of comment and analysis method and system based on term vector
CN104778256A (en) * 2015-04-20 2015-07-15 江苏科技大学 Rapid incremental clustering method for domain question-answering system consultations
CN104778256B (en) * 2015-04-20 2017-10-17 江苏科技大学 A kind of the quick of field question answering system consulting can increment clustering method
CN106095845A (en) * 2016-06-02 2016-11-09 腾讯科技(深圳)有限公司 File classification method and device
CN105912734A (en) * 2016-06-22 2016-08-31 北京金山安全软件有限公司 User feedback automatic reply method and device
CN106559695A (en) * 2016-10-14 2017-04-05 北京金山安全软件有限公司 Barrage message processing method and device and electronic equipment
CN108170691A (en) * 2016-12-07 2018-06-15 北京国双科技有限公司 It is associated with the determining method and apparatus of document
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN107145573A (en) * 2017-05-05 2017-09-08 上海携程国际旅行社有限公司 The problem of artificial intelligence customer service robot, answers method and system
CN108416014A (en) * 2018-03-05 2018-08-17 杭州朗和科技有限公司 Data processing method, medium, system and electronic equipment
CN109063184A (en) * 2018-08-24 2018-12-21 广东外语外贸大学 Multilingual newsletter archive clustering method, storage medium and terminal device
CN109063184B (en) * 2018-08-24 2020-09-01 广东外语外贸大学 Multi-language news text clustering method, storage medium and terminal device
CN110046332A (en) * 2019-04-04 2019-07-23 珠海远光移动互联科技有限公司 A kind of Similar Text data set generation method and device
CN110046332B (en) * 2019-04-04 2024-01-23 远光软件股份有限公司 Similar text data set generation method and device
CN113570380A (en) * 2020-04-28 2021-10-29 中国移动通信集团浙江有限公司 Service complaint processing method, device and equipment based on semantic analysis and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN102682000A (en) Text clustering method, question-answering system applying same and search engine applying same
CN103399901B (en) A kind of keyword abstraction method
WO2018189589A2 (en) Systems and methods for document processing using machine learning
CN102651003B (en) Cross-language searching method and device
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
Mori et al. A machine learning approach to recipe text processing
US20190171713A1 (en) Semantic parsing method and apparatus
CN104008126A (en) Method and device for segmentation on basis of webpage content classification
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN104391885A (en) Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN102043808A (en) Method and equipment for extracting bilingual terms using webpage structure
CN103544266A (en) Method and device for generating search suggestion words
CN102253930A (en) Method and device for translating text
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
CN104281565A (en) Semantic dictionary constructing method and device
CN111563382A (en) Text information acquisition method and device, storage medium and computer equipment
CN104391969A (en) User query statement syntactic structure determining method and device
Duc et al. Cross-language latent relational search: Mapping knowledge across languages
CN110209781A (en) A kind of text handling method, device and relevant device
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN101763403A (en) Query translation method facing multi-lingual information retrieval system
KR102083017B1 (en) Method and system for analyzing social review of place
Perez-Tellez et al. On the difficulty of clustering microblog texts for online reputation management
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120919