CN102682000A - Text clustering method, question-answering system applying same and search engine applying same - Google Patents
Text clustering method, question-answering system applying same and search engine applying same Download PDFInfo
- Publication number
- CN102682000A CN102682000A CN2011100563794A CN201110056379A CN102682000A CN 102682000 A CN102682000 A CN 102682000A CN 2011100563794 A CN2011100563794 A CN 2011100563794A CN 201110056379 A CN201110056379 A CN 201110056379A CN 102682000 A CN102682000 A CN 102682000A
- Authority
- CN
- China
- Prior art keywords
- text
- probability
- characteristic
- language
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text clustering method, a question-answering system applying the same and a search engine applying the same. The method comprises the following steps of 1) clustering texts in various languages; 2) drawing character word vectors of clustered texts in the various languages; and 3) calculating similarity of character word vectors of the texts in different languages, and clustering all the texts. By using the method, the question-answering system and the search engine, the texts in the various languages can be clustered correctly.
Description
Technical field
The present invention relates to area of pattern recognition, more specifically, relate to a kind of mode identification method natural language.
Background technology
Along with popularizing of information network, the appearance an urgent demand of the electronic text message of magnanimity automatically carries out text classification by machine.Text automatic classification can be practiced thrift a large amount of man power and materials, many defectives such as the cycle of avoiding the manual sort to bring is long, expense is high and efficient is low.Text automatic classification is exactly that a large amount of texts are classified according to its content automatically, thereby helps people to handle and organize text data effectively.
More strong for this demand of search engine.How People more and more relies on search to obtain knowledge and information.In the face of hundreds of millions of webpage, information resources, the greatest problem that search engine will face is from so huge information bank, how for the user its needed information to be provided quickly and accurately.
For example, the question and answer type systematic that relates to of search engine.In the question and answer type systematic, complete problem page includes the problem that a user proposes, and other users one or more answers that this problem is provided.When new user inquired about problem in the question and answer type systematic, this system need be based on the existing problem page before in the searching keyword searching system in the new user institute inquiry problem, and returns to this new user.
Because the diversity of spoken and written languages, question answering system can not limit the language format of problem that new user imports, so an identical problem has multiple form of presentation on the flesh and blood.From may be different, and then make that the problem page that can find is also different with the searching keyword that is extracted the described problem of different form of presentations.Because these problems are the same on essential content, therefore, if the problem page of all relevant issues is all returned to the user, rather than a part wherein, must promote user experience.Address the above problem, just need do cluster problem.
Present question and answer type systematic is through the problem page being carried out text analyzing, set up the index that " searching keyword " arrives " the problem page ", user inquiring being returned the relevant problem page.That is to say that " the problem page " must comprise " searching keyword " and just might be returned.Expenditure from the user; For example: " way of fried rice with eggs ", " how Fried rice with eggs is cooked ", " how fried rice with eggs is cooked " these problems say it is of equal value from content; When the user searched for, all corresponding " problem pages " of the problems referred to above of meeting consumers' demand all can be returned.In brief, people hope to gather into same type to the problem that contains same subject.When one in user inquiring keyword and the one type of problem when relevant, all pages of this type problem all are put into the candidate, show the user.As shown in Figure 1, before the cluster, search key 1 can return the problem page 1,3.Search key 2 can return the problem page 2,4.In the cluster process, the page 1,2 is gathered into one type, and the page 3,4 has been gathered into one type, after the cluster, no matter is search key 1, or keyword 2, all can return the problem page 1,2,3,4.
But the cluster for text in the prior art has only been considered monolingual text, has ignored the problem on obstacle of language.In fact, along with the increase of international exchange cooperation, multilingual appears in increasing credit union.Still above example question answering system, there will be multi-lingual issues, such as: "how? To? Make? The? Fired? Rice? And? Eggs" and "egg into ri tea a firm nn を ど Full specifications ni su ru ka "These two issues with the above" approach fried rice "and other three questions from a content perspective is also equivalent.Thus, should these five problems be gathered exactly is one type, thinks that the user provides better search experience.
In sum, press for a kind of method that can carry out accurate cluster to the text of various language, and a kind of can multilingual problem the unification, the question answering system of being convenient to answerer's identification and answering.
Summary of the invention
The object of the present invention is to provide a kind of method that can carry out accurate cluster to the text of various language.
According to an aspect of the present invention, a kind of text cluster method is provided, has comprised the steps:
1) text to each language carries out cluster respectively;
2) text to each language after the cluster extracts the characteristic term vector respectively;
3) similarity of the characteristic term vector of the text of calculating different language is carried out cluster to all texts.
In said method, said step 3) further comprises:
31) characteristics were calculated for both language text translation of the word probability
where
, respectively, and a language text in another language text feature words;
32) calculate said similarity according to following formula:<img file="BDA0000049500890000031.GIF" he="111" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="700" />Wherein<w<sub >1</sub>, w<sub >2</sub>, L w<sub >n</sub>>With<img file="BDA0000049500890000032.GIF" he="83" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="277" />Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
33) said similarity is gathered into one type greater than the text of threshold value.
In said method, said step 31) further comprise:
311) calculate the lexical translation probability according to following formula
With
Wherein
Expression
Translate into w
iProbability,
Expression w
iTranslate into
Probability;
312) the following formula of basis calculates the translation probability of said characteristic speech
In said method, said step 31) further comprise:
313) calculate the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability;
314) the following formula of basis calculates the translation probability of said characteristic speech
In said method, said step 31) further comprise:
315) calculate the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
316) the following formula of basis calculates the translation probability of said characteristic speech
According to another aspect of the invention, a kind of text cluster system is provided also, has comprised:
Single language text cluster module is used for the text of each language is carried out cluster respectively;
Extraction module is used for the text of each language after said single language text cluster module cluster is extracted the characteristic term vector respectively;
Analysis module is used to calculate the similarity of characteristic term vector of the text of the different language that is extracted by said extraction module, and all texts are carried out cluster.
In the system of a practical implementation of the present invention, said analysis module further comprises:
The translation probability computing module is used for calculating the translation probability by the bilingual text characteristic speech of said extraction module extraction
W wherein
i,
It is respectively a kind of characteristic speech of text of text and another kind of language of language;
Similarity calculation module is used for being calculated according to said translation probability computing module
Utilize following formula to calculate said similarity:
Wherein<w
1, w
2, L w
n>With
Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
Multi-language text cluster module, the similarity that is used for calculating according to said similarity calculation module is gathered into one type with said similarity greater than the text of threshold value.
In the system of a practical implementation of the present invention, said translation probability computing module further comprises:
The first probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability;
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
The first probability determination module,
that
that is used for being calculated according to the said first probability calculation module and the said second probability calculation module are calculated utilizes following formula to calculate the translation probability
of said characteristic speech
In the system of another practical implementation of the present invention, said translation probability computing module further comprises:
The first probability calculation module, be used for according to following formula calculate lexical translation probability
wherein
expression
translate into the probability of wi;
The second probability determination module is used for being calculated according to the said first probability calculation module
Utilize following formula to calculate the translation probability of said characteristic speech
In the system of another practical implementation of the present invention, said translation probability computing module further comprises:
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
The 3rd probability determination module is used for being calculated according to the said second probability calculation module
Utilize following formula to calculate the translation probability of said characteristic speech
In the system of a practical implementation of the present invention, said extraction module further comprises:
The feature clustering module; Be used for characteristic being carried out cluster according to the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;
Characteristic selecting module is used for that said feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.
According to a further aspect in the invention, a kind of question answering system is provided also, it comprises above-mentioned text cluster system, so that the problem of inquiry is classified, furnishes an answer according to classification then.
According to another aspect of the invention, a kind of search engine is provided also, has comprised above-mentioned question answering system.
Utilize file classification method provided by the present invention to carry out accurate cluster to the multilingual text.This method is applied in the question answering system, especially in the question answering system of search engine, can improves the accuracy of answer, effectively save Internet resources.
Description of drawings
Fig. 1 is a problem cluster synoptic diagram in the question and answer type systematic;
Fig. 2 is the text cluster method flow diagram of the specific embodiment according to the present invention;
Fig. 3 is the block diagram of text cluster system in accordance with a preferred embodiment of the present invention.
Embodiment
In order to make the object of the invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing, to file classification method further explain according to an embodiment of the invention.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The process flow diagram that will combine Fig. 2 below is the text cluster method that example is described the specific embodiment according to the present invention in detail with the macaronic text of China and British.
At first, centering, English text carries out cluster respectively.This cluster can adopt content relevant cluster method or the irrelevant clustering method of content.
Wherein, content relevant cluster method is used similarity function some internal characteristics according to element, for example having or not etc. of certain characteristic of two texts, describes the similarity degree between the text.
Preferably, adopt the irrelevant clustering method of content, it can be assessed the similarity degree between text under the situation of not considering content of text, and then carries out cluster.Further; Employing is based on the iteration clustering method of figure; It can excavate the implication relation in the keyword: some keywords appear to wide of the mark at the initial stage of method computing; But these keywords may become similar gradually in the process of carrying out, and thus, it can be efficiently with Chinese or English text cluster.
Especially, for the cluster of the related search problem of search engine, can carry out cluster through problem log, be consistent if two problem users click, and thinks that then these two problems are similar.Particularly, can adopt the hierarchy clustering method of coagulation type.This method mainly comprises following two steps:
1) at first with the two-dimensional plot of search engine inquiry daily record structure that reads in.Node q on one side represents keyword; The node d on one side representative link in addition; Simultaneously; Appear at simultaneously in the record if in daily record, find certain keyword and link, that is to say that the user has clicked some link in the result of page searching when carrying out a keyword query, set up a two-way limit so between the node of representing this keyword and link.
2) after the structure of accomplishing two-way limit, merge two maximum node q of similarity degree, merge two maximum node d of similarity degree then, so iteration is till satisfying end condition.
Then, macaronic text is striden the language cluster.
To the Chinese text and the English text of cluster extract the characteristic term vector respectively<w
1, w
2, L w
n>With
Wherein n and m are respectively the dimensions of characteristic term vector of characteristic term vector and the English text of Chinese text.
The extraction of characteristic term vector can be based on the method for cluster; Specifically comprise: at first characteristic is carried out cluster according to the similarity between characteristic; In each bunch of institute's cluster, select whole bunch of a characteristic representative; With bunch in other characteristics concentrate to reject from candidate feature, thereby reduce the redundancy in the feature set; Then, use the information gain method to carry out feature selecting, obtain the characteristic term vector remaining characteristic." computer utility " that be published in January, 2007 referring to Zhang Wenliang etc. gone up " a kind of text feature system of selection based on cluster " literary composition.
Certainly, the extraction of characteristic term vector also can be adopted other method, for example, and based on mutual information, x
2Statistic law etc.
Calculate the translation probability of Chinese and English characteristic speech
W wherein
i∈ w
1, w
2, L w
n,
1≤i≤n, 1≤j≤m, the two is respectively the characteristic speech of Chinese and English text.In order to guarantee the symmetry of similarity, order
Expression
Translate into w
iProbability,
Expression w
iTranslate into
Probability.For example, " bank " this English word, it comprises two kinds of Chinese implications, and a kind of is " bank ", and a kind of is " riverside ".P (bank | bank) and P (riverside | bank) these two conditional probabilities represent that respectively bank translates into the probability of bank and the probability that bank translates into the riverside.Can be through the bilingual corpora in the bilingual corpus be carried out participle; Carry out word alignment then; Add up total what " bank " according to the alignment result and translated into " bank ", what " bank " have been translated into " riverside " and have been calculated as above conditional probability.
Calculate the similarity of Chinese and English characteristic term vector according to translation probability
:
The text that similarity is higher than threshold value gathers into one type.Wherein, preferably, this threshold value obtains through test text is trained according to as above step.Certainly; This threshold value can confirm that also preferably, this threshold value is relevant with the dimension of the characteristic term vector of text according to experience; Span is
for example T=n=m when n equates with m; When n and m are unequal, T get min (n, m).
Based on above text cluster method, the present invention also provides a kind of text cluster system, comprising:
Single language text cluster module 100 is used for the text of each language is carried out cluster respectively;
According to a particular embodiment of the invention, this extraction module 200 can comprise feature clustering module 210 and characteristic selecting module 220.
This feature clustering module 210; Be used for characteristic being carried out cluster based on the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;
This characteristic selecting module 220 is used for that above-mentioned feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.
This analysis module 300 can comprise translation probability computing module 310, similarity calculation module 320 and multi-language text cluster module 330.
This translation probability computing module 310 is used for calculating the translation probability of characteristic speech of the characteristic term vector of the bilingual text that is extracted by extraction module
W wherein
i,
It is respectively a kind of characteristic speech of text of text and another kind of language of language.Different embodiment according to the subject invention, it can comprise the first probability calculation module 311 and the first probability determination module, perhaps the second probability calculation module 312 and the second probability determination module.Fig. 3 shows text cluster system chart according to a preferred embodiment of the invention.According to the preferred embodiment, this translation probability computing module comprises the first probability calculation module 311, the second probability calculation module 312 and the 3rd probability determination module 313.
This first probability calculation module 311 is used for calculating the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability; This second probability calculation module 312 is used for calculating the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability.
This first probability determination module is used for being calculated according to the said first probability calculation module
Utilize following formula to calculate the translation probability of said characteristic speech
This second probability determination module is used for being calculated according to the said second probability calculation module
Utilize following formula to calculate the translation probability of said characteristic speech
The 3rd probability determination module 313 is used for being calculated according to the said first probability calculation module
And the said second probability calculation module is calculated
Utilize following formula to calculate the translation probability of said characteristic speech
This similarity calculation module 320 is used for being calculated according to said translation probability computing module
Utilize following formula to calculate said similarity:
Wherein<w
1, w
2, L w
n>With
Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
This multi-language text cluster module 330 is used for the similarity calculated according to similarity calculation module, and said similarity is gathered into one type greater than the text of threshold value.
Above-mentioned text cluster method and text cluster system can be applied in the question answering system, particularly in the question answering system of search engine.For example, question answering system has received a problem " how to make the fired rice and eggs " through the problem page.Question answering system will be extracted the characteristic term vector of this problem, calculate the similarity between the characteristic term vector of each classification of having accomplished cluster in this characteristic term vector and the question answering system then, thus successfully with this problem cluster to suitable classification.If there is not qualified classification, then set up a new classification for this problem.Thus, realized the problem of inquiry is classified, based on classification answer more accurately is provided then.
One of ordinary skill in the art will appreciate that, be that example is described with Chinese with English text above, but the present invention is not limited to this, and it can be widely used in the text of various language.And the application of file classification method of the present invention also is not limited to the question answering system described in the specific embodiment of the invention, and it can be applied to other and relate to multilingual text.
Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of above-mentioned detailed description.Therefore, the scope of the technical scheme of requirement protection does not receive the restriction of given any specific exemplary teachings.
Claims (18)
1. a text cluster method comprises the steps:
1) text to each language carries out cluster respectively;
2) text to each language after the cluster extracts the characteristic term vector respectively;
3) similarity of the characteristic term vector of the text of calculating different language is carried out cluster to all texts.
2. method according to claim 1 is characterized in that, said step 3) further comprises:
31) translation probability of characteristic speech in the calculating bilingual text
W wherein
i,
It is respectively a kind of characteristic speech of text of text and another kind of language of language;
32) calculate said similarity according to following formula:<img file="FDA0000049500880000013.GIF" he="111" id="ifm0003" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="700" />Wherein<w<sub >1</sub>, w<sub >2</sub>, L w<sub >n</sub>>With<img file="FDA0000049500880000014.GIF" he="83" id="ifm0004" img-content="drawing" img-format="GIF" inline="yes" orientation="portrait" wi="277" />Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
33) said similarity is gathered into one type greater than the text of threshold value.
3. method according to claim 2 is characterized in that, said step 31) further comprise:
311) calculate the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability;
4. method according to claim 2 is characterized in that, said step 31) further comprise:
313) calculate the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
5. method according to claim 2 is characterized in that, said step 31) further comprise:
311) calculate the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability;
313) calculate the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
6. according to each described method in the claim 1 to 5, it is characterized in that cluster described in the said step 1) adopts the hierarchy clustering method of coagulation type.
7. according to each described method in the claim 1 to 5, it is characterized in that said step 2) described in extract the method that the characteristic term vector is based on cluster.
8. method according to claim 7 is characterized in that, said step 2) further comprise:
21) at first characteristic is carried out cluster, in each bunch, selects whole bunch of a characteristic representative according to the similarity between the characteristic in the text of each language after the cluster, with bunch in other characteristics concentrate from candidate feature and reject;
22) use the information gain method to carry out feature selecting to remaining characteristic, obtain the characteristic term vector.
9. method according to claim 2 is characterized in that said threshold value obtains through test text is trained.
11. a text cluster system comprises:
Single language text cluster module is used for the text of each language is carried out cluster respectively;
Extraction module is used for the text of each language after said single language text cluster module cluster is extracted the characteristic term vector respectively;
Analysis module is used to calculate the similarity of characteristic term vector of the text of the different language that is extracted by said extraction module, and all texts are carried out cluster.
12. system according to claim 11 is characterized in that, said analysis module further comprises:
The translation probability computing module is used for calculating the translation probability by the bilingual text characteristic speech of said extraction module extraction
W wherein
i,
It is respectively a kind of characteristic speech of text of text and another kind of language of language;
Similarity calculation module is used for being calculated according to said translation probability computing module
Utilize following formula to calculate said similarity:
Wherein<w
1, w
2, L w
n>With
Be respectively to be the characteristic term vector of the text of a kind of text of language and another kind of language, n and m are respectively the number of the characteristic speech in above-mentioned two characteristic term vectors;
Multi-language text cluster module, the similarity that is used for calculating according to said similarity calculation module is gathered into one type with said similarity greater than the text of threshold value.
13. system according to claim 12 is characterized in that, said translation probability computing module further comprises:
The first probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability;
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
The first probability determination module,
that
that is used for being calculated according to the said first probability calculation module and the said second probability calculation module are calculated utilizes following formula to calculate the translation probability
of said characteristic speech
14. system according to claim 12 is characterized in that, said translation probability computing module further comprises:
The first probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression
Translate into w
iProbability;
15. system according to claim 12 is characterized in that, said translation probability computing module further comprises:
The second probability calculation module is used for calculating the lexical translation probability according to following formula
Wherein
Expression w
iTranslate into
Probability;
16. system according to claim 11 is characterized in that, said extraction module further comprises:
The feature clustering module; Be used for characteristic being carried out cluster according to the similarity between the characteristic of the text of each language after single language text cluster module cluster; In each bunch, select whole bunch of a characteristic representative, with bunch in other characteristics concentrate from candidate feature and reject;
Characteristic selecting module is used for that said feature clustering module is accomplished the remaining characteristic in rejecting back and uses the information gain method to carry out feature selecting, obtains the characteristic term vector.
17. a question answering system comprises the described text cluster of claim 11 system.
18. a search engine comprises the described question answering system of claim 17.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100563794A CN102682000A (en) | 2011-03-09 | 2011-03-09 | Text clustering method, question-answering system applying same and search engine applying same |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100563794A CN102682000A (en) | 2011-03-09 | 2011-03-09 | Text clustering method, question-answering system applying same and search engine applying same |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102682000A true CN102682000A (en) | 2012-09-19 |
Family
ID=46813948
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011100563794A Pending CN102682000A (en) | 2011-03-09 | 2011-03-09 | Text clustering method, question-answering system applying same and search engine applying same |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102682000A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955857A (en) * | 2012-11-09 | 2013-03-06 | 北京航空航天大学 | Class center compression transformation-based text clustering method in search engine |
CN103049433A (en) * | 2012-12-11 | 2013-04-17 | 微梦创科网络科技(中国)有限公司 | Automatic question answering method, automatic question answering system and method for constructing question answering case base |
CN104361127A (en) * | 2014-12-05 | 2015-02-18 | 广西师范大学 | Multilanguage question and answer interface fast constituting method based on domain ontology and template logics |
WO2015035628A1 (en) * | 2013-09-12 | 2015-03-19 | 广东电子工业研究院有限公司 | Method of clustering literature in multiple languages |
CN104573046A (en) * | 2015-01-20 | 2015-04-29 | 成都品果科技有限公司 | Comment analyzing method and system based on term vector |
CN104778256A (en) * | 2015-04-20 | 2015-07-15 | 江苏科技大学 | Rapid incremental clustering method for domain question-answering system consultations |
CN105912734A (en) * | 2016-06-22 | 2016-08-31 | 北京金山安全软件有限公司 | User feedback automatic reply method and device |
CN106095845A (en) * | 2016-06-02 | 2016-11-09 | 腾讯科技(深圳)有限公司 | File classification method and device |
CN106559695A (en) * | 2016-10-14 | 2017-04-05 | 北京金山安全软件有限公司 | Barrage message processing method and device and electronic equipment |
CN106815310A (en) * | 2016-12-20 | 2017-06-09 | 华南师范大学 | A kind of hierarchy clustering method and system to magnanimity document sets |
CN107145573A (en) * | 2017-05-05 | 2017-09-08 | 上海携程国际旅行社有限公司 | The problem of artificial intelligence customer service robot, answers method and system |
CN108170691A (en) * | 2016-12-07 | 2018-06-15 | 北京国双科技有限公司 | It is associated with the determining method and apparatus of document |
CN108416014A (en) * | 2018-03-05 | 2018-08-17 | 杭州朗和科技有限公司 | Data processing method, medium, system and electronic equipment |
CN109063184A (en) * | 2018-08-24 | 2018-12-21 | 广东外语外贸大学 | Multilingual newsletter archive clustering method, storage medium and terminal device |
CN110046332A (en) * | 2019-04-04 | 2019-07-23 | 珠海远光移动互联科技有限公司 | A kind of Similar Text data set generation method and device |
CN113570380A (en) * | 2020-04-28 | 2021-10-29 | 中国移动通信集团浙江有限公司 | Service complaint processing method, device and equipment based on semantic analysis and computer readable storage medium |
-
2011
- 2011-03-09 CN CN2011100563794A patent/CN102682000A/en active Pending
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955857B (en) * | 2012-11-09 | 2015-07-08 | 北京航空航天大学 | Class center compression transformation-based text clustering method in search engine |
CN102955857A (en) * | 2012-11-09 | 2013-03-06 | 北京航空航天大学 | Class center compression transformation-based text clustering method in search engine |
CN103049433B (en) * | 2012-12-11 | 2015-10-28 | 微梦创科网络科技(中国)有限公司 | The method of automatic question-answering method, automatically request-answering system and structure question and answer case library |
CN103049433A (en) * | 2012-12-11 | 2013-04-17 | 微梦创科网络科技(中国)有限公司 | Automatic question answering method, automatic question answering system and method for constructing question answering case base |
WO2015035628A1 (en) * | 2013-09-12 | 2015-03-19 | 广东电子工业研究院有限公司 | Method of clustering literature in multiple languages |
CN104361127A (en) * | 2014-12-05 | 2015-02-18 | 广西师范大学 | Multilanguage question and answer interface fast constituting method based on domain ontology and template logics |
CN104361127B (en) * | 2014-12-05 | 2017-09-26 | 广西师范大学 | The multilingual quick constructive method of question and answer interface based on domain body and template logic |
CN104573046A (en) * | 2015-01-20 | 2015-04-29 | 成都品果科技有限公司 | Comment analyzing method and system based on term vector |
CN104573046B (en) * | 2015-01-20 | 2018-07-31 | 成都品果科技有限公司 | A kind of comment and analysis method and system based on term vector |
CN104778256A (en) * | 2015-04-20 | 2015-07-15 | 江苏科技大学 | Rapid incremental clustering method for domain question-answering system consultations |
CN104778256B (en) * | 2015-04-20 | 2017-10-17 | 江苏科技大学 | A kind of the quick of field question answering system consulting can increment clustering method |
CN106095845A (en) * | 2016-06-02 | 2016-11-09 | 腾讯科技(深圳)有限公司 | File classification method and device |
CN105912734A (en) * | 2016-06-22 | 2016-08-31 | 北京金山安全软件有限公司 | User feedback automatic reply method and device |
CN106559695A (en) * | 2016-10-14 | 2017-04-05 | 北京金山安全软件有限公司 | Barrage message processing method and device and electronic equipment |
CN108170691A (en) * | 2016-12-07 | 2018-06-15 | 北京国双科技有限公司 | It is associated with the determining method and apparatus of document |
CN106815310A (en) * | 2016-12-20 | 2017-06-09 | 华南师范大学 | A kind of hierarchy clustering method and system to magnanimity document sets |
CN106815310B (en) * | 2016-12-20 | 2020-04-21 | 华南师范大学 | Hierarchical clustering method and system for massive document sets |
CN107145573A (en) * | 2017-05-05 | 2017-09-08 | 上海携程国际旅行社有限公司 | The problem of artificial intelligence customer service robot, answers method and system |
CN108416014A (en) * | 2018-03-05 | 2018-08-17 | 杭州朗和科技有限公司 | Data processing method, medium, system and electronic equipment |
CN109063184A (en) * | 2018-08-24 | 2018-12-21 | 广东外语外贸大学 | Multilingual newsletter archive clustering method, storage medium and terminal device |
CN109063184B (en) * | 2018-08-24 | 2020-09-01 | 广东外语外贸大学 | Multi-language news text clustering method, storage medium and terminal device |
CN110046332A (en) * | 2019-04-04 | 2019-07-23 | 珠海远光移动互联科技有限公司 | A kind of Similar Text data set generation method and device |
CN110046332B (en) * | 2019-04-04 | 2024-01-23 | 远光软件股份有限公司 | Similar text data set generation method and device |
CN113570380A (en) * | 2020-04-28 | 2021-10-29 | 中国移动通信集团浙江有限公司 | Service complaint processing method, device and equipment based on semantic analysis and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102682000A (en) | Text clustering method, question-answering system applying same and search engine applying same | |
Gholamrezazadeh et al. | A comprehensive survey on text summarization systems | |
CN103399901B (en) | A kind of keyword abstraction method | |
WO2018189589A2 (en) | Systems and methods for document processing using machine learning | |
CN102651003B (en) | Cross-language searching method and device | |
CN105844424A (en) | Product quality problem discovery and risk assessment method based on network comments | |
Mori et al. | A machine learning approach to recipe text processing | |
US20190171713A1 (en) | Semantic parsing method and apparatus | |
CN104008126A (en) | Method and device for segmentation on basis of webpage content classification | |
CN110532390B (en) | News keyword extraction method based on NER and complex network characteristics | |
CN102214189B (en) | Data mining-based word usage knowledge acquisition system and method | |
CN104391885A (en) | Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training | |
CN102043808A (en) | Method and equipment for extracting bilingual terms using webpage structure | |
CN103544266A (en) | Method and device for generating search suggestion words | |
CN102789464A (en) | Natural language processing method, device and system based on semanteme recognition | |
CN104281565A (en) | Semantic dictionary constructing method and device | |
CN111563382A (en) | Text information acquisition method and device, storage medium and computer equipment | |
Jain et al. | Context sensitive text summarization using k means clustering algorithm | |
CN104391969A (en) | User query statement syntactic structure determining method and device | |
CN110209781A (en) | A kind of text handling method, device and relevant device | |
CN112015907A (en) | Method and device for quickly constructing discipline knowledge graph and storage medium | |
CN101763403A (en) | Query translation method facing multi-lingual information retrieval system | |
KR102083017B1 (en) | Method and system for analyzing social review of place | |
Perez-Tellez et al. | On the difficulty of clustering microblog texts for online reputation management | |
CN112487263A (en) | Information processing method, system, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120919 |