CN102955857B - Class center compression transformation-based text clustering method in search engine - Google Patents

Class center compression transformation-based text clustering method in search engine Download PDF

Info

Publication number
CN102955857B
CN102955857B CN201210447277.XA CN201210447277A CN102955857B CN 102955857 B CN102955857 B CN 102955857B CN 201210447277 A CN201210447277 A CN 201210447277A CN 102955857 B CN102955857 B CN 102955857B
Authority
CN
China
Prior art keywords
text
class
center
similarity
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210447277.XA
Other languages
Chinese (zh)
Other versions
CN102955857A (en
Inventor
欧阳元新
袁满
谢舒翼
刘文琦
熊璋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai haotengzhisheng Technology Co., Ltd
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201210447277.XA priority Critical patent/CN102955857B/en
Publication of CN102955857A publication Critical patent/CN102955857A/en
Application granted granted Critical
Publication of CN102955857B publication Critical patent/CN102955857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a class center compression transformation-based text clustering method in a search engine. The method comprises the following steps of: by using an improved tf-idf formula, calculating word weight of each file in a text set, calculating an initial class center, mining a synonym word set and a concurrent high-frequency word set, calculating a word center and performing primary classification according to similarity of the initial class center with each file; compressing the center word according to information such as title word, article length, synonyms and concurrent associated words, thereby guaranteeing that the same word only occurs in some class centers with high similarity with the word; clustering the file by using a new cluster center again; calculating core similarity of each class; splitting the biggest class; combining smaller classes to produce a new class; iterating compression, clustering and split operation until the number of the classes converges; and guaranteeing that the similarity of the text in the same class with the cluster center reaches a certain threshold value. The clustering accuracy is obviously higher than those of the conventional methods such as KMeans and DBSCAN (Density-based Spatial Clustering of Applications with Noise).

Description

Based on the Text Clustering Method of class central compressed conversion in a kind of search engine
Technical field
The invention belongs to text mining, the technical field of machine learning research, in particular in a kind of search engine based on class central compressed conversion Text Clustering Method, by in conjunction with synonym phrase, co-occurrence association phrase, vocabulary center, class center, title content, the many factors such as Document Length, carry out cluster repeatedly, division alternative manner to improve clustering precision to text set.The method is applicable to search engine, information retrieval system.
Background technology
In real world, text is the most important carrier of information, and in fact, research shows that information has 80% to be included in text document.Particularly on the internet, text data is present in various forms widely, as news report, e-book, research paper, digital library, webpage, Email etc.Text cluster technology can be applied to information filtering, personalized information recommendation, enables people retrieve required information exactly, shortens the time of information retrieval.Meanwhile, text cluster is a kind of method not needing training set can mark off generic, and it effectively can solve the automatic partition problem of text.Text cluster, owing to not needing to mark classification to text is manual in advance, therefore has certain dirigibility and higher automatic business processing ability, has become the important means effectively organized text message, make a summary and navigate.
Current existing Text Clustering Method major part is based on VSM(text vector model) model calculates similarity between text and text, is mutually independently when structure text vector between hypothesis word.This method have ignored the relevance between same section document word and word, the potential contact etc. between different document word and word.Traditional Clustering Model is confined to the input sequence of document, the number of initial classes, the restriction of the multiple conditions such as the selection of initial central point.Position cluster between word and synon excavation are also the contents that conventional Text Clustering Method is ignored.Thus the calculating of Documents Similarity is affected, and causes the result of cluster accurate not.Therefore, the method that this patent proposes is by the feature extraction keyword for data set, remove insignificant vocabulary, the vocabulary that the filtering effects factor is less, excavates document subject matter, synonym phrase, the potential applications relations such as co-occurrence high-frequency phrase improve clustering precision, by center of compression vocabulary, the tf-idf method that utilization improves calculates the similar weight between vocabulary, and the method for iteration cluster and the new class of division eliminates the impact of document input sequence.Finally reach and make similar text similarity as far as possible large, inhomogeneity text similarity is as far as possible little.
Summary of the invention
The technical problem to be solved in the present invention is: the limitation overcoming prior art, a kind of Text Clustering Method based on the conversion of class central compressed is provided, the method excavates document subject matter, synonym phrase, the potential applications relations such as co-occurrence high-frequency phrase, adopt class central compressed, center reunion class, divide the conversion such as new class, improve text cluster precision.
The technical scheme that the present invention solves the problems of the technologies described above is: based on the Text Clustering Method of class central compressed conversion in a kind of search engine, the method comprises the following steps:
Step 1, participle is carried out to each text in cluster text set;
Step 2, removal stop words, the word that the filtering effects factor is less;
Step 3, calculate the number of times tf that in each text, each word occurs;
The anti-text frequency of step 4, calculating word wherein fileNum is the sum of text, and freOccur is the amount of text occurring this word);
Step 5, excavation synonym phrase;
Step 6, excavation co-occurrence high-frequency phrase, namely appear at the phrase pair in multiple different text simultaneously;
Step 7, according to synonym phrase and high frequency co-occurrence phrase, produce original class center, each class center is made up of a series of high frequency vocabulary, tf and idf of statistics high frequency vocabulary, the class center belonging to mark high frequency vocabulary;
Step 8, calculate the content-length of each text, extract the title of article, participle is carried out to title; If do not have title, then title title is set to sky; Extraction section head-word language and section tail vocabulary are also marked so that weighted calculation below;
Step 9, calculate similarity between any two texts, have in title or content during the word of identical or synonym and increase weight, section head-word language gives different weights respectively from section tail vocabulary, and computing formula is as follows:
pureFileSim (i,j)=(contentSimilarity (i,j)+titleSimilarity (i,j)/(log(fileLength i*fileLength j));
contentSimilarity ( i , j ) = Σ x , y ( log ( fileKeywodTf ( i , x ) ) + 1 ) * fileKeywodIdf ( i , x ) * ∂ + ( log ( fileKeywodTf ( j , y ) ) + 1 ) * fileKeywodIdf ( j , y ) * ∂ ;
titleSimilarity ( i , j ) = Σ x , y ( fileTitleWordTf ( i , x ) * fileTitleWordIdf ( i , x ) ) * ∂ + ( fileTitleWordTf ( j , y ) * fileTitleWordIdf ( j , y ) ) * ∂ ;
In formula: pureFileSim (i, j): the pure similarity of text i and text j;
ContentSimilarity (i, j): the content similarity of text i and text j;
TitleSimilarity (i, j): the title similarity of text i and text j;
FileKeywordTf (x, i): the tf of key word x in text i;
FileKeywordIdf (x, i): the idf of key word x in text i;
FileTitleWordTf (j, y): the tf of class center j keyword y;
FileTitleWordIdf (j, y): the idf of class center j keyword y;
FileLengthi: the content-length of text i;
The input sequence of step 10, randomization text: initial clustering is carried out to cluster text set according to original cluster centre, its algorithm is as follows: to each section of text, calculate the similarity of it and all cluster centres, the cluster centre id selecting similarity maximum is as the class belonging to this text; The calculating formula of similarity of text i and class center j is as follows:
fileSim ( i , j ) = ( Σ fileKeyword ( i , x ) ∈ center j ( log ( fileKeywordIf ( x , i ) ) + 1 ) * fileKeywordIdf ( x , i )
+ Σ fileTitleWord ( i , x ) ∈ center j ( log ( centerKeywordTf ( j , y ) ) + 1 ) * ( centerKeywordIdf ( j , y ) ) / fileContentLength i ;
In formula:
FileKeywordTf (x, i): the tf of key word x in text i;
FileKeywordIdf (x, i): the idf of key word x in text i;
CenterKeywordTf (j, y): the tf of class center j keyword y;
CenterKeywordIdf (j, y): the idf of class center j keyword y;
FileContentLength i: the content-length of text i;
Calculate and the immediate class center of each vocabulary simultaneously, record the wordid of vocabulary;
Calculate the number percent that the most similar class center has more than class center like second-phase, be recorded in the diffRatio of text;
The text that step 11, rejecting diffRatio are less than 10%, carries out keyword extraction and statistics to the text set belonging to same class, utilizes these vocabulary to regenerate such center in remaining text; Selected vocabulary requires that tf and idf is not less than certain threshold value; More the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart, merges similarity comparatively Gao Lei center;
Step 12, recalculate the cluster centre belonging to each text according to new cluster centre, Similarity Measure is with step 9;
Step 13, calculate the core similarity of each class, attempt dividing to produce new class to maximum class, its splitting-up method is as follows: calculate most active text fx in such, namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger, the text fy minimum with text fx similarity is calculated in class, Xin Lei center ctx is set up with fx and the text set the most similar to fx, Xin Lei center cty is set up with fy and the text set the most similar to fy, itself and ctx are calculated to text remaining in such, the similarity of cty, one of both they are incorporated to respectively,
Step 14, in the less text of the basic Shang Duiyulei center similarity of step 11, the center id according to its most of vocabulary is incorporated to the class belonging to this id;
Step 15. repeats step 10-14, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value, then stop.
Principle of the present invention is:
Based on the Text Clustering Method of class central compressed conversion in a kind of search engine, improve traditional tf-idf(termfrequency – inverse document frequency) formula, calculate the term weight of each document in text set, initial classes center is produced by large data sets, excavate synonym phrase and co-occurrence high-frequency phrase, calculate vocabulary center, the pure similarity between document and association similarity, the similarity according to initial classes center and each document carries out just subseries.According to title vocabulary, article length, synonym, co-occurrence conjunctive word, vocabulary center, the information such as number percent that the most similar class center has more than class center like second-phase, center of compression vocabulary, make same vocabulary only to appear in some high classes similar to it at heart, utilize new cluster centre to carry out cluster again to document sets.Calculate the core similarity of each class, divide to produce new class to maximum class.To compression, cluster, splitting operation carries out iteration, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value.
The present invention's advantage is compared with prior art:
In the research field of text cluster, KMeans(K Mean Method: the number k of the given division that will build, first division methods creates an initial division.Then a kind of re-positioning technology of iteration is adopted, attempt improving division by object is mobile between division), DBSCAN(will bunch to regard as in data space by high density subject area that density regions separates, for each text object in bunch, the text object number comprised in the field of its given radius (representing with £) is not less than a certain given minimal amount (representing with MinPts), as long as the density of close region (number of object) exceedes certain threshold value, just continue cluster) be the method relatively commonly used.But all there are some shortcomings, as KMeans method affects by initial cluster center comparatively large, cluster numbers k will be specified in advance, easily be subject to the impact of isolated point and file input sequence.And DBSCAN method, although can find arbitrary shape bunch, responsive to parameter £ and MinPts, easily the text of same type is divided into multiple different class.The Clustering Effect that traditional method produces is not ideal.
Instant invention overcomes the shortcoming displayed in classic method, do not need to specify clusters number in advance, the input sequence of document does not affect cluster result, insensitive to parameters such as similar radial.
Accompanying drawing explanation
Fig. 1 is the Text Clustering Method block diagram based on the conversion of class central compressed;
Fig. 2 is the Text Clustering Method structural drawing based on the conversion of class central compressed;
Fig. 3 is the variation diagram along with initial K value rising KMeans method and class central compressed transform method clustering precision;
Fig. 4 is along with class center similarity radius increases, the variation diagram of DBSCAN method and class central compressed transform method clusters number;
Fig. 5 is KMeans method, the comparison diagram of DBSCAN method and the average clustering precision of class central compressed transform method.
Embodiment
Below in conjunction with drawings and Examples, the present invention is further described.
A kind of Text Clustering Method based on the conversion of class central compressed of the present invention, it fully excavates the potential applications association between text vocabulary, calculates vocabulary center, and compression class center, improves the precision of text cluster.The similarity of compute classes center and text, iteration Abruption and mergence, recombination classes center, until meet certain standard.Potential applications association between described excavation text vocabulary, the tf-idf that utilization improves calculates the similarity between text, in this, as the important indicator weighing the degree of association between text vocabulary.Extract the title of every section of document simultaneously and carry out participle, the similarity of title vocabulary is weighted.
tf new=log(tf)+1
wherein fileNum is the sum of text, and freOccur is the amount of text occurring this word;
Potential applications association between described excavation text vocabulary, excavate synonym phrase, co-occurrence high-frequency phrase (jointly appearing in many sections of documents) improves the computational accuracy of Lexical Similarity, and in this, as initial words clustering center.
Described calculating vocabulary center, utilizes vocabulary to appear at the number of times of same section document, appears at the frequency of different document, and with the word of its nearly justice, the features such as co-occurrence relative words, carry out key words sorting to vocabulary, calculate the immediate vocabulary center of vocabulary.
Described compression class center, more the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart.
Described compute classes center and the similarity of text, the input sequence of randomization text.Similarity is calculated according to the information after cluster centre and Text Pretreatment.To each section of text, calculate the similarity of it and all cluster centres, the cluster centre id selecting similarity maximum is as the class belonging to this text.
Described division class center, calculates the core similarity of each class, attempts dividing to produce new class to maximum class.Splitting-up method is as follows: calculate most active text fx in such, and namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger.The text minimum with text fx similarity is calculated in class.Set up Xin Lei center ctx with fx and the text set the most similar to fx, set up Xin Lei center cty with fy and the text set the most similar to fy.Text remaining in such is calculated to the similarity of itself and ctx, cty, they are incorporated to respectively both one of.
Described merging class center, the similarity between compute classes center, class similarity being reached certain standard merges, and utilizes the vocabulary of these classes to regenerate such center.Selected vocabulary requires that tf and idf is not less than certain threshold value.
Described iterative operation, multiple division class center, recombination classes center, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value.
Following 15 steps are mainly divided into based on the Text Clustering Method of class central compressed conversion in a kind of search engine of the present invention.
1. each text in pair cluster text set carries out participle.
2. remove stop words, the word that the filtering effects factor is less.
3. calculate the number of times tf that in each text, each word occurs.
4. calculate the anti-text frequency idf of word.
5. excavate synonym phrase.
6. excavate co-occurrence high-frequency phrase, namely appear at the phrase pair in multiple different text simultaneously.
7. according to synonym phrase and high frequency co-occurrence phrase, produce original class center, each class center is made up of a series of high frequency vocabulary, tf and idf of statistics high frequency vocabulary, the class center belonging to mark high frequency vocabulary.
8. calculate the content-length of each text, extract the title (if do not have title, then title is set to sky) of article, participle is carried out to title; Extraction section head-word language and section tail vocabulary are also marked so that weighted calculation below.
9. calculate the similarity (having word that is identical or synonym just to increase weight in title or content) between any two texts, section head-word language gives different weights respectively from section tail vocabulary.Computing formula:
pureFileSim (i,j)=(contentSimilarity (i,j)+titleSimilarity (i,j)/(log(fileLength i*fileLength j));
contentSimilarity ( i , j ) = Σ x , y ( log ( fileKeywodTf ( i , x ) ) + 1 ) * fileKeywodIdf ( i , x ) * ∂ + ( log ( fileKeywodTf ( j , y ) ) + 1 ) * fileKeywodIdf ( j , y ) * ∂ ;
titleSimilarity ( i , j ) = Σ x , y ( fileTitleWordTf ( i , x ) * fileTitleWordIdf ( i , x ) ) * ∂ + ( fileTitleWordTf ( j , y ) * fileTitleWordIdf ( j , y ) ) * ∂ ;
In formula: pureFileSim (i, j): the pure similarity of text i and text j;
ContentSimilarity (i, j): the content similarity of text i and text j;
TitleSimilarity (i, j): the title similarity of text i and text j;
FileKeywordTf (x, i): the tf of key word x in text i;
FileKeywordIdf (x, i): the idf of key word x in text i;
FileTitleWordTf (j, y): the tf of class center j keyword y;
FileTitleWordIdf (j, y): the idf of class center j keyword y;
FileLengthi: the content-length of text i.
10. the input sequence of randomization text.According to original cluster centre, initial clustering is carried out to cluster text set.Algorithm is as follows: to each section of text, calculates the similarity of it and all cluster centres, and the cluster centre id selecting similarity maximum is as the class belonging to this text.The calculating formula of similarity of text i and class center j is as follows:
fileSim ( i , j ) = ( Σ fileKeyword ( i , x ) ∈ center j ( log ( fileKeywordIf ( x , i ) ) + 1 ) * fileKeywordIdf ( x , i )
+ Σ fileTitleWord ( i , x ) ∈ center j ( log ( centerKeywordTf ( j , y ) ) + 1 ) * ( centerKeywordIdf ( j , y ) ) / fileContentLength i ;
In formula:
FileKeywordTf (x, i): the tf of key word x in text i
FileKeywordIdf (x, i): the idf of key word x in text i
CenterKeywordTf (j, y): the tf of class center j keyword y
CenterKeywordIdf (j, y): the idf of class center j keyword y
FileContentLength i: the content-length of text i
Calculate and the immediate class center of each vocabulary simultaneously, record the wordid of vocabulary.
Calculate the number percent that the most similar class center has more than class center like second-phase, be recorded in the diffRatio of text.
The text that 11. rejecting diffRatio are less than 10%, carries out keyword extraction and statistics to the text set belonging to same class, utilizes these vocabulary to regenerate such center in remaining text.Selected vocabulary requires that tf and idf is not less than certain threshold value.More the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart.Merge similarity comparatively Gao Lei center.
12. recalculate the cluster centre belonging to each text according to new cluster centre, and Similarity Measure is with step 9.
The core similarity of each class of 13. calculating, attempts dividing to produce new class to maximum class.Splitting-up method is as follows: calculate most active text fx in such, and namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger.The text fy minimum with text fx similarity is calculated in class.Set up Xin Lei center ctx with fx and the text set the most similar to fx, set up Xin Lei center cty with fy and the text set the most similar to fy.Text remaining in such is calculated to the similarity of itself and ctx, cty, they are incorporated to respectively both one of.
14. in the less text of the basic Shang Duiyulei center similarity of step 11, and the center id according to its most of vocabulary is incorporated to the class belonging to this id.
15. repeat step 10-14, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value, termination.

Claims (1)

1. in search engine based on class central compressed conversion a Text Clustering Method, it is characterized in that: the method comprises the following steps:
Step 1, participle is carried out to each text in cluster text set;
Step 2, removal stop words, the word that the filtering effects factor is less;
Step 3, calculate the number of times tf that in each text, each word occurs;
The anti-text frequency idf of step 4, calculating word;
Step 5, excavation synonym phrase;
Step 6, excavation co-occurrence high-frequency phrase, namely appear at the phrase pair in multiple different text simultaneously;
Step 7, according to synonym phrase and high frequency co-occurrence phrase, produce original class center, each class center is made up of a series of high frequency vocabulary, tf and idf of statistics high frequency vocabulary, the class center belonging to mark high frequency vocabulary;
Step 8, calculate the content-length of each text, extract the title of article, participle is carried out to title; If do not have title, then title title is set to sky; Extraction section head-word language and section tail vocabulary are also marked so that weighted calculation below;
Step 9, calculate similarity between any two texts, have in title or content during the word of identical or synonym and increase weight, section head-word language gives different weights respectively from section tail vocabulary, and computing formula is as follows:
pureFileSim (i,j)=(contentSimilarity (i,j)+titleSimilarity (i,j))/(log(fileLength i*fileLength j));
contentSimilarity ( i , j ) = Σ [ ( log ( fileKeywodTf ( i , x ) ) + 1 ) * fileKeywodIdf ( i , x ) * ∂ + ( log ( fileKeywodTf ( j , y ) ) + 1 ) * fileKeywodIdf ( j , y ) * ∂ ;
titleSimilarity ( i , j ) = Σ x , y [ ( fileTitleWordTf ( i , x ) * fileTitleWordIdf ( i , x ) ∂ ) + ( fileTitleWordTf ( j , y ) * fileTitleWordIdf ( j , y ) * ∂ ) ;
In formula: pureFileSim (i, j): the pure similarity of text i and text j;
ContentSimilarity (i, j): the content similarity of text i and text j;
TitleSimilarity (i, j): the title similarity of text i and text j;
FileKeywordTf (x, i): the tf of key word x in text i;
FileKeywordIdf (x, i): the idf of key word x in text i;
FileTitleWordTf (j, y): the tf of text j heading y;
FileTitleWordIdf (j, y): the idf of text j heading y;
FileLength i: the content-length of text i;
The input sequence of step 10, randomization text: initial clustering is carried out to cluster text set according to original cluster centre, its algorithm is as follows: to each section of text, calculate the similarity of it and all cluster centres, the cluster centre id selecting similarity maximum is as the class belonging to this text; The calculating formula of similarity of text i and class center m is as follows:
fileSim ( i , m ) = ( Σ f ileKeyword ( i , x ) ∈ center m ( log ( fileKeywordTf ( x , i ) ) + 1 ) * fileKeywordIdf ( x , i ) + Σ fileTitle Word ( i , x ) ∈ center m ( log ( centerKeywordTf ( m , y ) ) + 1 ) * ( centerKeywordIdf ( m , y ) ) ) / fileContentLength i ;
In formula:
FileKeywordTf (x, i): the tf of key word x in text i;
FileKeywordIdf (x, i): the idf of key word x in text i;
CenterKeywordTf (m, y): the tf of class center m keyword y;
CenterKeywordIdf (m, y): the idf of class center m keyword y;
FileContentLength i: the content-length of text i;
Calculate and the immediate class center of each vocabulary simultaneously, record the wordid of vocabulary;
Calculate the number percent that the most similar class center has more than class center like second-phase, be recorded in the diffRatio of text;
The text that step 11, rejecting diffRatio are less than 10%, carries out keyword extraction and statistics to the text set belonging to same class, utilizes these vocabulary to regenerate such center in remaining text; Selected vocabulary requires that tf and idf is not less than certain threshold value; More the center id of new term, compresses class center, allows same vocabulary only appear in some high classes similar to it at heart, merges similarity comparatively Gao Lei center;
Step 12, recalculate the cluster centre belonging to each text according to new cluster centre, Similarity Measure is with step 9;
Step 13, calculate the core similarity of each class, attempt dividing to produce new class to maximum class, its splitting-up method is as follows: calculate most active text fx in such, namely the number of times that occurs of the most Similar Text Chinese version fx of other text is the highest, and similar value is larger, the text fy minimum with text fx similarity is calculated in class, Xin Lei center ctx is set up with fx and the text set the most similar to fx, Xin Lei center cty is set up with fy and the text set the most similar to fy, itself and ctx are calculated to text remaining in such, the similarity of cty, one of both they are incorporated to respectively,
Step 14, in the less text of the basic Shang Duiyulei center similarity of step 11, the center id according to its most of vocabulary is incorporated to the class belonging to this id;
Step 15. repeats step 10-14, until the number convergence of class, and text in same class and class center similarity arrive certain threshold value, then stop.
CN201210447277.XA 2012-11-09 2012-11-09 Class center compression transformation-based text clustering method in search engine Active CN102955857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210447277.XA CN102955857B (en) 2012-11-09 2012-11-09 Class center compression transformation-based text clustering method in search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210447277.XA CN102955857B (en) 2012-11-09 2012-11-09 Class center compression transformation-based text clustering method in search engine

Publications (2)

Publication Number Publication Date
CN102955857A CN102955857A (en) 2013-03-06
CN102955857B true CN102955857B (en) 2015-07-08

Family

ID=47764663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210447277.XA Active CN102955857B (en) 2012-11-09 2012-11-09 Class center compression transformation-based text clustering method in search engine

Country Status (1)

Country Link
CN (1) CN102955857B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216931A (en) * 2013-05-29 2014-12-17 酷盛(天津)科技有限公司 Real-time recommending system and method
CN103942347B (en) * 2014-05-19 2017-04-05 焦点科技股份有限公司 A kind of segmenting method based on various dimensions synthesis dictionary
CN104331510B (en) * 2014-11-24 2018-09-04 小米科技有限责任公司 Approaches to IM and device
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN106294868A (en) * 2016-08-23 2017-01-04 达而观信息科技(上海)有限公司 A kind of personalized recommendation method based on search engine and system
CN106650803B (en) * 2016-12-09 2019-06-18 北京锐安科技有限公司 The method and device of similarity between a kind of calculating character string
CN106778880B (en) * 2016-12-23 2020-04-07 南开大学 Microblog topic representation and topic discovery method based on multi-mode deep Boltzmann machine
CN108052659B (en) 2017-12-28 2022-03-11 北京百度网讯科技有限公司 Search method and device based on artificial intelligence and electronic equipment
CN110750963B (en) * 2018-07-02 2023-09-26 北京四维图新科技股份有限公司 News document duplication removing method, device and storage medium
CN110196974B (en) * 2019-06-11 2023-07-07 吉林大学 Rapid data aggregation method for big data cleaning
CN110991168B (en) * 2019-12-05 2024-05-17 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
CN111161819B (en) * 2019-12-31 2023-06-30 重庆亚德科技股份有限公司 System and method for processing medical record data of traditional Chinese medicine
CN113806524B (en) * 2020-06-16 2024-05-24 阿里巴巴集团控股有限公司 Hierarchical category construction and hierarchical structure adjustment method and device for text content
CN113254584A (en) * 2021-05-28 2021-08-13 北京明略昭辉科技有限公司 Document retrieval method, system, electronic equipment and storage medium
CN113673684B (en) * 2021-08-24 2024-08-02 东北大学 Edge-end DNN model loading system and method based on input pruning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640488A (en) * 1995-05-05 1997-06-17 Panasonic Technologies, Inc. System and method for constructing clustered dictionary for speech and text recognition
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data
CN101706790A (en) * 2009-09-18 2010-05-12 浙江大学 Clustering method of WEB objects in search engine
US7827168B2 (en) * 2007-05-30 2010-11-02 Red Hat, Inc. Index clustering for full text search engines
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN102682000A (en) * 2011-03-09 2012-09-19 北京百度网讯科技有限公司 Text clustering method, question-answering system applying same and search engine applying same

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5640488A (en) * 1995-05-05 1997-06-17 Panasonic Technologies, Inc. System and method for constructing clustered dictionary for speech and text recognition
US7827168B2 (en) * 2007-05-30 2010-11-02 Red Hat, Inc. Index clustering for full text search engines
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data
CN101706790A (en) * 2009-09-18 2010-05-12 浙江大学 Clustering method of WEB objects in search engine
CN102682000A (en) * 2011-03-09 2012-09-19 北京百度网讯科技有限公司 Text clustering method, question-answering system applying same and search engine applying same
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database

Also Published As

Publication number Publication date
CN102955857A (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN102955857B (en) Class center compression transformation-based text clustering method in search engine
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
US10346257B2 (en) Method and device for deduplicating web page
CN103514183B (en) Information search method and system based on interactive document clustering
CN101685455B (en) Method and system of data retrieval
CN109960799B (en) Short text-oriented optimization classification method
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN101464898B (en) Method for extracting feature word of text
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN104765769A (en) Short text query expansion and indexing method based on word vector
CN103617157A (en) Text similarity calculation method based on semantics
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN101630312A (en) Clustering method for question sentences in question-and-answer platform and system thereof
CN103049569A (en) Text similarity matching method on basis of vector space model
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN104484343A (en) Topic detection and tracking method for microblog
CN105022740A (en) Processing method and device of unstructured data
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN103390004A (en) Determination method and determination device for semantic redundancy and corresponding search method and device
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN111460147A (en) Title short text classification method based on semantic enhancement
CN104317783A (en) SRC calculation method
Campbell et al. Content+ context networks for user classification in twitter

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Ouyang Yuanxin

Inventor after: Yuan Man

Inventor after: Xie Shuyi

Inventor after: Liu Wenqi

Inventor after: Xiong Zhang

Inventor before: Ouyang Yuanxin

Inventor before: Xie Shuyi

Inventor before: Liu Wenqi

Inventor before: Xiong Zhang

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: OUYANG YUANXIN XIE SHUYI LIU WENQI XIONG ZHANG TO: OUYANG YUANXIN YUAN MAN XIE SHUYI LIU WENQI XIONG ZHANG

C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200110

Address after: 519080 5th floor, building 8, science and Technology Innovation Park, No.1 Gangwan, Jintang Road, Tangjiawan, Xiangzhou District, Zhuhai City, Guangdong Province

Patentee after: Zhuhai haotengzhisheng Technology Co., Ltd

Address before: 100191 Haidian District, Xueyuan Road, No. 37,

Patentee before: Beijing University of Aeronautics and Astronautics