CN101694670A - Chinese Web document online clustering method based on common substrings - Google Patents

Chinese Web document online clustering method based on common substrings Download PDF

Info

Publication number
CN101694670A
CN101694670A CN200910236138A CN200910236138A CN101694670A CN 101694670 A CN101694670 A CN 101694670A CN 200910236138 A CN200910236138 A CN 200910236138A CN 200910236138 A CN200910236138 A CN 200910236138A CN 101694670 A CN101694670 A CN 101694670A
Authority
CN
China
Prior art keywords
cluster
document
clustering
public substring
substring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910236138A
Other languages
Chinese (zh)
Other versions
CN101694670B (en
Inventor
张辉
王德庆
王晗
杨高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN2009102361380A priority Critical patent/CN101694670B/en
Publication of CN101694670A publication Critical patent/CN101694670A/en
Application granted granted Critical
Publication of CN101694670B publication Critical patent/CN101694670B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese Web document online clustering method based on common substrings. As known to all, search engines are important in application of information searching and positioning with sharp increase of information on the internet. Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing pairwise similarity of the Web documents on the basis of the model to acquire a similarity matrix, (5) adopting an improved hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering label generation and clustering time effects.

Description

A kind of Chinese Web document online clustering method based on public substring
Technical field
The invention belongs to technical field of information processing, is a kind of data digging method, is specifically related to a kind of Web document online clustering method.
Background technology
Cluster process comes down to a mapping process.If given object set O={o 1, o 2..., o n, class set is π={ c 1, c 2..., c m, then cluster is following mapping:
Figure G2009102361380D0000011
And satisfy:
( 1 ) , c i ⊆ O ( i = 1,2 , . . . , t )
( 2 ) , ∪ i = 1 t c i = O
Along with popularizing day by day of internet, the increasing sharply of the network information, traditional search engine tends to return a large amount of Search Results and makes the user be difficult to find the own information that really needs.The Web clustering documents can address this problem preferably, and it presses classifying content with the return results of search engine.Like this, thus the user just can dwindle the scope of selecting finds information of interest fast.
The Web clustering documents is a kind of guideless document classification, and it is divided into several bunches (subclass) with a document sets, and is big as much as possible with the similarity of document content in the cluster, and the similarity of document content is as much as possible little between different bunches.Compare general cluster, the online cluster of Web document has two characteristics: the one, and cluster has nonumeric type and non-structured characteristics to liking the Web document; The 2nd, the cluster time will be satisfied the requirement of user's online retrieving, thereby algorithm should have the characteristics of real-time and interactivity.
The research of Web clustering documents mainly contains three kinds of methods: based on the cluster of link, based on the cluster of text similarity and based on the cluster of user feedback.At present, more common search-engine results clustering method mainly is based on the clustering algorithm of document similarity.Based on the cluster thought of document similarity is that the document abstract representation is vector, and adopts vector angle cosine to represent similarity between document and the document, according to certain clustering algorithm (as K-means, STC) document is carried out cluster then.
Above-mentioned method is applicable to the english information searching system, and does not have between the word of Chinese must depend on Words partition system at interval, so above method is also bad for the effect of Chinese information retrieval.The present invention proposes a kind of Chinese Web clustering documents algorithm online, that need not Chinese word segmentation.
Summary of the invention
The technical problem to be solved in the present invention:
1, general Web document clustering method is applicable to the english information searching system at present, and does not have between the word of Chinese must depend on Words partition system, and the quality of dictionary to have fundamental influence to the cluster effect at interval.The present invention adopts no participle technique, can avoid the influence of dictionary, improves the cluster performance simultaneously;
2, the execution time of the online cluster of Web document will be satisfied the requirement of user's online retrieving, thereby requires algorithm should have stronger real-time and interactivity.
The technical solution used in the present invention:
The system handles flow process is divided following step: 1) Web document pre-service, realize operation is handled in the deletion and the replacement of non-Chinese character in the search engine return results; 2) utilize GSA to realize the extraction of public word string in the Web document, then with the feature of public word string as document; 3) calculate document to be clustered similarity between any two, form the document similarity matrix; 4) utilize similarity matrix, and use clustering algorithm that document is carried out cluster; 5) extraction of cluster description and label is promptly given the class label that can describe such to each classification, and this label can be summarized the content of this class, this class and other classes differences can be come again.
The beneficial effect that the present invention obtains:
Online clustering method has than clear superiority aspect performance, the generation of cluster label and the cluster chronergy:
1, with traditional text cluster system compares, the Chinese Web document online clustering method that this paper proposed does not need participle, but the method that adopts the GSA algorithm to extract public substring between the Web document is determined the feature of document, and then carries out cluster calculation as the proper vector in the clustering method.Solved the Web text as nonumeric type of cluster object and non-structured problem.
What 2, the present invention found the solution that public substring adopts between the character string is a mutation---the GSA algorithm of suffix tree (Suffix Tree) algorithm, and its time complexity is 0 (n), and space complexity is S (n).It is better than the suffix tree algorithm on space complexity.
3, traditional hierarchy clustering method (no matter being cohesion hierarchical clustering or division hierarchical clustering), complexity is all very high, and extensibility is relatively poor, thereby be not suitable for the cluster of large volume document.For this reason, the present invention optimizes traditional cohesion hierarchical clustering, has obtained cluster effect preferably.
4, the present invention uses the label of the public substring of weight maximum as cluster, not only can keep semantic component, and makes that the readability of cluster label is strong.
Below, with the effect of verifying that by experiment the present invention obtains:
The leading indicator of clustering algorithm comprises CH value, cluster label validity and cluster effect.
The CH function is defined as follows:
CH = traceB / ( k - 1 ) traceW / ( n - k )
traceB = Σ j = 1 k n j | | u j - u | | 2
traceW = Σ j = 1 K Σ i = 1 n j | | x i - u j | |
Wherein, n jBe j the amount of text in the cluster; u jIt is the barycenter of j cluster; U is all barycenter that participate in the cluster text; x iBe i text in corresponding certain cluster; K is the total number of cluster; N is the total number of text.The CH function is the comprehensive embodiment of distance and between class distance in the class in the cluster result, and the CH value is big more, represents the cluster effect good more.
Use five key words to retrieve in the experiment, following table is the model that proposes of this paper and the CH value comparison of Chinese word segmentation+tf*idf model:
Key word Participle+tf*idf Model based on public substring
Apple ??16.916 ??17.007
Yao Ming ??14.785 ??20.516
The Department of Science and Technology ??13.146 ??16.597
Object-oriented ??17.860 ??17.764
Data mining ??11.593 ??16.974
Found through experiments: the CH value that obtains based on the model of public substring is bigger than participle+tf*idf model, and wherein the CH value of " Yao Ming " and " data mining " has improved 5.8,5.4 respectively, and therefore, new method is better than classic method on the cluster effect.
The validity of cluster label is promptly readable extremely important for the user, and the phrase that only has physical meaning could be as the label of cluster.The computing formula of label validity is P=M/N, and wherein, M represents readable good number of tags, and N represents the number of all labels.Experimental result is seen accompanying drawing 1, and by accompanying drawing 1 as can be known, the phrase validity of new method is between 0.8-0.95, and traditional method major part is below 0.8.Therefore, the cluster label readability that obtains of new method is better than conventional model.
At last, at preceding 100 results of Baidu inquiry Web document as cluster, final effect is seen accompanying drawing 2 to keyword " apple " in invention.Can find out that from accompanying drawing 2 method that the present invention proposes can obtain cluster effect preferably.
What interpretation of result and comparison by experiment, the present invention proposed has than remarkable advantages in cluster effect, cluster performance and the Chinese clustering algorithm compared based on participle at aspects such as cluster label readability based on the Chinese Web document online clustering method of public substring.
Description of drawings
Fig. 1 be label effective sex ratio;
Fig. 2 is the cluster effect of input key word " apple " gained;
Fig. 3 is the test result (apple, Yao Ming, data mining) of three query words;
Fig. 4 is the process flow diagram based on the Chinese Web document online clustering method of public substring.
Embodiment:
1.Web document pre-service
In the return results of Chinese search engine (as Baidu etc.), usually contain some non-Chinese characters, as English character, space, punctuation mark or mess code etc.Because the emphasis of the present invention's research is the Chinese Web clustering documents, so before cluster, need replace processing to the non-Chinese content in the Search Results.
Pretreatment stage mainly replaces to the predefined separator of system with these non-Chinese characters.The non-Chinese character that needs to replace mainly comprises: space, numeral, English upper and lower case letter, Chinese and English punctuation mark (comprising full-shape and half-angle) and Chinese pause word (for example: " ", " ", " " etc.).The search-engine results item that will only be comprised Chinese character after the pre-service is with its input of extracting as public substring.
2. the public substring based on GSA extracts
● (Common Substring, CS): if character string u is the substring of character string S is again the substring of character string T to public substring, and then character string u is the public substring of character string S and T.If with Sub (S, u) expression character string u is the substring of character string S, then the public substring collection Com of character string S, T (S T) may be defined as: Com ( S , T ) = { u | ∀ u , Sub ( S , u ) ^ Sub ( T , u ) } .
● (Longest Common Substring, LCS): the Longest Common Substring of character string S and T is meant the substring of length maximum in all public substrings of character string S and T to Longest Common Substring.If character string u satisfies: u ∈ Com (S, T) and
Figure G2009102361380D0000042
Claim that then u is character string S, the Longest Common Substring of T.
For example: given 2 length are 4 character string " abac ", " caba ".Their public substring has " ", " a ", " b ", " ab ", " ba ", " aba " and " c ", and wherein Longest Common Substring is " aba ".
Finding the solution of public substring problem, classic algorithm commonly used has dynamic programming algorithm and suffix tree algorithm.The former characteristics are to be easy to realize but time complexity is very high; And the latter's characteristics be time complexity only for linear, but implement difficulty relatively.This method adopts a mutation---the broad sense suffix array GSA algorithm of suffix tree (Suffix Tree) algorithm, realizes that the public substring between the text extracts.
Adopt the GSA algorithm to find the solution that the time complexity of public substring is O (n) between the character string.And the space complexity of GSA algorithm is S (n), and it is better than the suffix tree algorithm on space complexity.
Definition:
● suffix (Suffix): the suffix of a character string S is meant that from certain ad-hoc location i (i≤S.len (S)) up to a string of last character of S, it is the substring of S.This substring can be expressed as suffix (S, i), promptly Suffix (S, i)=subString (S, i, len (S)).
● (Suffix Array, SA): suffix array SA is corresponding one by one with character string S for the suffix array.Its each element is the subscript of S.Be len (SA)=len (S) and SA[i] ∈ 1,2 ..., len (S) } (1≤i≤len (S)), SA[i] ≠ SA[j] (i ≠ j).Simultaneously, this array also satisfies: Suffix (S, SA[i])<Suffix (S, SA[i+1]), (1≤i<len (S))
● broad sense suffix array (General ized Suffix Array, GSA): several character strings S 1, S 2..., S nBroad sense suffix array be meant and use special end mark connection string S 1, S 2..., S nThe back forms the suffix array of new character strings.
Illustrate, such as: for two character string S1=" abac " and S2=" caba ".Connect and the character string that obtains is abac@caba with special character @.For character string abac@caba, have 8 non-NULL suffix, sequence originally and sort according to the dictionary preface after sequence as shown in the table:
Before the ordering of non-NULL suffix and after the ordering
One-dimension array SA=[8 then, 6,1,3,7,2,5,4] be character string S 1And S 2Broad sense suffix array.
Obtain after the broad sense suffix array that two character strings connect, the longest common prefix of more adjacent substring in twos successively, all length is more than or equal to 1 the longest common prefix, the public substring of two character strings being asked exactly.
The public substring algorithm of more than finding the solution two character strings is extended to the public substring derivation algorithm of the individual character string of N (N>1): for the individual character string S of N (N>1) 1, S 2... S N, obtain character string SE=S after it is stitched together with N-1 special character (needn't be different in twos) 1a 1S 2a 2... S N-1a N-1S N, wherein, a i(1≤i≤(N-1)) is the special character of insertion, and to all a i, S j, (1≤i≤N-1,1≤j≤N), have The suffix array of structure SE, the longest common prefix of more adjacent substring in twos can obtain N character string S then 1, S 2... S NWhole public substring.
3. the foundation of text feature vector model
In the text based information retrieval process, the proper vector model of a text is a set of being made up of the certain characteristics in the text.In this text feature vector model based on public substring, each document D can be expressed as the proper vector that M public substring and respective weights thereof are formed.Here suppose:
● text to be clustered be D1, D2 ..., DN};
● the public substring sequence through filtration treatment is (S 1, S 2... S N-1, S n);
● function len (Sk) (k=1,2 ..., the n) length of expression character string Sk;
● (Sk Dj) represents the frequency that public substring Sk occurs to function tf in text Dj.The word frequency (Term frequency) that tf just usually uses in the information retrieval process;
● the contrary document frequency (Inversed document frequency) of the public substring Sk of function idf (Sk) expression;
● constant N represents the number as a result that search engine returns, and just we want the text number of cluster;
● function d f (Sk) expression comprises the number of the text of public substring Sk.
Based on above hypothesis, document D jCan be expressed as vectorial following form:
D j={w(S 1,D j),w(S 2,D j),...,w(S n,D j)},(j=1,2,...,N)
W (S wherein k, D j) (k=1,2 ..., n) be public substring S kWith respect to text D jWeight.It is as follows to propose weight calculation side's formula with reference to TF*IDF:
W (S k, D j)=log (1+tf (S k, D j)) * idf (S k) * (len (S k)) αWherein,
Figure G2009102361380D0000062
In a document, public substring weight and its length relation of being proportionate, promptly long more its weight of length is big more.In above formula, we are to public substring S kLength l en (S k) get the α power, to amplify the influence of long public substring to its weight, the value of concrete α needs to come by experiment to determine.
Utilize following formula, calculate the weight of whole public substrings after, just can use traditional similarity algorithm, for example the cosine similarity algorithm calculates the similarity between the text.The computing formula of two text similarities is as follows:
Sim ( d i , d j ) = d i · d j | d i | * | d j | .
4. clustering method and realization
Can obtain similarity matrix between text according to the text feature vector model of above-mentioned proposition, as shown in the table:
Similarity matrix
??D1 ??D2 ??D3 ??DN
??D1 ??Sim(1,2) ??Sim(1,3) ??Sim(1,N)
??D2 ??Sim(2,3) ??Sim(2,N)
??… ??…
??DN
In the last table, D i(1≤i≤N) is N result items (document that needs cluster), Sim (i, j) (1≤i, the ecbatic item D of j≤N) iAnd D jBetween similarity.
After obtaining similarity matrix, next step can adopt hierarchical clustering that result items is carried out cluster.Use hierarchical clustering can make total classification number less, be convenient to the user and locate information needed rapidly.Simultaneously, each class can also be segmented again.Traditional hierarchy clustering method (no matter being cohesion hierarchical clustering or division hierarchical clustering), complexity is all very high, and extensibility is relatively poor, thereby be not suitable for the cluster of large volume document.For this reason, the present invention optimizes traditional cohesion hierarchical clustering.
Suppose that wanting the total number of documents of cluster is N, N>0, Ni represents i+1 not classified number of files of step, i=0,1, Set T iRepresent i cluster, T iIn element be classified document code of i+1 step, i=0,1 ...Browsing for the convenience of the user, the classification sum that we set after the cluster is no more than 20.Then clustering method can be described below: (seeing also claims 4).
In the method, calculation of similarity degree is a key factor that influences the cluster effect.The present invention has considered that public substring weight becomes the relation of α power with this substring length, if α is too big, then the effect of long public substring can be excessively enlarged, thereby has influence on the cluster effect, so the occurrence of α need obtain by experiment.The α value is incremented to 2 since 0 with step-length 0.1, and 100 Search Results that different key words are returned carry out cluster respectively, and key word comprises " apple ", " Yao Ming ", " data mining ".Adopting evaluating is between class distance ratio in the class.By the definition of between class distance ratio in the class as can be known, more little when distance in the class of a cluster, the cluster effect was best when between class distance was big more, so between class distance is than more hour in class, the clustering result effect is just good more, and corresponding α value is also just got over science.Experimental result is seen accompanying drawing 3.
Test result by top three key words as can be seen when the α interval [1.2,1.4] time, between class distance is than minimum in the class of cluster result.Find that through a large amount of experiments between class distance is 1.3 than the mean value of hour the most corresponding α value in the class.So the α value is 1.3.

Claims (5)

1. Chinese Web document online clustering method based on public substring is characterized in that step is as follows:
(1) utilizes broad sense suffix array (Generalized Suffix Array, GSA) the public substring in the algorithm extraction Web document;
(2) the public substring that utilize to extract is set up the file characteristics vector model, and based on the similarity in twos of this Model Calculation Web document, obtains similarity matrix;
(3), adopt improved hierarchical clustering algorithm to realize the Web clustering documents based on this similarity matrix;
(4) in cluster process, with the public substring of weight maximum in the set of same cluster label as this cluster.
2. a kind of Chinese Web document online clustering method according to claim 1 based on public substring, it is characterized in that: utilize the leaching process of GSA algorithm to be in the described step (1): suppose total N piece of writing document, every piece of document can be regarded a character string as, then total N character string S 1, S 2... S N, wherein N is greater than 1, obtains character string SE=S after these character strings are stitched together with N-1 special character 1a 1S 2a 2... S N-1a N-1S N, a wherein iBe the special character of insertion, the span of i is 1≤i≤(N-1); And to all a i, S jHave
Figure F2009102361380C0000011
I wherein, the span of j is 1≤i≤N-1,1≤j≤N; The suffix array of structure SE, the longest common prefix of more adjacent substring in twos then, all length of these two adjacent substrings is more than or equal to 1 the longest common prefix, and the public substring of two character strings being asked exactly can obtain S by that analogy 1, S 2... S NWhole public substring.
3. a kind of Chinese Web document online clustering method according to claim 1 based on public substring, it is characterized in that: the file characteristics vector model of the foundation in the described step (2) is: at first suppose text to be clustered for D1, D2 ..., DN}; Public substring sequence through filtration treatment is S 1, S 2... S N-1, S nThe length of function len (Sk) expression character string Sk, k=1 wherein, 2 ..., n; (Sk Dj) represents the frequency that public substring Sk occurs to function tf in text Dj; The contrary document frequency of the public substring Sk of function idf (Sk) expression; Constant N represents the number as a result that search engine returns, and just wants the text number of cluster; Function d f (Sk) expression comprises the text number of public substring Sk;
Set up document D jThe proper vector model: D j={ w (S 1, D j), w (S 2, D j) ..., w (S n, D j), (j=1,2 ... N), the proper vector that promptly public substring and respective weights thereof are formed;
Wherein, w (S k, D j) for going here and there S kWith respect to text D jWeight, k=1 wherein, 2 ..., n;
W (S k, D j)=log (1+tf (S k, D j)) * idf (S k) * (len (S k)) αWherein, idf ( S k ) = log ( 1 + N df ( S k ) ) ;
It is 1.3 that the value of α is determined by experiment.
4. the described a kind of Chinese Web document online clustering method of claim 1 based on public substring, it is characterized in that: the improved hierarchical clustering algorithm process in the described step (3) is, suppose that wanting the total number of documents of cluster is N, N>0 wherein, Ni represents i+1 not classified number of files of step, i is integer and i 〉=0, set T iRepresent i cluster, T iIn element be classified document code of i+1 step, i is integer and i 〉=0; Clustering method is as follows:
First step cluster comprised for four steps:
I. this moment, unclassified original document was counted N 0=N gets initial threshold and is half of maximum similarity in the similarity matrix, promptly θ 0 = 1 2 max i , j = 1,2 · · · N , i ≠ j ( Sim ( D i , D j ) ) ;
Ii. to any two document D i, D j, if Sim (D i, D j)>θ 0, then with D i, D jPut into set T 0, i.e. T 0=T0 ∪ { D i, D j;
If T iii. 0In exist each other similarity less than threshold value θ 0Document, promptly to all D i, D j∈ T 0, i<j is if exist Sim (D i, D j)<θ 0, then from T 0The middle D that takes out i, D jMiddle subscript the greater, i.e. T 0=T 0-{ D j, until T 0In do not have such D i, D j
Iv. this moment is with T 0In all elements be classified as a class, and get the maximum persons of occurrence number in the public substring of these elements, as such label, so far, this step cluster is finished;
Since the second step cluster, be generalized to the n step, can be expressed as follows:
N) in the n step, n is integer and n 〉=2, at this moment unclassified number of files
Figure F2009102361380C0000022
Wherein | T i| be T iThe number of middle element, if
Figure F2009102361380C0000023
Then get
Figure F2009102361380C0000024
Wherein, To D arbitrarily i, D j∈ T repeats the ii in the first step cluster, and iii in the iv step, can obtain T N-1, and finish n step cluster, then enter the n+1 cluster in step (with the process of n step cluster); Until
Figure F2009102361380C0000026
Then not classified as yet document is classified as a class, label is " other ", finishes cluster.
5. the described a kind of Chinese Web document online clustering method based on public substring of claim 1 is characterized in that: browsing for the convenience of the user, the sum of setting the cluster classification in the described step (3) is no more than 20.
CN2009102361380A 2009-10-20 2009-10-20 Chinese Web document online clustering method based on common substrings Expired - Fee Related CN101694670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102361380A CN101694670B (en) 2009-10-20 2009-10-20 Chinese Web document online clustering method based on common substrings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102361380A CN101694670B (en) 2009-10-20 2009-10-20 Chinese Web document online clustering method based on common substrings

Publications (2)

Publication Number Publication Date
CN101694670A true CN101694670A (en) 2010-04-14
CN101694670B CN101694670B (en) 2012-07-04

Family

ID=42093643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102361380A Expired - Fee Related CN101694670B (en) 2009-10-20 2009-10-20 Chinese Web document online clustering method based on common substrings

Country Status (1)

Country Link
CN (1) CN101694670B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004724A (en) * 2010-12-23 2011-04-06 哈尔滨工业大学 Document paragraph segmenting method
CN102682132A (en) * 2012-05-18 2012-09-19 合一网络技术(北京)有限公司 Method and system for searching information based on word frequency, play amount and creation time
CN102693304A (en) * 2012-05-22 2012-09-26 北京邮电大学 Search engine feedback information processing method and search engine
CN103123685A (en) * 2011-11-18 2013-05-29 江南大学 Text mode recognition method
CN103699567A (en) * 2013-11-04 2014-04-02 北京中搜网络技术股份有限公司 Method for realizing same news clustering based on title fingerprint and text fingerprint
CN103902599A (en) * 2012-12-27 2014-07-02 北京新媒传信科技有限公司 Fuzzy search method and fuzzy search device
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN104156418A (en) * 2014-08-01 2014-11-19 北京系统工程研究所 Knowledge reuse based evolutionary clustering method
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN104462301A (en) * 2014-11-28 2015-03-25 北京奇虎科技有限公司 Network data processing method and device
CN106202405A (en) * 2016-07-11 2016-12-07 中国人民大学 A kind of compactedness Text Extraction based on text similarity relation
CN106844748A (en) * 2017-02-16 2017-06-13 湖北文理学院 Text Clustering Method, device and electronic equipment
CN108763369A (en) * 2018-05-17 2018-11-06 北京奇艺世纪科技有限公司 A kind of video searching method and device
CN109241275A (en) * 2018-07-05 2019-01-18 广东工业大学 A kind of text subject clustering algorithm based on natural language processing
CN109344245A (en) * 2018-06-05 2019-02-15 安徽省泰岳祥升软件有限公司 Text similarity computing method and device
CN109684928A (en) * 2018-11-22 2019-04-26 西交利物浦大学 Chinese document recognition methods based on Internal retrieval
CN110532389A (en) * 2019-08-22 2019-12-03 四川睿象科技有限公司 A kind of Text Clustering Method, device and calculate equipment
CN111753547A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN113128592A (en) * 2021-04-20 2021-07-16 重庆邮电大学 Medical instrument identification analysis method and system for isomerism and storage medium
CN116757807A (en) * 2023-08-14 2023-09-15 湖南华菱电子商务有限公司 Intelligent auxiliary label evaluation method based on optical character recognition

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1609859A (en) * 2004-11-26 2005-04-27 孙斌 Search result clustering method
CN101464898B (en) * 2009-01-12 2011-09-21 腾讯科技(深圳)有限公司 Method for extracting feature word of text

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004724A (en) * 2010-12-23 2011-04-06 哈尔滨工业大学 Document paragraph segmenting method
CN103123685B (en) * 2011-11-18 2016-03-02 江南大学 Text mode recognition method
CN103123685A (en) * 2011-11-18 2013-05-29 江南大学 Text mode recognition method
CN102682132A (en) * 2012-05-18 2012-09-19 合一网络技术(北京)有限公司 Method and system for searching information based on word frequency, play amount and creation time
CN102693304A (en) * 2012-05-22 2012-09-26 北京邮电大学 Search engine feedback information processing method and search engine
CN102693304B (en) * 2012-05-22 2014-10-22 北京邮电大学 Search engine feedback information processing method and search engine
CN103902599A (en) * 2012-12-27 2014-07-02 北京新媒传信科技有限公司 Fuzzy search method and fuzzy search device
CN103902599B (en) * 2012-12-27 2017-04-05 北京新媒传信科技有限公司 The method and apparatus of fuzzy search
CN104346411B (en) * 2013-08-09 2018-11-06 北大方正集团有限公司 The method and apparatus that multiple contributions are clustered
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN103699567A (en) * 2013-11-04 2014-04-02 北京中搜网络技术股份有限公司 Method for realizing same news clustering based on title fingerprint and text fingerprint
CN104090890B (en) * 2013-12-12 2016-05-04 深圳市腾讯计算机系统有限公司 Keyword similarity acquisition methods, device and server
CN104090890A (en) * 2013-12-12 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device and server for obtaining similarity of key words
CN104156418B (en) * 2014-08-01 2015-09-30 北京系统工程研究所 The evolution clustering method that a kind of knowledge based is reused
CN104156418A (en) * 2014-08-01 2014-11-19 北京系统工程研究所 Knowledge reuse based evolutionary clustering method
CN104462301A (en) * 2014-11-28 2015-03-25 北京奇虎科技有限公司 Network data processing method and device
CN104462301B (en) * 2014-11-28 2018-05-04 北京奇虎科技有限公司 A kind for the treatment of method and apparatus of network data
CN106202405A (en) * 2016-07-11 2016-12-07 中国人民大学 A kind of compactedness Text Extraction based on text similarity relation
CN106202405B (en) * 2016-07-11 2019-06-25 中国人民大学 A kind of compactedness Text Extraction based on text similarity relation
CN106844748A (en) * 2017-02-16 2017-06-13 湖北文理学院 Text Clustering Method, device and electronic equipment
CN108763369A (en) * 2018-05-17 2018-11-06 北京奇艺世纪科技有限公司 A kind of video searching method and device
CN109344245A (en) * 2018-06-05 2019-02-15 安徽省泰岳祥升软件有限公司 Text similarity computing method and device
CN109344245B (en) * 2018-06-05 2019-07-23 安徽省泰岳祥升软件有限公司 Text similarity computing method and device
CN109241275B (en) * 2018-07-05 2022-02-11 广东工业大学 Text topic clustering algorithm based on natural language processing
CN109241275A (en) * 2018-07-05 2019-01-18 广东工业大学 A kind of text subject clustering algorithm based on natural language processing
CN109684928B (en) * 2018-11-22 2023-04-11 西交利物浦大学 Chinese document identification method based on internet retrieval
CN109684928A (en) * 2018-11-22 2019-04-26 西交利物浦大学 Chinese document recognition methods based on Internal retrieval
CN110532389A (en) * 2019-08-22 2019-12-03 四川睿象科技有限公司 A kind of Text Clustering Method, device and calculate equipment
CN110532389B (en) * 2019-08-22 2023-07-14 北京睿象科技有限公司 Text clustering method and device and computing equipment
CN111753547A (en) * 2020-06-30 2020-10-09 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN111753547B (en) * 2020-06-30 2024-02-27 上海观安信息技术股份有限公司 Keyword extraction method and system for sensitive data leakage detection
CN113128592A (en) * 2021-04-20 2021-07-16 重庆邮电大学 Medical instrument identification analysis method and system for isomerism and storage medium
CN116757807A (en) * 2023-08-14 2023-09-15 湖南华菱电子商务有限公司 Intelligent auxiliary label evaluation method based on optical character recognition
CN116757807B (en) * 2023-08-14 2023-11-14 湖南华菱电子商务有限公司 Intelligent auxiliary label evaluation method based on optical character recognition

Also Published As

Publication number Publication date
CN101694670B (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN101694670B (en) Chinese Web document online clustering method based on common substrings
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Froud et al. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering
Ni et al. Short text clustering by finding core terms
CN106599054B (en) Method and system for classifying and pushing questions
Bouaziz et al. Short text classification using semantic random forest
Al-diabat Arabic text categorization using classification rule mining
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
CN105426529A (en) Image retrieval method and system based on user search intention positioning
Man Feature extension for short text categorization using frequent term sets
CN102651003A (en) Cross-language searching method and device
CN105912662A (en) Coreseek-based vertical search engine research and optimization method
Sun et al. Towards effective short text deep classification
CN105447119A (en) Text clustering method
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
Bellare et al. Lightly-supervised attribute extraction
Aliguliyev A novel partitioning-based clustering method and generic document summarization
CN105404677A (en) Tree structure based retrieval method
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN102929977B (en) Event tracing method aiming at news website
Aung et al. Random forest classifier for multi-category classification of web pages
Zhen et al. Notice of Retraction: Multi-modal music genre classification approach
Pereira et al. A generic Web‐based entity resolution framework

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120704

Termination date: 20171020

CF01 Termination of patent right due to non-payment of annual fee