CN101694670A - Chinese Web document online clustering method based on common substrings - Google Patents
Chinese Web document online clustering method based on common substrings Download PDFInfo
- Publication number
- CN101694670A CN101694670A CN200910236138A CN200910236138A CN101694670A CN 101694670 A CN101694670 A CN 101694670A CN 200910236138 A CN200910236138 A CN 200910236138A CN 200910236138 A CN200910236138 A CN 200910236138A CN 101694670 A CN101694670 A CN 101694670A
- Authority
- CN
- China
- Prior art keywords
- cluster
- document
- clustering
- public substring
- substring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Chinese Web document online clustering method based on common substrings. As known to all, search engines are important in application of information searching and positioning with sharp increase of information on the internet. Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing pairwise similarity of the Web documents on the basis of the model to acquire a similarity matrix, (5) adopting an improved hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering label generation and clustering time effects.
Description
Technical field
The invention belongs to technical field of information processing, is a kind of data digging method, is specifically related to a kind of Web document online clustering method.
Background technology
Cluster process comes down to a mapping process.If given object set O={o
1, o
2..., o
n, class set is π={ c
1, c
2..., c
m, then cluster is following mapping:
And satisfy:
Along with popularizing day by day of internet, the increasing sharply of the network information, traditional search engine tends to return a large amount of Search Results and makes the user be difficult to find the own information that really needs.The Web clustering documents can address this problem preferably, and it presses classifying content with the return results of search engine.Like this, thus the user just can dwindle the scope of selecting finds information of interest fast.
The Web clustering documents is a kind of guideless document classification, and it is divided into several bunches (subclass) with a document sets, and is big as much as possible with the similarity of document content in the cluster, and the similarity of document content is as much as possible little between different bunches.Compare general cluster, the online cluster of Web document has two characteristics: the one, and cluster has nonumeric type and non-structured characteristics to liking the Web document; The 2nd, the cluster time will be satisfied the requirement of user's online retrieving, thereby algorithm should have the characteristics of real-time and interactivity.
The research of Web clustering documents mainly contains three kinds of methods: based on the cluster of link, based on the cluster of text similarity and based on the cluster of user feedback.At present, more common search-engine results clustering method mainly is based on the clustering algorithm of document similarity.Based on the cluster thought of document similarity is that the document abstract representation is vector, and adopts vector angle cosine to represent similarity between document and the document, according to certain clustering algorithm (as K-means, STC) document is carried out cluster then.
Above-mentioned method is applicable to the english information searching system, and does not have between the word of Chinese must depend on Words partition system at interval, so above method is also bad for the effect of Chinese information retrieval.The present invention proposes a kind of Chinese Web clustering documents algorithm online, that need not Chinese word segmentation.
Summary of the invention
The technical problem to be solved in the present invention:
1, general Web document clustering method is applicable to the english information searching system at present, and does not have between the word of Chinese must depend on Words partition system, and the quality of dictionary to have fundamental influence to the cluster effect at interval.The present invention adopts no participle technique, can avoid the influence of dictionary, improves the cluster performance simultaneously;
2, the execution time of the online cluster of Web document will be satisfied the requirement of user's online retrieving, thereby requires algorithm should have stronger real-time and interactivity.
The technical solution used in the present invention:
The system handles flow process is divided following step: 1) Web document pre-service, realize operation is handled in the deletion and the replacement of non-Chinese character in the search engine return results; 2) utilize GSA to realize the extraction of public word string in the Web document, then with the feature of public word string as document; 3) calculate document to be clustered similarity between any two, form the document similarity matrix; 4) utilize similarity matrix, and use clustering algorithm that document is carried out cluster; 5) extraction of cluster description and label is promptly given the class label that can describe such to each classification, and this label can be summarized the content of this class, this class and other classes differences can be come again.
The beneficial effect that the present invention obtains:
Online clustering method has than clear superiority aspect performance, the generation of cluster label and the cluster chronergy:
1, with traditional text cluster system compares, the Chinese Web document online clustering method that this paper proposed does not need participle, but the method that adopts the GSA algorithm to extract public substring between the Web document is determined the feature of document, and then carries out cluster calculation as the proper vector in the clustering method.Solved the Web text as nonumeric type of cluster object and non-structured problem.
What 2, the present invention found the solution that public substring adopts between the character string is a mutation---the GSA algorithm of suffix tree (Suffix Tree) algorithm, and its time complexity is 0 (n), and space complexity is S (n).It is better than the suffix tree algorithm on space complexity.
3, traditional hierarchy clustering method (no matter being cohesion hierarchical clustering or division hierarchical clustering), complexity is all very high, and extensibility is relatively poor, thereby be not suitable for the cluster of large volume document.For this reason, the present invention optimizes traditional cohesion hierarchical clustering, has obtained cluster effect preferably.
4, the present invention uses the label of the public substring of weight maximum as cluster, not only can keep semantic component, and makes that the readability of cluster label is strong.
Below, with the effect of verifying that by experiment the present invention obtains:
The leading indicator of clustering algorithm comprises CH value, cluster label validity and cluster effect.
The CH function is defined as follows:
Wherein, n
jBe j the amount of text in the cluster; u
jIt is the barycenter of j cluster; U is all barycenter that participate in the cluster text; x
iBe i text in corresponding certain cluster; K is the total number of cluster; N is the total number of text.The CH function is the comprehensive embodiment of distance and between class distance in the class in the cluster result, and the CH value is big more, represents the cluster effect good more.
Use five key words to retrieve in the experiment, following table is the model that proposes of this paper and the CH value comparison of Chinese word segmentation+tf*idf model:
Key word | Participle+tf*idf | Model based on public substring |
Apple | ??16.916 | ??17.007 |
Yao Ming | ??14.785 | ??20.516 |
The Department of Science and Technology | ??13.146 | ??16.597 |
Object-oriented | ??17.860 | ??17.764 |
Data mining | ??11.593 | ??16.974 |
Found through experiments: the CH value that obtains based on the model of public substring is bigger than participle+tf*idf model, and wherein the CH value of " Yao Ming " and " data mining " has improved 5.8,5.4 respectively, and therefore, new method is better than classic method on the cluster effect.
The validity of cluster label is promptly readable extremely important for the user, and the phrase that only has physical meaning could be as the label of cluster.The computing formula of label validity is P=M/N, and wherein, M represents readable good number of tags, and N represents the number of all labels.Experimental result is seen accompanying drawing 1, and by accompanying drawing 1 as can be known, the phrase validity of new method is between 0.8-0.95, and traditional method major part is below 0.8.Therefore, the cluster label readability that obtains of new method is better than conventional model.
At last, at preceding 100 results of Baidu inquiry Web document as cluster, final effect is seen accompanying drawing 2 to keyword " apple " in invention.Can find out that from accompanying drawing 2 method that the present invention proposes can obtain cluster effect preferably.
What interpretation of result and comparison by experiment, the present invention proposed has than remarkable advantages in cluster effect, cluster performance and the Chinese clustering algorithm compared based on participle at aspects such as cluster label readability based on the Chinese Web document online clustering method of public substring.
Description of drawings
Fig. 1 be label effective sex ratio;
Fig. 2 is the cluster effect of input key word " apple " gained;
Fig. 3 is the test result (apple, Yao Ming, data mining) of three query words;
Fig. 4 is the process flow diagram based on the Chinese Web document online clustering method of public substring.
Embodiment:
1.Web document pre-service
In the return results of Chinese search engine (as Baidu etc.), usually contain some non-Chinese characters, as English character, space, punctuation mark or mess code etc.Because the emphasis of the present invention's research is the Chinese Web clustering documents, so before cluster, need replace processing to the non-Chinese content in the Search Results.
Pretreatment stage mainly replaces to the predefined separator of system with these non-Chinese characters.The non-Chinese character that needs to replace mainly comprises: space, numeral, English upper and lower case letter, Chinese and English punctuation mark (comprising full-shape and half-angle) and Chinese pause word (for example: " ", " ", " " etc.).The search-engine results item that will only be comprised Chinese character after the pre-service is with its input of extracting as public substring.
2. the public substring based on GSA extracts
● (Common Substring, CS): if character string u is the substring of character string S is again the substring of character string T to public substring, and then character string u is the public substring of character string S and T.If with Sub (S, u) expression character string u is the substring of character string S, then the public substring collection Com of character string S, T (S T) may be defined as:
● (Longest Common Substring, LCS): the Longest Common Substring of character string S and T is meant the substring of length maximum in all public substrings of character string S and T to Longest Common Substring.If character string u satisfies: u ∈ Com (S, T) and
Claim that then u is character string S, the Longest Common Substring of T.
For example: given 2 length are 4 character string " abac ", " caba ".Their public substring has " ", " a ", " b ", " ab ", " ba ", " aba " and " c ", and wherein Longest Common Substring is " aba ".
Finding the solution of public substring problem, classic algorithm commonly used has dynamic programming algorithm and suffix tree algorithm.The former characteristics are to be easy to realize but time complexity is very high; And the latter's characteristics be time complexity only for linear, but implement difficulty relatively.This method adopts a mutation---the broad sense suffix array GSA algorithm of suffix tree (Suffix Tree) algorithm, realizes that the public substring between the text extracts.
Adopt the GSA algorithm to find the solution that the time complexity of public substring is O (n) between the character string.And the space complexity of GSA algorithm is S (n), and it is better than the suffix tree algorithm on space complexity.
Definition:
● suffix (Suffix): the suffix of a character string S is meant that from certain ad-hoc location i (i≤S.len (S)) up to a string of last character of S, it is the substring of S.This substring can be expressed as suffix (S, i), promptly Suffix (S, i)=subString (S, i, len (S)).
● (Suffix Array, SA): suffix array SA is corresponding one by one with character string S for the suffix array.Its each element is the subscript of S.Be len (SA)=len (S) and SA[i] ∈ 1,2 ..., len (S) } (1≤i≤len (S)), SA[i] ≠ SA[j] (i ≠ j).Simultaneously, this array also satisfies: Suffix (S, SA[i])<Suffix (S, SA[i+1]), (1≤i<len (S))
● broad sense suffix array (General ized Suffix Array, GSA): several character strings S
1, S
2..., S
nBroad sense suffix array be meant and use special end mark connection string S
1, S
2..., S
nThe back forms the suffix array of new character strings.
Illustrate, such as: for two character string S1=" abac " and S2=" caba ".Connect and the character string that obtains is abac@caba with special character @.For character string abac@caba, have 8 non-NULL suffix, sequence originally and sort according to the dictionary preface after sequence as shown in the table:
Before the ordering of non-NULL suffix and after the ordering
One-dimension array SA=[8 then, 6,1,3,7,2,5,4] be character string S
1And S
2Broad sense suffix array.
Obtain after the broad sense suffix array that two character strings connect, the longest common prefix of more adjacent substring in twos successively, all length is more than or equal to 1 the longest common prefix, the public substring of two character strings being asked exactly.
The public substring algorithm of more than finding the solution two character strings is extended to the public substring derivation algorithm of the individual character string of N (N>1): for the individual character string S of N (N>1)
1, S
2... S
N, obtain character string SE=S after it is stitched together with N-1 special character (needn't be different in twos)
1a
1S
2a
2... S
N-1a
N-1S
N, wherein, a
i(1≤i≤(N-1)) is the special character of insertion, and to all a
i, S
j, (1≤i≤N-1,1≤j≤N), have
The suffix array of structure SE, the longest common prefix of more adjacent substring in twos can obtain N character string S then
1, S
2... S
NWhole public substring.
3. the foundation of text feature vector model
In the text based information retrieval process, the proper vector model of a text is a set of being made up of the certain characteristics in the text.In this text feature vector model based on public substring, each document D can be expressed as the proper vector that M public substring and respective weights thereof are formed.Here suppose:
● text to be clustered be D1, D2 ..., DN};
● the public substring sequence through filtration treatment is (S
1, S
2... S
N-1, S
n);
● function len (Sk) (k=1,2 ..., the n) length of expression character string Sk;
● (Sk Dj) represents the frequency that public substring Sk occurs to function tf in text Dj.The word frequency (Term frequency) that tf just usually uses in the information retrieval process;
● the contrary document frequency (Inversed document frequency) of the public substring Sk of function idf (Sk) expression;
● constant N represents the number as a result that search engine returns, and just we want the text number of cluster;
● function d f (Sk) expression comprises the number of the text of public substring Sk.
Based on above hypothesis, document D
jCan be expressed as vectorial following form:
D
j={w(S
1,D
j),w(S
2,D
j),...,w(S
n,D
j)},(j=1,2,...,N)
W (S wherein
k, D
j) (k=1,2 ..., n) be public substring S
kWith respect to text D
jWeight.It is as follows to propose weight calculation side's formula with reference to TF*IDF:
In a document, public substring weight and its length relation of being proportionate, promptly long more its weight of length is big more.In above formula, we are to public substring S
kLength l en (S
k) get the α power, to amplify the influence of long public substring to its weight, the value of concrete α needs to come by experiment to determine.
Utilize following formula, calculate the weight of whole public substrings after, just can use traditional similarity algorithm, for example the cosine similarity algorithm calculates the similarity between the text.The computing formula of two text similarities is as follows:
4. clustering method and realization
Can obtain similarity matrix between text according to the text feature vector model of above-mentioned proposition, as shown in the table:
Similarity matrix
??D1 | ??D2 | ??D3 | … | ??DN | |
??D1 | ??Sim(1,2) | ??Sim(1,3) | … | ??Sim(1,N) | |
??D2 | ??Sim(2,3) | … | ??Sim(2,N) | ||
??… | ??… | ||||
??DN |
In the last table, D
i(1≤i≤N) is N result items (document that needs cluster), Sim (i, j) (1≤i, the ecbatic item D of j≤N)
iAnd D
jBetween similarity.
After obtaining similarity matrix, next step can adopt hierarchical clustering that result items is carried out cluster.Use hierarchical clustering can make total classification number less, be convenient to the user and locate information needed rapidly.Simultaneously, each class can also be segmented again.Traditional hierarchy clustering method (no matter being cohesion hierarchical clustering or division hierarchical clustering), complexity is all very high, and extensibility is relatively poor, thereby be not suitable for the cluster of large volume document.For this reason, the present invention optimizes traditional cohesion hierarchical clustering.
Suppose that wanting the total number of documents of cluster is N, N>0, Ni represents i+1 not classified number of files of step, i=0,1, Set T
iRepresent i cluster, T
iIn element be classified document code of i+1 step, i=0,1 ...Browsing for the convenience of the user, the classification sum that we set after the cluster is no more than 20.Then clustering method can be described below: (seeing also claims 4).
In the method, calculation of similarity degree is a key factor that influences the cluster effect.The present invention has considered that public substring weight becomes the relation of α power with this substring length, if α is too big, then the effect of long public substring can be excessively enlarged, thereby has influence on the cluster effect, so the occurrence of α need obtain by experiment.The α value is incremented to 2 since 0 with step-length 0.1, and 100 Search Results that different key words are returned carry out cluster respectively, and key word comprises " apple ", " Yao Ming ", " data mining ".Adopting evaluating is between class distance ratio in the class.By the definition of between class distance ratio in the class as can be known, more little when distance in the class of a cluster, the cluster effect was best when between class distance was big more, so between class distance is than more hour in class, the clustering result effect is just good more, and corresponding α value is also just got over science.Experimental result is seen accompanying drawing 3.
Test result by top three key words as can be seen when the α interval [1.2,1.4] time, between class distance is than minimum in the class of cluster result.Find that through a large amount of experiments between class distance is 1.3 than the mean value of hour the most corresponding α value in the class.So the α value is 1.3.
Claims (5)
1. Chinese Web document online clustering method based on public substring is characterized in that step is as follows:
(1) utilizes broad sense suffix array (Generalized Suffix Array, GSA) the public substring in the algorithm extraction Web document;
(2) the public substring that utilize to extract is set up the file characteristics vector model, and based on the similarity in twos of this Model Calculation Web document, obtains similarity matrix;
(3), adopt improved hierarchical clustering algorithm to realize the Web clustering documents based on this similarity matrix;
(4) in cluster process, with the public substring of weight maximum in the set of same cluster label as this cluster.
2. a kind of Chinese Web document online clustering method according to claim 1 based on public substring, it is characterized in that: utilize the leaching process of GSA algorithm to be in the described step (1): suppose total N piece of writing document, every piece of document can be regarded a character string as, then total N character string S
1, S
2... S
N, wherein N is greater than 1, obtains character string SE=S after these character strings are stitched together with N-1 special character
1a
1S
2a
2... S
N-1a
N-1S
N, a wherein
iBe the special character of insertion, the span of i is 1≤i≤(N-1); And to all a
i, S
jHave
I wherein, the span of j is 1≤i≤N-1,1≤j≤N; The suffix array of structure SE, the longest common prefix of more adjacent substring in twos then, all length of these two adjacent substrings is more than or equal to 1 the longest common prefix, and the public substring of two character strings being asked exactly can obtain S by that analogy
1, S
2... S
NWhole public substring.
3. a kind of Chinese Web document online clustering method according to claim 1 based on public substring, it is characterized in that: the file characteristics vector model of the foundation in the described step (2) is: at first suppose text to be clustered for D1, D2 ..., DN}; Public substring sequence through filtration treatment is S
1, S
2... S
N-1, S
nThe length of function len (Sk) expression character string Sk, k=1 wherein, 2 ..., n; (Sk Dj) represents the frequency that public substring Sk occurs to function tf in text Dj; The contrary document frequency of the public substring Sk of function idf (Sk) expression; Constant N represents the number as a result that search engine returns, and just wants the text number of cluster; Function d f (Sk) expression comprises the text number of public substring Sk;
Set up document D
jThe proper vector model: D
j={ w (S
1, D
j), w (S
2, D
j) ..., w (S
n, D
j), (j=1,2 ... N), the proper vector that promptly public substring and respective weights thereof are formed;
Wherein, w (S
k, D
j) for going here and there S
kWith respect to text D
jWeight, k=1 wherein, 2 ..., n;
W (S
k, D
j)=log (1+tf (S
k, D
j)) * idf (S
k) * (len (S
k))
αWherein,
It is 1.3 that the value of α is determined by experiment.
4. the described a kind of Chinese Web document online clustering method of claim 1 based on public substring, it is characterized in that: the improved hierarchical clustering algorithm process in the described step (3) is, suppose that wanting the total number of documents of cluster is N, N>0 wherein, Ni represents i+1 not classified number of files of step, i is integer and i 〉=0, set T
iRepresent i cluster, T
iIn element be classified document code of i+1 step, i is integer and i 〉=0; Clustering method is as follows:
First step cluster comprised for four steps:
I. this moment, unclassified original document was counted N
0=N gets initial threshold and is half of maximum similarity in the similarity matrix, promptly
Ii. to any two document D
i, D
j, if Sim (D
i, D
j)>θ
0, then with D
i, D
jPut into set T
0, i.e. T
0=T0 ∪ { D
i, D
j;
If T iii.
0In exist each other similarity less than threshold value θ
0Document, promptly to all D
i, D
j∈ T
0, i<j is if exist Sim (D
i, D
j)<θ
0, then from T
0The middle D that takes out
i, D
jMiddle subscript the greater, i.e. T
0=T
0-{ D
j, until T
0In do not have such D
i, D
j
Iv. this moment is with T
0In all elements be classified as a class, and get the maximum persons of occurrence number in the public substring of these elements, as such label, so far, this step cluster is finished;
Since the second step cluster, be generalized to the n step, can be expressed as follows:
N) in the n step, n is integer and n 〉=2, at this moment unclassified number of files
Wherein | T
i| be T
iThe number of middle element, if
Then get
Wherein,
To D arbitrarily
i, D
j∈ T repeats the ii in the first step cluster, and iii in the iv step, can obtain T
N-1, and finish n step cluster, then enter the n+1 cluster in step (with the process of n step cluster); Until
Then not classified as yet document is classified as a class, label is " other ", finishes cluster.
5. the described a kind of Chinese Web document online clustering method based on public substring of claim 1 is characterized in that: browsing for the convenience of the user, the sum of setting the cluster classification in the described step (3) is no more than 20.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009102361380A CN101694670B (en) | 2009-10-20 | 2009-10-20 | Chinese Web document online clustering method based on common substrings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009102361380A CN101694670B (en) | 2009-10-20 | 2009-10-20 | Chinese Web document online clustering method based on common substrings |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101694670A true CN101694670A (en) | 2010-04-14 |
CN101694670B CN101694670B (en) | 2012-07-04 |
Family
ID=42093643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009102361380A Expired - Fee Related CN101694670B (en) | 2009-10-20 | 2009-10-20 | Chinese Web document online clustering method based on common substrings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101694670B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004724A (en) * | 2010-12-23 | 2011-04-06 | 哈尔滨工业大学 | Document paragraph segmenting method |
CN102682132A (en) * | 2012-05-18 | 2012-09-19 | 合一网络技术(北京)有限公司 | Method and system for searching information based on word frequency, play amount and creation time |
CN102693304A (en) * | 2012-05-22 | 2012-09-26 | 北京邮电大学 | Search engine feedback information processing method and search engine |
CN103123685A (en) * | 2011-11-18 | 2013-05-29 | 江南大学 | Text mode recognition method |
CN103699567A (en) * | 2013-11-04 | 2014-04-02 | 北京中搜网络技术股份有限公司 | Method for realizing same news clustering based on title fingerprint and text fingerprint |
CN103902599A (en) * | 2012-12-27 | 2014-07-02 | 北京新媒传信科技有限公司 | Fuzzy search method and fuzzy search device |
CN104090890A (en) * | 2013-12-12 | 2014-10-08 | 深圳市腾讯计算机系统有限公司 | Method, device and server for obtaining similarity of key words |
CN104156418A (en) * | 2014-08-01 | 2014-11-19 | 北京系统工程研究所 | Knowledge reuse based evolutionary clustering method |
CN104346411A (en) * | 2013-08-09 | 2015-02-11 | 北大方正集团有限公司 | Method and equipment for clustering multiple manuscripts |
CN104462301A (en) * | 2014-11-28 | 2015-03-25 | 北京奇虎科技有限公司 | Network data processing method and device |
CN106202405A (en) * | 2016-07-11 | 2016-12-07 | 中国人民大学 | A kind of compactedness Text Extraction based on text similarity relation |
CN106844748A (en) * | 2017-02-16 | 2017-06-13 | 湖北文理学院 | Text Clustering Method, device and electronic equipment |
CN108763369A (en) * | 2018-05-17 | 2018-11-06 | 北京奇艺世纪科技有限公司 | A kind of video searching method and device |
CN109241275A (en) * | 2018-07-05 | 2019-01-18 | 广东工业大学 | A kind of text subject clustering algorithm based on natural language processing |
CN109344245A (en) * | 2018-06-05 | 2019-02-15 | 安徽省泰岳祥升软件有限公司 | Text similarity computing method and device |
CN109684928A (en) * | 2018-11-22 | 2019-04-26 | 西交利物浦大学 | Chinese document recognition methods based on Internal retrieval |
CN110532389A (en) * | 2019-08-22 | 2019-12-03 | 四川睿象科技有限公司 | A kind of Text Clustering Method, device and calculate equipment |
CN111753547A (en) * | 2020-06-30 | 2020-10-09 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN113128592A (en) * | 2021-04-20 | 2021-07-16 | 重庆邮电大学 | Medical instrument identification analysis method and system for isomerism and storage medium |
CN116757807A (en) * | 2023-08-14 | 2023-09-15 | 湖南华菱电子商务有限公司 | Intelligent auxiliary label evaluation method based on optical character recognition |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1609859A (en) * | 2004-11-26 | 2005-04-27 | 孙斌 | Search result clustering method |
CN101464898B (en) * | 2009-01-12 | 2011-09-21 | 腾讯科技(深圳)有限公司 | Method for extracting feature word of text |
-
2009
- 2009-10-20 CN CN2009102361380A patent/CN101694670B/en not_active Expired - Fee Related
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004724A (en) * | 2010-12-23 | 2011-04-06 | 哈尔滨工业大学 | Document paragraph segmenting method |
CN103123685B (en) * | 2011-11-18 | 2016-03-02 | 江南大学 | Text mode recognition method |
CN103123685A (en) * | 2011-11-18 | 2013-05-29 | 江南大学 | Text mode recognition method |
CN102682132A (en) * | 2012-05-18 | 2012-09-19 | 合一网络技术(北京)有限公司 | Method and system for searching information based on word frequency, play amount and creation time |
CN102693304A (en) * | 2012-05-22 | 2012-09-26 | 北京邮电大学 | Search engine feedback information processing method and search engine |
CN102693304B (en) * | 2012-05-22 | 2014-10-22 | 北京邮电大学 | Search engine feedback information processing method and search engine |
CN103902599A (en) * | 2012-12-27 | 2014-07-02 | 北京新媒传信科技有限公司 | Fuzzy search method and fuzzy search device |
CN103902599B (en) * | 2012-12-27 | 2017-04-05 | 北京新媒传信科技有限公司 | The method and apparatus of fuzzy search |
CN104346411B (en) * | 2013-08-09 | 2018-11-06 | 北大方正集团有限公司 | The method and apparatus that multiple contributions are clustered |
CN104346411A (en) * | 2013-08-09 | 2015-02-11 | 北大方正集团有限公司 | Method and equipment for clustering multiple manuscripts |
CN103699567A (en) * | 2013-11-04 | 2014-04-02 | 北京中搜网络技术股份有限公司 | Method for realizing same news clustering based on title fingerprint and text fingerprint |
CN104090890B (en) * | 2013-12-12 | 2016-05-04 | 深圳市腾讯计算机系统有限公司 | Keyword similarity acquisition methods, device and server |
CN104090890A (en) * | 2013-12-12 | 2014-10-08 | 深圳市腾讯计算机系统有限公司 | Method, device and server for obtaining similarity of key words |
CN104156418B (en) * | 2014-08-01 | 2015-09-30 | 北京系统工程研究所 | The evolution clustering method that a kind of knowledge based is reused |
CN104156418A (en) * | 2014-08-01 | 2014-11-19 | 北京系统工程研究所 | Knowledge reuse based evolutionary clustering method |
CN104462301A (en) * | 2014-11-28 | 2015-03-25 | 北京奇虎科技有限公司 | Network data processing method and device |
CN104462301B (en) * | 2014-11-28 | 2018-05-04 | 北京奇虎科技有限公司 | A kind for the treatment of method and apparatus of network data |
CN106202405A (en) * | 2016-07-11 | 2016-12-07 | 中国人民大学 | A kind of compactedness Text Extraction based on text similarity relation |
CN106202405B (en) * | 2016-07-11 | 2019-06-25 | 中国人民大学 | A kind of compactedness Text Extraction based on text similarity relation |
CN106844748A (en) * | 2017-02-16 | 2017-06-13 | 湖北文理学院 | Text Clustering Method, device and electronic equipment |
CN108763369A (en) * | 2018-05-17 | 2018-11-06 | 北京奇艺世纪科技有限公司 | A kind of video searching method and device |
CN109344245A (en) * | 2018-06-05 | 2019-02-15 | 安徽省泰岳祥升软件有限公司 | Text similarity computing method and device |
CN109344245B (en) * | 2018-06-05 | 2019-07-23 | 安徽省泰岳祥升软件有限公司 | Text similarity computing method and device |
CN109241275B (en) * | 2018-07-05 | 2022-02-11 | 广东工业大学 | Text topic clustering algorithm based on natural language processing |
CN109241275A (en) * | 2018-07-05 | 2019-01-18 | 广东工业大学 | A kind of text subject clustering algorithm based on natural language processing |
CN109684928B (en) * | 2018-11-22 | 2023-04-11 | 西交利物浦大学 | Chinese document identification method based on internet retrieval |
CN109684928A (en) * | 2018-11-22 | 2019-04-26 | 西交利物浦大学 | Chinese document recognition methods based on Internal retrieval |
CN110532389A (en) * | 2019-08-22 | 2019-12-03 | 四川睿象科技有限公司 | A kind of Text Clustering Method, device and calculate equipment |
CN110532389B (en) * | 2019-08-22 | 2023-07-14 | 北京睿象科技有限公司 | Text clustering method and device and computing equipment |
CN111753547A (en) * | 2020-06-30 | 2020-10-09 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN111753547B (en) * | 2020-06-30 | 2024-02-27 | 上海观安信息技术股份有限公司 | Keyword extraction method and system for sensitive data leakage detection |
CN113128592A (en) * | 2021-04-20 | 2021-07-16 | 重庆邮电大学 | Medical instrument identification analysis method and system for isomerism and storage medium |
CN116757807A (en) * | 2023-08-14 | 2023-09-15 | 湖南华菱电子商务有限公司 | Intelligent auxiliary label evaluation method based on optical character recognition |
CN116757807B (en) * | 2023-08-14 | 2023-11-14 | 湖南华菱电子商务有限公司 | Intelligent auxiliary label evaluation method based on optical character recognition |
Also Published As
Publication number | Publication date |
---|---|
CN101694670B (en) | 2012-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101694670B (en) | Chinese Web document online clustering method based on common substrings | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
Froud et al. | Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering | |
Ni et al. | Short text clustering by finding core terms | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
Bouaziz et al. | Short text classification using semantic random forest | |
Al-diabat | Arabic text categorization using classification rule mining | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
Xie et al. | Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb | |
CN105426529A (en) | Image retrieval method and system based on user search intention positioning | |
Man | Feature extension for short text categorization using frequent term sets | |
CN102651003A (en) | Cross-language searching method and device | |
CN105912662A (en) | Coreseek-based vertical search engine research and optimization method | |
Sun et al. | Towards effective short text deep classification | |
CN105447119A (en) | Text clustering method | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
Bellare et al. | Lightly-supervised attribute extraction | |
Aliguliyev | A novel partitioning-based clustering method and generic document summarization | |
CN105404677A (en) | Tree structure based retrieval method | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
CN102929977B (en) | Event tracing method aiming at news website | |
Aung et al. | Random forest classifier for multi-category classification of web pages | |
Zhen et al. | Notice of Retraction: Multi-modal music genre classification approach | |
Pereira et al. | A generic Web‐based entity resolution framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120704 Termination date: 20171020 |
|
CF01 | Termination of patent right due to non-payment of annual fee |