CN102081598B - Method for detecting duplicated texts - Google Patents

Method for detecting duplicated texts Download PDF

Info

Publication number
CN102081598B
CN102081598B CN2011100294938A CN201110029493A CN102081598B CN 102081598 B CN102081598 B CN 102081598B CN 2011100294938 A CN2011100294938 A CN 2011100294938A CN 201110029493 A CN201110029493 A CN 201110029493A CN 102081598 B CN102081598 B CN 102081598B
Authority
CN
China
Prior art keywords
text
word
keyword set
weights
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011100294938A
Other languages
Chinese (zh)
Other versions
CN102081598A (en
Inventor
李蕾
聂洋
赵青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN2011100294938A priority Critical patent/CN102081598B/en
Publication of CN102081598A publication Critical patent/CN102081598A/en
Application granted granted Critical
Publication of CN102081598B publication Critical patent/CN102081598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for detecting duplicated texts, which comprises the following steps of: obtaining weights of words according to term frequency (TF) values of each word in a text and the occurrence of the word in a title, and sequentially extracting a plurality of words with the highest weights from the text to form a keyword set frame; for any two texts of which the keyword set frames are obtained, sequentially judging whether each word in the keyword set frame of one text is in the keyword set frame of the other text or not, adding 1 to a matching value representative of a matching degree when the word belongs to the keyword set frames of the two text and the weight values of the word in the two texts are matched until the last word in the keyword set frame of one text is detected, and obtaining the similarity of the two texts according to the obtained matching values; and judging whether the two texts are duplicated texts or not according to the similarity and a similarity threshold. The method for detecting the duplicated texts can effectively detect the texts with duplicated information and improve the efficiency of searching for effective information from a plurality of texts.

Description

A kind of method that detects the text repetition
Technical field
The present invention relates to the text-processing technical field, particularly a kind of method that detects the text repetition.
Background technology
At present, increasing text has appearred in all trades and professions, and the information scale of text is unlimited, generally, needs and can find effective information in numerous texts with fast speeds.Yet existing text much all is repetition, has also just seriously reduced the speed that from numerous texts, finds effective information.Therefore, how can in numerous texts, find effective information to become to need badly now the problem of solution apace.
Summary of the invention
In view of this, the invention provides a kind of method that text repeats that detects, can detect the text that information repeats effectively, thereby improve the efficient of in numerous texts, searching effective information.
For achieving the above object, technical scheme of the present invention specifically is achieved in that
A kind of method that detects the text repetition, this method comprises:
For each piece text:
With the word frequency TF value of each word in one piece of text weights, and adjust the weights of the word in the present text title according to the sentence number of the text as this word; Behind all words of the descending series arrangement of weights, take out and come the keyword set of the word of front some as text; From text, take out the word that belongs to keyword set in turn, and all words that will take out in turn are as the keyword set framework of text;
For any two pieces of text A and the text B that obtain the keyword set framework:
Judge that successively each word in the keyword set framework of text B is whether in the keyword set framework of text A; If; Judge then whether the weights of this word in text A and text B mate, if coupling is then deleted the word that matees in the keyword set framework of text A, the word word before that reaches this coupling; With the keyword set framework of the keyword set framework behind the deletion word as text A; And the matching value that will characterize matching degree adds 1, last word in text B keyword set framework, wherein; Whether each word in the said keyword set framework of judging text B in the keyword set framework of text A is: with the current word of text B successively with the keyword set framework of text A in each word compare, to judge current word whether in the keyword set framework of text A;
The total sum of word in the sum of deleting word in the word keyword set framework before the first time of calculating text A and the keyword set framework of text B; With the matching value that obtains with calculate and half ask the merchant, with the merchant who obtains as the similarity that is used to characterize text A and text B similarity degree;
A similarity that obtains and a similarity threshold are compared, and when similarity during greater than a similarity threshold, being judged to be text A and text B is repeated text; Otherwise then being judged to be text A and text B is not repeated text.
The TF value of each word is that the total degree that all words occur in the text in the total degree that in the text, occurred by this word and the text asks the merchant to obtain in said one piece of text.
The weights that said sentence number according to the text adjusts the word in the present text title comprise:
The weights of each word in the text title and half of the total sentence number of the text are multiplied each other, and the value that will obtain after will multiplying each other is as the weights of the word in the text title.
The word that said taking-up comes the front some comprises as the keyword set of text: choose the number of word according to the number of words of text, when the text number of words is 320 words when following, choose and come the keyword set of 6 preceding 6 words as text; When the text number of words is 320 words when above,, take out and come the keyword set of the word of front some as text with per 70~90 principles that a word got in word.
Saidly judge whether the weights of this word in text A and text B mate and comprise: when the difference of the weights of this word in two pieces of texts less than total sentence number in a certain text wherein 1/4 the time, then be judged to be the weights of this word in two pieces of texts and mate; Otherwise it is unmatched being judged to be.
Said similarity threshold is to confirm according to accuracy rate that obtains behind the experiment statistics and recall rate.
Said similarity threshold is 0.4~0.6.To sum up; The method that the detection text that the present invention adopted repeats; At first; Whether appear at the weights of determining each word in the title according to TF value and this word of word in text, and then select after weights come the keyword set of word as text of front some, from text, take out all words of belonging to keyword set keyword set framework in turn as text; Secondly; For any two pieces of texts that obtain the keyword set framework; Be provided with after the matching value that characterizes its matching degree, judge one piece of each word in the text key word collection framework successively whether in the keyword set framework of another piece text, when this word belongs to word and the coupling of the weights in two pieces of texts in the keyword set framework of two pieces of texts; The matching value that is provided with is added 1, until having judged one piece of last word in the text key word collection framework; Once more, obtain the similarity of two pieces of texts according to the matching value that obtains; Whether at last, determine two pieces of texts according to similarity and a similarity threshold is the text of repetition.Because the inventive method is to compare through the keyword set framework to two pieces of texts; And when certain word in judging the keyword set framework all belongs to two pieces of texts; Further, need judge that also the weights coupling of this word in two pieces of texts just is judged to be this word and belongs to two pieces of texts really, therefore; Use the method that detection text of the present invention repeats; Can detect the text that information repeats effectively, and then can the text of the repetition in numerous texts be deleted, also just improve the efficient of in numerous texts, searching effective information.
Description of drawings
Fig. 1 is the workflow diagram of keyword set abstracting method embodiment of the present invention;
Fig. 2 is the workflow diagram of Text similarity computing method embodiment of the present invention.
Embodiment
For solving the problem that exists in the prior art; The present invention proposes the method that a kind of new detection text repeats; That is: at first; Whether appear at the weights of determining each word in the title according to TF value and this word of word in text, and then select after weights come the keyword set of word as text of front some, from text, take out all words of belonging to keyword set keyword set framework in turn as text; Secondly; For any two pieces of texts that obtain the keyword set framework; Be provided with after the matching value that characterizes its matching degree, judge one piece of each word in the text key word collection framework successively whether in the keyword set framework of another piece text, when this word belongs to word and the coupling of the weights in two pieces of texts in the keyword set framework of two pieces of texts; The matching value that is provided with is added 1, until having judged one piece of last word in the text key word collection framework; Once more, obtain the similarity of two pieces of texts according to the matching value that obtains; Whether at last, determine two pieces of texts according to similarity and a similarity threshold is the text of repetition.
Before introducing concrete implementation, at first introduce the notion of keyword set and keyword set framework, the notion of text similarity.Keyword set is all words with the meaning represented that best embody in the specific area to a specific area; The keyword set framework is for one piece of text, all words in the keyword set that belongs to the text that from this paper, takes out in turn; Text similarity is meant the close degree of two texts on meaning.
Based on above-mentioned introduction, the concrete realization of scheme according to the invention comprises:
For each piece text:
With the word frequency TF value of each word in one piece of text weights, and adjust the weights of the word in the present text title according to the sentence number of the text as this word; Behind all words of the descending series arrangement of weights, take out and come the keyword set of the word of front some as text; From text, take out the word that belongs to keyword set in turn, and all words that will take out in turn are as the keyword set framework of text;
For any two pieces of text A and the text B that obtain the keyword set framework:
Judge that successively each word in the keyword set framework of text B is whether in the keyword set framework of text A; If; Judge then whether the weights of this word in text A and text B mate; If coupling is then deleted the word that matees in the keyword set framework of text A, the word word before that reaches this coupling, with the keyword set framework of the keyword set framework behind the deletion word as text A; And the matching value that will characterize matching degree adds 1, last word in text B keyword set framework;
The total sum of word in the sum of deleting word in the word keyword set framework before the first time of calculating text A and the keyword set framework of text B; With the matching value that obtains with calculate and half ask the merchant, with the merchant who obtains as the similarity that is used to characterize text A and text B similarity degree;
A similarity that obtains and a similarity threshold are compared, and when similarity during greater than a similarity threshold, being judged to be text A and text B is repeated text; Otherwise then being judged to be text A and text B is not repeated text.
For making the object of the invention, technical scheme and advantage clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, to further explain of the present invention.
Fig. 1 is the workflow diagram of keyword set abstracting method of the present invention.As shown in Figure 1, this flow process comprises:
Step 101: one piece of text is carried out participle, obtain the text behind the participle.
Because existing text is very many; And relate to every field; In this step, need to take out one piece of text wherein, and it is carried out word segmentation processing; The form of expression that also is about to sentence in the text converts the form of expression of words all in the text into, and the text behind the participle that obtains has also promptly obtained words all in the text.
Step 102: the TF value of each word in the text behind the calculating participle, and the TF value of each word that will obtain after will calculating is as the weights of this word.
In this step, the TF value of word is that the total degree that all words occur in the text in the total degree that in the text, occurred by this word and the text asks the merchant to obtain in the text.
Step 103: the weights that adjust the word in the present text header according to the sentence number of text.
Generally speaking, occurring words all is very important word in the title, therefore, need adjust the weights of occurring words in the title.In this step, be to adjust its weights according to the sentence number of text, be specially: the weights of each word in the text header and half of the total sentence number of the text are multiplied each other, and the value that will obtain after will multiplying each other is as the weights of the word in the text title.In the reality, also can carry out other forms of adjustment, be as the criterion with the realization that does not influence the embodiment of the invention to the weights of occurring words in the title.
Step 104: according to all words of the descending series arrangement of weights.
After having adjusted the weights that appear at the word in the text header, can arrange all words according to the descending order of weights, concrete arrangement mode can have multiple, is as the criterion with the realization that does not influence the embodiment of the invention.
Step 105: the number of words taking-up according to text is arranged in the word of front some, and all words that will take out are as the keyword set of text.
For the text of different numbers of words, needed information is different, and the text that the many needed information of text of number of words is lacked than number of words is many, and therefore, in this step, the number of word is how much to confirm according to the number of words of text in the keyword set of text.Generally speaking, when the text number of words is 280~320 words when following, choose and come the keyword set of 6 preceding 6 words as text; When the text number of words is that 320 words are when above; Get the principle of a word with the individual word of per 70~90 (80 is optimum value), and after the result rounded up, take out and come the keyword set of the word of front some as text; For example: if the length of text is 670 (getting a word with per 80 words here is example); 670 ÷ 80=8.375 ≈ 8, then the number of word is 8 in the keyword set of the text, promptly answers the weighting value to come the keyword set of preceding 8 word as text.
Step 106: from text, take out the word that belongs to keyword set in turn, and all words that will take out in turn are as the keyword set framework of text.
After the keyword set that has obtained text, also need from text, to take out all in turn and belong to the word in the keyword set, be specially: scan from the beginning of text, if the word in the text that scans is present in the keyword set, then with its taking-up; Otherwise, do not take out, after the ending that scans text, all words that can will take out in turn are as the keyword set framework of text.
Need to prove that when obtaining the keyword set framework of text, also need keep the weights of each word in the keyword set framework, each word in the keyword set framework that promptly obtains has all comprised this attribute of weights.
So far, promptly accomplished the whole workflow of keyword set abstracting method of the present invention.
Need to prove, extraction process shown in Figure 1 concerning one piece of text, in practical application, need to needs carry out text whether all texts of duplicate detection all carry out above-mentioned processing, thereby obtain the keyword set framework of each piece text.
After having obtained the keyword set framework of text; Can carry out calculation of similarity degree to text according to the keyword set framework that obtains; The workflow of the Text similarity computing method embodiment that specifically can provide referring to Fig. 2; And this flow process with two texts, be that text A and text B are that example is explained, suppose relatively whether text B similar with text A, can with text A as by comparison other, text B as comparison other.As shown in Figure 2, this flow process comprises:
Step 201: first word in the keyword set framework of text B as current word, and will be provided with the current matching value of 0 value as the matching degree that is used to characterize text A and text B.
Need to prove that in this step, matching value is to be used for explanatory text A and the matching degree of text B on absolute sense, its initial value should be 0.
Step 202-203: with the current word of text B successively with the keyword set framework of text A in each word compare, judging current word whether in the keyword set framework of text A, if, execution in step 204; Otherwise, execution in step 206.
Step 204: when in the keyword set framework of current word at text A, judge further whether weights and its weights among text Bs of current word in text A mate, if, execution in step 205; Otherwise, execution in step 206.
Because each word in the keyword set framework of text all has this attribute of weights; Therefore; When judging current word also in the keyword set framework at text A the time; Also need further to judge whether weights and its weights among text Bs of current word in text A mate; Be specially: when the difference of the weights of current word in text A and text B less than text A or text B in during 1/5~1/3 (optimum value is 1/4) of total sentence number, then be judged to be weights and its weights among text Bs of current word in text A and mate; Otherwise it is unmatched being judged to be.
Step 205: all words before the current word that matees in the keyword set framework of deletion text A, the current word that reaches this coupling; With the keyword set framework of the residue keyword set framework of deletion behind the word, and the current matching value of text A and text B added 1 as current matching value as text A.
When the weights coupling of weights and its of current word in text A in text B; Need the current word that matees in the keyword set framework of deletion text A, current word all words before that reach this coupling, with the keyword set framework of the residue keyword set framework after the deletion word as text A; Simultaneously, also need the current matching value of text A and text B is added 1 as current matching value.
Step 206: judge that whether current word is last word in the keyword set framework of text B, if, execution in step 207; Otherwise, execution in step 208.
Step 207: calculate word sum in the keyword set framework that text A deletes word sum and text B in the keyword set framework before the word for the first time with; With current matching value with calculate and half ask the merchant; And after the similarity of the merchant that will obtain as text A and text B, finish whole workflow.
When current word is last word in the keyword set framework of text B; Can be with the final matching value of current matching value as text A and text B; Yet all being not more than 5 matching value for the word sum of the keyword set framework of two texts is that the word sum of the keyword set framework of 3 and two texts is 3 greater than 10 matching value all, and its meaning obviously is different; It is the similarity degree that matching value can not reflect two texts very definitely; Therefore, in this step, need utilize the current matching value that obtains in the step 205 further to obtain the similarity of text A and text B; Be specially: calculate word sum in the keyword set framework that text A deletes word sum and text B in the keyword set framework before the word for the first time with; With current matching value with calculate and half ask the merchant, and after the similarity of the merchant that will obtain as text A and text B, finish whole workflow.
Step 208: with the next word of current word in the keyword set framework of text B as current word after, return execution in step 202, be last word in the keyword set framework of text B until current word.
So far, promptly accomplished the whole workflow of Text similarity computing method of the present invention.
After the similarity that has obtained two texts; Can judge whether text is repetition according to the similarity of text; In the present embodiment; Can through a text similarity and a similarity threshold be compared to judge whether two texts repeat, promptly when similarity during greater than this similarity threshold, being judged to be two texts is repetitions; Otherwise it is unduplicated being judged as two texts.
After having obtained the result whether text repeat, can handle accordingly the text that repeats, like deletion etc., thereby make the text that repeats in numerous texts reduce, improved and found out the wherein efficient of effective information.
Need to prove that in the present embodiment, similarity threshold is to confirm through accuracy rate that obtains behind the experiment statistics and recall rate, wherein, what accuracy rate reflected is that how much number percent is correct result account among the result who judges; What recall rate reflected is in all desired result (or claiming model answer), to judge correct how much number percent that accounts for; Draw through after the experimental analysis; Under accuracy rate and all reasonable situation of recall rate; Use 0.4~0.6 to be proper as similarity threshold, present embodiment can adopt 0.5 as similarity threshold.
In sum; The method that the detection text that the present invention adopted repeats; At first; Whether appear at the weights of determining each word in the title according to TF value and this word of word in text, and then select after weights come the keyword set of word as text of front some, from text, take out all words of belonging to keyword set keyword set framework in turn as text; Secondly; For any two pieces of texts that obtain the keyword set framework; Be provided with after the matching value that characterizes its matching degree, judge one piece of each word in the text key word collection framework successively whether in the keyword set framework of another piece text, when this word belongs to word and the coupling of the weights in two pieces of texts in the keyword set framework of two pieces of texts; The matching value that is provided with is added 1, until having judged one piece of last word in the text key word collection framework; Once more, obtain the similarity of two pieces of texts according to the matching value that obtains; Whether at last, determine two pieces of texts according to similarity and a similarity threshold is the text of repetition.Because the inventive method is to compare through the keyword set framework to two pieces of texts; And when certain word in judging the keyword set framework all belongs to two pieces of texts; Further, need judge that also the weights coupling of this word in two pieces of texts just is judged to be this word and belongs to two pieces of texts really, therefore; Use the method that detection text of the present invention repeats; Can detect the text that information repeats effectively, and then can the text of the repetition in numerous texts be deleted, also just improve the efficient of in numerous texts, searching effective information.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (7)

1. one kind is detected the method that text repeats, and it is characterized in that this method comprises:
For each piece text:
With the word frequency TF value of each word in one piece of text weights, and adjust the weights of the word in the present text title according to the sentence number of the text as this word; Behind all words of the descending series arrangement of weights, take out and come the keyword set of the word of front some as text; From text, take out the word that belongs to keyword set in turn, and all words that will take out in turn are as the keyword set framework of text;
For any two pieces of text A and the text B that obtain the keyword set framework:
Judge that successively each word in the keyword set framework of text B is whether in the keyword set framework of text A; If; Judge then whether the weights of this word in text A and text B mate, if coupling is then deleted the word that matees in the keyword set framework of text A, the word word before that reaches this coupling; With the keyword set framework of the keyword set framework behind the deletion word as text A; And the matching value that will characterize matching degree adds 1, last word in text B keyword set framework, wherein; Whether each word in the said keyword set framework of judging text B in the keyword set framework of text A is: with the current word of text B successively with the keyword set framework of text A in each word compare, to judge current word whether in the keyword set framework of text A;
The total sum of word in the sum of deleting word in the word keyword set framework before the first time of calculating text A and the keyword set framework of text B; With the matching value that obtains with calculate and half ask the merchant, with the merchant who obtains as the similarity that is used to characterize text A and text B similarity degree;
A similarity that obtains and a similarity threshold are compared, and when similarity during greater than a similarity threshold, being judged to be text A and text B is repeated text; Otherwise then being judged to be text A and text B is not repeated text.
2. method according to claim 1 is characterized in that, the TF value of each word is that the total degree that all words occur in the text in the total degree that in the text, occurred by this word and the text asks the merchant to obtain in said one piece of text.
3. method according to claim 1 is characterized in that, the weights that said sentence number according to the text adjusts the word in the present text title comprise:
The weights of each word in the text title and half of the total sentence number of the text are multiplied each other, and the value that will obtain after will multiplying each other is as the weights of the word in the text title.
4. method according to claim 3; It is characterized in that; The word that said taking-up comes the front some comprises as the keyword set of text: the number of choosing word according to the number of words of text; When the text number of words is 320 words when following, choose and come the keyword set of 6 preceding 6 words as text; When the text number of words is 320 words when above,, take out and come the keyword set of the word of front some as text with per 70~90 principles that a word got in word.
5. method according to claim 1; It is characterized in that; Saidly judge whether the weights of this word in text A and text B mate and comprise: when the difference of the weights of this word in two pieces of texts less than total sentence number in a certain text wherein 1/4 the time, then be judged to be the weights of this word in two pieces of texts and mate; Otherwise it is unmatched being judged to be.
6. method according to claim 1 is characterized in that, said similarity threshold is to confirm according to accuracy rate that obtains behind the experiment statistics and recall rate.
7. method according to claim 6 is characterized in that, said similarity threshold is 0.4~0.6.
CN2011100294938A 2011-01-27 2011-01-27 Method for detecting duplicated texts Active CN102081598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100294938A CN102081598B (en) 2011-01-27 2011-01-27 Method for detecting duplicated texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100294938A CN102081598B (en) 2011-01-27 2011-01-27 Method for detecting duplicated texts

Publications (2)

Publication Number Publication Date
CN102081598A CN102081598A (en) 2011-06-01
CN102081598B true CN102081598B (en) 2012-07-04

Family

ID=44087566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100294938A Active CN102081598B (en) 2011-01-27 2011-01-27 Method for detecting duplicated texts

Country Status (1)

Country Link
CN (1) CN102081598B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591976A (en) * 2012-01-04 2012-07-18 复旦大学 Text characteristic extracting method and document copy detection system based on sentence level
CN103176962B (en) * 2013-03-08 2015-11-04 深圳先进技术研究院 The statistical method of text similarity and system
CN104239285A (en) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 New article chapter detecting method and device
CN106528581B (en) * 2015-09-15 2019-05-07 阿里巴巴集团控股有限公司 Method for text detection and device
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN107133218A (en) * 2017-05-26 2017-09-05 北京惠商之星网络科技有限公司 Trade name intelligent Matching method, system and computer-readable recording medium
CN110147443B (en) * 2017-08-03 2021-04-27 北京国双科技有限公司 Topic classification judging method and device
CN110019660A (en) * 2017-08-06 2019-07-16 北京国双科技有限公司 A kind of Similar Text detection method and device
US11232132B2 (en) 2018-11-30 2022-01-25 Wipro Limited Method, device, and system for clustering document objects based on information content
CN113779222A (en) * 2021-09-14 2021-12-10 北京捷风数据技术有限公司 Method, system and storage medium for matching bid winning information based on contract information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN101408893A (en) * 2008-11-26 2009-04-15 哈尔滨工业大学 Method for rapidly clustering documents

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8046363B2 (en) * 2006-04-13 2011-10-25 Lg Electronics Inc. System and method for clustering documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN101408893A (en) * 2008-11-26 2009-04-15 哈尔滨工业大学 Method for rapidly clustering documents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陶跃华.基于向量的相似度计算方案.《云南师范大学学报》.2001,第21卷(第5期),17-19. *

Also Published As

Publication number Publication date
CN102081598A (en) 2011-06-01

Similar Documents

Publication Publication Date Title
CN102081598B (en) Method for detecting duplicated texts
CN106528642B (en) A kind of short text classification method based on TF-IDF feature extractions
CN102662952B (en) Chinese text parallel data mining method based on hierarchy
CN106407484B (en) Video tag extraction method based on barrage semantic association
CN104881458B (en) A kind of mask method and device of Web page subject
CN103473262B (en) A kind of Web comment viewpoint automatic classification system based on correlation rule and sorting technique
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN102955857B (en) Class center compression transformation-based text clustering method in search engine
CN105005553A (en) Emotional thesaurus based short text emotional tendency analysis method
CN104077407B (en) A kind of intelligent data search system and method
CN103399901A (en) Keyword extraction method
CN102194012B (en) Microblog topic detecting method and system
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN101127042A (en) Sensibility classification method based on language model
CN101630312A (en) Clustering method for question sentences in question-and-answer platform and system thereof
CN105022805A (en) Emotional analysis method based on SO-PMI (Semantic Orientation-Pointwise Mutual Information) commodity evaluation information
US20190163737A1 (en) Method and apparatus for constructing binary feature dictionary
CN103345528A (en) Text classification method based on correlation analysis and KNN
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN106202313B (en) Search result towards academic Meta Search Engine synthesizes sort method
CN102063424A (en) Method for Chinese word segmentation
CN101540017A (en) Feature extraction method based on byte level n-gram and junk mail filter
CN104317965A (en) Establishment method of emotion dictionary based on linguistic data
CN102081667A (en) Chinese text classification method based on Base64 coding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant