CN105224518B - Text similarity calculation method and system and similar text search method and system - Google Patents

Text similarity calculation method and system and similar text search method and system Download PDF

Info

Publication number
CN105224518B
CN105224518B CN201410270637.2A CN201410270637A CN105224518B CN 105224518 B CN105224518 B CN 105224518B CN 201410270637 A CN201410270637 A CN 201410270637A CN 105224518 B CN105224518 B CN 105224518B
Authority
CN
China
Prior art keywords
text
frequency
vocabulary
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410270637.2A
Other languages
Chinese (zh)
Other versions
CN105224518A (en
Inventor
刘健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410270637.2A priority Critical patent/CN105224518B/en
Publication of CN105224518A publication Critical patent/CN105224518A/en
Application granted granted Critical
Publication of CN105224518B publication Critical patent/CN105224518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a method and a system for calculating text similarity and a method and a system for searching similar texts. The text similarity calculation method comprises the following steps: acquiring a first text and a second text which need to be subjected to text similarity calculation; respectively carrying out vocabulary segmentation on the first text and the second text to obtain a first vocabulary set and a second vocabulary set; deleting stop words in the first vocabulary set and the second vocabulary set respectively to obtain a third vocabulary set and a fourth vocabulary set; extracting high-frequency words in the third word set and the fourth word set respectively to form a fifth word set and a sixth word set; and calculating the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set. By implementing the embodiment of the invention, the accuracy of text similarity calculation can be improved, and the efficiency of searching similar texts can be improved.

Description

Text similarity calculation method and system and similar text search method and system
Technical Field
The invention relates to the technical field of computers, in particular to a method and a system for calculating text similarity and a method and a system for searching similar texts.
Background
The text similarity calculation and the search of similar texts have wide application in the fields of paper anti-plagiarism, website anti-counterfeiting and the like, for example:
1. the counterfeit website identification, taking the counterfeit industry and commerce bank website as an example, if the content of a certain website is found to be close to the content of the industry and commerce bank official website (http:// www.icbc.com.cn), the website can be regarded as the counterfeit website.
2. And (4) the paper plagiarism identification, namely comparing the paper with other papers in the paper library to judge whether plagiarism behaviors exist.
3. For example, when a user purchases a book introducing the computer operating system at a website, the commodity recommendation system can automatically recommend other books similar to the book content.
4. And (4) carrying out similar duplication removal, wherein similar web pages are automatically subjected to duplication removal in a search engine so as to provide more useful information for a user.
The common text similarity calculation methods in the prior art include the following methods:
in the scheme 1, the longest common string algorithm, assuming that the lengths of two strings are n and m respectively, and the length of the longest common string is c, the similarity is c/MIN (n, m), i.e. c is divided by the smaller value of n and m. For example, two sections of texts, i.e., "i call zhang three" and "i call lie four", have the longest common string of "i call", and have a similarity of 2/MIN (4, 4) ═ 2/4 ═ 0.5.
Scheme 2, the minimum edit distance algorithm, refers to the minimum number of edits (incremental deletion modify operations) required to convert one string to another. For example, in the above example, it is necessary to change "one" to "li" and "three" to "four", for 2 edits. Assuming that the two character strings are n and m in length, respectively, and the minimum editing distance is d, the similarity is 1-d/MIN (n, m).
After the text similarity is calculated, the similarity may be compared with a threshold (e.g., 0.8), and if the similarity exceeds the threshold, the text is considered similar.
Various text similarity calculation methods in the prior art have some problems:
both scheme 1 and scheme 2 are easy to bypass, and the similarity is greatly low and the accuracy is low due to simple vocabulary, sentence or paragraph transposition. Such as the following two text contents with the same substantive content: "his current name is zhang san" and "his current name is zhang san", using scenario 1: the longest public string is the name of' with the similarity of only 3/9-0.33; with scheme 2: the minimum editing distance is 9, and the similarity is 1-9/9 ═ 0; the text similarity calculated by the method of the prior art is low and may be considered dissimilar.
In summary, the text similarity calculation method in the prior art has the problem of low accuracy, and is not beneficial to finding out similar texts of the text to be detected from the text library.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a method and a system for calculating text similarity, and a method and a system for searching similar texts, which are used for improving the accuracy of text similarity calculation and are beneficial to searching similar texts of texts to be detected from a text library.
The embodiment of the invention provides a text similarity calculation method, which comprises the following steps:
acquiring a first text and a second text which need to be subjected to text similarity calculation;
performing vocabulary segmentation on the first text to obtain a first vocabulary set, and performing vocabulary segmentation on the second text to obtain a second vocabulary set;
deleting stop words in the first vocabulary set to obtain a third vocabulary set, and deleting stop words in the second vocabulary set to obtain a fourth vocabulary set;
extracting high-frequency words from the third word set to form a fifth word set, and extracting high-frequency words from the fourth word set to form a sixth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a first threshold;
and calculating the text similarity of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.
Correspondingly, the embodiment of the invention also provides a method for searching similar texts, which comprises the following steps:
acquiring a data structure of high-frequency words and text numbers, wherein the data structure comprises information of each high-frequency word in a text library and the number of a corresponding text comprising each high-frequency word;
carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;
deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;
extracting high-frequency words in the eighth word set to form a ninth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a second threshold;
searching for similar text of the third text using the data structure and the ninth vocabulary set.
Correspondingly, an embodiment of the present invention further provides a system for calculating text similarity, including:
the first acquiring unit is used for acquiring a first text and a second text which need to be subjected to text similarity calculation;
the first segmentation unit is used for carrying out vocabulary segmentation on the first text to obtain a first vocabulary set and carrying out vocabulary segmentation on the second text to obtain a second vocabulary set;
the first deleting unit is used for deleting stop words in the first vocabulary set to obtain a third vocabulary set and deleting stop words in the second vocabulary set to obtain a fourth vocabulary set;
the first extraction unit is used for extracting the high-frequency words in the third word set to form a fifth word set, and extracting the high-frequency words in the fourth word set to form a sixth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a first threshold;
and the calculating unit is used for calculating the text similarity of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.
Correspondingly, the embodiment of the present invention further provides a system for searching for similar texts, including:
the second acquisition unit is used for a data structure of high-frequency words and text numbers, and the data structure comprises information of the numbers of the high-frequency words and corresponding texts comprising the high-frequency words in a text library;
the second segmentation unit is used for carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;
the second deleting unit is used for deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;
the second extraction unit is used for extracting the high-frequency words in the eighth word set to form a ninth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a second threshold;
and the searching unit is used for searching the similar text of the third text by utilizing the data structure and the ninth vocabulary set.
According to the method and the system for calculating the text similarity and the method and the system for searching the similar text, the text is subjected to word segmentation, stop word deletion and high-frequency word extraction, and similarity calculation or similar text search is performed by using the high-frequency words of the text, so that the accuracy of text similarity calculation can be improved, and the efficiency of searching the similar text can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for calculating text similarity according to an embodiment of the present invention;
fig. 2 is a first schematic view of a process of searching for similar texts according to a second embodiment of the present invention;
fig. 3 is a second schematic diagram of a process of searching for similar texts according to the second embodiment of the present invention;
fig. 4 is a third schematic view of a flow of a similar text searching method according to the second embodiment of the present invention;
fig. 5 is a fourth schematic view of a process of searching for a similar text according to the second embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a system for calculating text similarity according to a third embodiment of the present invention;
fig. 7 is a schematic structural diagram of a similar text search system according to a fourth embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The first embodiment is as follows:
the embodiment of the invention provides a method for calculating text similarity, which can comprise the following steps of:
101. acquiring a first text and a second text which need to be subjected to text similarity calculation;
in this embodiment, the first text and the second text may refer to any two texts for which text similarity calculation is required;
it should be noted that the text in the present embodiment may include a document (e.g., an article, a paper, a book, etc.), a web page, etc., and the type of the text is not specifically limited herein;
102. performing vocabulary segmentation on the first text to obtain a first vocabulary set, and performing vocabulary segmentation on the second text to obtain a second vocabulary set;
for example, in the present embodiment, various word segmentation systems in the prior art may be used to perform word segmentation on a text, for example, for english, since there are spaces between words, word segmentation may be performed according to the spaces, for Chinese, the word segmentation of the text may be implemented by using an ICTCLAS (Chinese Lexical Analysis System, Institute of Technology, Chinese Lexical Analysis System) System of the Institute of computing Technology of Chinese academy of sciences, or a word segmentation System of an SOSO (search and search) search engine of Tencent corporation, where there is no specific limitation on which word segmentation method or System is used;
103. deleting stop words in the first vocabulary set to obtain a third vocabulary set, and deleting stop words in the second vocabulary set to obtain a fourth vocabulary set;
prior to 103, the method may further comprise: acquiring a deactivation word list;
the stop word list is used to define which vocabulary belongs to stop words to be deleted, for example, the stop word list may "include" you, me, he, in, yes "and other common vocabulary without actual meaning, in this embodiment, the stop word list specifically includes which stop words, which can be reasonably set by the user according to the need, and no specific limitation is made herein;
104. extracting the high-frequency words in the third word set to form a fifth word set, and extracting the high-frequency words in the fourth word set to form a sixth word set; the high-frequency vocabulary is a vocabulary with TFIDF (term inverse document frequency value) higher than a first threshold value in the text library;
the TFIDF value can be used for representing the importance degree of a word, if the frequency TF of a certain word in a text is high and the TFIDF of the word rarely appears in other articles, the TFIDF value is large, and the word or phrase is considered to have good category distinguishing capability and is suitable for classification;
the method for calculating the TIFIDF value belongs to the prior art, and is not described herein again;
the specific value of the first threshold may be set by a user as appropriate, and the specific value is not specifically limited herein;
105. and calculating the text similarity of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.
According to the text similarity calculation method provided by the embodiment of the invention, the high-frequency words are extracted from the text, and the text similarity is calculated by using the high-frequency words included in the text, so that the influence of factors such as the sequence of each word and paragraph in the text on the text similarity is eliminated, and the accuracy of text similarity calculation is improved.
For example, the 105 may include:
calculating the text similarity S of the first text and the second text as C/min (A, B) according to the fifth vocabulary set and the sixth vocabulary set;
wherein, a is the number of words in the fifth word set, B is the number of words in the sixth word set, and C is the number of the same words in the fifth word set and the sixth word set.
Of course, those skilled in the art may make other variations or modifications to the above formula for calculating text similarity, and the specific form of the formula is not limited herein.
The following describes a method for calculating text similarity according to this embodiment with a specific example:
for example, the text 1 includes the content "the third specialty is a computer", the text 2 includes the content "the fourth specialty is a literature and a movie";
firstly, segmenting the words of the texts 1 and 2 to obtain a vocabulary set of the text 1 which is 'Zhang III', professional Ye and computer ', and a vocabulary set of the text 2 which is' Liqu ', professional Ye, literature and movie';
secondly, stopping words, and assuming that the stop word list comprises 'yes', yes 'and', stopping words, combining the vocabulary of the text 1 into 'Zhang III', profession and computer ', and combining the vocabulary of the text 2 into' Liqu ', profession, literature and film';
thirdly, extracting high-frequency words, and assuming that the words of Zhangsan, profession, computer, Liqu, literature and film are all high-frequency words, combining the high-frequency words of the text 1 into Zhangsan, profession and computer, and combining the high-frequency words of the text 2 into Liqu, profession, literature and film;
fourthly, calculating the text similarity, wherein the text 1 comprises 3 high-frequency words, the text 2 comprises 4 high-frequency words, the same high-frequency words contained in the text 1 and the text 2 are 1, and the text similarity S of the text 1 and the text 2 is calculated12=1/min(3,4)=1/3=0.33。
Generally, text similarity detection often needs to compare with a large number of samples to find a sample similar to a sample to be detected in the large number of samples, for example, in a counterfeit website detection system, a large number of regular website contents (such as an industrial and commercial bank, an Tencent network, a Taobao network, and the like) are collected in advance, and whether a newly-appeared website content is similar to a listing website needs to be judged; in the paper plagiarism detection system, a large number of academic papers are collected in advance, and it is necessary to determine whether a newly-appeared paper is similar to a collected paper.
Assuming that the number of the currently recorded texts is N, the existing scheme is adopted, and the texts to be detected need to be compared with N texts in a library one by one. When N is small (e.g., tens or hundreds), the overall computational overhead is also small; in practical large-scale application, if N is very large (for example, the number of issued scientific and technical papers worldwide is above ten million levels, and the number of websites in china is at million levels), the calculation cost is very considerable, it takes 0.1ms to calculate the text similarity once on average, N is 100 ten thousand calculations, each comparison takes 100s, which is equivalent to a single machine can only process 0.01 text per second, and the efficiency is obviously too low.
The second embodiment below can solve the text similarity calculation when the number of samples is large, find out a text similar to the text to be detected from the text library, and increase the calculation speed, please refer to the following description.
Example two:
the embodiment of the present invention provides a method for searching for a similar text, as shown in fig. 2, the method may include:
201. acquiring a data structure of high-frequency words and text numbers, wherein the data structure comprises information of each high-frequency word in a text library and the number of a corresponding text comprising each high-frequency word;
202. carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;
the specific vocabulary segmentation method may be the same as the description of the foregoing embodiment, and is not repeated herein;
203. deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;
the specific stop word deleting method may be the same as the description of the foregoing embodiment, and is not described herein again;
204. extracting the high-frequency words in the eighth word set to form a ninth word set;
wherein, the high-frequency vocabulary is the vocabulary with TFIDF value higher than the second threshold value in the text library;
the specific value of the second threshold may be set by a user as appropriate, and the specific value is not specifically limited herein;
in one embodiment, the second threshold may be equal to the first threshold;
205. and searching for similar texts of the third text by using the data structure and the ninth vocabulary set.
In the method for searching for similar texts provided by this embodiment, after the third text is segmented, the stop word is deleted, and the high frequency word is extracted, the similar text of the third text can be searched by using the data structure of the high frequency word and the document number, and it is not necessary to compare each text in the text library with the third text one by one, so that the searching speed can be increased, and the effect is improved more obviously when the number of texts in the text library is larger.
Preferably, as shown in fig. 3, before the step 201, the method may further include:
200. and sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers.
For example, as shown in fig. 4, 200 may include:
200A, setting n to 1 initially;
200B, carrying out vocabulary segmentation on the nth text in the text library;
200C, deleting stop words in the words obtained by dividing the nth text word;
200D, extracting high-frequency words from the word set of the nth text obtained after the stop words are deleted to obtain a high-frequency word set of the nth text;
200E, judging whether N is larger than or equal to the total number N of texts in the text library, if so, executing 200G, otherwise, executing 200F;
200F, n ═ n +1, return to 200B for processing of the next text;
and 200G, integrating the high-frequency vocabulary sets of the texts to generate a data structure of the high-frequency vocabularies and the text numbers.
For example, as shown in fig. 5, the 205 may include:
205A, searching a text which comprises at least one same high-frequency vocabulary as the third text by using the data structure;
205B, calculating a text similarity S ═ F/min (D, E) between each text including at least one high-frequency word as the third text and the third text; d represents the number of high-frequency words in a ninth word set, E represents the number of high-frequency words included in a fourth text, and F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;
that is, the same calculation method as that of the foregoing embodiment can be adopted here to calculate the text similarity between texts;
205C, determining the text with the text similarity meeting the preset condition with the third text as the similar text of the third text.
In this embodiment, a text whose text similarity with the third text meets a preset condition is defined as a similar text of the third text, where the preset condition may include:
1. the similarity is greater than or equal to a third threshold; the specific value of the third threshold may also be set according to actual needs, for example, set to be equal to 0.7, 0.9; alternatively, the first and second electrodes may be,
2. the text with the highest similarity, or the first M texts with the highest similarity; m is a preset integer.
Of course, the preset condition may have other forms, and those skilled in the art can make appropriate settings according to actual needs, and is not limited specifically herein.
It should be noted that the data structure of the high-frequency vocabulary and the text number may have various forms, such as "inverted index", "signature file", "suffix tree", and the like, and the specific form of the data structure is not limited in this embodiment.
The data structure of the high frequency vocabulary and the text number is described by the inverted index as follows:
assuming that the high-frequency vocabulary set of the document 1 is combined into { Zhang III, professional, computer }, the following inverted index is generated:
high frequency vocabulary Text numbering
Zhang three 1
Professional 1
Computer with a memory card 1
Each row in the table represents in which text a certain high frequency vocabulary is present.
Assuming that the high-frequency vocabulary of document 2 is combined as { lie four, professional, literature, movie }, it is added to the above inverted index, resulting in:
high frequency vocabulary Text numbering
Zhang three 1
Professional 1,2
Computer with a memory card 1
Li four 2
Literature 2
Film 2
Then, the high-frequency vocabulary set of the texts 3, 4 and … … N is added to the inverted index to form an inverted index structure of the high-frequency vocabularies and the text numbers.
After all texts in a text library are processed to form an inverted index structure of high-frequency words and text numbers, for a third text needing to search for the most similar text, after a ninth high-frequency word set of the third text is obtained, for each high-frequency word in the ninth high-frequency word set, searching for a corresponding text comprising the high-frequency word, and forming a similar text set; then calculating the similarity between each text in the similar text set and the third text; and determining the text with the similarity meeting the preset condition as the similar text of the third text.
For example, assuming that the high-frequency vocabulary sets of the document D are { a, b, c, D }, the corresponding texts obtained by the inverted index are texts 1, 2, 3, 4, 5, 6, 7, 8, wherein the high-frequency vocabulary a appears in the texts 1, 3, 5, the high-frequency vocabulary b appears in the texts 2, 3, the high-frequency vocabulary c appears in the texts 4, 5, 8, and the high-frequency vocabulary D appears in the texts 3, 4, 6, 7.
Traversing all high-frequency words, finding that 3 occurs for 3 times, each of the text 4 and the text 5 occurs for 2 times, and other texts occur for 1 time, and assuming that the sizes of the high-frequency word sets of the texts 1-8 are all 10, then:
the similarity of each text to the text D is:
SD-1 (the degree of similarity of text D to text 1) ═ 1/min (4, 10) ═ 0.25;
SD-2=1/min(4,10)=0.25;
SD-3=3/min(4,10)=0.75;
SD-4=2/min(4,10)=0.5;
SD-5=2/min(4,10)=0.5;
SD-6=1/min(4,10)=0.25;
SD-7=1/min(4,10)=0.25;
SD-8=1/min(4,10)=0.25;
the preset conditions are assumed to be: the text with the highest similarity, it may be determined that the text 3 is a similar text of the text D, and the similarity is 0.75.
Example three:
an embodiment of the present invention further provides a system for calculating text similarity, as shown in fig. 6, the system may include:
a first obtaining unit 601, configured to obtain a first text and a second text that need to be subjected to text similarity calculation;
a first segmentation unit 602, configured to perform vocabulary segmentation on a first text to obtain a first vocabulary set, and perform vocabulary segmentation on a second text to obtain a second vocabulary set;
a first deleting unit 603, configured to delete a stop word in the first vocabulary set to obtain a third vocabulary set, and delete a stop word in the second vocabulary set to obtain a fourth vocabulary set;
a first extracting unit 604, configured to extract high-frequency words from the third word set to form a fifth word set, and extract high-frequency words from the fourth word set to form a sixth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a first threshold;
the calculating unit 605 is configured to calculate the text similarity between the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.
The system for calculating the text similarity provided by the embodiment extracts the high-frequency words from the text, calculates the text similarity according to the number of the same high-frequency words included in the text to be detected, eliminates the influence of factors such as the sequence of each word and paragraph in the text on the text similarity, and improves the accuracy of text similarity calculation.
Preferably, the first obtaining unit may be further configured to obtain a deactivation vocabulary, for example, a common vocabulary without actual meaning in the deactivation vocabulary may be "included" in you, i, he, and is, yes, and the like.
For example, the calculating unit 605 may be specifically configured to calculate the text similarity S ═ C/min (a, B) of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set;
wherein, a is the number of words in the fifth word set, B is the number of words in the sixth word set, and C is the number of the same words in the fifth word set and the sixth word set.
Of course, those skilled in the art may make other variations or modifications to the above formula for calculating text similarity, and the specific form of the formula is not limited herein.
Example four:
an embodiment of the present invention further provides a system for searching for a similar text, as shown in fig. 7, the system may include:
a second obtaining unit 701, configured to obtain a data structure of high-frequency words and text numbers, where the data structure includes information of each high-frequency word in a text library and a number of a corresponding text including each high-frequency word;
a second segmentation unit 702, configured to perform vocabulary segmentation on the third text to obtain a seventh vocabulary set;
a second deleting unit 703, configured to delete the stop word in the seventh vocabulary set, so as to obtain an eighth vocabulary set;
a second extracting unit 704, configured to extract the high-frequency vocabulary in the eighth vocabulary set to form a ninth vocabulary set; the high-frequency vocabulary is the vocabulary with TFIDF value higher than a second threshold value in the text library;
the specific value of the second threshold may be set by a user as appropriate, and the specific value is not specifically limited herein;
in one embodiment, the second threshold may be equal to the first threshold;
the searching unit 705 is configured to search for a similar text of the third text by using the data structure and the ninth vocabulary set.
The similar text searching system provided in this embodiment may search for the similar text of the third text by using the data structure of the high-frequency vocabulary and the document number after the third text is segmented, the stop word is deleted, and the high-frequency word is extracted, and it is not necessary to compare each text in the text library with the third text one by one, so that the searching speed may be increased, and the effect is improved more obviously when the number of texts in the text library is larger.
Further, the system for searching for similar texts provided by this embodiment may further include:
and the processing unit is used for sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers.
It should be noted that the data structure of the high-frequency vocabulary and the text number may have various forms, such as "inverted index", "signature file", "suffix tree", and the like, and the specific form of the data structure is not limited in this embodiment.
Specifically, the search unit may include:
a searching subunit, configured to search, by using the data structure, a text that includes at least one same high-frequency word as the third text;
a calculating subunit, configured to calculate a text similarity S ═ F/min (D, E) between each text including at least one high-frequency word as the third text and the third text; d represents the number of high-frequency words in a ninth word group, E represents the number of high-frequency words included in a fourth text, and F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;
and the determining subunit is configured to determine a text of which the text similarity with the third text meets a preset condition as a similar text of the third text.
In this embodiment, a text whose text similarity with the third text meets a preset condition is defined as a similar text of the third text, where the preset condition may include:
1. the similarity is greater than or equal to a third threshold; the specific value of the third threshold may also be set according to actual needs, for example, set to be equal to 0.7, 0.9; alternatively, the first and second electrodes may be,
2. the text with the highest similarity, or the first M texts with the highest similarity; m is a preset integer.
Of course, the preset condition may have other forms, and those skilled in the art can make appropriate settings according to actual needs, and is not limited specifically herein.
It should be noted that the above embodiments belong to the same inventive concept, and the description of each embodiment has a different emphasis, and reference may be made to the description in other embodiments where the description in individual embodiments is not detailed.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The method and system for calculating text similarity, the method and system for searching similar texts provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (6)

1. A search method of similar texts is characterized in that the method is used for solving the problem of text similarity calculation when the number of samples is large, searching out texts similar to texts to be tested from a text library, and improving the calculation speed; the method comprises the following steps:
sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers; the method comprises the following steps:
A. initially setting n to 1;
B. carrying out vocabulary segmentation on the nth text in the text library;
C. deleting stop words in the words obtained after the nth text word is segmented;
D. extracting high-frequency words from the word set of the nth text obtained after the stop words are deleted to obtain a high-frequency word set of the nth text;
E. judging whether N is larger than or equal to the total number N of texts in the text library, if so, executing the step G; otherwise, executing step F;
F. returning to the step B to process the next text when n is n + 1;
G. synthesizing the high-frequency vocabulary set of each text to generate a data structure of the high-frequency vocabulary and the text number;
acquiring a data structure of high-frequency words and text numbers, wherein the data structure comprises information of each high-frequency word in a text library and the number of a corresponding text comprising each high-frequency word;
carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;
deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;
extracting high-frequency words in the eighth word set to form a ninth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a second threshold;
searching for similar texts of the third text by using the data structure and the ninth vocabulary set; the method comprises the following steps:
searching for a text comprising at least one same high-frequency vocabulary as the third text by using the data structure;
calculating the text similarity S of each text which comprises at least one same high-frequency word with the third text to the third text, wherein the text similarity S is F/min (D, E); the D represents the number of high-frequency words in a ninth word set, the E represents the number of high-frequency words included in the fourth text, and the F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;
determining a text with the text similarity meeting preset conditions with the third text as a similar text of the third text;
wherein the preset conditions include: the similarity is greater than or equal to a third threshold; or the text with the highest similarity, or the top M texts with the highest similarity; m is a preset integer.
2. The method of claim 1, further comprising, after the lexical segmentation of the nth text in the text corpus:
acquiring a deactivation word list; wherein the stop word list is used for limiting stop words belonging to common unrealistic meanings needing to be deleted.
3. The method according to claim 1 or 2, wherein the data structure of the high frequency vocabulary and text numbers is in the form of an "inverted index", "signature file" or "suffix tree".
4. A system for finding similar text, comprising:
the processing unit is used for sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers;
the processing unit is further to: A. initially setting n to 1;
B. carrying out vocabulary segmentation on the nth text in the text library;
C. deleting stop words in the words obtained after the nth text word is segmented;
D. extracting high-frequency words from the word set of the nth text obtained after the stop words are deleted to obtain a high-frequency word set of the nth text;
E. judging whether N is larger than or equal to the total number N of texts in the text library, if so, executing the step G; otherwise, executing step F;
F. returning to the step B to process the next text when n is n + 1;
G. synthesizing the high-frequency vocabulary set of each text to generate a data structure of the high-frequency vocabulary and the text number;
the second acquisition unit is used for a data structure of high-frequency words and text numbers, and the data structure comprises information of the numbers of the high-frequency words and corresponding texts comprising the high-frequency words in a text library;
the second segmentation unit is used for carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;
the second deleting unit is used for deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;
the second extraction unit is used for extracting the high-frequency words in the eighth word set to form a ninth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a second threshold;
a searching unit, configured to search for a similar text of the third text by using the data structure and the ninth vocabulary set;
the search unit includes:
the searching subunit is used for searching a text which comprises at least one same high-frequency vocabulary as the third text by using the data structure;
a calculating subunit, configured to calculate a text similarity S ═ F/min (D, E) between each text including at least one same high-frequency word as the third text and the third text; the D represents the number of high-frequency words in a ninth word set, the E represents the number of high-frequency words included in a fourth text, and the F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;
a determining subunit, configured to determine, as a similar text of the third text, a text whose text similarity to the third text satisfies a preset condition, where the preset condition includes: the similarity is greater than or equal to a third threshold; or the text with the highest similarity, or the top M texts with the highest similarity; m is a preset integer.
5. The system according to claim 4, wherein the first obtaining unit is further configured to obtain a deactivation word list; wherein the stop word list is used for limiting stop words belonging to common unrealistic meanings needing to be deleted.
6. The system according to claim 4 or 5, wherein the data structure of the high frequency vocabulary and text number is in the form of an "inverted index", "signature file" or "suffix tree".
CN201410270637.2A 2014-06-17 2014-06-17 Text similarity calculation method and system and similar text search method and system Active CN105224518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410270637.2A CN105224518B (en) 2014-06-17 2014-06-17 Text similarity calculation method and system and similar text search method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410270637.2A CN105224518B (en) 2014-06-17 2014-06-17 Text similarity calculation method and system and similar text search method and system

Publications (2)

Publication Number Publication Date
CN105224518A CN105224518A (en) 2016-01-06
CN105224518B true CN105224518B (en) 2020-03-17

Family

ID=54993497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410270637.2A Active CN105224518B (en) 2014-06-17 2014-06-17 Text similarity calculation method and system and similar text search method and system

Country Status (1)

Country Link
CN (1) CN105224518B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273391A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 Document recommends method and apparatus
CN106528507B (en) * 2016-10-25 2018-12-18 中南林业科技大学 A kind of detection method and detection device of Chinese text similarity
CN107085568B (en) * 2017-03-29 2022-11-22 腾讯科技(深圳)有限公司 Text similarity distinguishing method and device
CN107491425A (en) * 2017-07-26 2017-12-19 合肥美的智能科技有限公司 Determine method, determining device, computer installation and computer-readable recording medium
CN109325509B (en) * 2017-07-31 2023-01-17 北京国双科技有限公司 Similarity determination method and device
CN110019660A (en) * 2017-08-06 2019-07-16 北京国双科技有限公司 A kind of Similar Text detection method and device
CN107870976A (en) * 2017-09-25 2018-04-03 平安科技(深圳)有限公司 Resume identification device, method and computer-readable recording medium
CN108763569A (en) * 2018-06-05 2018-11-06 北京玄科技有限公司 Text similarity computing method and device, intelligent robot
CN110866407B (en) * 2018-08-17 2024-03-01 阿里巴巴集团控股有限公司 Analysis method, device and equipment for determining similarity between text of mutual translation
CN109635084B (en) * 2018-11-30 2020-11-24 宁波深擎信息科技有限公司 Real-time rapid duplicate removal method and system for multi-source data document
CN109614478B (en) * 2018-12-18 2020-12-08 北京中科闻歌科技股份有限公司 Word vector model construction method, keyword matching method and device
CN109977194B (en) * 2019-03-20 2021-08-10 华南理工大学 Text similarity calculation method, system, device and medium based on unsupervised learning
CN110083832B (en) * 2019-04-17 2020-12-29 北大方正集团有限公司 Article reprint relation identification method, device, equipment and readable storage medium
CN110321931A (en) * 2019-06-05 2019-10-11 上海易点时空网络有限公司 Original content referee method and device
CN112528630A (en) * 2019-09-19 2021-03-19 北京国双科技有限公司 Text similarity determination method and device, storage medium and electronic equipment
CN112435688B (en) * 2020-11-20 2024-06-18 腾讯音乐娱乐科技(深圳)有限公司 Audio identification method, server and storage medium
CN115455950B (en) * 2022-09-27 2023-06-16 中科雨辰科技有限公司 Acquiring text data processing system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102693279A (en) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8738361B2 (en) * 2009-07-01 2014-05-27 International Business Machines Corporation Systems and methods for extracting patterns from graph and unstructered data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102693279A (en) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Also Published As

Publication number Publication date
CN105224518A (en) 2016-01-06

Similar Documents

Publication Publication Date Title
CN105224518B (en) Text similarity calculation method and system and similar text search method and system
CN106033416B (en) Character string processing method and device
CN108829658B (en) Method and device for discovering new words
CN107544988B (en) Method and device for acquiring public opinion data
JP2005085285A5 (en)
Maier et al. Machine translation vs. multilingual dictionaries assessing two strategies for the topic modeling of multilingual text collections
CN105069021A (en) Chinese short text sentiment classification method based on fields
US20140074816A1 (en) Method and apparatus for generating a query candidate set
CN109101491B (en) Author information extraction method and device, computer device and computer readable storage medium
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
WO2014114175A1 (en) Method and apparatus for providing search engine tags
Mazari et al. Automatic Construction of Ontology from Arabic Texts.
CN108345694B (en) Document retrieval method and system based on theme database
CN110705261B (en) Chinese text word segmentation method and system thereof
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN111160445B (en) Bid file similarity calculation method and device
JP2021501387A (en) Methods, computer programs and computer systems for extracting expressions for natural language processing
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction
CN111401047A (en) Method and device for generating dispute focus of legal document and computer equipment
KR101545273B1 (en) Apparaus and method for detecting dupulicated document of big data text using clustering and hashing
JP5694989B2 (en) Document classification apparatus and program
CN109063117B (en) Network security blog classification method and system based on feature extraction
TWI534640B (en) Chinese network information monitoring and analysis system and its method
CN108959295B (en) Method and device for identifying native object
US11150871B2 (en) Information density of documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant