CN105224518B

CN105224518B - Text similarity calculation method and system and similar text search method and system

Info

Publication number: CN105224518B
Application number: CN201410270637.2A
Authority: CN
Inventors: 刘健
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2014-06-17
Filing date: 2014-06-17
Publication date: 2020-03-17
Anticipated expiration: 2034-06-17
Also published as: CN105224518A

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a method and a system for calculating text similarity and a method and a system for searching similar texts. The text similarity calculation method comprises the following steps: acquiring a first text and a second text which need to be subjected to text similarity calculation; respectively carrying out vocabulary segmentation on the first text and the second text to obtain a first vocabulary set and a second vocabulary set; deleting stop words in the first vocabulary set and the second vocabulary set respectively to obtain a third vocabulary set and a fourth vocabulary set; extracting high-frequency words in the third word set and the fourth word set respectively to form a fifth word set and a sixth word set; and calculating the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set. By implementing the embodiment of the invention, the accuracy of text similarity calculation can be improved, and the efficiency of searching similar texts can be improved.

Description

Text similarity calculation method and system and similar text search method and system

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for calculating text similarity and a method and a system for searching similar texts.

Background

The text similarity calculation and the search of similar texts have wide application in the fields of paper anti-plagiarism, website anti-counterfeiting and the like, for example:

1. the counterfeit website identification, taking the counterfeit industry and commerce bank website as an example, if the content of a certain website is found to be close to the content of the industry and commerce bank official website (http:// www.icbc.com.cn), the website can be regarded as the counterfeit website.

2. And (4) the paper plagiarism identification, namely comparing the paper with other papers in the paper library to judge whether plagiarism behaviors exist.

3. For example, when a user purchases a book introducing the computer operating system at a website, the commodity recommendation system can automatically recommend other books similar to the book content.

4. And (4) carrying out similar duplication removal, wherein similar web pages are automatically subjected to duplication removal in a search engine so as to provide more useful information for a user.

The common text similarity calculation methods in the prior art include the following methods:

in the scheme 1, the longest common string algorithm, assuming that the lengths of two strings are n and m respectively, and the length of the longest common string is c, the similarity is c/MIN (n, m), i.e. c is divided by the smaller value of n and m. For example, two sections of texts, i.e., "i call zhang three" and "i call lie four", have the longest common string of "i call", and have a similarity of 2/MIN (4, 4) ═ 2/4 ═ 0.5.

Scheme 2, the minimum edit distance algorithm, refers to the minimum number of edits (incremental deletion modify operations) required to convert one string to another. For example, in the above example, it is necessary to change "one" to "li" and "three" to "four", for 2 edits. Assuming that the two character strings are n and m in length, respectively, and the minimum editing distance is d, the similarity is 1-d/MIN (n, m).

After the text similarity is calculated, the similarity may be compared with a threshold (e.g., 0.8), and if the similarity exceeds the threshold, the text is considered similar.

Various text similarity calculation methods in the prior art have some problems:

both scheme 1 and scheme 2 are easy to bypass, and the similarity is greatly low and the accuracy is low due to simple vocabulary, sentence or paragraph transposition. Such as the following two text contents with the same substantive content: "his current name is zhang san" and "his current name is zhang san", using scenario 1: the longest public string is the name of' with the similarity of only 3/9-0.33; with scheme 2: the minimum editing distance is 9, and the similarity is 1-9/9 ═ 0; the text similarity calculated by the method of the prior art is low and may be considered dissimilar.

In summary, the text similarity calculation method in the prior art has the problem of low accuracy, and is not beneficial to finding out similar texts of the text to be detected from the text library.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method and a system for calculating text similarity, and a method and a system for searching similar texts, which are used for improving the accuracy of text similarity calculation and are beneficial to searching similar texts of texts to be detected from a text library.

The embodiment of the invention provides a text similarity calculation method, which comprises the following steps:

acquiring a first text and a second text which need to be subjected to text similarity calculation;

performing vocabulary segmentation on the first text to obtain a first vocabulary set, and performing vocabulary segmentation on the second text to obtain a second vocabulary set;

deleting stop words in the first vocabulary set to obtain a third vocabulary set, and deleting stop words in the second vocabulary set to obtain a fourth vocabulary set;

extracting high-frequency words from the third word set to form a fifth word set, and extracting high-frequency words from the fourth word set to form a sixth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a first threshold;

and calculating the text similarity of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.

Correspondingly, the embodiment of the invention also provides a method for searching similar texts, which comprises the following steps:

acquiring a data structure of high-frequency words and text numbers, wherein the data structure comprises information of each high-frequency word in a text library and the number of a corresponding text comprising each high-frequency word;

carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;

deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;

extracting high-frequency words in the eighth word set to form a ninth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a second threshold;

searching for similar text of the third text using the data structure and the ninth vocabulary set.

Correspondingly, an embodiment of the present invention further provides a system for calculating text similarity, including:

the first acquiring unit is used for acquiring a first text and a second text which need to be subjected to text similarity calculation;

the first segmentation unit is used for carrying out vocabulary segmentation on the first text to obtain a first vocabulary set and carrying out vocabulary segmentation on the second text to obtain a second vocabulary set;

the first deleting unit is used for deleting stop words in the first vocabulary set to obtain a third vocabulary set and deleting stop words in the second vocabulary set to obtain a fourth vocabulary set;

the first extraction unit is used for extracting the high-frequency words in the third word set to form a fifth word set, and extracting the high-frequency words in the fourth word set to form a sixth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a first threshold;

and the calculating unit is used for calculating the text similarity of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.

Correspondingly, the embodiment of the present invention further provides a system for searching for similar texts, including:

the second acquisition unit is used for a data structure of high-frequency words and text numbers, and the data structure comprises information of the numbers of the high-frequency words and corresponding texts comprising the high-frequency words in a text library;

the second segmentation unit is used for carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;

the second deleting unit is used for deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;

the second extraction unit is used for extracting the high-frequency words in the eighth word set to form a ninth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a second threshold;

and the searching unit is used for searching the similar text of the third text by utilizing the data structure and the ninth vocabulary set.

According to the method and the system for calculating the text similarity and the method and the system for searching the similar text, the text is subjected to word segmentation, stop word deletion and high-frequency word extraction, and similarity calculation or similar text search is performed by using the high-frequency words of the text, so that the accuracy of text similarity calculation can be improved, and the efficiency of searching the similar text can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for calculating text similarity according to an embodiment of the present invention;

fig. 2 is a first schematic view of a process of searching for similar texts according to a second embodiment of the present invention;

fig. 3 is a second schematic diagram of a process of searching for similar texts according to the second embodiment of the present invention;

fig. 4 is a third schematic view of a flow of a similar text searching method according to the second embodiment of the present invention;

fig. 5 is a fourth schematic view of a process of searching for a similar text according to the second embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a system for calculating text similarity according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of a similar text search system according to a fourth embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The first embodiment is as follows:

the embodiment of the invention provides a method for calculating text similarity, which can comprise the following steps of:

101. acquiring a first text and a second text which need to be subjected to text similarity calculation;

in this embodiment, the first text and the second text may refer to any two texts for which text similarity calculation is required;

it should be noted that the text in the present embodiment may include a document (e.g., an article, a paper, a book, etc.), a web page, etc., and the type of the text is not specifically limited herein;

102. performing vocabulary segmentation on the first text to obtain a first vocabulary set, and performing vocabulary segmentation on the second text to obtain a second vocabulary set;

for example, in the present embodiment, various word segmentation systems in the prior art may be used to perform word segmentation on a text, for example, for english, since there are spaces between words, word segmentation may be performed according to the spaces, for Chinese, the word segmentation of the text may be implemented by using an ICTCLAS (Chinese Lexical Analysis System, Institute of Technology, Chinese Lexical Analysis System) System of the Institute of computing Technology of Chinese academy of sciences, or a word segmentation System of an SOSO (search and search) search engine of Tencent corporation, where there is no specific limitation on which word segmentation method or System is used;

103. deleting stop words in the first vocabulary set to obtain a third vocabulary set, and deleting stop words in the second vocabulary set to obtain a fourth vocabulary set;

prior to 103, the method may further comprise: acquiring a deactivation word list;

the stop word list is used to define which vocabulary belongs to stop words to be deleted, for example, the stop word list may "include" you, me, he, in, yes "and other common vocabulary without actual meaning, in this embodiment, the stop word list specifically includes which stop words, which can be reasonably set by the user according to the need, and no specific limitation is made herein;

104. extracting the high-frequency words in the third word set to form a fifth word set, and extracting the high-frequency words in the fourth word set to form a sixth word set; the high-frequency vocabulary is a vocabulary with TFIDF (term inverse document frequency value) higher than a first threshold value in the text library;

the TFIDF value can be used for representing the importance degree of a word, if the frequency TF of a certain word in a text is high and the TFIDF of the word rarely appears in other articles, the TFIDF value is large, and the word or phrase is considered to have good category distinguishing capability and is suitable for classification;

the method for calculating the TIFIDF value belongs to the prior art, and is not described herein again;

the specific value of the first threshold may be set by a user as appropriate, and the specific value is not specifically limited herein;

105. and calculating the text similarity of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.

According to the text similarity calculation method provided by the embodiment of the invention, the high-frequency words are extracted from the text, and the text similarity is calculated by using the high-frequency words included in the text, so that the influence of factors such as the sequence of each word and paragraph in the text on the text similarity is eliminated, and the accuracy of text similarity calculation is improved.

For example, the 105 may include:

calculating the text similarity S of the first text and the second text as C/min (A, B) according to the fifth vocabulary set and the sixth vocabulary set;

wherein, a is the number of words in the fifth word set, B is the number of words in the sixth word set, and C is the number of the same words in the fifth word set and the sixth word set.

Of course, those skilled in the art may make other variations or modifications to the above formula for calculating text similarity, and the specific form of the formula is not limited herein.

The following describes a method for calculating text similarity according to this embodiment with a specific example:

for example, the text 1 includes the content "the third specialty is a computer", the text 2 includes the content "the fourth specialty is a literature and a movie";

firstly, segmenting the words of the texts 1 and 2 to obtain a vocabulary set of the text 1 which is 'Zhang III', professional Ye and computer ', and a vocabulary set of the text 2 which is' Liqu ', professional Ye, literature and movie';

secondly, stopping words, and assuming that the stop word list comprises 'yes', yes 'and', stopping words, combining the vocabulary of the text 1 into 'Zhang III', profession and computer ', and combining the vocabulary of the text 2 into' Liqu ', profession, literature and film';

thirdly, extracting high-frequency words, and assuming that the words of Zhangsan, profession, computer, Liqu, literature and film are all high-frequency words, combining the high-frequency words of the text 1 into Zhangsan, profession and computer, and combining the high-frequency words of the text 2 into Liqu, profession, literature and film;

fourthly, calculating the text similarity, wherein the text 1 comprises 3 high-frequency words, the text 2 comprises 4 high-frequency words, the same high-frequency words contained in the text 1 and the text 2 are 1, and the text similarity S of the text 1 and the text 2 is calculated₁₂＝1/min(3，4)＝1/3＝0.33。

Generally, text similarity detection often needs to compare with a large number of samples to find a sample similar to a sample to be detected in the large number of samples, for example, in a counterfeit website detection system, a large number of regular website contents (such as an industrial and commercial bank, an Tencent network, a Taobao network, and the like) are collected in advance, and whether a newly-appeared website content is similar to a listing website needs to be judged; in the paper plagiarism detection system, a large number of academic papers are collected in advance, and it is necessary to determine whether a newly-appeared paper is similar to a collected paper.

Assuming that the number of the currently recorded texts is N, the existing scheme is adopted, and the texts to be detected need to be compared with N texts in a library one by one. When N is small (e.g., tens or hundreds), the overall computational overhead is also small; in practical large-scale application, if N is very large (for example, the number of issued scientific and technical papers worldwide is above ten million levels, and the number of websites in china is at million levels), the calculation cost is very considerable, it takes 0.1ms to calculate the text similarity once on average, N is 100 ten thousand calculations, each comparison takes 100s, which is equivalent to a single machine can only process 0.01 text per second, and the efficiency is obviously too low.

The second embodiment below can solve the text similarity calculation when the number of samples is large, find out a text similar to the text to be detected from the text library, and increase the calculation speed, please refer to the following description.

Example two:

the embodiment of the present invention provides a method for searching for a similar text, as shown in fig. 2, the method may include:

201. acquiring a data structure of high-frequency words and text numbers, wherein the data structure comprises information of each high-frequency word in a text library and the number of a corresponding text comprising each high-frequency word;

202. carrying out vocabulary segmentation on the third text to obtain a seventh vocabulary set;

the specific vocabulary segmentation method may be the same as the description of the foregoing embodiment, and is not repeated herein;

203. deleting stop words in the seventh vocabulary set to obtain an eighth vocabulary set;

the specific stop word deleting method may be the same as the description of the foregoing embodiment, and is not described herein again;

204. extracting the high-frequency words in the eighth word set to form a ninth word set;

wherein, the high-frequency vocabulary is the vocabulary with TFIDF value higher than the second threshold value in the text library;

the specific value of the second threshold may be set by a user as appropriate, and the specific value is not specifically limited herein;

in one embodiment, the second threshold may be equal to the first threshold;

205. and searching for similar texts of the third text by using the data structure and the ninth vocabulary set.

In the method for searching for similar texts provided by this embodiment, after the third text is segmented, the stop word is deleted, and the high frequency word is extracted, the similar text of the third text can be searched by using the data structure of the high frequency word and the document number, and it is not necessary to compare each text in the text library with the third text one by one, so that the searching speed can be increased, and the effect is improved more obviously when the number of texts in the text library is larger.

Preferably, as shown in fig. 3, before the step 201, the method may further include:

200. and sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers.

For example, as shown in fig. 4, 200 may include:

200A, setting n to 1 initially;

200B, carrying out vocabulary segmentation on the nth text in the text library;

200C, deleting stop words in the words obtained by dividing the nth text word;

200D, extracting high-frequency words from the word set of the nth text obtained after the stop words are deleted to obtain a high-frequency word set of the nth text;

200E, judging whether N is larger than or equal to the total number N of texts in the text library, if so, executing 200G, otherwise, executing 200F;

200F, n ═ n +1, return to 200B for processing of the next text;

and 200G, integrating the high-frequency vocabulary sets of the texts to generate a data structure of the high-frequency vocabularies and the text numbers.

For example, as shown in fig. 5, the 205 may include:

205A, searching a text which comprises at least one same high-frequency vocabulary as the third text by using the data structure;

205B, calculating a text similarity S ═ F/min (D, E) between each text including at least one high-frequency word as the third text and the third text; d represents the number of high-frequency words in a ninth word set, E represents the number of high-frequency words included in a fourth text, and F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;

that is, the same calculation method as that of the foregoing embodiment can be adopted here to calculate the text similarity between texts;

205C, determining the text with the text similarity meeting the preset condition with the third text as the similar text of the third text.

In this embodiment, a text whose text similarity with the third text meets a preset condition is defined as a similar text of the third text, where the preset condition may include:

1. the similarity is greater than or equal to a third threshold; the specific value of the third threshold may also be set according to actual needs, for example, set to be equal to 0.7, 0.9; alternatively, the first and second electrodes may be,

2. the text with the highest similarity, or the first M texts with the highest similarity; m is a preset integer.

Of course, the preset condition may have other forms, and those skilled in the art can make appropriate settings according to actual needs, and is not limited specifically herein.

It should be noted that the data structure of the high-frequency vocabulary and the text number may have various forms, such as "inverted index", "signature file", "suffix tree", and the like, and the specific form of the data structure is not limited in this embodiment.

The data structure of the high frequency vocabulary and the text number is described by the inverted index as follows:

assuming that the high-frequency vocabulary set of the document 1 is combined into { Zhang III, professional, computer }, the following inverted index is generated:

high frequency vocabulary

Text numbering

Zhang three	1
		Professional	1
Computer with a memory card	1

Each row in the table represents in which text a certain high frequency vocabulary is present.

Assuming that the high-frequency vocabulary of document 2 is combined as { lie four, professional, literature, movie }, it is added to the above inverted index, resulting in:

high frequency vocabulary	Text numbering
		Zhang three	1
Professional	1，2
		Computer with a memory card	1
Li four	2
		Literature	2
Film	2

Then, the high-frequency vocabulary set of the texts 3, 4 and … … N is added to the inverted index to form an inverted index structure of the high-frequency vocabularies and the text numbers.

After all texts in a text library are processed to form an inverted index structure of high-frequency words and text numbers, for a third text needing to search for the most similar text, after a ninth high-frequency word set of the third text is obtained, for each high-frequency word in the ninth high-frequency word set, searching for a corresponding text comprising the high-frequency word, and forming a similar text set; then calculating the similarity between each text in the similar text set and the third text; and determining the text with the similarity meeting the preset condition as the similar text of the third text.

For example, assuming that the high-frequency vocabulary sets of the document D are { a, b, c, D }, the corresponding texts obtained by the inverted index are texts 1, 2, 3, 4, 5, 6, 7, 8, wherein the high-frequency vocabulary a appears in the texts 1, 3, 5, the high-frequency vocabulary b appears in the texts 2, 3, the high-frequency vocabulary c appears in the texts 4, 5, 8, and the high-frequency vocabulary D appears in the texts 3, 4, 6, 7.

Traversing all high-frequency words, finding that 3 occurs for 3 times, each of the text 4 and the text 5 occurs for 2 times, and other texts occur for 1 time, and assuming that the sizes of the high-frequency word sets of the texts 1-8 are all 10, then:

the similarity of each text to the text D is:

SD-1 (the degree of similarity of text D to text 1) ═ 1/min (4, 10) ═ 0.25;

SD-2＝1/min(4，10)＝0.25；

SD-3＝3/min(4，10)＝0.75；

SD-4＝2/min(4，10)＝0.5；

SD-5＝2/min(4，10)＝0.5；

SD-6＝1/min(4，10)＝0.25；

SD-7＝1/min(4，10)＝0.25；

SD-8＝1/min(4，10)＝0.25；

the preset conditions are assumed to be: the text with the highest similarity, it may be determined that the text 3 is a similar text of the text D, and the similarity is 0.75.

Example three:

an embodiment of the present invention further provides a system for calculating text similarity, as shown in fig. 6, the system may include:

a first obtaining unit 601, configured to obtain a first text and a second text that need to be subjected to text similarity calculation;

a first segmentation unit 602, configured to perform vocabulary segmentation on a first text to obtain a first vocabulary set, and perform vocabulary segmentation on a second text to obtain a second vocabulary set;

a first deleting unit 603, configured to delete a stop word in the first vocabulary set to obtain a third vocabulary set, and delete a stop word in the second vocabulary set to obtain a fourth vocabulary set;

a first extracting unit 604, configured to extract high-frequency words from the third word set to form a fifth word set, and extract high-frequency words from the fourth word set to form a sixth word set; the high-frequency vocabulary is the vocabulary with the word frequency reverse document frequency TFIDF value higher than a first threshold;

the calculating unit 605 is configured to calculate the text similarity between the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set.

The system for calculating the text similarity provided by the embodiment extracts the high-frequency words from the text, calculates the text similarity according to the number of the same high-frequency words included in the text to be detected, eliminates the influence of factors such as the sequence of each word and paragraph in the text on the text similarity, and improves the accuracy of text similarity calculation.

Preferably, the first obtaining unit may be further configured to obtain a deactivation vocabulary, for example, a common vocabulary without actual meaning in the deactivation vocabulary may be "included" in you, i, he, and is, yes, and the like.

For example, the calculating unit 605 may be specifically configured to calculate the text similarity S ═ C/min (a, B) of the first text and the second text according to the fifth vocabulary set and the sixth vocabulary set;

Example four:

an embodiment of the present invention further provides a system for searching for a similar text, as shown in fig. 7, the system may include:

a second obtaining unit 701, configured to obtain a data structure of high-frequency words and text numbers, where the data structure includes information of each high-frequency word in a text library and a number of a corresponding text including each high-frequency word;

a second segmentation unit 702, configured to perform vocabulary segmentation on the third text to obtain a seventh vocabulary set;

a second deleting unit 703, configured to delete the stop word in the seventh vocabulary set, so as to obtain an eighth vocabulary set;

a second extracting unit 704, configured to extract the high-frequency vocabulary in the eighth vocabulary set to form a ninth vocabulary set; the high-frequency vocabulary is the vocabulary with TFIDF value higher than a second threshold value in the text library;

in one embodiment, the second threshold may be equal to the first threshold;

the searching unit 705 is configured to search for a similar text of the third text by using the data structure and the ninth vocabulary set.

The similar text searching system provided in this embodiment may search for the similar text of the third text by using the data structure of the high-frequency vocabulary and the document number after the third text is segmented, the stop word is deleted, and the high-frequency word is extracted, and it is not necessary to compare each text in the text library with the third text one by one, so that the searching speed may be increased, and the effect is improved more obviously when the number of texts in the text library is larger.

Further, the system for searching for similar texts provided by this embodiment may further include:

and the processing unit is used for sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers.

Specifically, the search unit may include:

a searching subunit, configured to search, by using the data structure, a text that includes at least one same high-frequency word as the third text;

a calculating subunit, configured to calculate a text similarity S ═ F/min (D, E) between each text including at least one high-frequency word as the third text and the third text; d represents the number of high-frequency words in a ninth word group, E represents the number of high-frequency words included in a fourth text, and F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;

and the determining subunit is configured to determine a text of which the text similarity with the third text meets a preset condition as a similar text of the third text.

It should be noted that the above embodiments belong to the same inventive concept, and the description of each embodiment has a different emphasis, and reference may be made to the description in other embodiments where the description in individual embodiments is not detailed.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The method and system for calculating text similarity, the method and system for searching similar texts provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A search method of similar texts is characterized in that the method is used for solving the problem of text similarity calculation when the number of samples is large, searching out texts similar to texts to be tested from a text library, and improving the calculation speed; the method comprises the following steps:

sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers; the method comprises the following steps:

A. initially setting n to 1;

B. carrying out vocabulary segmentation on the nth text in the text library;

C. deleting stop words in the words obtained after the nth text word is segmented;

D. extracting high-frequency words from the word set of the nth text obtained after the stop words are deleted to obtain a high-frequency word set of the nth text;

E. judging whether N is larger than or equal to the total number N of texts in the text library, if so, executing the step G; otherwise, executing step F;

F. returning to the step B to process the next text when n is n + 1;

G. synthesizing the high-frequency vocabulary set of each text to generate a data structure of the high-frequency vocabulary and the text number;

searching for similar texts of the third text by using the data structure and the ninth vocabulary set; the method comprises the following steps:

searching for a text comprising at least one same high-frequency vocabulary as the third text by using the data structure;

calculating the text similarity S of each text which comprises at least one same high-frequency word with the third text to the third text, wherein the text similarity S is F/min (D, E); the D represents the number of high-frequency words in a ninth word set, the E represents the number of high-frequency words included in the fourth text, and the F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;

determining a text with the text similarity meeting preset conditions with the third text as a similar text of the third text;

wherein the preset conditions include: the similarity is greater than or equal to a third threshold; or the text with the highest similarity, or the top M texts with the highest similarity; m is a preset integer.

2. The method of claim 1, further comprising, after the lexical segmentation of the nth text in the text corpus:

acquiring a deactivation word list; wherein the stop word list is used for limiting stop words belonging to common unrealistic meanings needing to be deleted.

3. The method according to claim 1 or 2, wherein the data structure of the high frequency vocabulary and text numbers is in the form of an "inverted index", "signature file" or "suffix tree".

4. A system for finding similar text, comprising:

the processing unit is used for sequentially carrying out vocabulary segmentation, stop word deletion and high-frequency vocabulary extraction on each text in the text library to generate a data structure of high-frequency vocabularies and text numbers;

the processing unit is further to: A. initially setting n to 1;

B. carrying out vocabulary segmentation on the nth text in the text library;

F. returning to the step B to process the next text when n is n + 1;

a searching unit, configured to search for a similar text of the third text by using the data structure and the ninth vocabulary set;

the search unit includes:

the searching subunit is used for searching a text which comprises at least one same high-frequency vocabulary as the third text by using the data structure;

a calculating subunit, configured to calculate a text similarity S ═ F/min (D, E) between each text including at least one same high-frequency word as the third text and the third text; the D represents the number of high-frequency words in a ninth word set, the E represents the number of high-frequency words included in a fourth text, and the F represents the number of the same high-frequency words included in the third text and the fourth text; the fourth text is any text which comprises at least one same high-frequency word with the third text;

a determining subunit, configured to determine, as a similar text of the third text, a text whose text similarity to the third text satisfies a preset condition, where the preset condition includes: the similarity is greater than or equal to a third threshold; or the text with the highest similarity, or the top M texts with the highest similarity; m is a preset integer.

5. The system according to claim 4, wherein the first obtaining unit is further configured to obtain a deactivation word list; wherein the stop word list is used for limiting stop words belonging to common unrealistic meanings needing to be deleted.

6. The system according to claim 4 or 5, wherein the data structure of the high frequency vocabulary and text number is in the form of an "inverted index", "signature file" or "suffix tree".