CN107644010A

CN107644010A - A kind of Text similarity computing method and device

Info

Publication number: CN107644010A
Application number: CN201610578843.9A
Authority: CN
Inventors: 刘力华
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-07-20
Filing date: 2016-07-20
Publication date: 2018-01-30
Anticipated expiration: 2036-07-20
Also published as: CN107644010B

Abstract

A kind of Text similarity computing method, for calculating the similarity between two texts, wherein, the data of at least two objects can be extracted from each text, the object refers to embody the feature of the text semantic, and methods described includes：The shared object of two texts is determined, wherein, the number of the shared object is at least two；Calculate the Hamming distance of each shared object between described two texts；When the Hamming distance of described at least two shared objects meets the first preparatory condition, according at least one of following similarity determined between described two texts：Term vector similarity, Hamming distance and the splicing character string similarity of predetermined object in described at least two shared objects.Pass through such scheme, it is possible to increase the efficiency and accuracy of Text similarity computing.

Description

A kind of Text similarity computing method and device

Technical field

The present invention relates to data processing field, more particularly to a kind of Text similarity computing method and device.

Background technology

At present, the Similarity Measure between text is applied to many aspects.In the related art, following two can be used Kind scheme carries out the contrast between text.

The first scheme is：After long text is segmented, do Hash (hash) for each word and calculate, and frequency of use Do weighting and obtain vector, then to vectorial binarization, obtain the cryptographic Hash of text.Hamming is determined according to the cryptographic Hash between text Distance.Such scheme is very extensive in the application of the removing duplicate webpages fields such as Google (google), Baidu.

Second scheme is：Using document subject matter generation model (LDA, Latent Dirichlet Allocation) or Probability is dived the topic models such as semantic analysis (PLSA, Probability Latent Semantic Analysis), passes through machine Study, maps the text in theme vector, certain physical significance between the vector of generation be present, by calculating two vectors Cosine similarity so as to obtaining the similitude between two texts.

However, although the first above-mentioned scheme can efficiently obtain the Hamming distance of two texts, abandon interior The semanteme of appearance, the calculating of row distance is simply entered from the angle of text-string；When text is short text, contrast effect is simultaneously paid no attention to Think.Moreover, the result of calculation of the first scheme is distance value, it is not similarity, is not easy to follow-up business processing.Above-mentioned second Although kind of a scheme can represent text semantic well by way of machine learning, wherein the training of the model used Process is very time-consuming, and highly dependent upon training sample, even very simple sentence may be produced, also the problem of wrong is calculated in accounting. And the cosine computational efficiency between high dimension vector is relatively low, big text or big data environment are impractical in.

In summary, the computational efficiency of the Text similarity computing scheme in correlation technique is relatively low, accuracy is relatively low.

The content of the invention

It is the general introduction of the theme to being described in detail herein below.It is to limit the protection model of claim that this general introduction, which is not, Enclose.

The embodiment of the present application provides a kind of Text similarity computing method and device, it is possible to increase Text similarity computing Efficiency and accuracy.

The embodiment of the present application provides a kind of Text similarity computing method, for calculating the similarity between two texts, Wherein, the data of at least two objects can be extracted from each text, the object refers to embody the text semantic Feature, methods described include：The shared object of two texts is determined, wherein, the number of the shared object is at least two；Meter Calculate the Hamming distance of each shared object between described two texts；Expire in the Hamming distance of described at least two shared objects During the first preparatory condition of foot, according at least one of following similarity determined between described two texts：Described at least two is common There are the term vector similarity, Hamming distance and splicing character string similarity of predetermined object in object.

Alternatively, methods described also includes：Described first is unsatisfactory in the Hamming distance of described at least two shared objects During preparatory condition, the minimum value in the Hamming distance of described at least two shared objects is determined between described two texts Similarity.

Alternatively, when the Hamming distance in described at least two shared objects meets the first preparatory condition, institute is determined The similarity between two texts is stated, including：

Include the first predetermined object and the second predetermined object in the predetermined object, or, including the first predetermined object, When the second predetermined object and three predetermined objects, if the term vector similarity of the first predetermined object meets the second preparatory condition, Then determine that the similarity between described two texts is equal to the term vector similarity of first predetermined object；If first predetermined pair The term vector similarity of elephant is unsatisfactory for the second preparatory condition, and the first Hamming distance of the second predetermined object meets the 3rd preparatory condition And second predetermined object the second Hamming distance meet the 4th preparatory condition, then according to the second Hamming distance of the second predetermined object Determine the similarity between described two texts；If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, And second first Hamming distance of predetermined object be unsatisfactory for the second Hamming distance of the 3rd preparatory condition or the second predetermined object The 4th preparatory condition is unsatisfactory for, then is determined according to the splicing character string similarity of the first predetermined object between described two texts Similarity, and/or, the similarity between described two texts is determined according to the term vector similarity of the 3rd predetermined object.

Include the second predetermined object and not including the first predetermined object in the predetermined object, or, including second make a reservation for Object and the 3rd predetermined object, and when not including the first predetermined object, determined according to the second Hamming distance of the second predetermined object Similarity between described two texts；Or if the second predetermined object the first Hamming distance meet the 3rd preparatory condition and Second Hamming distance of the second predetermined object meets the 4th preparatory condition, then true according to the second Hamming distance of the second predetermined object Similarity between fixed described two texts；If the first Hamming distance of the second predetermined object be unsatisfactory for the 3rd preparatory condition or Second Hamming distance of the second predetermined object is unsatisfactory for the 4th preparatory condition, then according to the term vector similarity of the 3rd predetermined object Determine the similarity between described two texts.

Include the first predetermined object and not including the second predetermined object in the predetermined object, or, including first make a reservation for Object and the 3rd predetermined object, and when not including the second predetermined object, if the term vector similarity of the first predetermined object meets the Two preparatory conditions, it is determined that the similarity between described two texts is equal to the term vector similarity of first predetermined object； If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, according to the splicing character string of the first predetermined object Similarity determines the similarity between described two texts, and/or, institute is determined according to the term vector similarity of the 3rd predetermined object State the similarity between two texts.

Alternatively, when the Hamming distance of described at least two shared objects meets the first preparatory condition, described two are determined Before similarity between individual text, methods described also includes：The term vector similarity of predetermined object is determined according in the following manner：

According to the term vector data of predetermined object, the term vector cosine similarity of predetermined object is calculated；When described two texts When the word number or length of this predetermined object are all higher than default penalty factor, the term vector similarity of the predetermined object is determined For the term vector cosine similarity；When the word number of the predetermined object of at least one text or length are less than in described two texts Or during equal to the penalty factor, according to the word number or length of described two texts and basic penalty value, calculate punishment amendment Value；Additional penalty value is determined according to the addition penalty value of described two texts and value；Wherein, the addition penalty value of each text Determined according to penalty factor, the word number of the text or length and additional punishment maximum；According to the word of the predetermined object to Amount cosine similarity, described two texts word number smaller value or described two texts length smaller value, described punish Correction value, the additional penalty value, the penalty factor, the additional punishment maximum and basic penalty value are penalized, determines institute State the term vector similarity of predetermined object.

Alternatively, before the Hamming distance of each shared object between calculating described two texts, methods described is also Including：The data of at least two objects are extracted from described two texts respectively, the data include cryptographic Hash and/or term vector Data.

Alternatively, first preparatory condition includes：Minimum value in the Hamming distance of described at least two shared objects Less than or equal to the first distance threshold, or, the minimum value in the Hamming distance of described at least two shared objects is located at first In preset range.

Alternatively, first predetermined object is title, and second predetermined object is to represent the keynote idea of text Sentence, the number of the sentence is that the 3rd predetermined object is keyword more than or equal to 3；Second predetermined object Total Hamming distance of those sentences of first Hamming distance between two texts, the second Hamming distance of second predetermined object From the Hamming distance for including each sentence in each sentence in a text and another text.

Alternatively, one of them of described two texts is target text, and another is text to be analyzed；

It is determined that before the shared object of two texts, methods described also includes：Treated point described in obtaining in the following manner Analyse text：

According to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, the first object is established Domain and/or the index domain of the second object are indexed, wherein, each index domain includes one or more index trees；According to target The cryptographic Hash of first object of text, searched in each index tree in the index domain of the first object between the target text Hamming distance meet the text of first condition, and/or, according to the cryptographic Hash of the second object of target text, in the second object Index domain each index tree in the Hamming distance searched between the target text meet the text of second condition；From looking into Text to be analyzed is determined in the text found.

Alternatively, the index tree is bk-tree index trees.

Alternatively, first object is content, and second object is title.

The embodiment of the present application also provides a kind of Text similarity computing device, for similar between two texts of calculating Degree, described device include：Extraction module, for extracting the data of at least two objects from each text, the object refers to energy Enough embody the feature of the text semantic；Determining module, for determining the shared object of two texts, wherein, described shared pair The number of elephant is at least two；First computing module, for calculating the Hamming of each shared object between described two texts Distance；Second computing module, when the Hamming distance for sharing objects described at least two meets the first preparatory condition, according to At least one determines the similarity between described two texts below：In described at least two shared objects the word of predetermined object to Measure similarity, Hamming distance and splicing character string similarity.

Alternatively, described device also includes：Index establishes module, for the first couple according to each text in multiple texts As and/or the second object cryptographic Hash, establish the index domain and/or the index domain of the second object of the first object, wherein, it is described every Individual index domain includes one or more index trees；Searching modul, for the cryptographic Hash of the first object according to target text, The text that the Hamming distance between the target text meets first condition is searched in each index tree in the index domain of one object This, and/or, according to the cryptographic Hash of the second object of target text, searched in each index tree in the index domain of the second object Hamming distance between the target text meets the text of second condition；Text to be analyzed is determined from the text found This；The determining module, for determining the shared object of the target text and the text to be analyzed.

The embodiment of the present application also provides a kind of computer-readable recording medium, is stored with computer executable instructions, described Above-mentioned Text similarity computing method is realized when computer executable instructions are performed.

The embodiment of the present application fully combines the characteristics of text, and multiple features that extraction can embody text semantic (such as are marked Topic, content, keyword, kernel sentence) data, and data based on multiple features carry out Similarity Measure, not only increase meter Calculate efficiency and improve the accuracy of Similarity Measure.The Text similarity computing method that the embodiment of the present application provides is being carried out Calculated before complicated calculations by low cost and carry out preliminary matches, to improve computational efficiency.

Further, the Text similarity computing method that the embodiment of the present application provides can combine Hamming distance rapidly and efficiently From computational methods and with semantic meaning representation ability term vector method, not only avoid in correlation technique only with Hamming distance The defects of similarity shortcoming semantic meaning representation ability that computational methods obtain, and compensate in correlation technique only with term vector side Method calculates that speed is too slow existing for similarity and the problem of being only applicable to short text.In addition, the embodiment of the present application also introduce it is short Text penalty mechanism, easily there is the situation of error when can correct short text matching.

Certainly, any product for implementing the application it is not absolutely required to reach all advantages described above simultaneously.Reading And after understanding of accompanying drawing and being described in detail, it can be appreciated that other aspects.

Brief description of the drawings

Fig. 1 is the flow chart for the Text similarity computing method that the embodiment of the present application one provides；

Fig. 2 is the optional flow chart one of the Text similarity computing method of the embodiment of the present application one；

Fig. 3 is the optional flowchart 2 of the Text similarity computing method of the embodiment of the present application one；

Fig. 4 is the optional flow chart 3 of the Text similarity computing method of the embodiment of the present application one；

Fig. 5 is the optional flow chart four of the Text similarity computing method of the embodiment of the present application one；

Fig. 6 is the flow chart of present application example one；

Fig. 7 is the flow chart of step 104 in Fig. 6；

Fig. 8 is the flow chart of step 109 in Fig. 6；

Fig. 9 is the flow chart of present application example two；

Figure 10 is the flow chart of present application example three；

Figure 11 is the flow chart of present application example four；

Figure 12 is the flow chart of present application example five；

Figure 13 is the flow chart of present application example six；

Figure 14 is the flow chart of present application example seven；

Figure 15 is the schematic diagram for the Text similarity computing device that the embodiment of the present application one provides；

Figure 16 is the flow chart for the Text similarity computing method that the embodiment of the present application two provides；

Figure 17 is the schematic diagram for the Text similarity computing device that the embodiment of the present application two provides.

Embodiment

The embodiment of the present application is described in detail below in conjunction with accompanying drawing, it will be appreciated that embodiments described below is only For instruction and explanation of the application, it is not used to limit the application.

If it should be noted that not conflicting, the feature in the embodiment of the present application and embodiment can be combined with each other, Within the protection domain of the application.In addition, though logical order is shown in flow charts, but in some cases, can With with different from the shown or described step of order execution herein.

Term defines：

Object：Refer to embody the feature of text semantic；The object is for example including title, content, keyword, core Sentence.

Kernel sentence：Refer to the sentence of the keynote idea of table text；In the present embodiment, the kernel sentence that is extracted from each text Number be more than or equal to three.

Hamming distance：Refer to the number of the kinds of characters of two (equal length) character string correspondence positions；Wherein, two texts Or the Hamming distance between object can determine according to the cryptographic Hash of two texts or object.

Cryptographic Hash：Refer to by local sensitivity hash algorithm (LSH, Locality-Sensitive Hashing) from through undue Obtained value is extracted in word and the object gone after word processing, wherein, LSH algorithms can include SimHash algorithms, MinHash Algorithm.

Term vector cosine similarity：Refer to and calculate the value that the cosine of two term vector data obtains.

Term vector data：Refer to the data obtained by term vector model treatment object or text, wherein, term vector module can With including word2vec models, PLSA, LDA.

Term vector similarity：Refer to the value determined after being punished according to penalty mechanism term vector cosine similarity.

Splicing character string similarity：Refer between two parts of splicing character strings being calculated by similarity of character string algorithm Similarity；Wherein, splicing character string is the character string for splicing object again after segmenting and going word to handle to obtain；Character Similarity algorithm of going here and there can for example include JaroWinkler algorithms, editing distance, longest common subsequence (LCS, Longest Common Subsequence) etc..

Embodiment one

The embodiment of the present application provides a kind of Text similarity computing method, for calculating the similarity between two texts. Text described in the present embodiment can include：The public sentiment texts such as news, microblogging, forum's article.In the present embodiment, carry out similar Two parts of texts that degree calculates can be same type of public sentiment text, for example, it may be two parts of newsletter archives, or two micro- It is rich, or Liang Pian forums article；Or carry out two parts of texts of Similarity Measure and can also be different types of public sentiment text, Such as can be a newsletter archive and a microblogging, either a microblogging and forum's article or a newsletter archive With forum's article.However, the application is not limited this.

In the present embodiment, the data of at least two objects can be extracted from each text.Wherein, the number of the object According to for example including：Cryptographic Hash and/or term vector data.With using SimHash algorithms extraction cryptographic Hash and using word2vec Model extraction term vector data instance, the SimHash values of title, the VSM of title, content can be extracted from a forum's article SimHash values, the VSM of keyword, the SimHash values of the SimHash values of keyword and each kernel sentence；From a microblogging In can extract the SimHash values of content, the VSM of keyword, the SimHash values of keyword and each kernel sentence SimHash values.In practical application, the corresponding data of corresponding object can be extracted according to the actual features of text.The application couple This is not limited.

The Text similarity computing method of the present embodiment for example can apply to service end computing device or client meter Equipment is calculated, the present embodiment is not limited this.In addition, the similarity obtained by the Text similarity computing method of the present embodiment It is follow-up to can be used for the business such as duplicate removal, infringement inquiry, template filtering.For example, one can be established in public sentiment application system Individual infringement or spam samples storehouse, when crawler capturing is to new public sentiment text, using the text similarity meter of the present embodiment offer Calculation method calculates the similarity between the sample in the text and Sample Storehouse that currently grab, to judge that current text is with sample It is no similar, and then judge whether current text is infringement article or junk data.

Fig. 1 is the flow chart for the Text similarity computing method that the embodiment of the present application one provides.As shown in figure 1, this implementation The Text similarity computing method that example provides, comprises the following steps：

Step S11：The shared object of two texts is determined, wherein, the number of the shared object is at least two；

Step S12：Calculate the Hamming distance of each shared object between described two texts；

Step S13：When the Hamming distance of described at least two shared objects meets the first preparatory condition, according to below extremely One item missing determines the similarity between described two texts：The term vector of predetermined object is similar in described at least two shared objects Degree, Hamming distance and splicing character string similarity.

In the present embodiment, before step S12, methods described also includes：Respectively from described two texts extraction to The data of few two objects, the data include cryptographic Hash and/or term vector data.

In the present embodiment, at least two object includes following at least two：Title, content, keyword and core Heart sentence.Wherein, the title of text and content can determine according to the structural identification of text, such as Title identification captions, Content identifies content.The keyword and kernel sentence of text need to be extracted from the content of text.

, can be by calculating the weight extraction keyword of the word in text in an alternative embodiment.Carried from a text Take the process of keyword as follows：

Subordinate sentence is carried out to the text；

Subregion is carried out to the sentence of the text；Wherein, the preceding A sentences in all sentences are selected as the first subregion SectionA, A are chosen as 2；The rear B sentences in all sentences are selected to be chosen as 2 as the second subregion sectionB, B；By the text All sentences in remove the first subregion and the remaining sentence of the second subregion as the 3rd subregion sectionC；

The sentence of each subregion (section) is segmented；

Travel through each word in each subregion；If some word is not recorded, the reverse of the word is calculated after recording the word Document-frequency (IDF, Inverse Document Frequency) value, and set the word privately owned parameter local_head, Local_tail, local_mid 0；If the word has been recorded, if the sentence belonging to the word is current is located at sectionA, The privately owned parameter local_head of the word is added 1, if the sentence belonging to the word is current is located at sectionB, by the privately owned of the word Parameter local_tail adds 1, if the sentence belonging to the word is current is located at sectionC, by the privately owned parameter local_ of the word Mid adds 1；

After having traveled through each word and having completed the statistics of privately owned parameter of each word, the weight of each word is calculated, according to each The weight of word is descending to be ranked up, and selects keyword of N number of word as the text from the front to the back from collating sequence.Wherein, N For the integer more than 1, N can configure according to being actually needed.

Wherein it is possible to the weight W of a word is calculated according to following formula：

W=tw × (local_head × locw1+local_mid × locw2+local_tail × locw3+WL × lenw +PS×posw+TFIDF×tfw)；

Wherein, parameter tw, locw1, locw2, locw3, lenw, posw, tfw could be arranged to：Tw=0.4, locw1= 0.5, locw2=0.3, locw3=0.3, lenw=0.01, posw=0.5, tfw=0.8；The value of above-mentioned parameter is only to lift Example；

Privately owned parameter local_head, local_tail, local_mid are to travel through the end value obtained after the text；

WL is that the word of the word is grown；

PS is the part of speech fraction of a word；Such as it can be determined according in the following manner：If the word is morpheme word, PS= 0.2；If the word is noun, name verb, adnoun, Chinese idiom or idiom, PS=0.6；If the word is verb, PS =0.3；If the word is secondary verb, PS=0.4；If the word is adjective, PS=0.2；If the word is English words, Then PS=0.5；If the word is other parts of speech, PS=0；

TFIDF be the word TF-IDF values, TFIDF=TF × IDF, wherein, TF is word frequency, represents the word in the text The frequency of appearance, IDF values can by total text number (total text number in corpus) divided by the text number comprising the word, Again obtained business is taken the logarithm to obtain.

In another alternative embodiment, the keyword in text can be extracted based on TextRank algorithm.From a text The process for extracting keyword is as follows：

Subordinate sentence is carried out to the text, each sentence segmented, it is hereby achieved that the set of sentence and the collection of word Close, it is alternatively possible to filter out stop words in each sentence, and only retain and specify part of speech (for example, noun, verb, adjective Deng) word；

Using each word as a node in Algorithms for Page Ranking PageRank, window size is set as k, it is assumed that one Individual sentence is made up of following word successively：w1,w2,w3,w4,w5,...,wn；Then wherein, { w1, w2 ..., wk }, w2, W3 ..., wk+1 }, { w3, w4 ..., wk+2 } etc. be all a window, wherein, k and n are integer, and k is less than n；At one A undirected side had no right between node corresponding to any two words in window be present；

Based on the pie graph on the side between node, the importance of each word node can be calculated；According to each word section The importance of point, can select keyword of the most important N number of word as the text, wherein, N is the integer more than 1, and N can Configuration is actually needed with basis.

Wherein it is possible to the importance of a word node is calculated according to following formula：

Wherein, S (Vi) is word node i importance；D is damped coefficient, is traditionally arranged to be 0.85；In (Vi) is directed to Word node i word node set；S (Vj) is directed to the important of the word node j in word node i word node set Property；Out (Vj) is the word node set pointed by word node j；| Out (Vj) | it is the word section pointed by word node j The node number of point set.

Wherein, the importance of word node needs just obtain result by above-mentioned formula successive ignition., can be with when initial The importance for setting each word node is 1.The result that the formula equal sign left side calculates above is the important of word node after iteration Property, before the importance used on the right of equal sign is iteration.

In an alternative embodiment, the kernel sentence in text can be extracted based on TextRank algorithm.Carried from a text Take the process of kernel sentence as follows：

Subordinate sentence is carried out to the text；The weight of each sentence is calculated, is arranged according to the weight of each sentence is descending Sequence, select kernel sentence of N number of sentence as the text from front to back from collating sequence, wherein, N can be more than or equal to 3 Integer, N can also configure according to being actually needed.

Wherein it is possible to the weight of a sentence is calculated according to following formula：

Wherein, WS (Vi) is sentence i weight；D is damped coefficient, is traditionally arranged to be 0.85；In (Vi) is directed to sentence i Sentence set；S (Vj) is directed to the weight of sentence j in sentence i sentence set；Out (Vj) is the sentence pointed by sentence j Set；w_jiRepresent the similarity between sentence j in sentence i and sensing sentence i sentence set；w_jkRepresent sentence j and sentence j Similarity in pointed sentence set between sentence k.

Wherein, the weight of sentence needs just obtain result by above-mentioned formula successive ignition；The equation left side represents one The weight of sentence, the summation on right side represent percentage contribution of each adjacent sentence to this sentence.When with extracting keyword not Together, it is considered that whole sentences are all adjacent, therefore, no longer extract window.

Wherein, similarity w_jiAnd w_jkCalculating can use BM25 algorithms.Similarity i.e. between sentence can be by right Morpheme analysis, morpheme weighted judgment and the correlation prediction of morpheme and sentence in sentence obtain.

It is above-mentioned from text extract keyword mode and from text extract kernel sentence mode be only for example, this Shen Please this is not limited.In other embodiment, can also using it is other it is feasible by the way of from text extract keyword and/or Kernel sentence.

Explanation obtains title, content, keyword and core by taking SimHash algorithms and word2vec models as an example below The mode of the data of sentence.In this, cryptographic Hash is SimHash values, and term vector data are VSM.However, the application to this and it is unlimited It is fixed.In other embodiment, other feasible algorithm and models can also be used.

In the present embodiment, the process for extracting the data of the title of text is as follows：The title of text is segmented and gone Word, wherein, go word to include removing stop words, adverbial word, auxiliary word, punctuation mark, preposition and part conjunction；According to SimHash algorithms The title SimHash values of extraction 64；The VSM of title is obtained by word2vec models；And it will segment and go title after word Again splicing obtains splicing character string.

In the present embodiment, the process for extracting the data of the content of text is as follows：The content of text is segmented and gone Word, wherein, go word to include removing stop words, adverbial word, auxiliary word, punctuation mark, preposition and part conjunction, when text is microblogging, Also need to remove the rubbish contents such as microblogging expression；The content SimHash values of 64 are extracted according to SimHash algorithms.

In the present embodiment, the process for extracting the data of the keyword of text is as follows：For the pass extracted from text Keyword, the keyword SimHash values of 64 are extracted according to SimHash algorithms；Keyword is obtained by word2vec models VSM。

In the present embodiment, the process for extracting the data of the kernel sentence of text is as follows：For extracted from text to Few three kernel sentences, the SimHash values of 64 of each kernel sentence are extracted according to SimHash algorithms；By all kernel sentences SimHash value merging obtains total SimHash values.

In the present embodiment, before step S13, methods described also includes：Predetermined object is determined according in the following manner Term vector similarity：

According to the term vector data of predetermined object, the term vector cosine similarity of predetermined object is calculated；

When the word number or length of the predetermined object of described two texts are all higher than default penalty factor, determine described pre- The term vector similarity for determining object is the term vector cosine similarity；

When the word number of the predetermined object of at least one text in described two texts or length are less than or equal to the punishment During the factor, according to the word number or length of described two texts and basic penalty value, punishment correction value is calculated；According to described two The addition penalty value of text determines additional penalty value with value；Wherein, the addition penalty value of each text according to penalty factor, should The word number or length of text and additional punishment maximum determine；According to the term vector cosine similarity of the predetermined object, institute State the smaller value of the smaller value of the word number of two texts or the length of described two texts, the punishment correction value, described chase after Add penalty value, the penalty factor, the additional punishment maximum and basic penalty value, determine the word of the predetermined object to Measure similarity.

Wherein, the punishment correction value is equal to the word number of described two texts or the absolute value of the difference of length is punished with basis The absolute value of the difference of value；For each text, if the difference of the word number or length of penalty factor and the text is more than 0, calculate The product of above-mentioned difference and additional punishment maximum, the addition penalty value of the text are equal to the ratio of above-mentioned product and penalty factor Value, if the difference of the word number or length of penalty factor and the text is less than or equal to 0, the addition penalty value of the text is equal to 0.

In the present embodiment, as shown in Figures 2 to 5, methods described also includes：

Step S14：When the Hamming distance of described at least two shared objects is unsatisfactory for first preparatory condition, according to Minimum value in the Hamming distance of described at least two shared objects determines the similarity between described two texts.

In the present embodiment, first preparatory condition includes：In the Hamming distance of described at least two shared objects Minimum value is less than or equal to the first distance threshold, or, the minimum value position in the Hamming distance of described at least two shared objects In in the first preset range.

In the present embodiment, preset when the Hamming distance of at least two shared objects between two texts is unsatisfactory for first During condition, it may be determined that described two texts do not have similar possibility, can determine between the two similar according to predetermined way Degree, avoid carrying out complicated calculations, to improve computational efficiency；When the Hamming distance of at least two shared objects between two texts When meeting the first preparatory condition, it may be determined that described two texts have similar possibility, can be according to the data of predetermined object Follow-up complicated calculations are carried out, to obtain accurate similarity.For example, if the shared object between two texts is：Mark When topic, content, keyword, predetermined object is title and keyword；If the shared object between two texts is：Title, content, When keyword and kernel sentence, predetermined object is title, keyword and kernel sentence.

In an alternative embodiment, as shown in Fig. 2 predetermined including the first predetermined object and second in the predetermined object Object, or, including when the first predetermined object, the second predetermined object and three predetermined objects, step S13 can include：

Step S131：If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that described two Similarity between text is equal to the term vector similarity of first predetermined object；

Step S132：If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, the second predetermined object The first Hamming distance meet that the second Hamming distance of the 3rd preparatory condition and the second predetermined object meets the 4th preparatory condition, then The similarity between described two texts is determined according to the second Hamming distance of the second predetermined object；

Step S133：If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, and second predetermined pair The first Hamming distance of elephant is unsatisfactory for the second Hamming distance of the 3rd preparatory condition or the second predetermined object, and to be unsatisfactory for the 4th pre- If condition, then the similarity between described two texts is determined according to the splicing character string similarity of the first predetermined object, and/ Or, similarity between described two texts is determined according to the term vector similarity of the 3rd predetermined object.

In an alternative embodiment, as shown in figure 3, including the second predetermined object and the 3rd predetermined pair in the predetermined object As, and when not including the first predetermined object, step 13 can include：

Step S134：If the first Hamming distance of the second predetermined object meets the 3rd preparatory condition and the second predetermined object Second Hamming distance meets the 4th preparatory condition, then determines described two texts according to the second Hamming distance of the second predetermined object Between similarity；

Step S135：If the first Hamming distance of the second predetermined object is unsatisfactory for the 3rd preparatory condition or second predetermined pair The second Hamming distance of elephant is unsatisfactory for the 4th preparatory condition, then determines described two according to the term vector similarity of the 3rd predetermined object Similarity between individual text.

In an alternative embodiment, as shown in figure 4, including the first predetermined object in the predetermined object and not including second Predetermined object, or, including the first predetermined object and the 3rd predetermined object, and when not including the second predetermined object, step 13 is wrapped Include：

Step S136：If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that described two Similarity between text is equal to the term vector similarity of first predetermined object；

Step S137：If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, pre- according to first The splicing character string similarity for determining object determines similarity between described two texts, and/or, according to the 3rd predetermined object Term vector similarity determines the similarity between described two texts.

In an alternative embodiment, as shown in figure 5, including the second predetermined object in the predetermined object and not including first When predetermined object and three predetermined objects, step S13 can include：

Step S138：The similarity between described two texts is determined according to the second Hamming distance of the second predetermined object.

Wherein, second preparatory condition includes：The term vector similarity of first predetermined object is more than the first similarity threshold Value, or, the term vector similarity of the first predetermined object is located in the second preset range.3rd preparatory condition includes：The First Hamming distance of two predetermined objects is more than or equal to second distance threshold value, or, the first Hamming distance of the second predetermined object Off normal in the 3rd preset range.4th preparatory condition includes：Second Hamming distance of the second predetermined object is less than or waited In the 3rd distance threshold, or, the second Hamming distance of the second predetermined object is located in the 4th preset range.

The present embodiment is described in detail below by example one to example seven.In following examples, the first predetermined object is mark Topic, the second predetermined object is kernel sentence, and the 3rd predetermined object is keyword.First Hamming distance of second predetermined object is Total Hamming distance of kernel sentence between two texts, the second Hamming distance of second predetermined object are included in a text Each kernel sentence and another text in each kernel sentence Hamming distance.In following examples, Hamming distance is root The SimHash distances obtained according to the difference of SimHash values.Similarity Measure process is described in detail in following examples, for extraction pair The process of the data of elephant is repeated no more in this with foregoing.

Example one

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, kernel sentence SimHash values a4, a5, a6 Total SimHash values a7, the VSM of title and the VSM of keyword of (in this, by taking three kernel sentences as an example), three kernel sentences；From Title SimHash values b1, content SimHash values b2, keyword SimHash values b3, kernel sentence are extracted in text B SimHash values b4, b5, b6 (in this, by taking three kernel sentences as an example), three kernel sentences total SimHash values b7, the VSM of title And the VSM of keyword.In this, the shared object between text A and text B is：Title, content, keyword and kernel sentence.

As shown in fig. 6, the Similarity Measure process between text A and text B, comprises the following steps：

Step 101：Content SimHash distances D1 between calculating text A and text B, keyword SimHash distances respectively D2, kernel sentence total SimHash distances D3 and title SimHash distances D4；

Wherein, D1=| a2-b2 |, D2=| a3-b3 |, D3=| a7-b7 |, D4=| a1-b1 |.

It should be noted that when text A and text B title text size are all higher than length for heading threshold value, just count Calculate text A and text B title SimHash distances D4.When the text size of title at least one in text A and text B is small When length for heading threshold value, text A and text B title SimHash distances D4 is not calculated.

In the present embodiment, so that text A and text B title text size are all higher than length for heading threshold value as an example, i.e., Need calculating text A and text B title SimHash distances D4.

Step 102：Minimum value is selected in D1, D2, D3 and D4 as minimum range Dmin.In this, Dmin=min D1, D2, D3, D4 }.

Step 103：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is more than the first distance threshold maxDt, according between following formula calculating text A and text B Similarity：S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is to be normalized corresponding to the dimension of SimHash algorithms Parameter；In this, L value is 100.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 104 is performed.

Preliminary screening is carried out using cost-efficiently calculate by step 103, avoids the relatively low data of similar possibility Follow-up complicated calculations are carried out, so as to improve computational efficiency.

Step 104：Calculate the term vector similarity Ts of title.

As shown in fig. 7, the term vector similarity Ts of title calculating process comprises the following steps：

Step 1041：According to the VSM of text A title and the VSM of text B title, the term vector cosine of title is determined Similarity ts；

Step 1042：Judge whether F (A) is less than or equal to PF, or, whether F (B) is less than or equal to PF；Wherein, PF is Default penalty factor, F (A) are the text size of text A title, and F (B) is the text size of text B title.

When F (A) is less than or equal to PF, or, when F (B) is less than or equal to PF, perform step 1043；When F (A) is more than PF And F (B) determines the term vector similarity of title according to following formula when being more than PF：Ts=ts.

Step 1043：The term vector similarity Ts of title is calculated according to following formula：

Ts=ts × (100-PF+min (F (A), F (B))-BP+MPV-AP)/100,

Wherein, MPV=BP- | F (A)-F (B) |, AP=P (A)+P (B)；

As PF-F (A) ＞ 0, P (A)=(PF-F (A)) × maxAP/PF, as PF-F (A)≤0, P (A)=0；

As PF-F (B) ＞ 0, P (B)=(PF-F (B)) × maxAP/PF, when PF F-B () 0≤when, P (B)=0；

Wherein, ts is the term vector cosine similarity of title determined in step 1041, and PF is default penalty factor, F (A) for text A title text size, F (B) is the text size of text B title, penalty value based on BP, and maxAP is Additional punishment maximum.

In the present embodiment, term vector cosine similarity is handled by the penalty mechanism of step 104, short text can be avoided Caused larger error.

Step 105：Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.

When the term vector similarity Ts of title is more than the first similarity threshold St, text A and text B are determined according to following formula Between similarity：S=Ts；

When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 106 is performed.

Step 106：Compare D3 and second distance threshold value, in this, second distance threshold value be 10, however, the application to this simultaneously Do not limit.

When D3 is more than or equal to second distance threshold value, step 107 and step 108 are performed；When D3 is less than second distance threshold During value, step 109 is performed.

Step 107：The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and Determine minimum value dmin；

In step 107, the SimHash values of each kernel sentence and each kernel sentence in text B in text A are calculated The difference of SimHash values, obtain multiple SimHash distances.In this example, the SimHash values point of text A three kernel sentences Not Wei a4, a5, a6, the SimHash values of text B three kernel sentences are respectively b4, b5, b6, therefore, can obtain it is following away from From：| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |；

For each distance, if the distance is less than or equal to the 3rd distance threshold minDt (for example, 3 or 5), pass through JaroWinkler algorithms calculate the similarity between two kernel sentences corresponding to the distance, if the similarity is more than or waited In the second similarity threshold (for example, 0.8), then retain the distance, if the similarity is less than second similarity threshold, Then give up the distance；If the distance is more than the 3rd distance threshold minDt, give up the distance；In the present embodiment In, by similarity-rough set and distance compare obtained retention data for example including：| a4-b4 |, | a4-b5 |, | a4-b6 |, | A5-b4 |, | a5-b5 |, | a5-b6 |；

Minimum value dmin is determined from all distances of reservation；In the present embodiment, dmin=min | a4-b4 |, | a4- B5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 | }.

Step 108：Compare dmin and default 3rd distance threshold minDt, in this, the 3rd distance threshold minDt's takes It is worth for 3 or 5, however, the application is not limited this.

When dmin is less than or equal to the 3rd distance threshold minDt, the phase between text A and text B is calculated according to following formula Like degree：S=(L-dmin)/L, wherein, L is normalized parameter corresponding to the dimension of SimHash algorithms；In this, L value is 100。

When dmin is more than the 3rd distance threshold minDt, step 109 is performed.

Step 109：Calculate the term vector similarity S1 of keyword.

As shown in figure 8, the term vector similarity S1 of keyword calculating process comprises the following steps：

Step 1091：According to the VSM of text A keyword and the VSM of text B keyword, determine the word of keyword to Measure cosine similarity s1；

Step 1092：Judge whether F (A) is less than or equal to PF, or, whether F (B) is less than or equal to PF；Wherein, PF is Default penalty factor, F (A) are the word number of text A keyword, and F (B) is the word number of text B keyword.

When F (A) is less than or equal to PF, or, when F (B) is less than or equal to PF, perform step 1093；When F (A) is more than PF And F (B) determines the term vector similarity of keyword according to following formula when being more than PF：S1=s1.

Step 1093：The term vector similarity S1 of keyword is determined according to following formula：

S1=s1 × (100-PF+min (F (A), F (B))-BP+MPV-AP)/100,

Wherein, MPV=BP- | F (A)-F (B) |, AP=P (A)+P (B)；

As PF-F (A) ＞ 0, P (A)=(PF-F (A)) × maxAP/PF, as PF-F (A)≤0, P (A)=0；

Wherein, s1 is the term vector cosine similarity of the keyword determined in step 1091, and PF is default penalty factor, F (A) is the word number of text A keyword, and F (B) is the word number of text B keyword, penalty value based on BP, and maxAP is chases after Add punishment maximum.

In the present embodiment, term vector cosine similarity is handled by the penalty mechanism of step 109, keyword can be avoided Error caused by very few.

Step 110：Calculate the similarity S2 that title segments again spliced splicing character string.In this example, pass through JaroWinkler algorithms calculate the similarity S2 of the splicing character string.However, the application is not limited this.In other realities Apply in example, JaroWinkler algorithms can use editing distance, longest common subsequence (LCS, Longest Common ) etc. Subsequence similarity of character string algorithm replaces.

Step 111：Similarity between text A and text B is calculated according to following formula：

Similarity between S=S1 × w1+S2 × w2, i.e. text A and text B determines according to S1 and S2 weighted sum；Its In, w1+w2=1, w1 value is, for example, that 0.8, w2 value is, for example, 0.2, however, the application is not limited this.

In addition, when the shared object between two texts is：When title, keyword and kernel sentence, described two texts Between Similarity Measure process be referred to this example, therefore repeated no more in this.

Example two

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, content SimHash values a2, kernel sentence SimHash values a4, a5, a6 (in this, with three kernel sentences Exemplified by), total the SimHash values a7 and title of three kernel sentences VSM；Extracted from text B title SimHash values b1, Content SimHash values b2, SimHash values b4, b5, b6 (in this, by taking three kernel sentences as an example) of kernel sentence, three kernel sentences Total SimHash values b7 and title VSM.In this, the shared object between text A and text B is：Title, content and core Heart sentence.

As shown in figure 9, the Similarity Measure process between text A and text B, comprises the following steps：

Step 201：Text A and text B content SimHash distances D1, total SimHash distances of kernel sentence is calculated respectively D3 and title SimHash distances D4；

Wherein, D1=| a2-b2 |, D3=| a7-b7 |, D4=| a1-b1 |.

It should be noted that when text A and text B title text size are all higher than length for heading threshold value, just count Calculate text A and B title SimHash distances D4.Be less than when the text size of title at least one in text A and text B or During equal to length for heading threshold value, text A and B title SimHash distances D4 are not calculated.

In this example, using the text size of text A and text B title be respectively less than or equal to length for heading threshold value as Example, i.e., it need not calculate text A and B title SimHash distances D4.

Step 202：Minimum value is selected in D1 and D3 as minimum range Dmin.In this, Dmin=min { D1, D3 }.

Step 203：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is more than the first distance threshold maxDt, calculated according to following formula similar between text A and B Degree：S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is normalization ginseng corresponding to the dimension of SimHash algorithms Number；In this, L value is 100.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 204 is performed.

Preliminary screening is carried out using cost-efficiently calculate by step 203, avoids the relatively low data of similar possibility Follow-up complicated calculations are carried out, so as to improve computational efficiency.

Step 204：Calculate the term vector similarity Ts of title.In this, step 204 is referred to the step in example one 104, therefore repeated no more in this.

Step 205：Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.

When the term vector similarity Ts of title is more than the first similarity threshold St, determined according to following formula between text A and B Similarity：S=Ts；

When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 206 is performed.

Step 206：Compare D3 and second distance threshold value, in this, second distance threshold value be 10, however, the application to this simultaneously Do not limit.

When D3 is more than or equal to second distance threshold value, step 207 and step 208 are performed；When D3 is less than second distance threshold During value, step 209 is performed.

Step 207：The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and Determine minimum value dmin；

Step 207 is referred to the step 107 in example one, therefore is repeated no more in this.

In the present embodiment, SimHash values respectively a4, a5, a6 of text A three kernel sentences, text B three cores The SimHash values of heart sentence are respectively b4, b5, b6, therefore, can obtain following distance：| a4-b4 |, | a4-b5 |, | a4-b6 |, | A5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |；Compare to obtain by similarity-rough set and distance Retention data for example including：| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |；Now, dmin =min | a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |.

Step 208：Compare dmin and default 3rd distance threshold minDt, in this, the 3rd distance threshold minDt's takes It is worth for 3 or 5, however, the application is not limited this.

When dmin is more than the 3rd distance threshold minDt, step 209 is performed.

Step 209：The similarity S2 that title segments again spliced splicing character string is calculated, and text is determined according to following formula Similarity between this A and text B：S=S2.

In the present embodiment, the similarity S2 of the splicing character string is calculated by JaroWinkler algorithms.However, this Application is not limited this.In other embodiment, JaroWinkler algorithms can use editing distance, most long public sub- sequence Similarity of character string algorithms such as (LCS, Longest Common Subsequence) is arranged to replace.

In addition, the shared object before two texts is：When title and kernel sentence, the phase between described two texts This example is referred to like degree calculating process, therefore is repeated no more in this.

Example three

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, the VSM of title and the VSM of keyword； Extracted from text B title SimHas values b1, content SimHash values b2, keyword SimHash values b3, title VSM and The VSM of keyword.In this, the shared object between text A and text B is：Title, content and keyword.

As shown in Figure 10, the Similarity Measure process between text A and text B, comprises the following steps：

Step 301：Respectively calculate text A and text B content SimHash distances D1, keyword SimHash distances D2 with And title SimHash distances D4；

Wherein, D1=| a2-b2 |, D2=| a3-b3 |, D4=| a1-b1 |.

In this example, so that text A and text B title text size are all higher than length for heading threshold value as an example, that is, need Calculate text A and B title SimHash distances D4.

Step 302：Minimum value is selected in D1, D2 and D4 as minimum range Dmin.In this, Dmin=min D1, D2, D4}。

Step 303：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 304 is performed.

Preliminary screening is carried out using cost-efficiently calculate by step 303, avoids the relatively low data of similar possibility Follow-up complicated calculations are carried out, so as to improve computational efficiency.

Step 304：Calculate the term vector similarity Ts of title.In this, step 304 is referred to the step in example one 104, therefore repeated no more in this.

Step 305：Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.

When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 306 is performed.

Step 306：Calculate the term vector similarity S1 of keyword.In this, step 306 is referred to the step in example one 109, therefore repeated no more in this.

Step 307：Calculate the similarity S2 that title segments again spliced splicing character string.In the present embodiment, lead to Cross the similarity S2 that JaroWinkler algorithms calculate the splicing character string.However, the application is not limited this.In other In embodiment, JaroWinkler algorithms can use editing distance, longest common subsequence (LCS, Longest Common ) etc. Subsequence similarity of character string algorithm replaces.

Step 308：Similarity between text A and text B is calculated according to following formula：

In addition, when the shared object between two texts is：When title and keyword, the phase between described two texts This example is referred to like degree calculating process, therefore is repeated no more in this.

Example four

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, the VSM of title and the VSM of keyword； The VSM of title SimHash values b1, content SimHash values b2 and title is extracted from text B.In this, text A and text B Between shared object be：Title and content.

As shown in figure 11, the Similarity Measure process between text A and text B, comprises the following steps：

Step 401：Text A and text B content SimHash distances D1 and title SimHash distances D4 is calculated respectively；

Wherein, D1=| a2-b2 |, D4=| a1-b1 |.

Step 402：Minimum value is selected in D1 and D4 as minimum range Dmin.In this, Dmin=min { D1, D4 }.

Step 403：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 404 is performed.

Preliminary screening is carried out using cost-efficiently calculate by step 403, avoids the relatively low data of similar possibility Follow-up complicated calculations are carried out, so as to improve computational efficiency.

Step 404：Calculate the term vector similarity Ts of title.In this, step 404 is referred to the step in example one 104, therefore repeated no more in this.

Step 405：Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.

When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 406 is performed.

Step 406：The similarity S2 that title segments again spliced splicing character string is calculated, and text is determined according to following formula Similarity between this A and text B：S=S2.

Example five

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, kernel sentence SimHash values a4, a5, a6 Total SimHash values a7, the VSM of title and the VSM of keyword of (in this, by taking three kernel sentences as an example), three kernel sentences；From Extracted in text B content SimHash values b2, keyword SimHash values b3, kernel sentence SimHash values b4, b5, b6 (in This, by taking three kernel sentences as an example), total the SimHash values b7 and keyword of three kernel sentences VSM.In this, text A and text Shared object between this B is：Content, keyword and kernel sentence.

As shown in figure 12, the Similarity Measure process between text A and text B, comprises the following steps：

Step 501：Calculate respectively text A and text B content SimHash distances D1, keyword SimHash distances D2, Total SimHash distances D3 of kernel sentence；

Wherein, D1=| a2-b2 |, D2=| a3-b3 |, D3=| a7-b7 |.

Step 502：Minimum value is selected in D1, D2 and D3 as minimum range Dmin.In this, Dmin=min D1, D2, D3 }.

Step 503：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 504 is performed.

Preliminary screening is carried out using cost-efficiently calculate by step 503, avoids the relatively low data of similar possibility Follow-up complicated calculations are carried out, so as to improve computational efficiency.

Step 504：Compare D3 and second distance threshold value, in this, second distance threshold value be 10, however, the application to this simultaneously Do not limit.

When D3 is more than or equal to second distance threshold value, step 505 and step 506 are performed；When D3 is less than second distance threshold During value, step 507 is performed.

Step 505：The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and Determine minimum value dmin；

Step 505 is referred to the step 107 in example one, therefore is repeated no more in this.

Step 506：Compare dmin and default 3rd distance threshold minDt, in this, the 3rd distance threshold minDt's takes It is worth for 3 or 5, however, the application is not limited this.

When dmin is more than the 3rd distance threshold minDt, step 507 is performed.

Step 507：The term vector similarity S1 of keyword is calculated, and the phase between text A and text B is determined according to following formula Like degree：S=S1.In this, the term vector similarity S1 of keyword calculating process is referred to the step 109 of example one, therefore in This is repeated no more.

In addition, when the shared object between two texts is：When keyword and kernel sentence, between described two texts Similarity Measure process is referred to this example, therefore is repeated no more in this.

Example six

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, keyword SimHash values a3, content SimHash values a2, the VSM of title and the VSM of keyword； The VSM of keyword SimHash values b3, content SimHash values b2 and keyword is extracted from text B.In this, text A and Shared object between text B is：Keyword and content.

As shown in figure 13, the Similarity Measure process between text A and text B, comprises the following steps：

Step 601：Text A and text B content SimHash distances D1, keyword SimHash distances D2 is calculated respectively；

Wherein, D1=| a2-b2 |, D2=| a3-b3 |.

Step 602：Minimum value is selected in D1 and D2 as minimum range Dmin.In this, Dmin=min { D1, D2 }.

Step 603：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 604 is performed.

Preliminary screening is carried out using cost-efficiently calculate by step 603, avoids the relatively low data of similar possibility Follow-up complicated calculations are carried out, so as to improve computational efficiency.

Step 604：The term vector similarity S1 of keyword is calculated, and is determined according to following formula similar between text A and B Degree：S=S1.In this, the term vector similarity S1 of keyword calculating process is referred to the step 109 in example one, therefore in This is repeated no more.

Example seven

In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A To title SimHash values a1, content SimHash values a2, kernel sentence SimHash values a4, a5, a6 (in this, with three kernel sentences Exemplified by), total the SimHash values a7 and title of three kernel sentences VSM；Extracted from text B content SimHash values b2, SimHash values b4, b5, b6 (in this, by taking three kernel sentences as an example) of kernel sentence, three kernel sentences total SimHash values b7.In This, the shared object between text A and text B is：Content and kernel sentence.

As shown in figure 14, the Similarity Measure process between text A and text B, comprises the following steps：

Step 701：Text A and text B content SimHash distances D1, the total SimHash distances of kernel sentence is calculated respectively D3；

Wherein, D1=| a2-b2 |, D3=| a7-b7 |.

Step 702：Minimum value is selected in D1 and D3 as minimum range Dmin.In this, Dmin=min { D1, D3 }.

Step 703：Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold MaxDt value is, for example, 25, however, the application is not limited this.

When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 704 is performed.

Step 704：The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and Obtain minimum value dmin；And the similarity between text A and text B is calculated according to following formula：S=(L-dmin)/L, wherein, L is Normalized parameter corresponding to the dimension of SimHash algorithms；In this, L value is 100.

In the present embodiment, SimHash values respectively a4, a5, a6 of text A three kernel sentences, text B three cores The SimHash values of heart sentence are respectively b4, b5, b6, therefore, can obtain following distance：| a4-b4 |, | a4-b5 |, | a4-b6 |, | A5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |；Compare to obtain by similarity-rough set and distance Retention data for example including：| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |；Now, dmin =min | a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |.Acquisition on minimum value dmin Journey is referred to the step 107 in example one, therefore is repeated no more in this.

In addition, as shown in figure 15, the present embodiment also provides a kind of Text similarity computing device, for calculating two texts Between similarity, described device includes：

Extraction module 11, for extracting the data of at least two objects from each text, the object refers to embody The feature of the text semantic；

Determining module 12, for determining the shared object of two texts, wherein, the number of the shared object is at least two It is individual；

First computing module 13, for calculating the Hamming distance of each shared object between described two texts；

Second computing module 14, for meeting the first preparatory condition in the Hamming distance of described at least two shared objects When, according at least one of following similarity determined between described two texts：Predetermined pair in described at least two shared objects Term vector similarity, Hamming distance and the splicing character string similarity of elephant.

Handling process on described device is repeated no more in this with described in above-mentioned embodiment of the method.

The embodiment of the present application fully combines the characteristics of text, and multiple features that extraction can embody text semantic (such as are marked Topic, content, keyword, kernel sentence) data, and data based on multiple features carry out Similarity Measure, not only increase meter Calculate efficiency and improve the accuracy of Similarity Measure.The Text similarity computing method that the embodiment of the present application provides is being carried out Calculated before complicated calculations by low cost and carry out preliminary matches, to improve computational efficiency.The text phase that the embodiment of the present application provides Hamming distance computational methods rapidly and efficiently and the term vector side with semantic meaning representation ability can be combined like degree computational methods Method, it not only avoid the similarity obtained in correlation technique only with Hamming distance computational methods and be short of lacking for semantic meaning representation ability Fall into, and compensate for too slow only with speed existing for term vector method calculating similarity in correlation technique and be only applicable to short essay This problem of.Moreover, the embodiment of the present application also introduces short text penalty mechanism, easily occur when can correct short text matching The situation of error.

Embodiment two

The Text similarity computing method that the present embodiment provides can apply to big data scene.Now, if being directed to target Text carries out Similarity Measure with each text in a large amount of texts, and amount of calculation is larger, and expends the time.Therefore, in this implementation , it is necessary to search the text for meeting corresponding conditionses between target text from a large amount of texts as text to be analyzed in example, then enter Similarity Measure between row target text and text to be analyzed.

As shown in figure 16, the present embodiment provide Text similarity computing method, for calculate target text with it is to be analyzed Similarity between text, the described method comprises the following steps：

Step S21：According to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, is established The index domain of one object and/or the index domain of the second object, wherein, each index domain includes one or more index trees；

In the present embodiment, the first object is content, and the second object is title；Cryptographic Hash is obtained with SimHash algorithms Exemplified by SimHash values, the acquisition modes of the SimHash values of the first object and the SimHash values of the second object are the same as embodiment one It is described, therefore repeated no more in this.

For multiple texts, a subject index domain is built according to title SimHash values, according to content SimHash value structures A content indexing domain is built, wherein, if the multiple text is microblogging, only need to be built according to content simhash values One content indexing domain.

Wherein, one or more bk-tree index trees each are established under index domain, each bk-tree index trees can hold Receive G text, G can be configured according to actual business requirement, such as G is arranged to 1000, and the application is not limited this.

The bk-tree index trees are the data structures based on the Hamming distance foundation for meeting triangle inequality property.Three Inequality property is described as follows：D (x, y) is made to represent character string x to y Hamming distance；D (x, y) is with d's (y, z) and big In or equal to d (x, z), i.e., the step number needed for character string z is changed to from character string x and first becomes character string y again not over character string x Become character string z step number.

Bk-tree index trees in the present embodiment to establish procedure declaration as follows：

In an index tree, first, an optional text when often inserting a text afterwards, calculates as root node Hamming distance between the text and root node of insertion：If obtained Hamming distance value is that occur for the first time at root node, Establish a new child；Otherwise go down along corresponding side recurrence.For each child, there is similar processing, Therefore repeated no more in this.By the above-mentioned means, the text of respective number can be inserted into the index tree.

Alternatively, kd-tree index trees can be each established in index domain, or the row's of falling rope is established according to SimHash values Draw.The inverted index established according to SimHash values refers to the position that record SimHash values are determined according to SimHash values Index.The kd-tree index trees are a binary trees, and what is stored in tree is some K dimension datas；In a K dimension data collection Close the division that one kd-tree of structure represents the K dimension spaces formed to the K dimension datas set, that is, it is each in setting Node has just corresponded to the hypermatrix region (Hyper-rectangle) of a K dimension.

Step S22：According to the cryptographic Hash of the first object of target text, each index tree in the index domain of the first object Hamming distance between middle lookup and the target text meets the text of first condition, and/or, according to the second of target text The cryptographic Hash of object, the Hamming distance between the target text is searched in each index tree in the index domain of the second object Meet the text of second condition；And text to be analyzed is determined from the text found.

Wherein, the first condition for example including：Hamming distance between the target text is less than first threshold, or Hamming distance between person, with the target text is in the range of first.The second condition for example including：With the target text Between Hamming distance be less than Second Threshold, or, Hamming distance between the target text is in the range of second.Its In, the first threshold and Second Threshold can be with identical, and first scope can be with identical with the second scope.

When only including an index domain, it need to only carry out searching in the index domain and meet first condition or second condition N number of text is as text to be analyzed.

When including two index domains, first in content indexing domain, it is less than the first threshold from the Hamming distance of target text N1 text is found out in the text of value；Again in subject index domain, remove N1 text having been found, from target text Hamming distance, which is less than in the text of Second Threshold, finds out N2 text, finally obtains N1+N2 text as text to be analyzed, its In, first threshold can be with identical with Second Threshold, and N1 can be with equal, however, the application is not limited this with N2.Or when During including two index domains, while searched in content indexing domain and subject index domain, wherein, in content indexing domain, from Be less than with the Hamming distance of target text in the text of first threshold and find out N1 text, in subject index domain, from target The Hamming distance of text, which is less than in the text of Second Threshold, finds out N2 text；Obtained N1 text and N2 text is carried out Duplicate removal, it is determined that final N number of text is as text to be analyzed, wherein, N is less than or equal to N1+N2；It is 64 in SimHash dimensions When, the value of first threshold and Second Threshold can be 25.

Illustrate the process that progress String searching in domain is indexed at one below：

In one indexes domain, target text is inserted in each bk-tree index trees, in each bk-tree index trees In, the Hamming distance that calculating needs to return between target text is no more than threshold value (text such as n), if target text and root section Hamming distance between the corresponding text of point is d, then only needs recursively to consider to number the side in the range of d-n to d+n The subtree connected is inquired about.Due to n generally it is smaller, therefore every time compared with some node when can exclude Many subtrees, so as to greatly improve computational efficiency.Amount of calculation can be significantly reduced in this way, calculated so as to improve The improved efficiency of efficiency, at least 5~10 times.

It is determined that after text to be analyzed, the similarity between target text and each text to be analyzed, meter can be calculated Obtained similarity result can be used for the business such as comparison, duplicate removal, template filtering of encroaching right.

Step S23：The shared object of target text and text to be analyzed is determined, wherein, the number of the shared object is At least two；

Step S24：Calculate the Hamming distance of each shared object between target text and text to be analyzed；

Step S25：When the Hamming distance of described at least two shared objects meets the first preparatory condition, according to below extremely One item missing determines the similarity between described two texts：The term vector of predetermined object is similar in described at least two shared objects Degree, Hamming distance and splicing character string similarity.

Wherein, step S23 to step S25 is referred to embodiment one, in addition, in the present embodiment Similarity Measure other Content can also be with reference to embodiment one, therefore is repeated no more in this.

Figure 17 is the structural representation for the Text similarity computing device that the embodiment of the present application two provides.As shown in figure 17, The device that the present embodiment provides includes：Index establishes module 21, searching modul 22, extraction module 23, determining module 24 and meter Module 25 is calculated, wherein,

Extraction module 23, for extracting the Hash of the first object and/or the second object from target text and multiple texts Value；

Index establishes module 21, for according to the Kazakhstan of the first object of each text and/or the second object in multiple texts Uncommon value, establish the index domain and/or the index domain of the second object of the first object, wherein, each index domain include one or Multiple index trees；

Searching modul 22, for the cryptographic Hash of the first object according to target text, in the every of the index domain of the first object The text that the Hamming distance between the target text meets first condition is searched in individual index tree, and/or, according to target text The cryptographic Hash of this second object, searched in each index tree in the index domain of the second object between the target text Hamming distance meets the text of second condition；Text to be analyzed is determined from the text found；

The extraction module 23, it is additionally operable to extract at least two objects from target text and each text to be analyzed respectively Data；

Determining module 24, for determining the shared object of target text and text to be analyzed, wherein, the shared object Number is at least two；

First computing module 25, for calculating the Hamming distance of each shared object between target text and text to be analyzed From；

Second computing module 26, for meeting the first preparatory condition in the Hamming distance of described at least two shared objects When, according at least one of following similarity determined between described two texts：Predetermined pair in described at least two shared objects Term vector similarity, Hamming distance and the splicing character string similarity of elephant.

Handling process on described device is referred to described in the method for the present embodiment, therefore is repeated no more in this.

Embodiment three

The embodiment of the present application also provides a kind of data processing electronics, for calculating the similarity between two texts, The data processing electronics include memory and processor, and the memory, which is used to store, is used for Text similarity computing Program, when the program for Text similarity computing is read out by the processor execution, perform following operation：

The shared object of two texts is determined, wherein, the number of the shared object is at least two；

Calculate the Hamming distance of each shared object between described two texts；

When the Hamming distance of described at least two shared objects meets the first preparatory condition, according at least one of following true Similarity between fixed described two texts：The term vector similarity of predetermined object, Hamming in described at least two shared objects Distance and splicing character string similarity.

Alternatively, said procedure uses Java, C++ or Python.

In addition, the embodiment of the present invention also provides a kind of computer-readable recording medium, computer executable instructions are stored with, The computer executable instructions realize above-mentioned Text similarity computing method when being executed by processor.

One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program Related hardware (such as processor) is completed, and described program can be stored in computer-readable recording medium, as read-only storage, Disk or CD etc..Alternatively, all or part of step of above-described embodiment can also be come using one or more integrated circuits Realize.Correspondingly, each module/unit in above-described embodiment can be realized in the form of hardware, such as pass through integrated circuit To realize its corresponding function, can also be realized in the form of software function module, such as be stored in and deposited by computing device Program/instruction in reservoir realizes its corresponding function.The application is not restricted to the knot of the hardware and software of any particular form Close.

The advantages of general principle and principal character and the application of the application has been shown and described above.The application is not by upper State the limitation of embodiment, the principle for simply illustrating the application described in above-described embodiment and specification, do not depart from the application On the premise of spirit and scope, the application also has various changes and modifications, and these changes and improvements both fall within claimed In the range of the application.

Claims

A kind of 1. Text similarity computing method, it is characterised in that for calculating the similarity between two texts, wherein, from The data of at least two objects can be extracted in each text, the object refers to embody the feature of the text semantic, institute The method of stating includes：

The shared object of two texts is determined, wherein, the number of the shared object is at least two；

Calculate the Hamming distance of each shared object between described two texts；

When the Hamming distance of described at least two shared objects meets the first preparatory condition, institute is determined according at least one of following State the similarity between two texts：Term vector similarity, the Hamming distance of predetermined object in described at least two shared objects And splicing character string similarity.
2. according to the method for claim 1, it is characterised in that methods described also includes：

When the Hamming distance of described at least two shared objects is unsatisfactory for first preparatory condition, according to described at least two Minimum value in the Hamming distance of shared object determines the similarity between described two texts.
3. according to the method for claim 1, it is characterised in that the Hamming distance in described at least two shared objects When meeting the first preparatory condition, the similarity between described two texts is determined, including：

Include the first predetermined object and the second predetermined object in the predetermined object, or, including the first predetermined object, second When predetermined object and three predetermined objects,

If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that similar between described two texts Term vector similarity of the degree equal to first predetermined object；

If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, the first Hamming distance of the second predetermined object Meet that the second Hamming distance of the 3rd preparatory condition and the second predetermined object meets the 4th preparatory condition, then according to second predetermined pair The second Hamming distance of elephant determines the similarity between described two texts；

If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, and the first Hamming distance of the second predetermined object The 4th preparatory condition is unsatisfactory for from the second Hamming distance for being unsatisfactory for the 3rd preparatory condition or the second predetermined object, then according to The splicing character string similarity of one predetermined object determines the similarity between described two texts, and/or, according to the 3rd predetermined pair The term vector similarity of elephant determines the similarity between described two texts.
4. according to the method for claim 1, it is characterised in that the Hamming distance in described at least two shared objects When meeting the first preparatory condition, the similarity between described two texts is determined, including：

Include the second predetermined object in the predetermined object and do not include the first predetermined object, or, including the second predetermined object With the 3rd predetermined object, and when not including the first predetermined object,

The similarity between described two texts is determined according to the second Hamming distance of the second predetermined object；

Or

If the first Hamming distance of the second predetermined object meets the second Hamming distance of the 3rd preparatory condition and the second predetermined object Meet the 4th preparatory condition, then determined according to the second Hamming distance of the second predetermined object similar between described two texts Degree；If the first Hamming distance of the second predetermined object is unsatisfactory for the second Hamming distance of the 3rd preparatory condition or the second predetermined object From the 4th preparatory condition is unsatisfactory for, then the phase between described two texts is determined according to the term vector similarity of the 3rd predetermined object Like degree.
5. according to the method for claim 1, it is characterised in that the Hamming distance in described at least two shared objects When meeting the first preparatory condition, the similarity between described two texts is determined, including：

Include the first predetermined object in the predetermined object and do not include the second predetermined object, or, including the first predetermined object With the 3rd predetermined object, and when not including the second predetermined object,

If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that similar between described two texts Term vector similarity of the degree equal to first predetermined object；

If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, according to the splicing word of the first predetermined object Symbol string similarity determines the similarity between described two texts, and/or, it is true according to the term vector similarity of the 3rd predetermined object Similarity between fixed described two texts.
6. according to the method for claim 1, it is characterised in that meet in the Hamming distance of described at least two shared objects During the first preparatory condition, before determining the similarity between described two texts, methods described also includes：It is true according in the following manner Determine the term vector similarity of predetermined object：

According to the term vector data of predetermined object, the term vector cosine similarity of predetermined object is calculated；

When the word number or length of the predetermined object of described two texts are all higher than default penalty factor, described predetermined pair is determined The term vector similarity of elephant is the term vector cosine similarity；

When the word number of the predetermined object of at least one text in described two texts or length are less than or equal to the penalty factor When, according to the word number or length of described two texts and basic penalty value, calculate punishment correction value；According to described two texts Addition penalty value and value determine additional penalty value；Wherein, the addition penalty value of each text is according to penalty factor, the text Word number or length and additional punishment maximum determine；According to the term vector cosine similarity of the predetermined object, described two The smaller value of the word number of individual text or the smaller value of the length of described two texts, the punishment correction value, the addition are punished Penalties, the penalty factor, the additional punishment maximum and basic penalty value, determine the term vector phase of the predetermined object Like degree.
7. according to the method for claim 1, it is characterised in that each shared object between described two texts are calculated Hamming distance before, methods described also includes：

The data of at least two objects are extracted from described two texts respectively, the data include cryptographic Hash and/or term vector Data.
8. according to the method described in any one of claim 1 to 7, it is characterised in that first preparatory condition includes：It is described extremely Minimum value in the Hamming distance of few two shared objects is less than or equal to the first distance threshold, or, described at least two is common There is the minimum value in the Hamming distance of object to be located in the first preset range.
9. according to the method described in any one of claim 3 to 5, it is characterised in that first predetermined object is title, described Second predetermined object is the sentence for the keynote idea for representing text, and the number of the sentence is the more than or equal to three the described 3rd Predetermined object is keyword；First Hamming distance of second predetermined object is total Chinese of those sentences between two texts Prescribed distance, the second Hamming distance of second predetermined object are included in each sentence and another text in a text The Hamming distance of each sentence.
10. according to the method described in any one of claim 1 to 7, it is characterised in that

One of them of described two texts is target text, and another is text to be analyzed；

It is determined that before the shared object of two texts, methods described also includes：The text to be analyzed is obtained in the following manner This：

According to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, the index of the first object is established Domain and/or the index domain of the second object, wherein, each index domain includes one or more index trees；

According to the cryptographic Hash of the first object of target text, searched in each index tree in the index domain of the first object with it is described Hamming distance between target text meets the text of first condition, and/or, according to the Hash of the second object of target text Value, the Hamming distance searched in each index tree in the index domain of the second object between the target text meet Article 2 The text of part；

Text to be analyzed is determined from the text found.
11. according to the method for claim 10, it is characterised in that the index tree is bk-tree index trees.
12. according to the method for claim 10, it is characterised in that first object is content, and second object is Title.
13. a kind of Text similarity computing device, it is characterised in that for calculating the similarity between two texts, the dress Put including：

Extraction module, for extracting the data of at least two objects from each text, the object refers to embody the text This semantic feature；

Determining module, for determining the shared object of two texts, wherein, the number of the shared object is at least two；

First computing module, for calculating the Hamming distance of each shared object between described two texts；

Second computing module, when the Hamming distance for sharing objects described at least two meets the first preparatory condition, according to At least one determines the similarity between described two texts below：In described at least two shared objects the word of predetermined object to Measure similarity, Hamming distance and splicing character string similarity.
14. device according to claim 13, it is characterised in that described device also includes：

Index establishes module, for according to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, building The index domain of vertical first object and/or the index domain of the second object, wherein, each index domain includes one or more index Tree；

Searching modul, for the cryptographic Hash of the first object according to target text, each index in the index domain of the first object The Hamming distance searched in tree between the target text meets the text of first condition, and/or, according to the of target text The cryptographic Hash of two objects, the Hamming distance between the target text is searched in each index tree in the index domain of the second object From the text for meeting second condition；Text to be analyzed is determined from the text found；

The determining module, for determining the shared object of the target text and the text to be analyzed.