CN107644010A - A kind of Text similarity computing method and device - Google Patents
A kind of Text similarity computing method and device Download PDFInfo
- Publication number
- CN107644010A CN107644010A CN201610578843.9A CN201610578843A CN107644010A CN 107644010 A CN107644010 A CN 107644010A CN 201610578843 A CN201610578843 A CN 201610578843A CN 107644010 A CN107644010 A CN 107644010A
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- texts
- predetermined object
- hamming distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
A kind of Text similarity computing method, for calculating the similarity between two texts, wherein, the data of at least two objects can be extracted from each text, the object refers to embody the feature of the text semantic, and methods described includes:The shared object of two texts is determined, wherein, the number of the shared object is at least two;Calculate the Hamming distance of each shared object between described two texts;When the Hamming distance of described at least two shared objects meets the first preparatory condition, according at least one of following similarity determined between described two texts:Term vector similarity, Hamming distance and the splicing character string similarity of predetermined object in described at least two shared objects.Pass through such scheme, it is possible to increase the efficiency and accuracy of Text similarity computing.
Description
Technical field
The present invention relates to data processing field, more particularly to a kind of Text similarity computing method and device.
Background technology
At present, the Similarity Measure between text is applied to many aspects.In the related art, following two can be used
Kind scheme carries out the contrast between text.
The first scheme is:After long text is segmented, do Hash (hash) for each word and calculate, and frequency of use
Do weighting and obtain vector, then to vectorial binarization, obtain the cryptographic Hash of text.Hamming is determined according to the cryptographic Hash between text
Distance.Such scheme is very extensive in the application of the removing duplicate webpages fields such as Google (google), Baidu.
Second scheme is:Using document subject matter generation model (LDA, Latent Dirichlet Allocation) or
Probability is dived the topic models such as semantic analysis (PLSA, Probability Latent Semantic Analysis), passes through machine
Study, maps the text in theme vector, certain physical significance between the vector of generation be present, by calculating two vectors
Cosine similarity so as to obtaining the similitude between two texts.
However, although the first above-mentioned scheme can efficiently obtain the Hamming distance of two texts, abandon interior
The semanteme of appearance, the calculating of row distance is simply entered from the angle of text-string;When text is short text, contrast effect is simultaneously paid no attention to
Think.Moreover, the result of calculation of the first scheme is distance value, it is not similarity, is not easy to follow-up business processing.Above-mentioned second
Although kind of a scheme can represent text semantic well by way of machine learning, wherein the training of the model used
Process is very time-consuming, and highly dependent upon training sample, even very simple sentence may be produced, also the problem of wrong is calculated in accounting.
And the cosine computational efficiency between high dimension vector is relatively low, big text or big data environment are impractical in.
In summary, the computational efficiency of the Text similarity computing scheme in correlation technique is relatively low, accuracy is relatively low.
The content of the invention
It is the general introduction of the theme to being described in detail herein below.It is to limit the protection model of claim that this general introduction, which is not,
Enclose.
The embodiment of the present application provides a kind of Text similarity computing method and device, it is possible to increase Text similarity computing
Efficiency and accuracy.
The embodiment of the present application provides a kind of Text similarity computing method, for calculating the similarity between two texts,
Wherein, the data of at least two objects can be extracted from each text, the object refers to embody the text semantic
Feature, methods described include:The shared object of two texts is determined, wherein, the number of the shared object is at least two;Meter
Calculate the Hamming distance of each shared object between described two texts;Expire in the Hamming distance of described at least two shared objects
During the first preparatory condition of foot, according at least one of following similarity determined between described two texts:Described at least two is common
There are the term vector similarity, Hamming distance and splicing character string similarity of predetermined object in object.
Alternatively, methods described also includes:Described first is unsatisfactory in the Hamming distance of described at least two shared objects
During preparatory condition, the minimum value in the Hamming distance of described at least two shared objects is determined between described two texts
Similarity.
Alternatively, when the Hamming distance in described at least two shared objects meets the first preparatory condition, institute is determined
The similarity between two texts is stated, including:
Include the first predetermined object and the second predetermined object in the predetermined object, or, including the first predetermined object,
When the second predetermined object and three predetermined objects, if the term vector similarity of the first predetermined object meets the second preparatory condition,
Then determine that the similarity between described two texts is equal to the term vector similarity of first predetermined object;If first predetermined pair
The term vector similarity of elephant is unsatisfactory for the second preparatory condition, and the first Hamming distance of the second predetermined object meets the 3rd preparatory condition
And second predetermined object the second Hamming distance meet the 4th preparatory condition, then according to the second Hamming distance of the second predetermined object
Determine the similarity between described two texts;If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition,
And second first Hamming distance of predetermined object be unsatisfactory for the second Hamming distance of the 3rd preparatory condition or the second predetermined object
The 4th preparatory condition is unsatisfactory for, then is determined according to the splicing character string similarity of the first predetermined object between described two texts
Similarity, and/or, the similarity between described two texts is determined according to the term vector similarity of the 3rd predetermined object.
Alternatively, when the Hamming distance in described at least two shared objects meets the first preparatory condition, institute is determined
The similarity between two texts is stated, including:
Include the second predetermined object and not including the first predetermined object in the predetermined object, or, including second make a reservation for
Object and the 3rd predetermined object, and when not including the first predetermined object, determined according to the second Hamming distance of the second predetermined object
Similarity between described two texts;Or if the second predetermined object the first Hamming distance meet the 3rd preparatory condition and
Second Hamming distance of the second predetermined object meets the 4th preparatory condition, then true according to the second Hamming distance of the second predetermined object
Similarity between fixed described two texts;If the first Hamming distance of the second predetermined object be unsatisfactory for the 3rd preparatory condition or
Second Hamming distance of the second predetermined object is unsatisfactory for the 4th preparatory condition, then according to the term vector similarity of the 3rd predetermined object
Determine the similarity between described two texts.
Alternatively, when the Hamming distance in described at least two shared objects meets the first preparatory condition, institute is determined
The similarity between two texts is stated, including:
Include the first predetermined object and not including the second predetermined object in the predetermined object, or, including first make a reservation for
Object and the 3rd predetermined object, and when not including the second predetermined object, if the term vector similarity of the first predetermined object meets the
Two preparatory conditions, it is determined that the similarity between described two texts is equal to the term vector similarity of first predetermined object;
If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, according to the splicing character string of the first predetermined object
Similarity determines the similarity between described two texts, and/or, institute is determined according to the term vector similarity of the 3rd predetermined object
State the similarity between two texts.
Alternatively, when the Hamming distance of described at least two shared objects meets the first preparatory condition, described two are determined
Before similarity between individual text, methods described also includes:The term vector similarity of predetermined object is determined according in the following manner:
According to the term vector data of predetermined object, the term vector cosine similarity of predetermined object is calculated;When described two texts
When the word number or length of this predetermined object are all higher than default penalty factor, the term vector similarity of the predetermined object is determined
For the term vector cosine similarity;When the word number of the predetermined object of at least one text or length are less than in described two texts
Or during equal to the penalty factor, according to the word number or length of described two texts and basic penalty value, calculate punishment amendment
Value;Additional penalty value is determined according to the addition penalty value of described two texts and value;Wherein, the addition penalty value of each text
Determined according to penalty factor, the word number of the text or length and additional punishment maximum;According to the word of the predetermined object to
Amount cosine similarity, described two texts word number smaller value or described two texts length smaller value, described punish
Correction value, the additional penalty value, the penalty factor, the additional punishment maximum and basic penalty value are penalized, determines institute
State the term vector similarity of predetermined object.
Alternatively, before the Hamming distance of each shared object between calculating described two texts, methods described is also
Including:The data of at least two objects are extracted from described two texts respectively, the data include cryptographic Hash and/or term vector
Data.
Alternatively, first preparatory condition includes:Minimum value in the Hamming distance of described at least two shared objects
Less than or equal to the first distance threshold, or, the minimum value in the Hamming distance of described at least two shared objects is located at first
In preset range.
Alternatively, first predetermined object is title, and second predetermined object is to represent the keynote idea of text
Sentence, the number of the sentence is that the 3rd predetermined object is keyword more than or equal to 3;Second predetermined object
Total Hamming distance of those sentences of first Hamming distance between two texts, the second Hamming distance of second predetermined object
From the Hamming distance for including each sentence in each sentence in a text and another text.
Alternatively, one of them of described two texts is target text, and another is text to be analyzed;
It is determined that before the shared object of two texts, methods described also includes:Treated point described in obtaining in the following manner
Analyse text:
According to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, the first object is established
Domain and/or the index domain of the second object are indexed, wherein, each index domain includes one or more index trees;According to target
The cryptographic Hash of first object of text, searched in each index tree in the index domain of the first object between the target text
Hamming distance meet the text of first condition, and/or, according to the cryptographic Hash of the second object of target text, in the second object
Index domain each index tree in the Hamming distance searched between the target text meet the text of second condition;From looking into
Text to be analyzed is determined in the text found.
Alternatively, the index tree is bk-tree index trees.
Alternatively, first object is content, and second object is title.
The embodiment of the present application also provides a kind of Text similarity computing device, for similar between two texts of calculating
Degree, described device include:Extraction module, for extracting the data of at least two objects from each text, the object refers to energy
Enough embody the feature of the text semantic;Determining module, for determining the shared object of two texts, wherein, described shared pair
The number of elephant is at least two;First computing module, for calculating the Hamming of each shared object between described two texts
Distance;Second computing module, when the Hamming distance for sharing objects described at least two meets the first preparatory condition, according to
At least one determines the similarity between described two texts below:In described at least two shared objects the word of predetermined object to
Measure similarity, Hamming distance and splicing character string similarity.
Alternatively, described device also includes:Index establishes module, for the first couple according to each text in multiple texts
As and/or the second object cryptographic Hash, establish the index domain and/or the index domain of the second object of the first object, wherein, it is described every
Individual index domain includes one or more index trees;Searching modul, for the cryptographic Hash of the first object according to target text,
The text that the Hamming distance between the target text meets first condition is searched in each index tree in the index domain of one object
This, and/or, according to the cryptographic Hash of the second object of target text, searched in each index tree in the index domain of the second object
Hamming distance between the target text meets the text of second condition;Text to be analyzed is determined from the text found
This;The determining module, for determining the shared object of the target text and the text to be analyzed.
The embodiment of the present application also provides a kind of computer-readable recording medium, is stored with computer executable instructions, described
Above-mentioned Text similarity computing method is realized when computer executable instructions are performed.
The embodiment of the present application fully combines the characteristics of text, and multiple features that extraction can embody text semantic (such as are marked
Topic, content, keyword, kernel sentence) data, and data based on multiple features carry out Similarity Measure, not only increase meter
Calculate efficiency and improve the accuracy of Similarity Measure.The Text similarity computing method that the embodiment of the present application provides is being carried out
Calculated before complicated calculations by low cost and carry out preliminary matches, to improve computational efficiency.
Further, the Text similarity computing method that the embodiment of the present application provides can combine Hamming distance rapidly and efficiently
From computational methods and with semantic meaning representation ability term vector method, not only avoid in correlation technique only with Hamming distance
The defects of similarity shortcoming semantic meaning representation ability that computational methods obtain, and compensate in correlation technique only with term vector side
Method calculates that speed is too slow existing for similarity and the problem of being only applicable to short text.In addition, the embodiment of the present application also introduce it is short
Text penalty mechanism, easily there is the situation of error when can correct short text matching.
Certainly, any product for implementing the application it is not absolutely required to reach all advantages described above simultaneously.Reading
And after understanding of accompanying drawing and being described in detail, it can be appreciated that other aspects.
Brief description of the drawings
Fig. 1 is the flow chart for the Text similarity computing method that the embodiment of the present application one provides;
Fig. 2 is the optional flow chart one of the Text similarity computing method of the embodiment of the present application one;
Fig. 3 is the optional flowchart 2 of the Text similarity computing method of the embodiment of the present application one;
Fig. 4 is the optional flow chart 3 of the Text similarity computing method of the embodiment of the present application one;
Fig. 5 is the optional flow chart four of the Text similarity computing method of the embodiment of the present application one;
Fig. 6 is the flow chart of present application example one;
Fig. 7 is the flow chart of step 104 in Fig. 6;
Fig. 8 is the flow chart of step 109 in Fig. 6;
Fig. 9 is the flow chart of present application example two;
Figure 10 is the flow chart of present application example three;
Figure 11 is the flow chart of present application example four;
Figure 12 is the flow chart of present application example five;
Figure 13 is the flow chart of present application example six;
Figure 14 is the flow chart of present application example seven;
Figure 15 is the schematic diagram for the Text similarity computing device that the embodiment of the present application one provides;
Figure 16 is the flow chart for the Text similarity computing method that the embodiment of the present application two provides;
Figure 17 is the schematic diagram for the Text similarity computing device that the embodiment of the present application two provides.
Embodiment
The embodiment of the present application is described in detail below in conjunction with accompanying drawing, it will be appreciated that embodiments described below is only
For instruction and explanation of the application, it is not used to limit the application.
If it should be noted that not conflicting, the feature in the embodiment of the present application and embodiment can be combined with each other,
Within the protection domain of the application.In addition, though logical order is shown in flow charts, but in some cases, can
With with different from the shown or described step of order execution herein.
Term defines:
Object:Refer to embody the feature of text semantic;The object is for example including title, content, keyword, core
Sentence.
Kernel sentence:Refer to the sentence of the keynote idea of table text;In the present embodiment, the kernel sentence that is extracted from each text
Number be more than or equal to three.
Hamming distance:Refer to the number of the kinds of characters of two (equal length) character string correspondence positions;Wherein, two texts
Or the Hamming distance between object can determine according to the cryptographic Hash of two texts or object.
Cryptographic Hash:Refer to by local sensitivity hash algorithm (LSH, Locality-Sensitive Hashing) from through undue
Obtained value is extracted in word and the object gone after word processing, wherein, LSH algorithms can include SimHash algorithms, MinHash
Algorithm.
Term vector cosine similarity:Refer to and calculate the value that the cosine of two term vector data obtains.
Term vector data:Refer to the data obtained by term vector model treatment object or text, wherein, term vector module can
With including word2vec models, PLSA, LDA.
Term vector similarity:Refer to the value determined after being punished according to penalty mechanism term vector cosine similarity.
Splicing character string similarity:Refer between two parts of splicing character strings being calculated by similarity of character string algorithm
Similarity;Wherein, splicing character string is the character string for splicing object again after segmenting and going word to handle to obtain;Character
Similarity algorithm of going here and there can for example include JaroWinkler algorithms, editing distance, longest common subsequence (LCS, Longest
Common Subsequence) etc..
Embodiment one
The embodiment of the present application provides a kind of Text similarity computing method, for calculating the similarity between two texts.
Text described in the present embodiment can include:The public sentiment texts such as news, microblogging, forum's article.In the present embodiment, carry out similar
Two parts of texts that degree calculates can be same type of public sentiment text, for example, it may be two parts of newsletter archives, or two micro-
It is rich, or Liang Pian forums article;Or carry out two parts of texts of Similarity Measure and can also be different types of public sentiment text,
Such as can be a newsletter archive and a microblogging, either a microblogging and forum's article or a newsletter archive
With forum's article.However, the application is not limited this.
In the present embodiment, the data of at least two objects can be extracted from each text.Wherein, the number of the object
According to for example including:Cryptographic Hash and/or term vector data.With using SimHash algorithms extraction cryptographic Hash and using word2vec
Model extraction term vector data instance, the SimHash values of title, the VSM of title, content can be extracted from a forum's article
SimHash values, the VSM of keyword, the SimHash values of the SimHash values of keyword and each kernel sentence;From a microblogging
In can extract the SimHash values of content, the VSM of keyword, the SimHash values of keyword and each kernel sentence
SimHash values.In practical application, the corresponding data of corresponding object can be extracted according to the actual features of text.The application couple
This is not limited.
The Text similarity computing method of the present embodiment for example can apply to service end computing device or client meter
Equipment is calculated, the present embodiment is not limited this.In addition, the similarity obtained by the Text similarity computing method of the present embodiment
It is follow-up to can be used for the business such as duplicate removal, infringement inquiry, template filtering.For example, one can be established in public sentiment application system
Individual infringement or spam samples storehouse, when crawler capturing is to new public sentiment text, using the text similarity meter of the present embodiment offer
Calculation method calculates the similarity between the sample in the text and Sample Storehouse that currently grab, to judge that current text is with sample
It is no similar, and then judge whether current text is infringement article or junk data.
Fig. 1 is the flow chart for the Text similarity computing method that the embodiment of the present application one provides.As shown in figure 1, this implementation
The Text similarity computing method that example provides, comprises the following steps:
Step S11:The shared object of two texts is determined, wherein, the number of the shared object is at least two;
Step S12:Calculate the Hamming distance of each shared object between described two texts;
Step S13:When the Hamming distance of described at least two shared objects meets the first preparatory condition, according to below extremely
One item missing determines the similarity between described two texts:The term vector of predetermined object is similar in described at least two shared objects
Degree, Hamming distance and splicing character string similarity.
In the present embodiment, before step S12, methods described also includes:Respectively from described two texts extraction to
The data of few two objects, the data include cryptographic Hash and/or term vector data.
In the present embodiment, at least two object includes following at least two:Title, content, keyword and core
Heart sentence.Wherein, the title of text and content can determine according to the structural identification of text, such as Title identification captions,
Content identifies content.The keyword and kernel sentence of text need to be extracted from the content of text.
, can be by calculating the weight extraction keyword of the word in text in an alternative embodiment.Carried from a text
Take the process of keyword as follows:
Subordinate sentence is carried out to the text;
Subregion is carried out to the sentence of the text;Wherein, the preceding A sentences in all sentences are selected as the first subregion
SectionA, A are chosen as 2;The rear B sentences in all sentences are selected to be chosen as 2 as the second subregion sectionB, B;By the text
All sentences in remove the first subregion and the remaining sentence of the second subregion as the 3rd subregion sectionC;
The sentence of each subregion (section) is segmented;
Travel through each word in each subregion;If some word is not recorded, the reverse of the word is calculated after recording the word
Document-frequency (IDF, Inverse Document Frequency) value, and set the word privately owned parameter local_head,
Local_tail, local_mid 0;If the word has been recorded, if the sentence belonging to the word is current is located at sectionA,
The privately owned parameter local_head of the word is added 1, if the sentence belonging to the word is current is located at sectionB, by the privately owned of the word
Parameter local_tail adds 1, if the sentence belonging to the word is current is located at sectionC, by the privately owned parameter local_ of the word
Mid adds 1;
After having traveled through each word and having completed the statistics of privately owned parameter of each word, the weight of each word is calculated, according to each
The weight of word is descending to be ranked up, and selects keyword of N number of word as the text from the front to the back from collating sequence.Wherein, N
For the integer more than 1, N can configure according to being actually needed.
Wherein it is possible to the weight W of a word is calculated according to following formula:
W=tw × (local_head × locw1+local_mid × locw2+local_tail × locw3+WL × lenw
+PS×posw+TFIDF×tfw);
Wherein, parameter tw, locw1, locw2, locw3, lenw, posw, tfw could be arranged to:Tw=0.4, locw1=
0.5, locw2=0.3, locw3=0.3, lenw=0.01, posw=0.5, tfw=0.8;The value of above-mentioned parameter is only to lift
Example;
Privately owned parameter local_head, local_tail, local_mid are to travel through the end value obtained after the text;
WL is that the word of the word is grown;
PS is the part of speech fraction of a word;Such as it can be determined according in the following manner:If the word is morpheme word, PS=
0.2;If the word is noun, name verb, adnoun, Chinese idiom or idiom, PS=0.6;If the word is verb, PS
=0.3;If the word is secondary verb, PS=0.4;If the word is adjective, PS=0.2;If the word is English words,
Then PS=0.5;If the word is other parts of speech, PS=0;
TFIDF be the word TF-IDF values, TFIDF=TF × IDF, wherein, TF is word frequency, represents the word in the text
The frequency of appearance, IDF values can by total text number (total text number in corpus) divided by the text number comprising the word,
Again obtained business is taken the logarithm to obtain.
In another alternative embodiment, the keyword in text can be extracted based on TextRank algorithm.From a text
The process for extracting keyword is as follows:
Subordinate sentence is carried out to the text, each sentence segmented, it is hereby achieved that the set of sentence and the collection of word
Close, it is alternatively possible to filter out stop words in each sentence, and only retain and specify part of speech (for example, noun, verb, adjective
Deng) word;
Using each word as a node in Algorithms for Page Ranking PageRank, window size is set as k, it is assumed that one
Individual sentence is made up of following word successively:w1,w2,w3,w4,w5,...,wn;Then wherein, { w1, w2 ..., wk }, w2,
W3 ..., wk+1 }, { w3, w4 ..., wk+2 } etc. be all a window, wherein, k and n are integer, and k is less than n;At one
A undirected side had no right between node corresponding to any two words in window be present;
Based on the pie graph on the side between node, the importance of each word node can be calculated;According to each word section
The importance of point, can select keyword of the most important N number of word as the text, wherein, N is the integer more than 1, and N can
Configuration is actually needed with basis.
Wherein it is possible to the importance of a word node is calculated according to following formula:
Wherein, S (Vi) is word node i importance;D is damped coefficient, is traditionally arranged to be 0.85;In (Vi) is directed to
Word node i word node set;S (Vj) is directed to the important of the word node j in word node i word node set
Property;Out (Vj) is the word node set pointed by word node j;| Out (Vj) | it is the word section pointed by word node j
The node number of point set.
Wherein, the importance of word node needs just obtain result by above-mentioned formula successive ignition., can be with when initial
The importance for setting each word node is 1.The result that the formula equal sign left side calculates above is the important of word node after iteration
Property, before the importance used on the right of equal sign is iteration.
In an alternative embodiment, the kernel sentence in text can be extracted based on TextRank algorithm.Carried from a text
Take the process of kernel sentence as follows:
Subordinate sentence is carried out to the text;The weight of each sentence is calculated, is arranged according to the weight of each sentence is descending
Sequence, select kernel sentence of N number of sentence as the text from front to back from collating sequence, wherein, N can be more than or equal to 3
Integer, N can also configure according to being actually needed.
Wherein it is possible to the weight of a sentence is calculated according to following formula:
Wherein, WS (Vi) is sentence i weight;D is damped coefficient, is traditionally arranged to be 0.85;In (Vi) is directed to sentence i
Sentence set;S (Vj) is directed to the weight of sentence j in sentence i sentence set;Out (Vj) is the sentence pointed by sentence j
Set;wjiRepresent the similarity between sentence j in sentence i and sensing sentence i sentence set;wjkRepresent sentence j and sentence j
Similarity in pointed sentence set between sentence k.
Wherein, the weight of sentence needs just obtain result by above-mentioned formula successive ignition;The equation left side represents one
The weight of sentence, the summation on right side represent percentage contribution of each adjacent sentence to this sentence.When with extracting keyword not
Together, it is considered that whole sentences are all adjacent, therefore, no longer extract window.
Wherein, similarity wjiAnd wjkCalculating can use BM25 algorithms.Similarity i.e. between sentence can be by right
Morpheme analysis, morpheme weighted judgment and the correlation prediction of morpheme and sentence in sentence obtain.
It is above-mentioned from text extract keyword mode and from text extract kernel sentence mode be only for example, this Shen
Please this is not limited.In other embodiment, can also using it is other it is feasible by the way of from text extract keyword and/or
Kernel sentence.
Explanation obtains title, content, keyword and core by taking SimHash algorithms and word2vec models as an example below
The mode of the data of sentence.In this, cryptographic Hash is SimHash values, and term vector data are VSM.However, the application to this and it is unlimited
It is fixed.In other embodiment, other feasible algorithm and models can also be used.
In the present embodiment, the process for extracting the data of the title of text is as follows:The title of text is segmented and gone
Word, wherein, go word to include removing stop words, adverbial word, auxiliary word, punctuation mark, preposition and part conjunction;According to SimHash algorithms
The title SimHash values of extraction 64;The VSM of title is obtained by word2vec models;And it will segment and go title after word
Again splicing obtains splicing character string.
In the present embodiment, the process for extracting the data of the content of text is as follows:The content of text is segmented and gone
Word, wherein, go word to include removing stop words, adverbial word, auxiliary word, punctuation mark, preposition and part conjunction, when text is microblogging,
Also need to remove the rubbish contents such as microblogging expression;The content SimHash values of 64 are extracted according to SimHash algorithms.
In the present embodiment, the process for extracting the data of the keyword of text is as follows:For the pass extracted from text
Keyword, the keyword SimHash values of 64 are extracted according to SimHash algorithms;Keyword is obtained by word2vec models
VSM。
In the present embodiment, the process for extracting the data of the kernel sentence of text is as follows:For extracted from text to
Few three kernel sentences, the SimHash values of 64 of each kernel sentence are extracted according to SimHash algorithms;By all kernel sentences
SimHash value merging obtains total SimHash values.
In the present embodiment, before step S13, methods described also includes:Predetermined object is determined according in the following manner
Term vector similarity:
According to the term vector data of predetermined object, the term vector cosine similarity of predetermined object is calculated;
When the word number or length of the predetermined object of described two texts are all higher than default penalty factor, determine described pre-
The term vector similarity for determining object is the term vector cosine similarity;
When the word number of the predetermined object of at least one text in described two texts or length are less than or equal to the punishment
During the factor, according to the word number or length of described two texts and basic penalty value, punishment correction value is calculated;According to described two
The addition penalty value of text determines additional penalty value with value;Wherein, the addition penalty value of each text according to penalty factor, should
The word number or length of text and additional punishment maximum determine;According to the term vector cosine similarity of the predetermined object, institute
State the smaller value of the smaller value of the word number of two texts or the length of described two texts, the punishment correction value, described chase after
Add penalty value, the penalty factor, the additional punishment maximum and basic penalty value, determine the word of the predetermined object to
Measure similarity.
Wherein, the punishment correction value is equal to the word number of described two texts or the absolute value of the difference of length is punished with basis
The absolute value of the difference of value;For each text, if the difference of the word number or length of penalty factor and the text is more than 0, calculate
The product of above-mentioned difference and additional punishment maximum, the addition penalty value of the text are equal to the ratio of above-mentioned product and penalty factor
Value, if the difference of the word number or length of penalty factor and the text is less than or equal to 0, the addition penalty value of the text is equal to 0.
In the present embodiment, as shown in Figures 2 to 5, methods described also includes:
Step S14:When the Hamming distance of described at least two shared objects is unsatisfactory for first preparatory condition, according to
Minimum value in the Hamming distance of described at least two shared objects determines the similarity between described two texts.
In the present embodiment, first preparatory condition includes:In the Hamming distance of described at least two shared objects
Minimum value is less than or equal to the first distance threshold, or, the minimum value position in the Hamming distance of described at least two shared objects
In in the first preset range.
In the present embodiment, preset when the Hamming distance of at least two shared objects between two texts is unsatisfactory for first
During condition, it may be determined that described two texts do not have similar possibility, can determine between the two similar according to predetermined way
Degree, avoid carrying out complicated calculations, to improve computational efficiency;When the Hamming distance of at least two shared objects between two texts
When meeting the first preparatory condition, it may be determined that described two texts have similar possibility, can be according to the data of predetermined object
Follow-up complicated calculations are carried out, to obtain accurate similarity.For example, if the shared object between two texts is:Mark
When topic, content, keyword, predetermined object is title and keyword;If the shared object between two texts is:Title, content,
When keyword and kernel sentence, predetermined object is title, keyword and kernel sentence.
In an alternative embodiment, as shown in Fig. 2 predetermined including the first predetermined object and second in the predetermined object
Object, or, including when the first predetermined object, the second predetermined object and three predetermined objects, step S13 can include:
Step S131:If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that described two
Similarity between text is equal to the term vector similarity of first predetermined object;
Step S132:If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, the second predetermined object
The first Hamming distance meet that the second Hamming distance of the 3rd preparatory condition and the second predetermined object meets the 4th preparatory condition, then
The similarity between described two texts is determined according to the second Hamming distance of the second predetermined object;
Step S133:If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, and second predetermined pair
The first Hamming distance of elephant is unsatisfactory for the second Hamming distance of the 3rd preparatory condition or the second predetermined object, and to be unsatisfactory for the 4th pre-
If condition, then the similarity between described two texts is determined according to the splicing character string similarity of the first predetermined object, and/
Or, similarity between described two texts is determined according to the term vector similarity of the 3rd predetermined object.
In an alternative embodiment, as shown in figure 3, including the second predetermined object and the 3rd predetermined pair in the predetermined object
As, and when not including the first predetermined object, step 13 can include:
Step S134:If the first Hamming distance of the second predetermined object meets the 3rd preparatory condition and the second predetermined object
Second Hamming distance meets the 4th preparatory condition, then determines described two texts according to the second Hamming distance of the second predetermined object
Between similarity;
Step S135:If the first Hamming distance of the second predetermined object is unsatisfactory for the 3rd preparatory condition or second predetermined pair
The second Hamming distance of elephant is unsatisfactory for the 4th preparatory condition, then determines described two according to the term vector similarity of the 3rd predetermined object
Similarity between individual text.
In an alternative embodiment, as shown in figure 4, including the first predetermined object in the predetermined object and not including second
Predetermined object, or, including the first predetermined object and the 3rd predetermined object, and when not including the second predetermined object, step 13 is wrapped
Include:
Step S136:If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that described two
Similarity between text is equal to the term vector similarity of first predetermined object;
Step S137:If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, pre- according to first
The splicing character string similarity for determining object determines similarity between described two texts, and/or, according to the 3rd predetermined object
Term vector similarity determines the similarity between described two texts.
In an alternative embodiment, as shown in figure 5, including the second predetermined object in the predetermined object and not including first
When predetermined object and three predetermined objects, step S13 can include:
Step S138:The similarity between described two texts is determined according to the second Hamming distance of the second predetermined object.
Wherein, second preparatory condition includes:The term vector similarity of first predetermined object is more than the first similarity threshold
Value, or, the term vector similarity of the first predetermined object is located in the second preset range.3rd preparatory condition includes:The
First Hamming distance of two predetermined objects is more than or equal to second distance threshold value, or, the first Hamming distance of the second predetermined object
Off normal in the 3rd preset range.4th preparatory condition includes:Second Hamming distance of the second predetermined object is less than or waited
In the 3rd distance threshold, or, the second Hamming distance of the second predetermined object is located in the 4th preset range.
The present embodiment is described in detail below by example one to example seven.In following examples, the first predetermined object is mark
Topic, the second predetermined object is kernel sentence, and the 3rd predetermined object is keyword.First Hamming distance of second predetermined object is
Total Hamming distance of kernel sentence between two texts, the second Hamming distance of second predetermined object are included in a text
Each kernel sentence and another text in each kernel sentence Hamming distance.In following examples, Hamming distance is root
The SimHash distances obtained according to the difference of SimHash values.Similarity Measure process is described in detail in following examples, for extraction pair
The process of the data of elephant is repeated no more in this with foregoing.
Example one
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, kernel sentence SimHash values a4, a5, a6
Total SimHash values a7, the VSM of title and the VSM of keyword of (in this, by taking three kernel sentences as an example), three kernel sentences;From
Title SimHash values b1, content SimHash values b2, keyword SimHash values b3, kernel sentence are extracted in text B
SimHash values b4, b5, b6 (in this, by taking three kernel sentences as an example), three kernel sentences total SimHash values b7, the VSM of title
And the VSM of keyword.In this, the shared object between text A and text B is:Title, content, keyword and kernel sentence.
As shown in fig. 6, the Similarity Measure process between text A and text B, comprises the following steps:
Step 101:Content SimHash distances D1 between calculating text A and text B, keyword SimHash distances respectively
D2, kernel sentence total SimHash distances D3 and title SimHash distances D4;
Wherein, D1=| a2-b2 |, D2=| a3-b3 |, D3=| a7-b7 |, D4=| a1-b1 |.
It should be noted that when text A and text B title text size are all higher than length for heading threshold value, just count
Calculate text A and text B title SimHash distances D4.When the text size of title at least one in text A and text B is small
When length for heading threshold value, text A and text B title SimHash distances D4 is not calculated.
In the present embodiment, so that text A and text B title text size are all higher than length for heading threshold value as an example, i.e.,
Need calculating text A and text B title SimHash distances D4.
Step 102:Minimum value is selected in D1, D2, D3 and D4 as minimum range Dmin.In this, Dmin=min D1,
D2, D3, D4 }.
Step 103:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, according between following formula calculating text A and text B
Similarity:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is to be normalized corresponding to the dimension of SimHash algorithms
Parameter;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 104 is performed.
Preliminary screening is carried out using cost-efficiently calculate by step 103, avoids the relatively low data of similar possibility
Follow-up complicated calculations are carried out, so as to improve computational efficiency.
Step 104:Calculate the term vector similarity Ts of title.
As shown in fig. 7, the term vector similarity Ts of title calculating process comprises the following steps:
Step 1041:According to the VSM of text A title and the VSM of text B title, the term vector cosine of title is determined
Similarity ts;
Step 1042:Judge whether F (A) is less than or equal to PF, or, whether F (B) is less than or equal to PF;Wherein, PF is
Default penalty factor, F (A) are the text size of text A title, and F (B) is the text size of text B title.
When F (A) is less than or equal to PF, or, when F (B) is less than or equal to PF, perform step 1043;When F (A) is more than PF
And F (B) determines the term vector similarity of title according to following formula when being more than PF:Ts=ts.
Step 1043:The term vector similarity Ts of title is calculated according to following formula:
Ts=ts × (100-PF+min (F (A), F (B))-BP+MPV-AP)/100,
Wherein, MPV=BP- | F (A)-F (B) |, AP=P (A)+P (B);
As PF-F (A) > 0, P (A)=(PF-F (A)) × maxAP/PF, as PF-F (A)≤0, P (A)=0;
As PF-F (B) > 0, P (B)=(PF-F (B)) × maxAP/PF, when PF F-B () 0≤when, P (B)=0;
Wherein, ts is the term vector cosine similarity of title determined in step 1041, and PF is default penalty factor, F
(A) for text A title text size, F (B) is the text size of text B title, penalty value based on BP, and maxAP is
Additional punishment maximum.
In the present embodiment, term vector cosine similarity is handled by the penalty mechanism of step 104, short text can be avoided
Caused larger error.
Step 105:Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase
Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.
When the term vector similarity Ts of title is more than the first similarity threshold St, text A and text B are determined according to following formula
Between similarity:S=Ts;
When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 106 is performed.
Step 106:Compare D3 and second distance threshold value, in this, second distance threshold value be 10, however, the application to this simultaneously
Do not limit.
When D3 is more than or equal to second distance threshold value, step 107 and step 108 are performed;When D3 is less than second distance threshold
During value, step 109 is performed.
Step 107:The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and
Determine minimum value dmin;
In step 107, the SimHash values of each kernel sentence and each kernel sentence in text B in text A are calculated
The difference of SimHash values, obtain multiple SimHash distances.In this example, the SimHash values point of text A three kernel sentences
Not Wei a4, a5, a6, the SimHash values of text B three kernel sentences are respectively b4, b5, b6, therefore, can obtain it is following away from
From:| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |;
For each distance, if the distance is less than or equal to the 3rd distance threshold minDt (for example, 3 or 5), pass through
JaroWinkler algorithms calculate the similarity between two kernel sentences corresponding to the distance, if the similarity is more than or waited
In the second similarity threshold (for example, 0.8), then retain the distance, if the similarity is less than second similarity threshold,
Then give up the distance;If the distance is more than the 3rd distance threshold minDt, give up the distance;In the present embodiment
In, by similarity-rough set and distance compare obtained retention data for example including:| a4-b4 |, | a4-b5 |, | a4-b6 |, |
A5-b4 |, | a5-b5 |, | a5-b6 |;
Minimum value dmin is determined from all distances of reservation;In the present embodiment, dmin=min | a4-b4 |, | a4-
B5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 | }.
Step 108:Compare dmin and default 3rd distance threshold minDt, in this, the 3rd distance threshold minDt's takes
It is worth for 3 or 5, however, the application is not limited this.
When dmin is less than or equal to the 3rd distance threshold minDt, the phase between text A and text B is calculated according to following formula
Like degree:S=(L-dmin)/L, wherein, L is normalized parameter corresponding to the dimension of SimHash algorithms;In this, L value is
100。
When dmin is more than the 3rd distance threshold minDt, step 109 is performed.
Step 109:Calculate the term vector similarity S1 of keyword.
As shown in figure 8, the term vector similarity S1 of keyword calculating process comprises the following steps:
Step 1091:According to the VSM of text A keyword and the VSM of text B keyword, determine the word of keyword to
Measure cosine similarity s1;
Step 1092:Judge whether F (A) is less than or equal to PF, or, whether F (B) is less than or equal to PF;Wherein, PF is
Default penalty factor, F (A) are the word number of text A keyword, and F (B) is the word number of text B keyword.
When F (A) is less than or equal to PF, or, when F (B) is less than or equal to PF, perform step 1093;When F (A) is more than PF
And F (B) determines the term vector similarity of keyword according to following formula when being more than PF:S1=s1.
Step 1093:The term vector similarity S1 of keyword is determined according to following formula:
S1=s1 × (100-PF+min (F (A), F (B))-BP+MPV-AP)/100,
Wherein, MPV=BP- | F (A)-F (B) |, AP=P (A)+P (B);
As PF-F (A) > 0, P (A)=(PF-F (A)) × maxAP/PF, as PF-F (A)≤0, P (A)=0;
As PF-F (B) > 0, P (B)=(PF-F (B)) × maxAP/PF, when PF F-B () 0≤when, P (B)=0;
Wherein, s1 is the term vector cosine similarity of the keyword determined in step 1091, and PF is default penalty factor,
F (A) is the word number of text A keyword, and F (B) is the word number of text B keyword, penalty value based on BP, and maxAP is chases after
Add punishment maximum.
In the present embodiment, term vector cosine similarity is handled by the penalty mechanism of step 109, keyword can be avoided
Error caused by very few.
Step 110:Calculate the similarity S2 that title segments again spliced splicing character string.In this example, pass through
JaroWinkler algorithms calculate the similarity S2 of the splicing character string.However, the application is not limited this.In other realities
Apply in example, JaroWinkler algorithms can use editing distance, longest common subsequence (LCS, Longest Common
) etc. Subsequence similarity of character string algorithm replaces.
Step 111:Similarity between text A and text B is calculated according to following formula:
Similarity between S=S1 × w1+S2 × w2, i.e. text A and text B determines according to S1 and S2 weighted sum;Its
In, w1+w2=1, w1 value is, for example, that 0.8, w2 value is, for example, 0.2, however, the application is not limited this.
In addition, when the shared object between two texts is:When title, keyword and kernel sentence, described two texts
Between Similarity Measure process be referred to this example, therefore repeated no more in this.
Example two
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, content SimHash values a2, kernel sentence SimHash values a4, a5, a6 (in this, with three kernel sentences
Exemplified by), total the SimHash values a7 and title of three kernel sentences VSM;Extracted from text B title SimHash values b1,
Content SimHash values b2, SimHash values b4, b5, b6 (in this, by taking three kernel sentences as an example) of kernel sentence, three kernel sentences
Total SimHash values b7 and title VSM.In this, the shared object between text A and text B is:Title, content and core
Heart sentence.
As shown in figure 9, the Similarity Measure process between text A and text B, comprises the following steps:
Step 201:Text A and text B content SimHash distances D1, total SimHash distances of kernel sentence is calculated respectively
D3 and title SimHash distances D4;
Wherein, D1=| a2-b2 |, D3=| a7-b7 |, D4=| a1-b1 |.
It should be noted that when text A and text B title text size are all higher than length for heading threshold value, just count
Calculate text A and B title SimHash distances D4.Be less than when the text size of title at least one in text A and text B or
During equal to length for heading threshold value, text A and B title SimHash distances D4 are not calculated.
In this example, using the text size of text A and text B title be respectively less than or equal to length for heading threshold value as
Example, i.e., it need not calculate text A and B title SimHash distances D4.
Step 202:Minimum value is selected in D1 and D3 as minimum range Dmin.In this, Dmin=min { D1, D3 }.
Step 203:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, calculated according to following formula similar between text A and B
Degree:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is normalization ginseng corresponding to the dimension of SimHash algorithms
Number;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 204 is performed.
Preliminary screening is carried out using cost-efficiently calculate by step 203, avoids the relatively low data of similar possibility
Follow-up complicated calculations are carried out, so as to improve computational efficiency.
Step 204:Calculate the term vector similarity Ts of title.In this, step 204 is referred to the step in example one
104, therefore repeated no more in this.
Step 205:Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase
Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.
When the term vector similarity Ts of title is more than the first similarity threshold St, determined according to following formula between text A and B
Similarity:S=Ts;
When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 206 is performed.
Step 206:Compare D3 and second distance threshold value, in this, second distance threshold value be 10, however, the application to this simultaneously
Do not limit.
When D3 is more than or equal to second distance threshold value, step 207 and step 208 are performed;When D3 is less than second distance threshold
During value, step 209 is performed.
Step 207:The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and
Determine minimum value dmin;
Step 207 is referred to the step 107 in example one, therefore is repeated no more in this.
In the present embodiment, SimHash values respectively a4, a5, a6 of text A three kernel sentences, text B three cores
The SimHash values of heart sentence are respectively b4, b5, b6, therefore, can obtain following distance:| a4-b4 |, | a4-b5 |, | a4-b6 |, |
A5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |;Compare to obtain by similarity-rough set and distance
Retention data for example including:| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |;Now, dmin
=min | a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |.
Step 208:Compare dmin and default 3rd distance threshold minDt, in this, the 3rd distance threshold minDt's takes
It is worth for 3 or 5, however, the application is not limited this.
When dmin is less than or equal to the 3rd distance threshold minDt, the phase between text A and text B is calculated according to following formula
Like degree:S=(L-dmin)/L, wherein, L is normalized parameter corresponding to the dimension of SimHash algorithms;In this, L value is
100。
When dmin is more than the 3rd distance threshold minDt, step 209 is performed.
Step 209:The similarity S2 that title segments again spliced splicing character string is calculated, and text is determined according to following formula
Similarity between this A and text B:S=S2.
In the present embodiment, the similarity S2 of the splicing character string is calculated by JaroWinkler algorithms.However, this
Application is not limited this.In other embodiment, JaroWinkler algorithms can use editing distance, most long public sub- sequence
Similarity of character string algorithms such as (LCS, Longest Common Subsequence) is arranged to replace.
In addition, the shared object before two texts is:When title and kernel sentence, the phase between described two texts
This example is referred to like degree calculating process, therefore is repeated no more in this.
Example three
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, the VSM of title and the VSM of keyword;
Extracted from text B title SimHas values b1, content SimHash values b2, keyword SimHash values b3, title VSM and
The VSM of keyword.In this, the shared object between text A and text B is:Title, content and keyword.
As shown in Figure 10, the Similarity Measure process between text A and text B, comprises the following steps:
Step 301:Respectively calculate text A and text B content SimHash distances D1, keyword SimHash distances D2 with
And title SimHash distances D4;
Wherein, D1=| a2-b2 |, D2=| a3-b3 |, D4=| a1-b1 |.
It should be noted that when text A and text B title text size are all higher than length for heading threshold value, just count
Calculate text A and B title SimHash distances D4.Be less than when the text size of title at least one in text A and text B or
During equal to length for heading threshold value, text A and B title SimHash distances D4 are not calculated.
In this example, so that text A and text B title text size are all higher than length for heading threshold value as an example, that is, need
Calculate text A and B title SimHash distances D4.
Step 302:Minimum value is selected in D1, D2 and D4 as minimum range Dmin.In this, Dmin=min D1, D2,
D4}。
Step 303:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, according between following formula calculating text A and text B
Similarity:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is to be normalized corresponding to the dimension of SimHash algorithms
Parameter;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 304 is performed.
Preliminary screening is carried out using cost-efficiently calculate by step 303, avoids the relatively low data of similar possibility
Follow-up complicated calculations are carried out, so as to improve computational efficiency.
Step 304:Calculate the term vector similarity Ts of title.In this, step 304 is referred to the step in example one
104, therefore repeated no more in this.
Step 305:Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase
Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.
When the term vector similarity Ts of title is more than the first similarity threshold St, determined according to following formula between text A and B
Similarity:S=Ts;
When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 306 is performed.
Step 306:Calculate the term vector similarity S1 of keyword.In this, step 306 is referred to the step in example one
109, therefore repeated no more in this.
Step 307:Calculate the similarity S2 that title segments again spliced splicing character string.In the present embodiment, lead to
Cross the similarity S2 that JaroWinkler algorithms calculate the splicing character string.However, the application is not limited this.In other
In embodiment, JaroWinkler algorithms can use editing distance, longest common subsequence (LCS, Longest Common
) etc. Subsequence similarity of character string algorithm replaces.
Step 308:Similarity between text A and text B is calculated according to following formula:
Similarity between S=S1 × w1+S2 × w2, i.e. text A and text B determines according to S1 and S2 weighted sum;Its
In, w1+w2=1, w1 value is, for example, that 0.8, w2 value is, for example, 0.2, however, the application is not limited this.
In addition, when the shared object between two texts is:When title and keyword, the phase between described two texts
This example is referred to like degree calculating process, therefore is repeated no more in this.
Example four
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, the VSM of title and the VSM of keyword;
The VSM of title SimHash values b1, content SimHash values b2 and title is extracted from text B.In this, text A and text B
Between shared object be:Title and content.
As shown in figure 11, the Similarity Measure process between text A and text B, comprises the following steps:
Step 401:Text A and text B content SimHash distances D1 and title SimHash distances D4 is calculated respectively;
Wherein, D1=| a2-b2 |, D4=| a1-b1 |.
It should be noted that when text A and text B title text size are all higher than length for heading threshold value, just count
Calculate text A and B title SimHash distances D4.Be less than when the text size of title at least one in text A and text B or
During equal to length for heading threshold value, text A and B title SimHash distances D4 are not calculated.
In this example, so that text A and text B title text size are all higher than length for heading threshold value as an example, that is, need
Calculate text A and B title SimHash distances D4.
Step 402:Minimum value is selected in D1 and D4 as minimum range Dmin.In this, Dmin=min { D1, D4 }.
Step 403:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, according between following formula calculating text A and text B
Similarity:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is to be normalized corresponding to the dimension of SimHash algorithms
Parameter;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 404 is performed.
Preliminary screening is carried out using cost-efficiently calculate by step 403, avoids the relatively low data of similar possibility
Follow-up complicated calculations are carried out, so as to improve computational efficiency.
Step 404:Calculate the term vector similarity Ts of title.In this, step 404 is referred to the step in example one
104, therefore repeated no more in this.
Step 405:Compare the term vector similarity Ts and default first similarity threshold St of title, in this, the first phase
Value like degree threshold value St is, for example, 0.92, however, the application is not limited this.
When the term vector similarity Ts of title is more than the first similarity threshold St, determined according to following formula between text A and B
Similarity:S=Ts;
When the term vector similarity Ts of title is less than or equal to the first similarity threshold St, step 406 is performed.
Step 406:The similarity S2 that title segments again spliced splicing character string is calculated, and text is determined according to following formula
Similarity between this A and text B:S=S2.
In the present embodiment, the similarity S2 of the splicing character string is calculated by JaroWinkler algorithms.However, this
Application is not limited this.In other embodiment, JaroWinkler algorithms can use editing distance, most long public sub- sequence
Similarity of character string algorithms such as (LCS, Longest Common Subsequence) is arranged to replace.
Example five
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, content SimHash values a2, keyword SimHash values a3, kernel sentence SimHash values a4, a5, a6
Total SimHash values a7, the VSM of title and the VSM of keyword of (in this, by taking three kernel sentences as an example), three kernel sentences;From
Extracted in text B content SimHash values b2, keyword SimHash values b3, kernel sentence SimHash values b4, b5, b6 (in
This, by taking three kernel sentences as an example), total the SimHash values b7 and keyword of three kernel sentences VSM.In this, text A and text
Shared object between this B is:Content, keyword and kernel sentence.
As shown in figure 12, the Similarity Measure process between text A and text B, comprises the following steps:
Step 501:Calculate respectively text A and text B content SimHash distances D1, keyword SimHash distances D2,
Total SimHash distances D3 of kernel sentence;
Wherein, D1=| a2-b2 |, D2=| a3-b3 |, D3=| a7-b7 |.
Step 502:Minimum value is selected in D1, D2 and D3 as minimum range Dmin.In this, Dmin=min D1,
D2, D3 }.
Step 503:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, calculated according to following formula similar between text A and B
Degree:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is normalization ginseng corresponding to the dimension of SimHash algorithms
Number;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 504 is performed.
Preliminary screening is carried out using cost-efficiently calculate by step 503, avoids the relatively low data of similar possibility
Follow-up complicated calculations are carried out, so as to improve computational efficiency.
Step 504:Compare D3 and second distance threshold value, in this, second distance threshold value be 10, however, the application to this simultaneously
Do not limit.
When D3 is more than or equal to second distance threshold value, step 505 and step 506 are performed;When D3 is less than second distance threshold
During value, step 507 is performed.
Step 505:The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and
Determine minimum value dmin;
Step 505 is referred to the step 107 in example one, therefore is repeated no more in this.
In the present embodiment, SimHash values respectively a4, a5, a6 of text A three kernel sentences, text B three cores
The SimHash values of heart sentence are respectively b4, b5, b6, therefore, can obtain following distance:| a4-b4 |, | a4-b5 |, | a4-b6 |, |
A5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |;Compare to obtain by similarity-rough set and distance
Retention data for example including:| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |;Now, dmin
=min | a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |.
Step 506:Compare dmin and default 3rd distance threshold minDt, in this, the 3rd distance threshold minDt's takes
It is worth for 3 or 5, however, the application is not limited this.
When dmin is less than or equal to the 3rd distance threshold minDt, the phase between text A and text B is calculated according to following formula
Like degree:S=(L-dmin)/L, wherein, L is normalized parameter corresponding to the dimension of SimHash algorithms;In this, L value is
100。
When dmin is more than the 3rd distance threshold minDt, step 507 is performed.
Step 507:The term vector similarity S1 of keyword is calculated, and the phase between text A and text B is determined according to following formula
Like degree:S=S1.In this, the term vector similarity S1 of keyword calculating process is referred to the step 109 of example one, therefore in
This is repeated no more.
In addition, when the shared object between two texts is:When keyword and kernel sentence, between described two texts
Similarity Measure process is referred to this example, therefore is repeated no more in this.
Example six
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, keyword SimHash values a3, content SimHash values a2, the VSM of title and the VSM of keyword;
The VSM of keyword SimHash values b3, content SimHash values b2 and keyword is extracted from text B.In this, text A and
Shared object between text B is:Keyword and content.
As shown in figure 13, the Similarity Measure process between text A and text B, comprises the following steps:
Step 601:Text A and text B content SimHash distances D1, keyword SimHash distances D2 is calculated respectively;
Wherein, D1=| a2-b2 |, D2=| a3-b3 |.
Step 602:Minimum value is selected in D1 and D2 as minimum range Dmin.In this, Dmin=min { D1, D2 }.
Step 603:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, according between following formula calculating text A and text B
Similarity:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is to be normalized corresponding to the dimension of SimHash algorithms
Parameter;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 604 is performed.
Preliminary screening is carried out using cost-efficiently calculate by step 603, avoids the relatively low data of similar possibility
Follow-up complicated calculations are carried out, so as to improve computational efficiency.
Step 604:The term vector similarity S1 of keyword is calculated, and is determined according to following formula similar between text A and B
Degree:S=S1.In this, the term vector similarity S1 of keyword calculating process is referred to the step 109 in example one, therefore in
This is repeated no more.
Example seven
In this example, to be illustrated exemplified by calculating text A and text B similarity.Wherein, extracted from text A
To title SimHash values a1, content SimHash values a2, kernel sentence SimHash values a4, a5, a6 (in this, with three kernel sentences
Exemplified by), total the SimHash values a7 and title of three kernel sentences VSM;Extracted from text B content SimHash values b2,
SimHash values b4, b5, b6 (in this, by taking three kernel sentences as an example) of kernel sentence, three kernel sentences total SimHash values b7.In
This, the shared object between text A and text B is:Content and kernel sentence.
As shown in figure 14, the Similarity Measure process between text A and text B, comprises the following steps:
Step 701:Text A and text B content SimHash distances D1, the total SimHash distances of kernel sentence is calculated respectively
D3;
Wherein, D1=| a2-b2 |, D3=| a7-b7 |.
Step 702:Minimum value is selected in D1 and D3 as minimum range Dmin.In this, Dmin=min { D1, D3 }.
Step 703:Compare minimum range Dmin and default first distance threshold maxDt, in this, the first distance threshold
MaxDt value is, for example, 25, however, the application is not limited this.
When minimum range Dmin is more than the first distance threshold maxDt, according between following formula calculating text A and text B
Similarity:S=(L-Dmin)/L, wherein, Dmin is the minimum range, and L is to be normalized corresponding to the dimension of SimHash algorithms
Parameter;In this, L value is 100.
When minimum range Dmin is less than or equal to the first distance threshold maxDt, step 704 is performed.
Step 704:The SimHash distances in each kernel sentence and text B between each kernel sentence in text A are calculated, and
Obtain minimum value dmin;And the similarity between text A and text B is calculated according to following formula:S=(L-dmin)/L, wherein, L is
Normalized parameter corresponding to the dimension of SimHash algorithms;In this, L value is 100.
In the present embodiment, SimHash values respectively a4, a5, a6 of text A three kernel sentences, text B three cores
The SimHash values of heart sentence are respectively b4, b5, b6, therefore, can obtain following distance:| a4-b4 |, | a4-b5 |, | a4-b6 |, |
A5-b4 |, | a5-b5 |, | a5-b6 |, | a6-b4 |, | a6-b5 |, | a6-b6 |;Compare to obtain by similarity-rough set and distance
Retention data for example including:| a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |;Now, dmin
=min | a4-b4 |, | a4-b5 |, | a4-b6 |, | a5-b4 |, | a5-b5 |, | a5-b6 |.Acquisition on minimum value dmin
Journey is referred to the step 107 in example one, therefore is repeated no more in this.
In addition, as shown in figure 15, the present embodiment also provides a kind of Text similarity computing device, for calculating two texts
Between similarity, described device includes:
Extraction module 11, for extracting the data of at least two objects from each text, the object refers to embody
The feature of the text semantic;
Determining module 12, for determining the shared object of two texts, wherein, the number of the shared object is at least two
It is individual;
First computing module 13, for calculating the Hamming distance of each shared object between described two texts;
Second computing module 14, for meeting the first preparatory condition in the Hamming distance of described at least two shared objects
When, according at least one of following similarity determined between described two texts:Predetermined pair in described at least two shared objects
Term vector similarity, Hamming distance and the splicing character string similarity of elephant.
Handling process on described device is repeated no more in this with described in above-mentioned embodiment of the method.
The embodiment of the present application fully combines the characteristics of text, and multiple features that extraction can embody text semantic (such as are marked
Topic, content, keyword, kernel sentence) data, and data based on multiple features carry out Similarity Measure, not only increase meter
Calculate efficiency and improve the accuracy of Similarity Measure.The Text similarity computing method that the embodiment of the present application provides is being carried out
Calculated before complicated calculations by low cost and carry out preliminary matches, to improve computational efficiency.The text phase that the embodiment of the present application provides
Hamming distance computational methods rapidly and efficiently and the term vector side with semantic meaning representation ability can be combined like degree computational methods
Method, it not only avoid the similarity obtained in correlation technique only with Hamming distance computational methods and be short of lacking for semantic meaning representation ability
Fall into, and compensate for too slow only with speed existing for term vector method calculating similarity in correlation technique and be only applicable to short essay
This problem of.Moreover, the embodiment of the present application also introduces short text penalty mechanism, easily occur when can correct short text matching
The situation of error.
Embodiment two
The Text similarity computing method that the present embodiment provides can apply to big data scene.Now, if being directed to target
Text carries out Similarity Measure with each text in a large amount of texts, and amount of calculation is larger, and expends the time.Therefore, in this implementation
, it is necessary to search the text for meeting corresponding conditionses between target text from a large amount of texts as text to be analyzed in example, then enter
Similarity Measure between row target text and text to be analyzed.
As shown in figure 16, the present embodiment provide Text similarity computing method, for calculate target text with it is to be analyzed
Similarity between text, the described method comprises the following steps:
Step S21:According to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, is established
The index domain of one object and/or the index domain of the second object, wherein, each index domain includes one or more index trees;
In the present embodiment, the first object is content, and the second object is title;Cryptographic Hash is obtained with SimHash algorithms
Exemplified by SimHash values, the acquisition modes of the SimHash values of the first object and the SimHash values of the second object are the same as embodiment one
It is described, therefore repeated no more in this.
For multiple texts, a subject index domain is built according to title SimHash values, according to content SimHash value structures
A content indexing domain is built, wherein, if the multiple text is microblogging, only need to be built according to content simhash values
One content indexing domain.
Wherein, one or more bk-tree index trees each are established under index domain, each bk-tree index trees can hold
Receive G text, G can be configured according to actual business requirement, such as G is arranged to 1000, and the application is not limited this.
The bk-tree index trees are the data structures based on the Hamming distance foundation for meeting triangle inequality property.Three
Inequality property is described as follows:D (x, y) is made to represent character string x to y Hamming distance;D (x, y) is with d's (y, z) and big
In or equal to d (x, z), i.e., the step number needed for character string z is changed to from character string x and first becomes character string y again not over character string x
Become character string z step number.
Bk-tree index trees in the present embodiment to establish procedure declaration as follows:
In an index tree, first, an optional text when often inserting a text afterwards, calculates as root node
Hamming distance between the text and root node of insertion:If obtained Hamming distance value is that occur for the first time at root node,
Establish a new child;Otherwise go down along corresponding side recurrence.For each child, there is similar processing,
Therefore repeated no more in this.By the above-mentioned means, the text of respective number can be inserted into the index tree.
Alternatively, kd-tree index trees can be each established in index domain, or the row's of falling rope is established according to SimHash values
Draw.The inverted index established according to SimHash values refers to the position that record SimHash values are determined according to SimHash values
Index.The kd-tree index trees are a binary trees, and what is stored in tree is some K dimension datas;In a K dimension data collection
Close the division that one kd-tree of structure represents the K dimension spaces formed to the K dimension datas set, that is, it is each in setting
Node has just corresponded to the hypermatrix region (Hyper-rectangle) of a K dimension.
Step S22:According to the cryptographic Hash of the first object of target text, each index tree in the index domain of the first object
Hamming distance between middle lookup and the target text meets the text of first condition, and/or, according to the second of target text
The cryptographic Hash of object, the Hamming distance between the target text is searched in each index tree in the index domain of the second object
Meet the text of second condition;And text to be analyzed is determined from the text found.
Wherein, the first condition for example including:Hamming distance between the target text is less than first threshold, or
Hamming distance between person, with the target text is in the range of first.The second condition for example including:With the target text
Between Hamming distance be less than Second Threshold, or, Hamming distance between the target text is in the range of second.Its
In, the first threshold and Second Threshold can be with identical, and first scope can be with identical with the second scope.
When only including an index domain, it need to only carry out searching in the index domain and meet first condition or second condition
N number of text is as text to be analyzed.
When including two index domains, first in content indexing domain, it is less than the first threshold from the Hamming distance of target text
N1 text is found out in the text of value;Again in subject index domain, remove N1 text having been found, from target text
Hamming distance, which is less than in the text of Second Threshold, finds out N2 text, finally obtains N1+N2 text as text to be analyzed, its
In, first threshold can be with identical with Second Threshold, and N1 can be with equal, however, the application is not limited this with N2.Or when
During including two index domains, while searched in content indexing domain and subject index domain, wherein, in content indexing domain, from
Be less than with the Hamming distance of target text in the text of first threshold and find out N1 text, in subject index domain, from target
The Hamming distance of text, which is less than in the text of Second Threshold, finds out N2 text;Obtained N1 text and N2 text is carried out
Duplicate removal, it is determined that final N number of text is as text to be analyzed, wherein, N is less than or equal to N1+N2;It is 64 in SimHash dimensions
When, the value of first threshold and Second Threshold can be 25.
Illustrate the process that progress String searching in domain is indexed at one below:
In one indexes domain, target text is inserted in each bk-tree index trees, in each bk-tree index trees
In, the Hamming distance that calculating needs to return between target text is no more than threshold value (text such as n), if target text and root section
Hamming distance between the corresponding text of point is d, then only needs recursively to consider to number the side in the range of d-n to d+n
The subtree connected is inquired about.Due to n generally it is smaller, therefore every time compared with some node when can exclude
Many subtrees, so as to greatly improve computational efficiency.Amount of calculation can be significantly reduced in this way, calculated so as to improve
The improved efficiency of efficiency, at least 5~10 times.
It is determined that after text to be analyzed, the similarity between target text and each text to be analyzed, meter can be calculated
Obtained similarity result can be used for the business such as comparison, duplicate removal, template filtering of encroaching right.
Step S23:The shared object of target text and text to be analyzed is determined, wherein, the number of the shared object is
At least two;
Step S24:Calculate the Hamming distance of each shared object between target text and text to be analyzed;
Step S25:When the Hamming distance of described at least two shared objects meets the first preparatory condition, according to below extremely
One item missing determines the similarity between described two texts:The term vector of predetermined object is similar in described at least two shared objects
Degree, Hamming distance and splicing character string similarity.
Wherein, step S23 to step S25 is referred to embodiment one, in addition, in the present embodiment Similarity Measure other
Content can also be with reference to embodiment one, therefore is repeated no more in this.
Figure 17 is the structural representation for the Text similarity computing device that the embodiment of the present application two provides.As shown in figure 17,
The device that the present embodiment provides includes:Index establishes module 21, searching modul 22, extraction module 23, determining module 24 and meter
Module 25 is calculated, wherein,
Extraction module 23, for extracting the Hash of the first object and/or the second object from target text and multiple texts
Value;
Index establishes module 21, for according to the Kazakhstan of the first object of each text and/or the second object in multiple texts
Uncommon value, establish the index domain and/or the index domain of the second object of the first object, wherein, each index domain include one or
Multiple index trees;
Searching modul 22, for the cryptographic Hash of the first object according to target text, in the every of the index domain of the first object
The text that the Hamming distance between the target text meets first condition is searched in individual index tree, and/or, according to target text
The cryptographic Hash of this second object, searched in each index tree in the index domain of the second object between the target text
Hamming distance meets the text of second condition;Text to be analyzed is determined from the text found;
The extraction module 23, it is additionally operable to extract at least two objects from target text and each text to be analyzed respectively
Data;
Determining module 24, for determining the shared object of target text and text to be analyzed, wherein, the shared object
Number is at least two;
First computing module 25, for calculating the Hamming distance of each shared object between target text and text to be analyzed
From;
Second computing module 26, for meeting the first preparatory condition in the Hamming distance of described at least two shared objects
When, according at least one of following similarity determined between described two texts:Predetermined pair in described at least two shared objects
Term vector similarity, Hamming distance and the splicing character string similarity of elephant.
Handling process on described device is referred to described in the method for the present embodiment, therefore is repeated no more in this.
Embodiment three
The embodiment of the present application also provides a kind of data processing electronics, for calculating the similarity between two texts,
The data processing electronics include memory and processor, and the memory, which is used to store, is used for Text similarity computing
Program, when the program for Text similarity computing is read out by the processor execution, perform following operation:
The shared object of two texts is determined, wherein, the number of the shared object is at least two;
Calculate the Hamming distance of each shared object between described two texts;
When the Hamming distance of described at least two shared objects meets the first preparatory condition, according at least one of following true
Similarity between fixed described two texts:The term vector similarity of predetermined object, Hamming in described at least two shared objects
Distance and splicing character string similarity.
Alternatively, said procedure uses Java, C++ or Python.
In addition, the embodiment of the present invention also provides a kind of computer-readable recording medium, computer executable instructions are stored with,
The computer executable instructions realize above-mentioned Text similarity computing method when being executed by processor.
One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program
Related hardware (such as processor) is completed, and described program can be stored in computer-readable recording medium, as read-only storage,
Disk or CD etc..Alternatively, all or part of step of above-described embodiment can also be come using one or more integrated circuits
Realize.Correspondingly, each module/unit in above-described embodiment can be realized in the form of hardware, such as pass through integrated circuit
To realize its corresponding function, can also be realized in the form of software function module, such as be stored in and deposited by computing device
Program/instruction in reservoir realizes its corresponding function.The application is not restricted to the knot of the hardware and software of any particular form
Close.
The advantages of general principle and principal character and the application of the application has been shown and described above.The application is not by upper
State the limitation of embodiment, the principle for simply illustrating the application described in above-described embodiment and specification, do not depart from the application
On the premise of spirit and scope, the application also has various changes and modifications, and these changes and improvements both fall within claimed
In the range of the application.
Claims (14)
- A kind of 1. Text similarity computing method, it is characterised in that for calculating the similarity between two texts, wherein, from The data of at least two objects can be extracted in each text, the object refers to embody the feature of the text semantic, institute The method of stating includes:The shared object of two texts is determined, wherein, the number of the shared object is at least two;Calculate the Hamming distance of each shared object between described two texts;When the Hamming distance of described at least two shared objects meets the first preparatory condition, institute is determined according at least one of following State the similarity between two texts:Term vector similarity, the Hamming distance of predetermined object in described at least two shared objects And splicing character string similarity.
- 2. according to the method for claim 1, it is characterised in that methods described also includes:When the Hamming distance of described at least two shared objects is unsatisfactory for first preparatory condition, according to described at least two Minimum value in the Hamming distance of shared object determines the similarity between described two texts.
- 3. according to the method for claim 1, it is characterised in that the Hamming distance in described at least two shared objects When meeting the first preparatory condition, the similarity between described two texts is determined, including:Include the first predetermined object and the second predetermined object in the predetermined object, or, including the first predetermined object, second When predetermined object and three predetermined objects,If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that similar between described two texts Term vector similarity of the degree equal to first predetermined object;If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, the first Hamming distance of the second predetermined object Meet that the second Hamming distance of the 3rd preparatory condition and the second predetermined object meets the 4th preparatory condition, then according to second predetermined pair The second Hamming distance of elephant determines the similarity between described two texts;If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, and the first Hamming distance of the second predetermined object The 4th preparatory condition is unsatisfactory for from the second Hamming distance for being unsatisfactory for the 3rd preparatory condition or the second predetermined object, then according to The splicing character string similarity of one predetermined object determines the similarity between described two texts, and/or, according to the 3rd predetermined pair The term vector similarity of elephant determines the similarity between described two texts.
- 4. according to the method for claim 1, it is characterised in that the Hamming distance in described at least two shared objects When meeting the first preparatory condition, the similarity between described two texts is determined, including:Include the second predetermined object in the predetermined object and do not include the first predetermined object, or, including the second predetermined object With the 3rd predetermined object, and when not including the first predetermined object,The similarity between described two texts is determined according to the second Hamming distance of the second predetermined object;OrIf the first Hamming distance of the second predetermined object meets the second Hamming distance of the 3rd preparatory condition and the second predetermined object Meet the 4th preparatory condition, then determined according to the second Hamming distance of the second predetermined object similar between described two texts Degree;If the first Hamming distance of the second predetermined object is unsatisfactory for the second Hamming distance of the 3rd preparatory condition or the second predetermined object From the 4th preparatory condition is unsatisfactory for, then the phase between described two texts is determined according to the term vector similarity of the 3rd predetermined object Like degree.
- 5. according to the method for claim 1, it is characterised in that the Hamming distance in described at least two shared objects When meeting the first preparatory condition, the similarity between described two texts is determined, including:Include the first predetermined object in the predetermined object and do not include the second predetermined object, or, including the first predetermined object With the 3rd predetermined object, and when not including the second predetermined object,If the term vector similarity of the first predetermined object meets the second preparatory condition, it is determined that similar between described two texts Term vector similarity of the degree equal to first predetermined object;If the term vector similarity of the first predetermined object is unsatisfactory for the second preparatory condition, according to the splicing word of the first predetermined object Symbol string similarity determines the similarity between described two texts, and/or, it is true according to the term vector similarity of the 3rd predetermined object Similarity between fixed described two texts.
- 6. according to the method for claim 1, it is characterised in that meet in the Hamming distance of described at least two shared objects During the first preparatory condition, before determining the similarity between described two texts, methods described also includes:It is true according in the following manner Determine the term vector similarity of predetermined object:According to the term vector data of predetermined object, the term vector cosine similarity of predetermined object is calculated;When the word number or length of the predetermined object of described two texts are all higher than default penalty factor, described predetermined pair is determined The term vector similarity of elephant is the term vector cosine similarity;When the word number of the predetermined object of at least one text in described two texts or length are less than or equal to the penalty factor When, according to the word number or length of described two texts and basic penalty value, calculate punishment correction value;According to described two texts Addition penalty value and value determine additional penalty value;Wherein, the addition penalty value of each text is according to penalty factor, the text Word number or length and additional punishment maximum determine;According to the term vector cosine similarity of the predetermined object, described two The smaller value of the word number of individual text or the smaller value of the length of described two texts, the punishment correction value, the addition are punished Penalties, the penalty factor, the additional punishment maximum and basic penalty value, determine the term vector phase of the predetermined object Like degree.
- 7. according to the method for claim 1, it is characterised in that each shared object between described two texts are calculated Hamming distance before, methods described also includes:The data of at least two objects are extracted from described two texts respectively, the data include cryptographic Hash and/or term vector Data.
- 8. according to the method described in any one of claim 1 to 7, it is characterised in that first preparatory condition includes:It is described extremely Minimum value in the Hamming distance of few two shared objects is less than or equal to the first distance threshold, or, described at least two is common There is the minimum value in the Hamming distance of object to be located in the first preset range.
- 9. according to the method described in any one of claim 3 to 5, it is characterised in that first predetermined object is title, described Second predetermined object is the sentence for the keynote idea for representing text, and the number of the sentence is the more than or equal to three the described 3rd Predetermined object is keyword;First Hamming distance of second predetermined object is total Chinese of those sentences between two texts Prescribed distance, the second Hamming distance of second predetermined object are included in each sentence and another text in a text The Hamming distance of each sentence.
- 10. according to the method described in any one of claim 1 to 7, it is characterised in thatOne of them of described two texts is target text, and another is text to be analyzed;It is determined that before the shared object of two texts, methods described also includes:The text to be analyzed is obtained in the following manner This:According to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, the index of the first object is established Domain and/or the index domain of the second object, wherein, each index domain includes one or more index trees;According to the cryptographic Hash of the first object of target text, searched in each index tree in the index domain of the first object with it is described Hamming distance between target text meets the text of first condition, and/or, according to the Hash of the second object of target text Value, the Hamming distance searched in each index tree in the index domain of the second object between the target text meet Article 2 The text of part;Text to be analyzed is determined from the text found.
- 11. according to the method for claim 10, it is characterised in that the index tree is bk-tree index trees.
- 12. according to the method for claim 10, it is characterised in that first object is content, and second object is Title.
- 13. a kind of Text similarity computing device, it is characterised in that for calculating the similarity between two texts, the dress Put including:Extraction module, for extracting the data of at least two objects from each text, the object refers to embody the text This semantic feature;Determining module, for determining the shared object of two texts, wherein, the number of the shared object is at least two;First computing module, for calculating the Hamming distance of each shared object between described two texts;Second computing module, when the Hamming distance for sharing objects described at least two meets the first preparatory condition, according to At least one determines the similarity between described two texts below:In described at least two shared objects the word of predetermined object to Measure similarity, Hamming distance and splicing character string similarity.
- 14. device according to claim 13, it is characterised in that described device also includes:Index establishes module, for according to the cryptographic Hash of the first object of each text and/or the second object in multiple texts, building The index domain of vertical first object and/or the index domain of the second object, wherein, each index domain includes one or more index Tree;Searching modul, for the cryptographic Hash of the first object according to target text, each index in the index domain of the first object The Hamming distance searched in tree between the target text meets the text of first condition, and/or, according to the of target text The cryptographic Hash of two objects, the Hamming distance between the target text is searched in each index tree in the index domain of the second object From the text for meeting second condition;Text to be analyzed is determined from the text found;The determining module, for determining the shared object of the target text and the text to be analyzed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610578843.9A CN107644010B (en) | 2016-07-20 | 2016-07-20 | Text similarity calculation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610578843.9A CN107644010B (en) | 2016-07-20 | 2016-07-20 | Text similarity calculation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107644010A true CN107644010A (en) | 2018-01-30 |
CN107644010B CN107644010B (en) | 2021-05-25 |
Family
ID=61109052
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610578843.9A Active CN107644010B (en) | 2016-07-20 | 2016-07-20 | Text similarity calculation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107644010B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595439A (en) * | 2018-05-04 | 2018-09-28 | 北京中科闻歌科技股份有限公司 | A kind of character spread path analysis method and system |
CN108897861A (en) * | 2018-07-01 | 2018-11-27 | 东莞市华睿电子科技有限公司 | A kind of information search method |
CN109089018A (en) * | 2018-10-29 | 2018-12-25 | 上海理工大学 | A kind of intelligence prompter devices and methods therefor |
CN109190117A (en) * | 2018-08-10 | 2019-01-11 | 中国船舶重工集团公司第七〇九研究所 | A kind of short text semantic similarity calculation method based on term vector |
CN109241505A (en) * | 2018-10-09 | 2019-01-18 | 北京奔影网络科技有限公司 | Text De-weight method and device |
CN109299260A (en) * | 2018-09-29 | 2019-02-01 | 上海晶赞融宣科技有限公司 | Data classification method, device and computer readable storage medium |
CN109635090A (en) * | 2018-12-14 | 2019-04-16 | 安徽中船璞华科技有限公司 | A kind of copyright method for tracing based on machine learning |
CN110134768A (en) * | 2019-05-13 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment and the storage medium of text |
CN110348539A (en) * | 2019-07-19 | 2019-10-18 | 知者信息技术服务成都有限公司 | Short text correlation method of discrimination |
CN110555198A (en) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | method, apparatus, device and computer-readable storage medium for generating article |
CN110717092A (en) * | 2018-06-27 | 2020-01-21 | 北京京东尚科信息技术有限公司 | Method, system, device and storage medium for matching objects for articles |
CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
CN111061983A (en) * | 2019-12-17 | 2020-04-24 | 上海冠勇信息科技有限公司 | Evaluation method for capturing priority of infringement data and network monitoring system thereof |
CN111061842A (en) * | 2019-12-26 | 2020-04-24 | 上海众源网络有限公司 | Similar text determination method and device |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
CN111611399A (en) * | 2020-04-15 | 2020-09-01 | 广发证券股份有限公司 | Information event mapping system and method based on natural language processing |
CN112100372A (en) * | 2020-08-20 | 2020-12-18 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Head news prediction classification method |
CN113011194A (en) * | 2021-04-15 | 2021-06-22 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN113268986A (en) * | 2021-05-24 | 2021-08-17 | 交通银行股份有限公司 | Unit name matching and searching method and device based on fuzzy matching algorithm |
CN113449077A (en) * | 2021-06-25 | 2021-09-28 | 完美世界控股集团有限公司 | News popularity calculation method, equipment and storage medium |
CN113505835A (en) * | 2021-07-14 | 2021-10-15 | 杭州隆埠科技有限公司 | Similar news duplicate removal method and device |
CN113673216A (en) * | 2021-10-20 | 2021-11-19 | 支付宝(杭州)信息技术有限公司 | Text infringement detection method and device and electronic equipment |
CN114168809A (en) * | 2021-11-22 | 2022-03-11 | 中核核电运行管理有限公司 | Similarity-based document character string code matching method and device |
CN114398968A (en) * | 2022-01-06 | 2022-04-26 | 北京博瑞彤芸科技股份有限公司 | Method and device for labeling similar customer-obtaining files based on file similarity |
CN117235546A (en) * | 2023-11-14 | 2023-12-15 | 国泰新点软件股份有限公司 | Multi-version file comparison method, device, system and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7707157B1 (en) * | 2004-03-25 | 2010-04-27 | Google Inc. | Document near-duplicate detection |
CN103678275A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Two-level text similarity calculation method based on subjective and objective semantics |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
CN105302779A (en) * | 2015-10-23 | 2016-02-03 | 北京慧点科技有限公司 | Text similarity comparison method and device |
US20160103831A1 (en) * | 2014-10-14 | 2016-04-14 | Adobe Systems Incorporated | Detecting homologies in encrypted and unencrypted documents using fuzzy hashing |
-
2016
- 2016-07-20 CN CN201610578843.9A patent/CN107644010B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7707157B1 (en) * | 2004-03-25 | 2010-04-27 | Google Inc. | Document near-duplicate detection |
CN103678275A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Two-level text similarity calculation method based on subjective and objective semantics |
CN105095162A (en) * | 2014-05-19 | 2015-11-25 | 腾讯科技(深圳)有限公司 | Text similarity determining method and device, electronic equipment and system |
US20160103831A1 (en) * | 2014-10-14 | 2016-04-14 | Adobe Systems Incorporated | Detecting homologies in encrypted and unencrypted documents using fuzzy hashing |
CN105302779A (en) * | 2015-10-23 | 2016-02-03 | 北京慧点科技有限公司 | Text similarity comparison method and device |
Non-Patent Citations (1)
Title |
---|
陈露 等: "基于语义指纹和LCS的文本去重方法", 《软件》 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595439A (en) * | 2018-05-04 | 2018-09-28 | 北京中科闻歌科技股份有限公司 | A kind of character spread path analysis method and system |
CN108595439B (en) * | 2018-05-04 | 2022-04-12 | 北京中科闻歌科技股份有限公司 | Method and system for analyzing character propagation path |
CN110555198B (en) * | 2018-05-31 | 2023-05-23 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer readable storage medium for generating articles |
CN110555198A (en) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | method, apparatus, device and computer-readable storage medium for generating article |
CN110717092A (en) * | 2018-06-27 | 2020-01-21 | 北京京东尚科信息技术有限公司 | Method, system, device and storage medium for matching objects for articles |
CN108897861A (en) * | 2018-07-01 | 2018-11-27 | 东莞市华睿电子科技有限公司 | A kind of information search method |
CN109190117A (en) * | 2018-08-10 | 2019-01-11 | 中国船舶重工集团公司第七〇九研究所 | A kind of short text semantic similarity calculation method based on term vector |
CN110891010B (en) * | 2018-09-05 | 2022-09-16 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
CN109299260A (en) * | 2018-09-29 | 2019-02-01 | 上海晶赞融宣科技有限公司 | Data classification method, device and computer readable storage medium |
CN109241505A (en) * | 2018-10-09 | 2019-01-18 | 北京奔影网络科技有限公司 | Text De-weight method and device |
CN109089018A (en) * | 2018-10-29 | 2018-12-25 | 上海理工大学 | A kind of intelligence prompter devices and methods therefor |
CN109635090A (en) * | 2018-12-14 | 2019-04-16 | 安徽中船璞华科技有限公司 | A kind of copyright method for tracing based on machine learning |
CN110134768B (en) * | 2019-05-13 | 2023-05-26 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
CN110134768A (en) * | 2019-05-13 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment and the storage medium of text |
CN110348539A (en) * | 2019-07-19 | 2019-10-18 | 知者信息技术服务成都有限公司 | Short text correlation method of discrimination |
CN111061983A (en) * | 2019-12-17 | 2020-04-24 | 上海冠勇信息科技有限公司 | Evaluation method for capturing priority of infringement data and network monitoring system thereof |
CN111061983B (en) * | 2019-12-17 | 2024-01-09 | 上海冠勇信息科技有限公司 | Evaluation method of infringement data grabbing priority and network monitoring system thereof |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
CN111104794B (en) * | 2019-12-25 | 2023-07-04 | 同方知网数字出版技术股份有限公司 | Text similarity matching method based on subject term |
CN111061842A (en) * | 2019-12-26 | 2020-04-24 | 上海众源网络有限公司 | Similar text determination method and device |
CN111061842B (en) * | 2019-12-26 | 2023-06-30 | 上海众源网络有限公司 | Similar text determining method and device |
CN111611399A (en) * | 2020-04-15 | 2020-09-01 | 广发证券股份有限公司 | Information event mapping system and method based on natural language processing |
CN112100372A (en) * | 2020-08-20 | 2020-12-18 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Head news prediction classification method |
CN113011194A (en) * | 2021-04-15 | 2021-06-22 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN113268986A (en) * | 2021-05-24 | 2021-08-17 | 交通银行股份有限公司 | Unit name matching and searching method and device based on fuzzy matching algorithm |
WO2022267325A1 (en) * | 2021-06-25 | 2022-12-29 | 完美世界控股集团有限公司 | News popularity calculation method, device and storage medium |
CN113449077A (en) * | 2021-06-25 | 2021-09-28 | 完美世界控股集团有限公司 | News popularity calculation method, equipment and storage medium |
CN113449077B (en) * | 2021-06-25 | 2024-04-05 | 完美世界控股集团有限公司 | News heat calculation method, device and storage medium |
CN113505835A (en) * | 2021-07-14 | 2021-10-15 | 杭州隆埠科技有限公司 | Similar news duplicate removal method and device |
CN113673216B (en) * | 2021-10-20 | 2022-02-01 | 支付宝(杭州)信息技术有限公司 | Text infringement detection method and device and electronic equipment |
CN113673216A (en) * | 2021-10-20 | 2021-11-19 | 支付宝(杭州)信息技术有限公司 | Text infringement detection method and device and electronic equipment |
CN114168809A (en) * | 2021-11-22 | 2022-03-11 | 中核核电运行管理有限公司 | Similarity-based document character string code matching method and device |
CN114398968A (en) * | 2022-01-06 | 2022-04-26 | 北京博瑞彤芸科技股份有限公司 | Method and device for labeling similar customer-obtaining files based on file similarity |
CN117235546A (en) * | 2023-11-14 | 2023-12-15 | 国泰新点软件股份有限公司 | Multi-version file comparison method, device, system and storage medium |
CN117235546B (en) * | 2023-11-14 | 2024-03-12 | 国泰新点软件股份有限公司 | Multi-version file comparison method, device, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107644010B (en) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107644010A (en) | A kind of Text similarity computing method and device | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN109190117B (en) | Short text semantic similarity calculation method based on word vector | |
CN105095204B (en) | The acquisition methods and device of synonym | |
TWI512507B (en) | A method and apparatus for providing multi-granularity word segmentation results | |
US9626358B2 (en) | Creating ontologies by analyzing natural language texts | |
US8380489B1 (en) | System, methods, and data structure for quantitative assessment of symbolic associations in natural language | |
CN109299280B (en) | Short text clustering analysis method and device and terminal equipment | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN112395395B (en) | Text keyword extraction method, device, equipment and storage medium | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
KR101423549B1 (en) | Sentiment-based query processing system and method | |
CN115630640B (en) | Intelligent writing method, device, equipment and medium | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN111090731A (en) | Electric power public opinion abstract extraction optimization method and system based on topic clustering | |
CN110083832B (en) | Article reprint relation identification method, device, equipment and readable storage medium | |
JP2003223456A (en) | Method and device for automatic summary evaluation and processing, and program therefor | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN112597300A (en) | Text clustering method and device, terminal equipment and storage medium | |
JP6420268B2 (en) | Image evaluation learning device, image evaluation device, image search device, image evaluation learning method, image evaluation method, image search method, and program | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
CN114997288A (en) | Design resource association method | |
CN112749272A (en) | Intelligent new energy planning text recommendation method for unstructured data | |
Ashna et al. | Lexicon based sentiment analysis system for malayalam language | |
CN104021202A (en) | Device and method for processing entries of knowledge sharing platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |