CN110321925A

CN110321925A - A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint

Info

Publication number: CN110321925A
Application number: CN201910441282.1A
Authority: CN
Inventors: 梁燕; 万正景; 陶以政; 李龚亮; 许峰; 曹政; 谢杨; 马丹阳
Original assignee: COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Current assignee: COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2019-10-11
Anticipated expiration: 2039-05-24
Also published as: CN110321925B

Abstract

The invention discloses a kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, comprising the following steps: the training that term vector indicates；Semantic feature extraction；Multiple features polymerization；Level index construct；Similarity calculation.The present invention combines multi-dimensional semantic correlation and carries out term vector expression modeling, sufficiently excavate the semantic information between word, feature is extracted as unit of sentence, semantic feature is characterized using more weights, and text library statistics and distributed intelligence are excavated using statistical learning method, it realizes to the finer division of feature space, then generates the compact text fingerprints of high identification based on multiple features polymerization, effectively improve the descriptive power and discrimination of text fingerprints；Using top-down thought, text similarity comparison is carried out using semantics fusion fingerprint fingerprint and local semantic feature, by building level index, can quickly and efficiently realize that text is compared from the overall situation to more granularity similarities of part；This method is with good expansibility.

Description

A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint

Technical field

The present invention relates to a kind of text similarity comparison method more particularly to a kind of text based on semantics fusion fingerprint are more Granularity similarity comparison method, belongs to pattern-recognition and technical field of information processing.

Background technique

Two text approximations mean the content described in the text and information be it is similar, it is even identical. If a text is that another text is generated by modifying fraction content similar to modes such as insertion, deletion, replacements, recognize It is similar for this two texts.Text or the diffusion of webpage approximation are usually undesirable, with the surge of data, by approximate text The problem of causing is increasingly severe.Therefore, approximate text detection is to reduce the storage overhead, and improves search efficiency and data utilize Rate avoids the important technology illegally plagiarized and plagiarized.

Domestic and international experts and scholars propose a variety of methods to this.Traditional text similarity comparison is broadly divided into two major classes: One kind is the method based on character string comparison；Another kind of is the method based on word frequency statistics, on the basis of vector space model, Text is characterized using feature vector, and calculates the similarity distance between vector to measure the similitude between text.The former can adopt With varigrained character string, such as the character string based on sentence level and based on paragraph level.However, due to usual in text Comprising a large amount of character string, character string matching method is hard to avoid the low difficult situation of real-time of magnanimity long text.

Method based on Shingle, a part regard text as one group of Shingle, and wherein Shingle refers in text and connects A continuous subsequence, however all inevitably there is the big predicament of computing cost in these methods, basically can not handle sea Measure data.Word in each text is mapped as a simple hash value using existing dictionary by another part, although effectively Ground reduces the expense of similarity calculation, but there are unstability for the text fingerprints generated, when the vocabulary in dictionary is not enough to When comprising word in text, the variation of text very little can all cause hash value to fluctuate.

It is to be accepted to calculate for best, most effective approximate text detection at present by the Simhash algorithm that Charikar is proposed Method.Simhash is the method that a kind of pair of high dimensional data carries out probability dimensionality reduction, is that digit is smaller by high dimension text characteristics DUAL PROBLEMS OF VECTOR MAPPING And fixed fingerprint.Currently, most of approximation text detection system is all based on Simhash to realize.However, these sides Method is only absorbed in text itself, and has ignored information useful in text library, while Simhash is to one text generation one text This fingerprint can not provide two text part texts similar comparison.

Summary of the invention

The object of the invention is that it is more to provide a kind of text based on semantics fusion fingerprint to solve the above-mentioned problems Granularity similarity comparison method.

The present invention through the following technical solutions to achieve the above objectives:

A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, comprising the following steps:

Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using public affairs Open corpus, using text and word mapping indicate between word and word the horizontal relationship of co-occurrence within a context and word with it is upper Hereafter mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is built Mould carries out the training of term vector expression by unsupervised learning method, makes training gained term vector in semantic dependency and antisense It is performed better than in word synonym identification mission；

Step 2: semantic feature extraction: be based on the more abundant sentence of semantic information, to sentence carry out include segment, After the pretreatment operation for removing stop words, participle is indicated using term vector representation method, in conjunction with including word frequency, part of speech More weights come characterize participle, calculate it is each participle term vector and segment weight and, realization semantic feature extraction；

Step 3: multiple features polymerize: it is clustered using the statistics and distribution character of semantic feature in training library, it will be semantic Feature space is divided into multiple subregions, realizes to the finer division of feature space；According to the semantic feature and cluster in text Semantic feature in text is assigned to the semantic feature calculated in each subregion away from nearest subregion by the distance at center The sum of with the surplus of cluster centre, by multiple differences in each child partition and it is aggregating, generative semantics polymerize fingerprint；

Step 4: level index construct: carrying out cluster training to the semantics fusion fingerprint in training library, constitute first layer rope Draw, then each fingerprint surplus is split, carries out the building that cluster completes the second layer index again to the subvector after fractionation, obtain Obtain level index；The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row of falling File；

Step 5: similarity calculation: according to the more granularity alignment algorithms indexed based on level, using top-down calculating Mode first calculates the global similitude of text in text and text library to be compared, will when the similitude is greater than the threshold value of setting It is added in Similar Text alternative collection；The local similarity with text in alternative text set is calculated later, is obtained final similar Text and its specific local Similar content, and then obtain the more granularity similarity comparison results of text.

Preferably, each word is expressed as K by the integrated multidimensional semantic dependency joint modeling in the step 1 The objective function of the vector of dimension, the study of word vector indicates are as follows:

Wherein, N indicates the amount of text in training library, d_nIndicate n-th of text,Indicate i-th of list in n-th of text The term vector of word,Indicate wordContext use summation method vector conversion values,WithRespectively There is word within a context in expressionProbability and text in there is wordProbability,And ANT_i ⁿRespectively indicate list WordSynset and antisense word set,When synonym known to indicating or antonym are u, there is wordIt is general Rate, α are weight factor, and 0 < α < 1 is trained by maximizing objective function, is solved using stochastic gradient rise method.

Preferably, the specific method is as follows for the step 2:

Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained₁,S₂,..., S_M, wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c₁,c₂,..., c_T, wherein T indicates the participle number in sentence, using word frequency weights omega_fWith part of speech weights omega_niProduct come characterize participle power Weight ω_c=ω_f×ω_ni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective；

Then, each participle is indicated using term vector, semantic feature is expressed as each participle term vector and segments weight With:

Wherein, f_i,kIndicate the value of the kth dimension of i-th of sentence, w_i,j,kAnd ω_i,jRespectively indicate j-th point of i-th of sentence The value weight corresponding with the participle of the kth dimension of word；I () is indicator function, works as w_i,j,kWhen > 0, value 1, other are -1； Text representation is the set { f comprising M semantic feature₁,f₂,...,f_M}。

Preferably, the specific method is as follows for the step 3:

Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated with cluster centre Cluster C={ μ₁,μ₂,...,μ_L, each child partition for clustering corresponding semantic feature space；K-means algorithm is to cluster firmly Algorithm is the representative of the typically objective function clustering method based on prototype, it be data point to prototype certain apart from conduct The objective function of optimization obtains the adjustment rule of interative computation using the method that function seeks extreme value.

Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text Fingerprint calculates the semantic feature f of text_iIt is nearest to assign it to distance for the distance between cluster centre of each subregion In child partition:

Id(f_i)=arg min | | f_i-μ_j||², i=1,2 ..., M, j=1,2 ..., L

Wherein, Id (f_i) indicate the index of child partition that semantic feature is assigned to, μ_jIn the cluster for indicating j-th of child partition The heart；

Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:

Wherein, f_j:Id(f_j)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be in each child partition Multiple differences and be aggregating, generate a K × L dimension vector V_d=[v₁,v₂,...,v_L], as semantics fusion fingerprint, Text is ultimately expressed as a semantics fusion fingerprint V_dWith M semantic feature { f₁,f₂,...,f_M}。

Preferably, the specific method is as follows for the step 4:

Firstly, being clustered using K-means algorithm for the semantics fusion fingerprint in training text library, cluster being obtained The cluster centre obtained constitutes the first layer index as semantics fusion fingerprint word；

Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, by it On the nearest fingerprint word of the distance of quantization, and its difference with the fingerprint word is calculated as fingerprint surplus；More than each fingerprint Amount is equally divided into the subvector of L K dimension, uses K-means to subvector, obtains D cluster centre, that is, include D son Vector word completes the building of the second layer index；

Finally, indexing according to level, inverted file is generated to the semantics fusion fingerprint in test text library: first in first layer On index, according to the distance of semantics fusion fingerprint and fingerprint word, on the nearest fingerprint word of the distance quantified, and calculate Fingerprint surplus is split as L subvector by its fingerprint surplus, calculates each subvector at a distance from subvector word, obtain with Apart from nearest subvector word ID；In level index, text information storage is in the corresponding fingerprint word rope of fingerprint dictionary In drawing, index storage content includes text ID, the corresponding subvector word ID of each subvector.

Preferably, the specific method is as follows for the step 5:

When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same finger of index first The similarity distance of the semantics fusion fingerprint of line word, is measured using non symmetrical distance, is selected apart from nearest fingerprint Word calculates the fingerprint surplus of the semantics fusion fingerprint and corresponding fingerprint word, is split to the semantics fusion fingerprint surplus Generate { v₁,v₂,...,v_L}；The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix； Text fingerprints to be compared and quantization are calculated to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word

Wherein, v_q,jIndicate j-th of subvector of text to be compared, v_id(i),jIndicate the finger that text to be compared quantifies The corresponding subvector word of j-th of subvector of i-th of semantics fusion text fingerprints on line word；

It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints as standby Selected works；The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again

Wherein, d_tAnd d_iRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, f_t ^qWith Respectively indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured；

Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:

Wherein α is weight factor, 0 < α < 1；Similar Text is obtained according to the similarity distance between text is final, while can Similar local content is obtained according to local fingerprint similarity, and then obtains the more granularity similarity comparison results of text.

The beneficial effects of the present invention are:

The present invention combines multi-dimensional semantic correlation and carries out term vector expression modeling, sufficiently excavates the semantic information between word, Obtain the term vector that semantic dependency performs better than；It is extracted as unit of semantic information more horn of plenty and complete sentence special Sign is characterized semantic feature using more weights, and excavates text library statistics and distributed intelligence, realization pair using statistical learning method The finer division of feature space, then the compact text fingerprints based on the high identification of multiple features polymerization generation, effectively improve text The descriptive power and discrimination of fingerprint；Using top-down thought, semantics fusion fingerprint fingerprint and local semantic feature are used Text similarity comparison is carried out, by building level index, can quickly and efficiently realize text from the overall situation to more granularities of part Similarity compares.The standard of text overall situation similarity comparison is effectively increased by experimental verification semantics fusion fingerprint of the invention True rate, and have benefited from level index, effectively increase the efficiency of text similarity comparison；The present invention has good expansible Property, being suitable for mass text similarity compares, and can be good at meeting user and compares demands to efficient more granularity similarities, can Largely increase user experience.

Detailed description of the invention

Fig. 1 is that the more granularity similarities of text compare frame diagram in the embodiment of the present invention；

Fig. 2 is that term vector indicates model schematic in the embodiment of the present invention；

Fig. 3 is that semantics fusion fingerprint generates block diagram in the embodiment of the present invention, and semantic feature space is divided into 8 sons point Area；

Fig. 4 is that level indexes inverted file building process figure in the embodiment of the present invention；

Fig. 5 is the more granularity similarity comparison process figures of text in the embodiment of the present invention；

Fig. 6 is the text comparison result figure in the embodiment of the present invention.

Specific embodiment

Below with reference to implementation column and attached drawing, the invention will be further described:

Embodiment:

As shown in Figure 1, the more granularity similarity comparison methods of the text of the present invention based on semantics fusion fingerprint, including with Lower step:

Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using public affairs Open corpus, using text and word mapping indicate between word and word the horizontal relationship of co-occurrence within a context and word with it is upper Hereafter mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is built Mould, carrying out term vector by unsupervised learning method indicates training, makes training gained term vector in semantic dependency and antonym It is performed better than in synonym identification mission.Fig. 2 shows term vectors to indicate model.

In this step, each word is expressed as the vector of K dimension, word by the integrated multidimensional semantic dependency joint modeling The objective function of vector study indicates are as follows:

Wherein, N indicates the amount of text in training library, d_nIndicate n-th of text,Indicate i-th of list in n-th of text The term vector of word,Indicate wordContext use summation method vector conversion values,WithPoint Do not indicate occur word within a contextProbability and text in there is wordProbability,And ANT_i ⁿIt respectively indicates WordSynset and antisense word set,When synonym known to indicating or antonym are u, there is wordIt is general Rate, α are weight factor, and 0 < α < 1 is trained by maximizing objective function, is solved using stochastic gradient rise method.

Step 2: semantic feature extraction: be based on the more abundant sentence of semantic information, to sentence carry out include segment, After the pretreatment operation for removing stop words, participle is indicated using term vector representation method described in above-mentioned steps one, then In conjunction with include word frequency, more weights of part of speech come characterize participle, calculate it is each participle term vector and segment weight and, realize semanteme Feature extraction.

The specific method is as follows for this step:

Step 3: multiple features polymerize: as shown in figure 3, the statistics and distribution character using semantic feature in training library carry out Semantic feature space is divided into multiple subregions, realized to the finer division of feature space by cluster；According to the semanteme in text Semantic feature in text is assigned to away from nearest subregion by the distance of feature and cluster centre, is calculated in each subregion Semantic feature and the sum of the surplus of cluster centre, by multiple differences in each child partition and be aggregating, generative semantics are poly- Close fingerprint.

Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated with cluster centre Cluster C={ μ₁,μ₂,...,μ_L, each child partition for clustering corresponding semantic feature space；

Id(f_i)=arg min | | f_i-μ_j||², i=1,2 ..., M, j=1,2 ..., L

The semantic feature of text to be processed is assigned to therewith apart from nearest child partition by way of, obtain text language Distribution of the adopted feature relative to text library semantic feature.It by multiple differences in each child partition and is aggregating, generates a K The vector of × L dimension, as semantics fusion fingerprint.Entire text planting modes on sink characteristic is regarded as one greatly compared to traditional fingerprint algorithm Cluster is to quantify, and using origin as cluster centre, method of the invention provides finer distribution to feature space, adopts simultaneously It is indicated with the term vector that can more characterize multi-dimensional semantic correlation, substitution word hash is indicated, therefore can effectively improve text The descriptive power of fingerprint.

Step 4: level index construct: carrying out cluster training to the semantics fusion fingerprint in training library, constitute first layer rope Draw, then each fingerprint surplus is split, carries out the building that cluster completes the second layer index again to the subvector after fractionation, obtain Obtain level index；The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row of falling File.

The specific method is as follows for this step:

Finally, generating inverted file to the semantics fusion fingerprint in test text library as shown in figure 4, index according to level: First on the first layer index, according to the distance of semantics fusion fingerprint and fingerprint word, the nearest fingerprint list of the distance quantified On word, and its fingerprint surplus is calculated, fingerprint surplus is split as L subvector, calculates each subvector and subvector word Distance obtains therewith apart from nearest subvector word ID；In level index, text information storage is corresponding in fingerprint dictionary In fingerprint word index, index storage content includes text ID, the corresponding subvector word ID of each subvector.

Step 5: as shown in figure 5, similarity calculation: according to the more granularity alignment algorithms indexed based on level, using from upper Calculation under and first calculates the global similitude of text in text and text library to be compared, when the similitude is greater than setting Threshold value when, be added into Similar Text alternative collection；The local similarity with text in alternative text set is calculated later, in turn Final Similar Text and its specific local Similar content are obtained, the more granularity similarity comparison results of text are obtained.

The specific method is as follows for this step:

In order to verify effect of the invention, select SogouCS news library as test text library, comprising coming from sohu.com It stands, the 18 class news such as the country, the world, sport, society, amusement between in June, 2012 and July.It chooses 1000 news and carries out hand Dynamic modification, corresponding 5 approximate texts (including news itself) of every news.Simultaneously using SogouCA news library as training text This collection, for training semantic feature space to divide and level index construct.In an experiment, random selection SogouCA news library is surplus 50000 of remaining news are as interference collection.The semantic feature of all texts is all extracted in advance and is completed in training set and test set.Instruction Practicing collection includes 14,744,203 semantic features.Then, random sampling is carried out to the semantic feature of training set, utilizes K-Means Algorithm clusters these semantic features, generates different size of cluster, and then generate the semantics fusion fingerprint of different dimensions. Semantics fusion fingerprint is indicated using SAF, and sets 8,16,32,64,128 for dictionary size L, sets term vector dimension K to 8,16,32,64,128,256.Every text is all tested as text to be compared in test set.

The comparison accuracy rate of text fingerprints directly more global first, the following table 1 show the SAF and comparison text of different parameters The similarity offered compares accuracy rate.Semantics fusion fingerprint proposed by the present invention is compared with documents Simhash method, word to When amount dimension K parameter setting is consistent, accuracy rate is dramatically increased.

Table 1

K	Simhash	L=8	L=16	L=32	L=64	L=128
							8	39.00	92.22	92.54	92.86	92.56	93.14
16	78.68	92.96	94.02	94.48	95.38	94.52
							32	88.64	93.84	94.94	96.16	96.82	96.80
64	90.42	94.76	96.32	97.48	97.64	97.98
							128	91.00	95.60	97.22	97.84	98.68	98.14
256	91.00	96.16	97.88	98.18	98.92	97.86

When carrying out similarity comparison using the more granularity alignment algorithms indexed based on level, compared to directly comparison text language Comparison efficiency is improved 87.93% by adopted feature, the present invention, has good scalability.Fig. 6 gives text comparison result sample Example, table 2 is clearer to illustrate part comparison result, and similarity remains 8 decimal points.It can be found that the present invention can be well Compare out the sentence of partial content replacement or sequence exchange, slight modifications some conjunctions and keyword.When the key to sentence When word carries out more modification, low from the similarity between sentence for sentence structure and semantic angle, system comparison result is shown Similarity is 0 between sentence, is tallied with the actual situation.Thus, it is possible to find that the present invention can effectively realize the phase of the more granularities of text It is compared like degree.

Table 2

Above-described embodiment is presently preferred embodiments of the present invention, is not a limitation on the technical scheme of the present invention, as long as Without the technical solution that creative work can be realized on the basis of the above embodiments, it is regarded as falling into the invention patent Rights protection scope in.

Claims

1. a kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, it is characterised in that: the following steps are included:

Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using open language Expect library, indicates between word and word the horizontal relationship of co-occurrence and word and context within a context using text and word mapping Mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is modeled, The training that term vector expression is carried out by unsupervised learning method keeps training gained term vector same in semantic dependency and antonym It is performed better than in adopted word identification mission；

Step 2: semantic feature extraction: being based on the more abundant sentence of semantic information, carrying out including participle, removal to sentence After the pretreatment operation of stop words, participle is indicated using term vector representation method, in conjunction with include word frequency, part of speech it is more Weight come characterize participle, calculate it is each participle term vector and segment weight and, realization semantic feature extraction；

Step 3: multiple features polymerize: being clustered using the statistics and distribution character of semantic feature in training library, by semantic feature Space is divided into multiple subregions, realizes to the finer division of feature space；According to the semantic feature and cluster centre in text Distance, the semantic feature in text is assigned to and calculates semantic feature in each subregion and poly- away from nearest subregion The sum of the surplus at class center by multiple differences in each child partition and is aggregating, and generative semantics polymerize fingerprint；

Step 4: level index construct: cluster training is carried out to the semantics fusion fingerprint in training library, constitutes the first layer index, Each fingerprint surplus is split again, the building that cluster completes the second layer index is carried out again to the subvector after fractionation, obtains Level index；The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row's of falling text Part；

Step 5: similarity calculation: according to the more granularity alignment algorithms indexed based on level, using top-down calculating side Formula first calculates the global similitude of text in text and text library to be compared, when the similitude is greater than the threshold value of setting, by it It is added in Similar Text alternative collection；The local similarity with text in alternative text set is calculated later, obtains final similar text The specific local Similar content of this and its, and then obtain the more granularity similarity comparison results of text.

2. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: in the step 1, each word is expressed as the vector of K dimension, word by the integrated multidimensional semantic dependency joint modeling The objective function of vector study indicates are as follows:

Wherein, N indicates the amount of text in training library, d_nIndicate n-th of text,Indicate in n-th of text i-th of word Term vector,Indicate wordContext use summation method vector conversion values,WithIt respectively indicates Occurs word within a contextProbability and text in there is wordProbability,And ANT_i ⁿRespectively indicate word Synset and antisense word set,When synonym known to indicating or antonym are u, there is wordProbability, α is Weight factor, 0 < α < 1 are trained by maximizing objective function, are solved using stochastic gradient rise method.

3. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 2:

Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained₁,S₂,...,S_M, Wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c₁,c₂,...,c_T, Wherein T indicates the participle number in sentence, using word frequency weights omega_fWith part of speech weights omega_niProduct characterize participle weights omega_c =ω_f×ω_ni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective；

Then, each participle is indicated using term vector, semantic feature be expressed as each participle term vector and segment weight and:

Wherein, f_i,kIndicate the value of the kth dimension of i-th of sentence, w_i,j,kAnd ω_i,jRespectively indicate j-th of participle of i-th of sentence The value weight corresponding with the participle of kth dimension；I () is indicator function, works as w_i,j,kWhen > 0, value 1, other are -1；Text It is expressed as the set { f comprising M semantic feature₁,f₂,...,f_M}。

4. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 3:

Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated to cluster with cluster centre C={ μ₁,μ₂,...,μ_L, each child partition for clustering corresponding semantic feature space；

Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text fingerprints, Calculate the semantic feature f of text_iThe distance between cluster centre of each subregion, assigns it to apart from nearest child partition In:

Id(f_i)=arg min | | f_i-μ_j||², i=1,2 ..., M, j=1,2 ..., L

Wherein, Id (f_i) indicate the index of child partition that semantic feature is assigned to, μ_jIndicate the cluster centre of j-th of child partition；

Wherein, f_j:Id(f_j)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be more in each child partition It a difference and is aggregating, generates the vector V of K × L dimension_d=[v₁,v₂,...,v_L], as semantics fusion fingerprint, text It is ultimately expressed as a semantics fusion fingerprint V_dWith M semantic feature { f₁,f₂,...,f_M}。

5. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 4:

Firstly, being clustered for the semantics fusion fingerprint in training text library using K-means algorithm, cluster is obtained Cluster centre constitutes the first layer index as semantics fusion fingerprint word；

Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, quantified The nearest fingerprint word of distance on, and calculate its difference with the fingerprint word as fingerprint surplus；Each fingerprint surplus is put down It is divided into the subvector of L K dimension, K-means is used to subvector, obtains D cluster centre, that is, includes D subvector Word completes the building of the second layer index；

Finally, indexing according to level, inverted file is generated to the semantics fusion fingerprint in test text library: first in the first layer index On, according to the distance of semantics fusion fingerprint and fingerprint word, on the nearest fingerprint word of the distance quantified, and calculates it and refer to Fingerprint surplus is split as L subvector by line surplus, calculates each subvector at a distance from subvector word, obtain therewith away from From nearest subvector word ID；Level index in, text information storage in the corresponding fingerprint word index of fingerprint dictionary, Indexing storage content includes text ID, the corresponding subvector word ID of each subvector.

6. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 5:

When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same fingerprint list of index first The similarity distance of the semantics fusion fingerprint of word, is measured using non symmetrical distance, is selected apart from nearest fingerprint word, The fingerprint surplus for calculating the semantics fusion fingerprint and corresponding fingerprint word, is split generation to the semantics fusion fingerprint surplus {v₁,v₂,...,v_L}；The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix；It calculates Text fingerprints to be compared and quantization are to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word

Wherein, v_q,jIndicate j-th of subvector of text to be compared, v_id(i),jIndicate the fingerprint word that text to be compared quantifies On i-th of semantics fusion text fingerprints the corresponding subvector word of j-th of subvector；

It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints alternately Collection；The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again

Wherein, d_tAnd d_iRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, f_t ^qWithRespectively Indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured；

Wherein α is weight factor, 0 < α < 1；Similar Text is obtained according to the similarity distance between text is final, while can basis Local fingerprint similarity obtains similar local content, and then obtains the more granularity similarity comparison results of text.