CN110321925A - A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint - Google Patents
A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint Download PDFInfo
- Publication number
- CN110321925A CN110321925A CN201910441282.1A CN201910441282A CN110321925A CN 110321925 A CN110321925 A CN 110321925A CN 201910441282 A CN201910441282 A CN 201910441282A CN 110321925 A CN110321925 A CN 110321925A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- fingerprint
- semantic feature
- semantics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, comprising the following steps: the training that term vector indicates;Semantic feature extraction;Multiple features polymerization;Level index construct;Similarity calculation.The present invention combines multi-dimensional semantic correlation and carries out term vector expression modeling, sufficiently excavate the semantic information between word, feature is extracted as unit of sentence, semantic feature is characterized using more weights, and text library statistics and distributed intelligence are excavated using statistical learning method, it realizes to the finer division of feature space, then generates the compact text fingerprints of high identification based on multiple features polymerization, effectively improve the descriptive power and discrimination of text fingerprints;Using top-down thought, text similarity comparison is carried out using semantics fusion fingerprint fingerprint and local semantic feature, by building level index, can quickly and efficiently realize that text is compared from the overall situation to more granularity similarities of part;This method is with good expansibility.
Description
Technical field
The present invention relates to a kind of text similarity comparison method more particularly to a kind of text based on semantics fusion fingerprint are more
Granularity similarity comparison method, belongs to pattern-recognition and technical field of information processing.
Background technique
Two text approximations mean the content described in the text and information be it is similar, it is even identical.
If a text is that another text is generated by modifying fraction content similar to modes such as insertion, deletion, replacements, recognize
It is similar for this two texts.Text or the diffusion of webpage approximation are usually undesirable, with the surge of data, by approximate text
The problem of causing is increasingly severe.Therefore, approximate text detection is to reduce the storage overhead, and improves search efficiency and data utilize
Rate avoids the important technology illegally plagiarized and plagiarized.
Domestic and international experts and scholars propose a variety of methods to this.Traditional text similarity comparison is broadly divided into two major classes:
One kind is the method based on character string comparison;Another kind of is the method based on word frequency statistics, on the basis of vector space model,
Text is characterized using feature vector, and calculates the similarity distance between vector to measure the similitude between text.The former can adopt
With varigrained character string, such as the character string based on sentence level and based on paragraph level.However, due to usual in text
Comprising a large amount of character string, character string matching method is hard to avoid the low difficult situation of real-time of magnanimity long text.
Method based on Shingle, a part regard text as one group of Shingle, and wherein Shingle refers in text and connects
A continuous subsequence, however all inevitably there is the big predicament of computing cost in these methods, basically can not handle sea
Measure data.Word in each text is mapped as a simple hash value using existing dictionary by another part, although effectively
Ground reduces the expense of similarity calculation, but there are unstability for the text fingerprints generated, when the vocabulary in dictionary is not enough to
When comprising word in text, the variation of text very little can all cause hash value to fluctuate.
It is to be accepted to calculate for best, most effective approximate text detection at present by the Simhash algorithm that Charikar is proposed
Method.Simhash is the method that a kind of pair of high dimensional data carries out probability dimensionality reduction, is that digit is smaller by high dimension text characteristics DUAL PROBLEMS OF VECTOR MAPPING
And fixed fingerprint.Currently, most of approximation text detection system is all based on Simhash to realize.However, these sides
Method is only absorbed in text itself, and has ignored information useful in text library, while Simhash is to one text generation one text
This fingerprint can not provide two text part texts similar comparison.
Summary of the invention
The object of the invention is that it is more to provide a kind of text based on semantics fusion fingerprint to solve the above-mentioned problems
Granularity similarity comparison method.
The present invention through the following technical solutions to achieve the above objectives:
A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, comprising the following steps:
Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using public affairs
Open corpus, using text and word mapping indicate between word and word the horizontal relationship of co-occurrence within a context and word with it is upper
Hereafter mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is built
Mould carries out the training of term vector expression by unsupervised learning method, makes training gained term vector in semantic dependency and antisense
It is performed better than in word synonym identification mission;
Step 2: semantic feature extraction: be based on the more abundant sentence of semantic information, to sentence carry out include segment,
After the pretreatment operation for removing stop words, participle is indicated using term vector representation method, in conjunction with including word frequency, part of speech
More weights come characterize participle, calculate it is each participle term vector and segment weight and, realization semantic feature extraction;
Step 3: multiple features polymerize: it is clustered using the statistics and distribution character of semantic feature in training library, it will be semantic
Feature space is divided into multiple subregions, realizes to the finer division of feature space;According to the semantic feature and cluster in text
Semantic feature in text is assigned to the semantic feature calculated in each subregion away from nearest subregion by the distance at center
The sum of with the surplus of cluster centre, by multiple differences in each child partition and it is aggregating, generative semantics polymerize fingerprint;
Step 4: level index construct: carrying out cluster training to the semantics fusion fingerprint in training library, constitute first layer rope
Draw, then each fingerprint surplus is split, carries out the building that cluster completes the second layer index again to the subvector after fractionation, obtain
Obtain level index;The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row of falling
File;
Step 5: similarity calculation: according to the more granularity alignment algorithms indexed based on level, using top-down calculating
Mode first calculates the global similitude of text in text and text library to be compared, will when the similitude is greater than the threshold value of setting
It is added in Similar Text alternative collection;The local similarity with text in alternative text set is calculated later, is obtained final similar
Text and its specific local Similar content, and then obtain the more granularity similarity comparison results of text.
Preferably, each word is expressed as K by the integrated multidimensional semantic dependency joint modeling in the step 1
The objective function of the vector of dimension, the study of word vector indicates are as follows:
Wherein, N indicates the amount of text in training library, dnIndicate n-th of text,Indicate i-th of list in n-th of text
The term vector of word,Indicate wordContext use summation method vector conversion values,WithRespectively
There is word within a context in expressionProbability and text in there is wordProbability,And ANTi nRespectively indicate list
WordSynset and antisense word set,When synonym known to indicating or antonym are u, there is wordIt is general
Rate, α are weight factor, and 0 < α < 1 is trained by maximizing objective function, is solved using stochastic gradient rise method.
Preferably, the specific method is as follows for the step 2:
Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained1,S2,...,
SM, wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c1,c2,...,
cT, wherein T indicates the participle number in sentence, using word frequency weights omegafWith part of speech weights omeganiProduct come characterize participle power
Weight ωc=ωf×ωni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective;
Then, each participle is indicated using term vector, semantic feature is expressed as each participle term vector and segments weight
With:
Wherein, fi,kIndicate the value of the kth dimension of i-th of sentence, wi,j,kAnd ωi,jRespectively indicate j-th point of i-th of sentence
The value weight corresponding with the participle of the kth dimension of word;I () is indicator function, works as wi,j,kWhen > 0, value 1, other are -1;
Text representation is the set { f comprising M semantic feature1,f2,...,fM}。
Preferably, the specific method is as follows for the step 3:
Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated with cluster centre
Cluster C={ μ1,μ2,...,μL, each child partition for clustering corresponding semantic feature space;K-means algorithm is to cluster firmly
Algorithm is the representative of the typically objective function clustering method based on prototype, it be data point to prototype certain apart from conduct
The objective function of optimization obtains the adjustment rule of interative computation using the method that function seeks extreme value.
Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text
Fingerprint calculates the semantic feature f of textiIt is nearest to assign it to distance for the distance between cluster centre of each subregion
In child partition:
Id(fi)=arg min | | fi-μj||2, i=1,2 ..., M, j=1,2 ..., L
Wherein, Id (fi) indicate the index of child partition that semantic feature is assigned to, μjIn the cluster for indicating j-th of child partition
The heart;
Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:
Wherein, fj:Id(fj)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be in each child partition
Multiple differences and be aggregating, generate a K × L dimension vector Vd=[v1,v2,...,vL], as semantics fusion fingerprint,
Text is ultimately expressed as a semantics fusion fingerprint VdWith M semantic feature { f1,f2,...,fM}。
Preferably, the specific method is as follows for the step 4:
Firstly, being clustered using K-means algorithm for the semantics fusion fingerprint in training text library, cluster being obtained
The cluster centre obtained constitutes the first layer index as semantics fusion fingerprint word;
Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, by it
On the nearest fingerprint word of the distance of quantization, and its difference with the fingerprint word is calculated as fingerprint surplus;More than each fingerprint
Amount is equally divided into the subvector of L K dimension, uses K-means to subvector, obtains D cluster centre, that is, include D son
Vector word completes the building of the second layer index;
Finally, indexing according to level, inverted file is generated to the semantics fusion fingerprint in test text library: first in first layer
On index, according to the distance of semantics fusion fingerprint and fingerprint word, on the nearest fingerprint word of the distance quantified, and calculate
Fingerprint surplus is split as L subvector by its fingerprint surplus, calculates each subvector at a distance from subvector word, obtain with
Apart from nearest subvector word ID;In level index, text information storage is in the corresponding fingerprint word rope of fingerprint dictionary
In drawing, index storage content includes text ID, the corresponding subvector word ID of each subvector.
Preferably, the specific method is as follows for the step 5:
When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same finger of index first
The similarity distance of the semantics fusion fingerprint of line word, is measured using non symmetrical distance, is selected apart from nearest fingerprint
Word calculates the fingerprint surplus of the semantics fusion fingerprint and corresponding fingerprint word, is split to the semantics fusion fingerprint surplus
Generate { v1,v2,...,vL};The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix;
Text fingerprints to be compared and quantization are calculated to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word
Wherein, vq,jIndicate j-th of subvector of text to be compared, vid(i),jIndicate the finger that text to be compared quantifies
The corresponding subvector word of j-th of subvector of i-th of semantics fusion text fingerprints on line word;
It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints as standby
Selected works;The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again
Wherein, dtAnd diRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, ft qWith
Respectively indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured;
Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:
Wherein α is weight factor, 0 < α < 1;Similar Text is obtained according to the similarity distance between text is final, while can
Similar local content is obtained according to local fingerprint similarity, and then obtains the more granularity similarity comparison results of text.
The beneficial effects of the present invention are:
The present invention combines multi-dimensional semantic correlation and carries out term vector expression modeling, sufficiently excavates the semantic information between word,
Obtain the term vector that semantic dependency performs better than;It is extracted as unit of semantic information more horn of plenty and complete sentence special
Sign is characterized semantic feature using more weights, and excavates text library statistics and distributed intelligence, realization pair using statistical learning method
The finer division of feature space, then the compact text fingerprints based on the high identification of multiple features polymerization generation, effectively improve text
The descriptive power and discrimination of fingerprint;Using top-down thought, semantics fusion fingerprint fingerprint and local semantic feature are used
Text similarity comparison is carried out, by building level index, can quickly and efficiently realize text from the overall situation to more granularities of part
Similarity compares.The standard of text overall situation similarity comparison is effectively increased by experimental verification semantics fusion fingerprint of the invention
True rate, and have benefited from level index, effectively increase the efficiency of text similarity comparison;The present invention has good expansible
Property, being suitable for mass text similarity compares, and can be good at meeting user and compares demands to efficient more granularity similarities, can
Largely increase user experience.
Detailed description of the invention
Fig. 1 is that the more granularity similarities of text compare frame diagram in the embodiment of the present invention;
Fig. 2 is that term vector indicates model schematic in the embodiment of the present invention;
Fig. 3 is that semantics fusion fingerprint generates block diagram in the embodiment of the present invention, and semantic feature space is divided into 8 sons point
Area;
Fig. 4 is that level indexes inverted file building process figure in the embodiment of the present invention;
Fig. 5 is the more granularity similarity comparison process figures of text in the embodiment of the present invention;
Fig. 6 is the text comparison result figure in the embodiment of the present invention.
Specific embodiment
Below with reference to implementation column and attached drawing, the invention will be further described:
Embodiment:
As shown in Figure 1, the more granularity similarity comparison methods of the text of the present invention based on semantics fusion fingerprint, including with
Lower step:
Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using public affairs
Open corpus, using text and word mapping indicate between word and word the horizontal relationship of co-occurrence within a context and word with it is upper
Hereafter mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is built
Mould, carrying out term vector by unsupervised learning method indicates training, makes training gained term vector in semantic dependency and antonym
It is performed better than in synonym identification mission.Fig. 2 shows term vectors to indicate model.
In this step, each word is expressed as the vector of K dimension, word by the integrated multidimensional semantic dependency joint modeling
The objective function of vector study indicates are as follows:
Wherein, N indicates the amount of text in training library, dnIndicate n-th of text,Indicate i-th of list in n-th of text
The term vector of word,Indicate wordContext use summation method vector conversion values,WithPoint
Do not indicate occur word within a contextProbability and text in there is wordProbability,And ANTi nIt respectively indicates
WordSynset and antisense word set,When synonym known to indicating or antonym are u, there is wordIt is general
Rate, α are weight factor, and 0 < α < 1 is trained by maximizing objective function, is solved using stochastic gradient rise method.
Step 2: semantic feature extraction: be based on the more abundant sentence of semantic information, to sentence carry out include segment,
After the pretreatment operation for removing stop words, participle is indicated using term vector representation method described in above-mentioned steps one, then
In conjunction with include word frequency, more weights of part of speech come characterize participle, calculate it is each participle term vector and segment weight and, realize semanteme
Feature extraction.
The specific method is as follows for this step:
Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained1,S2,...,
SM, wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c1,c2,...,
cT, wherein T indicates the participle number in sentence, using word frequency weights omegafWith part of speech weights omeganiProduct come characterize participle power
Weight ωc=ωf×ωni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective;
Then, each participle is indicated using term vector, semantic feature is expressed as each participle term vector and segments weight
With:
Wherein, fi,kIndicate the value of the kth dimension of i-th of sentence, wi,j,kAnd ωi,jRespectively indicate j-th point of i-th of sentence
The value weight corresponding with the participle of the kth dimension of word;I () is indicator function, works as wi,j,kWhen > 0, value 1, other are -1;
Text representation is the set { f comprising M semantic feature1,f2,...,fM}。
Step 3: multiple features polymerize: as shown in figure 3, the statistics and distribution character using semantic feature in training library carry out
Semantic feature space is divided into multiple subregions, realized to the finer division of feature space by cluster;According to the semanteme in text
Semantic feature in text is assigned to away from nearest subregion by the distance of feature and cluster centre, is calculated in each subregion
Semantic feature and the sum of the surplus of cluster centre, by multiple differences in each child partition and be aggregating, generative semantics are poly-
Close fingerprint.
Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated with cluster centre
Cluster C={ μ1,μ2,...,μL, each child partition for clustering corresponding semantic feature space;
Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text
Fingerprint calculates the semantic feature f of textiIt is nearest to assign it to distance for the distance between cluster centre of each subregion
In child partition:
Id(fi)=arg min | | fi-μj||2, i=1,2 ..., M, j=1,2 ..., L
Wherein, Id (fi) indicate the index of child partition that semantic feature is assigned to, μjIn the cluster for indicating j-th of child partition
The heart;
Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:
Wherein, fj:Id(fj)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be in each child partition
Multiple differences and be aggregating, generate a K × L dimension vector Vd=[v1,v2,...,vL], as semantics fusion fingerprint,
Text is ultimately expressed as a semantics fusion fingerprint VdWith M semantic feature { f1,f2,...,fM}。
The semantic feature of text to be processed is assigned to therewith apart from nearest child partition by way of, obtain text language
Distribution of the adopted feature relative to text library semantic feature.It by multiple differences in each child partition and is aggregating, generates a K
The vector of × L dimension, as semantics fusion fingerprint.Entire text planting modes on sink characteristic is regarded as one greatly compared to traditional fingerprint algorithm
Cluster is to quantify, and using origin as cluster centre, method of the invention provides finer distribution to feature space, adopts simultaneously
It is indicated with the term vector that can more characterize multi-dimensional semantic correlation, substitution word hash is indicated, therefore can effectively improve text
The descriptive power of fingerprint.
Step 4: level index construct: carrying out cluster training to the semantics fusion fingerprint in training library, constitute first layer rope
Draw, then each fingerprint surplus is split, carries out the building that cluster completes the second layer index again to the subvector after fractionation, obtain
Obtain level index;The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row of falling
File.
The specific method is as follows for this step:
Firstly, being clustered using K-means algorithm for the semantics fusion fingerprint in training text library, cluster being obtained
The cluster centre obtained constitutes the first layer index as semantics fusion fingerprint word;
Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, by it
On the nearest fingerprint word of the distance of quantization, and its difference with the fingerprint word is calculated as fingerprint surplus;More than each fingerprint
Amount is equally divided into the subvector of L K dimension, uses K-means to subvector, obtains D cluster centre, that is, include D son
Vector word completes the building of the second layer index;
Finally, generating inverted file to the semantics fusion fingerprint in test text library as shown in figure 4, index according to level:
First on the first layer index, according to the distance of semantics fusion fingerprint and fingerprint word, the nearest fingerprint list of the distance quantified
On word, and its fingerprint surplus is calculated, fingerprint surplus is split as L subvector, calculates each subvector and subvector word
Distance obtains therewith apart from nearest subvector word ID;In level index, text information storage is corresponding in fingerprint dictionary
In fingerprint word index, index storage content includes text ID, the corresponding subvector word ID of each subvector.
Step 5: as shown in figure 5, similarity calculation: according to the more granularity alignment algorithms indexed based on level, using from upper
Calculation under and first calculates the global similitude of text in text and text library to be compared, when the similitude is greater than setting
Threshold value when, be added into Similar Text alternative collection;The local similarity with text in alternative text set is calculated later, in turn
Final Similar Text and its specific local Similar content are obtained, the more granularity similarity comparison results of text are obtained.
The specific method is as follows for this step:
When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same finger of index first
The similarity distance of the semantics fusion fingerprint of line word, is measured using non symmetrical distance, is selected apart from nearest fingerprint
Word calculates the fingerprint surplus of the semantics fusion fingerprint and corresponding fingerprint word, is split to the semantics fusion fingerprint surplus
Generate { v1,v2,...,vL};The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix;
Text fingerprints to be compared and quantization are calculated to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word
Wherein, vq,jIndicate j-th of subvector of text to be compared, vid(i),jIndicate the finger that text to be compared quantifies
The corresponding subvector word of j-th of subvector of i-th of semantics fusion text fingerprints on line word;
It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints as standby
Selected works;The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again
Wherein, dtAnd diRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, ft qWith
Respectively indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured;
Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:
Wherein α is weight factor, 0 < α < 1;Similar Text is obtained according to the similarity distance between text is final, while can
Similar local content is obtained according to local fingerprint similarity, and then obtains the more granularity similarity comparison results of text.
In order to verify effect of the invention, select SogouCS news library as test text library, comprising coming from sohu.com
It stands, the 18 class news such as the country, the world, sport, society, amusement between in June, 2012 and July.It chooses 1000 news and carries out hand
Dynamic modification, corresponding 5 approximate texts (including news itself) of every news.Simultaneously using SogouCA news library as training text
This collection, for training semantic feature space to divide and level index construct.In an experiment, random selection SogouCA news library is surplus
50000 of remaining news are as interference collection.The semantic feature of all texts is all extracted in advance and is completed in training set and test set.Instruction
Practicing collection includes 14,744,203 semantic features.Then, random sampling is carried out to the semantic feature of training set, utilizes K-Means
Algorithm clusters these semantic features, generates different size of cluster, and then generate the semantics fusion fingerprint of different dimensions.
Semantics fusion fingerprint is indicated using SAF, and sets 8,16,32,64,128 for dictionary size L, sets term vector dimension K to
8,16,32,64,128,256.Every text is all tested as text to be compared in test set.
The comparison accuracy rate of text fingerprints directly more global first, the following table 1 show the SAF and comparison text of different parameters
The similarity offered compares accuracy rate.Semantics fusion fingerprint proposed by the present invention is compared with documents Simhash method, word to
When amount dimension K parameter setting is consistent, accuracy rate is dramatically increased.
Table 1
K | Simhash | L=8 | L=16 | L=32 | L=64 | L=128 |
8 | 39.00 | 92.22 | 92.54 | 92.86 | 92.56 | 93.14 |
16 | 78.68 | 92.96 | 94.02 | 94.48 | 95.38 | 94.52 |
32 | 88.64 | 93.84 | 94.94 | 96.16 | 96.82 | 96.80 |
64 | 90.42 | 94.76 | 96.32 | 97.48 | 97.64 | 97.98 |
128 | 91.00 | 95.60 | 97.22 | 97.84 | 98.68 | 98.14 |
256 | 91.00 | 96.16 | 97.88 | 98.18 | 98.92 | 97.86 |
When carrying out similarity comparison using the more granularity alignment algorithms indexed based on level, compared to directly comparison text language
Comparison efficiency is improved 87.93% by adopted feature, the present invention, has good scalability.Fig. 6 gives text comparison result sample
Example, table 2 is clearer to illustrate part comparison result, and similarity remains 8 decimal points.It can be found that the present invention can be well
Compare out the sentence of partial content replacement or sequence exchange, slight modifications some conjunctions and keyword.When the key to sentence
When word carries out more modification, low from the similarity between sentence for sentence structure and semantic angle, system comparison result is shown
Similarity is 0 between sentence, is tallied with the actual situation.Thus, it is possible to find that the present invention can effectively realize the phase of the more granularities of text
It is compared like degree.
Table 2
Above-described embodiment is presently preferred embodiments of the present invention, is not a limitation on the technical scheme of the present invention, as long as
Without the technical solution that creative work can be realized on the basis of the above embodiments, it is regarded as falling into the invention patent
Rights protection scope in.
Claims (6)
1. a kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, it is characterised in that: the following steps are included:
Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using open language
Expect library, indicates between word and word the horizontal relationship of co-occurrence and word and context within a context using text and word mapping
Mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is modeled,
The training that term vector expression is carried out by unsupervised learning method keeps training gained term vector same in semantic dependency and antonym
It is performed better than in adopted word identification mission;
Step 2: semantic feature extraction: being based on the more abundant sentence of semantic information, carrying out including participle, removal to sentence
After the pretreatment operation of stop words, participle is indicated using term vector representation method, in conjunction with include word frequency, part of speech it is more
Weight come characterize participle, calculate it is each participle term vector and segment weight and, realization semantic feature extraction;
Step 3: multiple features polymerize: being clustered using the statistics and distribution character of semantic feature in training library, by semantic feature
Space is divided into multiple subregions, realizes to the finer division of feature space;According to the semantic feature and cluster centre in text
Distance, the semantic feature in text is assigned to and calculates semantic feature in each subregion and poly- away from nearest subregion
The sum of the surplus at class center by multiple differences in each child partition and is aggregating, and generative semantics polymerize fingerprint;
Step 4: level index construct: cluster training is carried out to the semantics fusion fingerprint in training library, constitutes the first layer index,
Each fingerprint surplus is split again, the building that cluster completes the second layer index is carried out again to the subvector after fractionation, obtains
Level index;The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row's of falling text
Part;
Step 5: similarity calculation: according to the more granularity alignment algorithms indexed based on level, using top-down calculating side
Formula first calculates the global similitude of text in text and text library to be compared, when the similitude is greater than the threshold value of setting, by it
It is added in Similar Text alternative collection;The local similarity with text in alternative text set is calculated later, obtains final similar text
The specific local Similar content of this and its, and then obtain the more granularity similarity comparison results of text.
2. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist
In: in the step 1, each word is expressed as the vector of K dimension, word by the integrated multidimensional semantic dependency joint modeling
The objective function of vector study indicates are as follows:
Wherein, N indicates the amount of text in training library, dnIndicate n-th of text,Indicate in n-th of text i-th of word
Term vector,Indicate wordContext use summation method vector conversion values,WithIt respectively indicates
Occurs word within a contextProbability and text in there is wordProbability,And ANTi nRespectively indicate word
Synset and antisense word set,When synonym known to indicating or antonym are u, there is wordProbability, α is
Weight factor, 0 < α < 1 are trained by maximizing objective function, are solved using stochastic gradient rise method.
3. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist
In: the specific method is as follows for the step 2:
Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained1,S2,...,SM,
Wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c1,c2,...,cT,
Wherein T indicates the participle number in sentence, using word frequency weights omegafWith part of speech weights omeganiProduct characterize participle weights omegac
=ωf×ωni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective;
Then, each participle is indicated using term vector, semantic feature be expressed as each participle term vector and segment weight and:
Wherein, fi,kIndicate the value of the kth dimension of i-th of sentence, wi,j,kAnd ωi,jRespectively indicate j-th of participle of i-th of sentence
The value weight corresponding with the participle of kth dimension;I () is indicator function, works as wi,j,kWhen > 0, value 1, other are -1;Text
It is expressed as the set { f comprising M semantic feature1,f2,...,fM}。
4. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist
In: the specific method is as follows for the step 3:
Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated to cluster with cluster centre
C={ μ1,μ2,...,μL, each child partition for clustering corresponding semantic feature space;
Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text fingerprints,
Calculate the semantic feature f of textiThe distance between cluster centre of each subregion, assigns it to apart from nearest child partition
In:
Id(fi)=arg min | | fi-μj||2, i=1,2 ..., M, j=1,2 ..., L
Wherein, Id (fi) indicate the index of child partition that semantic feature is assigned to, μjIndicate the cluster centre of j-th of child partition;
Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:
Wherein, fj:Id(fj)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be more in each child partition
It a difference and is aggregating, generates the vector V of K × L dimensiond=[v1,v2,...,vL], as semantics fusion fingerprint, text
It is ultimately expressed as a semantics fusion fingerprint VdWith M semantic feature { f1,f2,...,fM}。
5. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist
In: the specific method is as follows for the step 4:
Firstly, being clustered for the semantics fusion fingerprint in training text library using K-means algorithm, cluster is obtained
Cluster centre constitutes the first layer index as semantics fusion fingerprint word;
Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, quantified
The nearest fingerprint word of distance on, and calculate its difference with the fingerprint word as fingerprint surplus;Each fingerprint surplus is put down
It is divided into the subvector of L K dimension, K-means is used to subvector, obtains D cluster centre, that is, includes D subvector
Word completes the building of the second layer index;
Finally, indexing according to level, inverted file is generated to the semantics fusion fingerprint in test text library: first in the first layer index
On, according to the distance of semantics fusion fingerprint and fingerprint word, on the nearest fingerprint word of the distance quantified, and calculates it and refer to
Fingerprint surplus is split as L subvector by line surplus, calculates each subvector at a distance from subvector word, obtain therewith away from
From nearest subvector word ID;Level index in, text information storage in the corresponding fingerprint word index of fingerprint dictionary,
Indexing storage content includes text ID, the corresponding subvector word ID of each subvector.
6. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist
In: the specific method is as follows for the step 5:
When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same fingerprint list of index first
The similarity distance of the semantics fusion fingerprint of word, is measured using non symmetrical distance, is selected apart from nearest fingerprint word,
The fingerprint surplus for calculating the semantics fusion fingerprint and corresponding fingerprint word, is split generation to the semantics fusion fingerprint surplus
{v1,v2,...,vL};The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix;It calculates
Text fingerprints to be compared and quantization are to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word
Wherein, vq,jIndicate j-th of subvector of text to be compared, vid(i),jIndicate the fingerprint word that text to be compared quantifies
On i-th of semantics fusion text fingerprints the corresponding subvector word of j-th of subvector;
It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints alternately
Collection;The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again
Wherein, dtAnd diRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, ft qWithRespectively
Indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured;
Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:
Wherein α is weight factor, 0 < α < 1;Similar Text is obtained according to the similarity distance between text is final, while can basis
Local fingerprint similarity obtains similar local content, and then obtains the more granularity similarity comparison results of text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910441282.1A CN110321925B (en) | 2019-05-24 | 2019-05-24 | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910441282.1A CN110321925B (en) | 2019-05-24 | 2019-05-24 | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110321925A true CN110321925A (en) | 2019-10-11 |
CN110321925B CN110321925B (en) | 2022-11-18 |
Family
ID=68119119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910441282.1A Active CN110321925B (en) | 2019-05-24 | 2019-05-24 | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321925B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750616A (en) * | 2019-10-16 | 2020-02-04 | 网易(杭州)网络有限公司 | Retrieval type chatting method and device and computer equipment |
CN110909550A (en) * | 2019-11-13 | 2020-03-24 | 北京环境特性研究所 | Text processing method and device, electronic equipment and readable storage medium |
CN110956039A (en) * | 2019-12-04 | 2020-04-03 | 中国太平洋保险(集团)股份有限公司 | Text similarity calculation method and device based on multi-dimensional vectorization coding |
CN110990538A (en) * | 2019-12-20 | 2020-04-10 | 深圳前海黑顿科技有限公司 | Semantic fuzzy search method based on sentence-level deep learning language model |
CN111381191A (en) * | 2020-05-29 | 2020-07-07 | 支付宝(杭州)信息技术有限公司 | Method for synonymy modifying text and determining text creator |
CN111461109A (en) * | 2020-02-27 | 2020-07-28 | 浙江工业大学 | Method for identifying documents based on environment multi-type word bank |
CN111694952A (en) * | 2020-04-16 | 2020-09-22 | 国家计算机网络与信息安全管理中心 | Big data analysis model system based on microblog and implementation method thereof |
CN111859635A (en) * | 2020-07-03 | 2020-10-30 | 中国人民解放军海军航空大学航空作战勤务学院 | Simulation system based on multi-granularity modeling technology and construction method |
CN112287669A (en) * | 2020-12-28 | 2021-01-29 | 深圳追一科技有限公司 | Text processing method and device, computer equipment and storage medium |
CN113111645A (en) * | 2021-04-28 | 2021-07-13 | 东南大学 | Media text similarity detection method |
CN113313180A (en) * | 2021-06-04 | 2021-08-27 | 太原理工大学 | Remote sensing image semantic segmentation method based on deep confrontation learning |
CN115935195A (en) * | 2022-11-08 | 2023-04-07 | 华院计算技术(上海)股份有限公司 | Text matching method and device, computer readable storage medium and terminal |
CN116129146A (en) * | 2023-03-29 | 2023-05-16 | 中国工程物理研究院计算机应用研究所 | Heterogeneous image matching method and system based on local feature consistency |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090119572A1 (en) * | 2007-11-02 | 2009-05-07 | Marja-Riitta Koivunen | Systems and methods for finding information resources |
US20150120720A1 (en) * | 2012-06-22 | 2015-04-30 | Krishna Kishore Dhara | Method and system of identifying relevant content snippets that include additional information |
CN107423729A (en) * | 2017-09-20 | 2017-12-01 | 湖南师范大学 | A kind of remote class brain three-dimensional gait identifying system and implementation method towards under complicated visual scene |
CN108399163A (en) * | 2018-03-21 | 2018-08-14 | 北京理工大学 | Bluebeard compound polymerize the text similarity measure with word combination semantic feature |
CN108595706A (en) * | 2018-05-10 | 2018-09-28 | 中国科学院信息工程研究所 | A kind of document semantic representation method, file classification method and device based on theme part of speech similitude |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
-
2019
- 2019-05-24 CN CN201910441282.1A patent/CN110321925B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090119572A1 (en) * | 2007-11-02 | 2009-05-07 | Marja-Riitta Koivunen | Systems and methods for finding information resources |
US20150120720A1 (en) * | 2012-06-22 | 2015-04-30 | Krishna Kishore Dhara | Method and system of identifying relevant content snippets that include additional information |
CN107423729A (en) * | 2017-09-20 | 2017-12-01 | 湖南师范大学 | A kind of remote class brain three-dimensional gait identifying system and implementation method towards under complicated visual scene |
CN108399163A (en) * | 2018-03-21 | 2018-08-14 | 北京理工大学 | Bluebeard compound polymerize the text similarity measure with word combination semantic feature |
CN108595706A (en) * | 2018-05-10 | 2018-09-28 | 中国科学院信息工程研究所 | A kind of document semantic representation method, file classification method and device based on theme part of speech similitude |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
Non-Patent Citations (6)
Title |
---|
JADALLA,A: ""A fingerprinting based plagiarism detection system for Arabic text based documents"", 《PROCEEDINGS OF THE 2012 8TH INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGY AND INFORMATION MANAGEMENT》 * |
MOHAMED ELHOSENY: ""FPSS: Fingerprint-based semantic similarity detection in big data environment"", 《2017 EIGHTH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INFORMATION SYSTEMS (ICICIS)》 * |
WEN XIA: ""Similarity and Locality Based Indexing for High Performance Data Deduplication"", 《IEEE TRANSACTIONS ON COMPUTERS》 * |
刘宏哲: ""文本语义相似度计算方法研究"", 《中国博士学位论文全文数据库信息科技辑》 * |
刘礼芳: ""基于社会网络的WEB图像语义标注与聚合"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
姜雪: ""基于simhash的文本相似检测算法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750616A (en) * | 2019-10-16 | 2020-02-04 | 网易(杭州)网络有限公司 | Retrieval type chatting method and device and computer equipment |
CN110909550A (en) * | 2019-11-13 | 2020-03-24 | 北京环境特性研究所 | Text processing method and device, electronic equipment and readable storage medium |
CN110909550B (en) * | 2019-11-13 | 2023-11-03 | 北京环境特性研究所 | Text processing method, text processing device, electronic equipment and readable storage medium |
CN110956039A (en) * | 2019-12-04 | 2020-04-03 | 中国太平洋保险(集团)股份有限公司 | Text similarity calculation method and device based on multi-dimensional vectorization coding |
CN110990538A (en) * | 2019-12-20 | 2020-04-10 | 深圳前海黑顿科技有限公司 | Semantic fuzzy search method based on sentence-level deep learning language model |
CN110990538B (en) * | 2019-12-20 | 2022-04-01 | 深圳前海黑顿科技有限公司 | Semantic fuzzy search method based on sentence-level deep learning language model |
CN111461109A (en) * | 2020-02-27 | 2020-07-28 | 浙江工业大学 | Method for identifying documents based on environment multi-type word bank |
CN111461109B (en) * | 2020-02-27 | 2023-09-15 | 浙江工业大学 | Method for identifying documents based on environment multi-class word stock |
CN111694952A (en) * | 2020-04-16 | 2020-09-22 | 国家计算机网络与信息安全管理中心 | Big data analysis model system based on microblog and implementation method thereof |
CN111381191B (en) * | 2020-05-29 | 2020-09-01 | 支付宝(杭州)信息技术有限公司 | Method for synonymy modifying text and determining text creator |
CN111381191A (en) * | 2020-05-29 | 2020-07-07 | 支付宝(杭州)信息技术有限公司 | Method for synonymy modifying text and determining text creator |
CN111859635A (en) * | 2020-07-03 | 2020-10-30 | 中国人民解放军海军航空大学航空作战勤务学院 | Simulation system based on multi-granularity modeling technology and construction method |
CN112287669A (en) * | 2020-12-28 | 2021-01-29 | 深圳追一科技有限公司 | Text processing method and device, computer equipment and storage medium |
CN113111645A (en) * | 2021-04-28 | 2021-07-13 | 东南大学 | Media text similarity detection method |
CN113111645B (en) * | 2021-04-28 | 2024-02-06 | 东南大学 | Media text similarity detection method |
CN113313180A (en) * | 2021-06-04 | 2021-08-27 | 太原理工大学 | Remote sensing image semantic segmentation method based on deep confrontation learning |
CN115935195B (en) * | 2022-11-08 | 2023-08-08 | 华院计算技术(上海)股份有限公司 | Text matching method and device, computer readable storage medium and terminal |
CN115935195A (en) * | 2022-11-08 | 2023-04-07 | 华院计算技术(上海)股份有限公司 | Text matching method and device, computer readable storage medium and terminal |
CN116129146A (en) * | 2023-03-29 | 2023-05-16 | 中国工程物理研究院计算机应用研究所 | Heterogeneous image matching method and system based on local feature consistency |
CN116129146B (en) * | 2023-03-29 | 2023-09-01 | 中国工程物理研究院计算机应用研究所 | Heterogeneous image matching method and system based on local feature consistency |
Also Published As
Publication number | Publication date |
---|---|
CN110321925B (en) | 2022-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321925A (en) | A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint | |
CN106383877B (en) | Social media online short text clustering and topic detection method | |
Wang et al. | A hybrid document feature extraction method using latent Dirichlet allocation and word2vec | |
CN103514183B (en) | Information search method and system based on interactive document clustering | |
CN102207945B (en) | Knowledge network-based text indexing system and method | |
Ni et al. | Short text clustering by finding core terms | |
CN108197111A (en) | A kind of text automatic abstracting method based on fusion Semantic Clustering | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN108268449A (en) | A kind of text semantic label abstracting method based on lexical item cluster | |
CN111368077A (en) | K-Means text classification method based on particle swarm location updating thought wolf optimization algorithm | |
CN106934005A (en) | A kind of Text Clustering Method based on density | |
CN110765755A (en) | Semantic similarity feature extraction method based on double selection gates | |
CN106599072B (en) | Text clustering method and device | |
CN112926340B (en) | Semantic matching model for knowledge point positioning | |
CN112883722B (en) | Distributed text summarization method based on cloud data center | |
Odeh et al. | Arabic text categorization algorithm using vector evaluation method | |
Naeem et al. | Development of an efficient hierarchical clustering analysis using an agglomerative clustering algorithm | |
CN114997288A (en) | Design resource association method | |
CN115248839A (en) | Knowledge system-based long text retrieval method and device | |
Yin et al. | Sentence-bert and k-means based clustering technology for scientific and technical literature | |
Hassan et al. | Automatic document topic identification using wikipedia hierarchical ontology | |
Ding et al. | The research of text mining based on self-organizing maps | |
Zhang et al. | Extractive Document Summarization based on hierarchical GRU | |
Yang et al. | Research on improvement of text processing and clustering algorithms in public opinion early warning system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |