CN110321925A - A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint - Google Patents

A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint Download PDF

Info

Publication number
CN110321925A
CN110321925A CN201910441282.1A CN201910441282A CN110321925A CN 110321925 A CN110321925 A CN 110321925A CN 201910441282 A CN201910441282 A CN 201910441282A CN 110321925 A CN110321925 A CN 110321925A
Authority
CN
China
Prior art keywords
text
word
fingerprint
semantic feature
semantics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910441282.1A
Other languages
Chinese (zh)
Other versions
CN110321925B (en
Inventor
梁燕
万正景
陶以政
李龚亮
许峰
曹政
谢杨
马丹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Original Assignee
COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY filed Critical COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Priority to CN201910441282.1A priority Critical patent/CN110321925B/en
Publication of CN110321925A publication Critical patent/CN110321925A/en
Application granted granted Critical
Publication of CN110321925B publication Critical patent/CN110321925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, comprising the following steps: the training that term vector indicates;Semantic feature extraction;Multiple features polymerization;Level index construct;Similarity calculation.The present invention combines multi-dimensional semantic correlation and carries out term vector expression modeling, sufficiently excavate the semantic information between word, feature is extracted as unit of sentence, semantic feature is characterized using more weights, and text library statistics and distributed intelligence are excavated using statistical learning method, it realizes to the finer division of feature space, then generates the compact text fingerprints of high identification based on multiple features polymerization, effectively improve the descriptive power and discrimination of text fingerprints;Using top-down thought, text similarity comparison is carried out using semantics fusion fingerprint fingerprint and local semantic feature, by building level index, can quickly and efficiently realize that text is compared from the overall situation to more granularity similarities of part;This method is with good expansibility.

Description

A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint
Technical field
The present invention relates to a kind of text similarity comparison method more particularly to a kind of text based on semantics fusion fingerprint are more Granularity similarity comparison method, belongs to pattern-recognition and technical field of information processing.
Background technique
Two text approximations mean the content described in the text and information be it is similar, it is even identical. If a text is that another text is generated by modifying fraction content similar to modes such as insertion, deletion, replacements, recognize It is similar for this two texts.Text or the diffusion of webpage approximation are usually undesirable, with the surge of data, by approximate text The problem of causing is increasingly severe.Therefore, approximate text detection is to reduce the storage overhead, and improves search efficiency and data utilize Rate avoids the important technology illegally plagiarized and plagiarized.
Domestic and international experts and scholars propose a variety of methods to this.Traditional text similarity comparison is broadly divided into two major classes: One kind is the method based on character string comparison;Another kind of is the method based on word frequency statistics, on the basis of vector space model, Text is characterized using feature vector, and calculates the similarity distance between vector to measure the similitude between text.The former can adopt With varigrained character string, such as the character string based on sentence level and based on paragraph level.However, due to usual in text Comprising a large amount of character string, character string matching method is hard to avoid the low difficult situation of real-time of magnanimity long text.
Method based on Shingle, a part regard text as one group of Shingle, and wherein Shingle refers in text and connects A continuous subsequence, however all inevitably there is the big predicament of computing cost in these methods, basically can not handle sea Measure data.Word in each text is mapped as a simple hash value using existing dictionary by another part, although effectively Ground reduces the expense of similarity calculation, but there are unstability for the text fingerprints generated, when the vocabulary in dictionary is not enough to When comprising word in text, the variation of text very little can all cause hash value to fluctuate.
It is to be accepted to calculate for best, most effective approximate text detection at present by the Simhash algorithm that Charikar is proposed Method.Simhash is the method that a kind of pair of high dimensional data carries out probability dimensionality reduction, is that digit is smaller by high dimension text characteristics DUAL PROBLEMS OF VECTOR MAPPING And fixed fingerprint.Currently, most of approximation text detection system is all based on Simhash to realize.However, these sides Method is only absorbed in text itself, and has ignored information useful in text library, while Simhash is to one text generation one text This fingerprint can not provide two text part texts similar comparison.
Summary of the invention
The object of the invention is that it is more to provide a kind of text based on semantics fusion fingerprint to solve the above-mentioned problems Granularity similarity comparison method.
The present invention through the following technical solutions to achieve the above objectives:
A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, comprising the following steps:
Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using public affairs Open corpus, using text and word mapping indicate between word and word the horizontal relationship of co-occurrence within a context and word with it is upper Hereafter mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is built Mould carries out the training of term vector expression by unsupervised learning method, makes training gained term vector in semantic dependency and antisense It is performed better than in word synonym identification mission;
Step 2: semantic feature extraction: be based on the more abundant sentence of semantic information, to sentence carry out include segment, After the pretreatment operation for removing stop words, participle is indicated using term vector representation method, in conjunction with including word frequency, part of speech More weights come characterize participle, calculate it is each participle term vector and segment weight and, realization semantic feature extraction;
Step 3: multiple features polymerize: it is clustered using the statistics and distribution character of semantic feature in training library, it will be semantic Feature space is divided into multiple subregions, realizes to the finer division of feature space;According to the semantic feature and cluster in text Semantic feature in text is assigned to the semantic feature calculated in each subregion away from nearest subregion by the distance at center The sum of with the surplus of cluster centre, by multiple differences in each child partition and it is aggregating, generative semantics polymerize fingerprint;
Step 4: level index construct: carrying out cluster training to the semantics fusion fingerprint in training library, constitute first layer rope Draw, then each fingerprint surplus is split, carries out the building that cluster completes the second layer index again to the subvector after fractionation, obtain Obtain level index;The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row of falling File;
Step 5: similarity calculation: according to the more granularity alignment algorithms indexed based on level, using top-down calculating Mode first calculates the global similitude of text in text and text library to be compared, will when the similitude is greater than the threshold value of setting It is added in Similar Text alternative collection;The local similarity with text in alternative text set is calculated later, is obtained final similar Text and its specific local Similar content, and then obtain the more granularity similarity comparison results of text.
Preferably, each word is expressed as K by the integrated multidimensional semantic dependency joint modeling in the step 1 The objective function of the vector of dimension, the study of word vector indicates are as follows:
Wherein, N indicates the amount of text in training library, dnIndicate n-th of text,Indicate i-th of list in n-th of text The term vector of word,Indicate wordContext use summation method vector conversion values,WithRespectively There is word within a context in expressionProbability and text in there is wordProbability,And ANTi nRespectively indicate list WordSynset and antisense word set,When synonym known to indicating or antonym are u, there is wordIt is general Rate, α are weight factor, and 0 < α < 1 is trained by maximizing objective function, is solved using stochastic gradient rise method.
Preferably, the specific method is as follows for the step 2:
Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained1,S2,..., SM, wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c1,c2,..., cT, wherein T indicates the participle number in sentence, using word frequency weights omegafWith part of speech weights omeganiProduct come characterize participle power Weight ωcf×ωni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective;
Then, each participle is indicated using term vector, semantic feature is expressed as each participle term vector and segments weight With:
Wherein, fi,kIndicate the value of the kth dimension of i-th of sentence, wi,j,kAnd ωi,jRespectively indicate j-th point of i-th of sentence The value weight corresponding with the participle of the kth dimension of word;I () is indicator function, works as wi,j,kWhen > 0, value 1, other are -1; Text representation is the set { f comprising M semantic feature1,f2,...,fM}。
Preferably, the specific method is as follows for the step 3:
Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated with cluster centre Cluster C={ μ12,...,μL, each child partition for clustering corresponding semantic feature space;K-means algorithm is to cluster firmly Algorithm is the representative of the typically objective function clustering method based on prototype, it be data point to prototype certain apart from conduct The objective function of optimization obtains the adjustment rule of interative computation using the method that function seeks extreme value.
Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text Fingerprint calculates the semantic feature f of textiIt is nearest to assign it to distance for the distance between cluster centre of each subregion In child partition:
Id(fi)=arg min | | fij||2, i=1,2 ..., M, j=1,2 ..., L
Wherein, Id (fi) indicate the index of child partition that semantic feature is assigned to, μjIn the cluster for indicating j-th of child partition The heart;
Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:
Wherein, fj:Id(fj)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be in each child partition Multiple differences and be aggregating, generate a K × L dimension vector Vd=[v1,v2,...,vL], as semantics fusion fingerprint, Text is ultimately expressed as a semantics fusion fingerprint VdWith M semantic feature { f1,f2,...,fM}。
Preferably, the specific method is as follows for the step 4:
Firstly, being clustered using K-means algorithm for the semantics fusion fingerprint in training text library, cluster being obtained The cluster centre obtained constitutes the first layer index as semantics fusion fingerprint word;
Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, by it On the nearest fingerprint word of the distance of quantization, and its difference with the fingerprint word is calculated as fingerprint surplus;More than each fingerprint Amount is equally divided into the subvector of L K dimension, uses K-means to subvector, obtains D cluster centre, that is, include D son Vector word completes the building of the second layer index;
Finally, indexing according to level, inverted file is generated to the semantics fusion fingerprint in test text library: first in first layer On index, according to the distance of semantics fusion fingerprint and fingerprint word, on the nearest fingerprint word of the distance quantified, and calculate Fingerprint surplus is split as L subvector by its fingerprint surplus, calculates each subvector at a distance from subvector word, obtain with Apart from nearest subvector word ID;In level index, text information storage is in the corresponding fingerprint word rope of fingerprint dictionary In drawing, index storage content includes text ID, the corresponding subvector word ID of each subvector.
Preferably, the specific method is as follows for the step 5:
When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same finger of index first The similarity distance of the semantics fusion fingerprint of line word, is measured using non symmetrical distance, is selected apart from nearest fingerprint Word calculates the fingerprint surplus of the semantics fusion fingerprint and corresponding fingerprint word, is split to the semantics fusion fingerprint surplus Generate { v1,v2,...,vL};The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix; Text fingerprints to be compared and quantization are calculated to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word
Wherein, vq,jIndicate j-th of subvector of text to be compared, vid(i),jIndicate the finger that text to be compared quantifies The corresponding subvector word of j-th of subvector of i-th of semantics fusion text fingerprints on line word;
It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints as standby Selected works;The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again
Wherein, dtAnd diRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, ft qWith Respectively indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured;
Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:
Wherein α is weight factor, 0 < α < 1;Similar Text is obtained according to the similarity distance between text is final, while can Similar local content is obtained according to local fingerprint similarity, and then obtains the more granularity similarity comparison results of text.
The beneficial effects of the present invention are:
The present invention combines multi-dimensional semantic correlation and carries out term vector expression modeling, sufficiently excavates the semantic information between word, Obtain the term vector that semantic dependency performs better than;It is extracted as unit of semantic information more horn of plenty and complete sentence special Sign is characterized semantic feature using more weights, and excavates text library statistics and distributed intelligence, realization pair using statistical learning method The finer division of feature space, then the compact text fingerprints based on the high identification of multiple features polymerization generation, effectively improve text The descriptive power and discrimination of fingerprint;Using top-down thought, semantics fusion fingerprint fingerprint and local semantic feature are used Text similarity comparison is carried out, by building level index, can quickly and efficiently realize text from the overall situation to more granularities of part Similarity compares.The standard of text overall situation similarity comparison is effectively increased by experimental verification semantics fusion fingerprint of the invention True rate, and have benefited from level index, effectively increase the efficiency of text similarity comparison;The present invention has good expansible Property, being suitable for mass text similarity compares, and can be good at meeting user and compares demands to efficient more granularity similarities, can Largely increase user experience.
Detailed description of the invention
Fig. 1 is that the more granularity similarities of text compare frame diagram in the embodiment of the present invention;
Fig. 2 is that term vector indicates model schematic in the embodiment of the present invention;
Fig. 3 is that semantics fusion fingerprint generates block diagram in the embodiment of the present invention, and semantic feature space is divided into 8 sons point Area;
Fig. 4 is that level indexes inverted file building process figure in the embodiment of the present invention;
Fig. 5 is the more granularity similarity comparison process figures of text in the embodiment of the present invention;
Fig. 6 is the text comparison result figure in the embodiment of the present invention.
Specific embodiment
Below with reference to implementation column and attached drawing, the invention will be further described:
Embodiment:
As shown in Figure 1, the more granularity similarity comparison methods of the text of the present invention based on semantics fusion fingerprint, including with Lower step:
Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using public affairs Open corpus, using text and word mapping indicate between word and word the horizontal relationship of co-occurrence within a context and word with it is upper Hereafter mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is built Mould, carrying out term vector by unsupervised learning method indicates training, makes training gained term vector in semantic dependency and antonym It is performed better than in synonym identification mission.Fig. 2 shows term vectors to indicate model.
In this step, each word is expressed as the vector of K dimension, word by the integrated multidimensional semantic dependency joint modeling The objective function of vector study indicates are as follows:
Wherein, N indicates the amount of text in training library, dnIndicate n-th of text,Indicate i-th of list in n-th of text The term vector of word,Indicate wordContext use summation method vector conversion values,WithPoint Do not indicate occur word within a contextProbability and text in there is wordProbability,And ANTi nIt respectively indicates WordSynset and antisense word set,When synonym known to indicating or antonym are u, there is wordIt is general Rate, α are weight factor, and 0 < α < 1 is trained by maximizing objective function, is solved using stochastic gradient rise method.
Step 2: semantic feature extraction: be based on the more abundant sentence of semantic information, to sentence carry out include segment, After the pretreatment operation for removing stop words, participle is indicated using term vector representation method described in above-mentioned steps one, then In conjunction with include word frequency, more weights of part of speech come characterize participle, calculate it is each participle term vector and segment weight and, realize semanteme Feature extraction.
The specific method is as follows for this step:
Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained1,S2,..., SM, wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c1,c2,..., cT, wherein T indicates the participle number in sentence, using word frequency weights omegafWith part of speech weights omeganiProduct come characterize participle power Weight ωcf×ωni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective;
Then, each participle is indicated using term vector, semantic feature is expressed as each participle term vector and segments weight With:
Wherein, fi,kIndicate the value of the kth dimension of i-th of sentence, wi,j,kAnd ωi,jRespectively indicate j-th point of i-th of sentence The value weight corresponding with the participle of the kth dimension of word;I () is indicator function, works as wi,j,kWhen > 0, value 1, other are -1; Text representation is the set { f comprising M semantic feature1,f2,...,fM}。
Step 3: multiple features polymerize: as shown in figure 3, the statistics and distribution character using semantic feature in training library carry out Semantic feature space is divided into multiple subregions, realized to the finer division of feature space by cluster;According to the semanteme in text Semantic feature in text is assigned to away from nearest subregion by the distance of feature and cluster centre, is calculated in each subregion Semantic feature and the sum of the surplus of cluster centre, by multiple differences in each child partition and be aggregating, generative semantics are poly- Close fingerprint.
Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated with cluster centre Cluster C={ μ12,...,μL, each child partition for clustering corresponding semantic feature space;
Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text Fingerprint calculates the semantic feature f of textiIt is nearest to assign it to distance for the distance between cluster centre of each subregion In child partition:
Id(fi)=arg min | | fij||2, i=1,2 ..., M, j=1,2 ..., L
Wherein, Id (fi) indicate the index of child partition that semantic feature is assigned to, μjIn the cluster for indicating j-th of child partition The heart;
Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:
Wherein, fj:Id(fj)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be in each child partition Multiple differences and be aggregating, generate a K × L dimension vector Vd=[v1,v2,...,vL], as semantics fusion fingerprint, Text is ultimately expressed as a semantics fusion fingerprint VdWith M semantic feature { f1,f2,...,fM}。
The semantic feature of text to be processed is assigned to therewith apart from nearest child partition by way of, obtain text language Distribution of the adopted feature relative to text library semantic feature.It by multiple differences in each child partition and is aggregating, generates a K The vector of × L dimension, as semantics fusion fingerprint.Entire text planting modes on sink characteristic is regarded as one greatly compared to traditional fingerprint algorithm Cluster is to quantify, and using origin as cluster centre, method of the invention provides finer distribution to feature space, adopts simultaneously It is indicated with the term vector that can more characterize multi-dimensional semantic correlation, substitution word hash is indicated, therefore can effectively improve text The descriptive power of fingerprint.
Step 4: level index construct: carrying out cluster training to the semantics fusion fingerprint in training library, constitute first layer rope Draw, then each fingerprint surplus is split, carries out the building that cluster completes the second layer index again to the subvector after fractionation, obtain Obtain level index;The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row of falling File.
The specific method is as follows for this step:
Firstly, being clustered using K-means algorithm for the semantics fusion fingerprint in training text library, cluster being obtained The cluster centre obtained constitutes the first layer index as semantics fusion fingerprint word;
Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, by it On the nearest fingerprint word of the distance of quantization, and its difference with the fingerprint word is calculated as fingerprint surplus;More than each fingerprint Amount is equally divided into the subvector of L K dimension, uses K-means to subvector, obtains D cluster centre, that is, include D son Vector word completes the building of the second layer index;
Finally, generating inverted file to the semantics fusion fingerprint in test text library as shown in figure 4, index according to level: First on the first layer index, according to the distance of semantics fusion fingerprint and fingerprint word, the nearest fingerprint list of the distance quantified On word, and its fingerprint surplus is calculated, fingerprint surplus is split as L subvector, calculates each subvector and subvector word Distance obtains therewith apart from nearest subvector word ID;In level index, text information storage is corresponding in fingerprint dictionary In fingerprint word index, index storage content includes text ID, the corresponding subvector word ID of each subvector.
Step 5: as shown in figure 5, similarity calculation: according to the more granularity alignment algorithms indexed based on level, using from upper Calculation under and first calculates the global similitude of text in text and text library to be compared, when the similitude is greater than setting Threshold value when, be added into Similar Text alternative collection;The local similarity with text in alternative text set is calculated later, in turn Final Similar Text and its specific local Similar content are obtained, the more granularity similarity comparison results of text are obtained.
The specific method is as follows for this step:
When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same finger of index first The similarity distance of the semantics fusion fingerprint of line word, is measured using non symmetrical distance, is selected apart from nearest fingerprint Word calculates the fingerprint surplus of the semantics fusion fingerprint and corresponding fingerprint word, is split to the semantics fusion fingerprint surplus Generate { v1,v2,...,vL};The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix; Text fingerprints to be compared and quantization are calculated to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word
Wherein, vq,jIndicate j-th of subvector of text to be compared, vid(i),jIndicate the finger that text to be compared quantifies The corresponding subvector word of j-th of subvector of i-th of semantics fusion text fingerprints on line word;
It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints as standby Selected works;The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again
Wherein, dtAnd diRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, ft qWith Respectively indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured;
Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:
Wherein α is weight factor, 0 < α < 1;Similar Text is obtained according to the similarity distance between text is final, while can Similar local content is obtained according to local fingerprint similarity, and then obtains the more granularity similarity comparison results of text.
In order to verify effect of the invention, select SogouCS news library as test text library, comprising coming from sohu.com It stands, the 18 class news such as the country, the world, sport, society, amusement between in June, 2012 and July.It chooses 1000 news and carries out hand Dynamic modification, corresponding 5 approximate texts (including news itself) of every news.Simultaneously using SogouCA news library as training text This collection, for training semantic feature space to divide and level index construct.In an experiment, random selection SogouCA news library is surplus 50000 of remaining news are as interference collection.The semantic feature of all texts is all extracted in advance and is completed in training set and test set.Instruction Practicing collection includes 14,744,203 semantic features.Then, random sampling is carried out to the semantic feature of training set, utilizes K-Means Algorithm clusters these semantic features, generates different size of cluster, and then generate the semantics fusion fingerprint of different dimensions. Semantics fusion fingerprint is indicated using SAF, and sets 8,16,32,64,128 for dictionary size L, sets term vector dimension K to 8,16,32,64,128,256.Every text is all tested as text to be compared in test set.
The comparison accuracy rate of text fingerprints directly more global first, the following table 1 show the SAF and comparison text of different parameters The similarity offered compares accuracy rate.Semantics fusion fingerprint proposed by the present invention is compared with documents Simhash method, word to When amount dimension K parameter setting is consistent, accuracy rate is dramatically increased.
Table 1
K Simhash L=8 L=16 L=32 L=64 L=128
8 39.00 92.22 92.54 92.86 92.56 93.14
16 78.68 92.96 94.02 94.48 95.38 94.52
32 88.64 93.84 94.94 96.16 96.82 96.80
64 90.42 94.76 96.32 97.48 97.64 97.98
128 91.00 95.60 97.22 97.84 98.68 98.14
256 91.00 96.16 97.88 98.18 98.92 97.86
When carrying out similarity comparison using the more granularity alignment algorithms indexed based on level, compared to directly comparison text language Comparison efficiency is improved 87.93% by adopted feature, the present invention, has good scalability.Fig. 6 gives text comparison result sample Example, table 2 is clearer to illustrate part comparison result, and similarity remains 8 decimal points.It can be found that the present invention can be well Compare out the sentence of partial content replacement or sequence exchange, slight modifications some conjunctions and keyword.When the key to sentence When word carries out more modification, low from the similarity between sentence for sentence structure and semantic angle, system comparison result is shown Similarity is 0 between sentence, is tallied with the actual situation.Thus, it is possible to find that the present invention can effectively realize the phase of the more granularities of text It is compared like degree.
Table 2
Above-described embodiment is presently preferred embodiments of the present invention, is not a limitation on the technical scheme of the present invention, as long as Without the technical solution that creative work can be realized on the basis of the above embodiments, it is regarded as falling into the invention patent Rights protection scope in.

Claims (6)

1. a kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint, it is characterised in that: the following steps are included:
Step 1: term vector indicates training: integrated multidimensional semantic dependency combines the study of modeled words vector, i.e., using open language Expect library, indicates between word and word the horizontal relationship of co-occurrence and word and context within a context using text and word mapping Mapping indicates longitudinal relationship between word and word with similar contexts, and synonym is added and antisense word information is modeled, The training that term vector expression is carried out by unsupervised learning method keeps training gained term vector same in semantic dependency and antonym It is performed better than in adopted word identification mission;
Step 2: semantic feature extraction: being based on the more abundant sentence of semantic information, carrying out including participle, removal to sentence After the pretreatment operation of stop words, participle is indicated using term vector representation method, in conjunction with include word frequency, part of speech it is more Weight come characterize participle, calculate it is each participle term vector and segment weight and, realization semantic feature extraction;
Step 3: multiple features polymerize: being clustered using the statistics and distribution character of semantic feature in training library, by semantic feature Space is divided into multiple subregions, realizes to the finer division of feature space;According to the semantic feature and cluster centre in text Distance, the semantic feature in text is assigned to and calculates semantic feature in each subregion and poly- away from nearest subregion The sum of the surplus at class center by multiple differences in each child partition and is aggregating, and generative semantics polymerize fingerprint;
Step 4: level index construct: cluster training is carried out to the semantics fusion fingerprint in training library, constitutes the first layer index, Each fingerprint surplus is split again, the building that cluster completes the second layer index is carried out again to the subvector after fractionation, obtains Level index;The semantics fusion fingerprint of text in test text library is all quantified to index to level, generates the corresponding row's of falling text Part;
Step 5: similarity calculation: according to the more granularity alignment algorithms indexed based on level, using top-down calculating side Formula first calculates the global similitude of text in text and text library to be compared, when the similitude is greater than the threshold value of setting, by it It is added in Similar Text alternative collection;The local similarity with text in alternative text set is calculated later, obtains final similar text The specific local Similar content of this and its, and then obtain the more granularity similarity comparison results of text.
2. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: in the step 1, each word is expressed as the vector of K dimension, word by the integrated multidimensional semantic dependency joint modeling The objective function of vector study indicates are as follows:
Wherein, N indicates the amount of text in training library, dnIndicate n-th of text,Indicate in n-th of text i-th of word Term vector,Indicate wordContext use summation method vector conversion values,WithIt respectively indicates Occurs word within a contextProbability and text in there is wordProbability,And ANTi nRespectively indicate word Synset and antisense word set,When synonym known to indicating or antonym are u, there is wordProbability, α is Weight factor, 0 < α < 1 are trained by maximizing objective function, are solved using stochastic gradient rise method.
3. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 2:
Text is pre-processed first: being divided based on punctuation mark, a sentence set { S is obtained1,S2,...,SM, Wherein M indicates the number of text sentence, is segmented to each sentence and stop words is gone to handle, be expressed as { c1,c2,...,cT, Wherein T indicates the participle number in sentence, using word frequency weights omegafWith part of speech weights omeganiProduct characterize participle weights omegacf×ωni, in terms of part of speech, noun weight highest, verb takes second place, and third, remaining is minimum for adjective;
Then, each participle is indicated using term vector, semantic feature be expressed as each participle term vector and segment weight and:
Wherein, fi,kIndicate the value of the kth dimension of i-th of sentence, wi,j,kAnd ωi,jRespectively indicate j-th of participle of i-th of sentence The value weight corresponding with the participle of kth dimension;I () is indicator function, works as wi,j,kWhen > 0, value 1, other are -1;Text It is expressed as the set { f comprising M semantic feature1,f2,...,fM}。
4. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 3:
Firstly, using K-means algorithm, the semantic feature in training text library is gathered for L class, is indicated to cluster with cluster centre C={ μ12,...,μL, each child partition for clustering corresponding semantic feature space;
Then, the statistics and distributed intelligence using semantic feature compared to semantic feature space, generative semantics polymerize text fingerprints, Calculate the semantic feature f of textiThe distance between cluster centre of each subregion, assigns it to apart from nearest child partition In:
Id(fi)=arg min | | fij||2, i=1,2 ..., M, j=1,2 ..., L
Wherein, Id (fi) indicate the index of child partition that semantic feature is assigned to, μjIndicate the cluster centre of j-th of child partition;
Finally, calculate belong to the semantic feature of same subregion and the difference of cluster centre and:
Wherein, fj:Id(fj)=i indicates j-th of semantic feature being assigned in i-th of child partition, will be more in each child partition It a difference and is aggregating, generates the vector V of K × L dimensiond=[v1,v2,...,vL], as semantics fusion fingerprint, text It is ultimately expressed as a semantics fusion fingerprint VdWith M semantic feature { f1,f2,...,fM}。
5. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 4:
Firstly, being clustered for the semantics fusion fingerprint in training text library using K-means algorithm, cluster is obtained Cluster centre constitutes the first layer index as semantics fusion fingerprint word;
Then, by the semantics fusion fingerprint in training text library according to it at a distance from semantics fusion fingerprint word, quantified The nearest fingerprint word of distance on, and calculate its difference with the fingerprint word as fingerprint surplus;Each fingerprint surplus is put down It is divided into the subvector of L K dimension, K-means is used to subvector, obtains D cluster centre, that is, includes D subvector Word completes the building of the second layer index;
Finally, indexing according to level, inverted file is generated to the semantics fusion fingerprint in test text library: first in the first layer index On, according to the distance of semantics fusion fingerprint and fingerprint word, on the nearest fingerprint word of the distance quantified, and calculates it and refer to Fingerprint surplus is split as L subvector by line surplus, calculates each subvector at a distance from subvector word, obtain therewith away from From nearest subvector word ID;Level index in, text information storage in the corresponding fingerprint word index of fingerprint dictionary, Indexing storage content includes text ID, the corresponding subvector word ID of each subvector.
6. the more granularity similarity comparison methods of the text according to claim 1 based on semantics fusion fingerprint, feature exist In: the specific method is as follows for the step 5:
When carrying out global fingerprint similarity comparison, calculating text semantics fusion fingerprint and quantization to the same fingerprint list of index first The similarity distance of the semantics fusion fingerprint of word, is measured using non symmetrical distance, is selected apart from nearest fingerprint word, The fingerprint surplus for calculating the semantics fusion fingerprint and corresponding fingerprint word, is split generation to the semantics fusion fingerprint surplus {v1,v2,...,vL};The distance for calculating each subvector and each subvector word again, generates corresponding distance matrix;It calculates Text fingerprints to be compared and quantization are to the global distance between i-th of semantics fusion text fingerprints on same fingerprint word
Wherein, vq,jIndicate j-th of subvector of text to be compared, vid(i),jIndicate the fingerprint word that text to be compared quantifies On i-th of semantics fusion text fingerprints the corresponding subvector word of j-th of subvector;
It is ranked up according to obtained similarity distance, chooses preceding 10 similarities apart from minimum text fingerprints alternately Collection;The similitude of the local semantic feature of i-th of text in text to be compared and alternative collection is calculated again
Wherein, dtAnd diRespectively indicate the semantic feature number of i-th of text in text to be compared and alternative collection, ft qWithRespectively Indicate j-th of semantic feature of i-th of text in t-th of the semantic feature and alternative collection of text to be measured;
Similarity distance between text is the sum of global similarity and local similarity of respective weights are as follows:
Wherein α is weight factor, 0 < α < 1;Similar Text is obtained according to the similarity distance between text is final, while can basis Local fingerprint similarity obtains similar local content, and then obtains the more granularity similarity comparison results of text.
CN201910441282.1A 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints Active CN110321925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910441282.1A CN110321925B (en) 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910441282.1A CN110321925B (en) 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Publications (2)

Publication Number Publication Date
CN110321925A true CN110321925A (en) 2019-10-11
CN110321925B CN110321925B (en) 2022-11-18

Family

ID=68119119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910441282.1A Active CN110321925B (en) 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Country Status (1)

Country Link
CN (1) CN110321925B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750616A (en) * 2019-10-16 2020-02-04 网易(杭州)网络有限公司 Retrieval type chatting method and device and computer equipment
CN110909550A (en) * 2019-11-13 2020-03-24 北京环境特性研究所 Text processing method and device, electronic equipment and readable storage medium
CN110956039A (en) * 2019-12-04 2020-04-03 中国太平洋保险(集团)股份有限公司 Text similarity calculation method and device based on multi-dimensional vectorization coding
CN110990538A (en) * 2019-12-20 2020-04-10 深圳前海黑顿科技有限公司 Semantic fuzzy search method based on sentence-level deep learning language model
CN111381191A (en) * 2020-05-29 2020-07-07 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator
CN111461109A (en) * 2020-02-27 2020-07-28 浙江工业大学 Method for identifying documents based on environment multi-type word bank
CN111694952A (en) * 2020-04-16 2020-09-22 国家计算机网络与信息安全管理中心 Big data analysis model system based on microblog and implementation method thereof
CN111859635A (en) * 2020-07-03 2020-10-30 中国人民解放军海军航空大学航空作战勤务学院 Simulation system based on multi-granularity modeling technology and construction method
CN112287669A (en) * 2020-12-28 2021-01-29 深圳追一科技有限公司 Text processing method and device, computer equipment and storage medium
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method
CN113313180A (en) * 2021-06-04 2021-08-27 太原理工大学 Remote sensing image semantic segmentation method based on deep confrontation learning
CN115935195A (en) * 2022-11-08 2023-04-07 华院计算技术(上海)股份有限公司 Text matching method and device, computer readable storage medium and terminal
CN116129146A (en) * 2023-03-29 2023-05-16 中国工程物理研究院计算机应用研究所 Heterogeneous image matching method and system based on local feature consistency

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119572A1 (en) * 2007-11-02 2009-05-07 Marja-Riitta Koivunen Systems and methods for finding information resources
US20150120720A1 (en) * 2012-06-22 2015-04-30 Krishna Kishore Dhara Method and system of identifying relevant content snippets that include additional information
CN107423729A (en) * 2017-09-20 2017-12-01 湖南师范大学 A kind of remote class brain three-dimensional gait identifying system and implementation method towards under complicated visual scene
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119572A1 (en) * 2007-11-02 2009-05-07 Marja-Riitta Koivunen Systems and methods for finding information resources
US20150120720A1 (en) * 2012-06-22 2015-04-30 Krishna Kishore Dhara Method and system of identifying relevant content snippets that include additional information
CN107423729A (en) * 2017-09-20 2017-12-01 湖南师范大学 A kind of remote class brain three-dimensional gait identifying system and implementation method towards under complicated visual scene
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JADALLA,A: ""A fingerprinting based plagiarism detection system for Arabic text based documents"", 《PROCEEDINGS OF THE 2012 8TH INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGY AND INFORMATION MANAGEMENT》 *
MOHAMED ELHOSENY: ""FPSS: Fingerprint-based semantic similarity detection in big data environment"", 《2017 EIGHTH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INFORMATION SYSTEMS (ICICIS)》 *
WEN XIA: ""Similarity and Locality Based Indexing for High Performance Data Deduplication"", 《IEEE TRANSACTIONS ON COMPUTERS》 *
刘宏哲: ""文本语义相似度计算方法研究"", 《中国博士学位论文全文数据库信息科技辑》 *
刘礼芳: ""基于社会网络的WEB图像语义标注与聚合"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
姜雪: ""基于simhash的文本相似检测算法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750616A (en) * 2019-10-16 2020-02-04 网易(杭州)网络有限公司 Retrieval type chatting method and device and computer equipment
CN110909550A (en) * 2019-11-13 2020-03-24 北京环境特性研究所 Text processing method and device, electronic equipment and readable storage medium
CN110909550B (en) * 2019-11-13 2023-11-03 北京环境特性研究所 Text processing method, text processing device, electronic equipment and readable storage medium
CN110956039A (en) * 2019-12-04 2020-04-03 中国太平洋保险(集团)股份有限公司 Text similarity calculation method and device based on multi-dimensional vectorization coding
CN110990538A (en) * 2019-12-20 2020-04-10 深圳前海黑顿科技有限公司 Semantic fuzzy search method based on sentence-level deep learning language model
CN110990538B (en) * 2019-12-20 2022-04-01 深圳前海黑顿科技有限公司 Semantic fuzzy search method based on sentence-level deep learning language model
CN111461109A (en) * 2020-02-27 2020-07-28 浙江工业大学 Method for identifying documents based on environment multi-type word bank
CN111461109B (en) * 2020-02-27 2023-09-15 浙江工业大学 Method for identifying documents based on environment multi-class word stock
CN111694952A (en) * 2020-04-16 2020-09-22 国家计算机网络与信息安全管理中心 Big data analysis model system based on microblog and implementation method thereof
CN111381191B (en) * 2020-05-29 2020-09-01 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator
CN111381191A (en) * 2020-05-29 2020-07-07 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator
CN111859635A (en) * 2020-07-03 2020-10-30 中国人民解放军海军航空大学航空作战勤务学院 Simulation system based on multi-granularity modeling technology and construction method
CN112287669A (en) * 2020-12-28 2021-01-29 深圳追一科技有限公司 Text processing method and device, computer equipment and storage medium
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method
CN113111645B (en) * 2021-04-28 2024-02-06 东南大学 Media text similarity detection method
CN113313180A (en) * 2021-06-04 2021-08-27 太原理工大学 Remote sensing image semantic segmentation method based on deep confrontation learning
CN115935195B (en) * 2022-11-08 2023-08-08 华院计算技术(上海)股份有限公司 Text matching method and device, computer readable storage medium and terminal
CN115935195A (en) * 2022-11-08 2023-04-07 华院计算技术(上海)股份有限公司 Text matching method and device, computer readable storage medium and terminal
CN116129146A (en) * 2023-03-29 2023-05-16 中国工程物理研究院计算机应用研究所 Heterogeneous image matching method and system based on local feature consistency
CN116129146B (en) * 2023-03-29 2023-09-01 中国工程物理研究院计算机应用研究所 Heterogeneous image matching method and system based on local feature consistency

Also Published As

Publication number Publication date
CN110321925B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN110321925A (en) A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint
CN106383877B (en) Social media online short text clustering and topic detection method
Wang et al. A hybrid document feature extraction method using latent Dirichlet allocation and word2vec
CN103514183B (en) Information search method and system based on interactive document clustering
CN102207945B (en) Knowledge network-based text indexing system and method
Ni et al. Short text clustering by finding core terms
CN108197111A (en) A kind of text automatic abstracting method based on fusion Semantic Clustering
CN111291188B (en) Intelligent information extraction method and system
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN108268449A (en) A kind of text semantic label abstracting method based on lexical item cluster
CN111368077A (en) K-Means text classification method based on particle swarm location updating thought wolf optimization algorithm
CN106934005A (en) A kind of Text Clustering Method based on density
CN110765755A (en) Semantic similarity feature extraction method based on double selection gates
CN106599072B (en) Text clustering method and device
CN112926340B (en) Semantic matching model for knowledge point positioning
CN112883722B (en) Distributed text summarization method based on cloud data center
Odeh et al. Arabic text categorization algorithm using vector evaluation method
Naeem et al. Development of an efficient hierarchical clustering analysis using an agglomerative clustering algorithm
CN114997288A (en) Design resource association method
CN115248839A (en) Knowledge system-based long text retrieval method and device
Yin et al. Sentence-bert and k-means based clustering technology for scientific and technical literature
Hassan et al. Automatic document topic identification using wikipedia hierarchical ontology
Ding et al. The research of text mining based on self-organizing maps
Zhang et al. Extractive Document Summarization based on hierarchical GRU
Yang et al. Research on improvement of text processing and clustering algorithms in public opinion early warning system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant