CN110321925B - Text multi-granularity similarity comparison method based on semantic aggregated fingerprints - Google Patents

Text multi-granularity similarity comparison method based on semantic aggregated fingerprints Download PDF

Info

Publication number
CN110321925B
CN110321925B CN201910441282.1A CN201910441282A CN110321925B CN 110321925 B CN110321925 B CN 110321925B CN 201910441282 A CN201910441282 A CN 201910441282A CN 110321925 B CN110321925 B CN 110321925B
Authority
CN
China
Prior art keywords
text
semantic
word
fingerprint
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910441282.1A
Other languages
Chinese (zh)
Other versions
CN110321925A (en
Inventor
梁燕
万正景
陶以政
李龚亮
许峰
曹政
谢杨
马丹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
COMPUTER APPLICATION RESEARCH INST CHINA ACADEMY OF ENGINEERING PHYSICS
Original Assignee
COMPUTER APPLICATION RESEARCH INST CHINA ACADEMY OF ENGINEERING PHYSICS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by COMPUTER APPLICATION RESEARCH INST CHINA ACADEMY OF ENGINEERING PHYSICS filed Critical COMPUTER APPLICATION RESEARCH INST CHINA ACADEMY OF ENGINEERING PHYSICS
Priority to CN201910441282.1A priority Critical patent/CN110321925B/en
Publication of CN110321925A publication Critical patent/CN110321925A/en
Application granted granted Critical
Publication of CN110321925B publication Critical patent/CN110321925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text multi-granularity similarity comparison method based on semantic aggregated fingerprints, which comprises the following steps of: training of word vector representations; extracting semantic features; performing multi-feature polymerization; constructing a hierarchical index; and calculating the similarity. The method carries out word vector representation modeling in combination with multi-dimensional semantic correlation, fully excavates semantic information among words, extracts features by taking sentences as units, represents the semantic features by adopting multiple weights, excavates statistics and distribution information of a text library by utilizing a statistical learning method, realizes finer division of feature space, generates compact text fingerprints with high identification degree based on multi-feature aggregation, and effectively improves description capacity and discrimination degree of the text fingerprints; the idea of top-down is adopted, the semantic aggregate fingerprint and the local semantic features are used for comparing the text similarity, and the multi-granularity similarity comparison from the whole text to the local text can be quickly and efficiently realized by constructing the hierarchical index; the method has good expandability.

Description

Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
Technical Field
The invention relates to a text similarity comparison method, in particular to a text multi-granularity similarity comparison method based on semantic aggregated fingerprints, and belongs to the technical field of pattern recognition and information processing.
Background
Both text approximations mean that the content and information described in the text are similar or even identical. Two texts are considered similar if one text is generated by modifying a small part of the content of the other text in a similar way of inserting, deleting, replacing and the like. Text or web page proximity diffusion is generally undesirable, and as data proliferates, the problems caused by proximity text become more and more severe. Therefore, the approximate text detection is an important technology for reducing storage overhead, improving search efficiency and data utilization rate and avoiding illegal plagiarism and plagiarism.
Experts and scholars at home and abroad propose various methods for the method. The traditional text similarity comparison is mainly divided into two categories: one is a string comparison based approach; the other type is a word frequency statistics-based method, and on the basis of a vector space model, the feature vectors are adopted to represent texts, and the similarity distance between the vectors is calculated to measure the similarity between the texts. The former may employ strings of different granularity, such as sentence-level based and paragraph-level based strings. However, since a large number of character strings are usually contained in the text, the character string matching method is difficult to avoid the embarrassing situation of low real-time performance of a large amount of long text.
The method based on the shift, which partially considers the text as a group of shift, wherein shift refers to a continuous subsequence in the text, has the inevitable difficulty of high computational cost and can not process mass data basically. And the other part maps words in each text into a simple hash value by using the existing dictionary, although the cost of similarity calculation is effectively reduced, the generated text fingerprints have instability, and when the words in the dictionary do not sufficiently contain the words in the text, the hash value is fluctuated due to small changes of the text.
The Simhash algorithm proposed by Charikar is currently recognized as the best, most efficient approximate text detection algorithm. Simhash is a method for carrying out probability dimension reduction on high-dimensional data, and maps high-dimensional text feature vectors into fingerprints with small digits and fixed positions. At present, most approximate text detection systems are realized based on Simhash. However, these methods only focus on the text itself and ignore the useful information in the text library, and Simhash generates a text fingerprint for one text, and cannot provide comparison of local text similarity of two texts.
Disclosure of Invention
The invention aims to solve the problems and provide a text multi-granularity similarity comparison method based on semantic aggregated fingerprints.
The invention achieves the above purpose through the following technical scheme:
a text multi-granularity similarity comparison method based on semantic aggregated fingerprints comprises the following steps:
step one, word vector representation training: the method comprises the steps of combining multi-dimensional semantic correlation and modeling word vector learning, namely adopting a public corpus, utilizing the horizontal relation of the co-occurrence between words in context between text and word mapping expression words and the longitudinal relation of similar context between words in context mapping expression words, adding synonym and antisense information for modeling, and performing word vector expression training through an unsupervised learning method to enable word vectors obtained by training to better perform on semantic correlation and antisense synonym recognition tasks;
step two, semantic feature extraction: based on sentences with richer semantic information, after preprocessing operations including word segmentation and word stop removal are carried out on the sentences, word vector representation is adopted to represent the words, then multi-weight including word frequency and word property is combined to represent the words, and the sum of each word segmentation word vector and the word segmentation weight is calculated to realize semantic feature extraction;
step three, multi-feature polymerization: clustering is carried out by utilizing the statistics and distribution characteristics of semantic features in a training library, and a semantic feature space is divided into a plurality of partitions, so that the feature space is more finely divided; according to the distance between the semantic features in the text and the clustering center, distributing the semantic features in the text to the nearest partitions, calculating the sum of the residual amounts of the semantic features and the clustering center in each partition, and aggregating a plurality of difference values in each sub-partition to generate a semantic aggregation fingerprint;
step four, hierarchical index construction: performing clustering training on semantic aggregation fingerprints in a training library to form a first-layer index, splitting the margin of each fingerprint, clustering the split sub-vectors to complete construction of a second-layer index, and obtaining a hierarchical index; quantizing the semantic aggregation fingerprints of the texts in the test text library to the hierarchical indexes to generate corresponding inverted files;
step five, similarity calculation: according to a multi-granularity comparison algorithm based on hierarchical indexes, a top-down calculation mode is adopted, the global similarity between a text to be compared and a text in a text library is calculated, and when the similarity is larger than a set threshold value, the global similarity is added into a similar text alternative set; and then, calculating the local similarity with the texts in the alternative text set to obtain the final similar texts and the specific local similar contents thereof, and further obtaining a text multi-granularity similarity comparison result.
Preferably, in the first step, the comprehensive multi-dimensional semantic relevance joint modeling expresses each word as a K-dimensional vector, and an objective function of word vector learning is expressed as:
Figure BDA0002072091340000031
where N represents the number of texts in the training library, d n Which represents the n-th text of the text,
Figure BDA0002072091340000032
a word vector representing the ith word in the nth text,
Figure BDA0002072091340000033
representing words
Figure BDA0002072091340000034
The context of (a) employs vector translation values of a summation method,
Figure BDA0002072091340000035
and
Figure BDA0002072091340000036
respectively representing the occurrence of words in context
Figure BDA0002072091340000037
Probability of and occurrence of words in the text
Figure BDA0002072091340000038
The probability of (a) of (b) being,
Figure BDA0002072091340000039
and ANT i n Respectively representing words
Figure BDA00020720913400000310
A set of synonyms and a set of anti-synonyms,
Figure BDA00020720913400000311
when the known synonym or antonym is u, the word appears
Figure BDA00020720913400000312
The probability of (a) is a weight factor, alpha is more than 0 and less than 1, training is carried out through a maximized objective function, and a random gradient ascent method is adopted for solving.
Preferably, the specific method of the second step is as follows:
firstly, preprocessing a text: dividing based on punctuation marks to obtain a sentence set { S 1 ,S 2 ,...,S M Where M represents the number of text sentences, word segmentation and word de-stop processing is performed on each sentence, denoted as { c } 1 ,c 2 ,...,c T Where T represents the number of participles in a sentence, and the word frequency weight ω is used f And part-of-speech weight ω ni Is used to characterize the participle weight omega c =ω f ×ω ni In part of speech, the noun has the highest weight, the verb is inferior, the adjective is inferior, and the rest is lowest;
then, each participle is represented by a word vector, and the semantic features are represented as the sum of each participle word vector and a participle weight:
Figure BDA0002072091340000041
wherein, f i,k Value, w, representing the kth dimension of the ith sentence i,j,k And omega i,j Respectively representing the kth dimension value of the jth participle of the ith sentence and the weight value corresponding to the participle; i (-) is an indicator function when w i,j,k When the value is more than 0, the value is 1, and the others are-1; the text is represented as a set containing M semantic features f 1 ,f 2 ,...,f M }。
Preferably, the specific method of the third step is as follows:
firstly, utilizing a K-means algorithm to cluster semantic features in a training text library into L classes, and representing a cluster C = { mu ] by a cluster center 12 ,...,μ L Each cluster corresponds to a sub-partition of the semantic feature space; the K-means algorithm is a hard clustering algorithm, is a typical target function clustering method based on a prototype, takes a certain distance from a data point to the prototype as an optimized target function, and obtains an adjustment rule of iterative operation by using a function extremum solving method.
Then, comparing the semantic features with the semantic featuresThe statistical and distribution information of the space, the semantic aggregation text fingerprint is generated, and the semantic feature f of the text is calculated i And the distance between the cluster center of each partition, which is assigned to the nearest sub-partition:
Id(f i )=arg min||f ij || 2 ,i=1,2,...,M,j=1,2,...,L
wherein, id (f) i ) Index, μ, representing the sub-partition to which the semantic features are assigned j A cluster center representing a jth sub-partition;
and finally, calculating the sum of the difference values between the semantic features belonging to the same partition and the clustering center:
Figure BDA0002072091340000042
wherein, f j :Id(f j ) = i represents the j semantic features allocated to the i sub-partition, and the multiple difference sums in each sub-partition are aggregated to generate a K x L-dimensional vector V d =[v 1 ,v 2 ,...,v L ]As a semantic aggregated fingerprint, the text is finally represented as a semantic aggregated fingerprint V d And M semantic features { f } 1 ,f 2 ,...,f M }。
Preferably, the specific method of the step four is as follows:
firstly, clustering is carried out on semantic aggregate fingerprints in a training text base by adopting a K-means algorithm, and a clustering center obtained by clustering is used as a semantic aggregate fingerprint word to further form a first-layer index;
secondly, according to the distance between the semantic aggregation fingerprint in the training text base and the semantic aggregation fingerprint word, quantizing the fingerprint word with the closest distance, and calculating the difference between the fingerprint word and the semantic aggregation fingerprint as the fingerprint allowance; averagely dividing each fingerprint allowance into L K-dimensional sub-vectors, using K-means for the sub-vectors to obtain D clustering centers, namely D sub-vector words are contained, and completing construction of a second-layer index;
and finally, generating an inverted file for the semantic aggregation fingerprints in the test text library according to the hierarchical index: firstly, on a first-layer index, according to the distance between the semantic aggregate fingerprint and a fingerprint word, quantizing the fingerprint word with the closest distance, calculating the fingerprint margin, dividing the fingerprint margin into L sub-vectors, calculating the distance between each sub-vector and the sub-vector word, and obtaining the ID of the sub-vector word with the closest distance; in the hierarchical index, the text information is stored in the fingerprint word index corresponding to the fingerprint dictionary, and the index storage content comprises text IDs, and each sub-vector word ID corresponds to a sub-vector.
Preferably, the specific method of the fifth step is as follows:
when global fingerprint similarity comparison is carried out, firstly, the similarity distance between the text semantic aggregation fingerprint and the semantic aggregation fingerprint quantized to index the same fingerprint word is calculated, asymmetric distance is adopted for measurement, the fingerprint word with the closest distance is selected, the fingerprint allowance of the semantic aggregation fingerprint and the corresponding fingerprint word is calculated, and the semantic aggregation fingerprint allowance is divided to generate { v } v 1 ,v 2 ,...,v L }; then calculating the distance between each sub-vector and each sub-vector word, and generating a corresponding distance matrix; calculating the global distance between the text fingerprint to be compared and the ith semantic aggregation text fingerprint quantized to the same fingerprint word
Figure BDA0002072091340000051
Figure BDA0002072091340000061
Wherein v is q,j J-th sub-vector, v, representing the text to be compared id(i),j A sub-vector word corresponding to the jth sub-vector of the ith semantic aggregation text fingerprint on the fingerprint word quantized by the text to be compared is represented;
sorting according to the obtained similarity distances, and selecting the top 10 text fingerprints with the lowest similarity distance as an alternative set; then calculating the similarity of the local semantic features of the text to be compared and the ith text in the alternative set
Figure BDA0002072091340000062
Figure BDA0002072091340000063
Wherein, d t And d i Respectively representing the semantic feature number f of the ith text in the text to be compared and the candidate set t q And
Figure BDA0002072091340000064
respectively representing the t semantic feature of the text to be detected and the j semantic feature of the ith text in the candidate set;
the similarity distance between texts is the sum of the global similarity and the local similarity of the corresponding weight:
Figure BDA0002072091340000065
wherein alpha is a weight factor, and alpha is more than 0 and less than 1; finally obtaining similar texts according to the similarity distance between the texts, and simultaneously obtaining similar local contents according to the local fingerprint similarity, thereby obtaining a text multi-granularity similarity comparison result.
The invention has the beneficial effects that:
the method carries out word vector representation modeling by combining multidimensional semantic correlation, fully excavates semantic information among words, and obtains word vectors with better semantic correlation representation; extracting features by taking sentences with richer and more complete semantic information as units, representing the semantic features by adopting multiple weights, mining statistics and distribution information of a text library by using a statistical learning method, realizing more fine division of a feature space, generating compact text fingerprints with high identification degree based on multi-feature aggregation, and effectively improving the description capacity and the identification degree of the text fingerprints; the idea from top to bottom is adopted, the semantic aggregation fingerprint and the local semantic features are used for comparing the text similarity, and the multi-granularity similarity comparison from the whole text to the local text can be quickly and efficiently realized by constructing the hierarchical index. Experiments verify that the semantic aggregation fingerprint effectively improves the accuracy of text global similarity comparison, and effectively improves the efficiency of text similarity comparison by benefiting from hierarchical indexes; the method has good expandability, is suitable for comparing the similarity of massive texts, can well meet the requirement of a user on efficient multi-granularity similarity comparison, and can greatly increase the user experience.
Drawings
FIG. 1 is a block diagram of a text multi-granularity similarity comparison in an embodiment of the present invention;
FIG. 2 is a diagram of a word vector representation model according to an embodiment of the present invention;
FIG. 3 is a block diagram of semantic aggregate fingerprint generation according to an embodiment of the present invention, in which a semantic feature space is divided into 8 sub-partitions;
FIG. 4 is a diagram illustrating a process for constructing a hierarchical index inverted file according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a process of comparing text multi-granularity similarities according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a text comparison result according to an embodiment of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following examples and drawings:
example (b):
as shown in fig. 1, the text multi-granularity similarity comparison method based on semantic aggregated fingerprints of the present invention includes the following steps:
step one, word vector representation training: the method is characterized by combining multi-dimensional semantic correlation and modeling word vector learning, namely adopting a public corpus, utilizing the horizontal relation of the co-occurrence between words in the context between texts and word mapping expression words and the longitudinal relation of similar contexts between words and context mapping expression words, adding synonym and antisense information for modeling, and performing word vector expression training by an unsupervised learning method to ensure that the word vector obtained by training has better performance on semantic correlation and antisense synonym recognition tasks. Fig. 2 shows a word vector representation model.
In this step, the integrated multidimensional semantic correlation joint modeling represents each word as a K-dimensional vector, and the target function of word vector learning is represented as:
Figure BDA0002072091340000081
where N represents the number of texts in the training corpus, d n The n-th text is represented by the n-th text,
Figure BDA0002072091340000082
a word vector representing the ith word in the nth text,
Figure BDA0002072091340000083
representing words
Figure BDA0002072091340000084
The context of (a) employs vector conversion values of a summation method,
Figure BDA0002072091340000085
and
Figure BDA0002072091340000086
respectively representing the occurrence of words in context
Figure BDA0002072091340000087
Probability of and occurrence of words in the text
Figure BDA0002072091340000088
The probability of (a) of (b) being,
Figure BDA0002072091340000089
and ANT i n Respectively representing words
Figure BDA00020720913400000810
A set of synonyms and a set of anti-synonyms,
Figure BDA00020720913400000811
when the known synonym or antonym is u, the word appears
Figure BDA00020720913400000812
The probability of (a) is a weight factor, alpha is more than 0 and less than 1, training is carried out through a maximized objective function, and a random gradient ascent method is adopted for solving.
Step two, semantic feature extraction: based on sentences with richer semantic information, after preprocessing operations including word segmentation and word stop removal are carried out on the sentences, the word vector representation method in the step one is adopted to represent the words, then the words are represented by combining multiple weights including word frequency and word property, the sum of each word vector and the word weight is calculated, and the semantic feature extraction is realized.
The specific method of the step is as follows:
firstly, preprocessing a text: dividing based on punctuation marks to obtain a sentence set { S 1 ,S 2 ,...,S M Where M represents the number of text sentences, word segmentation and word de-stop processing is performed on each sentence, denoted as { c } 1 ,c 2 ,...,c T Where T represents the number of participles in a sentence, and the word frequency weight ω is used f And part-of-speech weight ω ni Is used to characterize the participle weight omega c =ω f ×ω ni In part of speech, the noun has the highest weight, the verb is inferior, the adjective is inferior, and the rest is lowest;
then, each participle is represented by a word vector, and the semantic features are represented as the sum of each participle word vector and a participle weight:
Figure BDA0002072091340000091
wherein f is i,k Value, w, representing the kth dimension of the ith sentence i,j,k And omega i,j Respectively representing the kth dimension value of the jth participle of the ith sentence and the weight value corresponding to the participle; i (-) is an indicator function when w i,j,k When > 0, it has a value of1, others are-1; the text is represented as a set containing M semantic features f 1 ,f 2 ,...,f M }。
Step three, multi-feature polymerization: as shown in fig. 3, clustering is performed by using the statistical and distribution characteristics of the semantic features in the training library, and the semantic feature space is divided into a plurality of partitions, so as to realize finer division of the feature space; according to the distance between the semantic features in the text and the clustering center, the semantic features in the text are distributed to the nearest partitions, the sum of the margins of the semantic features and the clustering center in each partition is calculated, and the multiple difference values in each sub-partition are aggregated to generate a semantic aggregation fingerprint.
Firstly, utilizing a K-means algorithm to cluster semantic features in a training text library into L classes, and representing a cluster C = { mu ] by a cluster center 12 ,...,μ L Each cluster corresponds to a sub-partition of the semantic feature space;
then, generating semantic aggregation text fingerprint by using statistic and distribution information of semantic features compared with semantic feature space, and calculating semantic feature f of text i And the distance between the cluster centers of each partition, which is assigned to the nearest sub-partition:
Id(f i )=arg min||f ij || 2 ,i=1,2,...,M,j=1,2,...,L
wherein, id (f) i ) Index, μ, representing the sub-partition to which the semantic features are assigned j A cluster center representing a jth sub-partition;
and finally, calculating the difference sum of the semantic features belonging to the same partition and the clustering center:
Figure BDA0002072091340000092
wherein f is j :Id(f j ) = i represents the j semantic features allocated to the i sub-partition, and the multiple difference sums in each sub-partition are aggregated to generate a K x L-dimensional vector V d =[v 1 ,v 2 ,...,v L ]As a semantic aggregated fingerprint, the text is finally represented as a semantic aggregated fingerprint V d And M semantic features { f 1 ,f 2 ,...,f M }。
The distribution of the text semantic features relative to the text library semantic features is obtained by distributing the semantic features of the text to be processed to the sub-partitions closest to the semantic features. And aggregating the plurality of difference sums in each sub-partition to generate a vector with dimension of K multiplied by L as a semantic aggregation fingerprint. Compared with the traditional fingerprint algorithm, the method has the advantages that the characteristics of the whole text library are quantified as a large cluster, the original point is used as a cluster center, the method provides finer distribution for the characteristic space, and meanwhile, word vector representation which can represent multi-dimensional semantic correlation is adopted to replace word hash representation, so that the description capacity of the text fingerprint can be effectively improved.
Step four, hierarchical index construction: performing clustering training on semantic aggregated fingerprints in a training library to form a first-layer index, splitting each fingerprint margin, clustering split sub-vectors to complete construction of a second-layer index, and obtaining a hierarchical index; and quantizing the semantic aggregation fingerprints of the texts in the test text library to the hierarchical indexes to generate corresponding inverted files.
The specific method of the step is as follows:
firstly, clustering is carried out on semantic aggregation fingerprints in a training text library by adopting a K-means algorithm, and a clustering center obtained by clustering is used as a semantic aggregation fingerprint word to further form a first-layer index;
secondly, according to the distance between the semantic aggregation fingerprint in the training text base and the semantic aggregation fingerprint word, quantizing the fingerprint word with the closest distance, and calculating the difference between the fingerprint word and the semantic aggregation fingerprint as the fingerprint allowance; equally dividing each fingerprint allowance into L K-dimensional sub-vectors, using K-means for the sub-vectors to obtain D clustering centers, namely D sub-vector words are contained, and completing construction of a second-layer index;
finally, as shown in fig. 4, according to the hierarchical index, generating an inverted file for the semantic aggregated fingerprint in the test text library: firstly, on a first-layer index, according to the distance between the semantic aggregated fingerprint and a fingerprint word, quantizing the fingerprint word with the closest distance, calculating the fingerprint allowance, dividing the fingerprint allowance into L sub-vectors, calculating the distance between each sub-vector and the sub-vector word, and obtaining the sub-vector word ID with the closest distance; in the hierarchical index, the text information is stored in the fingerprint word index corresponding to the fingerprint dictionary, and the index storage content comprises text IDs, and each sub-vector word ID corresponds to a sub-vector.
Step five, as shown in fig. 5, similarity calculation: according to a multi-granularity comparison algorithm based on hierarchical indexes, adopting a top-down calculation mode, firstly calculating the global similarity between a text to be compared and texts in a text library, and adding the global similarity into a similar text alternative set when the similarity is greater than a set threshold value; and then calculating the local similarity with the texts in the alternative text set, further obtaining the final similar texts and the specific local similar contents thereof, and obtaining the multi-granularity similarity comparison result of the texts.
The specific method of the step is as follows:
when global fingerprint similarity comparison is carried out, firstly, the similarity distance between the text semantic aggregation fingerprint and the semantic aggregation fingerprint quantized to index the same fingerprint word is calculated, asymmetric distance is adopted for measurement, the fingerprint word with the closest distance is selected, the fingerprint allowance of the semantic aggregation fingerprint and the corresponding fingerprint word is calculated, and the semantic aggregation fingerprint allowance is divided to generate { v } v 1 ,v 2 ,...,v L }; then calculating the distance between each sub-vector and each sub-vector word, and generating a corresponding distance matrix; calculating the global distance between the text fingerprint to be compared and the ith semantic aggregation text fingerprint quantized to the same fingerprint word
Figure BDA0002072091340000111
Figure BDA0002072091340000112
Wherein v is q,j Representing text to be comparedThe jth sub-vector of v id(i),j A sub-vector word corresponding to the jth sub-vector of the ith semantic aggregation text fingerprint on the fingerprint word quantized by the text to be compared is represented;
sorting according to the obtained similarity distances, and selecting the top 10 text fingerprints with the lowest similarity distance as an alternative set; then calculating the similarity of the local semantic features of the text to be compared and the ith text in the alternative set
Figure BDA0002072091340000113
Figure BDA0002072091340000114
Wherein d is t And d i Respectively representing the semantic feature number f of the ith text in the text to be compared and the candidate set t q And
Figure BDA0002072091340000115
respectively representing the t semantic feature of the text to be detected and the j semantic feature of the ith text in the candidate set;
the similarity distance between texts is the sum of the global similarity and the local similarity of the corresponding weight:
Figure BDA0002072091340000121
wherein alpha is a weight factor, and alpha is more than 0 and less than 1; finally obtaining similar texts according to the similarity distance between the texts, and simultaneously obtaining similar local contents according to the local fingerprint similarity, thereby obtaining a text multi-granularity similarity comparison result.
In order to verify the effect of the invention, a SogouCS news library is selected as a test text library and comprises 18 types of news from a fox searching website, including domestic, international, sports, social, entertainment and the like between 6 and 7 months in 2012. 1000 news items were selected for manual modification, each news item corresponding to 5 approximate texts (including the news item itself). Meanwhile, a SogouCA news library is used as a training text set for training semantic feature space division and hierarchical index construction. In the experiment, 50000 of the rest news of the SogouCA news library were randomly selected as an interference set. And extracting semantic features of all texts in the training set and the test set in advance. The training set includes 14,744,203 semantic features. Then, randomly sampling the semantic features of the training set, clustering the semantic features by using a K-Means algorithm to generate clusters with different sizes, and further generating semantic aggregation fingerprints with different dimensions. The SAF is used to represent the semantic aggregate fingerprint and the dictionary size L is set to 8,16,32,64,128 and the word vector dimension K is set to 8,16,32,64,128,256. And each text in the test set is used as a text to be compared for testing.
Firstly, the comparison accuracy of global text fingerprints is directly compared, and the following table 1 shows the similarity comparison accuracy of the SAFs and the comparison documents with different parameters. Compared with the Simhash method of the comparison document, the semantic aggregation fingerprint provided by the invention has the advantage that the accuracy rate is obviously increased when the word vector dimension K parameters are set consistently.
TABLE 1
K Simhash L=8 L=16 L=32 L=64 L=128
8 39.00 92.22 92.54 92.86 92.56 93.14
16 78.68 92.96 94.02 94.48 95.38 94.52
32 88.64 93.84 94.94 96.16 96.82 96.80
64 90.42 94.76 96.32 97.48 97.64 97.98
128 91.00 95.60 97.22 97.84 98.68 98.14
256 91.00 96.16 97.88 98.18 98.92 97.86
When the multi-granularity comparison algorithm based on the hierarchical index is used for comparing the similarity, compared with the method for directly comparing the text semantic features, the method improves the comparison efficiency by 87.93 percent and has good expansibility. Fig. 6 shows a text comparison result sample, and table 2 shows a part of the comparison result more clearly, and the similarity retains 8-digit decimal points. The invention can be found out that partial content replacement or sequence exchange can be well compared, and sentences of connecting words and key words can be slightly modified. When the keywords of the sentences are modified more, the similarity between the sentences is low from the perspective of sentence structure and semantics, and the system comparison result shows that the similarity between the sentences is 0, which accords with the actual situation. Therefore, the method can effectively realize the multi-granularity similarity comparison of the text.
TABLE 2
Figure BDA0002072091340000131
Figure BDA0002072091340000141
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims (6)

1. A text multi-granularity similarity comparison method based on semantic aggregated fingerprints is characterized in that: the method comprises the following steps:
step one, word vector representation training: the word vector learning is jointly modeled by integrating the multi-dimensional semantic relevance, namely, a public corpus is adopted, the horizontal relation of the co-occurrence of words and words between the text and the word mapping and the longitudinal relation of similar contexts between the words and the context mapping are used for representing the words and the words, synonym and antisense information are added for modeling, and training of word vector representation is performed through an unsupervised learning method, so that the word vector obtained by training can better perform on semantic correlation and an antisense synonym recognition task;
step two, semantic feature extraction: based on a sentence with richer semantic information, after preprocessing operations including word segmentation and stop word removal are carried out on the sentence, the word is represented by adopting a word vector representation method, word segmentation is represented by combining multiple weights including word frequency and part of speech, the sum of each word segmentation word vector and the word segmentation weight is calculated, and semantic feature extraction is realized;
step three, multi-feature polymerization: clustering is carried out by utilizing the statistics and distribution characteristics of semantic features in a training library, and a semantic feature space is divided into a plurality of partitions, so that the feature space is more finely divided; according to the distance between the semantic features in the text and the clustering center, distributing the semantic features in the text to the nearest partitions, calculating the sum of the residual amounts of the semantic features and the clustering center in each partition, and aggregating a plurality of difference values in each sub-partition to generate a semantic aggregation fingerprint;
step four, hierarchical index construction: performing clustering training on semantic aggregated fingerprints in a training library to form a first-layer index, splitting each fingerprint margin, clustering split sub-vectors to complete construction of a second-layer index, and obtaining a hierarchical index; quantizing the semantic aggregation fingerprints of the texts in the test text library to the hierarchical indexes to generate corresponding inverted files;
step five, similarity calculation: according to a multi-granularity comparison algorithm based on hierarchical indexes, a top-down calculation mode is adopted, the global similarity between a text to be compared and a text in a text library is calculated, and when the similarity is larger than a set threshold value, the global similarity is added into a similar text alternative set; and then, calculating the local similarity with the texts in the alternative text set to obtain the final similar texts and the specific local similar contents thereof, and further obtaining a text multi-granularity similarity comparison result.
2. The text multi-granularity similarity comparison method based on semantic aggregated fingerprints according to claim 1, wherein: in the first step, the comprehensive multi-dimensional semantic relevance joint modeling expresses each word as a K-dimensional vector, and an objective function of word vector learning is expressed as follows:
Figure FDA0002072091330000021
where N represents the number of texts in the training library, d n Which represents the n-th text of the text,
Figure FDA0002072091330000022
a word vector representing the ith word in the nth text,
Figure FDA0002072091330000023
representing words
Figure FDA0002072091330000024
The context of (a) employs vector translation values of a summation method,
Figure FDA0002072091330000025
and
Figure FDA0002072091330000026
respectively representing the occurrence of words in context
Figure FDA0002072091330000027
Probability of occurrence of word in text and text
Figure FDA0002072091330000028
The probability of (a) of (b) being,
Figure FDA0002072091330000029
and ANT i n Respectively represent words
Figure FDA00020720913300000210
A set of synonyms and a set of anti-synonyms,
Figure FDA00020720913300000211
when the known synonym or antonym is u, the word appears
Figure FDA00020720913300000212
The probability of (a) is a weight factor, alpha is more than 0 and less than 1, training is carried out through a maximized objective function, and a random gradient ascent method is adopted for solving.
3. The method for comparing text multi-granularity similarity based on semantic aggregated fingerprints according to claim 1, wherein the method comprises the following steps: the specific method of the second step is as follows:
firstly, preprocessing a text: dividing based on punctuation to obtain a sentence set { S 1 ,S 2 ,...,S M Where M represents the number of text sentences, word segmentation and word de-stop processing is performed on each sentence, denoted as { c } 1 ,c 2 ,...,c T In which T represents the number of participles in a sentence, with a word frequency weight ω f And part-of-speech weight ω ni Is used to characterize the participle weight omega c =ω f ×ω ni In part of speech, the noun has the highest weight, the verb is inferior, the adjective is inferior, and the rest is lowest;
then, each participle is represented by a word vector, and the semantic features are represented as the sum of each participle word vector and a participle weight:
Figure FDA00020720913300000213
wherein f is i,k Value, w, representing the k-dimension of the ith sentence i,j,k And ω i,j Respectively representing the kth dimension value of the jth participle of the ith sentence and the weight value corresponding to the participle; i (-) is an indicator function when w i,j,k When the value is more than 0, the value is 1, and the others are-1; the text is represented as a set containing M semantic features f 1 ,f 2 ,...,f M }。
4. The text multi-granularity similarity comparison method based on semantic aggregated fingerprints according to claim 1, wherein: the concrete method of the third step is as follows:
firstly, using a K-means algorithm to cluster semantic features in a training text library into L classes, and using a cluster center to represent a cluster C = { mu = 12 ,...,μ L Each cluster corresponds to a sub-partition of the semantic feature space;
then, generating semantic aggregation text fingerprint by using statistic and distribution information of semantic features compared with semantic feature space, and calculating semantic feature f of text i And the distance between the cluster centers of each partition, which is assigned to the nearest sub-partition:
Id(f i )=arg min||f ij || 2 ,i=1,2,...,M,j=1,2,...,L
wherein, id (f) i ) Index, μ, representing the sub-partition to which the semantic features are assigned j Representing the cluster center of the jth sub-partition;
and finally, calculating the sum of the difference values between the semantic features belonging to the same partition and the clustering center:
Figure FDA0002072091330000031
wherein f is j :Id(f j ) = i represents the j semantic features allocated to the i sub-partition, and the multiple difference sums in each sub-partition are aggregated to generate a K x L-dimensional vector V d =[v 1 ,v 2 ,...,v L ]As a semantic aggregated fingerprint, the text is finally represented as a semantic aggregated fingerprint V d And M semantic features { f 1 ,f 2 ,...,f M }。
5. The method for comparing text multi-granularity similarity based on semantic aggregated fingerprints according to claim 1, wherein the method comprises the following steps: the concrete method of the fourth step is as follows:
firstly, clustering is carried out on semantic aggregation fingerprints in a training text library by adopting a K-means algorithm, and a clustering center obtained by clustering is used as a semantic aggregation fingerprint word to further form a first-layer index;
secondly, according to the distance between the semantic aggregation fingerprint in the training text base and the semantic aggregation fingerprint word, quantizing the fingerprint word with the closest distance, and calculating the difference between the fingerprint word and the semantic aggregation fingerprint as the fingerprint allowance; equally dividing each fingerprint allowance into L K-dimensional sub-vectors, using K-means for the sub-vectors to obtain D clustering centers, namely D sub-vector words are contained, and completing construction of a second-layer index;
and finally, generating an inverted file for the semantic aggregation fingerprints in the test text library according to the hierarchical index: firstly, on a first-layer index, according to the distance between the semantic aggregated fingerprint and a fingerprint word, quantizing the fingerprint word with the closest distance, calculating the fingerprint allowance, dividing the fingerprint allowance into L sub-vectors, calculating the distance between each sub-vector and the sub-vector word, and obtaining the sub-vector word ID with the closest distance; in the hierarchical index, the text information is stored in the fingerprint word index corresponding to the fingerprint dictionary, and the index storage content comprises text IDs, and each sub-vector word ID corresponds to a sub-vector.
6. The text multi-granularity similarity comparison method based on semantic aggregated fingerprints according to claim 1, wherein: the concrete method of the step five is as follows:
when global fingerprint similarity comparison is carried out, firstly, the similarity distance between the text semantic aggregation fingerprint and the semantic aggregation fingerprint quantized to index the same fingerprint word is calculated, asymmetric distance is adopted for measurement, the fingerprint word with the closest distance is selected, the fingerprint allowance of the semantic aggregation fingerprint and the corresponding fingerprint word is calculated, and the semantic aggregation fingerprint allowance is divided to generate { v } v 1 ,v 2 ,...,v L }; then calculating the distance between each sub-vector and each sub-vector word, and generating a corresponding distance matrix; calculating the global distance between the text fingerprint to be compared and the ith semanteme aggregation text fingerprint quantized to the same fingerprint word
Figure FDA0002072091330000041
Figure FDA0002072091330000042
Wherein v is q,j J-th sub-vector, v, representing the text to be compared id(i),j A sub-vector word corresponding to the jth sub-vector of the ith semantic aggregation text fingerprint on the fingerprint word quantized by the text to be compared is represented;
sorting according to the obtained similarity distances, and selecting the top 10 text fingerprints with the lowest similarity distances as an alternative set; then calculating the similarity of the local semantic features of the text to be compared and the ith text in the alternative set
Figure FDA0002072091330000043
Figure FDA0002072091330000051
Wherein d is t And d i Respectively representing the text to be compared and the ith text in the alternative setNumber of semantic features of f t q And
Figure FDA0002072091330000052
respectively representing the t semantic feature of the text to be detected and the j semantic feature of the ith text in the candidate set;
the similarity distance between texts is the sum of the global similarity and the local similarity of the corresponding weight:
Figure FDA0002072091330000053
wherein alpha is a weight factor, and alpha is more than 0 and less than 1; finally obtaining similar texts according to the similarity distance between the texts, and simultaneously obtaining similar local contents according to the local fingerprint similarity, thereby obtaining a text multi-granularity similarity comparison result.
CN201910441282.1A 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints Active CN110321925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910441282.1A CN110321925B (en) 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910441282.1A CN110321925B (en) 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Publications (2)

Publication Number Publication Date
CN110321925A CN110321925A (en) 2019-10-11
CN110321925B true CN110321925B (en) 2022-11-18

Family

ID=68119119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910441282.1A Active CN110321925B (en) 2019-05-24 2019-05-24 Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Country Status (1)

Country Link
CN (1) CN110321925B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750616B (en) * 2019-10-16 2023-02-03 网易(杭州)网络有限公司 Retrieval type chatting method and device and computer equipment
CN110909550B (en) * 2019-11-13 2023-11-03 北京环境特性研究所 Text processing method, text processing device, electronic equipment and readable storage medium
CN110956039A (en) * 2019-12-04 2020-04-03 中国太平洋保险(集团)股份有限公司 Text similarity calculation method and device based on multi-dimensional vectorization coding
CN110990538B (en) * 2019-12-20 2022-04-01 深圳前海黑顿科技有限公司 Semantic fuzzy search method based on sentence-level deep learning language model
CN111461109B (en) * 2020-02-27 2023-09-15 浙江工业大学 Method for identifying documents based on environment multi-class word stock
CN111694952A (en) * 2020-04-16 2020-09-22 国家计算机网络与信息安全管理中心 Big data analysis model system based on microblog and implementation method thereof
CN111381191B (en) * 2020-05-29 2020-09-01 支付宝(杭州)信息技术有限公司 Method for synonymy modifying text and determining text creator
CN111859635A (en) * 2020-07-03 2020-10-30 中国人民解放军海军航空大学航空作战勤务学院 Simulation system based on multi-granularity modeling technology and construction method
CN112287669B (en) * 2020-12-28 2021-05-25 深圳追一科技有限公司 Text processing method and device, computer equipment and storage medium
CN113111645B (en) * 2021-04-28 2024-02-06 东南大学 Media text similarity detection method
CN113313180B (en) * 2021-06-04 2022-08-16 太原理工大学 Remote sensing image semantic segmentation method based on deep confrontation learning
CN115935195B (en) * 2022-11-08 2023-08-08 华院计算技术(上海)股份有限公司 Text matching method and device, computer readable storage medium and terminal
CN116129146B (en) * 2023-03-29 2023-09-01 中国工程物理研究院计算机应用研究所 Heterogeneous image matching method and system based on local feature consistency

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423729A (en) * 2017-09-20 2017-12-01 湖南师范大学 A kind of remote class brain three-dimensional gait identifying system and implementation method towards under complicated visual scene
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119572A1 (en) * 2007-11-02 2009-05-07 Marja-Riitta Koivunen Systems and methods for finding information resources
US9720979B2 (en) * 2012-06-22 2017-08-01 Krishna Kishore Dhara Method and system of identifying relevant content snippets that include additional information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423729A (en) * 2017-09-20 2017-12-01 湖南师范大学 A kind of remote class brain three-dimensional gait identifying system and implementation method towards under complicated visual scene
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108595706A (en) * 2018-05-10 2018-09-28 中国科学院信息工程研究所 A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
"A fingerprinting based plagiarism detection system for Arabic text based documents";Jadalla,A;《proceedings of the 2012 8th international conference on computing technology and information management》;20120920;全文 *
"FPSS: Fingerprint-based semantic similarity detection in big data environment";Mohamed Elhoseny;《2017 Eighth International Conference on Intelligent Computing and Information Systems (ICICIS)》;20180118;全文 *
"Similarity and Locality Based Indexing for High Performance Data Deduplication";wen xia;《IEEE Transactions on Computers》;20140225;全文 *
"基于simhash的文本相似检测算法研究";姜雪;《中国优秀硕士学位论文全文数据库信息科技辑》;20180715;全文 *
"基于社会网络的WEB图像语义标注与聚合";刘礼芳;《中国优秀硕士学位论文全文数据库信息科技辑》;20111115;全文 *
"文本语义相似度计算方法研究";刘宏哲;《中国博士学位论文全文数据库信息科技辑》;20130515;全文 *

Also Published As

Publication number Publication date
CN110321925A (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN108573045B (en) Comparison matrix similarity retrieval method based on multi-order fingerprints
CN108710894B (en) Active learning labeling method and device based on clustering representative points
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
CN110209808A (en) A kind of event generation method and relevant apparatus based on text information
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
Noaman et al. Naive Bayes classifier based Arabic document categorization
JP2011227688A (en) Method and device for extracting relation between two entities in text corpus
CN106599072B (en) Text clustering method and device
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN108647322A (en) The method that word-based net identifies a large amount of Web text messages similarities
Odeh et al. Arabic text categorization algorithm using vector evaluation method
CN113962293A (en) LightGBM classification and representation learning-based name disambiguation method and system
CN114997288A (en) Design resource association method
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Zahedi et al. Improving text classification performance using PCA and recall-precision criteria
Minkov et al. NER systems that suit user’s preferences: adjusting the recall-precision trade-off for entity extraction
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program
Long et al. Multi-document summarization by information distance
KR101240330B1 (en) System and method for mutidimensional document classification
CN114298020A (en) Keyword vectorization method based on subject semantic information and application thereof
Nagaraj et al. A novel semantic level text classification by combining NLP and Thesaurus concepts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant