CN110321925B

CN110321925B - Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Info

Publication number: CN110321925B
Application number: CN201910441282.1A
Authority: CN
Inventors: 梁燕; 万正景; 陶以政; 李龚亮; 许峰; 曹政; 谢杨; 马丹阳
Original assignee: COMPUTER APPLICATION RESEARCH INST CHINA ACADEMY OF ENGINEERING PHYSICS
Current assignee: COMPUTER APPLICATION RESEARCH INST CHINA ACADEMY OF ENGINEERING PHYSICS
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2022-11-18
Anticipated expiration: 2039-05-24
Also published as: CN110321925A

Abstract

The invention discloses a text multi-granularity similarity comparison method based on semantic aggregated fingerprints, which comprises the following steps of: training of word vector representations; extracting semantic features; performing multi-feature polymerization; constructing a hierarchical index; and calculating the similarity. The method carries out word vector representation modeling in combination with multi-dimensional semantic correlation, fully excavates semantic information among words, extracts features by taking sentences as units, represents the semantic features by adopting multiple weights, excavates statistics and distribution information of a text library by utilizing a statistical learning method, realizes finer division of feature space, generates compact text fingerprints with high identification degree based on multi-feature aggregation, and effectively improves description capacity and discrimination degree of the text fingerprints; the idea of top-down is adopted, the semantic aggregate fingerprint and the local semantic features are used for comparing the text similarity, and the multi-granularity similarity comparison from the whole text to the local text can be quickly and efficiently realized by constructing the hierarchical index; the method has good expandability.

Description

Text multi-granularity similarity comparison method based on semantic aggregated fingerprints

Technical Field

The invention relates to a text similarity comparison method, in particular to a text multi-granularity similarity comparison method based on semantic aggregated fingerprints, and belongs to the technical field of pattern recognition and information processing.

Background

Both text approximations mean that the content and information described in the text are similar or even identical. Two texts are considered similar if one text is generated by modifying a small part of the content of the other text in a similar way of inserting, deleting, replacing and the like. Text or web page proximity diffusion is generally undesirable, and as data proliferates, the problems caused by proximity text become more and more severe. Therefore, the approximate text detection is an important technology for reducing storage overhead, improving search efficiency and data utilization rate and avoiding illegal plagiarism and plagiarism.

Experts and scholars at home and abroad propose various methods for the method. The traditional text similarity comparison is mainly divided into two categories: one is a string comparison based approach; the other type is a word frequency statistics-based method, and on the basis of a vector space model, the feature vectors are adopted to represent texts, and the similarity distance between the vectors is calculated to measure the similarity between the texts. The former may employ strings of different granularity, such as sentence-level based and paragraph-level based strings. However, since a large number of character strings are usually contained in the text, the character string matching method is difficult to avoid the embarrassing situation of low real-time performance of a large amount of long text.

The method based on the shift, which partially considers the text as a group of shift, wherein shift refers to a continuous subsequence in the text, has the inevitable difficulty of high computational cost and can not process mass data basically. And the other part maps words in each text into a simple hash value by using the existing dictionary, although the cost of similarity calculation is effectively reduced, the generated text fingerprints have instability, and when the words in the dictionary do not sufficiently contain the words in the text, the hash value is fluctuated due to small changes of the text.

The Simhash algorithm proposed by Charikar is currently recognized as the best, most efficient approximate text detection algorithm. Simhash is a method for carrying out probability dimension reduction on high-dimensional data, and maps high-dimensional text feature vectors into fingerprints with small digits and fixed positions. At present, most approximate text detection systems are realized based on Simhash. However, these methods only focus on the text itself and ignore the useful information in the text library, and Simhash generates a text fingerprint for one text, and cannot provide comparison of local text similarity of two texts.

Disclosure of Invention

The invention aims to solve the problems and provide a text multi-granularity similarity comparison method based on semantic aggregated fingerprints.

The invention achieves the above purpose through the following technical scheme:

a text multi-granularity similarity comparison method based on semantic aggregated fingerprints comprises the following steps:

step one, word vector representation training: the method comprises the steps of combining multi-dimensional semantic correlation and modeling word vector learning, namely adopting a public corpus, utilizing the horizontal relation of the co-occurrence between words in context between text and word mapping expression words and the longitudinal relation of similar context between words in context mapping expression words, adding synonym and antisense information for modeling, and performing word vector expression training through an unsupervised learning method to enable word vectors obtained by training to better perform on semantic correlation and antisense synonym recognition tasks;

step two, semantic feature extraction: based on sentences with richer semantic information, after preprocessing operations including word segmentation and word stop removal are carried out on the sentences, word vector representation is adopted to represent the words, then multi-weight including word frequency and word property is combined to represent the words, and the sum of each word segmentation word vector and the word segmentation weight is calculated to realize semantic feature extraction;

step three, multi-feature polymerization: clustering is carried out by utilizing the statistics and distribution characteristics of semantic features in a training library, and a semantic feature space is divided into a plurality of partitions, so that the feature space is more finely divided; according to the distance between the semantic features in the text and the clustering center, distributing the semantic features in the text to the nearest partitions, calculating the sum of the residual amounts of the semantic features and the clustering center in each partition, and aggregating a plurality of difference values in each sub-partition to generate a semantic aggregation fingerprint;

step four, hierarchical index construction: performing clustering training on semantic aggregation fingerprints in a training library to form a first-layer index, splitting the margin of each fingerprint, clustering the split sub-vectors to complete construction of a second-layer index, and obtaining a hierarchical index; quantizing the semantic aggregation fingerprints of the texts in the test text library to the hierarchical indexes to generate corresponding inverted files;

step five, similarity calculation: according to a multi-granularity comparison algorithm based on hierarchical indexes, a top-down calculation mode is adopted, the global similarity between a text to be compared and a text in a text library is calculated, and when the similarity is larger than a set threshold value, the global similarity is added into a similar text alternative set; and then, calculating the local similarity with the texts in the alternative text set to obtain the final similar texts and the specific local similar contents thereof, and further obtaining a text multi-granularity similarity comparison result.

Preferably, in the first step, the comprehensive multi-dimensional semantic relevance joint modeling expresses each word as a K-dimensional vector, and an objective function of word vector learning is expressed as:

where N represents the number of texts in the training library, d _n Which represents the n-th text of the text,

a word vector representing the ith word in the nth text,

representing words

The context of (a) employs vector translation values of a summation method,

and

respectively representing the occurrence of words in context

Probability of and occurrence of words in the text

The probability of (a) of (b) being,

and ANT _i ⁿ Respectively representing words

A set of synonyms and a set of anti-synonyms,

when the known synonym or antonym is u, the word appears

The probability of (a) is a weight factor, alpha is more than 0 and less than 1, training is carried out through a maximized objective function, and a random gradient ascent method is adopted for solving.

Preferably, the specific method of the second step is as follows:

firstly, preprocessing a text: dividing based on punctuation marks to obtain a sentence set { S ₁ ,S ₂ ,...,S _M Where M represents the number of text sentences, word segmentation and word de-stop processing is performed on each sentence, denoted as { c } ₁ ,c ₂ ,...,c _T Where T represents the number of participles in a sentence, and the word frequency weight ω is used _f And part-of-speech weight ω _ni Is used to characterize the participle weight omega _c ＝ω _f ×ω _ni In part of speech, the noun has the highest weight, the verb is inferior, the adjective is inferior, and the rest is lowest;

then, each participle is represented by a word vector, and the semantic features are represented as the sum of each participle word vector and a participle weight:

wherein, f _i,k Value, w, representing the kth dimension of the ith sentence _i,j,k And omega _i,j Respectively representing the kth dimension value of the jth participle of the ith sentence and the weight value corresponding to the participle; i (-) is an indicator function when w _i,j,k When the value is more than 0, the value is 1, and the others are-1; the text is represented as a set containing M semantic features f ₁ ,f ₂ ,...,f _M }。

Preferably, the specific method of the third step is as follows:

firstly, utilizing a K-means algorithm to cluster semantic features in a training text library into L classes, and representing a cluster C = { mu ] by a cluster center ₁ ,μ ₂ ,...,μ _L Each cluster corresponds to a sub-partition of the semantic feature space; the K-means algorithm is a hard clustering algorithm, is a typical target function clustering method based on a prototype, takes a certain distance from a data point to the prototype as an optimized target function, and obtains an adjustment rule of iterative operation by using a function extremum solving method.

Then, comparing the semantic features with the semantic featuresThe statistical and distribution information of the space, the semantic aggregation text fingerprint is generated, and the semantic feature f of the text is calculated _i And the distance between the cluster center of each partition, which is assigned to the nearest sub-partition:

Id(f _i )＝arg min||f _i -μ _j || ² ,i＝1,2,...,M,j＝1,2,...,L

wherein, id (f) _i ) Index, μ, representing the sub-partition to which the semantic features are assigned _j A cluster center representing a jth sub-partition;

and finally, calculating the sum of the difference values between the semantic features belonging to the same partition and the clustering center:

wherein, f _j :Id(f _j ) = i represents the j semantic features allocated to the i sub-partition, and the multiple difference sums in each sub-partition are aggregated to generate a K x L-dimensional vector V _d ＝[v ₁ ,v ₂ ,...,v _L ]As a semantic aggregated fingerprint, the text is finally represented as a semantic aggregated fingerprint V _d And M semantic features { f } ₁ ,f ₂ ,...,f _M }。

Preferably, the specific method of the step four is as follows:

firstly, clustering is carried out on semantic aggregate fingerprints in a training text base by adopting a K-means algorithm, and a clustering center obtained by clustering is used as a semantic aggregate fingerprint word to further form a first-layer index;

secondly, according to the distance between the semantic aggregation fingerprint in the training text base and the semantic aggregation fingerprint word, quantizing the fingerprint word with the closest distance, and calculating the difference between the fingerprint word and the semantic aggregation fingerprint as the fingerprint allowance; averagely dividing each fingerprint allowance into L K-dimensional sub-vectors, using K-means for the sub-vectors to obtain D clustering centers, namely D sub-vector words are contained, and completing construction of a second-layer index;

and finally, generating an inverted file for the semantic aggregation fingerprints in the test text library according to the hierarchical index: firstly, on a first-layer index, according to the distance between the semantic aggregate fingerprint and a fingerprint word, quantizing the fingerprint word with the closest distance, calculating the fingerprint margin, dividing the fingerprint margin into L sub-vectors, calculating the distance between each sub-vector and the sub-vector word, and obtaining the ID of the sub-vector word with the closest distance; in the hierarchical index, the text information is stored in the fingerprint word index corresponding to the fingerprint dictionary, and the index storage content comprises text IDs, and each sub-vector word ID corresponds to a sub-vector.

Preferably, the specific method of the fifth step is as follows:

when global fingerprint similarity comparison is carried out, firstly, the similarity distance between the text semantic aggregation fingerprint and the semantic aggregation fingerprint quantized to index the same fingerprint word is calculated, asymmetric distance is adopted for measurement, the fingerprint word with the closest distance is selected, the fingerprint allowance of the semantic aggregation fingerprint and the corresponding fingerprint word is calculated, and the semantic aggregation fingerprint allowance is divided to generate { v } v ₁ ,v ₂ ,...,v _L }; then calculating the distance between each sub-vector and each sub-vector word, and generating a corresponding distance matrix; calculating the global distance between the text fingerprint to be compared and the ith semantic aggregation text fingerprint quantized to the same fingerprint word

Wherein v is _q,j J-th sub-vector, v, representing the text to be compared _id(i),j A sub-vector word corresponding to the jth sub-vector of the ith semantic aggregation text fingerprint on the fingerprint word quantized by the text to be compared is represented;

sorting according to the obtained similarity distances, and selecting the top 10 text fingerprints with the lowest similarity distance as an alternative set; then calculating the similarity of the local semantic features of the text to be compared and the ith text in the alternative set

Wherein, d _t And d _i Respectively representing the semantic feature number f of the ith text in the text to be compared and the candidate set _t ^q And

respectively representing the t semantic feature of the text to be detected and the j semantic feature of the ith text in the candidate set;

the similarity distance between texts is the sum of the global similarity and the local similarity of the corresponding weight:

wherein alpha is a weight factor, and alpha is more than 0 and less than 1; finally obtaining similar texts according to the similarity distance between the texts, and simultaneously obtaining similar local contents according to the local fingerprint similarity, thereby obtaining a text multi-granularity similarity comparison result.

The invention has the beneficial effects that:

the method carries out word vector representation modeling by combining multidimensional semantic correlation, fully excavates semantic information among words, and obtains word vectors with better semantic correlation representation; extracting features by taking sentences with richer and more complete semantic information as units, representing the semantic features by adopting multiple weights, mining statistics and distribution information of a text library by using a statistical learning method, realizing more fine division of a feature space, generating compact text fingerprints with high identification degree based on multi-feature aggregation, and effectively improving the description capacity and the identification degree of the text fingerprints; the idea from top to bottom is adopted, the semantic aggregation fingerprint and the local semantic features are used for comparing the text similarity, and the multi-granularity similarity comparison from the whole text to the local text can be quickly and efficiently realized by constructing the hierarchical index. Experiments verify that the semantic aggregation fingerprint effectively improves the accuracy of text global similarity comparison, and effectively improves the efficiency of text similarity comparison by benefiting from hierarchical indexes; the method has good expandability, is suitable for comparing the similarity of massive texts, can well meet the requirement of a user on efficient multi-granularity similarity comparison, and can greatly increase the user experience.

Drawings

FIG. 1 is a block diagram of a text multi-granularity similarity comparison in an embodiment of the present invention;

FIG. 2 is a diagram of a word vector representation model according to an embodiment of the present invention;

FIG. 3 is a block diagram of semantic aggregate fingerprint generation according to an embodiment of the present invention, in which a semantic feature space is divided into 8 sub-partitions;

FIG. 4 is a diagram illustrating a process for constructing a hierarchical index inverted file according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a process of comparing text multi-granularity similarities according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a text comparison result according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following examples and drawings:

example (b):

as shown in fig. 1, the text multi-granularity similarity comparison method based on semantic aggregated fingerprints of the present invention includes the following steps:

step one, word vector representation training: the method is characterized by combining multi-dimensional semantic correlation and modeling word vector learning, namely adopting a public corpus, utilizing the horizontal relation of the co-occurrence between words in the context between texts and word mapping expression words and the longitudinal relation of similar contexts between words and context mapping expression words, adding synonym and antisense information for modeling, and performing word vector expression training by an unsupervised learning method to ensure that the word vector obtained by training has better performance on semantic correlation and antisense synonym recognition tasks. Fig. 2 shows a word vector representation model.

In this step, the integrated multidimensional semantic correlation joint modeling represents each word as a K-dimensional vector, and the target function of word vector learning is represented as:

where N represents the number of texts in the training corpus, d _n The n-th text is represented by the n-th text,

a word vector representing the ith word in the nth text,

representing words

The context of (a) employs vector conversion values of a summation method,

and

respectively representing the occurrence of words in context

Probability of and occurrence of words in the text

The probability of (a) of (b) being,

and ANT _i ⁿ Respectively representing words

A set of synonyms and a set of anti-synonyms,

when the known synonym or antonym is u, the word appears

Step two, semantic feature extraction: based on sentences with richer semantic information, after preprocessing operations including word segmentation and word stop removal are carried out on the sentences, the word vector representation method in the step one is adopted to represent the words, then the words are represented by combining multiple weights including word frequency and word property, the sum of each word vector and the word weight is calculated, and the semantic feature extraction is realized.

The specific method of the step is as follows:

wherein f is _i,k Value, w, representing the kth dimension of the ith sentence _i,j,k And omega _i,j Respectively representing the kth dimension value of the jth participle of the ith sentence and the weight value corresponding to the participle; i (-) is an indicator function when w _i,j,k When > 0, it has a value of1, others are-1; the text is represented as a set containing M semantic features f ₁ ,f ₂ ,...,f _M }。

Step three, multi-feature polymerization: as shown in fig. 3, clustering is performed by using the statistical and distribution characteristics of the semantic features in the training library, and the semantic feature space is divided into a plurality of partitions, so as to realize finer division of the feature space; according to the distance between the semantic features in the text and the clustering center, the semantic features in the text are distributed to the nearest partitions, the sum of the margins of the semantic features and the clustering center in each partition is calculated, and the multiple difference values in each sub-partition are aggregated to generate a semantic aggregation fingerprint.

Firstly, utilizing a K-means algorithm to cluster semantic features in a training text library into L classes, and representing a cluster C = { mu ] by a cluster center ₁ ,μ ₂ ,...,μ _L Each cluster corresponds to a sub-partition of the semantic feature space;

then, generating semantic aggregation text fingerprint by using statistic and distribution information of semantic features compared with semantic feature space, and calculating semantic feature f of text _i And the distance between the cluster centers of each partition, which is assigned to the nearest sub-partition:

Id(f _i )＝arg min||f _i -μ _j || ² ,i＝1,2,...,M,j＝1,2,...,L

and finally, calculating the difference sum of the semantic features belonging to the same partition and the clustering center:

wherein f is _j :Id(f _j ) = i represents the j semantic features allocated to the i sub-partition, and the multiple difference sums in each sub-partition are aggregated to generate a K x L-dimensional vector V _d ＝[v ₁ ,v ₂ ,...,v _L ]As a semantic aggregated fingerprint, the text is finally represented as a semantic aggregated fingerprint V _d And M semantic features { f ₁ ,f ₂ ,...,f _M }。

The distribution of the text semantic features relative to the text library semantic features is obtained by distributing the semantic features of the text to be processed to the sub-partitions closest to the semantic features. And aggregating the plurality of difference sums in each sub-partition to generate a vector with dimension of K multiplied by L as a semantic aggregation fingerprint. Compared with the traditional fingerprint algorithm, the method has the advantages that the characteristics of the whole text library are quantified as a large cluster, the original point is used as a cluster center, the method provides finer distribution for the characteristic space, and meanwhile, word vector representation which can represent multi-dimensional semantic correlation is adopted to replace word hash representation, so that the description capacity of the text fingerprint can be effectively improved.

Step four, hierarchical index construction: performing clustering training on semantic aggregated fingerprints in a training library to form a first-layer index, splitting each fingerprint margin, clustering split sub-vectors to complete construction of a second-layer index, and obtaining a hierarchical index; and quantizing the semantic aggregation fingerprints of the texts in the test text library to the hierarchical indexes to generate corresponding inverted files.

The specific method of the step is as follows:

firstly, clustering is carried out on semantic aggregation fingerprints in a training text library by adopting a K-means algorithm, and a clustering center obtained by clustering is used as a semantic aggregation fingerprint word to further form a first-layer index;

secondly, according to the distance between the semantic aggregation fingerprint in the training text base and the semantic aggregation fingerprint word, quantizing the fingerprint word with the closest distance, and calculating the difference between the fingerprint word and the semantic aggregation fingerprint as the fingerprint allowance; equally dividing each fingerprint allowance into L K-dimensional sub-vectors, using K-means for the sub-vectors to obtain D clustering centers, namely D sub-vector words are contained, and completing construction of a second-layer index;

finally, as shown in fig. 4, according to the hierarchical index, generating an inverted file for the semantic aggregated fingerprint in the test text library: firstly, on a first-layer index, according to the distance between the semantic aggregated fingerprint and a fingerprint word, quantizing the fingerprint word with the closest distance, calculating the fingerprint allowance, dividing the fingerprint allowance into L sub-vectors, calculating the distance between each sub-vector and the sub-vector word, and obtaining the sub-vector word ID with the closest distance; in the hierarchical index, the text information is stored in the fingerprint word index corresponding to the fingerprint dictionary, and the index storage content comprises text IDs, and each sub-vector word ID corresponds to a sub-vector.

Step five, as shown in fig. 5, similarity calculation: according to a multi-granularity comparison algorithm based on hierarchical indexes, adopting a top-down calculation mode, firstly calculating the global similarity between a text to be compared and texts in a text library, and adding the global similarity into a similar text alternative set when the similarity is greater than a set threshold value; and then calculating the local similarity with the texts in the alternative text set, further obtaining the final similar texts and the specific local similar contents thereof, and obtaining the multi-granularity similarity comparison result of the texts.

The specific method of the step is as follows:

Wherein v is _q,j Representing text to be comparedThe jth sub-vector of v _id(i),j A sub-vector word corresponding to the jth sub-vector of the ith semantic aggregation text fingerprint on the fingerprint word quantized by the text to be compared is represented;

Wherein d is _t And d _i Respectively representing the semantic feature number f of the ith text in the text to be compared and the candidate set _t ^q And

In order to verify the effect of the invention, a SogouCS news library is selected as a test text library and comprises 18 types of news from a fox searching website, including domestic, international, sports, social, entertainment and the like between 6 and 7 months in 2012. 1000 news items were selected for manual modification, each news item corresponding to 5 approximate texts (including the news item itself). Meanwhile, a SogouCA news library is used as a training text set for training semantic feature space division and hierarchical index construction. In the experiment, 50000 of the rest news of the SogouCA news library were randomly selected as an interference set. And extracting semantic features of all texts in the training set and the test set in advance. The training set includes 14,744,203 semantic features. Then, randomly sampling the semantic features of the training set, clustering the semantic features by using a K-Means algorithm to generate clusters with different sizes, and further generating semantic aggregation fingerprints with different dimensions. The SAF is used to represent the semantic aggregate fingerprint and the dictionary size L is set to 8,16,32,64,128 and the word vector dimension K is set to 8,16,32,64,128,256. And each text in the test set is used as a text to be compared for testing.

Firstly, the comparison accuracy of global text fingerprints is directly compared, and the following table 1 shows the similarity comparison accuracy of the SAFs and the comparison documents with different parameters. Compared with the Simhash method of the comparison document, the semantic aggregation fingerprint provided by the invention has the advantage that the accuracy rate is obviously increased when the word vector dimension K parameters are set consistently.

TABLE 1

K	Simhash	L＝8	L＝16	L＝32	L＝64	L＝128
							8	39.00	92.22	92.54	92.86	92.56	93.14
16	78.68	92.96	94.02	94.48	95.38	94.52
							32	88.64	93.84	94.94	96.16	96.82	96.80
64	90.42	94.76	96.32	97.48	97.64	97.98
							128	91.00	95.60	97.22	97.84	98.68	98.14
256	91.00	96.16	97.88	98.18	98.92	97.86

When the multi-granularity comparison algorithm based on the hierarchical index is used for comparing the similarity, compared with the method for directly comparing the text semantic features, the method improves the comparison efficiency by 87.93 percent and has good expansibility. Fig. 6 shows a text comparison result sample, and table 2 shows a part of the comparison result more clearly, and the similarity retains 8-digit decimal points. The invention can be found out that partial content replacement or sequence exchange can be well compared, and sentences of connecting words and key words can be slightly modified. When the keywords of the sentences are modified more, the similarity between the sentences is low from the perspective of sentence structure and semantics, and the system comparison result shows that the similarity between the sentences is 0, which accords with the actual situation. Therefore, the method can effectively realize the multi-granularity similarity comparison of the text.

TABLE 2

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. A text multi-granularity similarity comparison method based on semantic aggregated fingerprints is characterized in that: the method comprises the following steps:

step one, word vector representation training: the word vector learning is jointly modeled by integrating the multi-dimensional semantic relevance, namely, a public corpus is adopted, the horizontal relation of the co-occurrence of words and words between the text and the word mapping and the longitudinal relation of similar contexts between the words and the context mapping are used for representing the words and the words, synonym and antisense information are added for modeling, and training of word vector representation is performed through an unsupervised learning method, so that the word vector obtained by training can better perform on semantic correlation and an antisense synonym recognition task;

step two, semantic feature extraction: based on a sentence with richer semantic information, after preprocessing operations including word segmentation and stop word removal are carried out on the sentence, the word is represented by adopting a word vector representation method, word segmentation is represented by combining multiple weights including word frequency and part of speech, the sum of each word segmentation word vector and the word segmentation weight is calculated, and semantic feature extraction is realized;

step four, hierarchical index construction: performing clustering training on semantic aggregated fingerprints in a training library to form a first-layer index, splitting each fingerprint margin, clustering split sub-vectors to complete construction of a second-layer index, and obtaining a hierarchical index; quantizing the semantic aggregation fingerprints of the texts in the test text library to the hierarchical indexes to generate corresponding inverted files;

2. The text multi-granularity similarity comparison method based on semantic aggregated fingerprints according to claim 1, wherein: in the first step, the comprehensive multi-dimensional semantic relevance joint modeling expresses each word as a K-dimensional vector, and an objective function of word vector learning is expressed as follows:

a word vector representing the ith word in the nth text,

representing words

The context of (a) employs vector translation values of a summation method,

and

respectively representing the occurrence of words in context

Probability of occurrence of word in text and text

The probability of (a) of (b) being,

and ANT _i ⁿ Respectively represent words

A set of synonyms and a set of anti-synonyms,

when the known synonym or antonym is u, the word appears

3. The method for comparing text multi-granularity similarity based on semantic aggregated fingerprints according to claim 1, wherein the method comprises the following steps: the specific method of the second step is as follows:

firstly, preprocessing a text: dividing based on punctuation to obtain a sentence set { S ₁ ,S ₂ ,...,S _M Where M represents the number of text sentences, word segmentation and word de-stop processing is performed on each sentence, denoted as { c } ₁ ,c ₂ ,...,c _T In which T represents the number of participles in a sentence, with a word frequency weight ω _f And part-of-speech weight ω _ni Is used to characterize the participle weight omega _c ＝ω _f ×ω _ni In part of speech, the noun has the highest weight, the verb is inferior, the adjective is inferior, and the rest is lowest;

wherein f is _i,k Value, w, representing the k-dimension of the ith sentence _i,j,k And ω _i,j Respectively representing the kth dimension value of the jth participle of the ith sentence and the weight value corresponding to the participle; i (-) is an indicator function when w _i,j,k When the value is more than 0, the value is 1, and the others are-1; the text is represented as a set containing M semantic features f ₁ ,f ₂ ,...,f _M }。

4. The text multi-granularity similarity comparison method based on semantic aggregated fingerprints according to claim 1, wherein: the concrete method of the third step is as follows:

firstly, using a K-means algorithm to cluster semantic features in a training text library into L classes, and using a cluster center to represent a cluster C = { mu = ₁ ,μ ₂ ,...,μ _L Each cluster corresponds to a sub-partition of the semantic feature space;

Id(f _i )＝arg min||f _i -μ _j || ² ,i＝1,2,...,M,j＝1,2,...,L

wherein, id (f) _i ) Index, μ, representing the sub-partition to which the semantic features are assigned _j Representing the cluster center of the jth sub-partition;

5. The method for comparing text multi-granularity similarity based on semantic aggregated fingerprints according to claim 1, wherein the method comprises the following steps: the concrete method of the fourth step is as follows:

and finally, generating an inverted file for the semantic aggregation fingerprints in the test text library according to the hierarchical index: firstly, on a first-layer index, according to the distance between the semantic aggregated fingerprint and a fingerprint word, quantizing the fingerprint word with the closest distance, calculating the fingerprint allowance, dividing the fingerprint allowance into L sub-vectors, calculating the distance between each sub-vector and the sub-vector word, and obtaining the sub-vector word ID with the closest distance; in the hierarchical index, the text information is stored in the fingerprint word index corresponding to the fingerprint dictionary, and the index storage content comprises text IDs, and each sub-vector word ID corresponds to a sub-vector.

6. The text multi-granularity similarity comparison method based on semantic aggregated fingerprints according to claim 1, wherein: the concrete method of the step five is as follows:

when global fingerprint similarity comparison is carried out, firstly, the similarity distance between the text semantic aggregation fingerprint and the semantic aggregation fingerprint quantized to index the same fingerprint word is calculated, asymmetric distance is adopted for measurement, the fingerprint word with the closest distance is selected, the fingerprint allowance of the semantic aggregation fingerprint and the corresponding fingerprint word is calculated, and the semantic aggregation fingerprint allowance is divided to generate { v } v ₁ ,v ₂ ,...,v _L }; then calculating the distance between each sub-vector and each sub-vector word, and generating a corresponding distance matrix; calculating the global distance between the text fingerprint to be compared and the ith semanteme aggregation text fingerprint quantized to the same fingerprint word

sorting according to the obtained similarity distances, and selecting the top 10 text fingerprints with the lowest similarity distances as an alternative set; then calculating the similarity of the local semantic features of the text to be compared and the ith text in the alternative set

Wherein d is _t And d _i Respectively representing the text to be compared and the ith text in the alternative setNumber of semantic features of f _t ^q And