CN115774785A

CN115774785A - Duplicate checking method and system based on feature vector space

Info

Publication number: CN115774785A
Application number: CN202310091416.8A
Authority: CN
Inventors: 蓝建敏; 李思伟; 池沐霖; 纪绿彬
Original assignee: Excellence Information Technology Co ltd
Current assignee: Excellence Information Technology Co ltd
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-03-10
Anticipated expiration: 2043-02-10
Also published as: CN115774785B

Abstract

The invention discloses a duplicate checking method and a duplicate checking system based on a feature vector space, wherein the method comprises the following steps: performing word segmentation processing on the target text and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; confirming a cluster center vector space where each paragraph feature vector is located, and recording a first cluster number according to the cluster center vector space where each paragraph feature vector is located; obtaining paragraph feature fingerprints of each paragraph feature vector, giving fingerprint numbers to the paragraph feature fingerprints, and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; and respectively acquiring all library paragraph feature fingerprints having a mapping relation with the first cluster number according to the first cluster number of each paragraph feature vector, and matching to obtain a paragraph duplication checking result. By adopting the method, the vector space is confirmed to optimize the duplicate checking data, and then the duplicate checking is accurately carried out according to the paragraph characteristic fingerprints, so that the duplicate checking dimensionality is ensured, and the duplicate checking efficiency is optimized.

Description

Duplicate checking method and system based on feature vector space

Technical Field

The invention relates to the technical field of text duplicate checking, in particular to a duplicate checking method and system based on a feature vector space.

Background

Text duplication is a process of finding out duplicate texts from a data stream according to a certain similarity model. The method is widely applied to the fields of search engine construction, plagiarism detection, news classification and the like. Text duplication is a special text filtering condition that the similarity between the target text and the source text is greater than a threshold value.

In the text information duplication checking method in the prior art, word segmentation is carried out on text contents, stop words are removed, feature word extraction is carried out, feature word vectors are stored in a database, and the same processing needs to be carried out on requested articles during searching. The method based on the feature word vector can find the high-similarity articles, but has the problems that the dimension of the feature vector is not good to control: under the condition of high dimension, the speed of text comparison and retrieval is very low; if the dimension is low, the extracted feature words may not be enough for similarity search, and the similarity performance of the obtained search results may be discounted greatly.

In summary, when retrieving and comparing a large number of texts, the existing text information duplication checking method usually consumes a large amount of query time to ensure the richness of search terms.

Disclosure of Invention

The embodiment of the invention provides a duplicate checking method and system based on a feature vector space, which can confirm vector space optimization duplicate checking data according to a first cluster number and then accurately check duplicates according to paragraph feature fingerprints.

A first aspect of the embodiments of the present application provides a duplicate checking method based on a feature vector space, including:

performing word segmentation processing on the target text and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; each paragraph feature vector is composed of a plurality of feature word corresponding weights;

confirming a cluster-like central vector space where each paragraph feature vector is located, and recording a first cluster number according to the cluster-like central vector space where each paragraph feature vector is located; the cluster-like center vector space comprises all paragraph feature vectors of which the cosine distances from the cluster-like center vectors are smaller than a preset cluster-like value, and the sum of the cluster-like center vector spaces is larger than or equal to the whole paragraph feature vector space;

obtaining a paragraph feature fingerprint of each paragraph feature vector, giving a fingerprint number to the paragraph feature fingerprint, and establishing a mapping relation between the fingerprint number and the first cluster number; the paragraph feature fingerprint is an array of features comprising the target text;

and respectively acquiring all library paragraph feature fingerprints which have a mapping relation with the first cluster number according to the first cluster number of each paragraph feature vector, and matching all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain a paragraph duplicate checking result.

In a possible implementation manner of the first aspect, a specific construction process of the cluster-like center vector space is as follows:

respectively carrying out word segmentation on all texts in a library and extracting paragraph feature vectors to obtain a plurality of library text data;

clustering all paragraph feature vectors in the plurality of library text data by adopting a clustering algorithm based on division to obtain a plurality of cluster center vectors;

dividing a vector space, of which the cosine distance between the vector and each cluster center vector is smaller than a preset cluster value, into cluster center vector spaces corresponding to the cluster center vectors; the feature vector space of each paragraph has intersection;

each cluster center vector space is assigned a first cluster number.

In a possible implementation manner of the first aspect, after assigning a first cluster number to each cluster center vector space, the method further includes:

and respectively acquiring library paragraph characteristic fingerprints of each paragraph characteristic vector in the plurality of library text data, giving a fingerprint number to each library paragraph characteristic fingerprint, and establishing a mapping relation between each fingerprint number and the first cluster numbers corresponding to the library paragraph characteristic fingerprints one by one.

In a possible implementation manner of the first aspect, after performing word segmentation processing on the target text and extracting a paragraph feature vector to obtain text data composed of a plurality of paragraph feature vectors, the method further includes:

adding the target text into a library and storing the target text into a sequence to be added;

and if the number of the paragraph feature vectors of the text in the sequence to be added is greater than a newly added threshold value, performing cluster division on all the first class cluster center vectors and all the paragraph feature vectors of the text in the sequence to be added to obtain a plurality of new class cluster center vectors.

In a possible implementation manner of the first aspect, the obtaining a paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint, and establishing a mapping relationship between the fingerprint number and the first cluster number specifically includes:

calculating a hash value of each feature word through a hash function;

weighting the character string of each characteristic word according to the hash value of each characteristic word;

accumulating the weighted results of the character strings corresponding to the feature words to obtain a paragraph sequence string;

performing dimensionality reduction calculation on the paragraph sequence string to obtain a paragraph feature fingerprint corresponding to the target text paragraph;

and giving a fingerprint number to the paragraph feature fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number.

In a possible implementation manner of the first aspect, the weighting the character string of each feature word according to the hash value of each feature word specifically includes:

obtaining a weighting factor according to the frequency of the characteristic words appearing in the target text;

multiplying each bit of the character string of each feature word by the weighting factor; the weighting factor is subjected to positive multiplication of the hash character value and the weight when meeting 1, and is subjected to negative multiplication of the hash character value and the weight when meeting 0.

In a possible implementation manner of the first aspect, after the obtaining a paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint, and establishing a mapping relationship between the fingerprint number and the first cluster number, the method further includes:

performing weighted accumulation on all paragraph feature vectors to obtain a text vector corresponding to the target text and a corresponding target sequence string;

confirming a cluster-like central vector space where a text vector is located, and recording a second cluster number according to the cluster-like central vector space where the text vector is located;

performing displacement calculation on the target sequence string to obtain a target text fingerprint corresponding to the target text;

obtaining first cluster numbers with equal number values according to second cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints having a mapping relation with the second cluster numbers; each second cluster number corresponds to the first cluster number with the same number value one by one;

matching all the library paragraph characteristic fingerprints with the target text fingerprints one by one to obtain a text duplicate checking result.

In a possible implementation manner of the first aspect, the matching all the library paragraph feature fingerprints with the paragraph feature fingerprints corresponding to the paragraph feature vector one by one to obtain the paragraph duplication checking result specifically includes:

expanding each library paragraph feature fingerprint and a paragraph feature fingerprint corresponding to the paragraph feature vector;

performing exclusive-or operation on each expanded library paragraph feature fingerprint and the paragraph feature fingerprint, and calculating the number of 1 in an exclusive-or operation result as a sequence similarity;

turning over the paragraph characteristic fingerprints, carrying out XOR operation on each unfolded library paragraph characteristic fingerprint and the turned paragraph characteristic fingerprint, and calculating the number of 1 in the XOR operation result as turning similarity;

if the average value of the sequence similarity and the turning similarity is larger than the paragraph similarity threshold, one paragraph of the target text has repeated content in the library.

In a possible implementation manner of the first aspect, the matching all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain a text duplication result specifically includes:

expanding each library paragraph feature fingerprint and the target text fingerprint;

confirming the number of the segmentation segments according to the number of the paragraph feature vectors of the target text;

respectively carrying out equal-length segmentation on the characteristic fingerprint of each library paragraph and the target text fingerprint according to the number of the segments to obtain a plurality of equal-length library paragraph characteristic sub-fingerprints and a plurality of equal-length target text sub-fingerprints;

respectively carrying out XOR operation on each target text sub-fingerprint and each library section feature sub-fingerprint, and taking the number of 1 in the XOR operation result as the section similarity;

carrying out weighted accumulation on the similarity of each subsection to obtain the text similarity;

if the text similarity is larger than a preset threshold value, repeated contents exist in the full text of the target text in the library.

A second aspect of the embodiments of the present application provides a duplicate checking system based on a feature vector space, including:

the word segmentation module is used for carrying out word segmentation processing on the target text and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; each paragraph feature vector is composed of a plurality of feature word corresponding weights;

the space recording module is used for confirming the cluster center vector space of each paragraph feature vector and recording a first cluster number according to the cluster center vector space; the cluster-like central vector space comprises all paragraph feature vectors of which the cosine distances from the cluster-like central vectors are smaller than a preset cluster-like value, and the sum of the cluster-like central vector spaces is larger than or equal to the whole paragraph feature vector space;

the fingerprint module is used for acquiring the paragraph characteristic fingerprint of each paragraph characteristic vector, giving a fingerprint number to the paragraph characteristic fingerprint and establishing a mapping relation between the fingerprint number and the first cluster number; the paragraph feature fingerprint is an array containing features of the target text;

and the matching module is used for acquiring all library paragraph characteristic fingerprints which have a mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph characteristic vector, and matching all library paragraph characteristic fingerprints with the paragraph characteristic fingerprints corresponding to the paragraph characteristic vectors one by one to obtain a paragraph duplicate checking result.

Compared with the prior art, the embodiment of the invention provides a duplicate checking method and a duplicate checking system based on a feature vector space, when a new target text is checked for duplicate, word segmentation processing is firstly carried out on the target text, paragraph feature vectors are extracted, a plurality of corresponding paragraph feature vectors of the whole text are obtained, then a cluster center to which the corresponding cluster center belongs is recorded by using a first cluster number so as to facilitate subsequent calling, relevant personnel can confirm the cluster center vector space where similar texts of the target text are located according to the first cluster number, library text data of a text database corresponding to a space except the cluster center vector space is excluded, and time required for optimizing duplicate checking is shortened.

And then further performing information concentration to obtain corresponding paragraph characteristic fingerprints and fingerprint numbers, wherein the corresponding paragraph characteristic fingerprints and the corresponding paragraph characteristic vectors are in one-to-one correspondence with the fingerprint numbers, so that related personnel can easily mark and access the corresponding paragraph text according to the fingerprint numbers. The obtained paragraph characteristic fingerprints are subjected to fingerprint matching with the library paragraph characteristic fingerprints corresponding to the first cluster numbers in the text database, so that which similar paragraphs of each paragraph in the target text exist can be obtained, and the similar texts to which the similar paragraphs belong can be confirmed by the library paragraph characteristic fingerprints and the fingerprint numbers.

In addition, when the target texts are accumulated to a certain number, clustering division is carried out again, a plurality of new cluster-like central vectors are obtained, the effect of vector space re-division is achieved, and the accuracy of the first cluster-like numbering is guaranteed.

Drawings

FIG. 1 is a flowchart illustrating a method for weight finding based on feature vector space according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a duplication checking system based on a feature vector space according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a duplicate checking method based on a feature vector space:

s10, performing word segmentation processing on the target text and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; each paragraph feature vector is composed of a plurality of feature word correspondence weights.

S11, confirming a cluster center vector space where each paragraph feature vector is located, and recording a first cluster number according to the cluster center vector space where each paragraph feature vector is located; the cluster-like center vector space comprises all paragraph feature vectors of which the cosine distances from the cluster-like center vectors are smaller than a preset cluster-like value, and the sum of the cluster-like center vector spaces is larger than or equal to the whole paragraph feature vector space.

S12, obtaining the paragraph characteristic fingerprint of each paragraph characteristic vector, giving a fingerprint number to the paragraph characteristic fingerprint, and establishing a mapping relation between the fingerprint number and the first cluster number; the paragraph feature fingerprint is an array of features that includes the target text.

S13, according to the first cluster serial number of each paragraph feature vector, all library paragraph feature fingerprints which have a mapping relation with the first cluster serial number are obtained, all library paragraph feature fingerprints are matched with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one, and a paragraph duplicate checking result is obtained.

The vector space model is used for expressing unstructured target text into computer-tractable paragraph feature vectors, and the process of one-way irreversible information reduction from the target text to the paragraph feature vectors and then to the paragraph feature fingerprints is carried out. The embodiment of the method mainly comprises two stages: the first stage (S10-S11) is to confirm the vector space (the first cluster number) to which the feature vector of the paragraph belongs, and the duplication checking unit of the stage is a vector; the second stage (S12-S13) is to search similar texts in the vector space, and the duplication checking unit of the stage is fingerprint. If the paragraph feature fingerprint contains 64 bits, the 64-bit fingerprint actually retains the direction information of the 64-dimensional vector space, and the duplication checking unit is 64 bits at this time. The lengths of the paragraph feature vectors and the paragraph feature fingerprints are uniform, the lengths are larger than the length of a uniform value reserved for the uniform value and smaller than the length of the uniform value required to be zero-filled to the uniform value.

And finally, the target text data stored in the text database comprises the mark of the target text, the first cluster number, the center vector of the cluster to which the target text belongs, the paragraph characteristic fingerprint and the fingerprint number. The mark of the target text, the first cluster number fingerprint and the fingerprint number are random assignment attributes which are convenient for an administrator to call and sort the database.

It should be noted that, in the embodiment of the present invention, the text to be queried may be given by a user, or may be automatically acquired on a network node, and the user may set the text according to a requirement.

The word segmentation processing mentioned in S10 is a mature natural language processing technology, and is intended to divide a text to be queried into a plurality of sentences, and then to combine the plurality of sentences to obtain a plurality of split paragraphs of a target text, where a split paragraph at least includes one sentence and corresponds to one paragraph feature vector.

For example, given a piece of target text: "the life book does not have the way, and the people who walk have had more to get the way, and believe sunshine always behind the wind and rain", the result after word segmentation is: the life does not believe that the sunshine is rainy, and then each feature vector is given a weight: life (5) does not (2) become (1) believes (2) that sunlight (3) is rainy (2), where the numbers in parentheses represent how important the word is in the whole sentence, the larger the number the more important.

Exemplarily, the specific construction process of the cluster-like center vector space in S11 is as follows:

s91, performing word segmentation processing on all texts in the library respectively and extracting paragraph feature vectors to obtain a plurality of library text data;

s92, clustering all paragraph feature vectors in the plurality of library text data by adopting a clustering algorithm based on division to obtain a plurality of cluster center vectors;

s93, dividing a vector space with a cosine distance between the vector and each cluster center vector smaller than a preset cluster value into cluster center vector spaces corresponding to the cluster center vectors; the feature vector space of each paragraph has intersection;

and S94, assigning a first cluster number to each cluster center vector space.

The embodiment of the invention adopts a K-Means method, and clustering is represented by average value points in grouped samples. S91-S94 are the process of clustering all texts in the library, which needs to be completed before the duplication checking of the target text, the target is to complete the initialization of the library text data of the database and the initialization of the cluster-like central vector space, and the subsequent duplication checking needs to be based on the library text data and the cluster-like central vector space.

Randomly selecting K text paragraph data from all M library text data as an initial clustering center, namely taking K feature vectors corresponding to the K text paragraph data as initial center vectors; wherein the K central vectors are recorded as T _1′ 、T _2′ 、…、T _′K (ii) a Marking the characteristic vectors of M-K text paragraph data outside the clustering center as T _′K+1 、T _′K+2 、…、T _′M . M and K are both positive integers, and K is less than M.

Then clustering and dividing the characteristic vectors of the M-K text paragraph data, and dividing T _′K+1 、T _′K+2 、…、T _′M Is divided into T _1′ 、T _2′ 、…、T _′K In a cluster of classes that are central vectors.

The cluster-like center vector space is determined by two parameters, one is a cluster-like center vector and the other is a distance from the vector center, and the value of the distance from the vector center is equal to the cosine distance between the vector and each cluster-like center vector, so that the size of the cluster-like center vector space can be adjusted by setting a preset cluster-like value. The larger the cluster center vector space is, the larger the data size of the library text data included finally is, the larger the complexity of time required for duplicate checking is, and the administrator can adjust the data size according to actual conditions.

After S94, the method further includes:

and S95, respectively obtaining the library paragraph characteristic fingerprints of each paragraph characteristic vector in the plurality of library text data, giving fingerprint numbers to each library paragraph characteristic fingerprint, and establishing a mapping relation between each fingerprint number and the first cluster numbers corresponding to the library paragraph characteristic fingerprints one by one.

After the mapping relation between each fingerprint number and the first cluster number corresponding to the characteristic fingerprint of the library paragraph is established one by one, the fact that all text paragraphs in the library are divided into the corresponding cluster center vector space means that a follow-up administrator can easily reduce the duplication checking space (range) only according to the first cluster number.

Exemplarily, S10 further includes:

s101, adding the target text into a library and storing the target text into a sequence to be added;

s102, if the number of the paragraph feature vectors of the texts in the sequence to be added is larger than a newly added threshold value, clustering and dividing all the first cluster center vectors and all the paragraph feature vectors of the texts in the sequence to be added to obtain a plurality of new cluster center vectors.

In this embodiment, the sequence to be added is monitored in real time, and the fact that the number of the paragraph feature vectors is greater than the newly added threshold means that the database receives and stores a certain amount of texts within a period of time, and if the cluster center vector is not updated at this time, the influence of the texts in the subsequent duplication checking process is ignored, so that the accuracy of the duplication checking result is affected.

Exemplarily, S12 specifically includes:

and S120, calculating a hash value of each feature word through a hash function.

And S121, weighting the character string of each characteristic word according to the hash value of each characteristic word.

And S122, accumulating the weighted results of the character strings corresponding to the feature words to obtain a paragraph sequence string.

And S123, performing dimension reduction calculation on the paragraph sequence string to obtain a paragraph feature fingerprint corresponding to the target text paragraph.

And S124, giving a fingerprint number to the paragraph feature fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number.

Exemplarily, S121 specifically includes:

s1210, obtaining a weighting factor according to the frequency of the feature words appearing in the target text.

S1211, multiplying each character string of each feature word by the weighting factor; the weighting factor is subjected to positive multiplication of the hash character value and the weight when meeting 1, and is subjected to negative multiplication of the hash character value and the weight when meeting 0.

And calculating the hash value of each feature vector through a hash function, wherein the hash value is an n-bit signature consisting of binary numbers of 01. For example, the Hash value of "life" is 110101, and the Hash value of "no" is "101001". The string becomes a series of numbers. And on the basis of the Hash value, weighting all the feature vectors, namely W = Hash weight, wherein if 1 is met, the Hash value is multiplied by the weight positively, and if 0 is met, the Hash value is multiplied by the weight negatively. For example, weighting the hash value "110101" of "life" yields: w (life) =110101 × 5 =5-5, weighting the hash value "101001" of "none" yields: w (none) =101001 x 2 = 2-2-2-2, the remaining feature vectors operate similarly. "101011 \8230: w (natural language, weight value bit is 5) =5-5 \8230, the hash value of processing is 100101 \8230, and the weighting is obtained: w (process, weight value of 4) =4-4-4 \8230, and the rest of feature vectors operate similarly.

And accumulating the weighted results of the feature vectors to form a sequence string. Taking the first two feature vectors as examples, for example, 5-5 "of" Life "and" 2-2-2-2 "without" add up to obtain "5+ 2-2-5 + 2-2-5 +2", and obtain "7-3-7".

Exemplarily, S123 specifically includes:

s1230, sequentially recording the accumulation sequence of each feature word according to the weighted result accumulation process of the paragraph sequence string;

s1231, accumulating character strings corresponding to each characteristic word, and respectively generating word codes corresponding to each characteristic word; generating a corresponding word coding sequence according to the accumulation sequence of each feature word;

s1232, forming a coding feature matrix by the word coding sequence and the paragraph sequence string;

and S1233, carrying out XOR operation on the coding feature matrix to obtain paragraph feature fingerprints corresponding to the target text paragraphs.

Through the steps, the accumulated sequence of each feature word can be used as the fingerprint characteristic of the paragraph sequence string, and the problem that the paragraph sequence string loses the relevance with the original information after dimension reduction calculation is avoided. The corresponding word coding sequence and the paragraph sequence string form a coding feature matrix, the dimension of the paragraph sequence string can be further reduced after the XOR operation is carried out, but the feature of the information is kept due to the addition of the word coding sequence, and the extraction of the paragraph feature fingerprint can be better completed.

Exemplarily, after S12, further comprising:

and S125, performing weighted accumulation on all paragraph feature vectors to obtain a text vector corresponding to the target text and a corresponding target sequence string.

And S126, confirming the center vector space of the similar cluster where the text vector is located, and recording the serial number of the second similar cluster according to the center vector space of the similar cluster where the text vector is located.

And S127, performing displacement calculation on the target sequence string to obtain a target text fingerprint corresponding to the target text.

S128, obtaining first cluster numbers with equal number values according to second cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints having a mapping relation with the second cluster numbers; and each second cluster number corresponds to the first cluster numbers with the same number value one by one.

And S129, matching all the library paragraph characteristic fingerprints with the target text fingerprints one by one to obtain a text duplicate checking result.

In addition to querying similar texts corresponding to each paragraph, if a duplication checking applicant wants to check duplication from the perspective of full-text conception and full-text subject, all paragraph feature vectors of a target text need to be weighted to form text vectors and corresponding target sequence strings. The forming process of the target text fingerprint is similar to that of the paragraph feature fingerprint, and is not described herein again, and the difference between the two processes is as follows: the paragraph feature fingerprint is calculated based on the paragraph feature vector, and the paragraph text is calculated based on the text vector. It should be noted that the lengths of the paragraph feature fingerprint and the target text fingerprint need to be consistent.

Exemplarily, S13 specifically includes:

and S130, expanding the paragraph feature fingerprints corresponding to the paragraph feature vectors of each library paragraph feature fingerprint.

S131, carrying out XOR operation on each expanded library paragraph feature fingerprint and the paragraph feature fingerprint, and calculating the number of 1 in the XOR operation result as sequence similarity.

S132, turning the paragraph feature fingerprints, performing XOR operation on each unfolded library paragraph feature fingerprint and each turned paragraph feature fingerprint, and calculating the number of 1 in XOR operation results as turning similarity;

s133, if the average value of the sequence similarity and the turning similarity is larger than the paragraph similarity threshold, one paragraph of the target text has repeated content in the library.

For example, the library paragraph feature fingerprints and the paragraph feature fingerprints corresponding to the paragraph feature vector are 10101 and 00110 respectively after expansion, the number of 1 s in the xor operation result is 3, then the sequential similarity is 3, and the flip similarity calculation method is the same except that the operation objects at this time are 10101 and 01100.

From the order similarity Sim ₁ And weight of 0.5, flip similarity Sim ₂ And the weight is 0.5, and the similarity fusion algorithm, namely the formula (1), is used for calculating the similarity Sim of the library text of the target text based on the vector space.

Sim＝Sim ₁ ·0.5+Sim ₂ ·0.5 （1）

If the value of Sim is greater than the paragraph similarity threshold, a paragraph of the target text has duplicate content in the library. It should be noted that each library segment feature fingerprint needs to be compared with the segment feature fingerprint corresponding to the segment feature vector once, and the Sim value needs to be recalculated once.

Exemplarily, S129 specifically includes:

and S1290, expanding the characteristic fingerprint of each library paragraph and the target text fingerprint.

And S1291, confirming the number of the segments according to the number of the segment feature vectors of the target text.

And S1292, respectively carrying out equal-length segmentation on each library paragraph feature fingerprint and the target text fingerprint according to the number of the segments to obtain a plurality of equal-length library paragraph feature sub-fingerprints and a plurality of equal-length target text sub-fingerprints.

And S1293, respectively carrying out XOR operation on each target text sub-fingerprint and each library section feature sub-fingerprint, and taking the number of 1 in the XOR operation result as the section similarity.

And S1294, performing weighted accumulation on the similarity of each subsection to obtain the text similarity.

And S1295, if the text similarity is larger than a preset threshold, the full text of the target text has repeated contents in the library.

In this embodiment, the repeatability of the full text of the target text is determined according to the difference between the characteristic fingerprint of each library paragraph and the target text fingerprint, and the parameter to be obtained first is the number of segments. The number of segments is related to the number of segment feature vectors of the target text, that is, the number of split segments of the target text obtained in S10.

If the target text has 8 split paragraphs, the number of the split paragraphs is 8, and the text similarity Sim' = Sim ₁₁ ′*α ₁₁ +Sim ₂₁ ′*α ₂₁ +Sim ₃₁ ′*α ₃₁ +…+ Sim ₈₁ ′*α ₈₁ ，Sim ₁₁ ' means the sub-segment similarity, α, of the first target text sub-fingerprint to the first library segment feature sub-fingerprint ₁₁ To correspond to the weight, sim ₁₂ ' means the sub-segment similarity, α, of the first target text sub-fingerprint and the second library paragraph feature sub-fingerprint ₁₂ For the corresponding weight, the similarity of 8 × 8 sub-segments is calculated and weighted and accumulated.

Compared with the prior art, the embodiment of the invention provides a duplicate checking method based on a feature vector space, when a new target text is checked for duplicate, word segmentation processing is firstly carried out on the target text and paragraph feature vectors are extracted to obtain a plurality of paragraph feature vectors corresponding to the whole text, then a first cluster number is used for recording the cluster center to which the similar text belongs so as to facilitate subsequent calling, related personnel can confirm the cluster center vector space of the similar text of the target text according to the first cluster number, library text data of a text database corresponding to the space except the cluster center vector space is excluded, and time required by optimization of duplicate checking is shortened.

And then further performing information concentration to obtain corresponding paragraph characteristic fingerprints and fingerprint numbers, wherein the corresponding paragraph characteristic fingerprints and the corresponding paragraph characteristic vectors are in one-to-one correspondence with the fingerprint numbers, so that related personnel can easily mark and access the corresponding paragraph text according to the fingerprint numbers. The obtained paragraph characteristic fingerprints are matched with the library paragraph characteristic fingerprints corresponding to the first cluster numbers in the text database, so that the similar paragraphs of each paragraph in the target text can be obtained, and the similar texts to which the similar paragraphs belong can be confirmed by the library paragraph characteristic fingerprints and the fingerprint numbers.

Referring to fig. 2, an embodiment of the present application provides a duplication checking system based on a feature vector space, which includes a word segmentation module 20, a space recording module 21, a fingerprint module 22, and a matching module 23.

A segmentation module 20, configured to perform segmentation processing on the target text and extract paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors; each paragraph feature vector is composed of a plurality of feature word correspondence weights.

The space recording module 21 is configured to determine a cluster-like central vector space where each paragraph feature vector is located, and record a first cluster number according to the cluster-like central vector space where the paragraph feature vector is located; the cluster-like center vector space comprises all paragraph feature vectors of which the cosine distances from the cluster-like center vectors are smaller than a preset cluster-like value, and the sum of the cluster-like center vector spaces is larger than or equal to the whole paragraph feature vector space.

The fingerprint module 22 is configured to obtain a paragraph feature fingerprint of each paragraph feature vector, assign a fingerprint number to the paragraph feature fingerprint, and establish a mapping relationship between the fingerprint number and the first cluster number; the paragraph feature fingerprint is an array of features that includes the target text.

The matching module 23 is configured to obtain all library segment feature fingerprints having a mapping relationship with the first class cluster number according to the first class cluster number of each segment feature vector, and match all library segment feature fingerprints with the segment feature fingerprints corresponding to the segment feature vector one by one to obtain a segment duplicate checking result.

It is clear to those skilled in the art that for the convenience and brevity of description, the specific working procedures of the system described above may refer to the corresponding procedures in the foregoing method embodiments, which are not reiterated herein.

Compared with the prior art, the embodiment of the invention provides a duplicate checking system based on a feature vector space, when a new target text is checked for duplicate, word segmentation processing is firstly carried out on the target text and paragraph feature vectors are extracted, a plurality of paragraph feature vectors corresponding to the whole text are obtained, then the cluster center to which the cluster belongs is recorded by using the first cluster number so as to facilitate subsequent calling, relevant personnel can confirm the cluster center vector space where similar texts of the target text are located according to the first cluster number, library text data of a text database corresponding to a space except the cluster center vector space is excluded, and time required by optimizing the duplicate checking is shortened.

In addition, when the target texts are accumulated to a certain number, clustering division is carried out again, a plurality of new cluster-like central vectors are obtained, the effect of vector space division is achieved again, and the accuracy of the first cluster numbering is guaranteed.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A duplicate checking method based on a feature vector space is characterized by comprising the following steps:

confirming a cluster center vector space where each paragraph feature vector is located, and recording a first cluster number according to the cluster center vector space where each paragraph feature vector is located; the cluster-like center vector space comprises all paragraph feature vectors of which the cosine distances from the cluster-like center vectors are smaller than a preset cluster-like value, and the sum of the cluster-like center vector spaces is larger than or equal to the whole paragraph feature vector space;

obtaining paragraph feature fingerprints of each paragraph feature vector, giving fingerprint numbers to the paragraph feature fingerprints, and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; the paragraph feature fingerprint is an array of features comprising the target text;

2. The method for weight finding based on the feature vector space of claim 1, wherein the specific construction process of the cluster-like central vector space is as follows:

dividing a vector space with a cosine distance between the vector and each cluster center vector smaller than a preset cluster value into cluster center vector spaces corresponding to the cluster center vectors; the feature vector space of each paragraph has intersection;

each cluster center vector space is assigned a first cluster number.

3. The method for checking duplicate based on eigenvector space as claimed in claim 2, wherein after assigning a first cluster number to the center vector space of each cluster, further comprising:

4. The method for duplicate checking based on feature vector space of claim 2, wherein after performing word segmentation processing on the target text and extracting paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors, the method further comprises:

and if the number of the paragraph feature vectors of the text in the sequence to be added is larger than a newly added threshold value, performing cluster division on all the first class cluster center vectors and all the paragraph feature vectors of the text in the sequence to be added to obtain a plurality of new class cluster center vectors.

5. The method according to claim 1, wherein the obtaining of the paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint, and establishing a mapping relationship between the fingerprint number and the first cluster number specifically comprises:

calculating a hash value of each feature word through a hash function;

weighting the character strings of each feature word according to the hash value of each feature word;

and assigning a fingerprint number to the paragraph feature fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number.

6. The method for duplicate checking based on the feature vector space according to claim 5, wherein the weighting is performed on the character string of each feature word according to the hash value of each feature word, and specifically comprises:

7. The method for duplicate checking based on feature vector space of claim 5, wherein after obtaining the paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint and establishing a mapping relationship between the fingerprint number and the first cluster number, the method further comprises:

obtaining first cluster numbers with equal number values according to second cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints having a mapping relation with the second cluster numbers; each second cluster number corresponds to the first cluster numbers with the same number value one by one;

8. The method for duplicate checking based on feature vector space according to claim 1, wherein the matching of all the library paragraph feature fingerprints with the paragraph feature fingerprints corresponding to the paragraph feature vector one by one to obtain the paragraph duplicate checking result specifically comprises:

expanding each library paragraph feature fingerprint and the paragraph feature fingerprint corresponding to the paragraph feature vector;

turning over the paragraph characteristic fingerprints, carrying out exclusive OR operation on each unfolded library paragraph characteristic fingerprint and the turned paragraph characteristic fingerprint, and calculating the number of 1 in an exclusive OR operation result as turning similarity;

if the average value of the sequence similarity and the turning similarity is larger than the paragraph similarity threshold value, one paragraph of the target text has repeated content in the library.

9. The method for duplicate checking based on feature vector space according to claim 7, wherein the matching of all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain a text duplicate checking result specifically comprises:

respectively carrying out equal-length segmentation on each library paragraph characteristic fingerprint and the target text fingerprint according to the number of the segmentation segments to obtain a plurality of equal-length library paragraph characteristic sub-fingerprints and a plurality of equal-length target text sub-fingerprints;

10. A duplicate checking system based on a feature vector space is characterized by comprising:

the word segmentation module is used for carrying out word segmentation processing on the target text and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; each paragraph feature vector consists of a plurality of feature word corresponding weights;

the space recording module is used for confirming the cluster center vector space where each paragraph feature vector is positioned and recording a first cluster number according to the cluster center vector space where each paragraph feature vector is positioned; the cluster-like center vector space comprises all paragraph feature vectors of which the cosine distances from the cluster-like center vectors are smaller than a preset cluster-like value, and the sum of the cluster-like center vector spaces is larger than or equal to the whole paragraph feature vector space;

the fingerprint module is used for acquiring the paragraph characteristic fingerprint of each paragraph characteristic vector, giving a fingerprint number to the paragraph characteristic fingerprint and establishing a mapping relation between the fingerprint number and the first cluster number; the paragraph feature fingerprint is an array of features comprising the target text;