CN115774785B - Weight checking method and system based on feature vector space - Google Patents

Weight checking method and system based on feature vector space Download PDF

Info

Publication number
CN115774785B
CN115774785B CN202310091416.8A CN202310091416A CN115774785B CN 115774785 B CN115774785 B CN 115774785B CN 202310091416 A CN202310091416 A CN 202310091416A CN 115774785 B CN115774785 B CN 115774785B
Authority
CN
China
Prior art keywords
paragraph
feature
fingerprint
text
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310091416.8A
Other languages
Chinese (zh)
Other versions
CN115774785A (en
Inventor
蓝建敏
李思伟
池沐霖
纪绿彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202310091416.8A priority Critical patent/CN115774785B/en
Publication of CN115774785A publication Critical patent/CN115774785A/en
Application granted granted Critical
Publication of CN115774785B publication Critical patent/CN115774785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a duplication checking method and a duplication checking system based on a feature vector space, wherein the method comprises the following steps: performing word segmentation and segmentation processing on the target text, and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; confirming a cluster center vector space in which each paragraph feature vector is positioned, and recording a first cluster number according to the cluster center vector space; acquiring paragraph feature fingerprints of each paragraph feature vector, assigning fingerprint numbers to the paragraph feature fingerprints, and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; and respectively acquiring all library paragraph feature fingerprints with mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector to match, so as to obtain paragraph check duplicate results. By adopting the method, vector space optimization duplicate checking data is firstly confirmed, then accurate duplicate checking is carried out according to paragraph characteristic fingerprints, duplicate checking dimension is ensured, and duplicate checking efficiency is optimized.

Description

Weight checking method and system based on feature vector space
Technical Field
The invention relates to the technical field of text duplicate checking, in particular to a duplicate checking method and system based on a feature vector space.
Background
Text duplication is the process of finding duplicate text from a data stream according to a certain similarity model. It has wide application in the fields of search engine construction, plagiarism detection, news classification, etc. Text duplication is a special text filtering condition that the similarity of the target text and the source text is greater than a threshold value.
In the text information duplication searching method in the prior art, word segmentation is carried out on text content, stop words are removed, feature word extraction is carried out, feature word vectors are stored in a database, and articles requested during searching need to be processed in the same way. The article with high similarity can be found by the method based on the feature word vector, but the problem is that the dimension of the feature vector is not well controlled: under the condition of high dimension, the speed of text comparison search is very slow; if the dimension is low, the extracted feature words may not be sufficiently used for similar retrieval, and the similarity performance of the obtained search results may be greatly discounted.
In summary, in the conventional text information duplication searching method, when more texts are searched and compared, in order to ensure the enrichment of the search terms, more searching time is generally consumed.
Disclosure of Invention
The embodiment of the invention provides a duplication checking method and a duplication checking system based on a feature vector space, which can confirm vector space optimization duplication checking data according to a first cluster number and then accurately check duplication according to paragraph feature fingerprints.
A first aspect of an embodiment of the present application provides a duplication checking method based on feature vector space, including:
performing word segmentation and segmentation processing on the target text, and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; each paragraph feature vector consists of a plurality of feature words corresponding weights;
confirming a cluster center vector space in which each paragraph feature vector is positioned, and recording a first cluster number according to the cluster center vector space; the class cluster center vector space comprises all paragraph feature vectors with cosine distances between the class cluster center vector space and class cluster center vectors smaller than a preset class cluster value, and the sum of the class cluster center vector spaces is larger than or equal to the whole paragraph feature vector space;
acquiring paragraph feature fingerprints of each paragraph feature vector, assigning fingerprint numbers to the paragraph feature fingerprints, and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; the paragraph feature fingerprint is an array containing features of the target text;
and respectively acquiring all library paragraph feature fingerprints with mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector, and matching all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain a paragraph check repetition result.
In a possible implementation manner of the first aspect, the specific construction process of the cluster-like center vector space is:
word segmentation processing is carried out on all texts in the library respectively, paragraph feature vectors are extracted, and a plurality of library text data are obtained;
clustering all paragraph feature vectors in the plurality of library text data by adopting a clustering algorithm based on division to obtain a plurality of cluster center vectors;
dividing a vector space with the cosine distance between each class cluster center vector being smaller than a preset class cluster value into class cluster center vector spaces corresponding to the class cluster center vectors; each paragraph feature vector space has an intersection;
each cluster-like center vector space is assigned a first cluster number.
In a possible implementation manner of the first aspect, after assigning a first cluster number to each cluster-like center vector space, the method further includes:
and respectively acquiring the library paragraph feature fingerprints of each paragraph feature vector in the plurality of library text data, assigning fingerprint numbers to each library paragraph feature fingerprint, and establishing a mapping relation between each fingerprint number and a first cluster number corresponding to the library paragraph feature fingerprint one by one.
In a possible implementation manner of the first aspect, after performing word segmentation and segmentation processing on the target text and extracting a paragraph feature vector, obtaining text data composed of a plurality of paragraph feature vectors, the method further includes:
adding the target text into a library and storing the target text into a sequence to be added;
and if the number of the paragraph feature vectors of the texts in the sequence to be added is larger than a new threshold, clustering and dividing all the first class cluster center vectors and all the paragraph feature vectors of the texts in the sequence to be added to obtain a plurality of new class cluster center vectors.
In a possible implementation manner of the first aspect, the obtaining a paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint, and establishing a mapping relationship between the fingerprint number and the first cluster number specifically includes:
calculating a hash value of each feature word through a hash function;
weighting the character strings of each feature word according to the hash value of each feature word;
accumulating the weighted results of the character strings corresponding to the characteristic words to obtain paragraph sequence strings;
performing dimension reduction calculation on the paragraph sequence strings to obtain paragraph feature fingerprints corresponding to the paragraphs of the target text;
and giving a fingerprint number to the paragraph characteristic fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number.
In a possible implementation manner of the first aspect, the weighting the character string of each feature word according to the hash value of each feature word specifically includes:
obtaining a weighting factor according to the frequency of the feature words in the target text;
multiplying each bit of the character string of each feature word with the weighting factor; the weighting factor multiplies the hash character value and the weight positively by 1 and multiplies the hash character value and the weight negatively by 0.
In a possible implementation manner of the first aspect, after the obtaining a paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint and establishing a mapping relationship between the fingerprint number and the first cluster number, the method further includes:
performing weighted accumulation on all paragraph feature vectors to obtain a text vector corresponding to the target text and a corresponding target sequence string;
confirming a class cluster center vector space in which a text vector is located, and recording a second class cluster number according to the class cluster center vector space;
performing displacement calculation on the target sequence string to obtain a target text fingerprint corresponding to the target text;
obtaining first class cluster numbers with equal number values according to second class cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints with mapping relations with the second class cluster numbers; each second type cluster number corresponds to the first type cluster number with the same number value one by one;
and matching all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain a text duplication checking result.
In a possible implementation manner of the first aspect, the matching, one by one, between all the library paragraph feature fingerprints and the paragraph feature fingerprints corresponding to the paragraph feature vector to obtain a paragraph check repetition result specifically includes:
expanding the paragraph feature fingerprint corresponding to the paragraph feature vector of each library paragraph feature fingerprint;
performing exclusive-or operation on each expanded library paragraph feature fingerprint and the paragraph feature fingerprint, and calculating the number of 1 in the exclusive-or operation result as the sequence similarity;
turning over the paragraph characteristic fingerprints, performing exclusive-or operation on each unfolded library paragraph characteristic fingerprint and the turned paragraph characteristic fingerprints, and calculating the number of 1 in the exclusive-or operation result as turning-over similarity;
if the average value of the sequence similarity and the turnover similarity is greater than a paragraph similarity threshold, a paragraph of the target text has repeated content in a library.
In a possible implementation manner of the first aspect, the matching all the library paragraph feature fingerprints with the target text fingerprint one by one to obtain a text duplication checking result specifically includes:
expanding each library paragraph feature fingerprint with the target text fingerprint;
confirming the number of segments according to the number of segment feature vectors of the target text;
respectively carrying out equal-length segmentation on each library paragraph characteristic fingerprint and the target text fingerprint according to the segmentation number to obtain a plurality of equal-length library paragraph characteristic sub-fingerprints and a plurality of equal-length target text sub-fingerprints;
performing exclusive-or operation on each target text sub-fingerprint and each library paragraph characteristic sub-fingerprint respectively, and taking the number of 1 in the exclusive-or operation result as the sub-paragraph similarity;
carrying out weighted accumulation on the similarity of each sub-paragraph to obtain text similarity;
and if the text similarity is greater than a preset threshold, repeating content exists in the full text of the target text in a library.
A second aspect of the embodiments of the present application provides a duplication checking system based on feature vector space, including:
the word segmentation module is used for carrying out word segmentation and segmentation processing on the target text and extracting paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors; each paragraph feature vector consists of a plurality of feature words corresponding weights;
the space recording module is used for confirming a cluster center vector space in which each paragraph feature vector is positioned, and recording a first cluster number according to the cluster center vector space; the class cluster center vector space comprises all paragraph feature vectors with cosine distances between the class cluster center vector space and class cluster center vectors smaller than a preset class cluster value, and the sum of the class cluster center vector spaces is larger than or equal to the whole paragraph feature vector space;
the fingerprint module is used for acquiring paragraph feature fingerprints of each paragraph feature vector, giving fingerprint numbers to the paragraph feature fingerprints and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; the paragraph feature fingerprint is an array containing features of the target text;
and the matching module is used for acquiring all library paragraph feature fingerprints with mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector respectively, and matching all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain a paragraph check repetition result.
Compared with the prior art, the embodiment of the invention provides a feature vector space-based searching method and a feature vector space-based searching system, wherein when a new target text is searched for a new time, word segmentation and segmentation processing are performed on the target text, paragraph feature vectors are extracted, a plurality of corresponding paragraph feature vectors of the whole text are obtained, then the cluster center of the corresponding cluster is recorded by using a first cluster number so as to facilitate subsequent calling, related personnel can confirm the cluster center vector space of the similar text of the target text according to the first cluster number, library text data of a text database corresponding to the space outside the cluster center vector space is eliminated, and the time required for optimizing searching for the new text is shortened.
And concentrating the information to obtain corresponding paragraph feature fingerprints and fingerprint numbers, wherein the fingerprint numbers correspond to the paragraph feature fingerprints and paragraph feature vectors one by one, and related personnel can easily mark and access corresponding paragraph texts according to the fingerprint numbers. And performing fingerprint matching on the obtained paragraph characteristic fingerprints and the library paragraph characteristic fingerprints corresponding to the first class cluster numbers in the text database to obtain which similar paragraphs of each paragraph in the target text exist, wherein the similar text to which the similar paragraphs belong can be confirmed by the library paragraph characteristic fingerprints and the fingerprint numbers.
In addition, when a certain number of target texts are accumulated, the method and the device can further conduct clustering division again to obtain a plurality of new cluster center vectors to achieve the effect of repartitioning vector space, and accuracy of the first cluster numbers is guaranteed.
Drawings
FIG. 1 is a flow chart of a duplication checking method based on feature vector space according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a weight checking system based on feature vector space according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, an embodiment of the present invention provides a duplication checking method based on feature vector space:
s10, performing word segmentation and segmentation processing on a target text, and extracting paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors; each paragraph feature vector is composed of a plurality of feature word corresponding weights.
S11, confirming a cluster center vector space in which each paragraph feature vector is located, and recording a first cluster number according to the cluster center vector space; the class cluster center vector space comprises all paragraph feature vectors with cosine distances between the class cluster center vector space and class cluster center vectors smaller than a preset class cluster value, and the sum of the class cluster center vector spaces is larger than or equal to the whole paragraph feature vector space.
S12, obtaining paragraph feature fingerprints of each paragraph feature vector, assigning fingerprint numbers to the paragraph feature fingerprints, and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; the paragraph feature fingerprint is an array of features that includes the target text.
S13, respectively obtaining all library paragraph feature fingerprints with mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector, and matching all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain a paragraph check repetition result.
The vector space model is used for representing unstructured target text as paragraph feature vectors which are easy to process by a computer, and the process from the target text to the paragraph feature vectors and then to the paragraph feature fingerprints is one-way irreversible information reduction. The method mainly comprises two stages: the first stage (S10-S11) is to confirm the space of the vector (first cluster number) that the characteristic vector of the paragraph belongs to, the check unit of this stage is the vector; the second stage (S12-S13) is to find similar text in the belonging vector space, and the unit of check weight in this stage is the fingerprint. If the paragraph feature fingerprint contains 64 bits, then the 64bit fingerprint actually retains the direction information of the 64-dimensional vector space, and the check unit is 64 bits. The lengths of the paragraph feature vectors and the paragraph feature fingerprints are uniform, the lengths are larger than the length of the uniform value reserved for the uniform value, and the lengths smaller than the length of the uniform value required to be zero-padded to the uniform value.
Finally, the target text data stored in the text database comprises a mark of the target text, a first cluster number, a cluster center vector of the first cluster, a paragraph characteristic fingerprint and a fingerprint number. The mark of the target text, the fingerprint number of the first cluster number and the fingerprint number are used for facilitating the manager to call and sort the random assignment attribute of the database.
It should be noted that, in the embodiment of the present invention, the text to be queried may be given by a user, or may be automatically acquired at a network node, and the user may be set according to the requirement, which is not specifically limited by the embodiment of the present invention.
The word segmentation and segmentation processing mentioned in S10 is a mature natural language processing technology, and aims to segment a text to be queried into a plurality of sentences, and then combine the sentences to obtain a plurality of split paragraphs of a target text, wherein the split paragraphs at least comprise one sentence, and correspond to one paragraph feature vector.
For example, given a piece of target text: "there is no way for life book, more people get way, the sunlight is believed to be always behind the wind and rain", the result after word segmentation is: the life did not believe that the sun was weather and then assign weights to each feature vector: life (5) without (2) becomes (1) believing (2) sunlight (3) weather (2), wherein the number in brackets represents the importance of the word in the whole sentence, and the larger the number the more important.
The specific construction process of the cluster-like center vector space in S11 is as follows:
s91, performing word segmentation processing on all texts in the library respectively, and extracting paragraph feature vectors to obtain a plurality of library text data;
s92, clustering all paragraph feature vectors in the plurality of library text data by adopting a clustering algorithm based on division to obtain a plurality of cluster center vectors;
s93, dividing a vector space with the cosine distance between each cluster center vector being smaller than a preset cluster value into a cluster center vector space corresponding to the cluster center vector; each paragraph feature vector space has an intersection;
s94, a first cluster number is given to each cluster center vector space.
The embodiment of the invention adopts a K-Means method (K-Means method), and the clustering is represented by average mean points in the grouping samples. S91-S94 are processes of clustering all texts in a database, the process needs to be completed before the target text is searched, the aim is to complete the initialization of database text data and the initialization of cluster-like center vector space of the database, and the later searching is based on the database text data and the cluster-like center vector space.
Randomly selecting K text paragraph data from all M library text data as an initial clustering center, namely taking K feature vectors corresponding to the K text paragraph data as initial center vectors; wherein the K center vectors are denoted as T 1′ 、T 2′ 、…、T ′K The method comprises the steps of carrying out a first treatment on the surface of the The feature vectors of M-K text paragraph data outside the clustering center are marked as T ′K+1 、T ′K+2 、…、T ′M . M, K are positive integers and K is less than M.
Then clustering and dividing the feature vectors of M-K text paragraph data to obtain T ′K+1 、T ′K+2 、…、T ′M Dividing into T 1′ 、T 2′ 、…、T ′K Is in a cluster of classes that are center vectors.
The cluster center vector space is confirmed by two parameters, namely a cluster center vector and a distance from the vector center, wherein the distance from the vector center is equal to the cosine distance between the vector and each cluster center vector, so that the size of the cluster center vector space can be adjusted by setting a preset cluster value. The larger the cluster-like center vector space is, the larger the data volume of the finally contained library text data is, the larger the time complexity is required for searching for the duplicate, and an administrator can adjust according to actual conditions.
Illustratively, after S94, further comprising:
s95, respectively acquiring library paragraph feature fingerprints of each paragraph feature vector in the plurality of library text data, assigning fingerprint numbers to each library paragraph feature fingerprint, and establishing a mapping relation between each fingerprint number and a first cluster number corresponding to the library paragraph feature fingerprint.
After each fingerprint number and the first cluster number corresponding to the characteristic fingerprint of the library paragraph establish a mapping relation one by one, the meaning is that all text paragraphs in the library are already divided into the corresponding cluster center vector space, and a subsequent administrator can easily reduce the searching space (range) only according to the first cluster number.
Illustratively, S10 further comprises, thereafter:
s101, adding the target text into a library and storing the target text into a sequence to be added;
s102, if the number of the paragraph feature vectors of the text in the sequence to be added is larger than a new threshold, clustering and dividing all the first class cluster center vectors and all the paragraph feature vectors of the text in the sequence to be added to obtain a plurality of new class cluster center vectors.
In this embodiment, the sequence to be added is monitored in real time, and the number of paragraph feature vectors being greater than the newly added threshold means that the database receives and stores a certain amount of text within a period of time, and if the cluster center vector is not updated at this time, the influence of the text is ignored in subsequent duplicate checking, so that the accuracy of the duplicate checking result is affected.
Illustratively, S12 specifically includes:
s120, calculating a hash value of each feature word through a hash function.
S121, weighting the character strings of each feature word according to the hash value of each feature word.
S122, accumulating the weighted results of the character strings corresponding to the feature words to obtain paragraph sequence strings.
S123, performing dimension reduction calculation on the paragraph sequence strings to obtain paragraph feature fingerprints corresponding to the target text paragraphs.
S124, giving a fingerprint number to the paragraph characteristic fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number.
Illustratively, S121 specifically includes:
s1210, obtaining a weighting factor according to the frequency of the feature words in the target text.
S1211, multiplying each bit of the character string of each feature word by the weighting factor; the weighting factor multiplies the hash character value and the weight positively by 1 and multiplies the hash character value and the weight negatively by 0.
And calculating the hash value of each feature vector through a hash function, wherein the hash value is an n-bit signature consisting of binary numbers 01. For example, the Hash value Hash (life) of "life" is 110101, and the Hash value Hash (none) of "none" is "101001". The string becomes a series of numbers. On the basis of the Hash value, weighting all feature vectors, namely, W=hash weight, wherein when the Hash value is encountered by 1, the Hash value is multiplied positively, and when the Hash value is encountered by 0, the Hash value is multiplied negatively. For example, the hash value "110101" of "life" is weighted to obtain: w (life) =110101×5= 5 5-5 5-5 5, and the "no" hash value "101001" is weighted: w (none) =101001×2=2-2 2-2-2 2, and the rest of the eigenvectors operate similarly. The "101011 …" weight results in: w ("natural language", weight value bit 5) =5-5 5-5 5 5 …, and weighting the "processed" hash value "100101 …" yields: w (processing, weight value 4) =4-4-4 4-4 4 …, and the rest of feature vectors operate similarly.
The weighted results of the feature vectors are accumulated to form only one sequence string. Taking the first two feature vectors as examples, for example, "life" 5 5-5 5-5 5 "and" no "2-2 2-2-2 2" are accumulated to obtain "5+2-2-5+2-5-2-5-2+2" and "7 3-3 3-7 7".
Illustratively, S123 specifically includes:
s1230, according to the weighted result accumulation process of the paragraph sequence strings, sequentially recording the accumulation sequence of each feature word;
s1231, accumulating character strings corresponding to each feature word to generate word codes corresponding to each feature word; generating a corresponding word coding sequence according to the accumulation sequence of each characteristic word;
s1232, forming a coding feature matrix by the word coding sequence and the paragraph sequence string;
s1233, performing exclusive OR operation on the coding feature matrix to obtain paragraph feature fingerprints corresponding to the target text paragraphs.
Through the steps, the accumulation sequence of each feature word can be used as the fingerprint characteristic of the paragraph sequence string, so that the paragraph sequence string and the original information are prevented from losing relevance after dimension reduction calculation. The dimension of the paragraph sequence string can be further reduced after exclusive OR operation is carried out by forming the corresponding word coding sequence and paragraph sequence string into the coding feature matrix, but the characteristic of the information is reserved due to the addition of the word coding sequence, so that the extraction of the paragraph characteristic fingerprint can be better completed.
Illustratively, after S12, further comprising:
s125, carrying out weighted accumulation on all paragraph feature vectors to obtain the text vector corresponding to the target text and the corresponding target sequence string.
And S126, confirming a class cluster center vector space in which the text vector is located, and recording a second class cluster number according to the class cluster center vector space.
And S127, performing displacement calculation on the target sequence string to obtain a target text fingerprint corresponding to the target text.
S128, obtaining first class cluster numbers with equal number values according to second class cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints with mapping relations with the second class cluster numbers; each second type cluster number corresponds to the first type cluster number with the same number value one by one.
And S129, matching all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain a text duplication checking result.
In addition to searching for similar texts corresponding to each paragraph, if the repeat searching applicant wants to search for the repeat from the perspective of full text conception and full text purport, all paragraph feature vectors of the target text need to be weighted to form text vectors and corresponding target sequence strings. The formation process of the target text fingerprint is similar to the formation process of the paragraph feature fingerprint, and is not repeated here, and the difference between the two is that: the paragraph feature fingerprint is calculated according to the paragraph feature vector, and the paragraph text is calculated according to the text vector. It should be noted that the lengths of the paragraph feature fingerprint and the target text fingerprint need to be consistent.
Illustratively, S13 specifically includes:
s130, expanding paragraph feature fingerprints corresponding to the paragraph feature vectors of each library paragraph feature fingerprint.
S131, performing exclusive OR operation on each unfolded library paragraph feature fingerprint and the paragraph feature fingerprints, and calculating the number of 1 in the exclusive OR operation result as the sequence similarity.
S132, turning over the paragraph feature fingerprints, performing exclusive-or operation on each unfolded library paragraph feature fingerprint and the turned paragraph feature fingerprints, and calculating the number of 1 in the exclusive-or operation result as turning-over similarity;
s133, if the average value of the sequence similarity and the turnover similarity is greater than a paragraph similarity threshold, repeating content exists in a database in one paragraph of the target text.
For example, after the library paragraph feature fingerprints and the paragraph feature fingerprints corresponding to the paragraph feature vector are developed, 10101 and 00110 are respectively, the number of 1 s in the exclusive or operation result is 3, then the sequential similarity is 3, and the inversion similarity calculation method is similar, except that the operation objects are 10101 and 01100 at this time.
From the sequence similarity Sim 1 Weight 0.5, flip similarity Sim 2 And the weight value is 0.5, and the similarity Sim of the target text based on the library text of the vector space is calculated by a similarity fusion algorithm, namely a formula (1).
Sim=Sim 1 ·0.5+Sim 2 ·0.5 (1)
If the value of Sim is greater than the paragraph similarity threshold, a paragraph of the target text has duplicate content in the library. It should be noted that each library paragraph feature fingerprint is compared once with the paragraph feature fingerprint corresponding to the paragraph feature vector, and the Sim value needs to be recalculated once.
Illustratively, S129 specifically includes:
s1290, expanding the characteristic fingerprint of each library paragraph and the target text fingerprint.
S1291, confirming the number of the segments according to the number of the paragraph characteristic vectors of the target text.
And S1292, respectively carrying out equal-length segmentation on each library paragraph characteristic fingerprint and the target text fingerprints according to the segmentation segment number to obtain a plurality of equal-length library paragraph characteristic sub-fingerprints and a plurality of equal-length target text sub-fingerprints.
S1293, performing exclusive OR operation on each target text sub-fingerprint and each library paragraph characteristic sub-fingerprint, and taking the number of 1 in the exclusive OR operation result as the sub-paragraph similarity.
And S1294, carrying out weighted accumulation on the similarity of each sub-paragraph to obtain the text similarity.
And S1295, if the text similarity is greater than a preset threshold, repeating content exists in the full text of the target text in the library.
In this embodiment, the repeatability of the full text of the target text is determined according to the difference between the feature fingerprint of each library paragraph and the target text fingerprint, and the parameter to be acquired is the number of segments. The number of segments is related to the number of segment feature vectors of the target text, i.e. the number of split segments of the target text obtained in S10.
If the target text has 8 split paragraphs, the number of the split paragraphs is 8, and the text similarity Sim' =sim 11 ′*α 11 +Sim 21 ′*α 21 +Sim 31 ′*α 31 +…+ Sim 81 ′*α 81 ,Sim 11 ' means the sub-paragraph similarity of the first target text sub-fingerprint to the first library paragraph feature sub-fingerprint, alpha 11 To correspond to weight, sim 12 ' means the sub-paragraph similarity, alpha, of the first target text sub-fingerprint and the second library paragraph feature sub-fingerprint 12 To calculate the corresponding weight, the similarity of 8×8 sub-paragraphs is calculated and added by weighting.
Compared with the prior art, the embodiment of the invention provides a feature vector space-based duplicate checking method, when new target texts are checked, word segmentation and segmentation processing are carried out on the target texts each time, paragraph feature vectors are extracted to obtain a plurality of corresponding paragraph feature vectors of the whole text, then the cluster centers of the corresponding class clusters are recorded by using the first class cluster numbers so as to facilitate subsequent calling, related personnel can confirm the cluster center vector space in which similar texts of the target texts are located according to the first class cluster numbers, library text data of a text database corresponding to the space outside the cluster center vector space is eliminated, and the time required for optimizing duplicate checking is shortened.
And concentrating the information to obtain corresponding paragraph feature fingerprints and fingerprint numbers, wherein the fingerprint numbers correspond to the paragraph feature fingerprints and paragraph feature vectors one by one, and related personnel can easily mark and access corresponding paragraph texts according to the fingerprint numbers. And performing fingerprint matching on the obtained paragraph characteristic fingerprints and the library paragraph characteristic fingerprints corresponding to the first class cluster numbers in the text database to obtain which similar paragraphs of each paragraph in the target text exist, wherein the similar text to which the similar paragraphs belong can be confirmed by the library paragraph characteristic fingerprints and the fingerprint numbers.
In addition, when a certain number of target texts are accumulated, the method and the device can further conduct clustering division again to obtain a plurality of new cluster center vectors to achieve the effect of repartitioning vector space, and accuracy of the first cluster numbers is guaranteed.
Referring to fig. 2, an embodiment of the present application provides a duplication checking system based on feature vector space, which includes a word segmentation and segmentation module 20, a space recording module 21, a fingerprint module 22 and a matching module 23.
The word segmentation module 20 is used for performing word segmentation and segmentation processing on the target text and extracting paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors; each paragraph feature vector is composed of a plurality of feature word corresponding weights.
The space recording module 21 is configured to confirm a cluster-like center vector space in which each paragraph feature vector is located, and record a first cluster number according to the cluster-like center vector space in which each paragraph feature vector is located; the class cluster center vector space comprises all paragraph feature vectors with cosine distances between the class cluster center vector space and class cluster center vectors smaller than a preset class cluster value, and the sum of the class cluster center vector spaces is larger than or equal to the whole paragraph feature vector space.
A fingerprint module 22, configured to obtain a paragraph feature fingerprint of each paragraph feature vector, assign a fingerprint number to the paragraph feature fingerprint, and establish a mapping relationship between the fingerprint number and the first cluster number; the paragraph feature fingerprint is an array of features that includes the target text.
And the matching module 23 is configured to obtain all library paragraph feature fingerprints having a mapping relationship with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector, and match all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain a paragraph check repetition result.
It will be clear to those skilled in the art that for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method embodiments for the specific working procedure of the above-described system, which is not further described herein.
Compared with the prior art, the embodiment of the invention provides a feature vector space-based duplicate checking system, when new target texts are checked, word segmentation and segmentation processing are carried out on the target texts each time, paragraph feature vectors are extracted to obtain a plurality of corresponding paragraph feature vectors of the whole text, then the cluster centers of the corresponding clusters are recorded by using first cluster numbers so as to facilitate subsequent calling, related personnel can confirm the cluster center vector space of the similar text of the target texts according to the first cluster numbers, and library text data of a text database corresponding to the space outside the cluster center vector space is eliminated, so that the time required for optimizing duplicate checking is shortened.
And concentrating the information to obtain corresponding paragraph feature fingerprints and fingerprint numbers, wherein the fingerprint numbers correspond to the paragraph feature fingerprints and paragraph feature vectors one by one, and related personnel can easily mark and access corresponding paragraph texts according to the fingerprint numbers. And performing fingerprint matching on the obtained paragraph characteristic fingerprints and the library paragraph characteristic fingerprints corresponding to the first class cluster numbers in the text database to obtain which similar paragraphs of each paragraph in the target text exist, wherein the similar text to which the similar paragraphs belong can be confirmed by the library paragraph characteristic fingerprints and the fingerprint numbers.
In addition, when a certain number of target texts are accumulated, the method and the device can further conduct clustering division again to obtain a plurality of new cluster center vectors to achieve the effect of repartitioning vector space, and accuracy of the first cluster numbers is guaranteed.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (8)

1. The duplication checking method based on the feature vector space is characterized by comprising the following steps of:
performing word segmentation and segmentation processing on the target text, and extracting paragraph feature vectors to obtain text data consisting of a plurality of paragraph feature vectors; each paragraph feature vector consists of a plurality of feature words corresponding weights;
confirming a cluster center vector space in which each paragraph feature vector is positioned, and recording a first cluster number according to the cluster center vector space; the class cluster center vector space comprises all paragraph feature vectors with cosine distances between the class cluster center vector space and class cluster center vectors smaller than a preset class cluster value, and the sum of the class cluster center vector spaces is larger than or equal to the whole paragraph feature vector space;
acquiring paragraph feature fingerprints of each paragraph feature vector, assigning fingerprint numbers to the paragraph feature fingerprints, and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; the paragraph feature fingerprint is an array containing features of the target text;
respectively obtaining all library paragraph feature fingerprints with mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector, and matching all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain paragraph check repetition results;
the obtaining the paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint, and establishing a mapping relationship between the fingerprint number and the first cluster number, specifically includes:
calculating a hash value of each feature word through a hash function;
weighting the character strings of each feature word according to the hash value of each feature word;
accumulating the weighted results of the character strings corresponding to the characteristic words to obtain paragraph sequence strings;
performing dimension reduction calculation on the paragraph sequence strings to obtain paragraph feature fingerprints corresponding to the paragraphs of the target text;
assigning a fingerprint number for the paragraph characteristic fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number;
after obtaining the paragraph feature fingerprint of each paragraph feature vector, assigning a fingerprint number to the paragraph feature fingerprint and establishing a mapping relationship between the fingerprint number and the first cluster number, the method further comprises:
performing weighted accumulation on all paragraph feature vectors to obtain a text vector corresponding to the target text and a corresponding target sequence string;
confirming a class cluster center vector space in which a text vector is located, and recording a second class cluster number according to the class cluster center vector space;
performing displacement calculation on the target sequence string to obtain a target text fingerprint corresponding to the target text;
obtaining first class cluster numbers with equal number values according to second class cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints with mapping relations with the second class cluster numbers; each second type cluster number corresponds to the first type cluster number with the same number value one by one;
and matching all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain a text duplication checking result.
2. The method for searching for duplicate based on feature vector space according to claim 1, wherein the specific construction process of the cluster-like center vector space is as follows:
word segmentation processing is carried out on all texts in the library respectively, paragraph feature vectors are extracted, and a plurality of library text data are obtained;
clustering all paragraph feature vectors in the plurality of library text data by adopting a clustering algorithm based on division to obtain a plurality of cluster center vectors;
dividing a vector space with the cosine distance between each class cluster center vector being smaller than a preset class cluster value into class cluster center vector spaces corresponding to the class cluster center vectors; each paragraph feature vector space has an intersection;
each cluster-like center vector space is assigned a first cluster number.
3. The method for searching for duplicate based on feature vector space according to claim 2, wherein after assigning a first cluster number to each cluster-like center vector space, further comprising:
and respectively acquiring the library paragraph feature fingerprints of each paragraph feature vector in the plurality of library text data, assigning fingerprint numbers to each library paragraph feature fingerprint, and establishing a mapping relation between each fingerprint number and a first cluster number corresponding to the library paragraph feature fingerprint one by one.
4. The method for searching for duplicate based on feature vector space according to claim 2, wherein after performing word segmentation and segmentation processing on the target text and extracting paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors, further comprising:
adding the target text into a library and storing the target text into a sequence to be added;
and if the number of the paragraph feature vectors of the texts in the sequence to be added is larger than a new threshold, clustering and dividing all the first class cluster center vectors and all the paragraph feature vectors of the texts in the sequence to be added to obtain a plurality of new class cluster center vectors.
5. The method for searching for duplicate based on feature vector space according to claim 1, wherein said weighting the character strings of each feature word according to the hash value of each feature word specifically comprises:
obtaining a weighting factor according to the frequency of the feature words in the target text;
multiplying each bit of the character string of each feature word with the weighting factor; the weighting factor multiplies the hash character value and the weight positively by 1 and multiplies the hash character value and the weight negatively by 0.
6. The method for searching and repeating the feature vector space according to claim 1, wherein the step of matching all the library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain a paragraph searching and repeating result comprises the following steps:
expanding the paragraph feature fingerprint corresponding to the paragraph feature vector of each library paragraph feature fingerprint;
performing exclusive-or operation on each expanded library paragraph feature fingerprint and the paragraph feature fingerprint, and calculating the number of 1 in the exclusive-or operation result as the sequence similarity;
turning over the paragraph characteristic fingerprints, performing exclusive-or operation on each unfolded library paragraph characteristic fingerprint and the turned paragraph characteristic fingerprints, and calculating the number of 1 in the exclusive-or operation result as turning-over similarity;
if the average value of the sequence similarity and the turnover similarity is greater than a paragraph similarity threshold, a paragraph of the target text has repeated content in a library.
7. The method for searching and repeating the text according to claim 1, wherein the matching of all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain the text searching and repeating result comprises the following steps:
expanding each library paragraph feature fingerprint with the target text fingerprint;
confirming the number of segments according to the number of segment feature vectors of the target text;
respectively carrying out equal-length segmentation on each library paragraph characteristic fingerprint and the target text fingerprint according to the segmentation number to obtain a plurality of equal-length library paragraph characteristic sub-fingerprints and a plurality of equal-length target text sub-fingerprints;
performing exclusive-or operation on each target text sub-fingerprint and each library paragraph characteristic sub-fingerprint respectively, and taking the number of 1 in the exclusive-or operation result as the sub-paragraph similarity;
carrying out weighted accumulation on the similarity of each sub-paragraph to obtain text similarity;
and if the text similarity is greater than a preset threshold, repeating content exists in the full text of the target text in a library.
8. A weight checking system based on feature vector space, comprising:
the word segmentation module is used for carrying out word segmentation and segmentation processing on the target text and extracting paragraph feature vectors to obtain text data composed of a plurality of paragraph feature vectors; each paragraph feature vector consists of a plurality of feature words corresponding weights;
the space recording module is used for confirming a cluster center vector space in which each paragraph feature vector is positioned, and recording a first cluster number according to the cluster center vector space; the class cluster center vector space comprises all paragraph feature vectors with cosine distances between the class cluster center vector space and class cluster center vectors smaller than a preset class cluster value, and the sum of the class cluster center vector spaces is larger than or equal to the whole paragraph feature vector space;
the fingerprint module is used for acquiring paragraph feature fingerprints of each paragraph feature vector, giving fingerprint numbers to the paragraph feature fingerprints and establishing a mapping relation between the fingerprint numbers and the first cluster numbers; the paragraph feature fingerprint is an array containing features of the target text;
the matching module is used for acquiring all library paragraph feature fingerprints with mapping relation with the first class cluster numbers according to the first class cluster numbers of each paragraph feature vector respectively, and matching all library paragraph feature fingerprints with paragraph feature fingerprints corresponding to the paragraph feature vectors one by one to obtain paragraph check and repeat results;
the fingerprint module is specifically used for: calculating a hash value of each feature word through a hash function; weighting the character strings of each feature word according to the hash value of each feature word; accumulating the weighted results of the character strings corresponding to the characteristic words to obtain paragraph sequence strings; performing dimension reduction calculation on the paragraph sequence strings to obtain paragraph feature fingerprints corresponding to the paragraphs of the target text; assigning a fingerprint number for the paragraph characteristic fingerprint and establishing a one-to-many mapping relation between the fingerprint number and the first cluster number;
the fingerprint module is also used for: performing weighted accumulation on all paragraph feature vectors to obtain a text vector corresponding to the target text and a corresponding target sequence string; confirming a class cluster center vector space in which a text vector is located, and recording a second class cluster number according to the class cluster center vector space; performing displacement calculation on the target sequence string to obtain a target text fingerprint corresponding to the target text; obtaining first class cluster numbers with equal number values according to second class cluster numbers corresponding to the text vectors, and obtaining all library paragraph feature fingerprints with mapping relations with the second class cluster numbers; each second type cluster number corresponds to the first type cluster number with the same number value one by one; and matching all the library paragraph feature fingerprints with the target text fingerprints one by one to obtain a text duplication checking result.
CN202310091416.8A 2023-02-10 2023-02-10 Weight checking method and system based on feature vector space Active CN115774785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310091416.8A CN115774785B (en) 2023-02-10 2023-02-10 Weight checking method and system based on feature vector space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310091416.8A CN115774785B (en) 2023-02-10 2023-02-10 Weight checking method and system based on feature vector space

Publications (2)

Publication Number Publication Date
CN115774785A CN115774785A (en) 2023-03-10
CN115774785B true CN115774785B (en) 2023-04-25

Family

ID=85393402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310091416.8A Active CN115774785B (en) 2023-02-10 2023-02-10 Weight checking method and system based on feature vector space

Country Status (1)

Country Link
CN (1) CN115774785B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295A (en) * 2011-03-31 2013-02-06 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 The duplicate checking method, apparatus and electronic equipment of text information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11822589B2 (en) * 2020-09-16 2023-11-21 L&T Technology Services Limited Method and system for performing summarization of text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915295A (en) * 2011-03-31 2013-02-06 百度在线网络技术(北京)有限公司 Document detecting method and document detecting device
CN106446148A (en) * 2016-09-21 2017-02-22 中国运载火箭技术研究院 Cluster-based text duplicate checking method
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 The duplicate checking method, apparatus and electronic equipment of text information

Also Published As

Publication number Publication date
CN115774785A (en) 2023-03-10

Similar Documents

Publication Publication Date Title
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
Kulis et al. Fast similarity search for learned metrics
Kulis et al. Kernelized locality-sensitive hashing
CN108573045B (en) Comparison matrix similarity retrieval method based on multi-order fingerprints
US7725484B2 (en) Scalable object recognition using hierarchical quantization with a vocabulary tree
CN110609916A (en) Video image data retrieval method, device, equipment and storage medium
CN111832289B (en) Service discovery method based on clustering and Gaussian LDA
CN111460201B (en) Cross-modal retrieval method for modal consistency based on generative countermeasure network
US20110305399A1 (en) Image clustering
Liu et al. Towards optimal binary code learning via ordinal embedding
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
CN112163145B (en) Website retrieval method, device and equipment based on editing distance and cosine included angle
CN106557777A (en) It is a kind of to be based on the improved Kmeans clustering methods of SimHash
CN112214335A (en) Web service discovery method based on knowledge graph and similarity network
Athitsos et al. Query-sensitive embeddings
Song et al. Brepartition: Optimized high-dimensional knn search with bregman distances
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
Pengcheng et al. Fast Chinese calligraphic character recognition with large-scale data
CN115774785B (en) Weight checking method and system based on feature vector space
CN110209895B (en) Vector retrieval method, device and equipment
Auer et al. Relevance feedback models for content-based image retrieval
JP2020135892A (en) Error correction method, apparatus, and computer-readable medium
CN115858780A (en) Text clustering method, device, equipment and medium
CN110941743B (en) Scientific and technological project duplicate checking method for automatically realizing field weight distribution based on deep learning algorithm
Lang et al. Fast graph similarity search via hashing and its application on image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant