CN111104484B - Text similarity detection method and device and electronic equipment - Google Patents

Text similarity detection method and device and electronic equipment Download PDF

Info

Publication number
CN111104484B
CN111104484B CN201911321980.4A CN201911321980A CN111104484B CN 111104484 B CN111104484 B CN 111104484B CN 201911321980 A CN201911321980 A CN 201911321980A CN 111104484 B CN111104484 B CN 111104484B
Authority
CN
China
Prior art keywords
fingerprint
sliding window
text
target
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911321980.4A
Other languages
Chinese (zh)
Other versions
CN111104484A (en
Inventor
王超
熊英超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhongfu Information Technology Co Ltd
Original Assignee
Nanjing Zhongfu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhongfu Information Technology Co Ltd filed Critical Nanjing Zhongfu Information Technology Co Ltd
Priority to CN201911321980.4A priority Critical patent/CN111104484B/en
Publication of CN111104484A publication Critical patent/CN111104484A/en
Application granted granted Critical
Publication of CN111104484B publication Critical patent/CN111104484B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking

Abstract

The invention provides a text similarity detection method, a text similarity detection device and electronic equipment, and relates to the technical field of data processing.

Description

Text similarity detection method and device and electronic equipment
Technical Field
The invention relates to the technical field of data processing, in particular to a text similarity detection method and device and electronic equipment.
Background
With the widespread use of the internet in various aspects of life, electronic documents have become an important carrier of message delivery. Due to the inherent characteristics of easy tampering, easy plagiarism and the like of electronic documents, people can easily spread the electronic documents, so that more and more text plagiarism phenomena occur, and serious influences are caused on information retrieval, electronic resource copyright protection and the like. Based on this, the text similarity detection method is produced.
The text similarity detection method based on the digital fingerprint technology has the core idea that: firstly, according to a certain text block division strategy, selecting a part of character strings (also called fingerprints) from a text, then mapping the character strings into a digital form to obtain digital fingerprints, and finally calculating the number of the same digital fingerprints or the ratio of the number of the same digital fingerprints to the total digital fingerprints to obtain the overlapping degree between the texts. Because the approximate text is mapped into the approximate digital fingerprint, the digital fingerprint technology can convert the original text into digital fingerprint characteristics (digital fingerprint sequence), and the purpose of text similarity detection is realized by calculating the digital fingerprint overlapping degree of the two texts. The method has the advantages of small storage space of the digital fingerprints, high detection speed and suitability for large-scale text set similarity detection. Therefore, text similarity detection based on digital fingerprints not only plays an essential role in the fields of web page deduplication, information retrieval and the like, but also can be applied to a plurality of fields such as library resource protection, software copyright protection, data leakage protection, spam filtering and the like.
However, in the existing text similarity detection method based on the digital fingerprint technology, the number of extracted digital fingerprints is large, and the density of the digital fingerprints is large, so that the problem of large calculation amount is caused, and the detection speed is low.
Disclosure of Invention
The invention aims to provide a text similarity detection method, a text similarity detection device and electronic equipment, so that the calculation amount is reduced, and the detection speed is increased.
The embodiment of the invention provides a text similarity detection method, which comprises the following steps:
acquiring two texts to be detected;
acquiring initial fingerprint features corresponding to each text, wherein the initial fingerprint features comprise a plurality of digital fingerprints;
for each text, extracting a target digital fingerprint from each digital fingerprint corresponding to the text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to the text; extracting the target digital fingerprints, wherein the extraction of the target digital fingerprints is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window;
and performing similarity calculation on the target fingerprint characteristics corresponding to the two texts to obtain a similarity detection result of the two texts.
Further, the acquiring of the initial fingerprint feature corresponding to each text includes:
preprocessing each text to obtain a word sequence of each text;
coding words in the word sequence of each text to obtain text characteristics of each text;
and performing digital fingerprint mapping on the text features of each text to obtain initial fingerprint features corresponding to each text.
Further, the extracting a target digital fingerprint from each digital fingerprint corresponding to the text based on a sliding window algorithm to obtain a target fingerprint feature corresponding to the text includes:
acquiring a fingerprint sequence in a first sliding window from each digital fingerprint corresponding to the text according to a preset sliding window size; the starting point of the first sliding window is a first digital fingerprint in initial fingerprint features corresponding to the text;
extracting at least one target digital fingerprint in the first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window;
acquiring a fingerprint sequence in a next sliding window, and extracting at least one target digital fingerprint in the next sliding window; the starting point of the next sliding window is the next digital fingerprint of the last target digital fingerprint extracted from the previous sliding window in the initial fingerprint features corresponding to the text;
until the last sliding window is processed, at least one target digital fingerprint in the last sliding window is obtained; the terminal point of the last sliding window is the last digital fingerprint in the initial fingerprint characteristics corresponding to the text;
and generating target fingerprint characteristics corresponding to the text according to the extracted target digital fingerprints in the sliding windows.
Further, the extracting at least one target digital fingerprint in the first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window includes:
dividing the fingerprint sequence in the first sliding window into a plurality of parts to obtain at least one first interval block and at least one second interval block; the first interval block is a preset number of parts in the plurality of parts, and the second interval block is the other part except the first interval block in the plurality of parts; the first interval block and the second interval block each comprise a plurality of digital fingerprints;
determining the minimum value in each first interval block as a characteristic reference value of the first sliding window;
according to the characteristic reference value, selecting a target digital fingerprint in each second interval block from the digital fingerprints in the second interval blocks;
and determining the target digital fingerprint in each second interval block as the target digital fingerprint in the first sliding window.
Further, the selecting the target digital fingerprint in each second interval block from the digital fingerprints in the second interval blocks according to the characteristic reference value includes:
judging whether each second interval block meets a preset optimal decision constraint condition or not according to the characteristic reference value and a preset threshold value;
if so, determining the minimum value in the second interval block as the target digital fingerprint in the second interval block;
if not, determining the last digital fingerprint in the second interval block as the target digital fingerprint in the second interval block;
wherein the optimal decision constraint comprises:
Figure BDA0002326801650000041
wherein H1tA second interval block with sequence number t in the first sliding window; min (H)1t) Represents H1tInner minimum value, max (H)1t) Represents H1tV represents the characteristic reference value and T represents the preset threshold value.
Further, the generating a target fingerprint feature corresponding to the text according to the extracted target digital fingerprints in each sliding window includes:
and according to the sequence of each digital fingerprint in the initial fingerprint characteristics corresponding to the text, vector generation is carried out on the extracted target digital fingerprints in each sliding window, and the target fingerprint characteristics corresponding to the text are obtained.
Further, the similarity calculation of the target fingerprint features corresponding to the two texts to obtain a similarity detection result of the two texts includes:
and calculating the similarity of the target fingerprint characteristics corresponding to the two texts by adopting cosine similarity to obtain a similarity detection result of the two texts.
The embodiment of the invention also provides a text similarity detection device, which comprises:
the first acquisition module is used for acquiring two texts to be detected;
the second acquisition module is used for acquiring initial fingerprint features corresponding to each text, and the initial fingerprint features comprise a plurality of digital fingerprints;
the extraction module is used for extracting a target digital fingerprint from each digital fingerprint corresponding to each text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to each text; extracting the target digital fingerprints, wherein the extraction of the target digital fingerprints is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window;
and the calculation module is used for calculating the similarity of the target fingerprint characteristics corresponding to the two texts to obtain the similarity detection result of the two texts.
The embodiment of the invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the computer program to realize the text similarity detection method.
The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the text similarity detection method is executed.
In the text similarity detection method, device and electronic equipment provided by the embodiment of the invention, the method comprises the following steps: acquiring two texts to be detected; acquiring initial fingerprint features corresponding to each text, wherein the initial fingerprint features comprise a plurality of digital fingerprints; for each text, extracting a target digital fingerprint from each digital fingerprint corresponding to the text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to the text; extracting a target digital fingerprint, wherein the extraction of the target digital fingerprint is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window; and performing similarity calculation on the target fingerprint characteristics corresponding to the two texts to obtain a similarity detection result of the two texts. According to the method, after the initial fingerprint features of the two texts are obtained, the target digital fingerprint is extracted from the initial fingerprint features based on the sliding window algorithm and the size of the digital fingerprint, and when the target digital fingerprint is extracted, the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window, so that on the basis of ensuring the detection accuracy, the number of the digital fingerprints in the target fingerprint features is reduced, the density of the digital fingerprints is reduced, the calculated amount in similarity calculation is reduced, and the detection speed of text similarity detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flowchart of a text similarity detection method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a hash value calculation according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of another text similarity detection method according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an example of obtaining a target digital fingerprint according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating comparison of precision ratios of two text similarity detection methods according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a text similarity detection apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The text similarity detection method based on the digital fingerprint technology comprises the steps of extracting text features after text blocks are divided, and then carrying out digital fingerprint mapping based on the extracted text features to obtain the digital fingerprints of the text, wherein the quality of feature extraction results and the number of the digital fingerprints directly influence the effect (accuracy) and detection efficiency of text similarity detection, and the research on text feature selection and digital fingerprint extraction has great significance and value. In the prior art, text similarity detection based on digital fingerprints has the problems of more text features, poor representativeness, more digital fingerprints and high density, so that the text similarity detection has large calculation amount and low detection speed. Based on this, the text similarity detection method, the text similarity detection device and the electronic device provided by the embodiment of the invention can reduce the number of digital fingerprints, reduce the density of the digital fingerprints and achieve the purpose of quickly calculating the text similarity by adopting the improved Winnowing algorithm to extract the target digital fingerprints.
To facilitate understanding of the embodiment, a text similarity detection method disclosed in the embodiment of the present invention is first described in detail.
The embodiment of the invention provides a text similarity detection method, which can be executed by an electronic device with data processing capability, wherein the electronic device can be but is not limited to a notebook computer or a desktop computer and the like. Referring to a flow diagram of a text similarity detection method shown in fig. 1, the method mainly includes the following steps S102 to S108:
step S102, two texts to be detected are obtained.
The text may be a chinese text or a text in other languages, and the language type of the text is not limited herein.
Step S104, acquiring initial fingerprint characteristics corresponding to each text, wherein the initial fingerprint characteristics comprise a plurality of digital fingerprints.
When the initial fingerprint features corresponding to the text are obtained, the text may be preprocessed, for example, the text is subjected to word segmentation, word removal and other processing, then the text features of the preprocessed text are extracted, and finally the extracted text features are converted into digital fingerprints, so that the initial fingerprint features corresponding to the text are obtained. The initial fingerprint feature may be a vector composed of a plurality of digital fingerprints, the digital fingerprints being numbers corresponding to the text features.
Step S106, extracting a target digital fingerprint from each digital fingerprint corresponding to each text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to each text; the extraction of the target digital fingerprint is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window.
The target digital fingerprints are the digital fingerprints with strong representativeness in the digital fingerprints, and the target fingerprint characteristics can be vectors formed by the extracted target digital fingerprints. In the conventional Winnowing algorithm, the sliding step is a fixed value, for example, the sliding step is 1 (which means that the sliding step is one step, i.e., sliding 1 digital fingerprint), the Winnowing algorithm is improved in this embodiment, the sliding step is no longer a fixed value, but is related to the position of the extracted target digital fingerprint, that is, the Winnowing algorithm is improved in this embodiment. Compared with the Winnowing algorithm, the target digital fingerprint is extracted through the improved Winnowing algorithm, the digital fingerprint with stronger representativeness is reserved, the number of the digital fingerprints is reduced, and the density of the digital fingerprint is reduced.
And S108, performing similarity calculation on the target fingerprint characteristics corresponding to the two texts to obtain a similarity detection result of the two texts.
But not limited to, similarity calculation is performed on target fingerprint features corresponding to the two texts by adopting cosine similarity, so that a similarity detection result of the two texts is obtained. In specific implementation, the similarity between two texts can be calculated by the following formula:
Figure BDA0002326801650000081
wherein v is1、v2Respectively representing the target fingerprint characteristics corresponding to the two texts.
In the embodiment of the invention, two texts to be detected are obtained; acquiring initial fingerprint features corresponding to each text, wherein the initial fingerprint features comprise a plurality of digital fingerprints; for each text, extracting a target digital fingerprint from each digital fingerprint corresponding to the text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to the text; extracting a target digital fingerprint, wherein the extraction of the target digital fingerprint is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window; and performing similarity calculation on the target fingerprint characteristics corresponding to the two texts to obtain a similarity detection result of the two texts. According to the method, after the initial fingerprint features of the two texts are obtained, the target digital fingerprint is extracted from the initial fingerprint features based on the sliding window algorithm and the size of the digital fingerprint, and when the target digital fingerprint is extracted, the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window, so that on the basis of ensuring the detection accuracy, the number of the digital fingerprints in the target fingerprint features is reduced, the density of the digital fingerprints is reduced, the calculated amount in similarity calculation is reduced, and the detection speed of text similarity detection is improved.
For convenience of understanding, an embodiment of the present invention provides an implementation manner of the step S104, which is shown in the following step 1.1 to step 1.3:
step 1.1, preprocessing each text to obtain a word sequence of each text.
The preprocessing may include format conversion, word segmentation, word deactivation, and the like, and the specific preprocessing process may refer to the related prior art and will not be described herein again. It should be noted that the preprocessing methods for different language types of text are different.
The word sequence is formed by arranging words obtained by preprocessing according to the sequence in the corresponding text, and the word sequence can be expressed as follows: t is0*=[t1,t2,t3…tn]Wherein, t1,t2,t3,tnRespectively representing words with the sequence numbers of 1, 2, 3 and n, wherein n represents the number of words obtained by preprocessing.
And step 1.2, coding words in the word sequence of each text to obtain text characteristics of each text.
Optionally, when the words in the word sequence are chinese characters, the words may be encoded by using a chinese encoding method based on the initial consonants in the pinyin, and the text features obtained by encoding may be expressed as: t isc*=[code(t1),code(t2),code(t3)…code(tn)]Wherein code (t)1),code(t2),code(t3),code(tn) Respectively represent words t1,t2,t3,tnAnd (5) encoding the encoded encoding result.
It should be noted that, the encoding modes of the word sequences of different language types may be different, for example, the encoding mode of chinese characters may be different from the encoding mode of english words; the word sequences of the same language type can also adopt a plurality of coding modes, for example, the coding modes of Chinese characters are numerous; different encoding schemes may result in different digital fingerprints for the same text, which are mapped subsequently.
And step 1.3, performing digital fingerprint mapping on the text features of each text to obtain initial fingerprint features corresponding to each text.
In a possible implementation manner, a mapping model based on a hash function may be used to perform digital fingerprint mapping, that is, a hash mapping manner is used, and a hash value calculated by the hash function is used as the digital fingerprint. The hash function calculation process will be described in detail below with reference to fig. 2: calculating a hash value of a feature of the text by rolling hash, as shown in FIG. 2, every m words in a sequence of words of the text are mapped to a hash value, e.g., t1To tmMapping to a first hash value h1(i.e., digital fingerprint with number 1), t2To tm+1Mapped as a second hash value h2,…,tn-m+1To tnMapped as the (n-m + 1) th hash value hn-m+1(ii) a The hash value (i.e., the digital fingerprint) can be calculated by the following formula:
hi=code(ti)bm-1+code(ti+1)bm-2+…+code(ti+m-1) Wherein h isiRepresenting a digital fingerprint, code (t), with a sequence number ii) Meaning the word tiThe coded coding result b represents a preset constant, and the value of m is positively correlated with the size of n (namely the number of words in the word sequence); determining a hash value sequence formed by the calculated hash values as an initial fingerprint feature corresponding to the text, where the initial fingerprint feature may be represented as: h ═ H1,h2,h3,…,hs],s=n-m+1。
In another possible implementation manner, for example, a word embedding model or a word2vec model may be adopted to perform digital fingerprint mapping on text features of each text, so as to obtain initial fingerprint features corresponding to each text.
It should be noted that the mapping modes of the text features obtained by different encoding modes may be different; the text features obtained by the same encoding mode can also adopt a plurality of mapping modes.
For convenience of understanding, an embodiment of the present invention provides an implementation manner of the step S106, where an improved Winnowing algorithm is adopted in the implementation manner, specifically as shown in the following step 2.1 to step 2.5:
step 2.1, acquiring a fingerprint sequence in a first sliding window from each digital fingerprint corresponding to the text according to the preset size of the sliding window; and starting the first sliding window, wherein the starting point of the first sliding window is the first digital fingerprint in the initial fingerprint features corresponding to the text.
The size w of the sliding window is predetermined and is positively correlated with the number of the digital fingerprints corresponding to the text; the fingerprint sequence in the first sliding window can be represented as H1=[h1,h2,h3,…,hw]. The number of digital fingerprints within each sliding window is typically equal to the sliding window size w, and the number of digital fingerprints within the last sliding window may be less than or equal to the sliding window size w.
And 2.2, extracting at least one target digital fingerprint in the first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window.
Alternatively, the step 2.2 can be realized by the following process:
(1) dividing the fingerprint sequence in the first sliding window into a plurality of parts to obtain at least one first interval block and at least one second interval block; the first interval block is the first preset number of parts in the multiple parts, and the second interval block is the other parts except the first interval block in the multiple parts; the first interval block and the second interval block each include a plurality of digital fingerprints.
Wherein, above-mentioned predetermined quantity can set up according to actual demand. For example, the fingerprint sequence in the first sliding window is divided into 3 parts, the preset number is 2, the first 2 parts are two first interval blocks, and the 3 rd part is a second interval block.
(2) And determining the minimum value in each first interval block as the characteristic reference value of the first sliding window.
The above-mentioned characteristic reference values are used for the subsequent extraction of the target digital fingerprint. If the preset number is k, the characteristic reference value may be represented as v ═ min (H)11,H12,…,H1k),H11,H12,H1kThe first section blocks with sequence numbers 1, 2, and k in the first sliding window are respectively shown.
(3) And according to the characteristic reference value, selecting the target digital fingerprint in each second interval block from the digital fingerprints in each second interval block.
(4) And determining the target digital fingerprint in each second interval block as the target digital fingerprint in the first sliding window.
The number of target digital fingerprints within the first sliding window may be equal to the number of second interval blocks comprised by the first sliding window. For example, when the first sliding window includes two second interval blocks, the number of target digital fingerprints within the first sliding window is 2.
Optionally, the process (3) is specifically as follows: judging whether each second interval block meets a preset optimal decision constraint condition or not according to the characteristic reference value and a preset threshold value; if yes, determining the minimum value in the second interval block as the target digital fingerprint in the second interval block; if not, determining the last digital fingerprint in the second interval block as the target digital fingerprint in the second interval block; wherein the optimal decision constraint condition comprises:
Figure BDA0002326801650000121
wherein H1tA second interval block with sequence number t in the first sliding window; min (H)1t) Represents H1tInner minimum value, max (H)1t) Represents H1tThe maximum value of (d), v represents the characteristic reference value, T represents the preset threshold value, T ∈ (0, 1).
The optimal decision constraint condition is normalized to constrain the preset threshold T within (0, 1), which is beneficial to the selection of the preset threshold T.
Step 2.3, acquiring a fingerprint sequence in the next sliding window, and extracting at least one target digital fingerprint in the next sliding window; and the starting point of the next sliding window is the next digital fingerprint of the last target digital fingerprint extracted from the previous sliding window in the initial fingerprint features corresponding to the text.
For example, if the last target digital fingerprint extracted in the previous sliding window is ha(i.e., the digital fingerprint with sequence number a), then the starting point of the next sliding window is ha+1(i.e., a digital fingerprint with serial number a + 1).
Step 2.4, until the last sliding window is processed, at least one target digital fingerprint in the last sliding window is obtained; and the terminal point of the last sliding window is the last digital fingerprint in the initial fingerprint features corresponding to the text.
Alternatively, when the number of digital fingerprints in the last sliding window is smaller than the size of the sliding window, the minimum value of the digital fingerprints in the last sliding window may be used as the target digital fingerprint in the last sliding window. For example, the sliding window size is 7, and the fingerprint sequence in the last sliding window is hs-1,hs]And the target digital fingerprint in the last sliding window is min (h)s-1,hs)。
And 2.5, generating target fingerprint characteristics corresponding to the text according to the extracted target digital fingerprints in the sliding windows.
Vector generation may be performed on the extracted target digital fingerprints in each sliding window according to the order (the above sequence number) of each digital fingerprint in the initial fingerprint feature corresponding to the text, so as to obtain the target fingerprint feature corresponding to the text.
During specific implementation, each sliding window can generate a feature vector formed by the extracted target digital fingerprints, and then the corresponding feature vectors are connected end to end according to the sequence of the sliding windows to obtain the target fingerprint features corresponding to the text.
For convenience of understanding, another text similarity detection method is further provided in the embodiments of the present invention, referring to a flowchart of another text similarity detection method shown in fig. 3, the method includes the following steps S302 to S314:
step S302, a text to be detected is obtained. The text to be detected comprises two texts to be detected.
And step S304, preprocessing. I.e. the text to be detected in step S302 is preprocessed.
And step S306, extracting text features. Namely, the text feature extraction is performed on the text preprocessed in step S304.
Step S308, mapping the digital fingerprint. That is, the digital fingerprint mapping is performed on the text features extracted in step S306 to obtain a plurality of digital fingerprints.
Step S310, extracting the digital fingerprint. Namely, the digital fingerprint obtained in step S308 is extracted to obtain the target fingerprint feature corresponding to the text to be detected.
In step S312, similarity calculation is performed. That is, similarity calculation is performed on the target fingerprint features obtained in step S310.
In step S314, the calculation result is output. Namely, the calculation result of step S312 is output.
For the parts not described in detail in step S302 to step S312, reference may be made to the corresponding contents in the foregoing embodiments, and details are not described here again.
To facilitate understanding of the process of acquiring the target digital fingerprint, the process of acquiring the target digital fingerprint will be described with reference to a schematic diagram of an example of acquiring the target digital fingerprint shown in fig. 4.
As shown in fig. 4, (1) one text to be detected is "this cost is slightly higher than i expected"; (2) by preprocessing the text, the obtained word segmentation result is as follows: this/cost/ratio/me/expect/little/high/some; (3) chinese coding is carried out on the word segmentation result, and the obtained text features are as follows: ZH8g1/d5j5/b5/w1/y515/sh5/g4/y1x 3; (4) the text features are subjected to hash mapping, and the obtained hash value sequence (namely the initial fingerprint features) is as follows: [6814425516452688](ii) a (5) The hash value sequence is subjected to digital fingerprint extractionPhysically, the sliding window size is 7, and the predetermined quantity is 2, and the predetermined threshold value is 0.8, and the fingerprint sequence in the first sliding window is: [68144255164526]The fingerprint sequence is divided into three parts: first part 6814]Second part- [ 4255]And a third fraction- [ 164526]The first part and the second part are both first interval blocks, and the third part is a second interval block; since the minimum value of the first section is 14 and the minimum value of the second section is 42, the characteristic reference value v min (14,42) is 14, and the characteristic reference value 14, the preset threshold value 0.8, the minimum value 16 and the maximum value 45 of the second interval block are taken into the optimal decision constraint condition, so that it can be known that
Figure BDA0002326801650000141
True, so the target digital fingerprint in the first sliding window is 16; the fingerprint sequence within the second sliding window is: 452688, selecting 26 as the target digital fingerprint in the second sliding window because the number of digital fingerprints in the second sliding window is smaller than the size of the sliding window; (6) the target fingerprint feature obtained by the above process is [ 1626 ]]。
In the embodiment, the digital fingerprints are selected from the hash value sequence by using the sliding window, so that the density of the digital fingerprints is reduced, and the aim of quickly calculating the text similarity is fulfilled.
The embodiment of the present invention further provides precision comparison results of two text similarity detection methods, which are shown in fig. 5, wherein one method uses a winnowng algorithm (as shown by a dotted line in fig. 5), and the other method uses an improved winnowng algorithm (as shown by a solid line in fig. 5). Fig. 5 shows that, as the T value (preset threshold) increases, the precision ratios of the modified winnown algorithm and the winnown algorithm are both increased, the precision ratio of the modified winnown algorithm is always greater than that of the winnown algorithm, and when the T value is within a range of 0.7 to 0.9, the precision ratio of the modified winnown algorithm is greater than that of the winnown algorithm, and the accuracy of the detection result can be ensured by the modified winnown algorithm.
Corresponding to the text similarity detection method, an embodiment of the present invention further provides a text similarity detection apparatus, referring to a schematic structural diagram of the text similarity detection apparatus shown in fig. 6, where the apparatus includes:
the first obtaining module 62 is configured to obtain two texts to be detected;
a second obtaining module 64, configured to obtain an initial fingerprint feature corresponding to each text, where the initial fingerprint feature includes a plurality of digital fingerprints;
an extracting module 66, configured to, for each text, extract a target digital fingerprint from each digital fingerprint corresponding to the text based on a sliding window algorithm, so as to obtain a target fingerprint feature corresponding to the text; extracting a target digital fingerprint, wherein the extraction of the target digital fingerprint is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window;
and the calculating module 68 is configured to perform similarity calculation on the target fingerprint features corresponding to the two texts to obtain a similarity detection result of the two texts.
In the embodiment of the invention, after the initial fingerprint characteristics of the two texts are obtained, the target digital fingerprint is extracted from the initial fingerprint characteristics based on the sliding window algorithm and the size of the digital fingerprint, and when the target digital fingerprint is extracted, the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window, so that on the basis of ensuring the detection accuracy, the number of the digital fingerprints in the target fingerprint characteristics is reduced, the density of the digital fingerprints is reduced, the calculation amount in similarity calculation is reduced, and the detection speed of text similarity detection is improved.
Optionally, the second obtaining module 64 is specifically configured to: preprocessing each text to obtain a word sequence of each text; coding words in the word sequence of each text to obtain text characteristics of each text; and carrying out digital fingerprint mapping on the text features of each text to obtain the initial fingerprint features corresponding to each text.
Optionally, the extracting module 66 is specifically configured to: acquiring a fingerprint sequence in a first sliding window from each digital fingerprint corresponding to the text according to a preset sliding window size; the starting point of the first sliding window is a first digital fingerprint in the initial fingerprint characteristics corresponding to the text; extracting at least one target digital fingerprint in a first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window; acquiring a fingerprint sequence in the next sliding window, and extracting at least one target digital fingerprint in the next sliding window; the starting point of the next sliding window is the next digital fingerprint of the last target digital fingerprint extracted from the previous sliding window in the initial fingerprint features corresponding to the text; processing to the last sliding window to obtain at least one target digital fingerprint in the last sliding window; the terminal point of the last sliding window is the last digital fingerprint in the initial fingerprint characteristics corresponding to the text; and generating target fingerprint characteristics corresponding to the text according to the extracted target digital fingerprints in the sliding windows.
Further, the extracting module 66 is further configured to: dividing the fingerprint sequence in the first sliding window into a plurality of parts to obtain at least one first interval block and at least one second interval block; the first interval block is the first preset number of parts in the multiple parts, and the second interval block is the other parts except the first interval block in the multiple parts; the first interval block and the second interval block each include a plurality of digital fingerprints; determining the minimum value in each first interval block as a characteristic reference value of a first sliding window; according to the characteristic reference value, selecting a target digital fingerprint in each second interval block from the digital fingerprints in each second interval block; and determining the target digital fingerprint in each second interval block as the target digital fingerprint in the first sliding window.
Further, the extracting module 66 is further configured to: judging whether each second interval block meets a preset optimal decision constraint condition or not according to the characteristic reference value and a preset threshold value; if yes, determining the minimum value in the second interval block as the target digital fingerprint in the second interval block; if not, determining the last digital fingerprint in the second interval block as the target digital fingerprint in the second interval block; wherein the optimal decision constraint condition comprises:
Figure BDA0002326801650000171
wherein H1tA second interval block with sequence number t in the first sliding window; min (H)1t) Represents H1tInner minimum value, max (H)1t) Represents H1tThe maximum value of (d), v represents a characteristic reference value, and T represents a preset threshold value.
Further, the extracting module 66 is further configured to: and according to the sequence of each digital fingerprint in the initial fingerprint characteristics corresponding to the text, vector generation is carried out on the extracted target digital fingerprints in each sliding window, and the target fingerprint characteristics corresponding to the text are obtained.
Optionally, the calculating module 68 is specifically configured to: and calculating the similarity of the target fingerprint characteristics corresponding to the two texts by adopting cosine similarity to obtain a similarity detection result of the two texts.
The device provided by the embodiment has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.
Referring to fig. 7, an embodiment of the present invention further provides an electronic device 100, including: a processor 70, a memory 71, a bus 72 and a communication interface 73, wherein the processor 70, the communication interface 73 and the memory 71 are connected through the bus 72; the processor 70 is arranged to execute executable modules, such as computer programs, stored in the memory 71.
The memory 71 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 73 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
The bus 72 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.
The memory 71 is configured to store a program, and the processor 70 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 70, or implemented by the processor 70.
The processor 70 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 70. The processor 70 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 71, and the processor 70 reads the information in the memory 71 and completes the steps of the method in combination with the hardware thereof.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for detecting text similarity is performed as described in the foregoing method embodiments. The computer-readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A text similarity detection method is characterized by comprising the following steps:
acquiring two texts to be detected;
acquiring initial fingerprint features corresponding to each text, wherein the initial fingerprint features comprise a plurality of digital fingerprints;
for each text, extracting a target digital fingerprint from each digital fingerprint corresponding to the text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to the text; extracting the target digital fingerprints, wherein the extraction of the target digital fingerprints is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window;
similarity calculation is carried out on target fingerprint characteristics corresponding to the two texts, and a similarity detection result of the two texts is obtained;
the method for extracting a target digital fingerprint from each digital fingerprint corresponding to the text based on the sliding window algorithm to obtain the target fingerprint characteristics corresponding to the text comprises the following steps:
acquiring a fingerprint sequence in a first sliding window from each digital fingerprint corresponding to the text according to a preset sliding window size; the starting point of the first sliding window is a first digital fingerprint in initial fingerprint features corresponding to the text;
extracting at least one target digital fingerprint in the first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window;
acquiring a fingerprint sequence in a next sliding window, and extracting at least one target digital fingerprint in the next sliding window; the starting point of the next sliding window is the next digital fingerprint of the last target digital fingerprint extracted from the previous sliding window in the initial fingerprint features corresponding to the text;
until the last sliding window is processed, at least one target digital fingerprint in the last sliding window is obtained; the terminal point of the last sliding window is the last digital fingerprint in the initial fingerprint characteristics corresponding to the text;
generating target fingerprint characteristics corresponding to the text according to the extracted target digital fingerprints in the sliding windows;
the extracting at least one target digital fingerprint in the first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window includes:
dividing the fingerprint sequence in the first sliding window into a plurality of parts to obtain at least one first interval block and at least one second interval block; the first interval block is a preset number of parts in the plurality of parts, and the second interval block is the other part except the first interval block in the plurality of parts; the first interval block and the second interval block each comprise a plurality of digital fingerprints;
determining the minimum value in each first interval block as a characteristic reference value of the first sliding window;
according to the characteristic reference value, selecting a target digital fingerprint in each second interval block from the digital fingerprints in the second interval blocks;
and determining the target digital fingerprint in each second interval block as the target digital fingerprint in the first sliding window.
2. The method of claim 1, wherein the obtaining of the initial fingerprint feature corresponding to each of the texts comprises:
preprocessing each text to obtain a word sequence of each text;
coding words in the word sequence of each text to obtain text characteristics of each text;
and performing digital fingerprint mapping on the text features of each text to obtain initial fingerprint features corresponding to each text.
3. The method according to claim 1, wherein the selecting the target digital fingerprint in each second block from the respective digital fingerprints in the second blocks according to the feature reference value comprises:
judging whether each second interval block meets a preset optimal decision constraint condition or not according to the characteristic reference value and a preset threshold value;
if so, determining the minimum value in the second interval block as the target digital fingerprint in the second interval block;
if not, determining the last digital fingerprint in the second interval block as the target digital fingerprint in the second interval block;
wherein the optimal decision constraint comprises:
Figure FDA0002971297950000031
wherein H1tA second interval block with sequence number t in the first sliding window; min (H)1t) Represents H1tInner minimum value, max (H)1t) Represents H1tV represents the characteristic reference value and T represents the preset threshold value.
4. The method of claim 1, wherein generating the target fingerprint feature corresponding to the text according to the extracted target digital fingerprint in each sliding window comprises:
and according to the sequence of each digital fingerprint in the initial fingerprint characteristics corresponding to the text, vector generation is carried out on the extracted target digital fingerprints in each sliding window, and the target fingerprint characteristics corresponding to the text are obtained.
5. The method according to claim 1, wherein the performing similarity calculation on the target fingerprint features corresponding to the two texts to obtain a similarity detection result of the two texts comprises:
and calculating the similarity of the target fingerprint characteristics corresponding to the two texts by adopting cosine similarity to obtain a similarity detection result of the two texts.
6. A text similarity detection apparatus, comprising:
the first acquisition module is used for acquiring two texts to be detected;
the second acquisition module is used for acquiring initial fingerprint features corresponding to each text, and the initial fingerprint features comprise a plurality of digital fingerprints;
the extraction module is used for extracting a target digital fingerprint from each digital fingerprint corresponding to each text based on a sliding window algorithm to obtain a target fingerprint characteristic corresponding to each text; extracting the target digital fingerprints, wherein the extraction of the target digital fingerprints is related to the size of each digital fingerprint corresponding to the text, and the starting point of the next sliding window is related to the target digital fingerprint extracted from the previous sliding window;
the calculation module is used for carrying out similarity calculation on target fingerprint characteristics corresponding to the two texts to obtain a similarity detection result of the two texts;
the extraction module is specifically configured to: acquiring a fingerprint sequence in a first sliding window from each digital fingerprint corresponding to the text according to a preset sliding window size; the starting point of the first sliding window is a first digital fingerprint in initial fingerprint features corresponding to the text; extracting at least one target digital fingerprint in the first sliding window from the fingerprint sequence in the first sliding window according to the size of each digital fingerprint in the fingerprint sequence in the first sliding window; acquiring a fingerprint sequence in a next sliding window, and extracting at least one target digital fingerprint in the next sliding window; the starting point of the next sliding window is the next digital fingerprint of the last target digital fingerprint extracted from the previous sliding window in the initial fingerprint features corresponding to the text; until the last sliding window is processed, at least one target digital fingerprint in the last sliding window is obtained; the terminal point of the last sliding window is the last digital fingerprint in the initial fingerprint characteristics corresponding to the text; generating target fingerprint characteristics corresponding to the text according to the extracted target digital fingerprints in the sliding windows;
the extraction module is further configured to: dividing the fingerprint sequence in the first sliding window into a plurality of parts to obtain at least one first interval block and at least one second interval block; the first interval block is a preset number of parts in the plurality of parts, and the second interval block is the other part except the first interval block in the plurality of parts; the first interval block and the second interval block each comprise a plurality of digital fingerprints; determining the minimum value in each first interval block as a characteristic reference value of the first sliding window; according to the characteristic reference value, selecting a target digital fingerprint in each second interval block from the digital fingerprints in the second interval blocks; and determining the target digital fingerprint in each second interval block as the target digital fingerprint in the first sliding window.
7. An electronic device comprising a memory, a processor, a computer program being stored in the memory and being executable on the processor, wherein the processor realizes the method of any of claims 1-5 when executing the computer program.
8. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, is adapted to carry out the method of any one of claims 1-5.
CN201911321980.4A 2019-12-19 2019-12-19 Text similarity detection method and device and electronic equipment Active CN111104484B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911321980.4A CN111104484B (en) 2019-12-19 2019-12-19 Text similarity detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911321980.4A CN111104484B (en) 2019-12-19 2019-12-19 Text similarity detection method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111104484A CN111104484A (en) 2020-05-05
CN111104484B true CN111104484B (en) 2021-09-03

Family

ID=70423555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911321980.4A Active CN111104484B (en) 2019-12-19 2019-12-19 Text similarity detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111104484B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102413007A (en) * 2011-10-12 2012-04-11 上海奇微通讯技术有限公司 Deep packet inspection method and equipment
CN105718430A (en) * 2016-01-13 2016-06-29 湖南工业大学 Grouping minimum value-based method for calculating fingerprint similarity
CN109241274A (en) * 2017-07-04 2019-01-18 腾讯科技(深圳)有限公司 text clustering method and device
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 The duplicate checking method, apparatus and electronic equipment of text information

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8144947B2 (en) * 2008-06-27 2012-03-27 Palo Alto Research Center Incorporated System and method for finding a picture image in an image collection using localized two-dimensional visual fingerprints
CN107133622B (en) * 2016-02-29 2022-08-26 阿里巴巴集团控股有限公司 Word segmentation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102413007A (en) * 2011-10-12 2012-04-11 上海奇微通讯技术有限公司 Deep packet inspection method and equipment
CN105718430A (en) * 2016-01-13 2016-06-29 湖南工业大学 Grouping minimum value-based method for calculating fingerprint similarity
CN109241274A (en) * 2017-07-04 2019-01-18 腾讯科技(深圳)有限公司 text clustering method and device
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 The duplicate checking method, apparatus and electronic equipment of text information

Also Published As

Publication number Publication date
CN111104484A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN107767870B (en) Punctuation mark adding method and device and computer equipment
CN107330306B (en) Text watermark embedding and extracting method and device, electronic equipment and storage medium
US8838657B1 (en) Document fingerprints using block encoding of text
CN112527992B (en) Long text processing method, related device and readable storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN110321913B (en) Text recognition method and device
CN110991310A (en) Portrait detection method, portrait detection device, electronic equipment and computer readable medium
CN113553848A (en) Long text classification method, system, electronic equipment and computer readable storage medium
CN111159394A (en) Text abstract generation method and device
CN111563391A (en) Machine translation method and device and electronic equipment
CN114581926A (en) Multi-line text recognition method, device, equipment and medium
CN111651674B (en) Bidirectional searching method and device and electronic equipment
CN110704608A (en) Text theme generation method and device and computer equipment
CN112182337B (en) Method for identifying similar news from massive short news and related equipment
CN111831869B (en) Character string duplicate checking method, device, terminal equipment and storage medium
CN111104484B (en) Text similarity detection method and device and electronic equipment
CN112861844A (en) Service data processing method and device and server
CN114996360B (en) Data analysis method, system, readable storage medium and computer equipment
CN107832341B (en) AGNSS user duplicate removal statistical method
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN113392902A (en) Data set processing method and device, storage medium and electronic equipment
CN114528944A (en) Medical text encoding method, device and equipment and readable storage medium
CN114629707A (en) Method and device for detecting messy codes, electronic equipment and storage medium
KR20060112380A (en) Apparatus and method for binary image compression
CN112765937A (en) Text regularization method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant