CN114492369A - Text comparison method and device, electronic equipment and computer readable storage medium - Google Patents

Text comparison method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN114492369A
CN114492369A CN202210096674.0A CN202210096674A CN114492369A CN 114492369 A CN114492369 A CN 114492369A CN 202210096674 A CN202210096674 A CN 202210096674A CN 114492369 A CN114492369 A CN 114492369A
Authority
CN
China
Prior art keywords
text
fingerprint
value
character string
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210096674.0A
Other languages
Chinese (zh)
Inventor
郭峰
范泽宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd, Secworld Information Technology Beijing Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN202210096674.0A priority Critical patent/CN114492369A/en
Publication of CN114492369A publication Critical patent/CN114492369A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The application provides a text comparison method, a text comparison device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: traversing a first text to obtain a fingerprint value of a first fingerprint of each character string in the first text; traversing a second text, and acquiring a fingerprint value of a second fingerprint of each character string in the second text; and comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text. According to the scheme of the embodiment of the application, O (M + N) is about in time complexity, O (M + N) is in space complexity, and compared with the related technology, the time complexity and the space complexity of the algorithm can be effectively reduced, so that the operation efficiency is improved, the text comparison efficiency is improved, the calculation resources and the time overhead are saved, and the working efficiency is improved.

Description

Text comparison method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a text comparison method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
When similarity calculation is performed on a large-scale text, the conventional scheme is realized by adopting similarity comparison algorithms such as an LD (edit distance) algorithm, a Needlema-Wunsch algorithm and the like. However, both the LD algorithm and the Needlema-Wunsch algorithm have high algorithm complexity and low operation efficiency.
Disclosure of Invention
An embodiment of the present application provides a text comparison method, a text comparison device, an electronic device, and a computer-readable storage medium, so as to improve text comparison efficiency.
The embodiment of the application provides a text comparison method, which comprises the following steps: traversing a first text to obtain a fingerprint value of a first fingerprint of each character string in the first text; traversing a second text to obtain a fingerprint value of a second fingerprint of each character string in the second text; comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text; the fingerprint value is a value representing the content and the structural characteristics of the character string.
In the implementation process, the fingerprint values of the character strings in the first text and the second text are obtained by traversing the first text and the second text, and because each fingerprint value depends on one character string, whether the character strings corresponding to the fingerprint values are different can be reflected according to whether the fingerprint values are different. Therefore, the similarity between the second text and the first text can be effectively determined by comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint. According to the scheme of the embodiment of the application, only the first text and the second text need to be traversed and compared, and the time complexity is about O (M + N) (M and N are the lengths of character strings of the two texts, and O is a complexity symbol), and the space complexity is O (M + N). In the LD algorithm, the Needlema-Wunsch algorithm, and other algorithms in the related art, both the time complexity and the space complexity are O (M × N), so that compared with the related art, the scheme of the embodiment of the application can effectively reduce the time complexity and the space complexity of the algorithm, thereby improving the operation efficiency, improving the text comparison efficiency, saving the calculation resources and time overhead, and improving the working efficiency.
Further, the fingerprint value of the first fingerprint is a hash value of a character string in the first text; and the fingerprint value of the second fingerprint is the hash value of the character string in the second text.
In the implementation process, the hash value is used as the fingerprint value of the character string, so that the implementation is simple and reliable, and the scheme of the embodiment of the application is beneficial to popularization in industrial application.
Further, the hash value of the character string in the first text is: in the character string of the first text, the last character corresponds to the calculated hash value; the hash value of the character string in the second text is as follows: in the character string of the second text, the last character corresponds to the calculated hash value; the hash value of each character in the character string is a value calculated according to the hash value of the previous character and the unique identification value of the character.
In the implementation process, because the hash value of the character string is the hash value calculated corresponding to the last character in the character string, and the hash value of each character in the character string needs to be calculated according to the hash value of the previous character and the unique identification value of the character, the characteristics of each character in the character string can be accumulated backwards, so that two characters with extremely small difference flee during actual calculation, the difference can be amplified, the difference of the calculated hash values is increased, and the subsequent comparison effect is ensured.
Further, the hash value of each character in the character string is: according to the formula
Figure BDA0003491077870000021
(ii) a calculated value of f (x); wherein, x represents the x-th character in the character string, seed is a preset constant, str [ x [ ]]Is the unique identification value of the x-th character in the character string.
In the implementation process, through the design of seed, when the hash value of each character is calculated, the difference can be larger, so that the hash value difference between different character strings is larger, and the subsequent comparison effect is ensured.
Further, the unique identification value is an ASCII code value of a character.
In the implementation process, the ASCII code value of the character is used as the unique identification value of the character, so that the uniqueness of the hash value calculation of each character can be effectively ensured, and the uniqueness of the hash value of the character string is ensured.
Further, each character string in the first text is a character string formed by characters corresponding to each line in the first text; each character string in the second text is a character string formed by characters corresponding to each line in the second text.
In the implementation process, the character strings are constructed in a row unit and then compared, so that the row-based file similarity comparison can be realized, and the method is favorable for being adopted in large files such as log files, data files, codes and the like.
Further, comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text, including: for each second fingerprint in a second text, sequentially comparing the fingerprint value of the second fingerprint with the fingerprint value of each first fingerprint in the first text; counting the number and the positions of first fingerprints with the fingerprint values of the second fingerprints in the first text; and determining the similarity between the second text and the first text according to the counted number and position of the first fingerprints corresponding to the second fingerprints.
Further, determining the similarity between the second text and the first text according to the counted number and position of the first fingerprints includes: constructing a similar matrix according to the number and the position of the first fingerprints obtained by statistics; and calculating the similarity matrix by adopting a preset similarity algorithm to obtain the similarity of the second text and the first text.
By the method, the sparse similarity matrix can be generated quickly, so that the time overhead cost for calculating the similarity can be greatly reduced through the sparse similarity matrix.
An embodiment of the present application further provides a text comparison apparatus, including: the acquisition module is used for traversing a first text and acquiring fingerprint values of first fingerprints of all character strings in the first text; the fingerprint value of the second fingerprint of each character string in the second text is obtained by traversing the second text; the comparison module is used for comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text; the fingerprint value is a value representing the content and the structural characteristics of the character string.
The embodiment of the application also provides electronic equipment, which comprises a processor, a memory and a communication bus; the communication bus is used for realizing connection communication between the processor and the memory; the processor is configured to execute one or more programs stored in the memory to implement any of the text comparison methods described above.
Also provided in an embodiment of the present application is a computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement any of the above-described text comparison methods.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic basic flow chart of a text comparison method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a specific text comparison process provided in the embodiment of the present application;
fig. 3 is a schematic structural diagram of a text comparison apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The first embodiment is as follows:
in practical applications, the complexity of the algorithm is usually evaluated by using the time complexity and the space complexity. The time complexity is used for evaluating the time required by executing the program, so that the use degree of the program on the processor can be estimated; the space complexity is used for evaluating the storage space required by the execution program, and the use degree of the program to the computer memory can be estimated.
In the field of text similarity comparison, the inventor notices that when similarity calculation is performed on a large-scale text, the traditional scheme is realized by adopting similarity comparison algorithms such as an LD algorithm and a Needlema-Wunsch algorithm.
The LD algorithm calculates similarity based on edit distance, converts a character string A into another character string B by inserting, deleting and replacing characters, and indirectly quantifies the difference degree between the two character strings by recording the minimum times of operation processes. When the LD algorithm is used, because a specific matching character string is obtained by calculating a matching path using an LD matrix, the time complexity and the space complexity of the LD algorithm are O (M × N), where M and N are the lengths of character strings of two texts. The performance of the matched two texts can be acceptable under the condition that the two texts are small in magnitude, but if the magnitude is too large, the computational power and the memory consumption of the algorithm are increased exponentially. For example: if both strings have 1024 characters, the size of the LD matrix is 1024 × 4B — 400 MB. It is obvious that the use degree of the storage space in this magnitude is not suitable for the application scene of large-scale text files.
The Needlema-Wunsch algorithm is an LCS (text comparison) algorithm based on the longest common subsequence. Similar to the LD algorithm, the Needlema-Wunsch algorithm also uses the idea of dynamic programming. A weight is also set in the Needlema-Wunsch algorithm to prioritize the three operations (insert, delete, change). The Needlema-Wunsch algorithm is the same as the LD algorithm, and the time complexity and the space complexity of the algorithm are O (M × N), so that the method is not suitable for the application scene of large-scale text files.
In order to ensure the operating efficiency when similarity calculation is performed on a large-scale text and improve the text comparison efficiency, the embodiment of the application provides a text comparison method. As shown in fig. 1, fig. 1 is a schematic flow chart of a text comparison method provided in an embodiment of the present application, and includes:
s101: and traversing the first text to obtain the fingerprint value of the first fingerprint of each character string in the first text.
S102: and traversing the second text to obtain the fingerprint value of the second fingerprint of each character string in the second text.
It should be noted that, in the embodiment of the present application, the first text and the second text are two texts that need to be compared similarly. When a plurality of texts need to be subjected to similar comparison, the similar comparison between any two texts can be realized by adopting the method.
In this embodiment of the present application, the first text and the second text may be a log file, a data file, a code, a configuration file, and the like, which are not limited in this embodiment of the present application.
It should be noted that, in the embodiment of the present application, there is no timing limitation between step S101 and step S102. That is, step S101 may be executed first between steps S101 and S102, step S102 may also be executed first, and step S101 and step S102 may also be executed simultaneously, which is not limited in the embodiment of the present application.
It should be further noted that, in the embodiment of the present application, the fingerprint value may characterize the content and the structural characteristics of the character string. That is, different numbers of characters in a character string, different character contents, and different permutation and combination among characters all affect the fingerprint value corresponding to the character string. Each fingerprint value depends on one character string, so that whether the character strings corresponding to the fingerprint values are different or not can be reflected according to whether the fingerprint values are different or not.
For example, in a possible implementation manner of the embodiment of the present application, a hash value of a character string in the first text may be used as a fingerprint value of the first fingerprint, and a hash value of a character string in the second text may be used as a fingerprint value of the second fingerprint.
In the embodiment of the present application, the Hash value of the character string may be calculated by using an Algorithm such as MD5(Message Digest Algorithm, fifth edition), SHA (Secure Hash Algorithm), DES (Data Encryption Standard) Algorithm, AES (Advanced Encryption Standard) Algorithm, or the like.
In addition, in an optional implementation manner of the embodiment of the present application, a hash value corresponding to each character in the character string may also be calculated, and during the calculation, the hash value corresponding to the current character may be determined and obtained by using the hash value of the previous character and the unique identification value of the current character (it should be noted that, for the first character in the character string, since there is no hash value of the previous character, the hash value of the current character may be determined and obtained only according to the unique identification value of the current character).
Since the hash value of the last character is obtained based on the hash values of the preceding characters, the hash value of the last character is affected by all the characters in the character string, and the change of any one of the preceding characters results in the change of the hash value of the last character. Thus, in one possible example approach, the hash value of the last character in the string may be used as the hash value of the string, so that a unique representation of the string may be achieved.
For example, assuming that a character string is composed of A, B, C characters in sequence, a hash value a corresponding to the character a may be obtained by calculation according to the unique identifier of the character a, then a hash value B corresponding to the character B (actually equivalent to the hash value of the character string AB being calculated) may be obtained by calculation according to the hash value a and the unique identifier of the character B, then a hash value C corresponding to the character C (actually equivalent to the hash value of the character string ABC being calculated) may be obtained by calculation according to the hash value B and the unique identifier of the character C, and finally the hash value C may be used as the hash value of the character string.
It should be noted that, in the embodiment of the present application, the unique identification value of a character refers to a value that can uniquely identify one character. For example, it may be an ASCII code value of a character. Also for example, the number may be a number uniquely corresponding to each character written in advance by an engineer.
In another possible example manner of the embodiment of the present application, the hash value of the character string may be further determined according to the hash value of each character in the character string. For example, a hash calculation may be performed once on a hash value of each character of the string, so as to obtain a hash value of the string.
It should be understood that, in the two example manners described above, when calculating the hash value of each character in the character string, an alternative calculation manner is:
can be according to the formula
Figure BDA0003491077870000081
And calculating the hash value of each character.
Wherein, x represents the x-th character in the character string, seed is a preset random number, str [ x ] is the unique identification value of the x-th character in the character string.
It should be understood that in the above calculation, seed may be a random number pre-selected by an engineer. Furthermore, the method is simple. The random number (i.e. seed) can be selected to be an odd number greater than 1, so that the risk of the same hash value appearing in different character strings can be reduced, and the reliability of the scheme is ensured.
Thus, for the same character string, the fingerprint value is the same, otherwise an "avalanche effect" occurs, and a very small text difference is amplified.
It is noted that in practical applications, the first text and the second text are both sets of a large number of characters. And the character strings in the first text and the second text can be obtained by presetting a division rule.
For example, each line of characters in the first text and the second text may be determined as a character string in units of lines, and then step S101 and step S102 may be performed.
Of course, the division may be performed in other manners, such as dividing in units of segments, or dividing in units of every n (n is a preset positive integer greater than 1) words, and the division manner is not limited in the embodiment of the present application.
It should be noted that, in the actual application process, when the hash value is calculated for a character string, the calculated hash value may overflow (i.e., the number of bits required for the hash value exceeds the maximum number of bits of the computer), which may result in a negative result of the calculated hash value. In order to solve the problem, the calculated hash value can be subjected to AND operation with a value represented by the maximum digit of the computer, so that the obtained hash value is positive, and the subsequent use is facilitated.
For example, taking a computer system as a 32-bit example, if the calculated hash value overflows, it may be anded with 0x7fffffff to obtain a positive number.
In the above process, the time complexity of traversing to obtain the fingerprint value of each character string in the first text and the second text is O (M + N) (where M is the total length of the character string of the first text, and M is the total length of the character string of the first text), which does not cause a bottleneck in performance.
S103: and comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text.
In this embodiment of the present application, for each second fingerprint in the second text, the fingerprint value of the second fingerprint may be sequentially compared with the fingerprint values of the first fingerprints in the first text, then the number and the positions of the first fingerprints having the fingerprint value of the second fingerprint in the first text are counted, and the similarity between the second text and the first text is determined according to the counted number and positions of the first fingerprints corresponding to the second fingerprints.
For example, a similarity matrix may be constructed according to the number and the position of the first fingerprint having the fingerprint value of the second fingerprint in the first text obtained through statistics, and then the similarity matrix is calculated by using a preset similarity algorithm (e.g., algorithm such as dijkstra algorithm, dynamic programming algorithm, etc.), so as to obtain the similarity between the second text and the first text.
For example, suppose the texts to be compared are:
text 1:
AAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
text 2:
AAAAAAAAAAAAAAAAAAA
bBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCC
dDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
assume that the calculated array of fingerprint values for text 1 is {1, 2, 3, 4, 5}, and the calculated array of fingerprint values for text 2 is {1, 6, 3, 7, 5 }.
For the fingerprint values 1, 6, 3, 7, and 5 of the text 2, sequentially searching whether the fingerprint value array of the file 1 exists or not, and obtaining: fingerprint values 1, 3, 5 are present and all 1 in number, positions 1, 3, 5, respectively, fingerprint values 6 and 7 are absent and all 0 in number, resulting in a similarity matrix of:
{1 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 1}
then, the dijkstra algorithm is used to calculate the maximum path of the matrix, which is the diagonal "10101" in the matrix, so that the similarity between text 2 and text 1 can be calculated to be 3. The dijkstra algorithm is a general algorithm for calculating the optimal path, and is not described herein.
In this embodiment, for convenience of comparison, the fingerprint value of the first fingerprint may be recorded in a fingerprint table, and the fingerprint values of the second fingerprints are stored in an array, so as to traverse the data in the array, search whether the first fingerprint exists in the fingerprint table by using the fingerprint value extracted from the array, and count the number C1 and the position of the first fingerprint that exists and is the same as the fingerprint value. And repeating the process until the array traversal is completed, so that the number and the positions of the first fingerprints corresponding to all the second fingerprints are obtained, and the similarity between the second text and the first text is generated according to the number and the positions of the first fingerprints.
It should be understood that the above manner of obtaining the similarity between the second text and the first text is only one possible implementation manner shown in the embodiment of the present application, and besides, other manners may also be implemented in the embodiment of the present application, and the embodiment of the present application does not limit the specific implementation manner of step S103.
According to the text comparison method provided by the embodiment of the application, the fingerprint values of the character strings in the first text and the second text are obtained by traversing the first text and the second text, and because each fingerprint value depends on one character string, whether the character strings corresponding to the fingerprint values are different can be reflected according to whether the fingerprint values are different. Therefore, the similarity between the second text and the first text can be effectively determined by comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint.
According to the scheme of the embodiment of the application, only the first text and the second text need to be traversed and compared, and the time complexity is about O (M + N) (M and N are the lengths of character string arrays of the two texts, and O is a complexity symbol), and the space complexity is O (M + N). In the LD algorithm, the Needlema-Wunsch algorithm, and other algorithms in the related art, both the time complexity and the space complexity are O (M × N), so that compared with the related art, the scheme of the embodiment of the application can effectively reduce the time complexity and the space complexity of the algorithm, thereby improving the operation efficiency, improving the text comparison efficiency, saving the calculation resources and time overhead, and improving the working efficiency.
Example two:
based on the first embodiment, the present embodiment takes a specific implementation process for performing similar comparison on two texts as an example, and further illustrates the present application.
As shown in fig. 2, the method comprises the following steps:
s201: the character string in file 1 is read by line.
S202: and checking the read character string and judging whether the file 1 is finished or not.
For example, if the read character string is empty, it may be considered that the file 1 is completely read, and it may be considered that the file 1 is ended. At which point it may proceed to step S204. If the read character string is not empty, it can be considered that the file 1 is not finished. At this point, the process may proceed to step S203.
S203: according to formula for the read character string
Figure BDA0003491077870000111
The hash value is computed and added to the hash table (denoted HashList1) to build the index. And then proceeds to step S201.
Wherein x represents the x-th character in the character string, seed is a preset odd number larger than 1, and str [ x ] is the ASCII code value of the x-th character in the character string.
S204: the character string in file 2 is read line by line.
S205: the read character string is checked to determine whether the file 2 is finished.
If file 2 is finished, then go to step S207. If file 2 is not finished, go to step S206.
The verification method is consistent with step S202, and is not described herein.
S206: according to formula for the read character string
Figure BDA0003491077870000121
The hash value is computed and stored in an array (denoted HashArr 2). And then proceeds to step S204.
S207: the current data is read from array HashArr 2.
S208: and judging whether the currently read data is empty or not.
If the result is null, the data in the HashArr2 is read completely, and the step S210 is switched to; if not, go to step S209.
S209: using the hash value retrieved from HashArr2, HashList1 was searched for its presence and counted for C1 and location. Go to step S207.
S210: c1 values and positions corresponding to the hash values in HashArr2 are collected and counted to generate a calculation similarity result.
By the scheme, the time complexity and the space complexity of the algorithm can be controlled to be O (M + N), so that the operation efficiency is improved, the text comparison efficiency is improved, the calculation resources and the time overhead are saved, and the working efficiency is improved. See also the following table, which provides the results of comparative testing with the Needleman-Wunsch algorithm provided herein:
Figure BDA0003491077870000122
Figure BDA0003491077870000131
as can be seen from the above table, according to the scheme of the embodiment of the application, the average consumption is reduced by about 90% in terms of time consumption, the promotion is more obvious for large-scale text documents, the time consumption is reduced by nearly 99%, and the efficiency promotion advantage is significant.
Example three:
based on the same inventive concept, the embodiment of the present application further provides a text comparison apparatus 300. Referring to FIG. 3, FIG. 3 illustrates a text comparison using the method of FIG. 1. It should be understood that the specific functions of the apparatus 300 can be referred to the above description, and the detailed description is omitted here as appropriate to avoid redundancy. The apparatus 300 includes at least one software functional module that can be stored in a memory in the form of software or firmware or solidified in an operating system of the apparatus 300. Specifically, the method comprises the following steps:
referring to fig. 3, the apparatus 300 includes: an acquisition module 301 and an alignment module 302. Wherein:
the obtaining module 301 is configured to traverse a first text, and obtain a fingerprint value of a first fingerprint of each character string in the first text; the fingerprint value of the second fingerprint of each character string in the second text is obtained by traversing the second text;
the comparison module 302 is configured to compare the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain a similarity between the second text and the first text;
the fingerprint value is a value representing the content and the structural characteristics of the character string.
In a possible implementation manner of the embodiment of the present application, the fingerprint value of the first fingerprint is a hash value of a character string in the first text; and the fingerprint value of the second fingerprint is the hash value of the character string in the second text.
In an optional example scenario of this possible implementation, the hash value of the character string in the first text is: in the character string of the first text, the last character corresponds to the calculated hash value; the hash value of the character string in the second text is as follows: in the character string of the second text, the last character corresponds to the calculated hash value; the hash value of each character in the character string is: and calculating the obtained value according to the hash value of the previous character and the unique identification value of the character.
In this alternative example scenario, the hash value of each character in the string is: according to the formula
Figure BDA0003491077870000141
(ii) a calculated value of f (x); wherein, x represents the x-th character in the character string, seed is a preset random number, str [ x [ ]]Is the unique identification value of the x-th character in the character string.
In the embodiment of the present application, the unique identification value is an ASCII code value of a character.
In this embodiment of the present application, each character string in the first text is a character string formed by characters corresponding to each line in the first text; each character string in the second text is a character string formed by characters corresponding to each line in the second text.
In this embodiment, the comparison module 302 is specifically configured to, for each second fingerprint in the second text, sequentially compare the fingerprint value of the second fingerprint with the fingerprint values of the first fingerprints in the first text, count the number and positions of the first fingerprints having the fingerprint value of the second fingerprint in the first text, and determine the similarity between the second text and the first text according to the counted number and positions of the first fingerprints corresponding to the second fingerprints.
In this embodiment of the application, the comparing module 302 is specifically configured to construct a similar matrix according to the counted number and position of the first fingerprints; and calculating the similarity matrix by adopting a preset similarity algorithm to obtain the similarity of the second text and the first text.
It should be understood that, for the sake of brevity, the contents described in some embodiments are not repeated in this embodiment.
Example four:
the present embodiment provides an electronic device, which is shown in fig. 4 and includes a processor 401, a memory 402, and a communication bus 403. Wherein:
the communication bus 403 is used to enable connection communication between the processor 401 and the memory 402.
The processor 401 is configured to execute one or more first programs stored in the memory 402 to implement the text comparison method described in the first embodiment and/or the second embodiment.
It will be appreciated that the configuration shown in fig. 4 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 4 or have a different configuration than shown in fig. 4. For example, the electronic device may further have input and output means such as a mouse, a keyboard, and the like, may further have display means such as a display screen, and may further have external communication means such as an antenna, a USB bus, and the like.
It should be understood that the electronic device described in the embodiments of the present application may be a device such as a computer, a server, or the like, which has data comparison processing capability.
The present embodiment further provides a computer-readable storage medium, such as a floppy disk, an optical disk, a hard disk, a flash Memory, a usb (universal serial bus) Card, an MMC (Multimedia Card) Card, etc., where one or more programs for implementing the above steps are stored, and the one or more programs can be executed by one or more processors to implement the steps of the text comparison method executed by the service distribution device in the first embodiment and/or the second embodiment or the steps of the text comparison method executed by the processing node in the first embodiment and/or the second embodiment. And will not be described in detail herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
In this context, a plurality means two or more.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (11)

1. A method of comparing text, comprising:
traversing a first text to obtain a fingerprint value of a first fingerprint of each character string in the first text;
traversing a second text to obtain a fingerprint value of a second fingerprint of each character string in the second text;
comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text;
the fingerprint value is a value representing the content and the structural characteristics of the character string.
2. The text comparison method of claim 1, wherein the fingerprint value of the first fingerprint is a hash value of a character string in the first text; and the fingerprint value of the second fingerprint is the hash value of the character string in the second text.
3. The text comparison method of claim 2,
the hash value of the character string in the first text is as follows: in the character string of the first text, the last character corresponds to the calculated hash value;
the hash value of the character string in the second text is as follows: in the character string of the second text, the last character corresponds to the calculated hash value;
the hash value of each character in the character string is a value calculated according to the hash value of the previous character and the unique identification value of the character.
4. A method of comparing text as claimed in claim 3, wherein the hash value for each character in the string is:
according to the formula
Figure FDA0003491077860000011
(ii) a calculated value of f (x);
wherein, x represents the x-th character in the character string, seed is a preset constant, str [ x ] is the unique identification value of the x-th character in the character string.
5. The text comparison method of claim 4, wherein the unique identification value is an ASCII code value of a character.
6. The text comparison method of any one of claims 1-5,
each character string in the first text is a character string formed by characters corresponding to each line in the first text;
each character string in the second text is a character string formed by characters corresponding to each line in the second text.
7. The text comparison method of any one of claims 1-5, wherein comparing the fingerprint value of the first fingerprint to the fingerprint value of the second fingerprint to obtain a similarity between the second text and the first text comprises:
for each second fingerprint in a second text, sequentially comparing the fingerprint value of the second fingerprint with the fingerprint value of each first fingerprint in the first text;
counting the number and the positions of first fingerprints with the fingerprint values of the second fingerprints in the first text;
and determining the similarity between the second text and the first text according to the counted number and position of the first fingerprints.
8. The text comparison method of claim 7, wherein determining the similarity between the second text and the first text according to the counted number and positions of the first fingerprints comprises:
constructing a similar matrix according to the number and the position of the first fingerprints obtained by statistics;
and calculating the similarity matrix by adopting a preset similarity algorithm to obtain the similarity of the second text and the first text.
9. A text comparison apparatus, comprising:
the acquisition module is used for traversing a first text and acquiring fingerprint values of first fingerprints of all character strings in the first text; the fingerprint value of the second fingerprint of each character string in the second text is obtained by traversing the second text;
the comparison module is used for comparing the fingerprint value of the first fingerprint with the fingerprint value of the second fingerprint to obtain the similarity between the second text and the first text;
the fingerprint value is a value representing the content and the structural characteristics of the character string.
10. An electronic device, comprising: a processor, a memory, and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute a program stored in the memory to implement the text comparison method of any one of claims 1 to 8.
11. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the text comparison method of any one of claims 1 to 8.
CN202210096674.0A 2022-01-26 2022-01-26 Text comparison method and device, electronic equipment and computer readable storage medium Pending CN114492369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210096674.0A CN114492369A (en) 2022-01-26 2022-01-26 Text comparison method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210096674.0A CN114492369A (en) 2022-01-26 2022-01-26 Text comparison method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114492369A true CN114492369A (en) 2022-05-13

Family

ID=81475724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210096674.0A Pending CN114492369A (en) 2022-01-26 2022-01-26 Text comparison method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114492369A (en)

Similar Documents

Publication Publication Date Title
CN111046034B (en) Method and system for managing memory data and maintaining data in memory
US9645828B2 (en) Method of searching character string, character string searching device, and recording medium
CN111258966A (en) Data deduplication method, device, equipment and storage medium
KR20150038738A (en) Detection of confidential information
CN111159329B (en) Sensitive word detection method, device, terminal equipment and computer readable storage medium
WO2010135082A1 (en) Localized weak bit assignment
WO2020007288A1 (en) Method and system for managing memory data and maintaining data in memory
CN111949710A (en) Data storage method, device, server and storage medium
Louza et al. External memory generalized suffix and LCP arrays construction
US20170024439A1 (en) Accelerated detection of matching patterns
US10339297B2 (en) Determining whether continuous byte data of inputted data includes credential
CN116302089B (en) Picture similarity-based code clone detection method, system and storage medium
CN112612810A (en) Slow SQL statement identification method and system
CN110532284B (en) Mass data storage and retrieval method and device, computer equipment and storage medium
CN114492369A (en) Text comparison method and device, electronic equipment and computer readable storage medium
CN114417102A (en) Text duplicate removal method and device and electronic equipment
CN116204612A (en) Text similarity calculation method and system
CN111198900B (en) Data caching method and device for industrial control network, terminal equipment and medium
CN113779932A (en) Digital formatting method, device, terminal equipment and storage medium
CN110046180B (en) Method and device for locating similar examples and electronic equipment
Rebenich et al. FLOTT—A Fast, Low Memory T-TransformAlgorithm for Measuring String Complexity
CN113468866A (en) Method and device for analyzing non-standard JSON string
CN111177362A (en) Information processing method, device, server and medium
CN117251532B (en) Large-scale literature mechanism disambiguation method based on dynamic multistage matching
CN112860712B (en) Block chain-based transaction database construction method, system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100032 NO.332, 3rd floor, Building 102, 28 xinjiekouwai street, Xicheng District, Beijing

Applicant after: Qianxin Technology Group Co.,Ltd.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: 100032 NO.332, 3rd floor, Building 102, 28 xinjiekouwai street, Xicheng District, Beijing

Applicant before: Qianxin Technology Group Co.,Ltd.

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.