CN118095247A - Text deduplication method, apparatus, device, storage medium, and program product - Google Patents
Text deduplication method, apparatus, device, storage medium, and program product Download PDFInfo
- Publication number
- CN118095247A CN118095247A CN202410224044.6A CN202410224044A CN118095247A CN 118095247 A CN118095247 A CN 118095247A CN 202410224044 A CN202410224044 A CN 202410224044A CN 118095247 A CN118095247 A CN 118095247A
- Authority
- CN
- China
- Prior art keywords
- text
- processed
- hash value
- fuzzy hash
- deduplication
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000004590 computer program Methods 0.000 claims description 24
- 230000000903 blocking effect Effects 0.000 claims description 17
- 238000013467 fragmentation Methods 0.000 claims description 4
- 238000006062 fragmentation reaction Methods 0.000 claims description 4
- 238000000638 solvent extraction Methods 0.000 claims description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 12
- 230000015654 memory Effects 0.000 description 12
- 238000004891 communication Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 230000004044 response Effects 0.000 description 5
- 238000012216 screening Methods 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 2
- 230000002688 persistence Effects 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000005096 rolling process Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a text deduplication method, relates to the technical field of artificial intelligence, and can be applied to the technical field of finance. The method comprises the following steps: responding to a data deduplication service request, and acquiring a text to be processed; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text; if the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and performing a deduplication operation on the text whose status is marked as duplicate. The present disclosure also provides a text deduplication apparatus, a device, a storage medium, and a program product.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to a text deduplication method, apparatus, device, storage medium, and program product.
Background
In the technical field of large models, a large amount of data is needed for training the large model, and the high-quality data can enable the training effect of the large model to be better. Corpus is currently obtained, typically by downloading from a public dataset, crawling web page data, internal data collection, and the like. However, all data are gathered together, so that the problem of data repetition is necessarily generated, and repeated data have a certain influence on large model training, so that the repeated data need to be removed when the data are prepared. In the related art, a text content is removed in a peer-to-peer mode generally, and a better text duplication removal method is not adopted. The text content is directly compared, so that the efficiency is low, persistence cannot be realized, and the text content cannot be used for removing the duplication of a large amount of text.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
In view of the foregoing, the present disclosure provides an efficient text deduplication method, apparatus, device, storage medium, and program product.
According to a first aspect of the present disclosure, there is provided a text deduplication method, the method comprising:
Responding to a data deduplication service request, and acquiring a text to be processed;
Calculating a fuzzy hash value of the text to be processed;
Inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text;
If the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and
And performing a deduplication operation on the text with the state marked as the duplicate.
According to an embodiment of the present disclosure, the calculating the fuzzy hash value of the text to be processed includes:
Performing blocking operation on the text to be processed;
Respectively calculating the hash value of each block; and
And generating a fuzzy hash value of the text to be processed according to the hash value of each block.
According to an embodiment of the present disclosure, the partitioning the text to be processed includes:
determining a fragmentation condition value according to the length of the text to be processed and the actual content of the text to be processed; and
And performing block operation on the text to be processed according to the block condition value.
According to an embodiment of the disclosure, the querying, in a database, text data similar to the text to be processed according to the fuzzy hash value includes:
Calculating the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value; and
And if the similarity is larger than a first preset threshold value, determining that text data similar to the text to be processed exists.
According to an embodiment of the present disclosure, the calculating, according to the fuzzy hash value, a similarity between the text to be processed and a history text in a database includes:
searching a target hash value of which the text length difference value is smaller than or equal to a second preset threshold value in a database;
and in the target hash value, comparing the similarity of the hash values by using the Hamming distance to determine texts with differences less than or equal to a third preset threshold.
According to an embodiment of the present disclosure, the method further comprises:
and if the text data which is repeated with the text to be processed does not exist, marking the state of the text to be processed as non-repeated.
According to an embodiment of the present disclosure, further comprising:
Descriptive information, summary information, and fuzzy hash values of the text to be processed, the state of which is marked as non-duplicate, are stored in a database.
A second aspect of the present disclosure provides a text deduplication apparatus, the apparatus comprising:
The device comprises:
The acquisition module is used for responding to the data deduplication service request and acquiring a text to be processed;
The computing module is used for computing the fuzzy hash value of the text to be processed;
The query module is used for querying text data similar to the text to be processed in a database according to the fuzzy hash value, and the database stores description information, abstract information and the fuzzy hash value of the historical text;
the first determining module is used for marking the state of the text to be processed as repeated if determining that text data which is repeated with the text to be processed exists; and
And the de-duplication module is used for performing de-duplication operation on the text with the state marked as repeated.
According to an embodiment of the present disclosure, the computing module includes: the system comprises a block sub-module, a calculation sub-module and a generation sub-module.
The partitioning sub-module is used for performing partitioning operation on the text to be processed;
a first computing sub-module for computing hash values of each block respectively; and
And the generation sub-module is used for generating a fuzzy hash value of the text to be processed according to the hash value of each block.
According to an embodiment of the disclosure, the blocking sub-module comprises a first determination unit and a blocking unit.
The first determining unit is used for determining a fragmentation condition value according to the length of the text to be processed and the actual content of the text to be processed; and
And the blocking unit is used for carrying out blocking operation on the text to be processed according to the blocking condition value.
According to an embodiment of the present disclosure, the query module includes: a second calculation sub-module and a first determination sub-module.
The second computing sub-module is used for computing the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value; and
And the first determining submodule is used for determining that text data similar to the text to be processed exists if the similarity is larger than a first preset threshold value.
According to an embodiment of the present disclosure, the second computing submodule includes: a search unit and a second determination unit,
The searching unit is used for searching a target hash value with the text length difference value smaller than or equal to a second preset threshold value in the database;
And a second determining unit, configured to compare the similarity of the hash values using the hamming distance in the target hash value to determine a text with a difference less than or equal to a third preset threshold.
According to an embodiment of the present disclosure, the apparatus further comprises: and a second determination module.
And the second determining module is used for marking the state of the text to be processed as non-repeated if determining that the text data which is repeated with the text to be processed does not exist.
According to an embodiment of the present disclosure, further comprising a memory module,
And the storage module is used for storing the description information, the abstract information and the fuzzy hash value of the text to be processed, the state of which is marked as non-repeated, in the database.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the text deduplication method described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described text deduplication method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above text deduplication method.
According to the text deduplication method provided by the embodiment of the disclosure, a text to be processed is obtained in response to a data deduplication service request; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value; determining whether a text to be processed is a repeated text or not through a fuzzy hash algorithm, and marking the state of the text to be processed as repeated if the text data which is repeated with the text to be processed exists; and performing a deduplication operation on the text whose status is marked as duplicate. Compared with the related art, the fuzzy hash algorithm provided by the embodiment of the disclosure can efficiently remove part of changes in the confirmed text of the express, remove repeated text and rapidly deliver high-quality text.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a text deduplication method, apparatus, device, storage medium, and program product according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a system architecture diagram of a text deduplication device in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a text deduplication method provided in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates one of the flowcharts of the fuzzy hash value calculation method provided in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a second flowchart of a fuzzy hash value calculation method provided in accordance with an embodiment of the present disclosure;
FIG. 6 schematically illustrates one of the flowcharts of a similar text data query method provided in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a second flowchart of a similar text data query method provided in accordance with an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow chart of a text deduplication method provided in accordance with another embodiment of the present disclosure;
Fig. 9 schematically illustrates a block diagram of a text deduplication apparatus according to an embodiment of the present disclosure; and
Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a text deduplication method in accordance with an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The terms appearing in the embodiments of the present disclosure will first be explained:
Fuzzy hash algorithm: the fuzzy hash algorithm (Context TRIGGERED PIECEWISE HASHING, CTPH) is also called a content segmentation-based hash algorithm, and is mainly used for similarity comparison of files.
The existing data deduplication method generally adopts a text content comparison peer-to-peer mode to remove, and a few better text deduplication methods exist. The text content is directly compared, the efficiency is low, and the persistence cannot be realized, so that the text content is used for removing the duplication of a large amount of text, and the requirement cannot be met obviously.
Based on the technical problems described above, embodiments of the present disclosure provide a text deduplication method, which includes: responding to a data deduplication service request, and acquiring a text to be processed; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text; if the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and performing a deduplication operation on the text whose status is marked as duplicate.
Fig. 1 schematically illustrates an application scenario diagram of a text deduplication method, apparatus, device, storage medium, and program product according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a text deduplication scenario. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a backend server, which may perform the text deduplication method provided by the embodiments of the present disclosure, and obtain a text to be processed in response to a data deduplication service request; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text; if the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and performing a deduplication operation on the text whose status is marked as duplicate.
It should be noted that the text deduplication method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the text deduplication apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The text deduplication method provided by the embodiments of the present disclosure may also be performed by a server or cluster of servers that are different from server 105 and capable of communicating with terminal devices 101, 102, 103 and/or server 105. Accordingly, the text deduplication apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
It should be noted that, the text deduplication method and device determined by the embodiments of the present disclosure may be used in the internet technical field, the financial technical field, and any field other than the financial field, and the application field of the text deduplication method and device determined by the embodiments of the present disclosure is not limited.
Fig. 2 schematically illustrates a system architecture diagram of a text deduplication device in accordance with an embodiment of the present disclosure. As shown in fig. 2, the data deduplication service architecture provided by the embodiment of the present disclosure includes a storage module and a screening module. The screening module calculates a fuzzy hash value of the text, transmits the fuzzy hash value to the storage module for inquiring, and the storage module inquires according to the fuzzy hash value, if the fuzzy hash value can be found, the text is judged to exist and is a repeated text. If the text cannot be found, the text is not found, and the storage module stores the fuzzy hash value for the subsequent text duplicate removal.
The text deduplication method of the embodiments of the present disclosure will be described in detail below by fig. 3 to 6 based on the application scenario described in fig. 1 and the architecture described in fig. 2.
Fig. 3 schematically illustrates a flowchart of a text deduplication method according to an embodiment of the present disclosure. As shown in fig. 3, the text deduplication method of the embodiment includes operations S210 to S240, which may be performed by a server or other computing device.
In operation S210, a text to be processed is acquired in response to the data deduplication service request.
In operation S220, a fuzzy hash value of the text to be processed is calculated.
In operation S230, text data similar to the text to be processed is queried in a database according to the fuzzy hash value.
According to an embodiment of the present disclosure, the database stores descriptive information, summary information, and fuzzy hash values of the history text.
In one example, in order to efficiently and quickly read and process text corpus, an embodiment of the present disclosure provides a text deduplication system, which serves text deduplication and provides text deduplication service to the outside. The text removal system comprises two parts, namely a storage module and a screening module. The screening module calculates a fuzzy hash value of the text, transmits the fuzzy hash value to the storage module for inquiring, and the storage module inquires according to the fuzzy hash value, if the fuzzy hash value can be found, the text is judged to exist and is a repeated text. If the text cannot be found, the text is not found, and the storage module stores the fuzzy hash value for the subsequent text duplicate removal. Responding to the data deduplication service request, acquiring a text to be processed, calculating a fuzzy hash value of the text to be processed by adopting ssdeep algorithm, comparing the fuzzy hash value with the fuzzy hash value of the historical text in the database, and determining whether the text similar to the text to be processed exists in the database. The database stores descriptive information, summary information and fuzzy hash values of the history text.
In operation S240, if it is determined that there is text data that is repeated with the text to be processed, the state of the text to be processed is marked as repeated.
In operation S250, a deduplication operation is performed on text whose status is marked as duplicate.
In one example, after determining that text data that is repeated with the text to be processed exists, marking the state of the text to be processed as repeated, and performing a deduplication operation on the text with the state marked as repeated.
According to the text deduplication method provided by the embodiment of the disclosure, a text to be processed is obtained in response to a data deduplication service request; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value; determining whether a text to be processed is a repeated text or not through a fuzzy hash algorithm, and marking the state of the text to be processed as repeated if the text data which is repeated with the text to be processed exists; and performing a deduplication operation on the text whose status is marked as duplicate. Compared with the related art, the fuzzy hash algorithm provided by the embodiment of the disclosure can efficiently remove part of changes in the confirmed text of the express, remove repeated text and rapidly deliver high-quality text.
Fig. 4 schematically illustrates one of flowcharts of a fuzzy hash value calculation method provided according to an embodiment of the present disclosure. Fig. 5 schematically illustrates a second flowchart of a fuzzy hash value calculation method provided according to an embodiment of the present disclosure.
As shown in fig. 4, operation S220 includes operations S221 to S223.
In operation S221, a blocking operation is performed on the text to be processed.
As shown in fig. 5, operation S221 includes operation S2211 and operation S2212.
In operation S2211, a segmentation condition value is determined according to the length of the text to be processed and the actual content of the text to be processed.
In operation S2212, a blocking operation is performed on the text to be processed according to the blocking condition value.
In one example, when calculating the fuzzy hash value, the text to be processed is firstly fragmented by using a weak hash algorithm and a fragmentation condition value, specifically, a part of content is read from the file, and a hash value is calculated for the weak hash algorithm. The fixed length content is typically read byte by byte, sliding in a file with a fixed window as in a sliding window in a network protocol, with the content within the window being calculated each time. For convenience, a rolling hash algorithm is generally adopted, where rolling hash refers to, for example, that the hash value of abcdef is already calculated as h1, then the hash value of bcdefg is calculated, and no complete recalculation is needed, only h1-X (a) +y (g) is needed, X, Y is two functions, and only the influence of difference on the hash value is needed to be correspondingly increased or decreased, so that the speed of fragment judgment can be greatly increased. In ssdeep, an Alder-32 algorithm is used, and in addition to the weak hash algorithm, a slicing condition value n is required, which is used to control the slicing condition, and in ssdeep, the value n is determined by the length of the text to be processed and the actual content of the text to be processed.
In operation S222, a hash value of each chunk is calculated, respectively.
In operation S223, a fuzzy hash value of the text to be processed is generated according to the hash value of each block.
In one example, after the file is partitioned, hash values are calculated separately for each partition, and a conventional hash algorithm, such as MD5, may be used, and in ssdeep, a Fowler-Noll-Vo hash algorithm is used. After each file fragment is calculated to obtain a hash value, each hash value is connected together, and then a fuzzy hash value of the text to be processed can be obtained. The shard condition values may be different for different files and may also be incorporated into the fuzzy hash value. Before each hash value is connected to obtain the fuzzy hash value, compression mapping can be performed on each hash value, for example, in ssdeep, only the lowest 6 bits of the FNV hash result are taken, the lowest bits are represented by ASCII characters, as the final hash result of the segment, a part of accuracy is lost due to compression mapping, a false alarm problem is introduced, or compression mapping can be optionally performed.
Fig. 6 schematically illustrates one of the flowcharts of the similar text data query method provided in accordance with an embodiment of the present disclosure. Fig. 7 schematically illustrates a second flowchart of a similar text data query method provided in accordance with an embodiment of the present disclosure. Fig. 8 schematically illustrates a flow chart of a text deduplication method provided in accordance with another embodiment of the present disclosure.
As shown in fig. 6, operation S230 includes operations S310 to S320.
In operation S310, a similarity between the text to be processed and the historical text in the database is calculated according to the fuzzy hash value.
In one example, after obtaining the fuzzy hash value of the file to be processed, searching a historical text similar to the file to be processed in a database according to the fuzzy hash value to determine whether the text to be processed is a repeated text, and in ssdeep, adopting the following thought, since ssdeep is an ASCII character for each obtained hash value, the finally obtained fuzzy hash value of the file is a character string, and the character string is assumed to be S1 and S2, wherein S1 is the fuzzy hash value of the file to be processed, S2 is the fuzzy hash value of the historical text stored in the database, and the weighted editing distance of S1 to S2 is used as a basis for determining the similarity. The weighted edit distance refers to judging how many steps of operations (including insertion, deletion, modification and exchange) are needed from S1 to S2, giving a weight to different operations, adding the results to obtain the weighted edit distance, dividing the distance by the sum of the lengths of S1 and S2, changing the absolute result to a relative result, and mapping the relative result to an integer value of 0-100, wherein 100 represents that two character strings are completely consistent, and 0 represents that the character strings are completely dissimilar. And finally obtaining the score of the similarity degree, and judging whether the two files have a similarity relationship or not.
As shown in fig. 7, operation S310 includes operations S311 to S312.
In operation S311, a target hash value having a text length difference value less than or equal to a second preset threshold value is searched for in the database.
In operation S312, in the target hash value, the similarity of the hash values is compared using the hamming distance to determine texts having differences less than or equal to a third preset threshold.
In one example, the similarity of the text is calculated by ssdeep algorithm, and the similar text generates a similar hash value, and the similarity of the text and the hash value is calculated by the hash value, and the text is considered to be similar when the similarity meets a certain threshold. In this patent, a text length difference and a hash value (the hash value contains length information) are generally adopted, and if the difference is less than 5%, the text length difference and the hash value are considered to be similar documents. Specifically, in all the stored ssdeep hash values, the target hash value with the text length difference value smaller than or equal to a second preset threshold value is searched, and the second preset threshold value can be 5%, for example. And comparing the similarity of the hash values by using the Korean distance in all the hash values with the text length difference of less than or equal to 5%, and determining texts with the difference of less than or equal to a third preset threshold, wherein the third preset threshold can be 5%, for example.
In operation S320, if the similarity is greater than a first preset threshold, it is determined that there is text data similar to the text to be processed.
In one example, if found, the text to be processed is characterized as repeated text, and the text status is marked as repeated. And performing subsequent text deduplication processing.
As shown in fig. 8, after operation S240, operations S410 and S420 are also included.
In operation S410, if it is determined that there is no text data that is repeated with the text to be processed, the state of the text to be processed is marked as not repeated.
In operation S420, description information, digest information, and fuzzy hash values of the text to be processed, the status of which is marked as not repeated, are stored in the database.
In one example, if text data similar to the text to be processed is not found in the database, marking the state of the text to be processed as not repeated, and persisting the fuzzy hash value, the description information and the abstract information of the text to be processed into the database, and sorting according to the text length and the hash value. The text deduplication system provided by the embodiment of the disclosure is realized by adopting a distributed architecture, wherein a storage module is used as a core service, a screening module is used as a multi-node service, and concurrent deduplication service is provided; with continuous updating of the database, the persistent storage of the historical text can support continuous duplication removal, and the text duplication removal service is continuously provided for the outside.
Based on the text deduplication method, the disclosure also provides a text deduplication device. The device will be described in detail below in connection with fig. 9.
Fig. 9 schematically illustrates a block diagram of a text deduplication apparatus according to an embodiment of the present disclosure. As shown in fig. 9, the text deduplication apparatus 700 of this embodiment includes an acquisition module 710, a calculation module 720, a query module 730, a first determination module 740, and a deduplication module 750.
The obtaining module 710 is configured to obtain a text to be processed in response to the data deduplication service request. In an embodiment, the obtaining module 710 may be configured to perform the operation S210 described above, which is not described herein.
The calculating module 720 is configured to calculate a fuzzy hash value of the text to be processed. In an embodiment, the calculating module 720 may be configured to perform the operation S220 described above, which is not described herein.
The query module 730 is configured to query, according to the fuzzy hash value, a database for text data similar to the text to be processed, where the database stores description information, abstract information and fuzzy hash value of historical text. In an embodiment, the query module 730 may be configured to perform the operation S230 described above, which is not described herein.
The first determining module 740 is configured to mark the state of the text to be processed as repeated if it is determined that text data that is repeated with the text to be processed exists. In an embodiment, the first determining module 740 may be configured to perform the operation S240 described above, which is not described herein.
The deduplication module 750 is configured to perform a deduplication operation on text whose status is marked as duplicate. In an embodiment, the deduplication module 750 may be used to perform the operation S240 described above, which is not described herein.
According to an embodiment of the present disclosure, the computing module includes: the system comprises a block sub-module, a calculation sub-module and a generation sub-module.
And the segmentation submodule is used for carrying out segmentation operation on the text to be processed. In an embodiment, the blocking sub-module may be used to perform the operation S221 described above, which is not described herein.
And the first computing sub-module is used for respectively computing the hash value of each block. In an embodiment, the deduplication module 750 may be used to perform the operation S222 described above, which is not described herein.
And the generation sub-module is used for generating a fuzzy hash value of the text to be processed according to the hash value of each block. In an embodiment, the generating sub-module may be configured to perform the operation S223 described above, which is not described herein.
According to an embodiment of the disclosure, the blocking sub-module comprises a first determination unit and a blocking unit.
And the first determining unit is used for determining the segmentation condition value according to the length of the text to be processed and the actual content of the text to be processed. In an embodiment, the first determining unit may be configured to perform the operation S2211 described above, which is not described herein.
And the blocking unit is used for carrying out blocking operation on the text to be processed according to the blocking condition value. In an embodiment, the blocking unit may be used to perform the operation S2212 described above, which is not described herein.
According to an embodiment of the present disclosure, the query module includes: a second calculation sub-module and a first determination sub-module.
And the second computing sub-module is used for computing the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value. In an embodiment, the second computing sub-module may be used to perform the operation S231 described above, which is not described herein.
And the first determining submodule is used for determining that text data similar to the text to be processed exists if the similarity is larger than a first preset threshold value. In an embodiment, the first determining sub-module may be used to perform the operation S232 described above, which is not described herein.
According to an embodiment of the present disclosure, the second computing submodule includes: a search unit and a second determination unit,
And the searching unit is used for searching the target hash value with the text length difference value smaller than or equal to a second preset threshold value in the database. In an embodiment, the search unit may be configured to perform the operation S310 described above, which is not described herein.
And a second determining unit, configured to compare the similarity of the hash values using the hamming distance in the target hash value to determine a text with a difference less than or equal to a third preset threshold. In an embodiment, the second determining unit may be configured to perform the operation S320 described above, which is not described herein.
According to an embodiment of the present disclosure, the apparatus further comprises: and a second determination module.
And the second determining module is used for marking the state of the text to be processed as non-repeated if determining that the text data which is repeated with the text to be processed does not exist. In an embodiment, the second determining module may be configured to perform the operation S410 described above, which is not described herein.
According to an embodiment of the present disclosure, further comprising a memory module,
And the storage module is used for storing the description information, the abstract information and the fuzzy hash value of the text to be processed, the state of which is marked as non-repeated, in the database. In an embodiment, the storage module may be used to perform the operation S420 described above, which is not described herein.
Any of the acquisition module 710, the calculation module 720, the query module 730, the first determination module 740, and the deduplication module 750 may be implemented in one module, or any of the modules may be split into multiple modules, according to embodiments of the present disclosure. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the acquisition module 710, the calculation module 720, the query module 730, the first determination module 740, and the deduplication module 750 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Or at least one of the acquisition module 710, the calculation module 720, the query module 730, the first determination module 740 and the deduplication module 750 may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.
Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a text deduplication method in accordance with an embodiment of the present disclosure.
As shown in fig. 10, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 909 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 909, so that a computer program read therefrom is installed into the storage section 908 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs that, when executed, implement text deduplication methods according to embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the text deduplication method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.
Claims (11)
1. A method for text deduplication, the method comprising:
Responding to a data deduplication service request, and acquiring a text to be processed;
Calculating a fuzzy hash value of the text to be processed;
Inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text;
If the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and
And performing a deduplication operation on the text with the state marked as the duplicate.
2. The method of claim 1, wherein the computing the fuzzy hash value of the text to be processed comprises:
Performing blocking operation on the text to be processed;
Respectively calculating the hash value of each block; and
And generating a fuzzy hash value of the text to be processed according to the hash value of each block.
3. The method of claim 2, wherein the partitioning the text to be processed comprises:
determining a fragmentation condition value according to the length of the text to be processed and the actual content of the text to be processed; and
And performing block operation on the text to be processed according to the block condition value.
4. The method of claim 1, wherein the querying a database for text data similar to the text to be processed based on the fuzzy hash value comprises:
Calculating the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value; and
And if the similarity is larger than a first preset threshold value, determining that text data similar to the text to be processed exists.
5. The method of claim 4, wherein calculating the similarity of the text to be processed to historical text in a database based on the fuzzy hash value comprises:
searching a target hash value of which the text length difference value is smaller than or equal to a second preset threshold value in a database;
and in the target hash value, comparing the similarity of the hash values by using the Hamming distance to determine texts with differences less than or equal to a third preset threshold.
6. The method according to any one of claims 1 to 5, characterized in that the method further comprises:
and if the text data which is repeated with the text to be processed does not exist, marking the state of the text to be processed as non-repeated.
7. The method as recited in claim 6, further comprising:
Descriptive information, summary information, and fuzzy hash values of the text to be processed, the state of which is marked as non-duplicate, are stored in a database.
8. A text deduplication apparatus, the apparatus comprising:
The acquisition module is used for responding to the data deduplication service request and acquiring a text to be processed;
The computing module is used for computing the fuzzy hash value of the text to be processed;
The query module is used for querying text data similar to the text to be processed in a database according to the fuzzy hash value, and the database stores description information, abstract information and the fuzzy hash value of the historical text;
the first determining module is used for marking the state of the text to be processed as repeated if determining that text data which is repeated with the text to be processed exists; and
And the de-duplication module is used for performing de-duplication operation on the text with the state marked as repeated.
9. An electronic device, comprising:
one or more processors;
Storage means for storing one or more computer programs,
Characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-7.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224044.6A CN118095247A (en) | 2024-02-28 | 2024-02-28 | Text deduplication method, apparatus, device, storage medium, and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224044.6A CN118095247A (en) | 2024-02-28 | 2024-02-28 | Text deduplication method, apparatus, device, storage medium, and program product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118095247A true CN118095247A (en) | 2024-05-28 |
Family
ID=91152767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410224044.6A Pending CN118095247A (en) | 2024-02-28 | 2024-02-28 | Text deduplication method, apparatus, device, storage medium, and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118095247A (en) |
-
2024
- 2024-02-28 CN CN202410224044.6A patent/CN118095247A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9720944B2 (en) | Method for facet searching and search suggestions | |
KR100672277B1 (en) | Personalized Search Method Using Cookie Information And System For Enabling The Method | |
US9043660B2 (en) | Data store capable of efficient storing of keys | |
US9740734B2 (en) | Group-by processing for data containing singleton groups | |
CN112136123A (en) | Characterizing documents for similarity search | |
CN113408660B (en) | Book clustering method, device, equipment and storage medium | |
CN113360803B (en) | Data caching method, device, equipment and storage medium based on user behaviors | |
US20190258682A1 (en) | Hybrid processing of disjunctive and conjunctive conditions of a search query for a similarity search | |
CN114817651B (en) | Data storage method, data query method, device and equipment | |
JP2023027250A (en) | Road information updating method and device, electronic apparatus, storage medium and computer program | |
CN113536763A (en) | Information processing method, device, equipment and storage medium | |
CN110020272B (en) | Caching method and device and computer storage medium | |
US10599726B2 (en) | Methods and systems for real-time updating of encoded search indexes | |
CN117093604B (en) | Search information generation method, apparatus, electronic device, and computer-readable medium | |
CN117971873A (en) | Method and device for generating Structured Query Language (SQL) and electronic equipment | |
US11030177B1 (en) | Selectively scanning portions of a multidimensional index for processing queries | |
CN117751368A (en) | Privacy sensitive neural network training | |
CN118095247A (en) | Text deduplication method, apparatus, device, storage medium, and program product | |
US11822803B2 (en) | Method, electronic device and computer program product for managing data blocks | |
US10235432B1 (en) | Document retrieval using multiple sort orders | |
CN111465929A (en) | Method and system for content-agnostic file indexing | |
CN114490400A (en) | Method and device for processing test cases | |
CN112860626A (en) | Document sorting method and device and electronic equipment | |
CN111797183A (en) | Method and device for mining road attribute of information point and electronic equipment | |
CN113411364A (en) | Resource acquisition method and device and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |