CN118095247A - Text deduplication method, apparatus, device, storage medium, and program product - Google Patents

Text deduplication method, apparatus, device, storage medium, and program product Download PDF

Info

Publication number
CN118095247A
CN118095247A CN202410224044.6A CN202410224044A CN118095247A CN 118095247 A CN118095247 A CN 118095247A CN 202410224044 A CN202410224044 A CN 202410224044A CN 118095247 A CN118095247 A CN 118095247A
Authority
CN
China
Prior art keywords
text
processed
hash value
fuzzy hash
deduplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410224044.6A
Other languages
Chinese (zh)
Inventor
孔德耀
李天浩
丁鑫煜
杨冰彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202410224044.6A priority Critical patent/CN118095247A/en
Publication of CN118095247A publication Critical patent/CN118095247A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a text deduplication method, relates to the technical field of artificial intelligence, and can be applied to the technical field of finance. The method comprises the following steps: responding to a data deduplication service request, and acquiring a text to be processed; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text; if the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and performing a deduplication operation on the text whose status is marked as duplicate. The present disclosure also provides a text deduplication apparatus, a device, a storage medium, and a program product.

Description

Text deduplication method, apparatus, device, storage medium, and program product
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to a text deduplication method, apparatus, device, storage medium, and program product.
Background
In the technical field of large models, a large amount of data is needed for training the large model, and the high-quality data can enable the training effect of the large model to be better. Corpus is currently obtained, typically by downloading from a public dataset, crawling web page data, internal data collection, and the like. However, all data are gathered together, so that the problem of data repetition is necessarily generated, and repeated data have a certain influence on large model training, so that the repeated data need to be removed when the data are prepared. In the related art, a text content is removed in a peer-to-peer mode generally, and a better text duplication removal method is not adopted. The text content is directly compared, so that the efficiency is low, persistence cannot be realized, and the text content cannot be used for removing the duplication of a large amount of text.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
In view of the foregoing, the present disclosure provides an efficient text deduplication method, apparatus, device, storage medium, and program product.
According to a first aspect of the present disclosure, there is provided a text deduplication method, the method comprising:
Responding to a data deduplication service request, and acquiring a text to be processed;
Calculating a fuzzy hash value of the text to be processed;
Inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text;
If the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and
And performing a deduplication operation on the text with the state marked as the duplicate.
According to an embodiment of the present disclosure, the calculating the fuzzy hash value of the text to be processed includes:
Performing blocking operation on the text to be processed;
Respectively calculating the hash value of each block; and
And generating a fuzzy hash value of the text to be processed according to the hash value of each block.
According to an embodiment of the present disclosure, the partitioning the text to be processed includes:
determining a fragmentation condition value according to the length of the text to be processed and the actual content of the text to be processed; and
And performing block operation on the text to be processed according to the block condition value.
According to an embodiment of the disclosure, the querying, in a database, text data similar to the text to be processed according to the fuzzy hash value includes:
Calculating the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value; and
And if the similarity is larger than a first preset threshold value, determining that text data similar to the text to be processed exists.
According to an embodiment of the present disclosure, the calculating, according to the fuzzy hash value, a similarity between the text to be processed and a history text in a database includes:
searching a target hash value of which the text length difference value is smaller than or equal to a second preset threshold value in a database;
and in the target hash value, comparing the similarity of the hash values by using the Hamming distance to determine texts with differences less than or equal to a third preset threshold.
According to an embodiment of the present disclosure, the method further comprises:
and if the text data which is repeated with the text to be processed does not exist, marking the state of the text to be processed as non-repeated.
According to an embodiment of the present disclosure, further comprising:
Descriptive information, summary information, and fuzzy hash values of the text to be processed, the state of which is marked as non-duplicate, are stored in a database.
A second aspect of the present disclosure provides a text deduplication apparatus, the apparatus comprising:
The device comprises:
The acquisition module is used for responding to the data deduplication service request and acquiring a text to be processed;
The computing module is used for computing the fuzzy hash value of the text to be processed;
The query module is used for querying text data similar to the text to be processed in a database according to the fuzzy hash value, and the database stores description information, abstract information and the fuzzy hash value of the historical text;
the first determining module is used for marking the state of the text to be processed as repeated if determining that text data which is repeated with the text to be processed exists; and
And the de-duplication module is used for performing de-duplication operation on the text with the state marked as repeated.
According to an embodiment of the present disclosure, the computing module includes: the system comprises a block sub-module, a calculation sub-module and a generation sub-module.
The partitioning sub-module is used for performing partitioning operation on the text to be processed;
a first computing sub-module for computing hash values of each block respectively; and
And the generation sub-module is used for generating a fuzzy hash value of the text to be processed according to the hash value of each block.
According to an embodiment of the disclosure, the blocking sub-module comprises a first determination unit and a blocking unit.
The first determining unit is used for determining a fragmentation condition value according to the length of the text to be processed and the actual content of the text to be processed; and
And the blocking unit is used for carrying out blocking operation on the text to be processed according to the blocking condition value.
According to an embodiment of the present disclosure, the query module includes: a second calculation sub-module and a first determination sub-module.
The second computing sub-module is used for computing the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value; and
And the first determining submodule is used for determining that text data similar to the text to be processed exists if the similarity is larger than a first preset threshold value.
According to an embodiment of the present disclosure, the second computing submodule includes: a search unit and a second determination unit,
The searching unit is used for searching a target hash value with the text length difference value smaller than or equal to a second preset threshold value in the database;
And a second determining unit, configured to compare the similarity of the hash values using the hamming distance in the target hash value to determine a text with a difference less than or equal to a third preset threshold.
According to an embodiment of the present disclosure, the apparatus further comprises: and a second determination module.
And the second determining module is used for marking the state of the text to be processed as non-repeated if determining that the text data which is repeated with the text to be processed does not exist.
According to an embodiment of the present disclosure, further comprising a memory module,
And the storage module is used for storing the description information, the abstract information and the fuzzy hash value of the text to be processed, the state of which is marked as non-repeated, in the database.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the text deduplication method described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described text deduplication method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above text deduplication method.
According to the text deduplication method provided by the embodiment of the disclosure, a text to be processed is obtained in response to a data deduplication service request; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value; determining whether a text to be processed is a repeated text or not through a fuzzy hash algorithm, and marking the state of the text to be processed as repeated if the text data which is repeated with the text to be processed exists; and performing a deduplication operation on the text whose status is marked as duplicate. Compared with the related art, the fuzzy hash algorithm provided by the embodiment of the disclosure can efficiently remove part of changes in the confirmed text of the express, remove repeated text and rapidly deliver high-quality text.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a text deduplication method, apparatus, device, storage medium, and program product according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a system architecture diagram of a text deduplication device in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a text deduplication method provided in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates one of the flowcharts of the fuzzy hash value calculation method provided in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a second flowchart of a fuzzy hash value calculation method provided in accordance with an embodiment of the present disclosure;
FIG. 6 schematically illustrates one of the flowcharts of a similar text data query method provided in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a second flowchart of a similar text data query method provided in accordance with an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow chart of a text deduplication method provided in accordance with another embodiment of the present disclosure;
Fig. 9 schematically illustrates a block diagram of a text deduplication apparatus according to an embodiment of the present disclosure; and
Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a text deduplication method in accordance with an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The terms appearing in the embodiments of the present disclosure will first be explained:
Fuzzy hash algorithm: the fuzzy hash algorithm (Context TRIGGERED PIECEWISE HASHING, CTPH) is also called a content segmentation-based hash algorithm, and is mainly used for similarity comparison of files.
The existing data deduplication method generally adopts a text content comparison peer-to-peer mode to remove, and a few better text deduplication methods exist. The text content is directly compared, the efficiency is low, and the persistence cannot be realized, so that the text content is used for removing the duplication of a large amount of text, and the requirement cannot be met obviously.
Based on the technical problems described above, embodiments of the present disclosure provide a text deduplication method, which includes: responding to a data deduplication service request, and acquiring a text to be processed; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text; if the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and performing a deduplication operation on the text whose status is marked as duplicate.
Fig. 1 schematically illustrates an application scenario diagram of a text deduplication method, apparatus, device, storage medium, and program product according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a text deduplication scenario. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a backend server, which may perform the text deduplication method provided by the embodiments of the present disclosure, and obtain a text to be processed in response to a data deduplication service request; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text; if the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and performing a deduplication operation on the text whose status is marked as duplicate.
It should be noted that the text deduplication method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the text deduplication apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The text deduplication method provided by the embodiments of the present disclosure may also be performed by a server or cluster of servers that are different from server 105 and capable of communicating with terminal devices 101, 102, 103 and/or server 105. Accordingly, the text deduplication apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
It should be noted that, the text deduplication method and device determined by the embodiments of the present disclosure may be used in the internet technical field, the financial technical field, and any field other than the financial field, and the application field of the text deduplication method and device determined by the embodiments of the present disclosure is not limited.
Fig. 2 schematically illustrates a system architecture diagram of a text deduplication device in accordance with an embodiment of the present disclosure. As shown in fig. 2, the data deduplication service architecture provided by the embodiment of the present disclosure includes a storage module and a screening module. The screening module calculates a fuzzy hash value of the text, transmits the fuzzy hash value to the storage module for inquiring, and the storage module inquires according to the fuzzy hash value, if the fuzzy hash value can be found, the text is judged to exist and is a repeated text. If the text cannot be found, the text is not found, and the storage module stores the fuzzy hash value for the subsequent text duplicate removal.
The text deduplication method of the embodiments of the present disclosure will be described in detail below by fig. 3 to 6 based on the application scenario described in fig. 1 and the architecture described in fig. 2.
Fig. 3 schematically illustrates a flowchart of a text deduplication method according to an embodiment of the present disclosure. As shown in fig. 3, the text deduplication method of the embodiment includes operations S210 to S240, which may be performed by a server or other computing device.
In operation S210, a text to be processed is acquired in response to the data deduplication service request.
In operation S220, a fuzzy hash value of the text to be processed is calculated.
In operation S230, text data similar to the text to be processed is queried in a database according to the fuzzy hash value.
According to an embodiment of the present disclosure, the database stores descriptive information, summary information, and fuzzy hash values of the history text.
In one example, in order to efficiently and quickly read and process text corpus, an embodiment of the present disclosure provides a text deduplication system, which serves text deduplication and provides text deduplication service to the outside. The text removal system comprises two parts, namely a storage module and a screening module. The screening module calculates a fuzzy hash value of the text, transmits the fuzzy hash value to the storage module for inquiring, and the storage module inquires according to the fuzzy hash value, if the fuzzy hash value can be found, the text is judged to exist and is a repeated text. If the text cannot be found, the text is not found, and the storage module stores the fuzzy hash value for the subsequent text duplicate removal. Responding to the data deduplication service request, acquiring a text to be processed, calculating a fuzzy hash value of the text to be processed by adopting ssdeep algorithm, comparing the fuzzy hash value with the fuzzy hash value of the historical text in the database, and determining whether the text similar to the text to be processed exists in the database. The database stores descriptive information, summary information and fuzzy hash values of the history text.
In operation S240, if it is determined that there is text data that is repeated with the text to be processed, the state of the text to be processed is marked as repeated.
In operation S250, a deduplication operation is performed on text whose status is marked as duplicate.
In one example, after determining that text data that is repeated with the text to be processed exists, marking the state of the text to be processed as repeated, and performing a deduplication operation on the text with the state marked as repeated.
According to the text deduplication method provided by the embodiment of the disclosure, a text to be processed is obtained in response to a data deduplication service request; calculating a fuzzy hash value of the text to be processed; inquiring text data similar to the text to be processed in a database according to the fuzzy hash value; determining whether a text to be processed is a repeated text or not through a fuzzy hash algorithm, and marking the state of the text to be processed as repeated if the text data which is repeated with the text to be processed exists; and performing a deduplication operation on the text whose status is marked as duplicate. Compared with the related art, the fuzzy hash algorithm provided by the embodiment of the disclosure can efficiently remove part of changes in the confirmed text of the express, remove repeated text and rapidly deliver high-quality text.
Fig. 4 schematically illustrates one of flowcharts of a fuzzy hash value calculation method provided according to an embodiment of the present disclosure. Fig. 5 schematically illustrates a second flowchart of a fuzzy hash value calculation method provided according to an embodiment of the present disclosure.
As shown in fig. 4, operation S220 includes operations S221 to S223.
In operation S221, a blocking operation is performed on the text to be processed.
As shown in fig. 5, operation S221 includes operation S2211 and operation S2212.
In operation S2211, a segmentation condition value is determined according to the length of the text to be processed and the actual content of the text to be processed.
In operation S2212, a blocking operation is performed on the text to be processed according to the blocking condition value.
In one example, when calculating the fuzzy hash value, the text to be processed is firstly fragmented by using a weak hash algorithm and a fragmentation condition value, specifically, a part of content is read from the file, and a hash value is calculated for the weak hash algorithm. The fixed length content is typically read byte by byte, sliding in a file with a fixed window as in a sliding window in a network protocol, with the content within the window being calculated each time. For convenience, a rolling hash algorithm is generally adopted, where rolling hash refers to, for example, that the hash value of abcdef is already calculated as h1, then the hash value of bcdefg is calculated, and no complete recalculation is needed, only h1-X (a) +y (g) is needed, X, Y is two functions, and only the influence of difference on the hash value is needed to be correspondingly increased or decreased, so that the speed of fragment judgment can be greatly increased. In ssdeep, an Alder-32 algorithm is used, and in addition to the weak hash algorithm, a slicing condition value n is required, which is used to control the slicing condition, and in ssdeep, the value n is determined by the length of the text to be processed and the actual content of the text to be processed.
In operation S222, a hash value of each chunk is calculated, respectively.
In operation S223, a fuzzy hash value of the text to be processed is generated according to the hash value of each block.
In one example, after the file is partitioned, hash values are calculated separately for each partition, and a conventional hash algorithm, such as MD5, may be used, and in ssdeep, a Fowler-Noll-Vo hash algorithm is used. After each file fragment is calculated to obtain a hash value, each hash value is connected together, and then a fuzzy hash value of the text to be processed can be obtained. The shard condition values may be different for different files and may also be incorporated into the fuzzy hash value. Before each hash value is connected to obtain the fuzzy hash value, compression mapping can be performed on each hash value, for example, in ssdeep, only the lowest 6 bits of the FNV hash result are taken, the lowest bits are represented by ASCII characters, as the final hash result of the segment, a part of accuracy is lost due to compression mapping, a false alarm problem is introduced, or compression mapping can be optionally performed.
Fig. 6 schematically illustrates one of the flowcharts of the similar text data query method provided in accordance with an embodiment of the present disclosure. Fig. 7 schematically illustrates a second flowchart of a similar text data query method provided in accordance with an embodiment of the present disclosure. Fig. 8 schematically illustrates a flow chart of a text deduplication method provided in accordance with another embodiment of the present disclosure.
As shown in fig. 6, operation S230 includes operations S310 to S320.
In operation S310, a similarity between the text to be processed and the historical text in the database is calculated according to the fuzzy hash value.
In one example, after obtaining the fuzzy hash value of the file to be processed, searching a historical text similar to the file to be processed in a database according to the fuzzy hash value to determine whether the text to be processed is a repeated text, and in ssdeep, adopting the following thought, since ssdeep is an ASCII character for each obtained hash value, the finally obtained fuzzy hash value of the file is a character string, and the character string is assumed to be S1 and S2, wherein S1 is the fuzzy hash value of the file to be processed, S2 is the fuzzy hash value of the historical text stored in the database, and the weighted editing distance of S1 to S2 is used as a basis for determining the similarity. The weighted edit distance refers to judging how many steps of operations (including insertion, deletion, modification and exchange) are needed from S1 to S2, giving a weight to different operations, adding the results to obtain the weighted edit distance, dividing the distance by the sum of the lengths of S1 and S2, changing the absolute result to a relative result, and mapping the relative result to an integer value of 0-100, wherein 100 represents that two character strings are completely consistent, and 0 represents that the character strings are completely dissimilar. And finally obtaining the score of the similarity degree, and judging whether the two files have a similarity relationship or not.
As shown in fig. 7, operation S310 includes operations S311 to S312.
In operation S311, a target hash value having a text length difference value less than or equal to a second preset threshold value is searched for in the database.
In operation S312, in the target hash value, the similarity of the hash values is compared using the hamming distance to determine texts having differences less than or equal to a third preset threshold.
In one example, the similarity of the text is calculated by ssdeep algorithm, and the similar text generates a similar hash value, and the similarity of the text and the hash value is calculated by the hash value, and the text is considered to be similar when the similarity meets a certain threshold. In this patent, a text length difference and a hash value (the hash value contains length information) are generally adopted, and if the difference is less than 5%, the text length difference and the hash value are considered to be similar documents. Specifically, in all the stored ssdeep hash values, the target hash value with the text length difference value smaller than or equal to a second preset threshold value is searched, and the second preset threshold value can be 5%, for example. And comparing the similarity of the hash values by using the Korean distance in all the hash values with the text length difference of less than or equal to 5%, and determining texts with the difference of less than or equal to a third preset threshold, wherein the third preset threshold can be 5%, for example.
In operation S320, if the similarity is greater than a first preset threshold, it is determined that there is text data similar to the text to be processed.
In one example, if found, the text to be processed is characterized as repeated text, and the text status is marked as repeated. And performing subsequent text deduplication processing.
As shown in fig. 8, after operation S240, operations S410 and S420 are also included.
In operation S410, if it is determined that there is no text data that is repeated with the text to be processed, the state of the text to be processed is marked as not repeated.
In operation S420, description information, digest information, and fuzzy hash values of the text to be processed, the status of which is marked as not repeated, are stored in the database.
In one example, if text data similar to the text to be processed is not found in the database, marking the state of the text to be processed as not repeated, and persisting the fuzzy hash value, the description information and the abstract information of the text to be processed into the database, and sorting according to the text length and the hash value. The text deduplication system provided by the embodiment of the disclosure is realized by adopting a distributed architecture, wherein a storage module is used as a core service, a screening module is used as a multi-node service, and concurrent deduplication service is provided; with continuous updating of the database, the persistent storage of the historical text can support continuous duplication removal, and the text duplication removal service is continuously provided for the outside.
Based on the text deduplication method, the disclosure also provides a text deduplication device. The device will be described in detail below in connection with fig. 9.
Fig. 9 schematically illustrates a block diagram of a text deduplication apparatus according to an embodiment of the present disclosure. As shown in fig. 9, the text deduplication apparatus 700 of this embodiment includes an acquisition module 710, a calculation module 720, a query module 730, a first determination module 740, and a deduplication module 750.
The obtaining module 710 is configured to obtain a text to be processed in response to the data deduplication service request. In an embodiment, the obtaining module 710 may be configured to perform the operation S210 described above, which is not described herein.
The calculating module 720 is configured to calculate a fuzzy hash value of the text to be processed. In an embodiment, the calculating module 720 may be configured to perform the operation S220 described above, which is not described herein.
The query module 730 is configured to query, according to the fuzzy hash value, a database for text data similar to the text to be processed, where the database stores description information, abstract information and fuzzy hash value of historical text. In an embodiment, the query module 730 may be configured to perform the operation S230 described above, which is not described herein.
The first determining module 740 is configured to mark the state of the text to be processed as repeated if it is determined that text data that is repeated with the text to be processed exists. In an embodiment, the first determining module 740 may be configured to perform the operation S240 described above, which is not described herein.
The deduplication module 750 is configured to perform a deduplication operation on text whose status is marked as duplicate. In an embodiment, the deduplication module 750 may be used to perform the operation S240 described above, which is not described herein.
According to an embodiment of the present disclosure, the computing module includes: the system comprises a block sub-module, a calculation sub-module and a generation sub-module.
And the segmentation submodule is used for carrying out segmentation operation on the text to be processed. In an embodiment, the blocking sub-module may be used to perform the operation S221 described above, which is not described herein.
And the first computing sub-module is used for respectively computing the hash value of each block. In an embodiment, the deduplication module 750 may be used to perform the operation S222 described above, which is not described herein.
And the generation sub-module is used for generating a fuzzy hash value of the text to be processed according to the hash value of each block. In an embodiment, the generating sub-module may be configured to perform the operation S223 described above, which is not described herein.
According to an embodiment of the disclosure, the blocking sub-module comprises a first determination unit and a blocking unit.
And the first determining unit is used for determining the segmentation condition value according to the length of the text to be processed and the actual content of the text to be processed. In an embodiment, the first determining unit may be configured to perform the operation S2211 described above, which is not described herein.
And the blocking unit is used for carrying out blocking operation on the text to be processed according to the blocking condition value. In an embodiment, the blocking unit may be used to perform the operation S2212 described above, which is not described herein.
According to an embodiment of the present disclosure, the query module includes: a second calculation sub-module and a first determination sub-module.
And the second computing sub-module is used for computing the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value. In an embodiment, the second computing sub-module may be used to perform the operation S231 described above, which is not described herein.
And the first determining submodule is used for determining that text data similar to the text to be processed exists if the similarity is larger than a first preset threshold value. In an embodiment, the first determining sub-module may be used to perform the operation S232 described above, which is not described herein.
According to an embodiment of the present disclosure, the second computing submodule includes: a search unit and a second determination unit,
And the searching unit is used for searching the target hash value with the text length difference value smaller than or equal to a second preset threshold value in the database. In an embodiment, the search unit may be configured to perform the operation S310 described above, which is not described herein.
And a second determining unit, configured to compare the similarity of the hash values using the hamming distance in the target hash value to determine a text with a difference less than or equal to a third preset threshold. In an embodiment, the second determining unit may be configured to perform the operation S320 described above, which is not described herein.
According to an embodiment of the present disclosure, the apparatus further comprises: and a second determination module.
And the second determining module is used for marking the state of the text to be processed as non-repeated if determining that the text data which is repeated with the text to be processed does not exist. In an embodiment, the second determining module may be configured to perform the operation S410 described above, which is not described herein.
According to an embodiment of the present disclosure, further comprising a memory module,
And the storage module is used for storing the description information, the abstract information and the fuzzy hash value of the text to be processed, the state of which is marked as non-repeated, in the database. In an embodiment, the storage module may be used to perform the operation S420 described above, which is not described herein.
Any of the acquisition module 710, the calculation module 720, the query module 730, the first determination module 740, and the deduplication module 750 may be implemented in one module, or any of the modules may be split into multiple modules, according to embodiments of the present disclosure. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the acquisition module 710, the calculation module 720, the query module 730, the first determination module 740, and the deduplication module 750 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Or at least one of the acquisition module 710, the calculation module 720, the query module 730, the first determination module 740 and the deduplication module 750 may be at least partially implemented as computer program modules, which when executed, may perform the respective functions.
Fig. 10 schematically illustrates a block diagram of an electronic device adapted to implement a text deduplication method in accordance with an embodiment of the present disclosure.
As shown in fig. 10, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 909 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 909, so that a computer program read therefrom is installed into the storage section 908 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs that, when executed, implement text deduplication methods according to embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the text deduplication method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (11)

1. A method for text deduplication, the method comprising:
Responding to a data deduplication service request, and acquiring a text to be processed;
Calculating a fuzzy hash value of the text to be processed;
Inquiring text data similar to the text to be processed in a database according to the fuzzy hash value, wherein the database stores description information, abstract information and the fuzzy hash value of historical text;
If the text data which is repeated with the text to be processed exists, marking the state of the text to be processed as repeated; and
And performing a deduplication operation on the text with the state marked as the duplicate.
2. The method of claim 1, wherein the computing the fuzzy hash value of the text to be processed comprises:
Performing blocking operation on the text to be processed;
Respectively calculating the hash value of each block; and
And generating a fuzzy hash value of the text to be processed according to the hash value of each block.
3. The method of claim 2, wherein the partitioning the text to be processed comprises:
determining a fragmentation condition value according to the length of the text to be processed and the actual content of the text to be processed; and
And performing block operation on the text to be processed according to the block condition value.
4. The method of claim 1, wherein the querying a database for text data similar to the text to be processed based on the fuzzy hash value comprises:
Calculating the similarity between the text to be processed and the historical text in the database according to the fuzzy hash value; and
And if the similarity is larger than a first preset threshold value, determining that text data similar to the text to be processed exists.
5. The method of claim 4, wherein calculating the similarity of the text to be processed to historical text in a database based on the fuzzy hash value comprises:
searching a target hash value of which the text length difference value is smaller than or equal to a second preset threshold value in a database;
and in the target hash value, comparing the similarity of the hash values by using the Hamming distance to determine texts with differences less than or equal to a third preset threshold.
6. The method according to any one of claims 1 to 5, characterized in that the method further comprises:
and if the text data which is repeated with the text to be processed does not exist, marking the state of the text to be processed as non-repeated.
7. The method as recited in claim 6, further comprising:
Descriptive information, summary information, and fuzzy hash values of the text to be processed, the state of which is marked as non-duplicate, are stored in a database.
8. A text deduplication apparatus, the apparatus comprising:
The acquisition module is used for responding to the data deduplication service request and acquiring a text to be processed;
The computing module is used for computing the fuzzy hash value of the text to be processed;
The query module is used for querying text data similar to the text to be processed in a database according to the fuzzy hash value, and the database stores description information, abstract information and the fuzzy hash value of the historical text;
the first determining module is used for marking the state of the text to be processed as repeated if determining that text data which is repeated with the text to be processed exists; and
And the de-duplication module is used for performing de-duplication operation on the text with the state marked as repeated.
9. An electronic device, comprising:
one or more processors;
Storage means for storing one or more computer programs,
Characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-7.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-7.
CN202410224044.6A 2024-02-28 2024-02-28 Text deduplication method, apparatus, device, storage medium, and program product Pending CN118095247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410224044.6A CN118095247A (en) 2024-02-28 2024-02-28 Text deduplication method, apparatus, device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410224044.6A CN118095247A (en) 2024-02-28 2024-02-28 Text deduplication method, apparatus, device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN118095247A true CN118095247A (en) 2024-05-28

Family

ID=91152767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410224044.6A Pending CN118095247A (en) 2024-02-28 2024-02-28 Text deduplication method, apparatus, device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN118095247A (en)

Similar Documents

Publication Publication Date Title
US9720944B2 (en) Method for facet searching and search suggestions
KR100672277B1 (en) Personalized Search Method Using Cookie Information And System For Enabling The Method
US9043660B2 (en) Data store capable of efficient storing of keys
US9740734B2 (en) Group-by processing for data containing singleton groups
CN112136123A (en) Characterizing documents for similarity search
CN113408660B (en) Book clustering method, device, equipment and storage medium
CN113360803B (en) Data caching method, device, equipment and storage medium based on user behaviors
US20190258682A1 (en) Hybrid processing of disjunctive and conjunctive conditions of a search query for a similarity search
CN114817651B (en) Data storage method, data query method, device and equipment
JP2023027250A (en) Road information updating method and device, electronic apparatus, storage medium and computer program
CN113536763A (en) Information processing method, device, equipment and storage medium
CN110020272B (en) Caching method and device and computer storage medium
US10599726B2 (en) Methods and systems for real-time updating of encoded search indexes
CN117093604B (en) Search information generation method, apparatus, electronic device, and computer-readable medium
CN117971873A (en) Method and device for generating Structured Query Language (SQL) and electronic equipment
US11030177B1 (en) Selectively scanning portions of a multidimensional index for processing queries
CN117751368A (en) Privacy sensitive neural network training
CN118095247A (en) Text deduplication method, apparatus, device, storage medium, and program product
US11822803B2 (en) Method, electronic device and computer program product for managing data blocks
US10235432B1 (en) Document retrieval using multiple sort orders
CN111465929A (en) Method and system for content-agnostic file indexing
CN114490400A (en) Method and device for processing test cases
CN112860626A (en) Document sorting method and device and electronic equipment
CN111797183A (en) Method and device for mining road attribute of information point and electronic equipment
CN113411364A (en) Resource acquisition method and device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination