CN116561110A

CN116561110A - Spark-based large-scale data global deduplication method, electronic equipment and medium

Info

Publication number: CN116561110A
Application number: CN202310439940.XA
Authority: CN
Inventors: 邓凌风; 张水勇; 耿林; 徐春香; 王晖; 余跃
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-08-08

Abstract

The large-scale data deduplication method based on Spark, the electronic equipment and the storage medium are characterized in that large-scale corpus data are preprocessed, first processed documents obtained after preprocessing are stored into different storage partitions, the first processed documents are grouped in each storage partition, so that a large number of completely irrelevant documents are eliminated, similarity detection is performed to obtain similar pairs of each first processed document, the similar pairs are combined in three granularities of document grouping, storage partition and overall, the similar pairs are combined in a high-efficiency mode through a distributed parallel operation method in granularity of the document grouping and the storage partition, the calculation amount of combining in overall granularity of a system is greatly reduced, and therefore high-efficiency fuzzy deduplication of large-scale data is achieved.

Description

Spark-based large-scale data global deduplication method, electronic equipment and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a Spark-based global duplication elimination method for large-scale data, an electronic device, and a storage medium

Background

When a model is trained on a large scale, a large amount of corpus data is required to be used as training samples, the training samples are often obtained by crawling various data sources, a large amount of actual content is similar in different data sources or different partitions of the same data source on the Internet, only repeated corpus data with different expression modes are available, and as the corpus data are not completely repeated in practice, whether the corpus data are identical in expression with each other is difficult to directly judge through a traditional deduplication method, when the model is trained, if the corpus data cannot be removed, excessive learning of the repeated corpus is easily caused by the model, the training effect of the model is poor, and for a large model, the scale of the corpus data required by training the model is quite huge, and how to efficiently deduplicate the large-scale similar corpus data through a traditional single mode is a problem which is often needed to be solved at present.

Disclosure of Invention

The main purpose of the embodiment of the application is to provide a Spark-based large-scale data global deduplication method, electronic equipment and a computer-readable storage medium, which are used for merging and deduplicating language data on three different granularities of grouping, partitioning and global, so that high-efficiency fuzzy deduplication of large-scale language data is realized.

An embodiment of a first aspect of an embodiment of the present application provides a Spark-based global deduplication method for large-scale data, including:

preprocessing large-scale data to obtain a plurality of first processing documents, and respectively storing the first processing documents into a plurality of storage partitions;

grouping the first processing documents in each storage partition to obtain a plurality of document groups;

performing similarity detection on the first processed document in each document group to obtain a plurality of similarity pairs;

determining the minimum merging similar pair of each similar pair according to the preconfigured global number of the first processing document, and merging all similar pairs pointing to the same minimum merging similar pair to obtain a grouping similar set;

determining a minimum merging group set of each group similar set according to the global number of the first processing document, merging all group similar sets pointing to the same minimum merging group set to obtain a partition similar set;

Determining a minimum merging partition set of each partition similar set according to the global number of the first processing document, merging all partition similar sets pointing to the same minimum merging partition set to obtain a global similar set, and recording the global numbers of all the first processing documents in each global similar set through a second processing document;

removing a first global number in each global similarity set from the second processed document, summarizing all the remaining global numbers in all the global similarity sets to obtain a document number set to be eliminated, and broadcasting the document number set to be eliminated to each storage partition;

and in each storage partition, filtering the first processing document according to the document number set to be eliminated.

In some embodiments, the determining a minimum merging similarity pair of each of the similarity pairs according to the preconfigured global number of the first processing document, and merging all the similarity pairs pointing to the same minimum merging similarity pair to obtain a group similarity set includes:

numbering the similar pairs, and constructing a first hash table according to the global number of the first processing document and the numbers of the similar pairs, wherein a key value of the first hash table is the global number of the first processing document, and a true value of the first hash table is an ordered linked list formed by the numbers of all the similar pairs of the first processing document corresponding to the key value;

And constructing a first dynamic array, assigning a value to the first dynamic array according to the first hash table, determining the minimum similarity pair of merging of each similarity pair according to an assignment result, merging a plurality of similarity pairs which point to the same minimum similarity pair in the same document group, and obtaining a group similarity set.

In some embodiments, the determining the minimum merging group set of each group similar set according to the global number of the first processing document, and merging all group similar sets pointing to the same minimum merging group set to obtain a partition similar set includes:

numbering the grouping similar sets, and constructing a second hash table according to the global numbers and the numbering of the grouping similar sets, wherein a key value of the second hash table is the global number of the first processing document, and a true value of the second hash table is an ordered linked list formed by the numbers of all the grouping similar sets of the first processing document corresponding to the key value;

and constructing a second dynamic array, assigning a value to the second dynamic array according to the second hash table, determining the minimum similarity set of the merging of each grouping similarity set according to an assignment result, merging a plurality of grouping similarity sets pointing to the same minimum similarity set in the same storage partition, and obtaining a partition similarity set.

In some embodiments, the determining a minimum merging partition set of each partition similarity set according to the global number of the first processing document, merging all the similarity pairs pointing to the same minimum merging partition set to obtain a global similarity set, and recording, by a second processing document, that each global similarity set includes the global numbers of all the first processing documents, including:

numbering the partition similarity sets, and constructing a third hash table according to the global numbers and the numbers of the partition similarity sets, wherein a key value of the third hash table is a global number of the first processing document, and a true value of the third hash table is an ordered linked list formed by the numbers of all the partition similarity sets of the first processing document corresponding to the key value;

and constructing a third dynamic array, assigning a value to the third dynamic array according to the third hash table, determining a minimum global set for merging each partition similar set according to an assignment result, merging a plurality of partition similar sets pointing to the same minimum global set to obtain a global similar set, and recording global numbers of all the first processing documents in each global similar set through a second processing document.

In some embodiments, the constructing a first dynamic array, assigning a value to the first dynamic array according to the first hash table, determining a minimum similarity pair of merging of each similarity pair according to a value assignment result, merging multiple similarity pairs pointing to the same minimum similarity pair in the same document packet to obtain a packet similarity set, and including:

constructing and initializing the first dynamic array;

scanning the first hash table to assign values to the first dynamic array, wherein the index value of each item of the first dynamic array after assignment is the number of the similar pair, and the initial value stored by each item is the number of the similar pair positioned at the later item of the index value in the truth value of the first hash table;

obtaining a first storage value of the ith item of the first dynamic array, and enabling the first storage value to be i under the condition that the first storage value is a first preset value, wherein i is a positive integer;

acquiring a second storage value of an item taking the first storage value as an index value in the first dynamic array under the condition that the first storage value is not the first preset value, taking the second storage value as a new first storage value under the condition that the second storage value is not the first preset value, and acquiring the new second storage value by taking the new first storage value as the index value until the second storage value is the first preset value;

If the second storage value is the first preset value, determining that the first storage value is the number of the smallest similarity pair of the similarity pair merging with the similarity pair number i;

and merging a plurality of corresponding similar pairs in each document group according to index values of all items with the same storage value in the first dynamic array to obtain the group similarity set.

In some embodiments, the constructing a second dynamic array and assigning a value to the second dynamic array according to the second hash table, determining a minimum similarity set of merging each of the group similarity sets according to a value assignment result, merging multiple group similarity sets pointing to the same minimum similarity set in the same storage partition to obtain a partition similarity set, and including:

constructing and initializing the second dynamic array;

scanning the second hash table to assign a value to the second dynamic array, wherein an index value of each item of the second dynamic array after the value assignment is the number of the grouping similarity set, and an initial value stored by each item is the number of the grouping similarity set positioned at the later item of the index value in a true value of the second hash table;

obtaining a third storage value of the j-th item of the second dynamic array, and enabling the third storage value to be j under the condition that the third storage value is a second preset value, wherein j is a positive integer;

Acquiring a fourth storage value of an item taking the third storage value as an index value in the second dynamic array under the condition that the third storage value is not a second preset value, taking the fourth storage value as a new third storage value under the condition that the fourth storage value is not the second preset value, and acquiring the new fourth storage value by taking the new third storage value as the index value until the fourth storage value is the second preset value;

if the fourth storage value is the second preset value, determining that the third storage value is the number of the smallest similarity set of the grouping similarity set merging with the grouping similarity set number i;

and merging the corresponding grouping similarity sets according to index values of all items with the same storage value in the second dynamic array to obtain the partition similarity set.

In some embodiments, the constructing a third dynamic array, assigning a value to the third dynamic array according to the third hash table, determining a minimum global set for merging each partition similarity set according to the assignment result, merging a plurality of partition similarity sets pointing to the same minimum global set to obtain a global similarity set, and recording global numbers of all the first processing documents included in each global similarity set through a second processing document, where the global numbers include:

Constructing and initializing the third dynamic array;

scanning the third hash table to assign a value to the third dynamic array, wherein the index value of each item of the third dynamic array after the value assignment is the number of the partition similarity set, and the initial value stored by each item is the number of the partition similarity set positioned at the later item of the index value in the true value of the third hash table;

obtaining a fifth storage value of a kth item for a kth item of the third dynamic array, and enabling the fifth storage value to be k under the condition that the fifth storage value is a third preset value, wherein k is a positive integer;

obtaining a sixth storage value of an item taking the fifth storage value as an index value in the third dynamic array when the fifth storage value is not the third preset value, taking the sixth storage value as a new fifth storage value when the sixth storage value is not the third preset value, and obtaining the new sixth storage value by taking the new fifth storage value as the index value until the sixth storage value is the third preset value;

determining that the fifth storage value is the number of the global similarity set of the partition similarity set merging with the partition similarity set number j under the condition that the sixth storage value is the third preset value;

Merging the corresponding partition similarity sets according to index values of all items with the same storage value in the third dynamic array to obtain the global similarity set;

the global numbers of the first processed documents included in each of the global similarity sets are recorded by the second processed documents.

In some embodiments, the preprocessing the large-scale data to obtain a plurality of first processing documents, and storing the first processing documents in a plurality of storage partitions respectively, including:

extracting a plurality of original input documents from the large-scale data, numbering the original input documents, and obtaining the global number of each original input document;

performing word segmentation processing on all the original input documents, and converting the original input documents into word sets comprising a plurality of words;

calculating hash codes of each word;

and generating the first processing documents corresponding to each original input document according to the hash codes and the global numbers, and storing all the first processing documents into each storage partition.

In some embodiments, said grouping the first processed documents in each storage partition to obtain a plurality of document groupings includes:

Calculating word frequency of each word in the first processing document, sorting the words in each first processing document according to the word frequency, and performing prefix pruning on the sorted first processing documents to obtain a prefix array of each first processing document;

and in each storage partition, dividing the first processing document comprising at least one same word in the prefix array into the same group to obtain a plurality of document groups.

In some embodiments, the performing similarity detection on the first processed document in each document group to obtain a plurality of similarity pairs includes:

calculating the document fingerprint of each first processed document through a preset algorithm;

determining a Hamming distance between the first processed documents within each of the document groupings based on the document fingerprints;

and dividing the first processing documents with the Hamming distance smaller than a fourth preset value into similar pairs.

A second aspect of an embodiment of the present application proposes an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the Spark-based global deduplication method for large-scale data as described in any of the embodiments of the first aspect when executing the computer program.

A third aspect of the embodiments of the present application proposes a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, which are executable by one or more processors to implement a Spark-based global deduplication method for large-scale data according to any of the embodiments of the first aspect.

The embodiment of the application provides a Spark-based large-scale data global deduplication method, electronic equipment and a storage medium, which are characterized in that large-scale corpus data are stored in different storage partitions, the corpus data are grouped in each storage partition to remove a large amount of irrelevant corpus data, similarity detection is carried out on first processing documents in each document group to determine whether the documents in the same group are similar, and then similar pairs are combined on three different granularities of the document group, the storage partition and the system global in sequence, so that the similar corpus data are primarily combined twice by adopting a distributed parallel processing method on the two granularities of the document group and the storage partition, and the primarily combined corpus data are summarized to the system global, so that a large amount of time can be saved when the large-scale data are deduplicated, and high-efficiency fuzzy deduplication of the large-scale corpus is realized.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

Fig. 1 is a flowchart of a global de-duplication method for large-scale data based on Spark provided in this embodiment;

fig. 2 is a flowchart of step S400 in fig. 1;

fig. 3 is a flowchart of step S220 in fig. 2;

fig. 4 is a flowchart of step S500 in fig. 1;

fig. 5 is a flowchart of step S520 in fig. 4;

fig. 6 is a flowchart of step S600 in fig. 4;

fig. 7 is a flowchart of step S620 in fig. 6;

fig. 8 is a schematic structural diagram of an electronic device according to the present embodiment;

the accompanying drawings are included to provide a further understanding of the technical aspects of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical aspects of the present application and together with the examples of the present application, and not constitute a limitation of the technical aspects of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the content and operations/steps, nor must they necessarily be run in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual operations may be changed according to actual situations.

Referring to fig. 1, a first aspect of the embodiments of the present application proposes a Spark-based global de-duplication method for large-scale data, including steps S100 to S800:

step S100, preprocessing large-scale data to obtain a plurality of first processing documents, and respectively storing the first processing documents into a plurality of storage partitions;

in some embodiments, the large-scale data may be a large amount of text data crawled from data sources such as various web pages, the text data comprising a plurality of documents, and the step of preprocessing the large-scale data may include: extracting a plurality of original input documents from large-scale data, wherein in some embodiments, the corpus data has a huge scale, which can reach tens of TB and cannot be directly stored into a memory, the corpus data is required to be stored through external storage, the corpus data is read from the external storage, each original input document is recorded through an elastic distributed data set RDD, all the original input documents are numbered to obtain the number of each original input document in the global system, specifically, the global number of each original input document can be obtained by executing ZipWithIndex operation on the original input document, and each numbered original input document is stored into each storage partition through MemoryAndDisk operation; in some embodiments, the step of preprocessing large-scale data further includes word segmentation processing on each original input document, specifically, conversion operation can be performed through Map operators, the original input document is segmented into a set formed by a plurality of words by using a word segmentation tool, in some embodiments, repeated words in the same original input document are recorded only once, and data redundancy caused by repeated recording of the same word in the same document is avoided, so that the duplication elimination efficiency is improved; in some embodiments, the step of preprocessing the large-scale data further comprises computing a hash code for each word, in particular, each word may be converted from a string type to a hash value of 64 bits by a Murmur hashing algorithm, resulting in a hash code for each word,

Step S200, grouping the first processed documents in each storage partition to obtain a plurality of document groups;

in some embodiments, step S200 includes: calculating word frequency of each word in the first processing document, sorting the words in each first processing document according to the word frequency, and performing prefix pruning on the sorted first processing documents to obtain a prefix array of each first processing document; and in each storage partition, dividing the first processing document comprising at least one same word in the prefix array into the same group to obtain a plurality of document groups. In this embodiment, first, all first processing documents in the same memory partition are expanded through a snap operator, the word frequency of each word is initialized to be 1, at this time, hash codes of each word and corresponding word frequencies can be recorded through a first word list, then the words with the same hash codes are combined, the word frequency of each word is calculated according to the number of times of combination, thus the word frequency in each memory partition is obtained, the first word lists of all memory partitions are converged to form a second word list, word frequencies corresponding to words with the same hash codes in the second word list are accumulated, and then the word frequency of each word system overall can be obtained. In some embodiments, the second vocabulary may also be broadcast to the respective memory partition to facilitate subsequent ordering of the words in the first processed document in the respective memory partition.

In some embodiments, step S200 further includes sorting the words in the first processed documents according to word frequency, specifically sorting the words in the documents from small to large according to word frequency, calculating the prefix length of each first processed document by a preset prefix length formula, intercepting the prefix array from the first processed documents after sorting, classifying the first processed documents including at least one word with the same hash code in the prefix array into the same group, specifically, traversing each word in the prefix array, performing the snap operation expansion according to the (hash code, fingerprint) manner, forming a third processed document, performing the GroupByKey operation on the third processed document to realize document grouping, thereby excluding a large number of irrelevant documents, and effectively reducing the calculation amount during subsequent similarity detection. Specifically, the prefix length may be calculated by a preset prefix length formula, where the preset prefix length formula is as follows:

wherein x represents the global number, j represents the prefix length, σ _x The sum of IDF weights representing all words in said third processed document, i being in the range of [1, j ]]Integers between x [ i ] ]Representing the ith word in the third processed document, weight (x [ i ]]) An IDF weight of an ith word in the third processing document is represented, and t represents a preset similarity threshold; wherein, the IDF weight is determined according to the word frequency of each word and the first processing document, and concretely, the following formula is referred to:

wherein idf _i IDF weight representing the i-th word, |D| represents the total amount of the first processed document, | { j: t ε D } _j The word frequency of the i-th word is represented by the } |.

It will be understood that if the same word appears in a large number of documents at the same time, it is indicated that the word is a more general word, for example, a "first" general word, where the word is actually related to the document content to a low degree, and features of the document content are difficult to reflect, and when the word frequency of a certain word is low and appears in a small number of documents, it is indicated that the word is highly related to the content in a certain field, for example, a "concrete" word, where the word is generally only appearing in the documents in the building field, the word is related to the document content to a high degree, and features of the document content can be better reflected.

Step S300, performing similarity detection on the first processed document in each document group to obtain a plurality of similarity pairs;

in some embodiments, the document fingerprint of each first processed document may be calculated by a preset algorithm, in this embodiment, the document fingerprint of each first processed document is calculated by a simhash algorithm; and then in each document group, carrying out similarity verification on each first processing document, specifically, calculating the Hamming distance between each first processing document, and dividing the first processing documents with the Hamming distance smaller than a fourth preset value into similar pairs. Specifically, map operation may be performed on the third processed document after the packet processing, and similarity verification may be performed in each packet according to a hamming distance according to a fingerprint, and in this embodiment, the fourth preset value may be 3, so that similarity detection of the first processed document may be implemented, and at this time, the similarity pair in each packet and the number of the first processed document included in each similarity pair may be recorded by using the fourth processed document.

It will be appreciated that in some embodiments, the step of calculating the document fingerprint of each first processed document by the preset algorithm may be performed in advance, for example, after word segmentation is performed on the first processed document and word frequency of each word is calculated, the document fingerprint of each first processed document is calculated, and the document fingerprint of each first processed document is also recorded in the first processed document. Or the global number and the fingerprint of each first processing document are recorded through the fifth processing document, so that when the Hamming verification and the division are carried out in similar pairs later, each global number and each document fingerprint can be directly read from the fifth processing document without repeatedly reading and writing the corresponding first processing document to carry out subsequent steps, thereby saving communication time and reducing time waste caused by reading and writing a memory.

Step S400, determining the minimum merging similar pair of each similar pair according to the preconfigured global number of the first processing document, and merging all similar pairs pointing to the same minimum merging similar pair to obtain a grouping similar set;

in this embodiment, inside each document group, all similar pairs including the same first processed document are regarded as similar pairs that can be merged with each other, and are merged, specifically, referring to fig. 2, step S400 includes, but is not limited to, the following steps S210 to S220:

step S210, numbering the similar pairs, and constructing a first hash table according to the global number of the first processing document and the numbers of the similar pairs, wherein the key value of the first hash table is the global number of the first processing document, and the true value of the first hash table is an ordered linked list formed by the numbers of all the similar pairs of the first processing document corresponding to the key value;

it will be understood that, for example, the similar pairs with numbers 1, 3, 7, and 10 each include the first processed document with global number 1, then the key value of the first hash table is set by the global number of the first processed document, and then the numbers of all the similar pairs including the first processed document are formed into an ordered linked list, and the ordered linked list is set as the true value corresponding to the key value, for example, the true value stored by the items with the key value 1 in the first hash table is the ordered linked list formed by 10, 7, 3, and 1.

Step S220, a first dynamic array is constructed, the first dynamic array is assigned according to a first hash table, the minimum similarity pair of merging of each similarity pair is determined according to the assignment result, and a plurality of similarity pairs pointing to the same minimum similarity pair in the same document group are merged to obtain a group similarity set.

And determining the minimum merging similar pair which can be merged for each similar pair according to the assigned first dynamic array, wherein similar pairs numbered 10, 7 and 3 can be merged to similar pair numbered 1 in the embodiment, so as to obtain the grouping similar set.

Referring to fig. 3, in some embodiments, step S220 may include, but is not limited to, steps S310 through S360:

step S310, a first dynamic array is constructed and initialized;

it will be appreciated that the initial storage value for initializing each item in the first dynamic array is 0

Step S320, scanning the first hash table to assign values to the first dynamic array, wherein the index value of each item of the assigned first dynamic array is the number of a similar pair, and the initial value stored by each item is the number of a similar pair located at the later item of the index value in the truth value of the first hash table;

in this embodiment, the assignment to the first dynamic array is implemented by scanning the truth value of the hash table, where the truth value of the hash table is specifically an ordered linked list formed by a plurality of numbers of similar pairs including the same first processed document, and the linked list may be an inverted list, where two values that are sequentially adjacent in the ordered linked list are used as the index value and the storage value of the first dynamic array to assign the first dynamic array.

Step S330, for the ith item of the first dynamic array, obtaining a first storage value of the ith item, and enabling the first storage value to be i under the condition that the first storage value is a first preset value, wherein i is a positive integer;

in this embodiment, before the first hash table is scanned to assign a value to the first dynamic array, each initial storage value of the first dynamic array is 0, based on this, the first preset value is taken to be 0, and the storage value bi of the i-th item of the first dynamic array is obtained, if bi is 0, it is stated that in the value of the hash table, i is the node at the rightmost end of the ordered list, the similar pair with the number i includes all similar pairs of the same first processed document, i is the minimum number, so that the similar pair with the number i can be regarded as the minimum similar pair which can be merged by itself, and bi=i.

Step S340, obtaining a second storage value of the item taking the first storage value as the index value in the first dynamic array under the condition that the first storage value is not the first preset value, taking the second storage value as a new first storage value under the condition that the second storage value is not the first preset value, and obtaining the new second storage value by taking the new first storage value as the index value until the second storage value is the first preset value;

Step S350, in the case that the second storage value is the first preset value, determining that the first storage value is the number of the smallest similarity pair of the merging of the similarity pairs with the similarity pair number i;

in some embodiments, if B [ i ] is not 0, it indicates that, in the corresponding ordered linked list, the right node of i stores a smaller value than i, i is not the number of the smallest possible merging pair of the similar pairs, at this time, the stored value of B [ i ] is used as a new index value to find B [ i ] ], and let B [ i ] =b [ i ] ] until B [ i ] ] =0, which indicates that at this time, the stored value of B [ i ] is the number of the smallest possible merging pair of the similar pairs i.

Step S360, merging the corresponding multiple similar pairs in each document group according to the index values of all items with the same storage value in the first dynamic array to obtain a group similarity set.

After the minimum similarity pair which can be merged in each similarity pair is found through the steps, the first dynamic array is traversed, and all the similarities which are the same in storage value and can be merged into the same minimum similarity pair are merged together to obtain a plurality of grouping similarity sets. Based on this, in realizing in each document grouping, the preliminary similar pair merging, at this time, the numbers of all the first processing documents in each grouping similar set may be recorded by the sixth processing document, so as to facilitate the subsequent merging of the grouping similar sets on the granularity of each storage partition; in addition, the steps S310 to S360 may be performed in parallel in a distributed manner among the respective document groups, the total time required for performing the above-described steps is the same as the document group having the largest data amount, and since only the documents including the same word in the prefix array are divided into the same document group when the documents are grouped, the data amount in each document group is greatly reduced compared with the total amount of the documents stored in the storage partition, and thus, the time required for performing the above-described merging step is also greatly reduced, and the merging efficiency is significantly improved.

Step S500, determining a minimum merging group set of each group similar set according to the global number of the first processing document, merging all group similar sets pointing to the same minimum merging group set to obtain a partition similar set;

in this embodiment, after the grouping of similar pairs in each document group, the grouping similar sets in different document groups need to be further combined, it may be understood that, when the first processing documents in each storage partition are grouped, the first processing documents including at least the same word are grouped into the same document group, and when there are multiple words between each first processing document, the same first processing document is also grouped into multiple document groups, and therefore, in this embodiment, the grouping similar sets including the same first processing document may also be combined at the level of the storage partition. Specifically, in the present embodiment, step S500 includes, but is not limited to, steps S410 to S420:

step S410, numbering the grouping similarity sets, and constructing a second hash table according to the global numbers and the grouping similarity sets, wherein key values of the second hash table are global numbers of the first processing document, and true values of the second hash table are ordered linked lists formed by the numbers of all grouping similarity sets of the first processing document corresponding to the key values;

In some embodiments, the grouping of all documents in the storage partition may be aggregated by performing the MapPart it operation on the sixth processing document, so that the grouping similarity sets in the same storage partition are numbered from 1, the global number of the first processing document is used as the key value of the second hash table, the numbers of all grouping similarity sets including the first processing document are formed into an ordered chain table, the ordered chain table is used as the true value corresponding to the key value, for example, the grouping similarity sets with the numbers 1, 3, 7 and 10 each include the first processing document with the global number 1, and the true value stored by the items with the key value 1 in the second hash table is the ordered chain table formed by 10, 7, 3 and 1.

Step S420, a second dynamic array is constructed, the second dynamic array is assigned according to the second hash table, the minimum similarity set of the merging of each grouping similarity set is determined according to the assignment result, and a plurality of grouping similarity sets pointing to the same minimum similarity set in the same memory partition are merged to obtain the partition similarity set.

Referring to FIG. 5, in some embodiments, step S420 includes, but is not limited to, steps S510 through S560 as follows

Step S510, constructing and initializing a second dynamic array;

It will be appreciated that the initial storage value for initializing each entry in the second dynamic array is 0

Step S520, scanning the second hash table to assign values to the second dynamic array, wherein the index value of each item of the assigned second dynamic array is the number of the group similarity set, and the initial value stored by each item is the number of the group similarity set located at the later item of the index value in the true value of the second hash table;

in this embodiment, the assignment to the second dynamic array is implemented by scanning the truth value of the hash table, where the truth value of the hash table is specifically an ordered linked list formed by the numbers of multiple similar sets of packets including the same first processed document, where the linked list may be an inverted list, and two values that are sequentially adjacent in the ordered linked list are used as the index value and the storage value of the second dynamic array to assign the second dynamic array.

Step S530, for the j-th item of the second dynamic array, obtaining a third storage value of the j-th item, and letting the third storage value be j when the third storage value is a second preset value, wherein j is a positive integer;

in this embodiment, before the second hash table is scanned to assign a value to the second dynamic array, each initial storage value of the second dynamic array is 0, based on this, the second preset value is taken to be 0, and the storage value B [ j ] of the j th item of the second dynamic array is obtained, if B [ j ] is 0, it is indicated that j is the node at the rightmost end of the ordered list in the value of the hash table, and the similar group set of the number j contains all similar group sets of the same first processing document, and j is the minimum number, so the similar group set of the number j can be regarded as the minimum merging group set which can be merged by itself, so that B [ j ] =j.

Step S540, obtaining a fourth storage value of the item taking the third storage value as the index value in the second dynamic array under the condition that the third storage value is not the second preset value, taking the fourth storage value as a new third storage value under the condition that the fourth storage value is not the second preset value, and obtaining the new fourth storage value by taking the new third storage value as the index value until the fourth storage value is the second preset value;

step S550, in the case that the fourth storage value is the second preset value, determining that the third storage value is the number of the least merging group set of the group similarity set merging with the group similarity set number j;

in some embodiments, if B [ j ] is not 0, it indicates that, in the corresponding ordered linked list, the right node of j stores a value smaller than j, j is not the number of the smallest merging packet set that the packet similarity set can merge, at this time, the value stored in B [ j ] is used as a new index value to find B [ j ] ], and let B [ j ] =b [ B [ j ] ] until B [ j ] ] =0, then it indicates that the stored value of B [ j ] is the smallest merging packet set number that the packet similarity set j can merge.

Step S560, merging the corresponding multiple grouping similarity sets in each storage partition according to the index values of all items with the same storage value in the second dynamic array to obtain the partition similarity set.

After the minimum merging packet set that can be merged in each packet similar set is found through the steps, the second dynamic array is traversed, all the packet similar sets with the same storage value, that is, the same minimum merging packet set, can be merged to obtain a plurality of partition similar sets, and it can be understood that since the minimum merging packet set that can be merged in the storage partition of each packet similar set is unique, there is no intersection among the plurality of partition similar sets obtained in the same storage partition. Based on the above, the grouping similar sets are preliminarily combined in each storage partition, and at this time, the numbers of all the first processing documents in each partition similar set can be recorded through the seventh processing document, so that the subsequent grouping similar sets are combined on the overall granularity of the system; in addition, the steps S510 to S560 may be performed in parallel in a distributed manner between the respective storage partitions, and the total time spent for executing the above steps is the same as the time spent for executing the storage partition with the largest data amount, and since the similar pairs have been preliminarily combined at the document grouping granularity, the data amount to be processed at the storage partition granularity is greatly reduced with respect to the total amount of the first processed document, and thus, the time spent for executing the above merging step is also greatly reduced, and the merging efficiency is significantly improved.

Step S600, determining a minimum merging partition set of each partition similar set according to the global number of the first processing document, merging all partition similar sets pointing to the same minimum merging partition set to obtain a global similar set, and recording the global numbers of all the first processing documents in each global similar set through the second processing document;

after the grouping similar sets are combined on the storage partition granularity, the grouping similar sets can be further combined on the system global granularity, referring to the above embodiment, first, a Collect operation is performed on the seventh processing document, and all the grouping similar sets are summarized to the Master node, referring to fig. 6, and step S600 includes, but is not limited to, the following steps S610 to S620.

Step S610, numbering the partition similarity sets, and constructing a third hash table according to the global numbers and the partition similarity sets, wherein the key value of the third hash table is the global number of the first processing document, and the true value of the third hash table is an ordered linked list formed by the numbers of all the partition similarity sets of the first processing document corresponding to the key value;

and numbering all the partition similarity sets from 1, taking the global number of the first processing document as a key value of the second hash table, forming an ordered chain table by taking the numbers of all the grouping similarity sets comprising the first processing document as true values corresponding to the key value, for example, the grouping similarity sets with the numbers of 1, 3, 7 and 10 all comprise the first processing document with the global number of 1, and if the true value stored by the items with the key value of 1 in the third hash table is the ordered chain table formed by 10, 7, 3 and 1.

In step S620, a third dynamic array is constructed, a value is assigned to the third dynamic array according to the third hash table, a minimum global set for merging each partition similarity set is determined according to the value assignment result, a plurality of partition similarity sets pointing to the same minimum global set are merged to obtain a global similarity set, and global numbers of all first processing documents in each global similarity set are recorded through the second processing documents.

Referring to FIG. 7, in some embodiments, step S620 includes, but is not limited to, steps S710 through S770 as follows

Step S710, constructing and initializing a third dynamic array;

it will be appreciated that the initial storage value for initializing each entry in the third dynamic array is 0

Step S720, scanning a third hash table to assign values to the third dynamic array, wherein the index value of each item of the assigned third dynamic array is the number of the partition similarity set, and the initial value stored by each item is the number of the partition similarity set positioned at the later item of the index value in the true value of the third hash table;

in this embodiment, the third dynamic array is assigned by scanning the truth value of the hash table, where the truth value of the hash table is an ordered linked list formed by the numbers of the multiple partition similarity sets including the same first processed document, and the linked list may be an inverted list, where two values sequentially adjacent to each other in the ordered linked list are used as the index value and the storage value of the third dynamic array to assign a value to the third dynamic array.

Step S730, for the kth item of the third dynamic array, obtaining a fifth storage value of the kth item, and when the fifth storage value is a third preset value, making the fifth storage value be k, wherein k is a positive integer;

in this embodiment, before the third hash table is scanned to assign a value to the third dynamic array, each initial storage value of the third dynamic array is 0, based on this, the third preset value is taken to be 0, and the storage value B [ k ] of the kth item of the third dynamic array is obtained, if B [ k ] is 0, it is stated that k is the node at the rightmost end of the ordered chain table in the value of the hash table, the partition similarity set with the number k includes all the partition similarity sets of the same first processing document, and k is the minimum number, so the partition similarity set with the number k can be regarded as the minimum merging partition set which can be merged by itself, and B [ k ] =k.

Step S740, obtaining a sixth storage value of the item taking the fifth storage value as the index value in the third dynamic array when the fifth storage value is not the third preset value, taking the sixth storage value as a new fifth storage value when the sixth storage value is not the third preset value, and obtaining the new sixth storage value by taking the new fifth storage value as the index value until the sixth storage value is the third preset value;

Step S750, when the sixth storage value is the third preset value, determining that the fifth storage value is the number of the smallest merging partition set of the partition similarity set merging with the partition similarity set number k;

in some embodiments, if B [ k ] is not 0, it indicates that, in the corresponding ordered linked list, the right node of k stores a value smaller than k, k is not the number of the smallest merging partition set that the partition similarity set can merge, at this time, the value stored by B [ k ] is used as a new index value, B [ k ] ] is found, and B [ k ] =b [ k ] ] is found until B [ k ] ] =0, which indicates that at this time, the stored value of B [ k ] is the smallest merging partition set number that the partition similarity set k can merge.

Step S760, merging the plurality of partition similarity sets according to the index values of all items with the same storage value in the third dynamic array to obtain a global similarity set;

after the minimum merging partition set which can be merged by each partition similar set is found through the steps, traversing the third dynamic array, merging all partition similar sets which can be merged to the same minimum merging partition set with the same storage value, and obtaining a plurality of global similar sets.

In step S770, the global numbers of the first processed documents included in each global similarity set are recorded by the second processed documents.

After merging the partition similarity sets at the Master node to obtain a plurality of global similarity sets, the numbers of all the first processing documents in each global similarity set can be recorded through the second processing documents.

In this embodiment, in the global Master node, the partition similar sets in each storage partition are aggregated and then combined, and in this process, since the first processing document has undergone the preliminary combination on the two granularities of the document grouping and the storage partition, the amount of data collected to the global Master node for processing is greatly reduced compared with the original amount of the first processing document, the time required for the combination processing is also reduced, and the combination efficiency is obviously improved.

Step S700, removing a first global number in each global similarity set from the second processing document, summarizing all the remaining global numbers in all the global similarity sets to obtain a document number set to be eliminated, and broadcasting the document number set to be eliminated to each storage partition;

step S800, in each storage partition, filtering the first processing document according to the document number set to be eliminated.

It can be understood that global numbers of all the first processing documents in each global similar set are recorded in the second processing document, a plurality of first processing documents in the same global similar set are similar, only one first processing document needs to be reserved, and other first processing documents can be regarded as repeated redundant data and need to be de-duplicated, so that in the embodiment, global numbers of the first processing documents in each global similar set are removed, the remaining global numbers are summarized to obtain a global number set of the documents to be eliminated, the document number set is broadcast to each storage partition, and in each storage partition, a filtering operation is performed on the first processing documents according to the document number set to be eliminated, and the first processing documents with global numbers appearing in the document number set to be eliminated are filtered, so that de-duplication is completed. It can be appreciated that, in step S800, the filtering amount of each storage partition can be effectively reduced and the deduplication efficiency can be improved by performing distributed parallel operation in each storage partition, compared with unified filtering on the global level.

In some embodiments, the first processed document left after filtering may also be saved to the file system from each storage partition, so that the high-quality corpus data after de-duplication is saved.

The embodiment of the application provides a Spark-based large-scale data deduplication method, which realizes a distributed similar pair rejection process from local to global. After performing the Hamming check, a large number of similar pairs are obtained, the similar pairs are globally merged from three granularity pairs. In the first stage, in each group, all similar pair results are scanned, each similar pair is given a number, meanwhile, an inverted list for recording document numbers-similar pair numbers is dynamically constructed, the inverted list records in which similar pairs each document appears, the list corresponding to each document is progressively ordered, and in addition, an array B records the minimum similar pair numbers which can be combined by each similar pair. After the completion of the inverted list construction, each list in the inverted list is scanned item by item, assuming that the first item number of each list is F, recording BF is Min (BF, F), and thereafter for each item number i of the list, recording Bi=Min (Bi, BF). After the scanning is finished, the B array obtains the minimum number of the similar pairs which can be combined into each similar pair, the B array is traversed to combine similar pairs with the same B array value together, so that combined similar sets of each group are obtained, no intersection exists between the similar sets, and the number of sets is obviously reduced compared with the number of similar pairs. The second stage is to combine the granularity of the storage partition, construct an inverted list and a B array in the same way based on the similar set obtained by each group, take the similar set of the group as an object, and process the flow in the similar group at the granularity of the partition to obtain the disjoint similar set at the partition level. And in the third stage, summarizing the results of all the partitions to a Master node of Spark, performing the same processing as in the partitions, further merging each disjoint similarity set to finally obtain a global similarity set, and finally, reserving only any one document in each global similarity set to obtain a global duplicate removal result. The method and the device have the advantages that when the two granularities of the document grouping and the storage partition are combined, the document grouping and the storage partition are processed in parallel in a distributed environment, the quantity of the aggregation finally converged to the Master node is small, a large-scale data set can be easily processed, and the global deduplication method for large-scale data based on Spark provided by the embodiment is based on the method and the device, so that the problem of calculation bottleneck of global fuzzy deduplication combination is effectively solved, time required for fuzzy deduplication of large-scale data is greatly reduced, and deduplication efficiency is improved.

Referring to fig. 8, an embodiment of the present application further provides an electronic device 800, including:

at least one processor, and,

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions that are executed by the at least one processor to cause the at least one processor to perform a method as in any of the embodiments of the present application when the instructions are executed.

The hardware configuration of the electronic device will be described in detail with reference to fig. 8. The electronic device includes: processor 810, memory 820, input/output interface 830, communication interface 840 and bus 850.

The processor 810 may be implemented by a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing related programs to implement the technical solutions provided in the embodiments of the present application;

the Memory 820 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). Memory 820 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented by software or firmware, relevant program codes are stored in memory 820 and invoked by processor 810 to perform the metaphor recognition method of the embodiments of the present application;

An input/output interface 830 for implementing information input and output;

the communication interface 840 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

bus 850 transfers information between the various components of the device (e.g., processor 810, memory 820, input/output interface 830, and communication interface 840);

wherein processor 810, memory 820, input/output interface 830, and communication interface 840 enable communication connections among each other within the device via bus 850.

The present embodiments also provide a storage medium that is a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the metaphor recognition method of the present embodiments.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-8 are not limiting to embodiments of the present application, and may include more or fewer components than shown, or certain components may be combined, or different components.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A Spark-based global de-duplication method for large-scale data, the method comprising:

2. The method of claim 1, wherein determining a minimum merging similarity pair for each of the similarity pairs according to the preconfigured global numbering of the first processed document, and merging all the similarity pairs pointing to the same minimum merging similarity pair to obtain a group similarity set, comprises:

3. The method of claim 1, wherein determining a minimum merging set of packets for each of the packet similarity sets based on a global number of the first processed document and merging all packet similarity sets that point to the same minimum merging set of packets to obtain a partition similarity set, comprises:

4. The method of claim 1, wherein said determining a minimum merging partition set for each of said partition similarity sets based on global numbers of said first processed documents and merging all pairs of similarity pointing to the same minimum merging partition set to obtain a global similarity set, and recording each of said global similarity sets including said global numbers of all of said first processed documents via a second processed document comprises:

5. The method of claim 2, wherein constructing a first dynamic array and assigning a value to the first dynamic array according to the first hash table, determining a minimum similarity pair for each of the similarity pair merges according to the assignment result, merging a plurality of similarity pairs within the same document group that point to the same minimum similarity pair, and obtaining a group similarity set, and further comprising:

constructing and initializing the first dynamic array;

6. The method of claim 3, wherein constructing a second dynamic array and assigning a value to the second dynamic array according to the second hash table, determining a minimum similarity set for merging each of the group similarity sets according to the assignment result, merging multiple group similarity sets pointing to the same minimum similarity set in the same storage partition to obtain a partition similarity set, and further comprising:

constructing and initializing the second dynamic array;

7. The method of claim 4, wherein constructing a third dynamic array and assigning a value to the third dynamic array according to the third hash table, determining a minimum global set for merging each of the partition similarity sets according to the assignment result, merging a plurality of partition similarity sets pointing to the same minimum global set to obtain a global similarity set, and recording global numbers of all the first processing documents included in each of the global similarity sets by a second processing document, wherein the method comprises:

Constructing and initializing the third dynamic array;

8. The method of claim 1, wherein preprocessing the large-scale data to obtain a plurality of first processed documents and storing the first processed documents in a plurality of storage partitions, respectively, comprises:

calculating hash codes of each word;

9. The method of claim 1, wherein grouping the first processed documents in each storage partition results in a plurality of document groupings, comprising:

10. The method of claim 1, wherein said performing similarity detection on said first processed document within each of said document groupings results in a plurality of similarity pairs, comprising:

11. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements a Spark-based global deduplication method for large-scale data according to any of claims 1 to 10 when executing the computer program.

12. A computer readable storage medium storing one or more programs executable by one or more processors to implement the Spark-based global deduplication method of large-scale data according to any of claims 1 to 10.