CN109828968B

CN109828968B - Data deduplication processing method, device, equipment, cluster and storage medium

Info

Publication number: CN109828968B
Application number: CN201910121676.9A
Authority: CN
Inventors: 陶胜; 仇贲
Original assignee: Guangzhou Huya Information Technology Co Ltd
Current assignee: Guangzhou Huya Information Technology Co Ltd
Priority date: 2019-02-19
Filing date: 2019-02-19
Publication date: 2021-12-21
Anticipated expiration: 2039-02-19
Also published as: CN109828968A

Abstract

The embodiment of the invention discloses a data deduplication processing method, a data deduplication processing device, data deduplication processing equipment, a cluster and a storage medium, wherein the method is applied to a main node in the cluster, and comprises the following steps: acquiring a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name; distributing the data in the query data table to at least two data nodes for grouping and deduplication processing, and acquiring deduplication results formed by the at least two data nodes; the data node is used for storing the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names, adding row identifiers for the data in the data grouping tables, and performing grouping duplication removal on the data in the data grouping tables according to the grouping field names and the row identifiers to form duplication removal results; different data nodes perform row identification addition operations of different data grouping tables. The technical scheme provided by the embodiment of the invention can save time and improve efficiency.

Description

Data deduplication processing method, device, equipment, cluster and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data analysis, in particular to a data deduplication processing method, device, equipment, cluster and storage medium.

Background

In recent years, with the rapid development and widespread application of computers and information technologies, various data are generated, and a large amount of repeated data also exists. Under the condition of continuous data growth, how to eliminate repeated data becomes a business demand which needs to be solved urgently in the field of data analysis.

In the prior art, when a large amount of data is deduplicated, tasks that can be deduplicated are allocated to a cluster, a large amount of data that needs to be deduplicated is grouped based on a grouping index by data nodes in the cluster, each data node performs one-to-one matching on corresponding data groups, and duplicate data having the same grouping index and the same deduplication index are removed, thereby completing deduplication of the data. For example, when the number of users on each product needs to be counted, a task of removing duplicate may be allocated to a cluster, a large amount of data that needs to be removed are grouped based on the product by data nodes in the cluster, each data node performs one-to-one matching on data groups of corresponding products, and redundant data of the same user of the same product is removed, so that user data on each product is obtained, and thus the number of users may also be counted. However, the foregoing method for removing duplicate data in the prior art takes a long time, and especially when the amount of data is large, it wastes more time and is inefficient.

Disclosure of Invention

The embodiment of the invention provides a data deduplication processing method, a data deduplication processing device, data deduplication processing equipment, a data deduplication cluster and a storage medium, which can save time and improve efficiency.

In a first aspect, an embodiment of the present invention provides a data deduplication processing method, where the method is applied to a master node in a cluster, and the method includes:

acquiring a data duplication removal query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name;

distributing the data in the query data table to at least two data nodes for carrying out grouping deduplication processing, and acquiring deduplication results formed by the at least two data nodes;

the data node is used for storing distributed data in a plurality of data grouping tables according to field values of grouping field names and de-duplication field names, adding row identifiers for the data in the data grouping tables, and performing grouping de-duplication on the data in the data grouping tables according to the grouping field names and the row identifiers to form de-duplication results; different data nodes perform row identification addition operations of different data grouping tables.

In a second aspect, an embodiment of the present invention provides a data deduplication processing method, where the method is applied to a data node set in a cluster, where the data node set includes at least two data nodes, and the method includes:

acquiring data distributed by a main node, wherein the distributed data is data distributed by the main node according to a query data table included in a data query request, and the data query request includes the query data table, a grouping field name and a duplication removal field name;

storing the distributed data in a plurality of data grouping tables according to the grouping field names and the field values of the duplication removal field names;

adding row identification to the data in the data grouping table; wherein, different data nodes execute row identification adding operation of different data grouping tables;

and carrying out grouping duplicate removal on the data in the data grouping table according to the grouping field names and the row identifiers to form a duplicate removal result and feeding the duplicate removal result back to the main node.

In a third aspect, an embodiment of the present invention provides a data deduplication processing method, where the method is applied to a cluster, where the cluster includes a master node and a data node set, and the data node set includes at least two data nodes, and the method includes:

the method comprises the steps that a main node obtains a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name;

the main node distributes the data in the query data table to a data node set;

the data node set acquires data distributed by the main node;

the data node set stores the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the de-duplication field names;

the data node set adds row identification to the data in the data grouping table; wherein, different data nodes execute row identification adding operation of different data grouping tables;

and the data node set carries out grouping duplicate removal on the data in the first data grouping table according to the grouping field names and the row identifiers to form duplicate removal results and feeds the duplicate removal results back to the main node.

In a fourth aspect, an embodiment of the present invention provides a data deduplication processing apparatus, including:

the request acquisition module is used for acquiring a data duplication removal query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name;

the result acquisition module is used for the main node to distribute the data in the query data table to at least two data nodes for grouping and deduplication processing and acquire deduplication results formed by the at least two data nodes;

In a fifth aspect, an embodiment of the present invention provides a cluster, including a master node and a data node set, where the data node set includes at least two data nodes;

the main node is used for acquiring a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name;

the main node is further used for allocating the data in the query data table to a data node set;

the data node set is used for acquiring data distributed by the main node;

the data node set is also used for respectively storing the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names;

the data node set is also used for adding row identifiers to the data in the data grouping table; wherein, different data nodes execute row identification adding operation of different data grouping tables;

and the data node set is also used for carrying out grouping duplicate removal on the data in the first data grouping table according to the grouping field names and the row identifiers to form a duplicate removal result which is fed back to the main node.

In a sixth aspect, an embodiment of the present invention provides an apparatus, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement a data deduplication processing method provided by the embodiment of the present invention.

In a seventh aspect, a computer-readable storage medium is provided in an embodiment of the present invention, and has a computer program stored thereon, where the computer program is executed by a processor to implement a data deduplication processing method provided in an embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, data of a query data table needing to be deduplicated are distributed to at least two data nodes through a main node, the distributed data are stored in a plurality of data grouping tables by the data nodes based on field values of grouping field names and deduplication field names, and row identifiers are added to the data in the data grouping tables, wherein different data nodes execute row identifier adding operations of different data grouping tables, and deduplication processing is carried out according to the row identifiers and the grouping field names to form deduplication results. The technical scheme provided by the embodiment of the invention distributes the data needing to be deduplicated to at least two data nodes, the data nodes are grouped based on the grouping index and the deduplication index, the data nodes corresponding to the data groups add row identifiers for the data, and deduplication processing is carried out based on the row identifiers and the grouping index; the data deduplication method comprises the steps that grouping is carried out through grouping indexes and deduplication indexes, and compared with the grouping based on the grouping indexes in the related technology, the data deduplication method has the advantages that more data groups are generated, and when each data group is allocated to the corresponding data node to add the row identifier, more data nodes can be allocated to carry out processing, so that the data amount processed by each data node is reduced, the processing speed is high, and the time is short; when the duplicate removal processing is carried out through the line identification and the grouping field names, namely the duplicate removal processing is carried out through the line identification and the grouping indexes, data matching based on the duplicate removal field names in the related technology is not needed, a large amount of time can be saved, and the efficiency is improved.

Drawings

Fig. 1 is a flowchart of a data deduplication processing method according to an embodiment of the present invention;

fig. 2 is a flowchart of a data deduplication processing method according to an embodiment of the present invention;

fig. 3 is a flowchart of a data deduplication processing method according to an embodiment of the present invention;

fig. 4 is a flowchart of a data deduplication processing method according to an embodiment of the present invention;

fig. 5 is a flowchart of a data deduplication processing method according to an embodiment of the present invention;

FIG. 6 is a block diagram of a data deduplication processing architecture provided by an embodiment of the present invention;

fig. 7 is a schematic diagram of a cluster structure according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a flowchart of a data deduplication processing method according to an embodiment of the present invention, where the method may be performed by a data deduplication processing apparatus, and the apparatus is implemented by software and/or hardware. The device can be configured in a main node in a cluster, and the method can be applied to a big data deduplication processing scene.

Optionally, the method provided by the embodiment of the present invention may be applied to a cluster. Wherein a cluster may include a master node and at least two data nodes. The cluster may be installed with Hadoop, which is a distributed system infrastructure developed by the Apache foundation.

Optionally, the technical solution provided by the embodiment of the present invention may be applied to the following specific scenarios: in the related art, when a large amount of data is deduplicated, tasks in deduplication can be allocated to a cluster, a large amount of data needing deduplication is grouped based on one index by data nodes in the cluster, each data node matches corresponding data groups one by one, and duplicate data with the same grouping index and the same deduplication index are removed, so that deduplication of the data is completed. However, in the data deduplication method in the related art, data needing deduplication is grouped through grouping indexes, and generated data groups are few; when each data packet is allocated to a corresponding data node for deduplication, the allocated data nodes are also fewer, and the amount of data processed by each data node is larger, which results in more time consumption. Moreover, in the related art, when the data node performs deduplication on the corresponding data packet, data matching is performed one by one based on the deduplication index, and the data matching manner is time-consuming, so that the data deduplication method in the related art takes more time and is lower in efficiency.

As shown in fig. 1, the technical solution provided by the embodiment of the present invention includes:

s110: the main node acquires a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name.

In the embodiment of the invention, the user can inquire the data according to the requirement of the user so as to remove the repeated data. The data query request of the user can comprise a query data table needing to be accessed, a grouping field name and a duplication removal field name, wherein the grouping field name can comprise one or more, and the duplication removal query field name can comprise one. When a user needs to inquire about the number of users using each product, the grouping field name may be product and the deduplication field name may be passport.

In the embodiment of the present invention, optionally, the data query request may further include an operation function, for example, a count (—) function and a row _ number function. The data query request may employ a Structured Query Language (SQL). The user side can send the data query request to Hive in the cluster through a Hive interface, and the data query request in the SQL language is converted into task language which can be identified by Hadoop through the Hive so as to run in the Hadoop of the cluster, and therefore corresponding operation is executed. The Hadoop is a distributed system infrastructure developed by the Apache Foundation and runs in a cluster. The Hive is a data warehouse tool based on Hadoop, can map structured data files into a database table, provides a simple SQL query function, and can convert SQL statements into tasks to run on the Hadoop. The row _ number function is used for executing the operation of adding the row identifier, and the count (×) function is used for executing the operation of removing the duplicate according to the row identifier.

S120: the main node distributes the data in the query data table to at least two data nodes for grouping and duplicate removal processing, and obtains duplicate removal results formed by the at least two data nodes; the data node is used for storing distributed data in a plurality of data grouping tables according to field values of grouping field names and de-duplication field names, adding row identifiers for the data in the data grouping tables, and performing grouping de-duplication on the data in the data grouping tables according to the grouping field names and the row identifiers to form de-duplication results; different data nodes perform row identification addition operations of different data grouping tables.

In an implementation manner of the embodiment of the present invention, optionally, the allocating, by the master node, the data in the query data table to at least two data nodes for packet deduplication processing includes: and the main node divides the data in the query data table according to the number of the data nodes and respectively distributes the divided data to each data node so as to enable each data node to perform grouping and deduplication processing according to the distributed data.

Specifically, the master node may equally divide the data in the query data table according to the number of the data nodes, and send an instruction to each data node to instruct each data node to acquire corresponding data from the query data table, and perform packet deduplication processing on the acquired data. For example, if the size of the query data table is n megabits and the number of data nodes is m, the query data table is equally divided into m parts, and each part of divided data is allocated to a data node, so that the data node performs packet deduplication processing on the allocated data.

Therefore, the data of the query data table is divided and distributed to the corresponding data nodes so that the data nodes perform grouping and de-duplication processing on the distributed data, so that the data nodes can be utilized most efficiently, and the data processing speed is improved.

In the embodiment of the present invention, optionally, the master node may allocate data in the data lookup table to each data node, and each data node may store data in the same data grouping table, where the field values of the group field name and the duplication removal field name in the allocated data are equal. Wherein, the data with unequal field values of the grouping field name and the duplication removal field name are stored in different data grouping tables. For example, if the number of data nodes is 3, the data in the lookup data table may be divided into three parts, which are respectively allocated to the 3 data nodes. The data node 1 may store the data of the first part of data, in which the field values of the packet field name and the deduplication field name are "anzhuo" and "aa", respectively, in the data packet table 1; the data node 2 may store the data with the field values of the packet field name and the deduplication field name of the second part of data being "anzhuo" and "aa", respectively, in the data packet table 1; the data node 3 may store data having field values of a packet field name and a deduplication field name of the third part of data as "anzhuo" and "aa", respectively, in the data packet table 1. Similarly, the 3 data nodes may also store all data, of the allocated data, whose field values of the grouping field name and the deduplication field name are "anzhuo" and "bb", respectively, in the data grouping table 2. Similarly, each data node may store, in the distributed data, data having the same grouping field name and deduplication field name in the same data grouping table.

After the data with the equal field values of the grouping field names and the duplication removal field names are stored in the same data grouping table, the main node allocates corresponding data nodes for each data grouping table, each data node adds row identifiers for the data in the corresponding data grouping table, wherein the row identifiers can be sorted from small to large, and the minimum sorting is 1.

In this embodiment of the present invention, optionally, the data in the data grouping table is subjected to grouping deduplication according to the grouping field name and the row identifier to form a deduplication result, which may specifically be: storing the data in the data grouping table and the row identification into the same target data table; storing the data in the target data table in a plurality of data summary tables according to the grouping field names; wherein, the packet field names corresponding to the data in the same data summary table are equal; screening data with row identification 1 in each data summary table respectively to remove the duplication of the data in each data summary table and obtain the data after the duplication removal in each data summary grouping table; counting the number of the data after the duplication removal in each data summary table, taking the number as the number corresponding to the query request, and feeding back the number to the main node; wherein different data nodes perform deduplication operations for different data summary tables.

The master node may divide data in the target data table, and allocate the divided data to each data node, and each data node may store data with the same field value of the group field name in the allocated data in the same data summary table. For example, each data node may store, in the allocated data, all data with a field value of "anzhuo" in the packet field name in the data summary table 1, and all data with a field value of "IOS" in the packet field name in the data summary table 2. The data storage method in which the field value of the group field name is otherwise the same as the above storage method.

Wherein, the field values of the grouping field names of the data in different data summary tables are not equal. After the data nodes store the data with the same group field names in the distributed data in the same data summary table, the master node may distribute corresponding data summary tables to the data nodes, and the data nodes may screen the data with the row identifier 1 in the corresponding data summary tables to perform deduplication on the data in each data summary table.

In the related art, when a large amount of data is deduplicated, tasks in deduplication can be allocated to a cluster, a large amount of data needing deduplication is grouped based on one index by data nodes in the cluster, each data node matches corresponding data groups one by one, and duplicate data with the same grouping index and the same deduplication index are removed, so that deduplication of the data is completed. However, in the data deduplication method in the related art, data needing deduplication is grouped through grouping indexes, and generated data groups are few; when each data packet is allocated to a corresponding data node for deduplication, the allocated data nodes are also fewer, and the amount of data processed by each data node is larger, which results in more time consumption. Moreover, in the related art, when the data node performs deduplication on the corresponding data packet, data matching is performed one by one based on the deduplication index, and the data matching manner is time-consuming, so that the data deduplication method in the related art takes more time and is lower in efficiency.

Fig. 2 is a flowchart of a data deduplication processing method provided in an embodiment of the present invention, where the method may be applied in a big data deduplication processing scenario, and the method may be applied in a data node set, where the data node set includes at least two data nodes.

As shown in fig. 2, the technical solution provided by the embodiment of the present invention includes:

s210: acquiring data distributed by a main node, wherein the distributed data is data distributed by the main node according to a query data table included in a data query request, and the data query request includes the query data table, a grouping field name and a duplication removal field name.

In the embodiment of the present invention, the master node may equally divide the data in the query data table according to the number of the data nodes, and send an instruction to each data node to instruct each data node to acquire corresponding data from the query data table, and perform packet deduplication processing on the acquired data. For example, if the size of the query data table is n megabits and the number of data nodes is m, the query data table is equally divided into m parts, and each part of the divided data is allocated to the data nodes.

S220: the distributed data is stored in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names.

In an implementation manner of the embodiment of the present invention, optionally, the storing the allocated data in a plurality of data grouping tables according to field values of a grouping field name and a deduplication field name includes: and storing the data with the same field value corresponding to the grouping field name and the deduplication inquiry field name in the distributed data into the same data grouping table.

The main node may allocate data in the data lookup table to each data node, and each data node may store data in the allocated data, where field values of the group field name and the duplication removal field name are equal, in the same data group table. Wherein, the data with unequal field values of the grouping field name and the duplication removal field name are stored in different data grouping tables. For example, if the number of data nodes is 3, the data in the lookup data table may be divided into three parts, which are respectively allocated to the 3 data nodes. The data node 1 may store the data of the first part of data, in which the field values of the packet field name and the deduplication field name are "anzhuo" and "aa", respectively, in the data packet table 1; the data node 2 may store the data with the field values of the packet field name and the deduplication field name of the second part of data being "anzhuo" and "aa", respectively, in the data packet table 1; the data node 3 may store data having field values of a packet field name and a deduplication field name of the third part of data as "anzhuo" and "aa", respectively, in the data packet table 1. Similarly, the 3 data nodes may also store all data, of the allocated data, whose field values of the grouping field name and the deduplication field name are "anzhuo" and "bb", respectively, in the data grouping table 2. Similarly, each data node may store, in the distributed data, data having the same grouping field name and deduplication field name in the same data grouping table.

S230: adding row identification to the data in the data grouping table; wherein different data nodes execute row identification adding operation of different data grouping tables.

In the embodiment of the invention, after data with equal field values of a grouping field name and a duplication removal field name are stored in the same data grouping table, a main node allocates corresponding data nodes for each data grouping table, each data node adds row identifiers for the data in the corresponding data grouping table, wherein the row identifiers can be sorted from small to large, and the minimum sorting is 1.

S240: and carrying out grouping duplicate removal on the data in the data grouping table according to the grouping field names and the row identifiers to form a duplicate removal result and feeding the duplicate removal result back to the main node.

Specifically, the data in each data grouping table may be stored in a target data table, the data in the target data table may be grouped based on the grouping field names, and the data with the same grouping field name may be stored in the same data summary table. In each data summary table, data with the sequence identification of 1 is screened to deduplicate the data, and the deduplicated result is fed back to the master node. Details can be found in the following description of the embodiments.

Compared with the related art that the data in the query data table is grouped based on the grouping field names, the technical scheme provided by the embodiment of the invention can divide the data in the query data table into more groups and store the data in the query data table into more data grouping tables by grouping the data in the query data table based on the de-duplication field names and the grouping field names and storing the data in the same group in the same data grouping table. The adding operation of the row identifiers in different data grouping tables is executed by different data nodes, more data nodes can process data, the row identifiers can be added to the data more quickly, and the time is less. By grouping and screening the data based on the row identification and the grouping field names, compared with the prior art in a data matching mode, the data processing time is greatly reduced, and the duplicate removal efficiency is improved.

Fig. 3 is a flowchart of a data deduplication processing method according to an embodiment of the present invention, and as shown in fig. 3, a technical solution according to an embodiment of the present invention includes:

s310: acquiring data distributed by a main node, wherein the distributed data is data distributed by the main node according to a query data table included in a data query request, and the data query request includes the query data table, a grouping field name and a duplication removal field name.

S320: the distributed data is stored in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names.

S330: adding row identification to the data in the data grouping table; wherein different data nodes execute row identification adding operation of different data grouping tables.

S340: and storing the data in the data grouping table and the row identification into the same target data table.

In the embodiment of the present invention, the target data table includes all data in the data grouping table and a row identifier of each data in the corresponding data grouping table.

S350: storing the data in the target data table into a plurality of data summary tables according to the field value of the grouping field name; and the packet field names corresponding to the data in the same data summary table are equal.

In the embodiment of the present invention, the master node may divide data in the target data table, and allocate the divided data to each data node, and each data node may store data with the same field value of the group field name in the allocated data in the same data summary table. For example, each data node may store, in the allocated data, all data with a field value of "anzhuo" in the packet field name in the data summary table 1, and all data with a field value of "IOS" in the packet field name in the data summary table 2. The data storage method in which the field value of the group field name is otherwise the same as the above storage method. Wherein, the field values of the grouping field names of the data in different data summary tables are not equal.

S360: and screening the data with the row identifier of 1 in each data summary table to perform deduplication on the data in each data summary table, and obtaining the deduplicated data in each data summary grouping table.

In the embodiment of the present invention, the data node may screen the data with the row identifier 1 in the corresponding data summary table, so as to perform deduplication on the data of each data summary table, and obtain the deduplicated data in each data summary grouping table.

S370: counting the number of the data after the duplication removal in each data summary table, taking the number as the number corresponding to the query request, and feeding back the number to the main node; wherein different data nodes perform deduplication operations for different data summary tables.

Compared with the prior art that the data in the query data table is grouped based on the grouping field names, the embodiment of the invention can divide the data in the query data table into more data groups by grouping the data in the query data table based on the de-duplication field names and the grouping field names and storing the data with the de-duplication field names equal to the grouping field names in the data grouping table, and perform the adding operation of the row identifiers in different data grouping tables through different data nodes, namely perform the row identifier adding operation of each data group by the corresponding data node. As more data packets are divided, more data nodes can process the data packets, and the row identification can be added to the data in the data packets more quickly.

And, storing the data in the data grouping table and the row identifier in the target data table, regrouping the data in the target data table based on the grouping field names, storing the data with the same grouping field names in the data summary table, and screening the data with the row identifier 1 in each data summary table to perform deduplication on the data in each data summary table. Compared with the related data in which the data in the query data table is grouped based on the grouping field names and subjected to data matching for deduplication, the time can be saved. According to the technical scheme provided by the embodiment of the invention, although data grouping is carried out twice, the data grouping is carried out based on the grouping field names and the duplication removal field names, more data groupings can be divided, and the row identifier can be added to each data grouping by more data nodes, so that the operation execution speed of adding the row identifier is high, and the time spent is short. In addition, in the embodiment of the invention, data grouping is carried out based on the grouping field names, and the data with the row identifier of 1 is screened to carry out deduplication processing, so that the data deduplication speed is greatly improved compared with a data matching mode in the related technology. Therefore, although the technical solution provided by the embodiment of the present invention increases the operation of grouping data once and adding a row identifier, the time consumed is less, and compared with the data matching method in the related art, the method of filtering data with a row identifier of 1 to perform data deduplication saves more time. In general, the technical solution provided by the embodiment of the present invention can save time and improve efficiency compared with a data deduplication method in the related art.

Fig. 4 is a flowchart of a data deduplication processing method provided in an embodiment of the present invention, where the method is applied to a cluster, where the cluster includes a master node and a data node set, and the data node set includes at least two data nodes. As shown in fig. 4, the technical solution provided by the embodiment of the present invention includes:

s410: the main node acquires a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name.

S420: and the main node distributes the data in the query data table to a data node set.

S430: and the data node set acquires data distributed by the master node.

S440: and the data node set respectively stores the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names.

S450: the data node set adds row identification to the data in the data grouping table; wherein different data nodes execute row identification adding operation of different data grouping tables.

S460: and the data node set carries out grouping duplicate removal on the data in the first data grouping table according to the grouping field names and the row identifiers to form duplicate removal results and feeds the duplicate removal results back to the main node.

In this embodiment of the present invention, optionally, the data node set performs packet deduplication on data in the first data packet table according to the packet field name and the row identifier to form a deduplication result, and feeds the deduplication result back to the master node, where the method includes:

storing the data in the data grouping table and the row identification into the same target data table;

storing the data in the target data table in a plurality of data summary tables according to the grouping field names; wherein, the packet field names corresponding to the data in the same data summary table are equal;

screening data with row identification 1 in each data summary table respectively to remove the duplication of the data in each data summary table and obtain the data after the duplication removal in each data summary grouping table;

counting the number of the data after the duplication removal in each data summary table, taking the number as the number corresponding to the query request, and feeding back the number to the main node; wherein different data nodes perform deduplication operations for different data summary tables.

For detailed descriptions of various steps in the embodiments of the present invention, reference may be made to the descriptions of the corresponding steps in the above embodiments.

Fig. 5 is a flowchart of a data deduplication processing method according to an embodiment of the present invention, and as shown in fig. 5, a technical solution according to an embodiment of the present invention includes:

s510: the main node acquires a data query request, wherein the data query request comprises a query data table, a product field name and a user field name.

In an embodiment of the present invention, the data query request of the user may include a query data table to be accessed, a grouping field name and a deduplication field name. Part of the information in the look-up data table may be in the form shown in table 1 below.

TABLE 1

Field(s)	Type (B)	Remarks for note
			stime	String	Time point of data reporting
passport	String	User' s
			product	String	Product(s)

In the embodiment of the present invention, optionally, the data query request may further include an operation function, for example, a count (—) function and a row _ number function. And the row _ number function is used for executing the operation of adding the row identifier, and the count (×) function is used for executing the operation of removing the duplicate according to the row identifier. Specifically, the row _ number function is used for grouping the data in the query data table according to the user field names and the product field names, and adding the row identifier. The count (×) function is used to filter data identified as 1 by the data row grouped by product field name. The target data table comprises data in the query data table and row identification of the data in the corresponding group.

S520: and the main node distributes the data in the query data table to a data node set.

S530: and the data node set acquires data distributed by the master node.

S540: and the data node set respectively stores the distributed data in a plurality of data grouping tables according to the field values of the product field names and the user field names.

S550: the data node set adds row identification to the data in the data grouping table; wherein different data nodes execute row identification adding operation of different data grouping tables.

S560: and the data node set stores the data in the data grouping table and the row identification into the same target data table.

S570: the data node set stores the data in the target data table into a plurality of data summary tables according to the field value of the product field name; and the packet field names corresponding to the data in the same data summary table are equal.

S580: and the data node set screens the data with the row identifier of 1 in each data summary table to remove the duplication of the data in each data summary table and obtain the data after the duplication removal in each data summary grouping table.

S590: the data node set counts the number of the data after the duplication removal in each data summary table, and the number is used as the number corresponding to the query request and is fed back to the main node; wherein different data nodes perform deduplication operations for different data summary tables.

In the related art, the data query request includes a count distinguint function, and when the user needs to count the number of users using each product, the data query request may be in the following form:

select product,count(distinct passport)

from look-up data sheet

group by product

When the data query request is sent to the cluster, the cluster executes a task in the data query request, namely, data of the query data table are grouped according to the product field names, data with the same field value of the product field names are stored in one data group, and the data nodes corresponding to the cluster remove repeated data of the same product and the same user in each data group based on a data matching mode. Because the grouping is performed based on the product field names, the generated data groups are fewer, when each data group is allocated to the corresponding data node for deduplication, the allocated data nodes are fewer, the number of data processed by each data node is larger, so that more time is consumed, and when the data nodes perform deduplication on the corresponding data groups, the data matching mode is time-consuming.

In the technical solution provided in the embodiment of the present invention, the data query request includes a count (—) function and a row _ number function, for example, the query request may be in the following form:

select product,count(*)

from(

select product,passport

row_number()over(partition by product,passport)rid

from look-up data sheet

)t1

where rid＝1

group by product

；

Where t1 is the target data table, when the data query request is sent to the cluster, the cluster executes the task in the data query request. The technical scheme of the embodiment of the invention adopts a row _ number function to group the data in the query data table based on the product field names and the user field names, stores the data with the equal field values of the product field names and the user field names into a data group, and allocates the data nodes in the cluster to add the row identifiers for the corresponding data groups. Then, the data added with the row identifier is stored in a target data table, the data in the target data table is grouped based on the product field names, the data with the same field value of the product field names is stored in a product data group, the data with the row identifier of 1 is screened by adopting a count (x) function, and the number of heavy users can be counted. According to the method provided by the embodiment of the invention, as the row _ number function is adopted to carry out grouping based on the product field names and the user field names, more data groups are generated, more data nodes can be allocated to carry out the operation of adding the row identifiers to the data groups, and the time is less; and the data with the row mark of 1 is filtered through the count (×) function to be deduplicated, so that more time can be saved compared with a data matching mode in the related technology. Therefore, in the embodiment of the present invention, the count distintint operation is converted into the count (#) operation by using the row _ number function in the data query request, so that the cluster receives the data query request, and the time can be saved and the efficiency can be improved when the task is executed.

In the effect data test, the number of data records of the query data table is 5.2 hundred million, the data deduplication is performed by the method adopted by the related technology, and the time consumption is 3234 seconds (54 minutes), while the data deduplication is performed by the method adopted by the embodiment of the invention, and the time consumption is 224 seconds (less than 4 minutes), so that the time consumption of the deduplication task is directly shortened by 50 minutes, and the efficiency is greatly improved.

Fig. 6 is a block diagram of a data deduplication processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, the apparatus includes: a request acquisition module 610 and a result acquisition module 620.

A request obtaining module 610, configured to obtain a data query request, where the data query request includes a query data table, a grouping field name, and a duplication removal field name;

a result obtaining module 620, configured to allocate, by the master node, the data in the query data table to at least two data nodes for performing packet deduplication processing, and obtain a deduplication result formed by the at least two data nodes;

Optionally, the result obtaining module 620 is configured to divide data in the query data table according to the number of the data nodes, and allocate the divided data to each data node, so that each data node performs packet deduplication processing according to the allocated data.

The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 7 is a cluster according to an embodiment of the present invention, and as shown in fig. 7, the cluster includes a master node and a data node set, where the data node set includes at least two data nodes;

the host node 710 is configured to obtain a data query request, where the data query request includes a query data table, a grouping field name, and a duplication removal field name;

the master node 710 is further configured to assign data in the query data table to a data node set;

the data node set 720 is used for acquiring data allocated by the master node;

the data node set 720 is further configured to store the allocated data in a plurality of data grouping tables according to field values of a grouping field name and a deduplication field name;

the data node set 720 is further configured to add a row identifier to data in a data grouping table; wherein, different data nodes execute row identification adding operation of different data grouping tables;

the data node set 720 is further configured to perform grouping deduplication on data in the first data grouping table according to the grouping field name and the row identifier, and form a deduplication result to be fed back to the master node.

Specifically, the data node set 720 is configured to:

counting the number of the data after the duplication removal in each data summary table, taking the number as the number corresponding to the query request, and feeding back the number to the main node; wherein different data nodes 730 perform deduplication operations for different data summary tables.

Fig. 8 is a schematic structural diagram of an apparatus provided in an embodiment of the present invention, and as shown in fig. 8, the apparatus includes:

one or more processors 810, one processor 810 being illustrated in FIG. 8;

a memory 820;

the apparatus may further include: an input device 830 and an output device 840.

The processor 810, the memory 820, the input device 830 and the output device 840 of the apparatus may be connected by a bus or other means, for example, in fig. 8.

The memory 820 is a non-transitory computer-readable storage medium and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules (e.g., the text request acquiring module 610 and the result acquiring module 620 shown in fig. 6) corresponding to a data deduplication processing method according to an embodiment of the present invention. The processor 810 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 820, namely, a data deduplication processing method for implementing the above method embodiments is implemented, that is:

acquiring a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name;

The memory 820 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 820 may optionally include memory located remotely from processor 810, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 830 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 840 may include an output interface, and the like.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a data deduplication processing method as provided in the embodiment of the present invention:

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A data deduplication processing method applied to a master node in a cluster, the method comprising:

the data node is used for storing the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names, adding row identifiers for the data in the data grouping tables, and performing grouping and duplication removal on the data in the data grouping tables according to the grouping field names and the row identifiers to form duplication removal results; wherein, different data nodes execute row identification adding operation of different data grouping tables;

the storing the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names comprises the following steps:

and storing the data with the same grouping field name and the same field value of the de-duplication inquiry field name in the distributed data into the same data grouping table.

2. The method of claim 1, wherein distributing the data in the lookup data table to at least two data nodes for packet deduplication processing comprises:

and dividing the data in the query data table according to the number of the data nodes, and respectively distributing the divided data to each data node so as to enable each data node to perform grouping deduplication processing according to the distributed data.

3. A data deduplication processing method applied to a data node set in a cluster, the data node set including at least two data nodes, the method comprising:

storing the distributed data in a plurality of data grouping tables according to the grouping field names and the field values of the de-duplication field names;

grouping and duplicate removal are carried out on the data in the data grouping table according to the grouping field names and the row identifiers, and duplicate removal results are formed and fed back to the main node;

4. The method of claim 3, wherein the packet deduplication is performed on the data in the data packet table according to the packet field name and the row identifier, and a deduplication result is formed and fed back to the master node, and the method includes:

storing the data in the data grouping table and the row identification into the same target data table; storing the data in the target data table into a plurality of data summary tables according to the field value of the grouping field name; the field values of the grouping field names corresponding to the data in the same data summary table are equal;

screening data with row identification 1 in each data summary table respectively to obtain data after duplication removal in each data summary grouping table;

5. A data deduplication processing method is applied to a cluster, wherein the cluster includes a master node and a data node set, and the data node set includes at least two data nodes, and the method includes:

the main node distributes the data in the query data table to a data node set;

the data node set acquires data distributed by the main node;

the data node set carries out grouping duplicate removal on the data in the data grouping table with the row identification according to the grouping field names and the row identification to form a duplicate removal result which is fed back to the main node;

6. A data deduplication processing apparatus, comprising:

the request acquisition module is used for acquiring a data query request, wherein the data query request comprises a query data table, a grouping field name and a duplication removal field name;

the result acquisition module is used for distributing the data in the query data table to at least two data nodes for grouping and deduplication processing and acquiring deduplication results formed by the at least two data nodes;

the data node is used for storing distributed data in a plurality of data grouping tables according to field values of grouping field names and de-duplication field names, adding row identifiers for the data in the data grouping tables, and performing grouping de-duplication on the data in the data grouping tables according to the grouping field names and the row identifiers to form de-duplication results; different data nodes execute row identification adding operation of different data grouping tables;

7. A cluster comprising a master node and a set of data nodes, the set of data nodes comprising at least two data nodes;

the data node set is used for acquiring data distributed by the main node;

the data node set is also used for storing the distributed data in a plurality of data grouping tables according to the field values of the grouping field names and the duplication removal field names;

the data node set is also used for carrying out grouping duplicate removal on the data in the data grouping table with the row identification according to the grouping field names and the row identification to form a duplicate removal result which is fed back to the main node;

8. An apparatus, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a data deduplication processing method as recited in claim 1 or 2.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a data deduplication processing method according to any one of claims 1 or 2.