CN111159112B

CN111159112B - Data processing method and system

Info

Publication number: CN111159112B
Application number: CN201911326062.0A
Authority: CN
Inventors: 王胜杰
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2022-03-25
Anticipated expiration: 2039-12-20
Also published as: CN111159112A

Abstract

The application provides a data processing method and system. In the method, a Map node processes data to be processed to obtain data to be written, Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs; the Shuffle node determines a target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and sends the data to be written, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs to the target Reduce node; the target Reduce node writes the data to be written and the Rowkey corresponding to the data to be written into a temporary directory of a table to which the data to be written belongs; the DoBulkload node transfers the HFile file under the temporary directory of the table to the actual directory of the table. The data storage efficiency can be effectively improved.

Description

Data processing method and system

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and system.

Background

HBase is a highly reliable, high-performance, column-oriented, scalable, distributed database that supports mass data storage. Referring to fig. 1, a schematic diagram of the HBase database is shown.

Writing data into a database requires a series of processes. Referring to fig. 2, a schematic structural diagram of data processing based on a MapReduce framework is shown. The left Hadoop Distributed File System (HDFS) stores files to be processed, and the files to be processed comprise a plurality of data to be processed. And reading the data to be processed from the file to be processed by the Map node (Mapper 1-Mapper 3), and processing the data to be processed to obtain the data to be written into the HBase database (data to be written for short). And the Map node sends the data to be written to the HBase database through a write (Put) operation instruction. A series of operations are required inside the HBase database to generate an HFile file which can be stored in the bottom layer of the HBase database. The data storage process is low in efficiency and affects data query.

Disclosure of Invention

In view of this, the present application provides a data processing method and system, so as to improve data storage efficiency and reduce influence on data query.

In order to achieve the purpose of the application, the application provides the following technical scheme:

in a first aspect, the present application provides a data processing method applied to a data processing system, where the system includes at least one Map node, a Shuffle node, at least one Reduce node, and a DoBulkload node, and the method includes:

the Map node processes data to be processed to obtain data to be written, Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs;

the Shuffle node determines a target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and sends the data to be written, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs to the target Reduce node;

the target Reduce node writes the data to be written and the Rowkey corresponding to the data to be written into the HFile file under the temporary directory corresponding to the table to which the data to be written belongs;

and the DoBulkload node transfers the HFile file under the temporary directory corresponding to the table to the actual directory corresponding to the table.

Optionally, the processing, by the Map node, of the data to be processed to obtain the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs, includes:

obtaining characteristics of the created table;

extracting data to be written which is matched with the characteristics of the table from the data to be processed;

generating Rowkey corresponding to the data to be written according to a preset generation rule;

and determining the identifier of the table matched with the data to be written as the identifier of the table to which the data to be written belongs.

Optionally, before the Shuffle node determines, according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, to process the target Reduce node of the data to be written, the method further includes:

the Map node generates a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs;

and the Shuffle node analyzes the target key to obtain the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs.

Optionally, the generating, by the Map node, a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs includes:

and the Map node splices the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and the spliced identifier is used as a target key corresponding to the data to be written.

Optionally, the determining, by the Shuffle node, a target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written includes:

aiming at each Reduce node, acquiring a preset Rowkey range of a Region corresponding to the Reduce node and an identifier of a table to which the Region belongs;

matching the Rowkey range corresponding to each preset Reduce node and the table identification according to the table identification to which the data to be written belongs and the Rowkey corresponding to the data to be written;

and determining the matched Reduce node as a target Reduce node for processing the data to be written.

In a second aspect, the present application provides a data processing system comprising at least one Map node, a Shuffle node, at least one Reduce node, and a DoBulkload node,

the Map node is used for processing data to be processed to obtain data to be written, Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs;

the Shuffle node is configured to determine, according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, a target Reduce node that processes the data to be written, and send the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs to the target Reduce node;

the target Reduce node is used for writing the data to be written and the Rowkey corresponding to the data to be written into the HFile file under the temporary directory corresponding to the table to which the data to be written belongs;

the DoBulkload node is used for transferring the HFile file in the temporary directory corresponding to the table to the actual directory corresponding to the table.

obtaining characteristics of the created table;

Optionally, the Map node is further configured to generate a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs;

the Shuffle node is further configured to parse the target key, and obtain a Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs.

As can be seen from the above description, in the present application, the Map node processes the data to be processed, and obtains the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs. And determining a target Reduce node for processing the data to be written by the Shuffle node according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and sending the data to be written, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs to the target Reduce node. And the target Reduce node writes the data to be written and the Rowkey corresponding to the data to be written into a temporary directory corresponding to the table to which the data to be written belongs. And finally, the DoBulkload node transfers the HFile file under the temporary directory corresponding to the table to the actual directory corresponding to the table, and the data storage is finished. Because the data is converted into a file format (HFile file) required by data storage before being put in storage, the data can be put in storage only by carrying out HFile file transfer operation without executing a series of processing by the DoBulkload node, and therefore, the data putting in efficiency can be effectively improved, and the influence on data query is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an HBase database according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of data processing based on a MapReduce framework in the prior art;

FIG. 3 is a flow chart of a data processing method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating an implementation of step 301 according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating an implementation of step 302 according to an embodiment of the present application;

fig. 6 is a flowchart illustrating an implementation procedure of a table identifier transmission process according to an embodiment of the present application;

FIG. 7 is a block diagram of a data processing system according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a data processing system according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the negotiation information may also be referred to as second information, and similarly, the second information may also be referred to as negotiation information without departing from the scope of the embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Referring to fig. 2, a schematic structural diagram of data processing based on a MapReduce framework is shown. The file to be processed is stored in the HDFS in advance, and comprises a plurality of data to be processed. And reading the data to be processed from the file to be processed by the Map node (Mapper 1-Mapper 3), and processing the data to be processed to obtain the data to be written into the HBase database. The Map node sends the data to be written to the designated RegionServer (RS1 or RS2) through the Put operation instruction. The RegionServer writes the data to be written into the HLog of the Region and then writes the data into the MemStore cache. And when the cached data in the MemStore meets a certain condition, the cached data is written (flush) into the HDFS to form the HFile, and the data is stored in a database. It can be seen that, in the processing procedure, the RegionServer needs to execute a series of operations on each data to be written to complete data storage, the data storage efficiency is low, and the RegionServer needs to undertake data query operations besides data storage, so that the data query is seriously affected by the series of operations.

In view of the foregoing problems, an embodiment of the present application provides a data processing method. According to the method, a Map node processes data to be processed to obtain the data to be written, a Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs. And determining a target Reduce node for processing the data to be written by the Shuffle node according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and sending the data to be written, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs to the target Reduce node. And the target Reduce node writes the data to be written and the Rowkey corresponding to the data to be written into a temporary directory corresponding to the table to which the data to be written belongs. And finally, the DoBulkload node transfers the HFile file under the temporary directory corresponding to the table to the actual directory corresponding to the table in the database. The method can effectively improve the data storage efficiency and reduce the influence on data query.

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application are described in detail below with reference to the accompanying drawings and specific embodiments:

referring to fig. 3, a flowchart of a data processing method according to an embodiment of the present application is shown. The flow is applied to a data processing system comprising a Map node, a Shuffle node, a Reduce node and a DoBulkload node.

As shown in fig. 3, the process may include the following steps:

step 301, the Map node processes the data to be processed to obtain the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs.

It should be noted that, in the embodiment of the present application, data to be written may be written in an HBase database, where the HBase database stores data in a table form. Each table may include a plurality of key-value pairs, where key is a row key (Rowkey) corresponding to data, and value is data, i.e., the data to be written. The data processing system in the embodiment of the application comprises at least one Map node.

In this step, after the Map node processes the data to be processed, at least one data to be written may be obtained, and the Rowkey corresponding to each data to be written and the identifier of the table to which each data to be written belongs may be determined, where the data to be processed may be service data.

That is, after the Map node processes the data to be processed, the data to be written belonging to different tables can be obtained. And the Map node lays a foundation for outputting each data to be written to different tables for subsequent nodes according to the identifier of the table to which each data to be written belongs by determining the identifier of the table to which each data to be written belongs.

Step 302, the Shuffle node determines a target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and sends the data to be written, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs to the target Reduce node.

It should be noted that the data processing system may include a plurality of Reduce nodes, each of which is processed in parallel and operates independently.

In this step, the Shuffle node needs to determine which Reduce node processes the data to be written according to the identifier of the table to which the data to be written output by the Map node belongs and the Rowkey corresponding to the data to be written. The determined Reduce node for processing the data to be written is referred to as a target Reduce node. And the Shuffle node sends the data to be written, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs to the target Reduce node. Subsequent processing is performed by the target Reduce node.

Step 303, the target Reduce node writes the data to be written and the Rowkey corresponding to the data to be written into the HFile file under the temporary directory corresponding to the table to which the data to be written belongs.

In the embodiment of the application, the target Reduce node can directly generate the HFile file containing the data to be written and the Rowkey corresponding to the data. The HFile file is a format file which can be directly stored to the bottom layer of the HBase database.

In this step, the target Reduce node may create a temporary directory corresponding to the table according to the identifier of the table to which the data to be written belongs, and then store the HFile file in the temporary directory.

In step 304, the DoBulkload node transfers the HFile file in the temporary directory corresponding to the table to the actual directory corresponding to the table.

As mentioned above, the HFile file is a format file that can be directly stored in the bottom layer of the HBase database.

Therefore, in the embodiment of the application, the HFile file import function of the DoBulkload node is utilized to directly transfer the HFile file in the temporary directory corresponding to each table to the actual directory corresponding to each table.

Specifically, the DoBulkload node groups the HFile files according to the Region to which each HFile file belongs, determines a Region server for managing the Region, and imports the HFile files into an actual directory by interacting with the Region server.

It should be noted that the actual directory of each table in the present application is dedicated to store the data to be written that needs to be put into the table.

The flow shown in fig. 3 is completed.

As can be seen from the flow shown in fig. 3, in the embodiment of the present application, after the Map node processes the data to be processed, multiple data to be written may be obtained, and the identifier of the table to which each data to be written belongs is determined, and the subsequent nodes (stub node, Reduce node, and DoBulkload node) may distinguish the data to be written belonging to different tables according to the identifier of the table to which each data to be written belongs, which is provided by the Map node, and then output each data to be written to different tables, thereby implementing multi-table output. In addition, in the application, since the HFile file which can be directly stored to the bottom layer of the database is generated before the data is put into the database, and the HFile file importing function of the DoBulkload node is used for directly putting into the database, a series of write-in operations are not required to be executed by the RegionServer aiming at each data, so that the data putting efficiency can be effectively improved, and the influence on data query is reduced.

Next, a process of processing the data to be processed by the Map node in step 301 to obtain the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs is described. Referring to fig. 4, a flow of implementing step 301 is shown in the embodiment of the present application.

As shown in fig. 4, the process may include the following steps:

in step 401, the Map node obtains the characteristics of the created table.

The HBase database stores data in the form of tables, and therefore, before writing data, various types of tables required for satisfying a service need to be created according to a specific service requirement. For example, an identification card information table for storing identification card information; a city population information table for storing city population information.

In this step, the Map node obtains the characteristics of the created table. For example, the ID card information table may include characteristics such as name, gender, ID card number, home address, etc.; the city demographic information table may include characteristics such as name, identification number, home address, etc.

Step 402, the Map node extracts the data to be written which is matched with the characteristics of the table from the data to be processed.

Here, it should be noted that the data to be processed is raw data, and may include data that is not required to be stored in the database. For this purpose, this step requires extracting the data to be stored in the database from the data to be processed.

Specifically, the Map node extracts data to be written, which is matched with the characteristics of the table, from the data to be processed according to the characteristics of the created table. For example, the Map node reads a piece of data to be processed with a data format of "name + sex + identity card number + home address", specifically "king + man +123456+ shaanxi xi. The HBase database is pre-established with a city population information table, and the city population information table comprises the following characteristics: name, identification card number, home address. Then the Map node extracts the data to be written, namely the data to be written, from the data to be processed, namely the data of the King, the male, the 123456 and the Shaanxi security according to the characteristics included in the city population information table.

And step 403, generating Rowkey corresponding to the data to be written by the Map node according to a preset generation rule.

As previously described, data is stored in the form of key-value pairs in the table. And the key is a Rowkey corresponding to the data. The process of generating Rowkey by the Map node can be implemented by using the prior art, and is not described in detail here.

In step 404, the Map node determines the identifier of the table matched with the data to be written as the identifier of the table to which the data to be written belongs.

As described above, the Map node extracts the data to be written based on the characteristics of the table, and thus, it can be determined that the data to be written belongs to the table, and the identifier of the table is the identifier of the table to which the data to be written belongs. For example, the Map node extracts data to be written, "king +123456+ shanxi seian" according to the characteristics of the city population information table, and the data to be written belongs to the city population information table, and the identifier of the table to which the data to be written belongs is the identifier of the city population information table.

The flow shown in fig. 4 is completed. Through the process shown in fig. 4, the Map node extracts the features of each table, and then determines the data to be written that matches the features of each table, so as to flexibly determine the identifier of the table to which the data to be written belongs, which is not only fast, high in accuracy, but also strong in universality; in addition, the Map node can extract the data to be written in each table and determine the identifier of the table to which the data to be written belongs. And laying a foundation for data distribution of subsequent Shuffle nodes based on the table identification.

The following describes a process of determining, by the Shuffle node in step 302, a target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written. Referring to fig. 5, a flow of implementing step 302 is shown for the embodiment of the present application.

As shown in fig. 5, the process may include the following steps:

step 501, for each Reduce node, the Shuffle node acquires a preset Rowkey range of a Region corresponding to the Reduce node and an identifier of a table to which the Region belongs.

It should be noted that, in the embodiment of the present application, the number of Reduce nodes is the same as the total number of regions of each table created in advance. Each Reduce node is configured to process data within a Region. For example, 2 tables are created in advance, and are respectively marked as Table1 and Table 2; wherein, Table1 includes 2 regions, which are respectively marked as Region1 and Region 2; table2 includes 1 Region, which is designated as Region3, and 3 corresponding Reduce nodes, which are designated as Reducer 1-Reducer 3. On this basis, Reducer1 may be configured to process data belonging to Region1, Reducer2 to process data belonging to Region2, and Reducer3 to process data belonging to Region 3.

Each Region corresponds to a certain Rowkey range. For example, the Rowkey range of Region1 is Rowkey 0-Rowkey 10; the Rowkey range of Region2 is Rowkey 11-Rowkey 20; the Rowkey range of Region3 is RowkeyA-RowkeyZ.

In this step, the Shuffle node may obtain, according to the above configuration, the Rowkey range of the data to be written that can be processed by each Reduce node and the identifier of the table to which the Shuffle node belongs. For example, according to the foregoing configuration, Reducer1 can process data to be written belonging to Table1 and having Rowkey between Rowkey0 and Rowkey 10; reducer2 can process the data to be written, which belongs to Table1 and is located between Rowkey11 and Rowkey 20; reducer3 may process data to be written belonging to Table2 having Rowkey between RowkeyA and RowkeyZ.

And 502, matching the Rowkey range corresponding to each Reduce node and the table identifier configured in advance by the Shuffle node according to the table identifier to which the data to be written belongs and the Rowkey corresponding to the data to be written.

In step 503, the Shuffle node determines the matched Reduce node as a target Reduce node for processing the data to be written.

For example, the identifier of the Table to which the current data to be written belongs is Table1, and the Rowkey corresponding to the data to be written is Rowkey 15. The identifier of the corresponding Table of Reducer1 is Table1, and the corresponding Rowkey range is Rowkey 0-Rowkey 10; the identifier of the corresponding Table of Reducer2 is Table1, and the corresponding Rowkey range is Rowkey 11-Rowkey 20; the identifier of the Reducer3 corresponding Table is Table2, and the corresponding Rowkey range is RowkeyA to RowkeyZ. It can be seen that the current data to be written falls within the processing range of Reducer2, and therefore, Reducer2 is determined to be the target Reduce node for processing the current data to be written.

The flow shown in fig. 5 is completed.

Through the process shown in fig. 5, the Shuffle node may determine a target Reduce node that processes the data to be written, and then distribute the data to be written to the target Reduce node for processing.

The following describes the transmission process of the table identifier. Referring to fig. 6, an implementation flow of a table identifier transmission process shown in the embodiment of the present application is shown.

As shown in fig. 6, the process may include the following steps:

step 601, the Map node generates a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs.

After the Map node obtains the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs through step 301, the Map node can determine the target key corresponding to the data to be written through the step.

And the target key is a new key generated by the Map node according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs.

As an embodiment, the Map node may directly splice the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and use the spliced identifier as the target key corresponding to the data to be written. For example, the Rowkey corresponding to the data to be written is Rowkey1, and the identifier of the Table to which the data to be written belongs is Table1, then the generated target key corresponding to the data to be written may be Table1# Rowkey 1.

Step 602, the Shuffle node analyzes the target key, and obtains the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs.

In this step, the Shuffle node obtains, from the target key provided by the Map node, the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs by parsing. For example, the Map node generates a target key Table1# Rowkey1 corresponding to the data to be written by using the aforementioned splicing manner, and the Shuffle node can analyze the target key according to the splicing rule of the Map node, and obtain the Rowkey1 corresponding to the data to be written and an identifier Table1 of a Table to which the data to be written belongs.

After the Shuffle node obtains the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs through the step, step 302 can be executed to determine the target Reduce node for processing the data to be written.

The flow shown in fig. 6 is completed.

As can be seen from the flow shown in fig. 6, in the embodiment of the present application, the identifier of the table to which the data to be written belongs may be transmitted between the Map node and the Shuffle node based on a key (target key).

The data processing method provided by the present application is described below by a specific embodiment.

Referring to fig. 7, a schematic structural diagram of a data processing system according to an embodiment of the present application is shown. The system comprises 2 Map nodes which are Mapper1 and Mapper2 respectively; 3 Reduce nodes, namely, Reduce 1-Reduce 3; 1 DoBulkload node; 1 Shuffle node (not shown in the figure).

The file to be processed is stored in the HDFS system in advance. The file to be processed comprises a plurality of data to be processed, and the data format of the data to be processed is name + sex + identity card number + family address. The data to be processed in the file to be processed may be divided into two parts in advance, and processed by Mapper1 and Mapper2, respectively.

2 tables are created in advance in the HBase database, namely an identity card information Table (marked as Table1) and a city population information Table (marked as Table 2). Table1 includes Table features: name, gender, identification card number, home address; table2 includes Table features: name, identification card number, home address.

Wherein, Table1 includes 2 regions, which are Region1 and Region2, Region1 corresponds to Rowkey range from Rowkey0 to Rowkey10, and Region2 corresponds to Rowkey range from Rowkey11 to Rowkey 20; table2 includes 1 Region, which is designated as Region3, and the Region3 corresponds to Rowkey in the Rowkey range Rowkey A-Rowkey Z.

Configuring a Reducer1 to process data to be written between Rowkey0 and Rowkey10 in Table1, namely the data to be written belonging to Region 1; configuring a Reducer2 to process data to be written between Rowkey11 and Rowkey20 in Table1, namely the data to be written belonging to Region 2; configuring Reducer3 to process data to be written between RowkeyA and RowkeyZ in Table2, namely the data to be written belonging to Region 3.

Taking the example that Mapper1 reads a piece of data to be processed, the data to be processed is "king + man +123456+ shaanxi xi" data. The Mapper1 extracts data to be written, "king + man +123456+ shanxi xi' an", according to the Table features of Table1, generates a Rowkey corresponding to the data to be written, and records the Rowkey as Rowkey1, and obtains an identifier of a Table to which the data to be written belongs as Table 1; similarly, the Mapper1 may extract the data to be written, "xiao wang +123456+ shanxi sean", according to the Table feature of Table2, generate a Rowkey corresponding to the data to be written, and record the Rowkey as RowkeyB, and obtain the identifier of the Table to which the data to be written belongs as Table 2.

Taking the data to be written extracted by the Mapper1 "king +123456+ shaanxi seian" as an example, the Mapper1 generates a new key (Table2# RowkeyB) corresponding to the data to be written according to the identifier Table2 of the Table to which the data to be written belongs and the RowkeyB corresponding to the data to be written.

The Shuffle node analyzes the identifier Table2 of the Table to which the data to be written belongs and the row key RowkeyB corresponding to the data to be written according to a new key (Table2# RowkeyB) corresponding to the data to be written, "king +123456+ shanxi-xi-an" provided by Mapper 1. The Shuffle node acquires a pre-configured range of data to be written, which can be processed by each Reduce node, for example, the Reducer1 can process data to be written, in Table1, where Rowkey is located between Rowkey0 and Rowkey 10; the Reducer2 can process data to be written, of which Rowkey is located between Rowkey11 and Rowkey20 in Table 1; reducer3 can process the data to be written in Table2 with Rowkey located between RowkeyA and RowkeyZ. The Shuffle node determines that the current data to be written, namely the King +123456+ Shaanxi-xi-An, is to be sent to the Reducer3 for processing according to the range of the data to be written which can be processed by each Reduce node. The Shuffle node sends the data to be written "king +123456+ shaanxi-xi" and the rowkeyB corresponding to the data to be written and the identification Table2 of the Table to which the data to be written belongs to the Reducer 3.

After the Reducer3 receives data to be written, "king +123456+ shanxi 'an" and corresponding RowkeyB, the RowkeyB and "king +123456+ shanxi' an" are written into hfile1 under a temporary directory corresponding to Table2, for example,/outputDir/Table 2/columnfammiamly/hfile 1.

Similarly, the Shuffle node may determine, according to the identification Table1 of the Table to which the to-be-written data "queen + man +123456+ shanxi-xi-an" provided by Mapper1 belongs and the Rowkey1 corresponding to the to-be-written data, that the Reduce node that processes the to-be-written data is Reducer 1. The Shuffle node sends the data to be written, "king + man +123456+ shanxi xi' an", the Rowkey1 corresponding to the data to be written, and the identifier Table1 of the Table to which the data to be written belongs to the Reducer 1.

After receiving data to be written, "king + man +123456+ shanxi xi 'and corresponding Rowkey1, the Reducer1 writes Rowkey1 and" king + man +123456+ shanxi' into hfile2 in a temporary directory corresponding to Table1, for example,/outputDir/Table 1/columnifenamily/hfile 2.

Here, since the Reducer1 is responsible for processing the data to be written in the Region1 (corresponding to the regions 0 to 10), all the data written in the hfile2 are data belonging to the Region1, in other words, the hfile2 belongs to the Region 1.

Through the processing process, the data to be written extracted by each Map node can be written into the hfile file under the temporary directory corresponding to each table respectively.

The DoBulkload node traverses the hfile files under the temporary directories corresponding to the tables, and the example is that the Table1 traverses the hfile 2-hfile 5 under the temporary directories corresponding to the tables. The DoBulkload nodes group the hfiles 2-5 according to regions to which the hfiles 2-5 belong. For example, hfile2 and hfile3 belong to Region1, and hfile4 and hfile5 belong to Region 2. The DoBulkload node acquires Region Server (RS) information responsible for managing regions, for example, Region1 is managed by RS1, and Region2 is managed by RS 2.

The DoBulkload node guides the hfile2 and the hfile3 belonging to the Region1 into an actual directory corresponding to Table1 through interaction with RS1, for example,/hbaseDir/Table 1/Region 1/ColumnAily/hfile 2; /hbsedir/Table 1/Region 1/columndamaly/hfile 3.

The DoBulkload node guides the hfile4 and the hfile5 belonging to the Region2 into an actual directory corresponding to Table1 through interaction with RS2, for example,/hbaseDir/Table 1/Region 2/ColumnAily/hfile 4; /hbsedir/Table 1/Region 2/columndamaly/hfile 5.

Similarly, the DoBulkload node may traverse Table2 to correspond to a hfile file under the temporary directory, for example, hfile 1. The DoBulkload node determines that the hfile1 belongs to Region 3. The Region3 is managed by RS1, and then the DoBulkload node imports the hfile1 belonging to the Region3 into the actual directory corresponding to Table2 through interaction with RS1, for example,/hbaseDir/Table 2/Region 3/columnadmixture/hfile 1.

And finishing data warehousing.

In addition, the applicant performed the following tests for the existing data processing method and the data processing method provided in the present application:

the test hardware environment is as follows:

name of server	Memory device	CPU	Network card	Magnetic disk
					Node1	252G	56 core 112 thread 2.00GHz	Ten-thousand-million network card	18.2T+744.7G+558.4G
Node2	252G	56 core 112 thread 2.40GHz	Ten-thousand-million network card	18.2T+744.7G+558.4G
					Node3	252G	56 core 112 thread 2.40GHz	Ten-thousand-million network card	18.2T+744.7G+558.4G
Node4	252G	56 core 112 thread 2.40GHz	Ten-thousand-million network card	18.2T+1.1T

Table1 the configuration of each role in the HBase is as follows:

master memory	Number of RegionServer	RegionServer memory
			32G	4	20G

Table2 the table configuration in HBase is as follows:

number of HBase tables	Number of fields	Number of columns per table	Number of regions per table
				4	146	1	10

TABLE 3

The data import efficiency is compared as follows:

TABLE 4

It can be seen that, under the same hardware environment, the same software configuration and the same data volume, the data processing scheme of the application can greatly improve the data storage efficiency compared with the existing data processing scheme, and the efficiency is improved more obviously the larger the data volume is.

In order to describe the method provided by the embodiment of the present application, the following describes a system provided by the embodiment of the present application:

referring to fig. 8, a schematic structural diagram of a system provided in an embodiment of the present application is shown. The system comprises: map node 801, Shuffle node 802, Reduce node 803, and DoBulkload node 804,

the Map node 801 is configured to process data to be processed to obtain data to be written, a Rowkey corresponding to the data to be written, and an identifier of a table to which the data to be written belongs;

the Shuffle node 802 is configured to determine, according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, a target Reduce node that processes the data to be written, and send the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs to the target Reduce node 803;

the target Reduce node 803 is configured to write the data to be written and the Rowkey corresponding to the data to be written into the HFile file in the temporary directory corresponding to the table to which the data to be written belongs;

the DoBulkload node 804 is configured to transfer the HFile file in the temporary directory corresponding to the table to the actual directory corresponding to the table.

As an embodiment, the processing, by the Map node 801, of the data to be processed to obtain the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs includes:

obtaining characteristics of the created table;

As an embodiment, the Map node 801 is further configured to generate a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs;

the Shuffle node 802 is further configured to parse the target key, and obtain a Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs.

As an embodiment, the generating, by the Map node 801, a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs includes:

and the Map node 801 splices the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, and uses the spliced identifier as a target key corresponding to the data to be written.

As an embodiment, the determining, by the Shuffle node 802, a target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written includes:

and determining the matched Reduce node as a target Reduce node 803 for processing the data to be written.

This completes the description of the system of fig. 8.

The above description is only a preferred embodiment of the present application, and should not be taken as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application shall be included in the scope of the present application.

Claims

1. A data processing method is applied to a data processing system, the system comprises at least one Map node, a dispatch node, at least one Reduce node and an import DoBulkload node, and the method comprises the following steps:

2. The method of claim 1, wherein the processing, by the Map node, of the data to be processed to obtain the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs includes:

obtaining characteristics of the created table;

3. The method as claimed in claim 1, wherein before the Shuffle node determines the target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written, the method further comprises:

4. The method of claim 3, wherein the generating, by the Map node, the target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs includes:

5. The method as claimed in claim 1, wherein the determining, by the Shuffle node, the target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written includes:

6. A data processing system comprising at least one Map node, a distribute Shuffle node, at least one Reduce node, and an import DoBulkload node, wherein:

the Map node is used for processing data to be processed to obtain data to be written, a Rowkey corresponding to the data to be written and an identifier of a table to which the data to be written belongs;

7. The system of claim 6, wherein the processing, by the Map node, of the data to be processed to obtain the data to be written, the Rowkey corresponding to the data to be written, and the identifier of the table to which the data to be written belongs includes:

obtaining characteristics of the created table;

8. The system of claim 6,

the Map node is further configured to generate a target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs;

9. The system of claim 8, wherein the Map node generates the target key corresponding to the data to be written according to the Rowkey corresponding to the data to be written and the identifier of the table to which the data to be written belongs, and the method includes:

10. The system as claimed in claim 6, wherein the determining, by the Shuffle node, the target Reduce node for processing the data to be written according to the identifier of the table to which the data to be written belongs and the Rowkey corresponding to the data to be written includes: