CN108228752B

CN108228752B - Data total export method, data export task allocation device and data export node device

Info

Publication number: CN108228752B
Application number: CN201711395359.3A
Authority: CN
Inventors: 牛龙飞; 陈斌; 周一峰
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2022-04-15
Anticipated expiration: 2037-12-21
Also published as: CN108228752A

Abstract

The invention provides a data full-scale export method, a data export task allocation device and a data export node device, wherein the method comprises the following steps: the data export task allocation device analyzes and manages a service unit of each data unit in a data table to be exported, wherein the data table to be exported comprises at least one data unit; and the data export task distribution device distributes each data unit export task in the data table to be exported to a host where a service unit for managing the data units is located, so that the data export node device deployed on the host performs weighted average division on the data units managed by the data export node device and then exports the data. By the method and the device, network IO can be reduced on the basis of improving data export efficiency.

Description

Data total export method, data export task allocation device and data export node device

Technical Field

The present invention relates to the field of communications, and in particular, to a data full-scale export method, a data export task allocation apparatus, and a data export node apparatus.

Background

Hbase is a high-reliability, high-performance, column-oriented and scalable distributed storage system, which provides two ways, namely Get and Scan, for data in a lookup table, wherein the Get method is used for acquiring only one record according to a specified Rowkey, and in the Scan method, all records with Rowkey between StartRowkey and EndRowkey can be acquired at one time by defining StartRowkey and EndRowkey. The design characteristic of HBase determines that the data retrieval efficiency based on Rowkey is very high, but if the retrieval condition is a common column, full table scanning is required, namely a Scan query object which does not specify StartRowkey and EndRowkey is constructed, a request is initiated, and the export of full data belongs to a practical application under the scene.

In the prior art, a MapReduce batch task mode is usually adopted for exporting the full data. The method fully utilizes the computing resources of the whole HBase cluster, and disperses the data export task of the whole table to each node in the cluster for operation. By means of the MapReduce framework, a user only needs to write two functions of map and reduce, create HBase connection, open a specified table, construct a Scan object and send a query request in the map function, and directly process a result set or send data in the result set out for processing in the reduce stage. The MapReduce framework divides the task into a plurality of fine-grained tasks, disperses the fine-grained tasks to each node in the cluster to run in parallel, and outputs a final result set to an output directory configured on the HDFS.

However, the MapReduce batch task mode generally allocates tasks according to the CPU idle degree of each node in the HBase, and since there is no backup in a host where a node may be located in a task to be exported, data needs to be called from other hosts, so that a large number of data copies exist between hosts in a cluster, a high network IO may be generated, and in extreme cases, communication between service processes in the cluster may be affected, while the CPU of the host is idle, and various resources of the cluster may not be utilized in a balanced manner. In addition, the MapReduce batch task mode generally allocates data to be exported to each node according to the strength of one Region-one task, but because the difference of the data amount between each Region is large, 80% of tasks may use up 20% of time, and 20% of tasks use up 80% of time, so that the data export efficiency is low.

Disclosure of Invention

The invention provides a data full-scale export method, a data export task allocation device and a data export node device, which are used for improving the data export efficiency of Hbase and reducing higher network IO in the data export process of the Hbase.

The first aspect of the invention provides a data full-scale export method, which comprises the following steps: the data export task allocation device analyzes and manages a service unit of each data unit in a data table to be exported, wherein the data table to be exported comprises at least one data unit; and the data export task distribution device distributes each data unit export task in the data table to be exported to a host where a service unit for managing the data units is located, so that the data export node device deployed on the host performs weighted average division on the data units managed by the data export node device and then exports the data.

Another aspect of the present invention provides a method for exporting data in full scale, including: the data export node device receives a data unit export task distributed by the data export task distribution device, and a data unit corresponding to the data unit export task is managed by the data export node device deployed on a host where a service unit for managing the data unit is located; the data export node device calls a service unit to obtain a copy of a data unit to be exported currently from a data node of an HDFS (Hadoop distributed file system) configured on a host where the service unit is located according to the data unit export task, and the data node stores the copies of all data units managed by the service unit; and the data export node device evenly distributes the copies of the data units to each thread pool for data export.

Still another aspect of the present invention provides a data export task assigning apparatus, including: the system comprises an analysis module, a service module and a management module, wherein the analysis module is used for analyzing and managing a service unit of each data unit in a data table to be exported, and the data table to be exported comprises at least one data unit; and the distribution module is used for distributing each data unit export task in the data table to be exported to a host where a service unit for managing the data units is located by the data export task distribution device so as to lead the data export node device deployed on the host to carry out weighted average division on the managed data units and then carry out data export.

Still another aspect of the present invention provides a data exporting node apparatus, comprising: the receiving module is used for receiving the data unit export task distributed by the data export task distribution device, and the data unit corresponding to the data unit export task is managed by the data export node device deployed on the host where the service unit for managing the data unit is located; the copy obtaining module is used for calling the service unit to obtain a copy of the current data unit to be exported from a data node of an HDFS (Hadoop distributed File System) configured on a host where the service unit is located according to the data unit export task, and the data node stores the copies of all data units managed by the service unit; and the splitting module is used for distributing the copies of the data units to all the thread pools evenly for data export.

According to the data total export method, the data export task allocation device and the data export node device, each Region in the data table to be exported in Hbase is allocated to the Region server managing the Region, so that local export of the data to be exported can be achieved, the data to be exported does not need to be called from other hosts, and network IO can be reduced on the basis of improving data export efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a flowchart of a data full-scale export method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a data full-scale export method according to a second embodiment of the present invention;

FIG. 3 is a block diagram of a data export task allocation apparatus according to a third embodiment of the present invention;

fig. 4 is a structural diagram of a data export node apparatus according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other examples obtained based on the examples in the present invention are within the scope of the present invention.

The terms to which the present invention relates will be explained first:

hbase: a highly reliable, high performance, column-oriented, scalable distributed storage system;

a service unit: the servers of the Region server and the Hbase are deployed on one physical server and manage at least one Region;

data unit: namely regions, the basic unit for storing and managing HBase data, and each Region can be served by only one Region server.

Fig. 1 is a flowchart of a data full-scale export method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

101. the data export task allocation device analyzes and manages a service unit of each data unit in a data table to be exported, wherein the data table to be exported comprises at least one data unit.

In this embodiment, at least one Region is stored in any data table of Hbase, and each Region server manages one or more regions in the data table. Therefore, when data export is performed on the regions in the data table, it may be firstly analyzed by which Region server the Region to be exported is specifically managed.

102. And the data export task distribution device distributes each data unit export task in the data table to be exported to a host where a service unit for managing the data units is located, so that the data export node device deployed on the host performs weighted average division on the data units managed by the data export node device and then exports the data.

In this embodiment, after analyzing which Region server manages a Region to be exported specifically, a data export node device is configured on a host where the Region server managing the Region is located, so that the Region to be exported may be allocated to the data export node device deployed on the host where the Region server managing the Region is located, so that the data export node device performs weighted average allocation on the Region to perform export.

In practical application, when an HBase cluster is deployed, a register server and a data node storing data at the bottom layer are generally deployed on the same group of machines, so according to a copy storage policy of an HDFS, data written to a certain register will be stored in one copy on a local DataNode first, and then stored in one copy on different nodes of the same rack and one copy on certain nodes of different racks, respectively, so theoretically, it can be considered that all data of the register on the register server will have one copy locally, that is, the locality attribute of the register is 100%, even if the HBase performs a balance operation or restarts the register server, the locality attribute of the register will be gradually restored to 100% along with the writing of the data and the constant Compact of the register. Therefore, after analyzing which Region server is specifically managed by which Region server, each Region export task in the data table to be exported may be respectively allocated to a data export node device deployed on a host where the Region server managing the Region is located. And the data export node device deployed on the host where the Region server managing the Region is located distributes the weighted average to each thread pool for data export after carrying out weighted average on the regions to be exported.

In the data full-scale export method provided by this embodiment, each Region in the to-be-exported data table in Hbase is allocated to the Region server that manages the Region, so that local export of the to-be-exported data can be realized, the to-be-exported data does not need to be called from other hosts, and further, network IO can be reduced on the basis of improving data export efficiency.

Further, since a data table to be exported includes at least one Region, the data table is exported only after all regions are exported, and therefore, on the basis of the above embodiment, the method further includes:

the data export task allocation device detects whether the export of the data unit to be exported managed by each data export node device is finished;

and if so, the data export task distribution device judges that all the data in the data table to be exported are exported completely.

In this embodiment, in order to determine whether all the data tables to be currently exported are exported, it may be first detected whether the Region export task for which each Region server is responsible has completed data export, and if so, it may be determined that all the data tables to be currently exported have been exported.

According to the data total export method provided by the embodiment, whether the export of the current data table to be exported is completed or not is judged by detecting whether the data on each region server is totally exported or not, so that the accuracy of data export can be increased, and the misjudgment of whether the data export is completed or not is avoided.

Fig. 2 is a flowchart of a data full-scale export method according to a second embodiment of the present invention, and as shown in fig. 2, the method further includes:

201. and the data export node device receives the data unit export task distributed by the data export task distribution device, and the data unit corresponding to the data unit export task is managed by the data export node device deployed on the host where the service unit for managing the data unit is located.

In the present embodiment, the Region export task assigned by the data export assignment means is received for each data export node device. The Region export task may be configured to, after the data export allocation apparatus analyzes the regions managed by the Region server, allocate each Region to a data export node apparatus deployed on the host where the Region server manages the Region.

202. And the data export node device calls the service unit to acquire a copy of the current data unit to be exported from the data node of the HDFS configured on the host where the service unit is located according to the data unit export task, and the data node stores the copies of all the data units managed by the service unit.

In this embodiment, since the Region server and the data node storing data at the bottom layer are generally deployed on the same group of machines, there will be one copy of data of all regions on the Region server locally, i.e. the locality attribute of the Region is 100%. Therefore, after the data export node device receives the Region export task, the Region export node device can obtain the copy of the current Region to be exported from the data node of the HDFS by calling the Region server, so that the data to be exported does not need to be obtained from other hosts, and the network IO is reduced.

203. And the data export node device evenly distributes the copies of the data units to each thread pool for data export.

In this embodiment, the data export node device may calculate copies of regions to be exported and equally allocate the copies to each thread pool for data export. Specifically, the weight of the Region to be derived can be obtained by calculating the difference between the length (i.e., endkey-startup) of all the regions to be derived and the number of the thread pools, and the regions to be derived are divided equally according to the weight of the regions to be derived. Specifically, the Region to be derived may also be divided equally by any one of the calculation methods, and the present invention is not limited herein.

It should be noted that before the Region to be exported, the current local thread pool may be initialized to call the thread pool for data export in the following process, or M thread pools may be established for data export in the following process, which is not limited herein.

According to the data total export method provided by the embodiment, the locally stored Region to be exported is obtained and is evenly distributed to each thread pool for data export, so that the data quantity required to be exported of each thread pool is equal, and the data export efficiency is improved.

Optionally, in order to improve the quality of the derived data, before step 203, the data to be derived may be further filtered, and on the basis of any of the foregoing embodiments, the method further includes:

and the data export node device screens the copies of the data units to be exported according to preset screening conditions.

In this embodiment, before data export, a copy of a Region to be exported under a preset screening condition may be screened. Specifically, the screening condition may be to screen data of a certain attribute, so that the data export node device obtains all data that satisfies the attribute in the data to be exported according to the screening condition, and the screening condition may also be a filtering condition, for example, to filter all duplicate data in the data to be exported, so that the data export node device deletes all duplicate data in the data to be exported according to the screening condition.

In the method for exporting the total amount of data provided by this embodiment, the data to be exported is screened according to the preset screening condition, so that on one hand, the quality of the exported data is improved, and on the other hand, the data to be exported is reduced due to further screening of the data, so that the speed of exporting the data is improved.

Further, since in practical applications, there may be a case that a Region does not have a complete copy on the host where its corresponding Region server is located, on the basis of any of the above embodiments, to improve the efficiency of data export, the method further includes:

the data export node device detects the integrity of the copies of all data units in the current data table to be exported, which are stored in the data nodes of the HDFS;

and the data export node device adopts different data export modes according to different completeness degrees of the copies.

In this embodiment, because some regions on a Region server may have undergone balance operation or Region server restart, the regions may migrate between the Region servers, and a situation may occur in which all copies of a Region and its corresponding StoreFile are not in the same node to some extent. Therefore, in order to improve the efficiency of data export, the integrity of the copy stored by all regions in the data table to be exported at the data node of the HDFS needs to be detected. Since different completeness degrees have inconsistent influence on data derivation, different data derivation modes need to be adopted for different completeness degrees.

According to the data total export method provided by the embodiment, the completeness of the copies stored by all regions in the data table to be exported at the data node of the HDFS is detected before data export, and different data modes are adopted according to different data completeness, so that the data export efficiency can be improved.

Further, on the basis of the above embodiment, the method for deriving data with different degrees of completeness includes:

if detecting that all data units in the current data table to be exported store complete copies in the data nodes of the HDFS, directly exporting the data units in the data table;

if the fact that the number of data units of which the integrity of the copies stored in the data nodes of the HDFS of the current data table to be exported is lower than a preset first threshold exceeds a preset second threshold is detected, aiming at each service unit, acquiring incomplete data to be exported from the data nodes of the HDFS deployed on other hosts in the data exporting process;

if the fact that the number of data units of the current data table to be exported, of which the integrity of the copies stored in the data nodes of the HDFS is lower than the preset first threshold value, is lower than the preset second threshold value is detected, for each service unit, before data export, the data to be exported are acquired from the data nodes of the HDFS deployed on other hosts, and the locally stored copies are completely supplemented.

In the embodiment, if it is detected that all regions in the current data table to be exported store complete copies in the data nodes of the HDFS, the copies stored in the data nodes of the HDFS can be directly called to export the data; if the fact that the number of regions, of the current data table to be exported, of which the integrity of the copy is lower than a preset first threshold value, in a data node of the HDFS exceeds a preset second threshold value is detected, the integrity of the data is low, so that the time for completing and supplementing the copy is long, the data can be directly exported in order to improve the data export speed, and in the data export process, the data are called from other hosts to be exported; if it is detected that the number of regions, in which the integrity of the copy stored in the data node of the HDFS of the current data table to be exported is lower than the preset first threshold, is lower than the preset second threshold, that is, the integrity of the data is higher, therefore, the time taken to supplement the copy completely is short, and therefore, in order to increase the speed of data export, the locally stored copy can be obtained after the copy is supplemented completely, and data export can be performed.

For example, the first threshold and the second threshold may be set by the user, and in this example, both may be 80%. When the locality attribute of the locally stored Region copy to be exported is that the percentage of regions with the locality attribute of more than 80 percent in the total Region number exceeds 80 percent, directly exporting the copy, and acquiring missing data from other hosts in the data exporting process; when the percentage of the regions with the locality attribute of the locally stored Region copy to be exported being more than 80% in the total number of the regions is less than 80%, the Region copy is supplemented completely and then data export is carried out.

According to the data total export method provided by the embodiment, different data export modes are adopted for the integrity of the Region copy to be exported, so that the data export efficiency can be increased.

Further, on the basis of any of the above embodiments, in order to detect whether data on each regionser completes exporting, the method further includes:

detecting whether all tasks in the thread pool are exported or not for each data export node device;

and if so, judging that the data unit exporting task is completed.

In this embodiment, for each data export node device, it is detected whether data in each thread pool is exported, and if so, it is determined that the Region to be exported currently has been exported. Optionally, if it is detected that the data on the current data export node device is not exported, it is determined that the Region to be exported currently is not exported completely, and as an implementable manner, information that the data is not exported completely may be pushed to a user, and an alarm message may also be sent out to enable the user to know the current export process.

According to the data full-scale export method provided by the embodiment, whether data in the Regionserver is exported or not is determined by detecting whether data in each thread pool is exported or not, so that the accuracy of data export can be improved, and the misjudgment of whether data is exported or not is avoided.

Further, after the data export is completed, a copy of the exported data unit may be stored to a preset storage path.

Fig. 3 is a structural diagram of a data export task allocation apparatus according to a third embodiment of the present invention, and as shown in fig. 3, the data export task allocation apparatus includes:

an analysis module 31, configured to analyze and manage a service unit of each data unit in a data table to be exported, where the data table to be exported includes at least one data unit.

The distribution module 32 is configured to distribute each data unit export task in the to-be-exported data table to a host where a service unit managing the data unit is located, so that the data export node device deployed on the host performs weighted average division on the managed data unit and then performs data export.

In this embodiment, at least one Region is stored in any data table of Hbase, and each Region server manages one or more regions in the data table. Therefore, when data export is performed on a Region in the data table, first, the analysis module 31 may analyze which Region server specifically manages the Region to be exported.

After analyzing which Region server is specifically managed by the Region server to be exported, configuring a data export node device on a host where the Region server managing the Region is located, so that the Region to be exported can be distributed to the data export node device deployed on the host where the Region server managing the Region is located, and the data export node device can export the Region after performing weighted average distribution on the Region to be exported.

In practical application, when an HBase cluster is deployed, a register server and a data node storing data at the bottom layer are generally deployed on the same group of machines, so according to a copy storage policy of an HDFS, data written to a certain register will be stored in one copy on a local DataNode first, and then stored in one copy on different nodes of the same rack and one copy on certain nodes of different racks, respectively, so theoretically, it can be considered that all data of the register on the register server will have one copy locally, that is, the locality attribute of the register is 100%, even if the HBase performs a balance operation or restarts the register server, the locality attribute of the register will be gradually restored to 100% along with the writing of the data and the constant Compact of the register. Therefore, after analyzing which Region server manages the Region to be exported specifically, the allocation module 32 may allocate each Region export task in the data table to be exported to the data export node device deployed on the host that manages the Region server of the Region, and since there is a copy of data of all regions on the Region server locally, when exporting data, the locally stored copy may be directly exported without acquiring export data from other hosts through a network, so that network IO is low, and data export efficiency is high. And the data export node device deployed on the host where the Region server managing the Region is located distributes the weighted average to each thread pool for data export after carrying out weighted average on the regions to be exported.

The data export task allocation device provided by this embodiment allocates each Region in the to-be-exported data table in Hbase to the Region server managing the Region, so that local export of the to-be-exported data can be realized, the to-be-exported data does not need to be called from other hosts, and then network IO can be reduced on the basis of improving data export efficiency.

Further, since a data table to be exported includes at least one Region, the data table is exported only after all regions are exported, and therefore, on the basis of the above embodiment, the data export task assigning apparatus further includes:

the first detection module is used for detecting whether the data units to be exported managed by each data export node device are exported completely;

and the first judging module is used for judging that all data in the data table to be exported are exported completely by the data export task distribution device if the data are exported completely.

In this embodiment, in order to determine whether all the data tables to be currently exported are exported, the detection module may first detect whether the Region export task for which each Region server is responsible has completed data export, and if so, the first determination module may determine that all the data tables to be currently exported have been exported.

The data export task allocation device provided by this embodiment judges whether the data table to be exported currently is exported completely by detecting whether the data on each Regionserver is exported completely, so that the accuracy of data export can be increased, and the misjudgment of whether data export is completed or not is avoided.

Fig. 4 is a structural diagram of a service unit according to a fourth embodiment of the present invention, and as shown in fig. 4, the data export node apparatus further includes:

a receiving module 41, configured to receive a data unit export task allocated by the data export task allocation device, where a data unit corresponding to the data unit export task is managed by a data export node device deployed on a host where a service unit that manages the data unit is located.

And the copy obtaining module 42 is configured to invoke the service unit to obtain a copy of the current data unit to be exported from a data node of the HDFS configured on the host where the service unit is located according to the data unit export task, where the data node stores copies of all data units managed by the service unit.

And the splitting module 43 is configured to equally distribute the copies of the data unit to each thread pool for data export.

In the present embodiment, the receiving module 41 receives the Region export task assigned by the data export assignment device for each data export node device. The Region export task may be configured to, after the data export allocation apparatus analyzes the regions managed by the Region server, allocate each Region to a data export node apparatus deployed on the host where the Region server managing the Region is located.

Since the RegionServer and the data nodes storing data at the bottom are generally deployed on the same set of machines, there will be one copy of the data of all the regions on the RegionServer locally, i.e. the locality attribute of the Region is 100%. Therefore, after the data export node device receives the Region export task, the copy obtaining module 42 may obtain the copy of the current Region to be exported from the data node of the HDFS by calling the Region server, so that the data to be exported does not need to be obtained from other hosts, and network IO is reduced.

The data export node device can calculate the copy of the Region to be exported and evenly distribute the copy to each thread pool for data export. Specifically, the weight of the Region to be derived may be obtained by calculating a difference between the length of all regions to be derived (i.e., endkey-startup) and the number of thread pools, and the splitting module 43 performs average division on the regions to be derived according to the weight of the regions to be derived. Specifically, the Region to be derived may also be divided equally by any one of the calculation methods, and the present invention is not limited herein.

The data export node device provided in this embodiment obtains the locally stored Region to be exported and equally allocates the Region to each thread pool for data export, so that the amount of data that needs to be exported for each thread pool is equal, and thus, the efficiency of data export is improved.

Optionally, in order to improve the quality of the exported data, the data to be exported may be further filtered, and on the basis of any of the above embodiments, the data export node apparatus further includes:

and the screening module is used for screening the copies of the data units to be exported according to preset screening conditions.

In this embodiment, before data export, the screening module may also screen the copy of the Region to be exported under the preset screening condition. Specifically, the screening condition may be to screen data of a certain attribute, so that the data export node device obtains all data that satisfies the attribute in the data to be exported according to the screening condition, and the screening condition may also be a filtering condition, for example, to filter all duplicate data in the data to be exported, so that the data export node device deletes all duplicate data in the data to be exported according to the screening condition.

The data export node device provided by this embodiment screens data to be exported according to preset screening conditions, on the one hand, improves the quality of exported data, and on the other hand, further screens data, so that the data volume to be exported is reduced, and the speed of data export is improved.

Further, in practical applications, there may be a situation that a Region does not have a complete copy on the host where its corresponding Region server is located, and therefore, to improve the efficiency of data export, on the basis of any of the above embodiments, the data export node apparatus further includes:

the integrity analysis module is used for calling the service unit to detect the integrity of the copies of all data units in the current data table to be exported, which are stored in the data nodes of the HDFS;

and the data export mode judging module is used for adopting different data export modes according to different integrity degrees of the copies.

In this embodiment, because some regions on a Region server may have undergone balance operation or Region server restart, the regions may migrate between the Region servers, and a situation may occur in which all copies of a Region and its corresponding StoreFile are not in the same node to some extent. Therefore, in order to improve the efficiency of data export, the integrity analysis module needs to detect the integrity of the copies of all regions in the data table to be exported at the data node of the HDFS. Because the influence of different completeness on data export is inconsistent, the data export mode decision module needs to adopt different data export modes according to different completeness.

The data export node device provided by the embodiment detects the integrity of the copies of all regions in the current data table to be exported, which are stored in the data node of the HDFS, before data export, and adopts different data modes according to different data integrity, so that the data export efficiency can be improved.

Further, on the basis of the above embodiment, the data export mode determination module specifically includes:

the first export mode judging unit is used for directly exporting the data units in the data table if detecting that all the data units in the data table to be exported store complete copies in the data nodes of the HDFS;

the second export mode judging unit is used for acquiring incomplete data to be exported from data nodes of the HDFS deployed on other hosts in the data export process of each service unit if the fact that the number of data units, of which the integrity of the copies stored in the data nodes of the HDFS of the current data table to be exported is lower than a preset first threshold, exceeds a preset second threshold is detected;

and a third export mode determination unit, configured to, if it is detected that the number of data units for which the integrity of the copy stored in the data node of the HDFS in the current data table to be exported is lower than a preset second threshold is lower than a preset first threshold, acquire, for each service unit, before data export, data to be exported from the data node of the HDFS deployed on another host, and complete supplement of the locally stored copy.

In this embodiment, if it is detected that all regions in the current data table to be exported store complete copies in the data nodes of the HDFS, the first export mode determination unit may directly call the copies stored in the data nodes of the HDFS to export the data; if the fact that the number of regions, of the current data table to be exported, of which the integrity of the copy stored in the data node of the HDFS is lower than a preset first threshold exceeds a preset second threshold is detected, the integrity of the data is low, so that the time consumed for the second export mode judging unit to completely supplement the copy is long, the data can be directly exported in order to improve the data export speed, and in the data export process, the data are called from other hosts to be exported; if it is detected that the number of regions, in which the integrity of the copy stored in the data node of the HDFS of the current data table to be exported is lower than the preset first threshold, is lower than the preset second threshold, that is, the integrity of the data is higher, the time taken for the third export mode determination unit to completely supplement the copy is shorter, and therefore, in order to increase the speed of data export, the locally stored copy can be obtained after the copy is completely supplemented, and data export can be performed.

The data export node device provided by this embodiment can increase the efficiency of data export by adopting different data export methods for the integrity of the Region copy to be exported.

Further, on the basis of any of the above embodiments, in order to detect whether data on each Regionserver completes derivation, the data derivation node apparatus further includes:

the second detection module is used for detecting whether all tasks in the thread pool are exported or not aiming at each data export node device;

and the second judging module is used for judging that the data unit exporting task is finished if the data unit exporting task is finished.

The data export node device provided by this embodiment determines whether data in the Regionserver is exported or not by detecting whether data in each thread pool is exported or not, so that accuracy of data export can be improved, and misjudgment of whether data is exported or not is avoided.

Further, after the data export is completed, the storage module may store a copy of the exported Region to a preset storage path.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for deriving data in full scale, comprising:

the data export task allocation device analyzes and manages a service unit of each data unit in a data table to be exported, wherein the data table to be exported comprises at least one data unit;

the data export task distribution device distributes each data unit export task in the data table to be exported to a host where a service unit managing the data units is located, so that a data export node device deployed on the host averagely distributes the data units to each thread pool for data export according to the weight of each data unit managed by the data export node device, wherein the weighted sum of the data units for data export in each thread pool is the same; the data units exported by the data export task allocation device are the data export units screened by preset screening conditions; the data export task allocation device determines the data export mode of the exported data unit according to the integrity determined after the service unit detects the copy of the current data unit stored in the data node of the HDFS;

wherein the screening comprises deleting duplicate data in the data unit;

wherein the data derivation mode comprises at least one of the following:

if the fact that the number of data units of a current data table to be exported, of which the integrity of the copies stored in the data nodes of the HDFS is lower than a preset first threshold value, is lower than a preset second threshold value is detected, acquiring data to be exported from the data nodes of the HDFS deployed on other hosts to completely supplement the locally stored copies for each service unit before data export;

wherein the second threshold is 80% of the total number of data units.

2. The method of claim 1, further comprising:

3. A method for deriving data in full scale, comprising:

the data export node device receives a data unit export task distributed by the data export task distribution device, and a data unit corresponding to the data unit export task is managed by the data export node device deployed on a host where a service unit for managing the data unit is located;

the data export node device calls a service unit to obtain a copy of a data unit to be exported currently from a data node of an HDFS (Hadoop distributed file system) configured on a host where the service unit is located according to the data unit export task, and the data node stores the copies of all data units managed by the service unit;

the data export node device evenly distributes the copies of the data units to all thread pools for data export according to the weight of each data unit managed by the service unit, wherein the weighted sum of the data units for data export in all the thread pools is the same;

the method further comprises the following steps:

the data export node device screens the copies of the data units to be exported according to preset screening conditions; the screening comprises deleting repeated data in the copy of the data unit to be exported;

the method further comprises the following steps:

the data export node device calls a service unit to detect the integrity of the copies of all data units in the current data table to be exported, which are stored in the data nodes of the HDFS;

adopting different data export modes according to different completeness degrees of the copies;

the data export mode comprises at least one of the following modes:

wherein the second threshold is 80% of the total number of data units.

4. The method according to claim 3, wherein after the data export node apparatus filters the copy of the data unit to be exported according to a preset filtering condition, the method further comprises:

and the data export node device stores the exported copies of the data units to a preset storage path.

5. The method of claim 3, further comprising:

and if so, judging that the data unit exporting task is completed.

6. A data export task assigning apparatus, comprising:

the system comprises an analysis module, a service module and a management module, wherein the analysis module is used for analyzing and managing a service unit of each data unit in a data table to be exported, and the data table to be exported comprises at least one data unit;

the distribution module is used for the data export task distribution device to distribute each data unit export task in the data table to be exported to a host where a service unit for managing the data unit is located, so that the data export node device deployed on the host can distribute the data units to all thread pools to conduct data export according to the weight of all data units managed by the host, wherein the weight sum of the data units conducting data export in all the thread pools is the same; the data units exported by the data export task allocation device are the data export units screened by preset screening conditions; the data export task allocation device determines the data export mode of the exported data unit according to the integrity determined after the service unit detects the copy of the current data unit stored in the data node of the HDFS;

wherein the screening comprises deleting duplicate data in the data unit;

wherein the data derivation mode comprises at least one of the following:

wherein the second threshold is 80% of the total number of data units.

7. A data derivation node apparatus, comprising:

the receiving module is used for receiving the data unit export task distributed by the data export task distribution device, and the data unit corresponding to the data unit export task is managed by a data export node device deployed on a host where a service unit for managing the data unit is located;

the copy obtaining module is used for calling the service unit to obtain a copy of the current data unit to be exported from a data node of an HDFS (Hadoop distributed File System) configured on a host where the service unit is located according to the data unit export task, and the data node stores the copies of all data units managed by the service unit;

the splitting module is used for averagely distributing the copies of the data units to each thread pool for data export according to the weight of each data unit managed by the service unit, wherein the weighted sum of the data units for data export in each thread pool is the same;

the device further comprises:

the data export mode comprises at least one of the following modes:

wherein the second threshold is 80% of the total number of data units.