CN106294886A

CN106294886A - A kind of method and system of full dose extracted data from HBase

Info

Publication number: CN106294886A
Application number: CN201610902484.8A
Authority: CN
Inventors: 范卫卫; 张翼; 温宗臣; 何良均
Original assignee: BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Current assignee: BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Priority date: 2016-10-17
Filing date: 2016-10-17
Publication date: 2017-01-04

Abstract

The present invention discloses a kind of method of full dose extracted data from HBase, its can multi-thread concurrent ground full dose efficient decimation HBase data, and be load balancing between regionServer during data pick-up, prevent focus occurring to regionServer build-up of pressure during extracted data.The method includes: (1) is deployed on the station server that can access HBase cluster；(2), above this server, the machine name of partitioned server regionserver and the corresponding informance of ip are configured in hosts file；(3) first read the metadata table of HBase, get region information；Then according to different region information, build different scan objects, carry out extracted data；(4) data extracted are stored under the different directories of HDFS according to different region.Also has the system of full dose extracted data from HBase.

Description

A kind of method and system of full dose extracted data from HBase

Technical field

The present invention relates to the technical field that big data process, particularly relate to a kind of side of full dose extracted data from HBase Method, and the system of full dose extracted data from HBase.

Background technology

HBase be one distributed, towards row PostgreSQL database, this Technology origin is write in Fay Chang Google paper " distributed memory system of Bigtable: one structural data ".Just as Bigtable make use of Google The Distributed Storage that file system (File System) is provided is the same, and HBase provides on Hadoop and is similar to The ability of Bigtable.HBase is the sub-project of the Hadoop project of Apache.HBase is different from general relational database, It is a data base being suitable for unstructured data storage.HBase unlike another per-column rather than based on The pattern of row.

The table of HBase can be cut into different data blocks in logic, does not has data to occur simultaneously between each data block.From In HBase table, full dose extracted data can use the API scan (reading data in table by scan) of HBase, does not set scan Initial major key (startRow) and terminate major key (stopRow) value, be thus full table read data.

The most only read the data of a region (subregion), read next region after having run through again, be a kind of Serial manner.HBase table data are bigger when, extracted data is the most efficient.

Summary of the invention

For overcoming the defect of prior art, the technical problem to be solved in the present invention there is provided a kind of full dose from HBase The method of extracted data, its can multi-thread concurrent ground full dose efficient decimation HBase data, and during data pick-up It is load balancing between regionServer, prevents focus occurring to regionServer build-up of pressure during extracted data.

The technical scheme is that this method of full dose extracted data from HBase, the method includes following step Rapid:

(1) it is deployed on the station server that can access HBase cluster；

(2), above this server, the machine name of partitioned server regionserver is configured to the corresponding informance of ip In hosts file；

(3) first read the metadata table of HBase, get region information；Then according to different region information, structure Build different scan objects, carry out extracted data；

(4) data extracted are stored in the different mesh of HDFS (Hadoop distributed file system) according to different region Under record.

The present invention is converted into many parts of little data blocks region of reading by reading a huge table, according to different Region information, builds different scan objects, carrys out extracted data such that it is able to multi-thread concurrent ground full dose efficient decimation HBase data, and be load balancing between regionServer during data pick-up, prevent, during extracted data, focus pair occurs RegionServer build-up of pressure.

Additionally providing a kind of system of full dose extracted data from HBase, this system includes:

Deployment module, it configures this system deployment on the station server that can access HBase cluster；

Configuration module, its configuration comes face on that server, by machine name and the ip of partitioned server regionserver Corresponding informance be configured in hosts file；

Data extraction module, its configuration is first read the metadata table of HBase, is got region information；Then basis Different region information, builds different scan objects, carrys out extracted data；

Data memory module, the data of extraction are stored in the different directories of HDFS by its configuration according to different region Under.

Accompanying drawing explanation

Fig. 1 show the flow chart of the method for full dose extracted data from HBase according to the present invention.

Detailed description of the invention

As it is shown in figure 1, this method of full dose extracted data from HBase, the method comprises the following steps:

(1) it is deployed on the station server that can access HBase cluster；

Hosts is a system file not having extension name, can be with TOs such as notepads, and its effect is exactly by one An association " data base " is set up, when user inputs one in a browser in the IP address that the most conventional network address domain names are corresponding When needing the network address logged in, system can find the IP address of correspondence first automatically from Hosts file, once finds, system meeting Opening corresponding webpage immediately, without finding, then network address can be submitted to DNS name resolution server to carry out IP address by system again Parsing.

(4) data extracted are stored under the different directories of HDFS according to different region.

It addition, in described step (3), region information includes:

The scope of data of region, for startRow and stopRow；

The machine name of the regionServer at region place；

According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and is Region information list.

It addition, in described step (3),

Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase；

Region quantity is M, and thread pool size is N；The method selecting region and reading region is as follows:

A if, () M≤N, the most all region are all as reading object；

B if () M > N, then searching loop Map, until Map is empty；

According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate main Key, such scan is aiming at this region and is read out data；To each region Information encapsulation thread class, it is submitted to thread Pond performs digital independent.

It addition, in described (b), the logic of traversal Map is every time: obtain an entry from Map；The value of entry For region information list；

From region information list, take out a record, and this record removes from list；If removing a record After, secondary region information list is empty, then removed from map by entry.

It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention includes a kind of complete from HBase the most simultaneously The system of amount extracted data, this system generally represents with the form of the corresponding functional module of step each with method.Use the party The system of method, this system includes:

It addition, in described data extraction module, region information includes:

The scope of data of region, for startRow and stopRow；

The machine name of the regionServer at region place；

It addition, in described data extraction module,

A if, () M≤N, the most all region are all as reading object；

B if () M > N, then searching loop Map, until Map is empty；

Beneficial effects of the present invention is as follows:

1. will read a huge table, be converted into many parts of little data blocks (region) of reading.

2. multi-thread concurrent read block, quickly extraction HBase data

The region quantity of the most same decimation in time is equilibrium in regionserver, occurs when preventing extracted data Focus is to regionServer build-up of pressure.

The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention The protection domain of technical scheme.

Claims

1. the method for full dose extracted data from HBase, it is characterised in that: the method comprises the following steps:

(1) it is deployed on the station server that can access HBase cluster；

(2), above this server, the machine name of partitioned server regionserver and the corresponding informance of ip are configured to hosts In file；

(3) first read the metadata table of HBase, get region information；Then according to different region information, build not Same scan object, carrys out extracted data；

The method of full dose extracted data from HBase the most according to claim 1, it is characterised in that:

In described step (3), region information includes:

The scope of data of region, for startRow and stopRow；

The machine name of the regionServer at region place；

According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and believes for region Breath list.

The method of full dose extracted data from HBase the most according to claim 2, it is characterised in that:

In described step (3),

A if, () M≤N, the most all region are all as reading object；

B if () M > N, then searching loop Map, until Map is empty；

According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate major key, So scan is aiming at this region and is read out data；To each region Information encapsulation thread class, it is submitted to thread pool Perform digital independent.

The method of full dose extracted data from HBase the most according to claim 3, it is characterised in that:

In described (b), the logic of traversal Map is every time: obtain an entry from Map；The value of entry is region letter Breath list；

From region information list, take out a record, and this record removes from list；If it is after removing a record, secondary Region information list is empty, then removed from map by entry.

5. the system of full dose extracted data from HBase, it is characterised in that: this system includes:

Configuration module, its configuration comes face on that server, right by the machine name of partitioned server regionserver and ip Answer information configuration in hosts file；

Data extraction module, its configuration is first read the metadata table of HBase, is got region information；Then according to difference Region information, build different scan objects, carry out extracted data；

Data memory module, the data of extraction are stored under the different directories of HDFS by its configuration according to different region.

The system of full dose extracted data from HBase the most according to claim 5, it is characterised in that:

In described data extraction module, region information includes:

The scope of data of region, for startRow and stopRow；

The machine name of the regionServer at region place；

The system of full dose extracted data from HBase the most according to claim 6, it is characterised in that:

In described data extraction module,

A if, () M≤N, the most all region are all as reading object；

B if () M > N, then searching loop Map, until Map is empty；

The system of full dose extracted data from HBase the most according to claim 7, it is characterised in that: