CN106294886A - A kind of method and system of full dose extracted data from HBase - Google Patents

A kind of method and system of full dose extracted data from HBase Download PDF

Info

Publication number
CN106294886A
CN106294886A CN201610902484.8A CN201610902484A CN106294886A CN 106294886 A CN106294886 A CN 106294886A CN 201610902484 A CN201610902484 A CN 201610902484A CN 106294886 A CN106294886 A CN 106294886A
Authority
CN
China
Prior art keywords
region
hbase
data
extracted data
region information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610902484.8A
Other languages
Chinese (zh)
Inventor
范卫卫
张翼
温宗臣
何良均
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd filed Critical BEIJING GEO POLYMERIZATION TECHNOLOGY Co Ltd
Priority to CN201610902484.8A priority Critical patent/CN106294886A/en
Publication of CN106294886A publication Critical patent/CN106294886A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of method of full dose extracted data from HBase, its can multi-thread concurrent ground full dose efficient decimation HBase data, and be load balancing between regionServer during data pick-up, prevent focus occurring to regionServer build-up of pressure during extracted data.The method includes: (1) is deployed on the station server that can access HBase cluster;(2), above this server, the machine name of partitioned server regionserver and the corresponding informance of ip are configured in hosts file;(3) first read the metadata table of HBase, get region information;Then according to different region information, build different scan objects, carry out extracted data;(4) data extracted are stored under the different directories of HDFS according to different region.Also has the system of full dose extracted data from HBase.

Description

A kind of method and system of full dose extracted data from HBase
Technical field
The present invention relates to the technical field that big data process, particularly relate to a kind of side of full dose extracted data from HBase Method, and the system of full dose extracted data from HBase.
Background technology
HBase be one distributed, towards row PostgreSQL database, this Technology origin is write in Fay Chang Google paper " distributed memory system of Bigtable: one structural data ".Just as Bigtable make use of Google The Distributed Storage that file system (File System) is provided is the same, and HBase provides on Hadoop and is similar to The ability of Bigtable.HBase is the sub-project of the Hadoop project of Apache.HBase is different from general relational database, It is a data base being suitable for unstructured data storage.HBase unlike another per-column rather than based on The pattern of row.
The table of HBase can be cut into different data blocks in logic, does not has data to occur simultaneously between each data block.From In HBase table, full dose extracted data can use the API scan (reading data in table by scan) of HBase, does not set scan Initial major key (startRow) and terminate major key (stopRow) value, be thus full table read data.
The most only read the data of a region (subregion), read next region after having run through again, be a kind of Serial manner.HBase table data are bigger when, extracted data is the most efficient.
Summary of the invention
For overcoming the defect of prior art, the technical problem to be solved in the present invention there is provided a kind of full dose from HBase The method of extracted data, its can multi-thread concurrent ground full dose efficient decimation HBase data, and during data pick-up It is load balancing between regionServer, prevents focus occurring to regionServer build-up of pressure during extracted data.
The technical scheme is that this method of full dose extracted data from HBase, the method includes following step Rapid:
(1) it is deployed on the station server that can access HBase cluster;
(2), above this server, the machine name of partitioned server regionserver is configured to the corresponding informance of ip In hosts file;
(3) first read the metadata table of HBase, get region information;Then according to different region information, structure Build different scan objects, carry out extracted data;
(4) data extracted are stored in the different mesh of HDFS (Hadoop distributed file system) according to different region Under record.
The present invention is converted into many parts of little data blocks region of reading by reading a huge table, according to different Region information, builds different scan objects, carrys out extracted data such that it is able to multi-thread concurrent ground full dose efficient decimation HBase data, and be load balancing between regionServer during data pick-up, prevent, during extracted data, focus pair occurs RegionServer build-up of pressure.
Additionally providing a kind of system of full dose extracted data from HBase, this system includes:
Deployment module, it configures this system deployment on the station server that can access HBase cluster;
Configuration module, its configuration comes face on that server, by machine name and the ip of partitioned server regionserver Corresponding informance be configured in hosts file;
Data extraction module, its configuration is first read the metadata table of HBase, is got region information;Then basis Different region information, builds different scan objects, carrys out extracted data;
Data memory module, the data of extraction are stored in the different directories of HDFS by its configuration according to different region Under.
Accompanying drawing explanation
Fig. 1 show the flow chart of the method for full dose extracted data from HBase according to the present invention.
Detailed description of the invention
As it is shown in figure 1, this method of full dose extracted data from HBase, the method comprises the following steps:
(1) it is deployed on the station server that can access HBase cluster;
(2), above this server, the machine name of partitioned server regionserver is configured to the corresponding informance of ip In hosts file;
Hosts is a system file not having extension name, can be with TOs such as notepads, and its effect is exactly by one An association " data base " is set up, when user inputs one in a browser in the IP address that the most conventional network address domain names are corresponding When needing the network address logged in, system can find the IP address of correspondence first automatically from Hosts file, once finds, system meeting Opening corresponding webpage immediately, without finding, then network address can be submitted to DNS name resolution server to carry out IP address by system again Parsing.
(3) first read the metadata table of HBase, get region information;Then according to different region information, structure Build different scan objects, carry out extracted data;
(4) data extracted are stored under the different directories of HDFS according to different region.
The present invention is converted into many parts of little data blocks region of reading by reading a huge table, according to different Region information, builds different scan objects, carrys out extracted data such that it is able to multi-thread concurrent ground full dose efficient decimation HBase data, and be load balancing between regionServer during data pick-up, prevent, during extracted data, focus pair occurs RegionServer build-up of pressure.
It addition, in described step (3), region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and is Region information list.
It addition, in described step (3),
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate main Key, such scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread Pond performs digital independent.
It addition, in described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry For region information list;
From region information list, take out a record, and this record removes from list;If removing a record After, secondary region information list is empty, then removed from map by entry.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention includes a kind of complete from HBase the most simultaneously The system of amount extracted data, this system generally represents with the form of the corresponding functional module of step each with method.Use the party The system of method, this system includes:
Deployment module, it configures this system deployment on the station server that can access HBase cluster;
Configuration module, its configuration comes face on that server, by machine name and the ip of partitioned server regionserver Corresponding informance be configured in hosts file;
Data extraction module, its configuration is first read the metadata table of HBase, is got region information;Then basis Different region information, builds different scan objects, carrys out extracted data;
Data memory module, the data of extraction are stored in the different directories of HDFS by its configuration according to different region Under.
It addition, in described data extraction module, region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and is Region information list.
It addition, in described data extraction module,
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate main Key, such scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread Pond performs digital independent.
It addition, in described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry For region information list;
From region information list, take out a record, and this record removes from list;If removing a record After, secondary region information list is empty, then removed from map by entry.
Beneficial effects of the present invention is as follows:
1. will read a huge table, be converted into many parts of little data blocks (region) of reading.
2. multi-thread concurrent read block, quickly extraction HBase data
The region quantity of the most same decimation in time is equilibrium in regionserver, occurs when preventing extracted data Focus is to regionServer build-up of pressure.
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention The protection domain of technical scheme.

Claims (8)

1. the method for full dose extracted data from HBase, it is characterised in that: the method comprises the following steps:
(1) it is deployed on the station server that can access HBase cluster;
(2), above this server, the machine name of partitioned server regionserver and the corresponding informance of ip are configured to hosts In file;
(3) first read the metadata table of HBase, get region information;Then according to different region information, build not Same scan object, carrys out extracted data;
(4) data extracted are stored under the different directories of HDFS according to different region.
The method of full dose extracted data from HBase the most according to claim 1, it is characterised in that:
In described step (3), region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and believes for region Breath list.
The method of full dose extracted data from HBase the most according to claim 2, it is characterised in that:
In described step (3),
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate major key, So scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread pool Perform digital independent.
The method of full dose extracted data from HBase the most according to claim 3, it is characterised in that:
In described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry is region letter Breath list;
From region information list, take out a record, and this record removes from list;If it is after removing a record, secondary Region information list is empty, then removed from map by entry.
5. the system of full dose extracted data from HBase, it is characterised in that: this system includes:
Deployment module, it configures this system deployment on the station server that can access HBase cluster;
Configuration module, its configuration comes face on that server, right by the machine name of partitioned server regionserver and ip Answer information configuration in hosts file;
Data extraction module, its configuration is first read the metadata table of HBase, is got region information;Then according to difference Region information, build different scan objects, carry out extracted data;
Data memory module, the data of extraction are stored under the different directories of HDFS by its configuration according to different region.
The system of full dose extracted data from HBase the most according to claim 5, it is characterised in that:
In described data extraction module, region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and believes for region Breath list.
The system of full dose extracted data from HBase the most according to claim 6, it is characterised in that:
In described data extraction module,
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate major key, So scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread pool Perform digital independent.
The system of full dose extracted data from HBase the most according to claim 7, it is characterised in that:
In described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry is region letter Breath list;
From region information list, take out a record, and this record removes from list;If it is after removing a record, secondary Region information list is empty, then removed from map by entry.
CN201610902484.8A 2016-10-17 2016-10-17 A kind of method and system of full dose extracted data from HBase Pending CN106294886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610902484.8A CN106294886A (en) 2016-10-17 2016-10-17 A kind of method and system of full dose extracted data from HBase

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610902484.8A CN106294886A (en) 2016-10-17 2016-10-17 A kind of method and system of full dose extracted data from HBase

Publications (1)

Publication Number Publication Date
CN106294886A true CN106294886A (en) 2017-01-04

Family

ID=57717746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610902484.8A Pending CN106294886A (en) 2016-10-17 2016-10-17 A kind of method and system of full dose extracted data from HBase

Country Status (1)

Country Link
CN (1) CN106294886A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110389766A (en) * 2019-06-21 2019-10-29 深圳市汇川技术股份有限公司 HBase container cluster dispositions method, system, equipment and computer readable storage medium
CN110457279A (en) * 2019-07-11 2019-11-15 新华三大数据技术有限公司 Off-line data scan method, device, server and readable storage medium storing program for executing
CN110928941A (en) * 2019-11-28 2020-03-27 杭州数梦工场科技有限公司 Data fragment extraction method and device
CN111241171A (en) * 2019-10-28 2020-06-05 杭州美创科技有限公司 Full-amount data extraction method for database
CN111949673A (en) * 2020-08-04 2020-11-17 贵州易鲸捷信息技术有限公司 Hbase storage-based distributed pessimistic lock and implementation method thereof
CN116049197A (en) * 2023-03-07 2023-05-02 中船重工奥蓝托无锡软件技术有限公司 HBase-based data equilibrium storage method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102725753A (en) * 2011-11-28 2012-10-10 华为技术有限公司 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage
US20130282668A1 (en) * 2012-04-20 2013-10-24 Cloudera, Inc. Automatic repair of corrupt hbases
CN103631922A (en) * 2013-12-03 2014-03-12 南通大学 Hadoop cluster-based large-scale Web information extraction method and system
CN103646073A (en) * 2013-12-11 2014-03-19 浪潮电子信息产业股份有限公司 Condition query optimizing method based on HBase table
CN104516985A (en) * 2015-01-15 2015-04-15 浪潮(北京)电子信息产业有限公司 Rapid mass data importing method based on HBase database
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN105205154A (en) * 2015-09-24 2015-12-30 浙江宇视科技有限公司 Data migration method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102725753A (en) * 2011-11-28 2012-10-10 华为技术有限公司 Method and apparatus for optimizing data access, method and apparatus for optimizing data storage
US20130282668A1 (en) * 2012-04-20 2013-10-24 Cloudera, Inc. Automatic repair of corrupt hbases
CN103631922A (en) * 2013-12-03 2014-03-12 南通大学 Hadoop cluster-based large-scale Web information extraction method and system
CN103646073A (en) * 2013-12-11 2014-03-19 浪潮电子信息产业股份有限公司 Condition query optimizing method based on HBase table
CN104516985A (en) * 2015-01-15 2015-04-15 浪潮(北京)电子信息产业有限公司 Rapid mass data importing method based on HBase database
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN105205154A (en) * 2015-09-24 2015-12-30 浙江宇视科技有限公司 Data migration method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王姜勇: "基于大规模数据集的并发处理的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110389766A (en) * 2019-06-21 2019-10-29 深圳市汇川技术股份有限公司 HBase container cluster dispositions method, system, equipment and computer readable storage medium
CN110389766B (en) * 2019-06-21 2022-12-27 深圳市汇川技术股份有限公司 HBase container cluster deployment method, system, equipment and computer readable storage medium
CN110457279A (en) * 2019-07-11 2019-11-15 新华三大数据技术有限公司 Off-line data scan method, device, server and readable storage medium storing program for executing
CN110457279B (en) * 2019-07-11 2022-03-11 新华三大数据技术有限公司 Data offline scanning method and device, server and readable storage medium
CN111241171A (en) * 2019-10-28 2020-06-05 杭州美创科技有限公司 Full-amount data extraction method for database
CN110928941A (en) * 2019-11-28 2020-03-27 杭州数梦工场科技有限公司 Data fragment extraction method and device
CN110928941B (en) * 2019-11-28 2023-10-27 杭州数梦工场科技有限公司 Data fragment extraction method and device
CN111949673A (en) * 2020-08-04 2020-11-17 贵州易鲸捷信息技术有限公司 Hbase storage-based distributed pessimistic lock and implementation method thereof
CN111949673B (en) * 2020-08-04 2024-02-20 贵州易鲸捷信息技术有限公司 Hbase storage-based distributed pessimistic lock and implementation method thereof
CN116049197A (en) * 2023-03-07 2023-05-02 中船重工奥蓝托无锡软件技术有限公司 HBase-based data equilibrium storage method

Similar Documents

Publication Publication Date Title
CN106294886A (en) A kind of method and system of full dose extracted data from HBase
CN107957957B (en) Test case obtaining method and device
JP5961689B2 (en) Incremental data extraction
Lee et al. Efficient spatial query processing for big data
CN105718455A (en) Data query method and apparatus
CN104133867A (en) DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN104657423A (en) Method and device thereof for sharing contents of applications
CN109241003B (en) File management method and device
CN110019542B (en) Generation of enterprise relationship, generation of organization member database and identification of same name member
CN110888837A (en) Object storage small file merging method and device
CN103744875B (en) Data quick migration method and system based on file system
JPWO2014006903A1 (en) Content control method, content control apparatus, and program
CN110399096B (en) Method, device and equipment for deleting metadata cache of distributed file system again
US20110264703A1 (en) Importing Tree Structure
CN105786843A (en) Multi-language implementation method for applications and multi-language information query method and device
CN103593447B (en) Data processing method and device applied to database table
CN111209061B (en) User information filling method, device, computer equipment and storage medium
CN104408128B (en) A kind of reading optimization method indexed based on B+ trees asynchronous refresh
CN111176901B (en) HDFS deleted file recovery method, terminal device and storage medium
CN105279166B (en) File management method and system
CN107239568B (en) Distributed index implementation method and device
CN112328379A (en) Application migration method, device, equipment and medium
CN107315806B (en) Embedded storage method and device based on file system
CN107357836B (en) VNF package and method and device for deleting mirror image file referenced by VNF package
CN110968555A (en) Dimension data processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104