CN106294886A - A kind of method and system of full dose extracted data from HBase - Google Patents
A kind of method and system of full dose extracted data from HBase Download PDFInfo
- Publication number
- CN106294886A CN106294886A CN201610902484.8A CN201610902484A CN106294886A CN 106294886 A CN106294886 A CN 106294886A CN 201610902484 A CN201610902484 A CN 201610902484A CN 106294886 A CN106294886 A CN 106294886A
- Authority
- CN
- China
- Prior art keywords
- region
- hbase
- data
- extracted data
- region information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of method of full dose extracted data from HBase, its can multi-thread concurrent ground full dose efficient decimation HBase data, and be load balancing between regionServer during data pick-up, prevent focus occurring to regionServer build-up of pressure during extracted data.The method includes: (1) is deployed on the station server that can access HBase cluster;(2), above this server, the machine name of partitioned server regionserver and the corresponding informance of ip are configured in hosts file;(3) first read the metadata table of HBase, get region information;Then according to different region information, build different scan objects, carry out extracted data;(4) data extracted are stored under the different directories of HDFS according to different region.Also has the system of full dose extracted data from HBase.
Description
Technical field
The present invention relates to the technical field that big data process, particularly relate to a kind of side of full dose extracted data from HBase
Method, and the system of full dose extracted data from HBase.
Background technology
HBase be one distributed, towards row PostgreSQL database, this Technology origin is write in Fay Chang
Google paper " distributed memory system of Bigtable: one structural data ".Just as Bigtable make use of Google
The Distributed Storage that file system (File System) is provided is the same, and HBase provides on Hadoop and is similar to
The ability of Bigtable.HBase is the sub-project of the Hadoop project of Apache.HBase is different from general relational database,
It is a data base being suitable for unstructured data storage.HBase unlike another per-column rather than based on
The pattern of row.
The table of HBase can be cut into different data blocks in logic, does not has data to occur simultaneously between each data block.From
In HBase table, full dose extracted data can use the API scan (reading data in table by scan) of HBase, does not set scan
Initial major key (startRow) and terminate major key (stopRow) value, be thus full table read data.
The most only read the data of a region (subregion), read next region after having run through again, be a kind of
Serial manner.HBase table data are bigger when, extracted data is the most efficient.
Summary of the invention
For overcoming the defect of prior art, the technical problem to be solved in the present invention there is provided a kind of full dose from HBase
The method of extracted data, its can multi-thread concurrent ground full dose efficient decimation HBase data, and during data pick-up
It is load balancing between regionServer, prevents focus occurring to regionServer build-up of pressure during extracted data.
The technical scheme is that this method of full dose extracted data from HBase, the method includes following step
Rapid:
(1) it is deployed on the station server that can access HBase cluster;
(2), above this server, the machine name of partitioned server regionserver is configured to the corresponding informance of ip
In hosts file;
(3) first read the metadata table of HBase, get region information;Then according to different region information, structure
Build different scan objects, carry out extracted data;
(4) data extracted are stored in the different mesh of HDFS (Hadoop distributed file system) according to different region
Under record.
The present invention is converted into many parts of little data blocks region of reading by reading a huge table, according to different
Region information, builds different scan objects, carrys out extracted data such that it is able to multi-thread concurrent ground full dose efficient decimation
HBase data, and be load balancing between regionServer during data pick-up, prevent, during extracted data, focus pair occurs
RegionServer build-up of pressure.
Additionally providing a kind of system of full dose extracted data from HBase, this system includes:
Deployment module, it configures this system deployment on the station server that can access HBase cluster;
Configuration module, its configuration comes face on that server, by machine name and the ip of partitioned server regionserver
Corresponding informance be configured in hosts file;
Data extraction module, its configuration is first read the metadata table of HBase, is got region information;Then basis
Different region information, builds different scan objects, carrys out extracted data;
Data memory module, the data of extraction are stored in the different directories of HDFS by its configuration according to different region
Under.
Accompanying drawing explanation
Fig. 1 show the flow chart of the method for full dose extracted data from HBase according to the present invention.
Detailed description of the invention
As it is shown in figure 1, this method of full dose extracted data from HBase, the method comprises the following steps:
(1) it is deployed on the station server that can access HBase cluster;
(2), above this server, the machine name of partitioned server regionserver is configured to the corresponding informance of ip
In hosts file;
Hosts is a system file not having extension name, can be with TOs such as notepads, and its effect is exactly by one
An association " data base " is set up, when user inputs one in a browser in the IP address that the most conventional network address domain names are corresponding
When needing the network address logged in, system can find the IP address of correspondence first automatically from Hosts file, once finds, system meeting
Opening corresponding webpage immediately, without finding, then network address can be submitted to DNS name resolution server to carry out IP address by system again
Parsing.
(3) first read the metadata table of HBase, get region information;Then according to different region information, structure
Build different scan objects, carry out extracted data;
(4) data extracted are stored under the different directories of HDFS according to different region.
The present invention is converted into many parts of little data blocks region of reading by reading a huge table, according to different
Region information, builds different scan objects, carrys out extracted data such that it is able to multi-thread concurrent ground full dose efficient decimation
HBase data, and be load balancing between regionServer during data pick-up, prevent, during extracted data, focus pair occurs
RegionServer build-up of pressure.
It addition, in described step (3), region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and is
Region information list.
It addition, in described step (3),
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate main
Key, such scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread
Pond performs digital independent.
It addition, in described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry
For region information list;
From region information list, take out a record, and this record removes from list;If removing a record
After, secondary region information list is empty, then removed from map by entry.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible
Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium,
Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program
Dish, CD, storage card etc..Therefore, corresponding with the method for the present invention, the present invention includes a kind of complete from HBase the most simultaneously
The system of amount extracted data, this system generally represents with the form of the corresponding functional module of step each with method.Use the party
The system of method, this system includes:
Deployment module, it configures this system deployment on the station server that can access HBase cluster;
Configuration module, its configuration comes face on that server, by machine name and the ip of partitioned server regionserver
Corresponding informance be configured in hosts file;
Data extraction module, its configuration is first read the metadata table of HBase, is got region information;Then basis
Different region information, builds different scan objects, carrys out extracted data;
Data memory module, the data of extraction are stored in the different directories of HDFS by its configuration according to different region
Under.
It addition, in described data extraction module, region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and is
Region information list.
It addition, in described data extraction module,
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate main
Key, such scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread
Pond performs digital independent.
It addition, in described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry
For region information list;
From region information list, take out a record, and this record removes from list;If removing a record
After, secondary region information list is empty, then removed from map by entry.
Beneficial effects of the present invention is as follows:
1. will read a huge table, be converted into many parts of little data blocks (region) of reading.
2. multi-thread concurrent read block, quickly extraction HBase data
The region quantity of the most same decimation in time is equilibrium in regionserver, occurs when preventing extracted data
Focus is to regionServer build-up of pressure.
The above, be only presently preferred embodiments of the present invention, and the present invention not makees any pro forma restriction, every depends on
Any simple modification, equivalent variations and the modification made above example according to the technical spirit of the present invention, the most still belongs to the present invention
The protection domain of technical scheme.
Claims (8)
1. the method for full dose extracted data from HBase, it is characterised in that: the method comprises the following steps:
(1) it is deployed on the station server that can access HBase cluster;
(2), above this server, the machine name of partitioned server regionserver and the corresponding informance of ip are configured to hosts
In file;
(3) first read the metadata table of HBase, get region information;Then according to different region information, build not
Same scan object, carrys out extracted data;
(4) data extracted are stored under the different directories of HDFS according to different region.
The method of full dose extracted data from HBase the most according to claim 1, it is characterised in that:
In described step (3), region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and believes for region
Breath list.
The method of full dose extracted data from HBase the most according to claim 2, it is characterised in that:
In described step (3),
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate major key,
So scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread pool
Perform digital independent.
The method of full dose extracted data from HBase the most according to claim 3, it is characterised in that:
In described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry is region letter
Breath list;
From region information list, take out a record, and this record removes from list;If it is after removing a record, secondary
Region information list is empty, then removed from map by entry.
5. the system of full dose extracted data from HBase, it is characterised in that: this system includes:
Deployment module, it configures this system deployment on the station server that can access HBase cluster;
Configuration module, its configuration comes face on that server, right by the machine name of partitioned server regionserver and ip
Answer information configuration in hosts file;
Data extraction module, its configuration is first read the metadata table of HBase, is got region information;Then according to difference
Region information, build different scan objects, carry out extracted data;
Data memory module, the data of extraction are stored under the different directories of HDFS by its configuration according to different region.
The system of full dose extracted data from HBase the most according to claim 5, it is characterised in that:
In described data extraction module, region information includes:
The scope of data of region, for startRow and stopRow;
The machine name of the regionServer at region place;
According to above region information structuring map data structure: key is the machine name of regionServer, it is worth and believes for region
Breath list.
The system of full dose extracted data from HBase the most according to claim 6, it is characterised in that:
In described data extraction module,
Creating amount of capacity is the thread pool of N, is used for performing the task of extracted data from HBase;
Region quantity is M, and thread pool size is N;The method selecting region and reading region is as follows:
A if, () M≤N, the most all region are all as reading object;
B if () M > N, then searching loop Map, until Map is empty;
According to region information, the startRow of region, stopRow are set to the initial major key of scan and terminate major key,
So scan is aiming at this region and is read out data;To each region Information encapsulation thread class, it is submitted to thread pool
Perform digital independent.
The system of full dose extracted data from HBase the most according to claim 7, it is characterised in that:
In described (b), the logic of traversal Map is every time: obtain an entry from Map;The value of entry is region letter
Breath list;
From region information list, take out a record, and this record removes from list;If it is after removing a record, secondary
Region information list is empty, then removed from map by entry.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610902484.8A CN106294886A (en) | 2016-10-17 | 2016-10-17 | A kind of method and system of full dose extracted data from HBase |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610902484.8A CN106294886A (en) | 2016-10-17 | 2016-10-17 | A kind of method and system of full dose extracted data from HBase |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294886A true CN106294886A (en) | 2017-01-04 |
Family
ID=57717746
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610902484.8A Pending CN106294886A (en) | 2016-10-17 | 2016-10-17 | A kind of method and system of full dose extracted data from HBase |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294886A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110389766A (en) * | 2019-06-21 | 2019-10-29 | 深圳市汇川技术股份有限公司 | HBase container cluster dispositions method, system, equipment and computer readable storage medium |
CN110457279A (en) * | 2019-07-11 | 2019-11-15 | 新华三大数据技术有限公司 | Off-line data scan method, device, server and readable storage medium storing program for executing |
CN110928941A (en) * | 2019-11-28 | 2020-03-27 | 杭州数梦工场科技有限公司 | Data fragment extraction method and device |
CN111241171A (en) * | 2019-10-28 | 2020-06-05 | 杭州美创科技有限公司 | Full-amount data extraction method for database |
CN111949673A (en) * | 2020-08-04 | 2020-11-17 | 贵州易鲸捷信息技术有限公司 | Hbase storage-based distributed pessimistic lock and implementation method thereof |
CN116049197A (en) * | 2023-03-07 | 2023-05-02 | 中船重工奥蓝托无锡软件技术有限公司 | HBase-based data equilibrium storage method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102725753A (en) * | 2011-11-28 | 2012-10-10 | 华为技术有限公司 | Method and apparatus for optimizing data access, method and apparatus for optimizing data storage |
US20130282668A1 (en) * | 2012-04-20 | 2013-10-24 | Cloudera, Inc. | Automatic repair of corrupt hbases |
CN103631922A (en) * | 2013-12-03 | 2014-03-12 | 南通大学 | Hadoop cluster-based large-scale Web information extraction method and system |
CN103646073A (en) * | 2013-12-11 | 2014-03-19 | 浪潮电子信息产业股份有限公司 | Condition query optimizing method based on HBase table |
CN104516985A (en) * | 2015-01-15 | 2015-04-15 | 浪潮(北京)电子信息产业有限公司 | Rapid mass data importing method based on HBase database |
CN104820670A (en) * | 2015-03-13 | 2015-08-05 | 国家电网公司 | Method for acquiring and storing big data of power information |
CN105205154A (en) * | 2015-09-24 | 2015-12-30 | 浙江宇视科技有限公司 | Data migration method and device |
-
2016
- 2016-10-17 CN CN201610902484.8A patent/CN106294886A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102725753A (en) * | 2011-11-28 | 2012-10-10 | 华为技术有限公司 | Method and apparatus for optimizing data access, method and apparatus for optimizing data storage |
US20130282668A1 (en) * | 2012-04-20 | 2013-10-24 | Cloudera, Inc. | Automatic repair of corrupt hbases |
CN103631922A (en) * | 2013-12-03 | 2014-03-12 | 南通大学 | Hadoop cluster-based large-scale Web information extraction method and system |
CN103646073A (en) * | 2013-12-11 | 2014-03-19 | 浪潮电子信息产业股份有限公司 | Condition query optimizing method based on HBase table |
CN104516985A (en) * | 2015-01-15 | 2015-04-15 | 浪潮(北京)电子信息产业有限公司 | Rapid mass data importing method based on HBase database |
CN104820670A (en) * | 2015-03-13 | 2015-08-05 | 国家电网公司 | Method for acquiring and storing big data of power information |
CN105205154A (en) * | 2015-09-24 | 2015-12-30 | 浙江宇视科技有限公司 | Data migration method and device |
Non-Patent Citations (1)
Title |
---|
王姜勇: "基于大规模数据集的并发处理的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110389766A (en) * | 2019-06-21 | 2019-10-29 | 深圳市汇川技术股份有限公司 | HBase container cluster dispositions method, system, equipment and computer readable storage medium |
CN110389766B (en) * | 2019-06-21 | 2022-12-27 | 深圳市汇川技术股份有限公司 | HBase container cluster deployment method, system, equipment and computer readable storage medium |
CN110457279A (en) * | 2019-07-11 | 2019-11-15 | 新华三大数据技术有限公司 | Off-line data scan method, device, server and readable storage medium storing program for executing |
CN110457279B (en) * | 2019-07-11 | 2022-03-11 | 新华三大数据技术有限公司 | Data offline scanning method and device, server and readable storage medium |
CN111241171A (en) * | 2019-10-28 | 2020-06-05 | 杭州美创科技有限公司 | Full-amount data extraction method for database |
CN110928941A (en) * | 2019-11-28 | 2020-03-27 | 杭州数梦工场科技有限公司 | Data fragment extraction method and device |
CN110928941B (en) * | 2019-11-28 | 2023-10-27 | 杭州数梦工场科技有限公司 | Data fragment extraction method and device |
CN111949673A (en) * | 2020-08-04 | 2020-11-17 | 贵州易鲸捷信息技术有限公司 | Hbase storage-based distributed pessimistic lock and implementation method thereof |
CN111949673B (en) * | 2020-08-04 | 2024-02-20 | 贵州易鲸捷信息技术有限公司 | Hbase storage-based distributed pessimistic lock and implementation method thereof |
CN116049197A (en) * | 2023-03-07 | 2023-05-02 | 中船重工奥蓝托无锡软件技术有限公司 | HBase-based data equilibrium storage method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294886A (en) | A kind of method and system of full dose extracted data from HBase | |
CN107957957B (en) | Test case obtaining method and device | |
JP5961689B2 (en) | Incremental data extraction | |
Lee et al. | Efficient spatial query processing for big data | |
CN105718455A (en) | Data query method and apparatus | |
CN104133867A (en) | DOT in-fragment secondary index method and DOT in-fragment secondary index system | |
CN104657423A (en) | Method and device thereof for sharing contents of applications | |
CN109241003B (en) | File management method and device | |
CN110019542B (en) | Generation of enterprise relationship, generation of organization member database and identification of same name member | |
CN110888837A (en) | Object storage small file merging method and device | |
CN103744875B (en) | Data quick migration method and system based on file system | |
JPWO2014006903A1 (en) | Content control method, content control apparatus, and program | |
CN110399096B (en) | Method, device and equipment for deleting metadata cache of distributed file system again | |
US20110264703A1 (en) | Importing Tree Structure | |
CN105786843A (en) | Multi-language implementation method for applications and multi-language information query method and device | |
CN103593447B (en) | Data processing method and device applied to database table | |
CN111209061B (en) | User information filling method, device, computer equipment and storage medium | |
CN104408128B (en) | A kind of reading optimization method indexed based on B+ trees asynchronous refresh | |
CN111176901B (en) | HDFS deleted file recovery method, terminal device and storage medium | |
CN105279166B (en) | File management method and system | |
CN107239568B (en) | Distributed index implementation method and device | |
CN112328379A (en) | Application migration method, device, equipment and medium | |
CN107315806B (en) | Embedded storage method and device based on file system | |
CN107357836B (en) | VNF package and method and device for deleting mirror image file referenced by VNF package | |
CN110968555A (en) | Dimension data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |