CN105808577A - HBase database-based data batch loading method and device - Google Patents

HBase database-based data batch loading method and device Download PDF

Info

Publication number
CN105808577A
CN105808577A CN201410848940.6A CN201410848940A CN105808577A CN 105808577 A CN105808577 A CN 105808577A CN 201410848940 A CN201410848940 A CN 201410848940A CN 105808577 A CN105808577 A CN 105808577A
Authority
CN
China
Prior art keywords
subregion
scope
end value
hbase
hbase table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410848940.6A
Other languages
Chinese (zh)
Other versions
CN105808577B (en
Inventor
唐正才
王庆磊
张国波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenzhou Taiyue Software Co Ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201410848940.6A priority Critical patent/CN105808577B/en
Publication of CN105808577A publication Critical patent/CN105808577A/en
Application granted granted Critical
Publication of CN105808577B publication Critical patent/CN105808577B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an Hbase database-based data batch loading method and device. The method comprises the following steps: extracting row keys of to-be-loaded source data, sorting the row keys, carrying out average partitioning on the sorted row keys according to an appointed number of partitions so as to determine a row key corresponding to the end value of each partition range; respectively adding a predetermined length to the row key corresponding to the end value of each partition range to serve as the end value of each pre-established partition range; judging whether an Hbase table exists in an Hbase database or not; if the judging result is negative, creating an Hbase table and establishing partitions in the Hbase table according to the end value of each pre-established partition range; generating corresponding HFile files for the to-be-loaded source data in parallel according to each partition in the HBase table; importing the HFfile files into the HBase table in batches. Through the data batch loading method disclosed in the invention, the HFile file generation speed and loading speed are improved so that the HBase batch loading efficiency is greatly enhanced.

Description

The method and apparatus that a kind of batch data based on HBase data base is put in storage
Technical field
The present invention relates to HBase database technical field, be specifically related to the method and apparatus that a kind of batch data based on HBase data base is put in storage.
Background technology
HBase be a high reliability, high-performance, towards row, telescopic distributed data base, HBase is different from general relevant database, it is a data base being suitable for unstructured data storage, utilize HBase can erect large-scale structure storage cluster on cheap PCServer, the carrying cost under big datumization background can be effectively reduced.But HBase has a problem that in batch data warehouse-in, by HBase self provide warehouse-in instrument large batch of data are put in storage time very slowly consuming time, extremely inefficient, such as, 23-24 consuming time hour is generally wanted when the data file of G up to a hundred is put in storage, the even longer time.Its batch warehouse-in step approximately as: 1, first by data file by Importtsv instrument by Paralleled be generated as HBase bottom store file HFile file.At this, very easily there is data skew in step, namely owing to each subregion can have scope, causing that when the design of subregion scope is unreasonable substantial amounts of data concentrations is in some subregion, thus causing that this subregion is very slow in calculating process, reducing overall operation speed.2, generated HBase bottom storage file HFile file is imported in HBase table by BulkLoad instrument batch.This step is easy to HFile file occur across subregion, namely the data part in a HFile file belongs to A subregion scope another part data and belongs to B subregion scope, because these HFile files are managed by HBase bottom by subregion, once generated HFile file occurs across subregion, this file will be carried out by process again that import replicate segmentation, duplication cutting procedure is quite time-consuming, thus greatly reducing the warehouse-in efficiency of entirety.Problem existing in above-mentioned two step seriously limits the raising of HBase batch warehouse-in efficiency so that whole data loading process is slow, consuming time longer.
Summary of the invention
The invention provides the method and apparatus that a kind of batch data based on HBase data base is put in storage, with solve existing HBase database data batch put length consuming time, inefficient problem in storage.
According to an aspect of the invention, it is provided a kind of batch data storage method based on HBase data base, the method includes:
Treat the source data of warehouse-in, extract line unit and also sort, the line unit after sequence is averaged subregion by the subregion number specified, it is determined that go out the line unit that the end value of each subregion scope is corresponding;
By line unit corresponding for the end value of each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length;
Judge whether described HBase data base exists HBase table;
If it does not, create HBase table in HBase data base, and set up subregion according to each end value building subregion scope in advance in HBase table;Treat the source data of warehouse-in according to each subregion set up in HBase table, the HFile file that parallel generation is corresponding;
The HFile files in batch generated is imported in HBase table.
Wherein, the method also includes:
If it is, extract the end value of each existing subregion scope from existing HBase table;
End value according to each end value building subregion scope in advance and existing subregion scope is ranked up processing, and obtains each new partition scope;
Treat the source data of warehouse-in according to each new partition scope, the HFile file that parallel generation is corresponding;
The HFile files in batch generated is imported in existing HBase table.
Wherein, it is ranked up processing according to the end value of each end value building subregion scope in advance and existing subregion scope, obtains each new partition scope and include:
The end value of each end value building subregion scope in advance and existing subregion scope is ranked up;
A new partition scope is determined by the adjacent each two end value after sorting.
Wherein, the HFile files in batch of generation is imported HBase table to include:
End value according to existing subregion scope each in existing Hbase table, belongs to new partition scope the HFlie file correspondence generated within the scope of this existing subregion and imports in this existing subregion.
Wherein, treating the source data of warehouse-in according to each new partition scope, the HFile file that parallel generation is corresponding includes:
Each new partition scope is generated a partitioned file respectively, it is modified the source code of Importtsv instrument, the partitioned file of generation is passed to TotalOrderPartitioner class, again through the HFile file that this amended Importtsv instrument parallel generation is corresponding.
Wherein, treat the source data of warehouse-in, extract line unit and sequence includes:
Treat the source data of warehouse-in, extract whole line units and sort;
Or, treat the source data of warehouse-in, the line unit of extraction part also sorts.
According to another aspect of the present invention, it is provided that a kind of batch data loading device based on HBase data base, this device includes:
Subregion scope determines unit, for treating the source data of warehouse-in, extracts line unit and also sorts, and is averaged subregion by the subregion number specified by the line unit after sequence, it is determined that go out the line unit that the end value of each subregion scope is corresponding;
The end value building subregion scope in advance determines unit, for by line unit corresponding for the end value of each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length;
HBase table judging unit, is used for judging whether there is HBase table in HBase data base;
Subregion sets up unit, for when HBase table judging unit judges to be absent from HBase table in HBase data base, creating HBase table, and set up subregion according to each end value building subregion scope in advance in HBase table in HBase data base;
HFile file generating unit, is used for the source data treating warehouse-in according to each subregion set up in HBase table, the HFile file that parallel generation is corresponding;
HFile file imports unit, imports in HBase table for the HFile files in batch that will generate.
Wherein, this device also includes: new partition scope determines unit, for when HBase table judging unit judges there is HBase table in HBase data base, extracting the end value of each existing subregion scope from existing HBase table;End value according to each end value building subregion scope in advance and existing subregion scope is ranked up processing, and obtains each new partition scope;
HFile file generating unit, is used for the source data treating warehouse-in according to each new partition scope, the HFile file that parallel generation is corresponding;
HFile file imports unit, imports in HBase table for the HFile files in batch that will generate.
Wherein, new partition scope determines unit, specifically for the end value of each end value building subregion scope in advance and existing subregion scope is ranked up;
A new partition scope is determined by the adjacent each two end value after sorting.
Wherein, HFlie file imports unit, for according to the end value of each existing subregion scope in existing Hbase table, new partition scope belonging to the HFlie file correspondence generated within the scope of this existing subregion and imports in this existing subregion.
The method and apparatus that this batch data based on HBase data base of the present invention is put in storage, by extracting source data line unit to be put in storage and being averaged subregion according to the subregion number specified, avoid in generation HFile file processes and data tilt problem occurs, by after line unit corresponding for the end value of each subregion scope is increased predetermined length respectively as each end value building subregion scope in advance, make the end value length of subregion scope for building table more than row key length, the HFile file problem across subregion of generation is avoided with this, which thereby enhance the formation speed of bottom HFile file in HBase data base and the storage of HFile file, so that HBase batch warehouse-in efficiency is greatly improved.
Accompanying drawing explanation
Fig. 1 is a kind of batch data storage method flow chart based on HBase data base that one embodiment of the invention provides;
Fig. 2 is a kind of batch data storage method schematic flow sheet based on HBase data base that another embodiment of the present invention provides;
Fig. 3 is a kind of based on subregion scope cross processing schematic diagram in the batch data storage method of HBase data base of one embodiment of the invention offer;
Fig. 4 is the block diagram of a kind of batch data loading device based on HBase data base that one embodiment of the invention provides.
Detailed description of the invention
The core concept of the present invention is: for the key link of the restriction HBase data base batch warehouse-in in existing HBase data, by adopting the source data treating warehouse-in extract line unit and be averaged subregion by the subregion number specified, the subregion scope of data is effectively divided, it is to avoid generate in HFile file processes and data skew and the problem across subregion occur.
Fig. 1 is a kind of batch data storage method flow chart based on HBase data base that one embodiment of the invention provides, and referring to Fig. 1, when not having HBase table in HBase data base, this batch data storage method of the present invention includes:
Step S110, treats the source data of warehouse-in, extracts line unit and also sorts, and is averaged subregion by the subregion number specified by the line unit after sequence, it is determined that go out the line unit that the end value of each subregion scope is corresponding;
Wherein, line unit is the mark that every data is designed by Hbase data base when data loading, for instance, source data to be put in storage is as follows:
abef,1ewew,df;
aaxe,5sfd,3wesdd;
xrty,9dsdw,32dd;
Treat warehouse-in source data to be extracted the data abef of first row, aaxe, xrty respectively as the line unit of every data at these 3;Line unit must be ranked up when data loading, data could be put in storage from small to large ord in Hbase data base, this is because the data of Hbase Database Requirements warehouse-in carry out ordered arrangement by line unit, arrangement principle is arrange by natural order, as tri-line units of above abef, aaxe, xrty by after sequence are from small to large:
aaxe、abef、xrty;
Additionally, here end value refers to the boundary value of a subregion, such as, by the line unit after warehouse-in source data sequence be 123,3se, 4ad, 5dw be divided into 2 subregions, it it is a subregion from infinitesimal to 3se, 3se is to an infinite greatly subregion, and 3se at this moment is the line unit corresponding to the end value of subregion scope.
Step S120, by line unit corresponding for the end value of each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length;
Here line unit is exactly from treating the line unit that extracts warehouse-in source data, namely the data obtained after increasing by a bit length on the basis of the strong length of former row are as each end value building subregion scope in advance, such as, obtaining a line unit corresponding to subregion scope in step S110 example is 3se, become 3se1 after increasing by a bit length, build the end value of subregion scope as this in advance.Multidigit can also be increased on the basis of the strong length of former row.
Step S130, it is judged that whether there is HBase table in described HBase data base;If it does not, perform step S140;
Step S140, creates HBase table in HBase data base, and sets up subregion according to each end value building subregion scope in advance in HBase table;
Step S150, treats the source data of warehouse-in according to each subregion set up in HBase table, the HFile file that parallel generation is corresponding;
Namely the every data treating warehouse-in all can go contrast subregion scope in becoming Hfile file processes, it is then determined that should be placed in which subregion scope, finally the data comprised in each subregion scope are become Hfile file, Hfile is the storage format of KeyValue data in HBase, for binary format data.
Step S180, imports the step S150 HFile files in batch generated in HBase table.
Namely after first sorting by cross processing, the subregion scope of gained carries out generating Hfile file, then obtains these Hfile files corresponding to subregion scope, then is sat in the right seat by these Hfile files and put in Hbase table.
In the present embodiment, the method also includes: if it is, perform step S160, extract the end value of each existing subregion scope from existing HBase table;End value according to each end value building subregion scope in advance and existing subregion scope is ranked up processing, and obtains each new partition scope;
Step S170, treats the source data of warehouse-in according to each new partition scope, the HFile file that parallel generation is corresponding;
Step S180, imports the step S170 HFile files in batch generated in existing HBase table.
Through said process, HBase data base determines subregion scope and sets up subregion, again total data to be put in storage is generated corresponding HFile file according to the subregion scope determined, and to newly-increased data, in the subregion scope determined according to newly-increased data and HBase table, existing subregion scope carries out cross processing, and determine new partition scope, generate corresponding HFile file;Generated HBase bottom storage file HFile file is imported to by BulkLoad instrument batch HBase table achieves data rapid batch warehouse-in.It is ranked up according to line unit by treating the data of warehouse-in, determine subregion scope, avoid substantial amounts of data concentrations on some subregion, thus solving the data skew problem occurred when being generated HBase file by subregion scope, in addition, by line unit length corresponding for subregion range endpoints is increased predetermined length, avoid when importing HBase data base again by Importtsv instrument generation HFile file, owing to the end value length of line unit and subregion scope is equal, HFile file there will be the problem needing to re-start subregion across subregion.To sum up, the method can be greatly improved the efficiency of batch data warehouse-in, shortens the time of batch data warehouse-in in HBase data base.
Fig. 2 is a kind of batch data storage method schematic flow sheet based on HBase data base that another embodiment of the present invention provides, and Fig. 3 is a kind of based on the schematic diagram of subregion scope cross processing in the batch data storage method of HBase data base of one embodiment of the invention offer.The work process of this batch data storage method of the present invention is illustrated below in conjunction with Fig. 2 and Fig. 3.
In the present embodiment, in order to realize HBase rapid batch warehouse-in, source data point two following situations to be put in storage is specifically described:
A), data first time batch puts HBase table (namely HBase table is but without setting up) in storage
Treat the line unit of the source data of warehouse-in and be sampled sequence (i.e. the line unit of extraction part is ranked up), and the line unit of samplesort is averaged subregion according to the subregion number specified, it is determined that go out the line unit that the end value of each subregion scope is corresponding.Referring to Fig. 2, owing to data are first time to put HBase table in storage, first pass through MapReduce to treat the source data line unit of warehouse-in be sampled sequence when processing, the line unit extracted is averaged subregion by the subregion number specified, thus the line unit that the end value that obtains each subregion scope is corresponding, the line unit that end value according to each subregion scope determined is corresponding, the line unit basis that the end value of each subregion scope is corresponding increases the one or more end value being built subregion scope in advance, end value according to building subregion scope in advance creates HBase table in HBase data base and sets up subregion in HBase table;Treat the source data of warehouse-in according to each subregion set up in HBase table, by the HFile file that Importtsv instrument parallel generation is corresponding, now Importtsv instrument can go the subregion scope pre-build in reading HBase table to go source data all to be put in storage is carried out subregion;The HFile files in batch generated is imported in HBase table, puts in storage in batch data thus effectively prevent, when the source data put in storage generates HFile file, data skew problem will occur.
It should be noted that, according to when utilizing MapReduce that the scope of subregion is divided wait the source data put in storage, whole data can also be adopted to sort (namely extracting whole line units) according to line unit, whole line units are averaged subregion by the subregion number specified, thus the line unit that the end value that obtains each subregion scope is corresponding.
In the present embodiment, the line unit that end value according to each subregion scope determined is corresponding, creates HBase table in HBase data base and sets up subregion in HBase table and include: row key length corresponding for the end value of each subregion scope increases predetermined length respectively as the end value building subregion in advance.Generally, have only to length strong for the row sampled out is increased the length as the end value building subregion scope in advance, corresponding for subregion scope line unit length can also be increased length that multidigit the obtain length as the end value building subregion scope in advance, as long as ensureing that the length building subregion scope in advance is more than row key length, with this avoid generate HFile file time, subregion extent length and data line unit length equal and cause utilize BulkLoad instrument import HFile file time HFile file occurs across partitioning problem.
B), newly-increased data loading HBase table (namely HBase table has existed and there are data)
In HBase data base when existing HBase table and data, newly-increased source data to be put in storage;Treat the source data line unit of warehouse-in be sampled sequence and be averaged subregion according to the subregion specifying number, determine the line unit that the end value of each subregion scope is corresponding, by line unit corresponding for the end value of each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length.
Referring to Fig. 2, step 1, in order to prevent the newly-increased HFile file problem across subregion that data skew and generation occur again when batch data is put in storage wait the source data put in storage, to newly-increased wait the source data put in storage carry out batch put in storage time, also to first pass through MapReduce and newly-increased data line unit is sampled sequence, the line unit extracted is averaged subregion by the subregion number specified, thus obtaining the line unit of each subregion scope, then these line unit length are increased one or more respectively, built the end value of subregion scope in advance, ensure to build in advance the length of existing subregion range endpoints in the length of subregion range endpoints and HBase table equal;
Step 2, the end value extracting existing subregion scope from HBase table and each end value building subregion scope in advance are ranked up processing, and obtain each new partition scope;
Specifically, end value and each end value building subregion scope in advance to each existing subregion scope are ranked up;Being determined a new partition scope by the adjacent each two end value after sorting, which is called subregion scope interior extrapolation method.
The cross processing of subregion scope is as shown in Figure 3, for instance, in Hbase table, X1, X2 infinitely small by end value and infinity define three existing subregion scopes and are defined three built subregion scope in advance by infinitely small, Y1, Y2 and infinity.In Fig. 3, first row scope (i.e. existing subregion scope) represents existing subregion scope in HBase data base;Second row scope represents (namely building subregion scope in advance) line unit to newly-increased source data to be put in storage and is sampled sequence, it is distributed equally according to the subregion number specified, determine the line unit that the end value of each subregion scope is corresponding, by line unit corresponding for the end value of each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length, so that it is determined that what go out builds subregion scope in advance, existing subregion scope and the end value building subregion scope in advance are ranked up, the adjacent each two end value after sorting determine a new partition scope.Wherein, by the process that existing subregion scope and the end value building subregion scope in advance are ranked up be exactly to each the existing subregion scope determined by end value with build subregion scope in advance and carry out the process of cross processing.
Referring to Fig. 3, the left end value of the first existing subregion is infinitely small, and right-hand member value is X1;The left end value of the second existing subregion is also X1, and right-hand member value is X2;The left end value of the 3rd existing subregion is also X2, and left end value is infinitely great;
The first left end value building subregion in advance is infinitely small, and right-hand member value is Y1;The second left end value building subregion in advance is also Y1, and right-hand member value is Y2;The 3rd left end value building subregion in advance is also Y2, and left end value is infinitely great;
According to translocation sorting method, six end values of subregion scope obtained above are ranked up (infinitesimal → X1 → Y1 → X2 → Y2 → infinity), the adjacent each two end value after sorting determine a new partition scope.Namely in new partition scope, the scope in 1st district is that { infinitely small, X1}, the scope in 2nd district is that { scope in X1, Y1}, 3 district is that { scope in Y1, X2}, 4 district is that { scope in X2, Y2}, 5 district is { Y2, infinitely great }.Thus obtaining the end value of each new partition scope, using this new partition scope as subregion scope new after newly-increased data genaration HFile file.
It should be noted that, the line unit that is sampled in the source data treating warehouse-in also sorts, determine when the row that the end value of subregion scope is corresponding is strong, in Hbase data base, the value to the two ends of subregion scope is system default value, therefore it may only be necessary to namely the line unit extracting middle end value corresponding can determine that subregion scope.Referring to Fig. 3, when building subregion in advance in HBase table, it is intended that subregion number be 3,3 subregion scopes have only to extract the line unit of two centres and are assured that out three subregion scopes.
Same, determine newly-increased wait the source data put in storage when building subregion scope in advance, it is only necessary to determine namely the line units that in the middle of going out two, end value is corresponding can determine that three are built subregion scope in advance.
Step 3, by by new partition scope generate partitioned file, then the source code of Importtsv instrument is revised again, this partitioned file is passed to TotalOrderPartitioner class, making TotalOrderPartitioner class go read newly-generated partitioned file rather than remove in reading database the partitioned file of existing HBase table, data skew problem when this addresses the problem newly-increased data loading and the HFile file of generation are across partitioning problem;
Step 4, again through this amended Importtsv instrument by HFile file corresponding for parallel generation;
Step 5, generated HFile file is imported in HBase table by BulkLoad instrument batch.
When to newly-increased data loading, in order to avoid the HFile file of data skew and generation occurring again across subregion, it needs to be determined that newly-increased source data to be put in storage build subregion scope in advance, and subregion scope correspondence generation HFile file is built in advance according to this, example is ranged for 3 the existing subregions having built up in 5 the new partition scopes illustrated in Fig. 3 and HBase table, 5 new partition scope correspondences are generated 5 HFile files, this new partition scope is belonged to the HFile file generated within the scope of existing subregion and imports in corresponding existing subregion scope.Namely the HFile file generated according to new partition scope 1 district correspondence imports in the first existing subregion, 2nd district and 3rd district are imported by the HFile file generated in the second existing subregion, 4th district and 5th district are imported by the HFile file generated in the 3rd existing subregion.
Through above-mentioned steps, the subregion scope of the source data that can treat warehouse-in effectively divides, data skew when solving batch data warehouse-in and the HFile file problem that affects storage across subregion, use this batch data storage method of the present invention, the file of G up to a hundred adopts warehouse-in only to need to complete half an hour, substantially increase the formation speed of HFile file and the storage of HFile file, so that HBase batch warehouse-in efficiency is greatly improved.
According to another aspect of the present invention, it is provided that a kind of batch data loading device 400 based on HBase data base, this device includes:
Subregion scope determines unit 401, for treating the source data of warehouse-in, extracts line unit and also sorts, and is averaged subregion by the subregion number specified by the line unit after sequence, it is determined that go out the line unit that the end value of each subregion scope is corresponding;
The end value building subregion scope in advance determines unit 402, for by line unit corresponding for the end value of each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length;
HBase table judging unit 403, is used for judging whether there is HBase table in HBase data base;
Subregion sets up unit 404, for when HBase table judging unit 403 judges to be absent from HBase table in HBase data base, creating HBase table, and set up subregion according to each end value building subregion scope in advance in HBase table in HBase data base;
HFile file generating unit 406, is used for the source data treating warehouse-in according to each subregion set up in HBase table, the HFile file that parallel generation is corresponding;
HFile file imports unit 407, imports in HBase table for the HFile files in batch that will generate.
Wherein, this device also includes:
New partition scope determines unit 405, for when HBase table judging unit 403 judges there is HBase table in HBase data base, extracting the end value of each existing subregion scope from existing HBase table;End value according to each end value building subregion scope in advance and existing subregion scope is ranked up processing, and obtains each new partition scope;
HFile file generating unit 406, is used for the source data treating warehouse-in according to each new partition scope, the HFile file that parallel generation is corresponding;
HFile file imports unit 407, imports in HBase table for the HFile files in batch that will generate.
Wherein, new partition scope determines unit, specifically for the end value of each end value building subregion scope in advance and existing subregion scope is ranked up;
A new partition scope is determined by the adjacent each two end value after sorting.
Wherein, HFlie file imports unit, for according to the end value of each existing subregion scope in existing Hbase table, new partition scope belonging to the HFlie file correspondence generated within the scope of this existing subregion and imports in this existing subregion.It should be noted that the batch data loading device based on HBase data base is corresponding with the batch data storage method based on HBase data base above, the specific works process of this device may refer to the explanation of preceding method part, does not repeat them here.
To sum up, the batch data storage method based on HBase data base of the present invention and device are by adopting source data sampling to be put in storage to build subregion and subregion scope intersection in advance, the subregion scope of data is effectively divided, the efficiency that HBase distributed data base batch data is put in storage is greatly improved, shortens the time of batch data warehouse-in.
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. the method that the batch data based on HBase data base is put in storage, it is characterised in that the method includes:
Treat the source data of warehouse-in, extract line unit and also sort, the line unit after sequence is averaged subregion by the subregion number specified, it is determined that go out the line unit that the end value of each subregion scope is corresponding;
By line unit corresponding for the end value of described each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length;
Judge whether described HBase data base exists HBase table;
If it does not, create HBase table in HBase data base, and set up subregion according to each end value building subregion scope in advance in described HBase table;
To source data described to be put in storage according to each subregion set up in described HBase table, the HFile file that parallel generation is corresponding;
The HFile files in batch of described generation is imported in described HBase table.
2. the method for claim 1, it is characterised in that the method also includes:
If it is, extract the end value of each existing subregion scope from described existing HBase table;
End value according to described each end value building subregion scope in advance and described existing subregion scope is ranked up processing, and obtains each new partition scope;
To source data described to be put in storage according to described each new partition scope, the HFile file that parallel generation is corresponding;
The HFile files in batch of described generation is imported in described existing HBase table.
3. method as claimed in claim 2, it is characterised in that the described end value according to described each end value building subregion scope in advance and described existing subregion scope is ranked up processing, and obtains each new partition scope and includes:
The end value of described each end value building subregion scope in advance and described existing subregion scope is ranked up;
A described new partition scope is determined by the adjacent each two end value after sorting.
4. method as claimed in claim 3, it is characterised in that described the HFile files in batch of described generation is imported described HBase table include:
End value according to existing subregion scope each in described existing Hbase table, belongs to new partition scope the HFlie file correspondence generated within the scope of this existing subregion and imports in this existing subregion.
5. method as claimed in claim 2, it is characterised in that described to source data described to be put in storage according to described each new partition scope, the HFile file that parallel generation is corresponding includes:
Described each new partition scope is generated a partitioned file respectively, it is modified the source code of Importtsv instrument, the described partitioned file generated is passed to TotalOrderPartitioner class, again through the HFile file that this amended Importtsv instrument parallel generation is corresponding.
6. method as claimed in claim 1 or 2, it is characterised in that described in treat the source data of warehouse-in, extract line unit and sequence include:
Treat the source data of warehouse-in, extract whole line units and sort;
Or, treat the source data of warehouse-in, the line unit of extraction part also sorts.
7. the device that the batch data based on HBase data base is put in storage, it is characterised in that this device includes:
Subregion scope determines unit, for treating the source data of warehouse-in, extracts line unit and also sorts, and is averaged subregion by the subregion number specified by the line unit after sequence, it is determined that go out the line unit that the end value of each subregion scope is corresponding;
The end value building subregion scope in advance determines unit, for by line unit corresponding for the end value of described each subregion scope, respectively as each end value building subregion scope in advance after increase predetermined length;
HBase table judging unit, is used for judging whether there is HBase table in described HBase data base;
Subregion sets up unit, for when described HBase table judging unit judges to be absent from HBase table in described HBase data base, creating HBase table, and set up subregion according to each end value building subregion scope in advance in described HBase table in HBase data base;
HFile file generating unit, is used for the source data treating warehouse-in according to each subregion set up in described HBase table, the HFile file that parallel generation is corresponding;
HFile file imports unit, for being imported in described HBase table by the HFile files in batch of described generation.
8. device as claimed in claim 7, it is characterised in that this device also includes:
New partition scope determines unit, for when described HBase table judging unit judges there is HBase table in described HBase data base, extracting the end value of each existing subregion scope from described existing HBase table;End value according to described each end value building subregion scope in advance and described existing subregion scope is ranked up processing, and obtains each new partition scope;
Described HFile file generating unit, is used for source data described to be put in storage according to described each new partition scope, the HFile file that parallel generation is corresponding;
Described HFile file imports unit, for being imported in described HBase table by the HFile files in batch of described generation.
9. device as claimed in claim 8, it is characterised in that described new partition scope determine unit specifically for, the end value of described each end value building subregion scope in advance and described existing subregion scope is ranked up;
A described new partition scope is determined by the adjacent each two end value after sorting.
10. device as claimed in claim 9, it is characterised in that
Described HFlie file imports unit, for the end value according to existing subregion scope each in described existing Hbase table, new partition scope belongs to the HFlie file correspondence generated within the scope of this existing subregion and imports in this existing subregion.
CN201410848940.6A 2014-12-29 2014-12-29 A kind of method and apparatus of the batch data storage based on HBase database Active CN105808577B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410848940.6A CN105808577B (en) 2014-12-29 2014-12-29 A kind of method and apparatus of the batch data storage based on HBase database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410848940.6A CN105808577B (en) 2014-12-29 2014-12-29 A kind of method and apparatus of the batch data storage based on HBase database

Publications (2)

Publication Number Publication Date
CN105808577A true CN105808577A (en) 2016-07-27
CN105808577B CN105808577B (en) 2019-08-20

Family

ID=56420579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410848940.6A Active CN105808577B (en) 2014-12-29 2014-12-29 A kind of method and apparatus of the batch data storage based on HBase database

Country Status (1)

Country Link
CN (1) CN105808577B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363766A (en) * 2018-02-06 2018-08-03 福建星瑞格软件有限公司 A kind of method and computer equipment of uniform cutting database table data
CN109284335A (en) * 2018-09-10 2019-01-29 郑州云海信息技术有限公司 A kind of method and apparatus of integration across database batch conduct data
CN112445759A (en) * 2020-11-30 2021-03-05 中国人寿保险股份有限公司 Method and device for cluster data replication across distributed databases and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282668A1 (en) * 2012-04-20 2013-10-24 Cloudera, Inc. Automatic repair of corrupt hbases
CN103617211A (en) * 2013-11-20 2014-03-05 浪潮电子信息产业股份有限公司 HBase loaded data importing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282668A1 (en) * 2012-04-20 2013-10-24 Cloudera, Inc. Automatic repair of corrupt hbases
CN103617211A (en) * 2013-11-20 2014-03-05 浪潮电子信息产业股份有限公司 HBase loaded data importing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
佚名: "MapReduce生成HFile文件,再使用BulkLoad导入HBase中(完全分布式运行)", 《HTTP://WWW.ABOUTYUN.COM/THREAD-10665-1-1.HTML》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363766A (en) * 2018-02-06 2018-08-03 福建星瑞格软件有限公司 A kind of method and computer equipment of uniform cutting database table data
CN109284335A (en) * 2018-09-10 2019-01-29 郑州云海信息技术有限公司 A kind of method and apparatus of integration across database batch conduct data
CN112445759A (en) * 2020-11-30 2021-03-05 中国人寿保险股份有限公司 Method and device for cluster data replication across distributed databases and electronic equipment
CN112445759B (en) * 2020-11-30 2024-04-16 中国人寿保险股份有限公司 Method and device for copying data across clusters of distributed database and electronic equipment

Also Published As

Publication number Publication date
CN105808577B (en) 2019-08-20

Similar Documents

Publication Publication Date Title
KR101617696B1 (en) Method and device for mining data regular expression
CN102880650B (en) A kind of data matching method and device
CN104679778A (en) Search result generating method and device
CN105205397A (en) Rogue program sample classification method and device
CN102722709A (en) Method and device for identifying garbage pictures
CN106156082A (en) A kind of body alignment schemes and device
CN102855178A (en) Method and device for generating Mock base during unit test
CN106778278B (en) A kind of malice document detection method and device
CN105808577A (en) HBase database-based data batch loading method and device
CN101369278B (en) Approximate adaptation method and apparatus
CN105824825A (en) Sensitive data identifying method and apparatus
CN103106262A (en) Method and device of file classification and generation of support vector machine model
CN112257366A (en) CNF generation method and system for equivalence verification
CN103617226A (en) Regular expression matching method and device
CN102999495B (en) A kind of synonym Semantic mapping relation determines method and device
Farnoud et al. A stochastic model for genomic interspersed duplication
CN112711649A (en) Database multi-field matching method, device, equipment and storage medium
CN105224415B (en) For the generation method and device of the code for realizing business task
CN112765014A (en) Automatic test system for multi-user simultaneous operation and working method
CN103970792A (en) Index-based file comparison method and device
CN116578558A (en) Data processing method, device, equipment and storage medium
CN106569986A (en) Character string replacement method and device
CN110825453B (en) Data processing method and device based on big data platform
De França Scalable overlapping co-clustering of word-document data
Bhardwaj et al. Performance improvement in genetic programming using modified crossover and node mutation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 818, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080

Patentee after: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building 6 storey block A Room 601

Patentee before: BEIJING ULTRAPOWER SOFTWARE Co.,Ltd.