CN108319714A - A kind of row storage compacting method based on HBase - Google Patents
A kind of row storage compacting method based on HBase Download PDFInfo
- Publication number
- CN108319714A CN108319714A CN201810130781.4A CN201810130781A CN108319714A CN 108319714 A CN108319714 A CN 108319714A CN 201810130781 A CN201810130781 A CN 201810130781A CN 108319714 A CN108319714 A CN 108319714A
- Authority
- CN
- China
- Prior art keywords
- data
- compression
- row
- hbase
- column
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1744—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
Abstract
The invention discloses a kind of row storage compacting method based on HBase, including each column data is read from HBase, it is resequenced and is stored in each area to each column data;The statistic of randomized block is counted to calculate the similar factors S between each area, similar factors S is the defined amount for judging similarity between intervals, is obtained by the absolute difference of the statistic T characteristic component in twoth area, and judge that column distribution is uniform or discrete;If being evenly distributed, using mixed grade row compress mode;If distribution is discrete, the areas Hun Ji compress mode is used.Row storage compacting method provided by the invention largely reduces calculating cost, while improving compression efficiency.
Description
Technical field
The present invention relates to big data technical fields, and in particular to a kind of row storage compacting method based on HBase.
Background technology
Data compression is always the problem of paying close attention to of data fields, existing numerous compression methods, lightweight compress mode
There is a Run- Length Coding, dictionary encoding, null value inhibits etc., and heavyweight compressed encoding has GZIP, Lempel-Ziv series, Huffman to compile
Code and arithmetic coding etc..The difference of lightweight and heavyweight compression algorithm is that lightweight algorithm is grasped to being continuously worth
Make, heavyweight algorithm is the boundary between having broken value, and value is operated as a series of bytes.Common lightweight and weight
The classification of grade compression algorithm is as shown in Figure 1.
Research about row storage compression algorithm strategy first starts in the related research to C-store, J.Abadi
Etc. a kind of row compact model based on decision tree is proposed, which is sentenced by establishing a compression algorithm decision tree
The optimal compression algorithm of Ding Gelie, but this method compression granularity is excessive, ignores the local distribution feature and permutation number of data
According to otherness to compression strap come influence.
Wang Zhen imperial or royal seals et al. propose a kind of area's grade Compression Strategies, are divided data as unit of area, between subregion
Correlation and otherness carry out the selection of compression algorithm, the method can be that different areas be applicable in different compressions calculations according to feature
Method ensure that compression ratio, but the big subregion of similarity difference can excessively be caused compared with intensive.
Idreos et al. proposes a kind of compression algorithm dynamic select strategy based on Bayes's classification, passes through Bayes's public affairs
Formula is the different compression algorithm of different data block selections after calculating, and data to be compressed is made to reach best compression effect as far as possible
Fruit, but the accuracy of this method is largely dependent on training sample, and an assessment layer is not established according to feedback result
Assess the quality of compression algorithm.
King sea is gorgeous et al. to propose a kind of Compression Strategies selection method classify based on cold and hot data, and it is literary to be first depending on data
HBase data are divided into cold and hot data by part visiting frequency, and improve Bayes's classification compression method, are increased on the basis of forefathers
Add assessment layer, the advantage of combined area grade Compression Strategies proposes a kind of new compression classification, but its sorting algorithm is not supported simultaneously
Row processing, and do not there is the shortcomings that breakthrough, area grade calculates to still have in compression granularity.
Nowadays the research for the selection that the characteristics of being directed to column storage database carries out Compression Strategies has been achieved for many achievements,
But data are not pre-processed when compressing, data are larger in each section distributional difference, and dispersion degree is high, is unsuitable for compressing.
More preference small grain size Compression Strategies in the selection of compression granularity, and small grain size strategy will count the statistical information in each area, calculate
It is of high cost, compression time is influenced, and research does not have compression algorithm itself enough concerns.In the selection of Compression Strategies sorting algorithm
On, in previous work frequently with sorting algorithm have decision tree and naive Bayesian, decision tree interpretation is good, but in face of existing
When real complicated unstructured data and various noise datas, it is easy to over-fitting;Naive Bayes Classification method has solid
Fundamentals of Mathematics, algorithm are simple, it is easy to accomplish, but its classification performance is influenced by priori, and classification performance is often unsatisfactory,
The accuracy and compression efficiency for leading to strategy can not ensure, and the sorting algorithm of existing application does not support parallel processing, Bu Nengchong
Divide and calculate power using cluster, keeps load uneven.
Invention content
Therefore, the data discrete degree that existing column storage database Compression Strategies encounter in compression process is big, grain of classifying
It spends small, of high cost, the problem of compression efficiency is difficult to ensure is calculated caused by mating sorting algorithm defect, method proposes one kind
Row area based on sequence mixes Compression Strategies.
The present invention provides a kind of row storage compacting method based on HBase, includes the following steps:
Each column data is read from HBase, is resequenced and is stored in each area to each column data;
The statistic of randomized block is counted to calculate the similar factors S between each area, and judges that column distribution is uniform or discrete;
If being evenly distributed, using mixed grade row compress mode;If distribution is discrete, the areas Hun Ji compress mode is used.
Optionally, the similar factors S is the defined amount for judging similarity between intervals, passes through the statistic T feature point in twoth area
The absolute difference of amount obtains.
Optionally, when resequencing to each column data, each column information is stored in what each area was made of HFile
In StoreFile.
Optionally, row are split into different tables first, train value are ranked up, to the compound of the new table of column-generation after sequence
Row is strong to be used<columnID>_<rowID>_<Row-key>Format convention is designed.
Optionally, the compression algorithm for mixing grade row compress mode uses Run- Length Coding, bit vector coding, WAH codings, prefix volume
Code, incremental encoding and improved LZO.
Optionally, the row area mixing Compression Strategies based on sequence include the following steps:
Step1 reads in each column data from HBase.
Step2 is ranked up each column data, and is stored according to specified format to each column data.
Step3 randomly selects 10 area's statistical nature statistic Ts of Lie Zhongi={ q2,q3,q4,q5,q6,q7, i ∈ [1,10].
Step4 judges each column-data distribution characteristic, according to data distribution characteristic by the data areas Hun Ji Compression Strategies
(Hybrid Sector-Based Compression) and mixed grade row Compression Strategies (Hybrid Column-Based
Compression it) is stored respectively.
Each column datas of Step5 are compressed according to the different Compression Strategies of distribution.
Step6 stores compressed data into HDFS.
Optionally, the areas Hun Ji Compression Strategies include:
Step1 enables i=1;
Step2 counts Ti={ q1,q2,q3,q4,q5,q6};
If Step3 i=1 redirect Step4, otherwise redirect Step3;
Step4 calculates the similarity with a upper block, if similarity is high, m by similar factors Si=mi-1, otherwise redirect
Count Ti={ q1,q2,q3,q4,q5,q6, redirect Step5;
Step5 uses the policy selection method based on XGBoost to data block;
Step6 such as block i are not the last one blocks, and i=i+1 jumps to Step3;
Step7 returns to Compression Strategies vector Ms。
Optionally, mixing grade row Compression Strategies includes:
Input:Column data to be compressed
Output:Compression Strategies m
Step1 statistical nature statistic Tsc={ q1,q2,q5,q6,q7};
Step2 judges radix q such as less than threshold value, m=WAH codings redirect Step6, such as larger than threshold value, redirect
Step3;
Step3 judges that text type t, numerical value in this way, m=delta compressions coding redirect Step6, text in this way redirects
Step4;
Step4 judges data skew, and if data have apparent inclination, m=prefix codes to redirect Step6, such as nothing is obviously inclined
Tiltedly, Step5 is redirected;
Compression algorithm is divided into improved LZO and not compressed by Step5 according to usage frequency l, if the big m=of usage frequency is not
Compression, redirects final step, if usage frequency is small, the improved LZO of m=redirect Step6;
Step6 returns to Compression Strategies m.
Technical solution of the present invention has the following advantages that:
1. it is tight that the present invention devises a kind of method reinforcement data being ranked up to each column data according to HBase features first
Density splits each row, carries out structure design to each row split out, enables data according to being stored sequentially in after sequence
In each region and hot issue is avoided, by enabling data to arrange closely, utmostly reduces data in local data point
The otherness of cloth.
2. proposing a kind of row area mixing Compression Strategies, the areas Hun Ji Compression Strategies and mixed grade are used respectively according to data characteristics
Row Compression Strategies carry out Compression Strategies recommendation.The row high to the areas Lie Zhongge distribution characteristics similarity are compressed using grade row are mixed, data
Characteristic similarity is low to be compressed using the areas Hun Ji, and size granularity, which combines, reduces calculating cost.
3. when strategy designs, according to the compression algorithm that the selection of different data feature is suitable, and for the first time in Compression Strategies
In introduce XGBoost algorithms and compensated in the past as sorting algorithm, outstanding generalisation properties and to the support of parallel computation
The deficiency of sorting algorithm.
Description of the drawings
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art
Embodiment or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, in being described below
Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor
It puts, other drawings may also be obtained based on these drawings.
Fig. 1 is the prior art lightweight and heavyweight compression algorithm diagram of the present invention;
Fig. 2 is that the present invention is based on the row areas of sequence to mix Compression Strategies flow chart;
Fig. 3 is figure compared with the present invention uses the compression ratio of different Compression Strategies with the prior art;
Fig. 4 is figure compared with the present invention uses the compression effectiveness of different Compression Strategies with the prior art;
Fig. 5 is figure compared with the present invention uses the compression time of different Compression Strategies with the prior art;
Fig. 6 is figure compared with the present invention uses the solution contracting time of different Compression Strategies with the prior art.
Specific implementation mode
Technical scheme of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation
Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
The every other embodiment that personnel are obtained without making creative work, shall fall within the protection scope of the present invention.
As long as in addition, technical characteristic involved in invention described below different embodiments non-structure each other
It can be combined with each other at conflict.
Embodiment 1
A kind of row storage compacting method based on HBase of the present embodiment, the row area mixing Compression Strategies (a based on sequence
Hybrid Compression Strategy of Column-Based Compression and Sector-Based
Compression), shown in Figure 2, each column data is read in from HBase first, each column data is ranked up, then to row
Data after sequence are stored;The similar factors S between each area, similar factors S are calculated by counting the statistic of randomized block
For judging the defined amount of similarity between intervals, obtained by the absolute difference of the statistic T characteristic component in twoth area, wherein need
Characteristic component is provided by the Compression Strategies in first area.If the areas the Lie Zhongge feature similarity degree selected is high, judge
Column distribution is balanced, and data are applicable in mixed grade row Compression Strategies in row;If the areas the Lie Zhongge feature similarity degree selected is low, judge to arrange
It is distributed discrete, the applicable areas the Hun Ji Compression Strategies of data in row.Then it is calculated according to the compression that data characteristics are judged with applicable policies
Method stores data into after being compressed to data in HDFS.
Specifically, data characteristics refers to the group information for selected data is described, this group information includes radix
Q, the total a of identical value, data type t, the inclined degree d of data, key-value pair sum v, the continuous average number c of identical value,
The average length l of value;Radix refers in particular to the hash degree of data in a row, i.e., the number of this kind different attribute value in the method.
Data access frequency f=C/t, C refer to file access number, and t refers to the corresponding period;Data statistics amount T:Data statistics amount is to pass through
One group of data that data characteristics obtains, are intended for the input of Compression Strategies, are calculated by data characteristics, share 7
Characteristic component.It is q respectively1The hash degree of data, q2The inclined degree of data, q3Identical value percentage a*100/v, q4
Identical value continuous average number c, q5Data type t, q6The average length l, q of value7Data usage frequency f.
Row area mixing Compression Strategies based on sequence are as follows:
Input:Data W to be compressed
Output:Compression whether successful (0:Failure, 1:Success)
Step1 reads in each column data from HBase.
Step2 is ranked up each column data, and is stored according to specified format to each column data.
Step3 randomly selects 10 area's statistical nature statistic Ts of Lie Zhongi={ q2,q3,q4,q5,q6,q7, i ∈ [1,10].
Step4 judges each column-data distribution characteristic, according to data distribution characteristic by the data areas Hun Ji Compression Strategies
(Hybrid Sector-Based Compression) and mixed grade row Compression Strategies (Hybrid Column-Based
Compression it) is stored respectively.
Each column datas of Step5 are compressed according to the different Compression Strategies of distribution.
Step6 stores compressed data into HDFS.
Wherein, the areas Hun Ji Compression Strategies (Hybrid Sector-Based Compression Strategy) include:
Input:Column data to be compressed
Output:Compression Strategies vector Ms
Step1 enables i=1.
Step2 counts Ti={ q1,q2,q3,q4,q5,q6}。
If Step3 i=1 redirect Step4, otherwise redirect Step3.
Step4 calculates the similarity with a upper block, if similarity is high, m by similar factors Si=mi-1, otherwise redirect
Count Ti={ q1,q2,q3,q4,q5,q6, redirect Step5.
Step5 uses the policy selection method based on XGBoost to data block.
Step6 such as block i are not the last one blocks, and i=i+1 jumps to Step3.
Step7 returns to Compression Strategies vector Ms。
Row to being applicable in the areas Hun Ji Compression Strategies count the statistic T of first block firsti={ q1,q2,q3,q4,q5,
q6, i=1.The similitude in twoth area is judged by the calculating of similar factors S, if twoth area are similar, the compression for being applicable in an area is calculated
Otherwise method counts statistic T againi={ q1,q2,q3,q4,q5,q6, it obtains the block using XGBoost strategies and presses accordingly
Each block is stored in Compression Strategies by compression algorithm according to the Compression Strategies that XGBoost strategies or adjacent region learning strategy obtain
Vector Ms。
Mixing grade row Compression Strategies (Hybrid Column-Based Compression Strategy) includes:
Input:Column data to be compressed
Output:Compression Strategies m
Step1 statistical nature statistic Tsc={ q1,q2,q5,q6,q7}
Step2 judges radix q such as less than threshold value, m=WAH codings redirect Step6, such as larger than threshold value, redirect
Step3。
Step3 judges that text type t, numerical value in this way, m=delta compressions coding redirect Step6, text in this way redirects
Step4。
Step4 judges data skew, and if data have apparent inclination, m=prefix codes to redirect Step6, such as nothing is obviously inclined
Tiltedly, Step5 is redirected.
Compression algorithm is divided into improved LZO and not compressed by Step5 according to usage frequency l, if the big m=of usage frequency is not
Compression, redirects final step, if usage frequency is small, the improved LZO of m=redirect Step6.
Step6 returns to Compression Strategies m.
It selects first using lexcographical order as sortord, sorted data is carried out with the judgement of radix, why by radix
It is because WAH algorithms are in tradeoff compression ratio as primary criteria for classification, compression/decompression time and search efficiency show four kinds
Classic in algorithm, so as much as possible make data be applicable in WAH algorithms, and the data adaptive form of delta compression compared with
It is narrow, it is more demanding to data, judged in next step so being arranged in, final step is to remaining data according to data distribution spy
Point is classified, to there is apparent inclined data to be applicable in prefix compression, to being distinguished according to usage frequency without apparent inclined data
Select improved LZO algorithms or without data compression.
In the present embodiment, HBase supports the file of compression to have SequenceFile and HFile, wherein WAL (Write-
Ahead Log) it is main SequenceFile in HBase, it is a kind of write-ahead log.WAL can be written in data first, write-in
WAL can be stored in MemStore, expired when storing, and one new HFile of generation in HDFS can be write with a brush dipped in Chinese ink.In HBase, respectively
The information of row is stored in the StoreFile that each region is made of HFile, directly to former table content be can not by content into
Row sequence storage, because each region is to be ranked up division according to row is strong, each region corresponds to certain row and is good for range
Content, i.e., strong row of mutually going together are stored in same region, regardless of to be compressed to which kind of file, the strong key assignments of adjacent rows
To being always arranged together, therefore such as want to realize the sequence to content in table, row should be split into different tables first, train value is arranged
Sequence is good for the compound row of the new table of column-generation after sequence and is used<columnID>_<rowID>_<Row-key>Format convention carries out
Design.Wherein<columnID>Make data distribution to different as the mark of former column position, and as salting prefixes
Region sever get on, and avoid generating hot issue (hot spot),<rowID>Effect be the value for enabling to have sorted in row
Be arranged together, avoid being distributed in different region cause sorting data can not Coutinuous store,<Row-key>Act as
The former mark mark of record, is associated with each column data.Wherein<columnID>,<rowID>The value of this two field should fixed length.For example, having 5
The sample table of 100 row of row, column family Cf, specific sheet format are as shown in table 1:
1 sample table of table
It sorts top to bottom first to data, the data after sequence is stored in new table, the new compound row of table is strong to be used<columnID
>_<rowID>_<Row-key>Format design, column family are still set as Cf, arrange this for being shown as having sorted in the case of ItemID here
The sheet format that column data is built, specific sheet format are as shown in table 2:
2 ItemID table structures of table
Abadi D and Ferreira's research shows that row store application scenarios in, lightweight compression algorithm not only CPU at
This is low, and can support the operation directly to squeezed state data, improves search efficiency, therefore the selection principle of compression algorithm is exactly
Lightweight algorithm is laid particular stress on, heavyweight algorithm is taken into account.
Common method includes Run- Length Coding, bit vector coding, null value inhibition, simple dictionary encoding in lightweight compression algorithm
And incremental encoding.In HBase, to null value without storage, therefore null value inhibition compression algorithm to be selected is not included in.It is expert at storage
In, Run- Length Coding is only used for the compression to continuous space and letter, but in row store, the application field of Run- Length Coding is non-
Often extensively, since the data attribute of same row is similar, the successive value length being well suited for is had again after sorted, so small to radix
Column selection with Run- Length Coding be suitably to select very much.In the requirement to data, requirement of the bit vector algorithm to data type
It is relatively low, and Run- Length Coding is higher to data types entail, but it is fine for the data compression effects of duplicate data and ordering rule,
Data type is more demanding, and the two has complementary advantages, and can generate very good effect.WAH(Word—Alignment Hyhrid Code)
The two is combined by algorithm well, and is had than preferably being showed on unpressed bitmap vector on search efficiency,
Therefore it is added into compression algorithm concentration.In truthful data, it often will appear storage such as URL, the non-knot of home address etc
Structure data, the type data can reach good with the prefix code (Trie Encoding) in simple dictionary encoding
Effect.And to the date, time or the little data type of other spacing, incremental encoding (Delta Encoding) is one non-
Often suitable lightweight compression method.
About heavyweight algorithm, document is tested GZIP, LZO, snappy in the performance of HBase, the compression of LZO
Rate is placed in the middle, but its compression/de-compression is obviously much faster, as the Huffman codings and arithmetic coding of entropy coding series, document
It points out that effect of both coding methods in column storage database is bad, does not support the operation of squeezed state, and compression/solution not only
Compression speed does not have any advantage with upper three kinds of methods ratio.A kind of improved LZO methods that this method has chosen document proposition add
Entering compression algorithm collection, this method can save the memory space of 2 times of highest and 10% memory usage amount compared with former LZO algorithms,
And decompression speed ratio snappy is faster.Uneven to those data distributions, the unconspicuous data recommendation of data skew is with improved
LZO is encoded.
According to the above, finally have chosen Run- Length Coding, bit vector coding, WAH codings, prefix code incremental encoding and changes
Into LZO as compression algorithm.
It is to compress the purpose of granularity to be distributed and permutation data point primarily to solving local data in the areas row storage Zhong Yi
The unmatched problem of cloth, but the data distribution feature different problems for being pointed out in document, by before compression to data
It is ranked up, data arrangement can be made closer, especially small to radix, identical value percentage is high, identical value consecutive mean number
For row more than mesh, difference very little between each area can be directly compression granularity with row, therefore the areas selection Liao Yilie are mixed in the method
The mode of conjunction is as compression granularity.
In conclusion to reduce calculating cost to the greatest extent, it must set about from compression granularity and compression algorithm, compression granularity is wanted
The areas Lie Jihe grade compression is mixed, area's grade compression cannot be utilized merely, granularity selection is excessive to lead to compression algorithm in order to avoid compressing
The local discomfort the case where, first sequencing means is used to keep data sorting close.It should also be laid particular stress in the selection of compression algorithm
Select lightweight compression;Ensure higher compression efficiency, it is necessary to design different compressions for different data characteristics and calculate
Method, and the defect of sorting algorithm in previous classification policy is improved to reach optimal compression efficiency.
Compression algorithm selection based on XGBoost
XGBoost (eXtreme Gradient Boosting) is a kind of improved ladder that Tianqi Chen et al. are proposed
Degree promotes decision tree (Gradient Boosted Decision Trees), has quick processing speed and excellent classification
Performance, possesses the flexibility of the processing data type of GBDT trees, to the robustness of exceptional value, the advantages that generalization ability is strong, also exists
Select best splitting point when, carry out parallel enumerating, solve the disadvantage that GBDT trees can not parallel training, and design when carry out
Sufficient cache lines optimization, accelerates training speed.It supports row sampling, is sampled when building Pterostyrax property, make training effect
It is fast and good.Over-fitting is prevented using regularization, further enhances Generalization Capability.
Therefore, structure compression algorithm set M={ compile by Run- Length Coding, bit vector coding, WAH codings, prefix code, increment
Code, improved LZO do not compress }, the division of compression algorithm is carried out using XGBoost in the compression set of structure.
When doing compression classification with XGBoost, iteration needs m tree each time, and m is class number, each tree pair one
A classification predicted, in this example m=7.Every tree can be regarded as a function f, input for statistic T when, f will input sample
This statistic is mapped as f1(T),f2(T),f3(T),.....,fm(T) as the predicted value of T.
The flow of XGBoost focuses on achievement process and leaf node fission process.
During achievement, most important is exactly object function, first objective function:
Obj (φ)=L (φ)+Ω (φ) (1)
Wherein L (φ) is loss function (cost function).The expression formula of L (φ) is as follows:
Because the Compression Strategies of this method finally only select one, all kinds of mutual exclusions are classification (multi-class) more than one
Problem rather than multi-tag (multi-label) problem, therefore loss function is set to Softmax loss functions, then belong to certain
The probability of a classification i is
In loss function L (φ)Calculation be
K is the sum of tree, and f indicates that each specific CART tree, model are made of k CART tree,
fk(T)=ωq(T),ω∈RP,q:Rd→{1,2,3,....,P} (5)
R refers to that leaf weight, q refer to that tree construction, q are mapped to input in leaf call number, and ω specifies each
The leaf score of call number, the wherein value of ω are by the calculated value for making object function minimum of function optimization.What P referred to
It is index sum.
Ω (φ) is regularization term, is used for the complexity of decision tree.To weigh decline and the model of object function
Complexity avoids over-fitting.
TθRefer to leaf number, ω indicates that the value of each leaf node, wherein γ values, λ value guide the complexity into new leaf node
Cost is spent, γ values, λ value value are bigger, and to there is the punishment of the tree of more leaf node and extremum bigger, the value of Ω (φ) gets over little tree
Structure is simpler.
The determination of each tree construction realized by attempting a segmentation is added to an existing leaf every time,
The formula that XGBoost uses burl dot splitting for
WhereinRefer to left subtree score,Refer to right subtree score,Finger does not divide available score,
Υ, which refers to, is added the complexity cost that new node introduces, GLIt is all node first derivatives of left subtree and GRIt is that right subtree is all to lead
The sum of number single order, HLIt is all node second dervatives of left subtree and HRBe all node second dervatives of right subtree and.
The node split operation of XGBoost is different with common decision tree fission process to be common decision tree when division simultaneously
Do not consider the complexity of tree, and rely on it is follow-up, beta pruning, and XGBoost has just considered tree when division by Υ
Complexity, therefore individual cut operator need not be carried out.When the gain that division zone comes is less than threshold values Υ, stop division.So
Enter next round iteration afterwards, when sample weights and less than given threshold when then stop contributing.
Effect example
About test environment:
This experimental situation uses 9 systems for the server of Centos7, wherein a MASTER, 8 SLAVER, service
Device hardware configuration is identical, and CPU is Intel (R) Xeon (R) CPU E5-2630v4@2.20GHz, 64G memories, 3T hard disks,
Hadoop version 2s .7.3, Hbase version 1.2.4.
Data set describes:
About the compression of column data, the TPC-H data set owners that forefathers use want the operation of adapted relationships data OLAP, and
Data do not tilt, and data mode is single, larger with true usage scenario gap.What this method used is TPC-DS data
Collection is authoritative standardization body TPC (Transaction Processing Performance Council) according to TPC-H
The insufficient version with the need for improved of reality scene, the data of test have inclination and real data more consistent.
This experiment has selected ITEM tables therein as test data set, this is because the table data type is abundant, meets
Performance that is practical and can more embodying each compression algorithm, compresses the table with selected strategy, which shares 22 attribute,
This experiment passes through the dsdgen Program Generatings ITEM tables of 3414MB sizes.Generate each row file access rank table such as table of data
Shown in 3.
Experimental result and analysis
Experiment compares in compression ratio and in terms of the compression/decompression time each Compression Strategies on selected data, ties
Fruit is as shown in Fig. 3.
From the figure 3, it may be seen that the method that this method proposes will be good than other 4 kinds of strategies in the compression effectiveness of each row, compression ratio
20% or so of former data is reached, the mating strategy of the compression based on cold and hot data takes second place, compression ratio 27.8%, simple pattra leaves
The compression ratio of this Compression Strategies is 36.5%, and area's grade Compression Strategies based on study are the mating Compression Strategies of 38.9%, c-store
Compression ratio minimum 52.8%, this method obtain compression effectiveness preferably because pervious strategy research emphasis is put mostly
In sorting algorithm, and there are not enough concerns to the selection of compression algorithm, is mostly the compression calculation for having continued to use the mating strategies of c-store
Method.Although taking fine-grained classification that can improve the nicety of grading to data, if compression algorithm is not suitable enough, then it is high
Nicety of grading is also unable to reach good accuracy rate.The selection of this method algorithm has fully considered Various types of data under big data environment
The characteristics of, select suitable compression algorithm for the data of different characteristics.And data are ranked up before compression, each section
Data characteristics are clear, convenient for compression, can reach very high compression ratio, and specific each row compression effectiveness is as shown in Figure 4.
As shown in Figure 4, this method is all got well than other methods in the compression effectiveness respectively arranged in addition to 17 arrange, wherein 17 row i_
Informulaiton dependents of dead military hero judge to predicate after data characteristics to be not compressed in active data, according to this method strategy, but base
In the Compression Strategies granularity smaller of area's grade, certain texts may locally have similitude, can be with so be judged using compression algorithm
See, other compression algorithms in the row effect and bad, although and compression effectiveness can be played, more pressure can be expended
The contracting time, and difficulty is caused to subsequent query.It should be noted that each group of data (22 groups) of Fig. 4 are distinguished from left to right
For the mating Compression Strategies of c-store, the grade Compression Strategies of the area based on study, naive Bayesian Compression Strategies, it is based on cold and hot data
Compression classification policy and use this method strategy the data obtained result.
The experimental result of compression/decompression time is as shown in Figure 5, Figure 6, as a result prove this method propose strategy compression/
Also there is good advantage in terms of decompression time, this is because this method is compressed by combination grain, by row according to different characteristics point
Not Cai Yong different compression granularities, the data in all areas in row need not be counted like that previous work, be saved a large amount of
Calculating cost.And data are ranked up before compression, while in addition to being conducive to compression ratio, more compact strict data
The time cost that big flyweight compression algorithm can be saved by being distributed after data sorting, there is same characteristic features by taking WAH algorithms as an example
Value come together, can increase run length, triple quantity be reduced, to accelerate compression time and decompression time.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right
For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or
It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or
It changes still within the protection scope of the invention.
Claims (8)
1. a kind of row storage compacting method based on HBase, which is characterized in that include the following steps:
Each column data is read from HBase, is resequenced and is stored in each area to each column data;
The statistic of randomized block is counted to calculate the similar factors S between each area, and judges that column distribution is uniform or discrete;
If being evenly distributed, using mixed grade row compress mode;If distribution is discrete, the areas Hun Ji compress mode is used.
2. the row storage compacting method according to claim 1 based on HBase, which is characterized in that the similar factors S is
The defined amount for judging similarity between intervals is obtained by the absolute difference of the statistic T characteristic component in twoth area.
3. the row storage compacting method according to claim 1 or 2 based on HBase, which is characterized in that each column data
When being resequenced, each column information is stored in the StoreFile that each area is made of HFile.
4. the row storage compacting method according to claim 3 based on HBase, which is characterized in that be first split into row not
Same table, is ranked up train value, is good for and uses to the compound row of the new table of column-generation after sequence<columnID>_<rowID>_<
Row-key>Format convention is designed.
5. the row storage compacting method according to claim 3 based on HBase, which is characterized in that mixed grade row compress mode
Compression algorithm use Run- Length Coding, bit vector coding, WAH codings, prefix code, incremental encoding and improved LZO.
6. the row storage compacting method according to claim 1 based on HBase, which is characterized in that the row area based on sequence
Mixing Compression Strategies include the following steps:
Step1 reads in each column data from HBase.
Step2 is ranked up each column data, and is stored according to specified format to each column data.
Step3 randomly selects 10 area's statistical nature statistic Ts of Lie Zhongi={ q2,q3,q4,q5,q6,q7, i ∈ [1,10].
Step4 judges each column-data distribution characteristic, according to data distribution characteristic by the data areas Hun Ji Compression Strategies (Hybrid
Sector-Based Compression) and mixed grade row Compression Strategies (Hybrid Column-Based Compression) point
It is not stored.
Each column datas of Step5 are compressed according to the different Compression Strategies of distribution.
Step6 stores compressed data into HDFS.
7. the row storage compacting method according to claim 6 based on HBase, which is characterized in that the areas Hun Ji Compression Strategies
Including:
Step1 enables i=1;
Step2 counts Ti={ q1,q2,q3,q4,q5,q6};
If Step3 i=1 redirect Step4, otherwise redirect Step3;
Step4 calculates the similarity with a upper block, if similarity is high, m by similar factors Si=mi-1, otherwise redirect statistics Ti
={ q1,q2,q3,q4,q5,q6, redirect Step5;
Step5 uses the policy selection method based on XGBoost to data block;
Step6 such as block i are not the last one blocks, and i=i+1 jumps to Step3;
Step7 returns to Compression Strategies vector Ms。
8. the row storage compacting method according to claim 6 based on HBase, which is characterized in that mixed grade row Compression Strategies
Including:
Input:Column data to be compressed
Output:Compression Strategies m
Step1 statistical nature statistic Tsc={ q1,q2,q5,q6,q7};
Step2 judges radix q such as less than threshold value, m=WAH codings redirect Step6, such as larger than threshold value, redirect
Step3;
Step3 judges that text type t, numerical value in this way, m=delta compressions coding redirect Step6, text in this way redirects Step4;
Step4 judges data skew, if data have apparent inclination, m=prefix codes to redirect Step6, such as tilts, jumps without apparent
Turn Step5;
Compression algorithm is divided into improved LZO and not compressed by Step5 according to usage frequency l, if the big m=of usage frequency does not compress,
Final step is redirected, if usage frequency is small, the improved LZO of m=redirect Step6;
Step6 returns to Compression Strategies m.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810130781.4A CN108319714A (en) | 2018-02-08 | 2018-02-08 | A kind of row storage compacting method based on HBase |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810130781.4A CN108319714A (en) | 2018-02-08 | 2018-02-08 | A kind of row storage compacting method based on HBase |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108319714A true CN108319714A (en) | 2018-07-24 |
Family
ID=62903490
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810130781.4A Pending CN108319714A (en) | 2018-02-08 | 2018-02-08 | A kind of row storage compacting method based on HBase |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108319714A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110147372A (en) * | 2019-05-21 | 2019-08-20 | 电子科技大学 | A kind of distributed data base Intelligent Hybrid storage method towards HTAP |
CN111010189A (en) * | 2019-10-21 | 2020-04-14 | 清华大学 | Multi-path compression method and device for data set and storage medium |
CN111552669A (en) * | 2020-04-26 | 2020-08-18 | 北京达佳互联信息技术有限公司 | Data processing method and device, computing equipment and storage medium |
CN113688127A (en) * | 2020-05-19 | 2021-11-23 | Sap欧洲公司 | Data compression technique |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609491A (en) * | 2012-01-20 | 2012-07-25 | 东华大学 | Column-storage oriented area-level data compression method |
CN105512305A (en) * | 2015-12-14 | 2016-04-20 | 北京奇虎科技有限公司 | Serialization-based document compression and decompression method and device |
US20170134044A1 (en) * | 2015-11-10 | 2017-05-11 | International Business Machines Corporation | Fast evaluation of predicates against compressed data |
-
2018
- 2018-02-08 CN CN201810130781.4A patent/CN108319714A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609491A (en) * | 2012-01-20 | 2012-07-25 | 东华大学 | Column-storage oriented area-level data compression method |
US20170134044A1 (en) * | 2015-11-10 | 2017-05-11 | International Business Machines Corporation | Fast evaluation of predicates against compressed data |
CN105512305A (en) * | 2015-12-14 | 2016-04-20 | 北京奇虎科技有限公司 | Serialization-based document compression and decompression method and device |
Non-Patent Citations (1)
Title |
---|
王海艳等: "基于HBase数据分类的压缩策略选择方法", 《通信学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110147372A (en) * | 2019-05-21 | 2019-08-20 | 电子科技大学 | A kind of distributed data base Intelligent Hybrid storage method towards HTAP |
CN110147372B (en) * | 2019-05-21 | 2022-12-23 | 电子科技大学 | HTAP-oriented distributed database intelligent hybrid storage method |
CN111010189A (en) * | 2019-10-21 | 2020-04-14 | 清华大学 | Multi-path compression method and device for data set and storage medium |
CN111010189B (en) * | 2019-10-21 | 2021-10-26 | 清华大学 | Multi-path compression method and device for data set and storage medium |
CN111552669A (en) * | 2020-04-26 | 2020-08-18 | 北京达佳互联信息技术有限公司 | Data processing method and device, computing equipment and storage medium |
CN113688127A (en) * | 2020-05-19 | 2021-11-23 | Sap欧洲公司 | Data compression technique |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108319714A (en) | A kind of row storage compacting method based on HBase | |
Lemire et al. | SIMD compression and the intersection of sorted integers | |
Grabowski et al. | Disk-based compression of data from genome sequencing | |
US20140229454A1 (en) | Method and system for data compression in a relational database | |
CN109325032B (en) | Index data storage and retrieval method, device and storage medium | |
US10210280B2 (en) | In-memory database search optimization using graph community structure | |
Jiang et al. | Good to the last bit: Data-driven encoding with codecdb | |
Gao et al. | Squish: Near-optimal compression for archival of relational datasets | |
Zhang et al. | Improved covering-based collaborative filtering for new users’ personalized recommendations | |
Hernández-Illera et al. | RDF-TR: Exploiting structural redundancies to boost RDF compression | |
CN108932738B (en) | Bit slice index compression method based on dictionary | |
Yan et al. | Micronet for efficient language modeling | |
Lei et al. | Compressing deep convolutional networks using k-means based on weights distribution | |
Pibiri et al. | On optimally partitioning variable-byte codes | |
Sun et al. | A novel fractal coding method based on MJ sets | |
Fan et al. | Codebook-softened product quantization for high accuracy approximate nearest neighbor search | |
Vimal et al. | An Experiment with Distance Measures for Clustering. | |
Guzun et al. | Performance evaluation of word-aligned compression methods for bitmap indices | |
Xie et al. | Query log compression for workload analytics | |
Kattan et al. | Evolutionary lossless compression with GP-ZIP | |
Vo et al. | Using column dependency to compress tables | |
CN115982634A (en) | Application program classification method and device, electronic equipment and computer program product | |
CN113468186A (en) | Data table primary key association method and device, computer equipment and readable storage medium | |
CN109614587B (en) | Intelligent human relationship analysis modeling method, terminal device and storage medium | |
Fu et al. | All-CQS: adaptive locality-based lossy compression of quality scores |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180724 |