CN108319714A - A kind of row storage compacting method based on HBase - Google Patents

A kind of row storage compacting method based on HBase Download PDF

Info

Publication number
CN108319714A
CN108319714A CN201810130781.4A CN201810130781A CN108319714A CN 108319714 A CN108319714 A CN 108319714A CN 201810130781 A CN201810130781 A CN 201810130781A CN 108319714 A CN108319714 A CN 108319714A
Authority
CN
China
Prior art keywords
data
compression
row
hbase
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810130781.4A
Other languages
Chinese (zh)
Inventor
芦天亮
孙靖超
杜彦辉
蔡满春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Original Assignee
CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY filed Critical CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Priority to CN201810130781.4A priority Critical patent/CN108319714A/en
Publication of CN108319714A publication Critical patent/CN108319714A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method

Abstract

The invention discloses a kind of row storage compacting method based on HBase, including each column data is read from HBase, it is resequenced and is stored in each area to each column data;The statistic of randomized block is counted to calculate the similar factors S between each area, similar factors S is the defined amount for judging similarity between intervals, is obtained by the absolute difference of the statistic T characteristic component in twoth area, and judge that column distribution is uniform or discrete;If being evenly distributed, using mixed grade row compress mode;If distribution is discrete, the areas Hun Ji compress mode is used.Row storage compacting method provided by the invention largely reduces calculating cost, while improving compression efficiency.

Description

A kind of row storage compacting method based on HBase
Technical field
The present invention relates to big data technical fields, and in particular to a kind of row storage compacting method based on HBase.
Background technology
Data compression is always the problem of paying close attention to of data fields, existing numerous compression methods, lightweight compress mode There is a Run- Length Coding, dictionary encoding, null value inhibits etc., and heavyweight compressed encoding has GZIP, Lempel-Ziv series, Huffman to compile Code and arithmetic coding etc..The difference of lightweight and heavyweight compression algorithm is that lightweight algorithm is grasped to being continuously worth Make, heavyweight algorithm is the boundary between having broken value, and value is operated as a series of bytes.Common lightweight and weight The classification of grade compression algorithm is as shown in Figure 1.
Research about row storage compression algorithm strategy first starts in the related research to C-store, J.Abadi Etc. a kind of row compact model based on decision tree is proposed, which is sentenced by establishing a compression algorithm decision tree The optimal compression algorithm of Ding Gelie, but this method compression granularity is excessive, ignores the local distribution feature and permutation number of data According to otherness to compression strap come influence.
Wang Zhen imperial or royal seals et al. propose a kind of area's grade Compression Strategies, are divided data as unit of area, between subregion Correlation and otherness carry out the selection of compression algorithm, the method can be that different areas be applicable in different compressions calculations according to feature Method ensure that compression ratio, but the big subregion of similarity difference can excessively be caused compared with intensive.
Idreos et al. proposes a kind of compression algorithm dynamic select strategy based on Bayes's classification, passes through Bayes's public affairs Formula is the different compression algorithm of different data block selections after calculating, and data to be compressed is made to reach best compression effect as far as possible Fruit, but the accuracy of this method is largely dependent on training sample, and an assessment layer is not established according to feedback result Assess the quality of compression algorithm.
King sea is gorgeous et al. to propose a kind of Compression Strategies selection method classify based on cold and hot data, and it is literary to be first depending on data HBase data are divided into cold and hot data by part visiting frequency, and improve Bayes's classification compression method, are increased on the basis of forefathers Add assessment layer, the advantage of combined area grade Compression Strategies proposes a kind of new compression classification, but its sorting algorithm is not supported simultaneously Row processing, and do not there is the shortcomings that breakthrough, area grade calculates to still have in compression granularity.
Nowadays the research for the selection that the characteristics of being directed to column storage database carries out Compression Strategies has been achieved for many achievements, But data are not pre-processed when compressing, data are larger in each section distributional difference, and dispersion degree is high, is unsuitable for compressing. More preference small grain size Compression Strategies in the selection of compression granularity, and small grain size strategy will count the statistical information in each area, calculate It is of high cost, compression time is influenced, and research does not have compression algorithm itself enough concerns.In the selection of Compression Strategies sorting algorithm On, in previous work frequently with sorting algorithm have decision tree and naive Bayesian, decision tree interpretation is good, but in face of existing When real complicated unstructured data and various noise datas, it is easy to over-fitting;Naive Bayes Classification method has solid Fundamentals of Mathematics, algorithm are simple, it is easy to accomplish, but its classification performance is influenced by priori, and classification performance is often unsatisfactory, The accuracy and compression efficiency for leading to strategy can not ensure, and the sorting algorithm of existing application does not support parallel processing, Bu Nengchong Divide and calculate power using cluster, keeps load uneven.
Invention content
Therefore, the data discrete degree that existing column storage database Compression Strategies encounter in compression process is big, grain of classifying It spends small, of high cost, the problem of compression efficiency is difficult to ensure is calculated caused by mating sorting algorithm defect, method proposes one kind Row area based on sequence mixes Compression Strategies.
The present invention provides a kind of row storage compacting method based on HBase, includes the following steps:
Each column data is read from HBase, is resequenced and is stored in each area to each column data;
The statistic of randomized block is counted to calculate the similar factors S between each area, and judges that column distribution is uniform or discrete;
If being evenly distributed, using mixed grade row compress mode;If distribution is discrete, the areas Hun Ji compress mode is used.
Optionally, the similar factors S is the defined amount for judging similarity between intervals, passes through the statistic T feature point in twoth area The absolute difference of amount obtains.
Optionally, when resequencing to each column data, each column information is stored in what each area was made of HFile In StoreFile.
Optionally, row are split into different tables first, train value are ranked up, to the compound of the new table of column-generation after sequence Row is strong to be used<columnID>_<rowID>_<Row-key>Format convention is designed.
Optionally, the compression algorithm for mixing grade row compress mode uses Run- Length Coding, bit vector coding, WAH codings, prefix volume Code, incremental encoding and improved LZO.
Optionally, the row area mixing Compression Strategies based on sequence include the following steps:
Step1 reads in each column data from HBase.
Step2 is ranked up each column data, and is stored according to specified format to each column data.
Step3 randomly selects 10 area's statistical nature statistic Ts of Lie Zhongi={ q2,q3,q4,q5,q6,q7, i ∈ [1,10].
Step4 judges each column-data distribution characteristic, according to data distribution characteristic by the data areas Hun Ji Compression Strategies (Hybrid Sector-Based Compression) and mixed grade row Compression Strategies (Hybrid Column-Based Compression it) is stored respectively.
Each column datas of Step5 are compressed according to the different Compression Strategies of distribution.
Step6 stores compressed data into HDFS.
Optionally, the areas Hun Ji Compression Strategies include:
Step1 enables i=1;
Step2 counts Ti={ q1,q2,q3,q4,q5,q6};
If Step3 i=1 redirect Step4, otherwise redirect Step3;
Step4 calculates the similarity with a upper block, if similarity is high, m by similar factors Si=mi-1, otherwise redirect Count Ti={ q1,q2,q3,q4,q5,q6, redirect Step5;
Step5 uses the policy selection method based on XGBoost to data block;
Step6 such as block i are not the last one blocks, and i=i+1 jumps to Step3;
Step7 returns to Compression Strategies vector Ms
Optionally, mixing grade row Compression Strategies includes:
Input:Column data to be compressed
Output:Compression Strategies m
Step1 statistical nature statistic Tsc={ q1,q2,q5,q6,q7};
Step2 judges radix q such as less than threshold value, m=WAH codings redirect Step6, such as larger than threshold value, redirect Step3;
Step3 judges that text type t, numerical value in this way, m=delta compressions coding redirect Step6, text in this way redirects Step4;
Step4 judges data skew, and if data have apparent inclination, m=prefix codes to redirect Step6, such as nothing is obviously inclined Tiltedly, Step5 is redirected;
Compression algorithm is divided into improved LZO and not compressed by Step5 according to usage frequency l, if the big m=of usage frequency is not Compression, redirects final step, if usage frequency is small, the improved LZO of m=redirect Step6;
Step6 returns to Compression Strategies m.
Technical solution of the present invention has the following advantages that:
1. it is tight that the present invention devises a kind of method reinforcement data being ranked up to each column data according to HBase features first Density splits each row, carries out structure design to each row split out, enables data according to being stored sequentially in after sequence In each region and hot issue is avoided, by enabling data to arrange closely, utmostly reduces data in local data point The otherness of cloth.
2. proposing a kind of row area mixing Compression Strategies, the areas Hun Ji Compression Strategies and mixed grade are used respectively according to data characteristics Row Compression Strategies carry out Compression Strategies recommendation.The row high to the areas Lie Zhongge distribution characteristics similarity are compressed using grade row are mixed, data Characteristic similarity is low to be compressed using the areas Hun Ji, and size granularity, which combines, reduces calculating cost.
3. when strategy designs, according to the compression algorithm that the selection of different data feature is suitable, and for the first time in Compression Strategies In introduce XGBoost algorithms and compensated in the past as sorting algorithm, outstanding generalisation properties and to the support of parallel computation The deficiency of sorting algorithm.
Description of the drawings
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art are briefly described, it should be apparent that, in being described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, other drawings may also be obtained based on these drawings.
Fig. 1 is the prior art lightweight and heavyweight compression algorithm diagram of the present invention;
Fig. 2 is that the present invention is based on the row areas of sequence to mix Compression Strategies flow chart;
Fig. 3 is figure compared with the present invention uses the compression ratio of different Compression Strategies with the prior art;
Fig. 4 is figure compared with the present invention uses the compression effectiveness of different Compression Strategies with the prior art;
Fig. 5 is figure compared with the present invention uses the compression time of different Compression Strategies with the prior art;
Fig. 6 is figure compared with the present invention uses the solution contracting time of different Compression Strategies with the prior art.
Specific implementation mode
Technical scheme of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill The every other embodiment that personnel are obtained without making creative work, shall fall within the protection scope of the present invention.
As long as in addition, technical characteristic involved in invention described below different embodiments non-structure each other It can be combined with each other at conflict.
Embodiment 1
A kind of row storage compacting method based on HBase of the present embodiment, the row area mixing Compression Strategies (a based on sequence Hybrid Compression Strategy of Column-Based Compression and Sector-Based Compression), shown in Figure 2, each column data is read in from HBase first, each column data is ranked up, then to row Data after sequence are stored;The similar factors S between each area, similar factors S are calculated by counting the statistic of randomized block For judging the defined amount of similarity between intervals, obtained by the absolute difference of the statistic T characteristic component in twoth area, wherein need Characteristic component is provided by the Compression Strategies in first area.If the areas the Lie Zhongge feature similarity degree selected is high, judge Column distribution is balanced, and data are applicable in mixed grade row Compression Strategies in row;If the areas the Lie Zhongge feature similarity degree selected is low, judge to arrange It is distributed discrete, the applicable areas the Hun Ji Compression Strategies of data in row.Then it is calculated according to the compression that data characteristics are judged with applicable policies Method stores data into after being compressed to data in HDFS.
Specifically, data characteristics refers to the group information for selected data is described, this group information includes radix Q, the total a of identical value, data type t, the inclined degree d of data, key-value pair sum v, the continuous average number c of identical value, The average length l of value;Radix refers in particular to the hash degree of data in a row, i.e., the number of this kind different attribute value in the method. Data access frequency f=C/t, C refer to file access number, and t refers to the corresponding period;Data statistics amount T:Data statistics amount is to pass through One group of data that data characteristics obtains, are intended for the input of Compression Strategies, are calculated by data characteristics, share 7 Characteristic component.It is q respectively1The hash degree of data, q2The inclined degree of data, q3Identical value percentage a*100/v, q4 Identical value continuous average number c, q5Data type t, q6The average length l, q of value7Data usage frequency f.
Row area mixing Compression Strategies based on sequence are as follows:
Input:Data W to be compressed
Output:Compression whether successful (0:Failure, 1:Success)
Step1 reads in each column data from HBase.
Step2 is ranked up each column data, and is stored according to specified format to each column data.
Step3 randomly selects 10 area's statistical nature statistic Ts of Lie Zhongi={ q2,q3,q4,q5,q6,q7, i ∈ [1,10].
Step4 judges each column-data distribution characteristic, according to data distribution characteristic by the data areas Hun Ji Compression Strategies (Hybrid Sector-Based Compression) and mixed grade row Compression Strategies (Hybrid Column-Based Compression it) is stored respectively.
Each column datas of Step5 are compressed according to the different Compression Strategies of distribution.
Step6 stores compressed data into HDFS.
Wherein, the areas Hun Ji Compression Strategies (Hybrid Sector-Based Compression Strategy) include:
Input:Column data to be compressed
Output:Compression Strategies vector Ms
Step1 enables i=1.
Step2 counts Ti={ q1,q2,q3,q4,q5,q6}。
If Step3 i=1 redirect Step4, otherwise redirect Step3.
Step4 calculates the similarity with a upper block, if similarity is high, m by similar factors Si=mi-1, otherwise redirect Count Ti={ q1,q2,q3,q4,q5,q6, redirect Step5.
Step5 uses the policy selection method based on XGBoost to data block.
Step6 such as block i are not the last one blocks, and i=i+1 jumps to Step3.
Step7 returns to Compression Strategies vector Ms
Row to being applicable in the areas Hun Ji Compression Strategies count the statistic T of first block firsti={ q1,q2,q3,q4,q5, q6, i=1.The similitude in twoth area is judged by the calculating of similar factors S, if twoth area are similar, the compression for being applicable in an area is calculated Otherwise method counts statistic T againi={ q1,q2,q3,q4,q5,q6, it obtains the block using XGBoost strategies and presses accordingly Each block is stored in Compression Strategies by compression algorithm according to the Compression Strategies that XGBoost strategies or adjacent region learning strategy obtain Vector Ms
Mixing grade row Compression Strategies (Hybrid Column-Based Compression Strategy) includes:
Input:Column data to be compressed
Output:Compression Strategies m
Step1 statistical nature statistic Tsc={ q1,q2,q5,q6,q7}
Step2 judges radix q such as less than threshold value, m=WAH codings redirect Step6, such as larger than threshold value, redirect Step3。
Step3 judges that text type t, numerical value in this way, m=delta compressions coding redirect Step6, text in this way redirects Step4。
Step4 judges data skew, and if data have apparent inclination, m=prefix codes to redirect Step6, such as nothing is obviously inclined Tiltedly, Step5 is redirected.
Compression algorithm is divided into improved LZO and not compressed by Step5 according to usage frequency l, if the big m=of usage frequency is not Compression, redirects final step, if usage frequency is small, the improved LZO of m=redirect Step6.
Step6 returns to Compression Strategies m.
It selects first using lexcographical order as sortord, sorted data is carried out with the judgement of radix, why by radix It is because WAH algorithms are in tradeoff compression ratio as primary criteria for classification, compression/decompression time and search efficiency show four kinds Classic in algorithm, so as much as possible make data be applicable in WAH algorithms, and the data adaptive form of delta compression compared with It is narrow, it is more demanding to data, judged in next step so being arranged in, final step is to remaining data according to data distribution spy Point is classified, to there is apparent inclined data to be applicable in prefix compression, to being distinguished according to usage frequency without apparent inclined data Select improved LZO algorithms or without data compression.
In the present embodiment, HBase supports the file of compression to have SequenceFile and HFile, wherein WAL (Write- Ahead Log) it is main SequenceFile in HBase, it is a kind of write-ahead log.WAL can be written in data first, write-in WAL can be stored in MemStore, expired when storing, and one new HFile of generation in HDFS can be write with a brush dipped in Chinese ink.In HBase, respectively The information of row is stored in the StoreFile that each region is made of HFile, directly to former table content be can not by content into Row sequence storage, because each region is to be ranked up division according to row is strong, each region corresponds to certain row and is good for range Content, i.e., strong row of mutually going together are stored in same region, regardless of to be compressed to which kind of file, the strong key assignments of adjacent rows To being always arranged together, therefore such as want to realize the sequence to content in table, row should be split into different tables first, train value is arranged Sequence is good for the compound row of the new table of column-generation after sequence and is used<columnID>_<rowID>_<Row-key>Format convention carries out Design.Wherein<columnID>Make data distribution to different as the mark of former column position, and as salting prefixes Region sever get on, and avoid generating hot issue (hot spot),<rowID>Effect be the value for enabling to have sorted in row Be arranged together, avoid being distributed in different region cause sorting data can not Coutinuous store,<Row-key>Act as The former mark mark of record, is associated with each column data.Wherein<columnID>,<rowID>The value of this two field should fixed length.For example, having 5 The sample table of 100 row of row, column family Cf, specific sheet format are as shown in table 1:
1 sample table of table
It sorts top to bottom first to data, the data after sequence is stored in new table, the new compound row of table is strong to be used<columnID >_<rowID>_<Row-key>Format design, column family are still set as Cf, arrange this for being shown as having sorted in the case of ItemID here The sheet format that column data is built, specific sheet format are as shown in table 2:
2 ItemID table structures of table
Abadi D and Ferreira's research shows that row store application scenarios in, lightweight compression algorithm not only CPU at This is low, and can support the operation directly to squeezed state data, improves search efficiency, therefore the selection principle of compression algorithm is exactly Lightweight algorithm is laid particular stress on, heavyweight algorithm is taken into account.
Common method includes Run- Length Coding, bit vector coding, null value inhibition, simple dictionary encoding in lightweight compression algorithm And incremental encoding.In HBase, to null value without storage, therefore null value inhibition compression algorithm to be selected is not included in.It is expert at storage In, Run- Length Coding is only used for the compression to continuous space and letter, but in row store, the application field of Run- Length Coding is non- Often extensively, since the data attribute of same row is similar, the successive value length being well suited for is had again after sorted, so small to radix Column selection with Run- Length Coding be suitably to select very much.In the requirement to data, requirement of the bit vector algorithm to data type It is relatively low, and Run- Length Coding is higher to data types entail, but it is fine for the data compression effects of duplicate data and ordering rule, Data type is more demanding, and the two has complementary advantages, and can generate very good effect.WAH(Word—Alignment Hyhrid Code) The two is combined by algorithm well, and is had than preferably being showed on unpressed bitmap vector on search efficiency, Therefore it is added into compression algorithm concentration.In truthful data, it often will appear storage such as URL, the non-knot of home address etc Structure data, the type data can reach good with the prefix code (Trie Encoding) in simple dictionary encoding Effect.And to the date, time or the little data type of other spacing, incremental encoding (Delta Encoding) is one non- Often suitable lightweight compression method.
About heavyweight algorithm, document is tested GZIP, LZO, snappy in the performance of HBase, the compression of LZO Rate is placed in the middle, but its compression/de-compression is obviously much faster, as the Huffman codings and arithmetic coding of entropy coding series, document It points out that effect of both coding methods in column storage database is bad, does not support the operation of squeezed state, and compression/solution not only Compression speed does not have any advantage with upper three kinds of methods ratio.A kind of improved LZO methods that this method has chosen document proposition add Entering compression algorithm collection, this method can save the memory space of 2 times of highest and 10% memory usage amount compared with former LZO algorithms, And decompression speed ratio snappy is faster.Uneven to those data distributions, the unconspicuous data recommendation of data skew is with improved LZO is encoded.
According to the above, finally have chosen Run- Length Coding, bit vector coding, WAH codings, prefix code incremental encoding and changes Into LZO as compression algorithm.
It is to compress the purpose of granularity to be distributed and permutation data point primarily to solving local data in the areas row storage Zhong Yi The unmatched problem of cloth, but the data distribution feature different problems for being pointed out in document, by before compression to data It is ranked up, data arrangement can be made closer, especially small to radix, identical value percentage is high, identical value consecutive mean number For row more than mesh, difference very little between each area can be directly compression granularity with row, therefore the areas selection Liao Yilie are mixed in the method The mode of conjunction is as compression granularity.
In conclusion to reduce calculating cost to the greatest extent, it must set about from compression granularity and compression algorithm, compression granularity is wanted The areas Lie Jihe grade compression is mixed, area's grade compression cannot be utilized merely, granularity selection is excessive to lead to compression algorithm in order to avoid compressing The local discomfort the case where, first sequencing means is used to keep data sorting close.It should also be laid particular stress in the selection of compression algorithm Select lightweight compression;Ensure higher compression efficiency, it is necessary to design different compressions for different data characteristics and calculate Method, and the defect of sorting algorithm in previous classification policy is improved to reach optimal compression efficiency.
Compression algorithm selection based on XGBoost
XGBoost (eXtreme Gradient Boosting) is a kind of improved ladder that Tianqi Chen et al. are proposed Degree promotes decision tree (Gradient Boosted Decision Trees), has quick processing speed and excellent classification Performance, possesses the flexibility of the processing data type of GBDT trees, to the robustness of exceptional value, the advantages that generalization ability is strong, also exists Select best splitting point when, carry out parallel enumerating, solve the disadvantage that GBDT trees can not parallel training, and design when carry out Sufficient cache lines optimization, accelerates training speed.It supports row sampling, is sampled when building Pterostyrax property, make training effect It is fast and good.Over-fitting is prevented using regularization, further enhances Generalization Capability.
Therefore, structure compression algorithm set M={ compile by Run- Length Coding, bit vector coding, WAH codings, prefix code, increment Code, improved LZO do not compress }, the division of compression algorithm is carried out using XGBoost in the compression set of structure.
When doing compression classification with XGBoost, iteration needs m tree each time, and m is class number, each tree pair one A classification predicted, in this example m=7.Every tree can be regarded as a function f, input for statistic T when, f will input sample This statistic is mapped as f1(T),f2(T),f3(T),.....,fm(T) as the predicted value of T.
The flow of XGBoost focuses on achievement process and leaf node fission process.
During achievement, most important is exactly object function, first objective function:
Obj (φ)=L (φ)+Ω (φ) (1)
Wherein L (φ) is loss function (cost function).The expression formula of L (φ) is as follows:
Because the Compression Strategies of this method finally only select one, all kinds of mutual exclusions are classification (multi-class) more than one Problem rather than multi-tag (multi-label) problem, therefore loss function is set to Softmax loss functions, then belong to certain The probability of a classification i is
In loss function L (φ)Calculation be
K is the sum of tree, and f indicates that each specific CART tree, model are made of k CART tree,
fk(T)=ωq(T),ω∈RP,q:Rd→{1,2,3,....,P} (5)
R refers to that leaf weight, q refer to that tree construction, q are mapped to input in leaf call number, and ω specifies each The leaf score of call number, the wherein value of ω are by the calculated value for making object function minimum of function optimization.What P referred to It is index sum.
Ω (φ) is regularization term, is used for the complexity of decision tree.To weigh decline and the model of object function Complexity avoids over-fitting.
TθRefer to leaf number, ω indicates that the value of each leaf node, wherein γ values, λ value guide the complexity into new leaf node Cost is spent, γ values, λ value value are bigger, and to there is the punishment of the tree of more leaf node and extremum bigger, the value of Ω (φ) gets over little tree Structure is simpler.
The determination of each tree construction realized by attempting a segmentation is added to an existing leaf every time, The formula that XGBoost uses burl dot splitting for
WhereinRefer to left subtree score,Refer to right subtree score,Finger does not divide available score, Υ, which refers to, is added the complexity cost that new node introduces, GLIt is all node first derivatives of left subtree and GRIt is that right subtree is all to lead The sum of number single order, HLIt is all node second dervatives of left subtree and HRBe all node second dervatives of right subtree and. The node split operation of XGBoost is different with common decision tree fission process to be common decision tree when division simultaneously Do not consider the complexity of tree, and rely on it is follow-up, beta pruning, and XGBoost has just considered tree when division by Υ Complexity, therefore individual cut operator need not be carried out.When the gain that division zone comes is less than threshold values Υ, stop division.So Enter next round iteration afterwards, when sample weights and less than given threshold when then stop contributing.
Effect example
About test environment:
This experimental situation uses 9 systems for the server of Centos7, wherein a MASTER, 8 SLAVER, service Device hardware configuration is identical, and CPU is Intel (R) Xeon (R) CPU E5-2630v4@2.20GHz, 64G memories, 3T hard disks, Hadoop version 2s .7.3, Hbase version 1.2.4.
Data set describes:
About the compression of column data, the TPC-H data set owners that forefathers use want the operation of adapted relationships data OLAP, and Data do not tilt, and data mode is single, larger with true usage scenario gap.What this method used is TPC-DS data Collection is authoritative standardization body TPC (Transaction Processing Performance Council) according to TPC-H The insufficient version with the need for improved of reality scene, the data of test have inclination and real data more consistent.
This experiment has selected ITEM tables therein as test data set, this is because the table data type is abundant, meets Performance that is practical and can more embodying each compression algorithm, compresses the table with selected strategy, which shares 22 attribute, This experiment passes through the dsdgen Program Generatings ITEM tables of 3414MB sizes.Generate each row file access rank table such as table of data Shown in 3.
Experimental result and analysis
Experiment compares in compression ratio and in terms of the compression/decompression time each Compression Strategies on selected data, ties Fruit is as shown in Fig. 3.
From the figure 3, it may be seen that the method that this method proposes will be good than other 4 kinds of strategies in the compression effectiveness of each row, compression ratio 20% or so of former data is reached, the mating strategy of the compression based on cold and hot data takes second place, compression ratio 27.8%, simple pattra leaves The compression ratio of this Compression Strategies is 36.5%, and area's grade Compression Strategies based on study are the mating Compression Strategies of 38.9%, c-store Compression ratio minimum 52.8%, this method obtain compression effectiveness preferably because pervious strategy research emphasis is put mostly In sorting algorithm, and there are not enough concerns to the selection of compression algorithm, is mostly the compression calculation for having continued to use the mating strategies of c-store Method.Although taking fine-grained classification that can improve the nicety of grading to data, if compression algorithm is not suitable enough, then it is high Nicety of grading is also unable to reach good accuracy rate.The selection of this method algorithm has fully considered Various types of data under big data environment The characteristics of, select suitable compression algorithm for the data of different characteristics.And data are ranked up before compression, each section Data characteristics are clear, convenient for compression, can reach very high compression ratio, and specific each row compression effectiveness is as shown in Figure 4.
As shown in Figure 4, this method is all got well than other methods in the compression effectiveness respectively arranged in addition to 17 arrange, wherein 17 row i_ Informulaiton dependents of dead military hero judge to predicate after data characteristics to be not compressed in active data, according to this method strategy, but base In the Compression Strategies granularity smaller of area's grade, certain texts may locally have similitude, can be with so be judged using compression algorithm See, other compression algorithms in the row effect and bad, although and compression effectiveness can be played, more pressure can be expended The contracting time, and difficulty is caused to subsequent query.It should be noted that each group of data (22 groups) of Fig. 4 are distinguished from left to right For the mating Compression Strategies of c-store, the grade Compression Strategies of the area based on study, naive Bayesian Compression Strategies, it is based on cold and hot data Compression classification policy and use this method strategy the data obtained result.
The experimental result of compression/decompression time is as shown in Figure 5, Figure 6, as a result prove this method propose strategy compression/ Also there is good advantage in terms of decompression time, this is because this method is compressed by combination grain, by row according to different characteristics point Not Cai Yong different compression granularities, the data in all areas in row need not be counted like that previous work, be saved a large amount of Calculating cost.And data are ranked up before compression, while in addition to being conducive to compression ratio, more compact strict data The time cost that big flyweight compression algorithm can be saved by being distributed after data sorting, there is same characteristic features by taking WAH algorithms as an example Value come together, can increase run length, triple quantity be reduced, to accelerate compression time and decompression time.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or It changes still within the protection scope of the invention.

Claims (8)

1. a kind of row storage compacting method based on HBase, which is characterized in that include the following steps:
Each column data is read from HBase, is resequenced and is stored in each area to each column data;
The statistic of randomized block is counted to calculate the similar factors S between each area, and judges that column distribution is uniform or discrete;
If being evenly distributed, using mixed grade row compress mode;If distribution is discrete, the areas Hun Ji compress mode is used.
2. the row storage compacting method according to claim 1 based on HBase, which is characterized in that the similar factors S is The defined amount for judging similarity between intervals is obtained by the absolute difference of the statistic T characteristic component in twoth area.
3. the row storage compacting method according to claim 1 or 2 based on HBase, which is characterized in that each column data When being resequenced, each column information is stored in the StoreFile that each area is made of HFile.
4. the row storage compacting method according to claim 3 based on HBase, which is characterized in that be first split into row not Same table, is ranked up train value, is good for and uses to the compound row of the new table of column-generation after sequence<columnID>_<rowID>_< Row-key>Format convention is designed.
5. the row storage compacting method according to claim 3 based on HBase, which is characterized in that mixed grade row compress mode Compression algorithm use Run- Length Coding, bit vector coding, WAH codings, prefix code, incremental encoding and improved LZO.
6. the row storage compacting method according to claim 1 based on HBase, which is characterized in that the row area based on sequence Mixing Compression Strategies include the following steps:
Step1 reads in each column data from HBase.
Step2 is ranked up each column data, and is stored according to specified format to each column data.
Step3 randomly selects 10 area's statistical nature statistic Ts of Lie Zhongi={ q2,q3,q4,q5,q6,q7, i ∈ [1,10].
Step4 judges each column-data distribution characteristic, according to data distribution characteristic by the data areas Hun Ji Compression Strategies (Hybrid Sector-Based Compression) and mixed grade row Compression Strategies (Hybrid Column-Based Compression) point It is not stored.
Each column datas of Step5 are compressed according to the different Compression Strategies of distribution.
Step6 stores compressed data into HDFS.
7. the row storage compacting method according to claim 6 based on HBase, which is characterized in that the areas Hun Ji Compression Strategies Including:
Step1 enables i=1;
Step2 counts Ti={ q1,q2,q3,q4,q5,q6};
If Step3 i=1 redirect Step4, otherwise redirect Step3;
Step4 calculates the similarity with a upper block, if similarity is high, m by similar factors Si=mi-1, otherwise redirect statistics Ti ={ q1,q2,q3,q4,q5,q6, redirect Step5;
Step5 uses the policy selection method based on XGBoost to data block;
Step6 such as block i are not the last one blocks, and i=i+1 jumps to Step3;
Step7 returns to Compression Strategies vector Ms
8. the row storage compacting method according to claim 6 based on HBase, which is characterized in that mixed grade row Compression Strategies Including:
Input:Column data to be compressed
Output:Compression Strategies m
Step1 statistical nature statistic Tsc={ q1,q2,q5,q6,q7};
Step2 judges radix q such as less than threshold value, m=WAH codings redirect Step6, such as larger than threshold value, redirect Step3;
Step3 judges that text type t, numerical value in this way, m=delta compressions coding redirect Step6, text in this way redirects Step4;
Step4 judges data skew, if data have apparent inclination, m=prefix codes to redirect Step6, such as tilts, jumps without apparent Turn Step5;
Compression algorithm is divided into improved LZO and not compressed by Step5 according to usage frequency l, if the big m=of usage frequency does not compress, Final step is redirected, if usage frequency is small, the improved LZO of m=redirect Step6;
Step6 returns to Compression Strategies m.
CN201810130781.4A 2018-02-08 2018-02-08 A kind of row storage compacting method based on HBase Pending CN108319714A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810130781.4A CN108319714A (en) 2018-02-08 2018-02-08 A kind of row storage compacting method based on HBase

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810130781.4A CN108319714A (en) 2018-02-08 2018-02-08 A kind of row storage compacting method based on HBase

Publications (1)

Publication Number Publication Date
CN108319714A true CN108319714A (en) 2018-07-24

Family

ID=62903490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810130781.4A Pending CN108319714A (en) 2018-02-08 2018-02-08 A kind of row storage compacting method based on HBase

Country Status (1)

Country Link
CN (1) CN108319714A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147372A (en) * 2019-05-21 2019-08-20 电子科技大学 A kind of distributed data base Intelligent Hybrid storage method towards HTAP
CN111010189A (en) * 2019-10-21 2020-04-14 清华大学 Multi-path compression method and device for data set and storage medium
CN111552669A (en) * 2020-04-26 2020-08-18 北京达佳互联信息技术有限公司 Data processing method and device, computing equipment and storage medium
CN113688127A (en) * 2020-05-19 2021-11-23 Sap欧洲公司 Data compression technique

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609491A (en) * 2012-01-20 2012-07-25 东华大学 Column-storage oriented area-level data compression method
CN105512305A (en) * 2015-12-14 2016-04-20 北京奇虎科技有限公司 Serialization-based document compression and decompression method and device
US20170134044A1 (en) * 2015-11-10 2017-05-11 International Business Machines Corporation Fast evaluation of predicates against compressed data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609491A (en) * 2012-01-20 2012-07-25 东华大学 Column-storage oriented area-level data compression method
US20170134044A1 (en) * 2015-11-10 2017-05-11 International Business Machines Corporation Fast evaluation of predicates against compressed data
CN105512305A (en) * 2015-12-14 2016-04-20 北京奇虎科技有限公司 Serialization-based document compression and decompression method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王海艳等: "基于HBase数据分类的压缩策略选择方法", 《通信学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147372A (en) * 2019-05-21 2019-08-20 电子科技大学 A kind of distributed data base Intelligent Hybrid storage method towards HTAP
CN110147372B (en) * 2019-05-21 2022-12-23 电子科技大学 HTAP-oriented distributed database intelligent hybrid storage method
CN111010189A (en) * 2019-10-21 2020-04-14 清华大学 Multi-path compression method and device for data set and storage medium
CN111010189B (en) * 2019-10-21 2021-10-26 清华大学 Multi-path compression method and device for data set and storage medium
CN111552669A (en) * 2020-04-26 2020-08-18 北京达佳互联信息技术有限公司 Data processing method and device, computing equipment and storage medium
CN113688127A (en) * 2020-05-19 2021-11-23 Sap欧洲公司 Data compression technique

Similar Documents

Publication Publication Date Title
CN108319714A (en) A kind of row storage compacting method based on HBase
Lemire et al. SIMD compression and the intersection of sorted integers
Grabowski et al. Disk-based compression of data from genome sequencing
US20140229454A1 (en) Method and system for data compression in a relational database
CN109325032B (en) Index data storage and retrieval method, device and storage medium
US10210280B2 (en) In-memory database search optimization using graph community structure
Jiang et al. Good to the last bit: Data-driven encoding with codecdb
Gao et al. Squish: Near-optimal compression for archival of relational datasets
Zhang et al. Improved covering-based collaborative filtering for new users’ personalized recommendations
Hernández-Illera et al. RDF-TR: Exploiting structural redundancies to boost RDF compression
CN108932738B (en) Bit slice index compression method based on dictionary
Yan et al. Micronet for efficient language modeling
Lei et al. Compressing deep convolutional networks using k-means based on weights distribution
Pibiri et al. On optimally partitioning variable-byte codes
Sun et al. A novel fractal coding method based on MJ sets
Fan et al. Codebook-softened product quantization for high accuracy approximate nearest neighbor search
Vimal et al. An Experiment with Distance Measures for Clustering.
Guzun et al. Performance evaluation of word-aligned compression methods for bitmap indices
Xie et al. Query log compression for workload analytics
Kattan et al. Evolutionary lossless compression with GP-ZIP
Vo et al. Using column dependency to compress tables
CN115982634A (en) Application program classification method and device, electronic equipment and computer program product
CN113468186A (en) Data table primary key association method and device, computer equipment and readable storage medium
CN109614587B (en) Intelligent human relationship analysis modeling method, terminal device and storage medium
Fu et al. All-CQS: adaptive locality-based lossy compression of quality scores

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180724