CN106021360A

CN106021360A - Method and device for autonomously learning and optimizing MapReduce processing data

Info

Publication number: CN106021360A
Application number: CN201610305912.9A
Authority: CN
Inventors: 张伟; 王界兵; 李�杰; 董迪马; 郭宇翔; 梁猛
Original assignee: Shenzhen Frontsurf Information Technology Co Ltd
Current assignee: Shenzhen Frontsurf Information Technology Co Ltd
Priority date: 2016-05-10
Filing date: 2016-05-10
Publication date: 2016-10-12

Abstract

The invention discloses a method and a device for autonomously learning and optimizing MapReduce processing data. The method comprises the following steps: in work, sampling data before a reduce calculation according to a preset manner, forming a learning file from an obtained sampling key, and storing the learning file into a learning file folder, the catalogue of which is corresponding job label information; in subsequent work, searching a corresponding learning file folder according to the work label information; if the corresponding learning file folder is searched, directly calling a processing result in the learning file folder to optimize the processing; and if the corresponding learning file folder is not searched, forming and storing a new learning file. According to the method and device disclosed in the invention, the sampling key of the data before the reduce calculation is obtained to carry out sampling learning, and judgement is carried out to determine whether the data of the subsequent work has the corresponding learning file or not through the work label, so as to determine whether to call the processing result of the learning file or not; and the learning method is simple and can be used for rapidly and efficiently repeating the work of processing similar data.

Description

Autonomic learning optimizes the method and apparatus that MapReduce processes data

Technical field

The present invention relates to MapReduce and process the timing field of data, especially relate to a kind of autonomic learning Optimize the method and apparatus that MapReduce processes data.

Background technology

A lot of application is had to have the data that similarity is the highest, Er Qiejin in the actually used scene of big data Statistics that row repeats, analyze and the work such as calculating, if by the data processing of information of some history is carried out Statistics and analysis, its result is valuable and raising effect to the follow-up work repeating to process class likelihood data.

Summary of the invention

The main object of the present invention is for providing a kind of process knot that can utilize data that the phase knowledge and magnanimity processed are high The autonomic learning of fruit optimizes the method and apparatus that MapReduce processes data.

In order to realize foregoing invention purpose, the present invention proposes a kind of autonomic learning and optimizes MapReduce process number According to method, including:

In an operation, the data before calculating reduce sample according to default mode, and by taking of obtaining In the leaning portfolio of sample key-like become learning files to store operation label information that catalogue is its correspondence；

In subsequent job, search corresponding leaning portfolio according to its operation label information, if it has, then Directly invoke the result in leaning portfolio to process to optimize this；If it is not, form new Practise file and store.

Further, described in an operation, the data before calculating reduce sample according to default mode, And the study of become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence is civilian Step in part folder, including:

A sampling key is obtained at interval of the key-value pair specified number；Or,

A sampling key is obtained at interval of the byte specified number.

Further, the catalogue of described leaning portfolio is:

The signature value that the value of signature template calculates by specifying calculation.

Further, described signature value be signature template value carry out the value that Hash calculation goes out.

Further, described at interval of one sampling key of the key-value pair acquisition specified number；Or, at interval of The byte specified number obtains the step of a sampling key, including:

According to the situation of transformation of data, adaptively selected described interval specifies number.

The present invention also provides for a kind of autonomic learning and optimizes the device of MapReduce process data, including:

Sampling memory element, in an operation, the data before calculating reduce are according to default mode Sampling, and become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence In leaning portfolio；

Select unit, in subsequent job, search corresponding learning files according to its operation label information Folder, if it has, the result then directly invoked in leaning portfolio processes to optimize this；If it did not, Then form new learning files and store.

Further, described sampling memory element, including:

First sampling module, for obtaining a sampling key at interval of the key-value pair specified number；Or,

Second sampling module, for obtaining a sampling key at interval of the byte specified number.

Further, the catalogue of described leaning portfolio is:

Further, described sampling memory element, including:

Hash calculation module, the value being signature template for value of signing carries out the value that Hash calculation goes out.

Further, described first sampling module；Or, the second sampling module includes:

Self adaptation submodule, for the situation according to transformation of data, adaptively sampled strategy.

The autonomic learning of the present invention optimizes the method and apparatus that MapReduce processes data, by obtaining reduce The sampling key of the data before calculating is sampled study, is then sentenced the data of subsequent job by operation label Whether break by corresponding learning files, determine whether that the result calling learning files uses, Learning method is simple, can repeat to process the work of class likelihood data fast and efficiently.

Accompanying drawing explanation

Fig. 1 is that the autonomic learning of one embodiment of the invention optimizes the flow process that MapReduce processes the method for data Schematic diagram；

Fig. 2 is the signal that the value key of one embodiment of the invention represents in data acquisition system with the mode of side-play amount Figure；

Fig. 3 is the schematic diagram to data segmentation of the Map output of one embodiment of the invention；

Fig. 4 is that the autonomic learning of one embodiment of the invention optimizes the structure that MapReduce processes the device of data Schematic block diagram；

Fig. 5 is the structural schematic block diagram of the sampling memory element of one embodiment of the invention.

The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, do furtherly referring to the drawings Bright.

Detailed description of the invention

Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit The present invention.

With reference to Fig. 1, the embodiment of the present invention provides a kind of autonomic learning to optimize MapReduce and processes the side of data Method, including step:

S1, in an operation, to reduce calculate before data sample according to default mode, and will obtain Sampling key-like become learning files to store operation label information that catalogue is its correspondence leaning portfolio in；

S2, in subsequent job, search corresponding leaning portfolio according to its operation label information, if it has, Then directly invoke the result in leaning portfolio to process to optimize this；If it is not, formed new Learning files also stores.

As described in above-mentioned steps S1, it is simply that the data before calculating reduce are sampled, wherein, reduce Data before calculating include that multiple key-value pair, a key-value pair are exactly a key (key) value (value) in fact.Note Under the key of key-value pair and value instantly, owing to data itself are ordered into, so what the key after Qu Yang was also ordered into. Find the sample intelligence of correspondence for convenience, sample intelligence can be stored catalogue for believing with operation label In the learning files of breath.In the present embodiment, an operation may comprise multiple reduce and calculate, therefore can Generating the learning files of multiple correspondence, it is corresponding operation label information that multiple learning files are stored in catalogue In leaning portfolio, different operations has different operation label informations.

As described in above-mentioned steps S2, can quickly process the task of operation as before, improve work effect Rate.

In the present embodiment, above-mentioned in an operation, the data before calculating reduce take according to default mode Sample, and of become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence Practise step S1 in file, including:

S11, obtain a sampling key at interval of the key-value pair that specifies number；Or,

S12, obtain a sampling key at interval of the byte that specifies number.

As described in above-mentioned steps S11, this sampling mode can sample simple as key-value pair sampling mode, As long as number key-value pair number is the most permissible, it is inferior that the most each 5000 key-value pairs sample one.If each key assignments To byte number the same if, this equivalent counting byte number.

As described in above-mentioned steps S12, this sampling mode can be suitable for each key as byte sampling mode Be worth to byte number variant.This pattern accurately data amount can realize calculating based on internal memory, than If 500MB is than more typical configuration.

In the present embodiment, the catalogue of above-mentioned leaning portfolio is: the value of signature template is by specifying calculation The signature value calculated.

Above-mentioned signature template, is operation signature and can identify the uniqueness of operation, as on HDFS The catalogue of storage learning files, in order to follow-up identical operation can find the learning files of correspondence, template is to use Some producing signature can identify the configuration parameter of uniqueness, such as: "mapred.mapper.class,mapreduce.map.class,mapred.reducer.class,mapreduce.reduce. class,mapred.reduce.tasks,mapreduce.job.reduces,mapreduce.workflow.name,mapred uce.workflow.node.name"。

The value of above-mentioned signature template, i.e. for the word coupled by the value of each parameter in above-mentioned signature template Symbol string；

Above-mentioned signature value, is the value to above-mentioned signature template and specifies the value calculated, concrete one In embodiment, above-mentioned signature value is that the value of signature template carries out the value that Hash calculation goes out.

In the present embodiment, above-mentioned at interval of one sampling key of the key-value pair acquisition specified number；Or, every The step of a sampling key is obtained every the byte specified number, including:

S110, situation according to transformation of data, adaptively selected described interval specifies number.

Number as described in above-mentioned steps S110, when data when above-mentioned transformation of data refers to actual motion and sampling According to variant, this species diversity can be very big sometimes, and the such as transaction data of Taobao's conventional operational day is purchased with double 11 The transaction data of thing joint has differed from several order of magnitude, if so instructing process by the result of plain data study Double 11 shopping joint gross distortions data, during data be bound to from internal memory overflow, it is to avoid method be We can formulate different strategies, such as different times according to practical situation and use different learning files, Or regenerate learning files etc..The frequency regenerating learning files can be according to the height of learning cost Carrying out low decision, the learning method cost of the sampling that the present invention is above-mentioned is the lowest, can be with weight when of each run New study, thus can accomplish transformation of data self adaptation.

In the present embodiment, the information of sampling can be saved in the distributed field system of Hadoop in the form of a file So that the access of follow-up work on system HDFS.Catalogue on HDFS is also important, and we use The signature (signature) of operation.Sampling is for each Reduce process, so each reduce enters Journey can produce a learning files.File format we use Hadoop intrinsic for preserving key-value pair File format SequenceFile, preserve the information of sampling with it, key be exactly sampling key, value can be used The side-play amount in data acquisition system of this key.As in figure 2 it is shown, with each 5000 record to data sampling Once, 4 keys " cat " " fox " " lion " " snake " of sampling altogether, write the value in file respectively For each key side-play amount (or sample position) in data acquisition system.

In the present embodiment, in learning files, the key of sampling can be used to build a key in internal memory and data Mapping relations between block (bucket) identifier (id), these mapping relations can be used in follow-up operation Search bucket belonging to data, just can be used to data are divided at map end, add due to The key of sampling is ordered into, so be ordered between bucket, material is thus formed a kind of global orderly The sort algorithm of coarse-grain.Wherein, the structure multiple method of mapping relations, such as: TreeMap mode, its Building simple, committed memory is few, the general O of search efficiency (logn)；Or, the method for Multidimensional numerical is permissible Build in the study stage, by several most significant bytes (Most Significant Bytes) before sampling key Array data is filled as array indexing, efficiency height O (1) during lookup, but the probability clashed is the biggest. The present embodiment uses the pattern of a kind of mixing-classifying, first searches in three-dimensional array, CC (Collision Counter) it is the number of times that clashes of array element, if CC < 2, then only one of which bucket, if CC=2, so 1^stWith 2^ndBucket ID is meaningful, relatively determines 1 with full key board^stOr 2^nd Bucket, if CC > 2, just by the result of TreeMap；If three-dimensional array, its memory size reaches To 255*255*255*8=132651000bytes=128MB, mono-array of each Reduce obviously can not OK, all reduce can be used to share data, if clashed between reduce, it is special to use Bit identifier SB (Special Bit flag) assists and determines reduce, namely area code (partition number)。

The autonomic learning of the embodiment of the present invention optimizes the method that MapReduce processes data, can be to Map The segmentation of output data, carries out multiple batches of pipelining concurrent processing.To often when exporting with reference to Fig. 3, Map Individual Partition (subregion) can utilize the learning files of the Reduce of its correspondence to carry out partition data block (bucket), if having N number of sampling key, partition can be divided into N+1 data block, and because of key It is ordered, so being also drained through sequence between data block.In the MapReduce of pipelining, one Individual partition can be by multiple batches of Shuffle, and each batch (Pass) comprises from all map output files (MOFs) data block (bucket) with identical numbering come, and be ordered between bucket, this Make multiple batches of concurrently can realizing；Size additionally by regulation bucket realizes the base of each batch In shuffle Yu the reduce process of internal memory, greatly reduce hard disk I O access and delay.

In one specifically embodiment, carry out experimental data and compare:

(1) test environment:

Four back end

The big supplier CDH of hadoop software-three, HDP, MAPR result is similar to

CPU 2X8core

RAM 128GB

Disk 2TBx12

(2) measured result.

Whether using learning files, its code layer is same process, does not only have the default quilt of learning files Thinking only one of which data block (bucket), this is carried out whole subregion (partition) with primary realization Shuffle, it is consistent for merge (merge) calculating with reduce, it is impossible to exempt the I O access of hard disk.From Table 1 is it can be seen that the data using learning files to significantly improve MapReduce after processing in batches process Ability, the chances are original 1.6 times-2 times.It addition, also find out hard disk from the statistical report of hadoop self The data accessed greatly reduce, and illustrate to use learning files to change into calculating based on internal memory after processing in batches

The autonomic learning of the present invention optimizes the method that MapReduce processes data, calculates by obtaining reduce The sampling key of front data is sampled study, then to the data of subsequent job by the judgement of operation label is No by corresponding learning files, determine whether that the result calling learning files uses, study Method is simple, can repeat to process the work of class likelihood data fast and efficiently.

With reference to Fig. 4, the embodiment of the present invention also provides for a kind of autonomic learning and optimizes MapReduce process data Device, including:

Sampling memory element 10, in an operation, the data before calculating reduce are according to default side Formula samples, and become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence Leaning portfolio in；

Select unit 20, in subsequent job, search corresponding study literary composition according to its operation label information Part presss from both sides, if it has, the result then directly invoked in leaning portfolio processes to optimize this；If not yet Have, then form new learning files and store.

Such as above-mentioned sampling memory element 10, it is simply that the data before calculating reduce are sampled, wherein, reduce Data before calculating include that multiple key-value pair, a key-value pair are exactly a key (key) value (value) in fact.Note Under the key of key-value pair and value instantly, owing to data itself are ordered into, so what the key after Qu Yang was also ordered into. Find the sample intelligence of correspondence for convenience, sample intelligence can be stored catalogue for believing with operation label In the learning files of breath.In the present embodiment, an operation may comprise multiple reduce and calculate, therefore can Generating the learning files of multiple correspondence, it is corresponding operation label information that multiple learning files are stored in catalogue In leaning portfolio, different operations has different operation label informations.

Such as above-mentioned selection unit 20, can quickly process the task of operation as before, improve work effect Rate.

Reference Fig. 5, in the present embodiment, above-mentioned sampling memory element 10, including:

First sampling module 11, for obtaining a sampling key at interval of the key-value pair specified number；Or,

Second sampling module 12, for obtaining a sampling key at interval of the byte specified number.

Such as above-mentioned first sampling module 11, the sampling mode of this first sampling module 11 can take as key-value pair Original mold formula, sampling is simple, as long as number key-value pair number is the most permissible, the most each 5000 key-value pairs sampling one Inferior.If the byte number of each key-value pair is the same, this equivalent counting byte number.

Such as above-mentioned second sampling module 12, the sampling mode of this second sampling module 12 can sample as byte Pattern, the byte number being suitable for each key-value pair is variant.This pattern can accurately data amount realize Calculating based on internal memory, such as 500MB is than more typical configuration.

In the present embodiment, the catalogue of above-mentioned leaning portfolio is: the value of signature template is by specifying calculation The signature value above-mentioned signature template calculated, is operation signature and can identify the uniqueness of operation, be used as The catalogue of the storage learning files on HDFS, in order to follow-up identical operation can find the learning files of correspondence, Template is used to produce some signed can identify the configuration parameter of uniqueness, such as: "mapred.mapper.class,mapreduce.map.class,mapred.reducer.class,mapreduce.reduce. class,mapred.reduce.tasks,mapreduce.job.reduces,mapreduce.workflow.name,mapred uce.workflow.node.name"。

Above-mentioned signature value, is the value to above-mentioned signature template and specifies the value calculated, concrete one In embodiment, above-mentioned sampling memory element 10, including: Hash calculation module 13, for signature template Value carries out Hash calculation, draws signature value.

In the present embodiment, above-mentioned first sampling module 11；Or, the second sampling module 12 includes:

Self adaptation submodule 110, for the situation according to transformation of data, adaptively sampled strategy.

Such as above-mentioned self adaptation submodule 110, data when above-mentioned transformation of data refers to actual motion with during sampling Data are variant, and this species diversity can be very big sometimes, the such as transaction data of Taobao's conventional operational day and double 11 The transaction data of shopping joint has differed from several order of magnitude, if so instructing place by the result of plain data study The data of the double 11 shopping joint gross distortions of reason, during data be bound to from internal memory overflow, it is to avoid method It is that we can formulate, according to practical situation, the study literary composition that different strategies, such as different times employing are different Part, or regenerate learning files etc..The frequency regenerating learning files can be according to learning cost Height carry out low decision, the learning method cost of the sampling that the present invention is above-mentioned is the lowest, can with each run time Time relearns, and thus can accomplish transformation of data self adaptation.

The autonomic learning of the embodiment of the present invention optimizes the method that MapReduce processes data, can be defeated to Map Go out the segmentation of data, carry out multiple batches of pipelining concurrent processing.During Map output, each Partition (is divided District) learning files of the Reduce of its correspondence can be utilized to carry out partition data block (bucket), there is N number of sampling If key, partition can be divided into N+1 data block, and because what key was ordered, so number According to being also drained through sequence between block.In the MapReduce of pipelining, a partition can be by many batches Secondary Shuffle, each batch (Pass) comprise from all map output files (MOFs) come there is phase The data block (bucket) of same numbering, and be ordered between bucket, this makes multiple batches of concurrently may be used To realize；Additionally by the size of regulation bucket realize the shuffle based on internal memory of each batch with Reduce process, greatly reduces hard disk I O access and delay.

In one specifically embodiment, carry out experimental data and compare:

(1) test environment:

Four back end

The big supplier CDH of hadoop software-three, HDP, MAPR result is similar to

CPU 2X8core

RAM 128GB

Disk 2TBx12

(2) measured result.

The autonomic learning of the present invention optimizes MapReduce and processes the device of data, calculates by obtaining reduce The sampling key of front data is sampled study, then to the data of subsequent job by the judgement of operation label is No by corresponding learning files, determine whether that the result calling learning files uses, study Method is simple, can repeat to process the work of class likelihood data fast and efficiently

The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention, all The equivalent structure utilizing description of the invention and accompanying drawing content to be made or equivalence flow process conversion, or directly or Connect and be used in other relevant technical fields, be the most in like manner included in the scope of patent protection of the present invention.

Claims

1. an autonomic learning optimizes the method that MapReduce processes data, it is characterised in that including:

The most according to claim 1 by machine autonomic learning method optimize MapReduce process number According to method, it is characterised in that described in an operation, to reduce calculate before data according to default Mode samples, and operation label that catalogue is its correspondence is believed to become learning files to store the sampling key-like obtained Step in the leaning portfolio of breath, including:

A sampling key is obtained at interval of the byte specified number.

Autonomic learning the most according to claim 1 optimizes the method that MapReduce processes data, and it is special Levying and be, the catalogue of described leaning portfolio is:

Autonomic learning the most according to claim 3 optimizes the method that MapReduce processes data, and it is special Levying and be, described signature value is that the value of signature template carries out the value that Hash calculation goes out.

Autonomic learning the most according to claim 2 optimizes the method that MapReduce processes data, and it is special Levy and be, described at interval of one sampling key of the key-value pair acquisition specified number；Or, at interval of specifying number Purpose byte obtains the step of a sampling key, including:

6. the device of autonomic learning optimization MapReduce process data, it is characterised in that including:

The most according to claim 6 by machine autonomic learning method optimize MapReduce process number According to device, it is characterised in that described sampling memory element, including:

Autonomic learning the most according to claim 6 optimizes MapReduce and processes the device of data, and it is special Levying and be, the catalogue of described leaning portfolio is:

Autonomic learning the most according to claim 8 optimizes MapReduce and processes the device of data, and it is special Levy and be, described sampling memory element, including:

Autonomic learning the most according to claim 7 optimizes MapReduce and processes the device of data, its It is characterised by, described first sampling module；Or, the second sampling module includes: