CN106021360A - Method and device for autonomously learning and optimizing MapReduce processing data - Google Patents
Method and device for autonomously learning and optimizing MapReduce processing data Download PDFInfo
- Publication number
- CN106021360A CN106021360A CN201610305912.9A CN201610305912A CN106021360A CN 106021360 A CN106021360 A CN 106021360A CN 201610305912 A CN201610305912 A CN 201610305912A CN 106021360 A CN106021360 A CN 106021360A
- Authority
- CN
- China
- Prior art keywords
- data
- sampling
- key
- learning
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for autonomously learning and optimizing MapReduce processing data. The method comprises the following steps: in work, sampling data before a reduce calculation according to a preset manner, forming a learning file from an obtained sampling key, and storing the learning file into a learning file folder, the catalogue of which is corresponding job label information; in subsequent work, searching a corresponding learning file folder according to the work label information; if the corresponding learning file folder is searched, directly calling a processing result in the learning file folder to optimize the processing; and if the corresponding learning file folder is not searched, forming and storing a new learning file. According to the method and device disclosed in the invention, the sampling key of the data before the reduce calculation is obtained to carry out sampling learning, and judgement is carried out to determine whether the data of the subsequent work has the corresponding learning file or not through the work label, so as to determine whether to call the processing result of the learning file or not; and the learning method is simple and can be used for rapidly and efficiently repeating the work of processing similar data.
Description
Technical field
The present invention relates to MapReduce and process the timing field of data, especially relate to a kind of autonomic learning
Optimize the method and apparatus that MapReduce processes data.
Background technology
A lot of application is had to have the data that similarity is the highest, Er Qiejin in the actually used scene of big data
Statistics that row repeats, analyze and the work such as calculating, if by the data processing of information of some history is carried out
Statistics and analysis, its result is valuable and raising effect to the follow-up work repeating to process class likelihood data.
Summary of the invention
The main object of the present invention is for providing a kind of process knot that can utilize data that the phase knowledge and magnanimity processed are high
The autonomic learning of fruit optimizes the method and apparatus that MapReduce processes data.
In order to realize foregoing invention purpose, the present invention proposes a kind of autonomic learning and optimizes MapReduce process number
According to method, including:
In an operation, the data before calculating reduce sample according to default mode, and by taking of obtaining
In the leaning portfolio of sample key-like become learning files to store operation label information that catalogue is its correspondence;
In subsequent job, search corresponding leaning portfolio according to its operation label information, if it has, then
Directly invoke the result in leaning portfolio to process to optimize this;If it is not, form new
Practise file and store.
Further, described in an operation, the data before calculating reduce sample according to default mode,
And the study of become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence is civilian
Step in part folder, including:
A sampling key is obtained at interval of the key-value pair specified number;Or,
A sampling key is obtained at interval of the byte specified number.
Further, the catalogue of described leaning portfolio is:
The signature value that the value of signature template calculates by specifying calculation.
Further, described signature value be signature template value carry out the value that Hash calculation goes out.
Further, described at interval of one sampling key of the key-value pair acquisition specified number;Or, at interval of
The byte specified number obtains the step of a sampling key, including:
According to the situation of transformation of data, adaptively selected described interval specifies number.
The present invention also provides for a kind of autonomic learning and optimizes the device of MapReduce process data, including:
Sampling memory element, in an operation, the data before calculating reduce are according to default mode
Sampling, and become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence
In leaning portfolio;
Select unit, in subsequent job, search corresponding learning files according to its operation label information
Folder, if it has, the result then directly invoked in leaning portfolio processes to optimize this;If it did not,
Then form new learning files and store.
Further, described sampling memory element, including:
First sampling module, for obtaining a sampling key at interval of the key-value pair specified number;Or,
Second sampling module, for obtaining a sampling key at interval of the byte specified number.
Further, the catalogue of described leaning portfolio is:
The signature value that the value of signature template calculates by specifying calculation.
Further, described sampling memory element, including:
Hash calculation module, the value being signature template for value of signing carries out the value that Hash calculation goes out.
Further, described first sampling module;Or, the second sampling module includes:
Self adaptation submodule, for the situation according to transformation of data, adaptively sampled strategy.
The autonomic learning of the present invention optimizes the method and apparatus that MapReduce processes data, by obtaining reduce
The sampling key of the data before calculating is sampled study, is then sentenced the data of subsequent job by operation label
Whether break by corresponding learning files, determine whether that the result calling learning files uses,
Learning method is simple, can repeat to process the work of class likelihood data fast and efficiently.
Accompanying drawing explanation
Fig. 1 is that the autonomic learning of one embodiment of the invention optimizes the flow process that MapReduce processes the method for data
Schematic diagram;
Fig. 2 is the signal that the value key of one embodiment of the invention represents in data acquisition system with the mode of side-play amount
Figure;
Fig. 3 is the schematic diagram to data segmentation of the Map output of one embodiment of the invention;
Fig. 4 is that the autonomic learning of one embodiment of the invention optimizes the structure that MapReduce processes the device of data
Schematic block diagram;
Fig. 5 is the structural schematic block diagram of the sampling memory element of one embodiment of the invention.
The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, do furtherly referring to the drawings
Bright.
Detailed description of the invention
Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not used to limit
The present invention.
With reference to Fig. 1, the embodiment of the present invention provides a kind of autonomic learning to optimize MapReduce and processes the side of data
Method, including step:
S1, in an operation, to reduce calculate before data sample according to default mode, and will obtain
Sampling key-like become learning files to store operation label information that catalogue is its correspondence leaning portfolio in;
S2, in subsequent job, search corresponding leaning portfolio according to its operation label information, if it has,
Then directly invoke the result in leaning portfolio to process to optimize this;If it is not, formed new
Learning files also stores.
As described in above-mentioned steps S1, it is simply that the data before calculating reduce are sampled, wherein, reduce
Data before calculating include that multiple key-value pair, a key-value pair are exactly a key (key) value (value) in fact.Note
Under the key of key-value pair and value instantly, owing to data itself are ordered into, so what the key after Qu Yang was also ordered into.
Find the sample intelligence of correspondence for convenience, sample intelligence can be stored catalogue for believing with operation label
In the learning files of breath.In the present embodiment, an operation may comprise multiple reduce and calculate, therefore can
Generating the learning files of multiple correspondence, it is corresponding operation label information that multiple learning files are stored in catalogue
In leaning portfolio, different operations has different operation label informations.
As described in above-mentioned steps S2, can quickly process the task of operation as before, improve work effect
Rate.
In the present embodiment, above-mentioned in an operation, the data before calculating reduce take according to default mode
Sample, and of become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence
Practise step S1 in file, including:
S11, obtain a sampling key at interval of the key-value pair that specifies number;Or,
S12, obtain a sampling key at interval of the byte that specifies number.
As described in above-mentioned steps S11, this sampling mode can sample simple as key-value pair sampling mode,
As long as number key-value pair number is the most permissible, it is inferior that the most each 5000 key-value pairs sample one.If each key assignments
To byte number the same if, this equivalent counting byte number.
As described in above-mentioned steps S12, this sampling mode can be suitable for each key as byte sampling mode
Be worth to byte number variant.This pattern accurately data amount can realize calculating based on internal memory, than
If 500MB is than more typical configuration.
In the present embodiment, the catalogue of above-mentioned leaning portfolio is: the value of signature template is by specifying calculation
The signature value calculated.
Above-mentioned signature template, is operation signature and can identify the uniqueness of operation, as on HDFS
The catalogue of storage learning files, in order to follow-up identical operation can find the learning files of correspondence, template is to use
Some producing signature can identify the configuration parameter of uniqueness, such as:
"mapred.mapper.class,mapreduce.map.class,mapred.reducer.class,mapreduce.reduce.
class,mapred.reduce.tasks,mapreduce.job.reduces,mapreduce.workflow.name,mapred
uce.workflow.node.name"。
The value of above-mentioned signature template, i.e. for the word coupled by the value of each parameter in above-mentioned signature template
Symbol string;
Above-mentioned signature value, is the value to above-mentioned signature template and specifies the value calculated, concrete one
In embodiment, above-mentioned signature value is that the value of signature template carries out the value that Hash calculation goes out.
In the present embodiment, above-mentioned at interval of one sampling key of the key-value pair acquisition specified number;Or, every
The step of a sampling key is obtained every the byte specified number, including:
S110, situation according to transformation of data, adaptively selected described interval specifies number.
Number as described in above-mentioned steps S110, when data when above-mentioned transformation of data refers to actual motion and sampling
According to variant, this species diversity can be very big sometimes, and the such as transaction data of Taobao's conventional operational day is purchased with double 11
The transaction data of thing joint has differed from several order of magnitude, if so instructing process by the result of plain data study
Double 11 shopping joint gross distortions data, during data be bound to from internal memory overflow, it is to avoid method be
We can formulate different strategies, such as different times according to practical situation and use different learning files,
Or regenerate learning files etc..The frequency regenerating learning files can be according to the height of learning cost
Carrying out low decision, the learning method cost of the sampling that the present invention is above-mentioned is the lowest, can be with weight when of each run
New study, thus can accomplish transformation of data self adaptation.
In the present embodiment, the information of sampling can be saved in the distributed field system of Hadoop in the form of a file
So that the access of follow-up work on system HDFS.Catalogue on HDFS is also important, and we use
The signature (signature) of operation.Sampling is for each Reduce process, so each reduce enters
Journey can produce a learning files.File format we use Hadoop intrinsic for preserving key-value pair
File format SequenceFile, preserve the information of sampling with it, key be exactly sampling key, value can be used
The side-play amount in data acquisition system of this key.As in figure 2 it is shown, with each 5000 record to data sampling
Once, 4 keys " cat " " fox " " lion " " snake " of sampling altogether, write the value in file respectively
For each key side-play amount (or sample position) in data acquisition system.
In the present embodiment, in learning files, the key of sampling can be used to build a key in internal memory and data
Mapping relations between block (bucket) identifier (id), these mapping relations can be used in follow-up operation
Search bucket belonging to data, just can be used to data are divided at map end, add due to
The key of sampling is ordered into, so be ordered between bucket, material is thus formed a kind of global orderly
The sort algorithm of coarse-grain.Wherein, the structure multiple method of mapping relations, such as: TreeMap mode, its
Building simple, committed memory is few, the general O of search efficiency (logn);Or, the method for Multidimensional numerical is permissible
Build in the study stage, by several most significant bytes (Most Significant Bytes) before sampling key
Array data is filled as array indexing, efficiency height O (1) during lookup, but the probability clashed is the biggest.
The present embodiment uses the pattern of a kind of mixing-classifying, first searches in three-dimensional array, CC (Collision
Counter) it is the number of times that clashes of array element, if CC < 2, then only one of which bucket, if
CC=2, so 1stWith 2ndBucket ID is meaningful, relatively determines 1 with full key boardstOr 2nd
Bucket, if CC > 2, just by the result of TreeMap;If three-dimensional array, its memory size reaches
To 255*255*255*8=132651000bytes=128MB, mono-array of each Reduce obviously can not
OK, all reduce can be used to share data, if clashed between reduce, it is special to use
Bit identifier SB (Special Bit flag) assists and determines reduce, namely area code (partition
number)。
The autonomic learning of the embodiment of the present invention optimizes the method that MapReduce processes data, can be to Map
The segmentation of output data, carries out multiple batches of pipelining concurrent processing.To often when exporting with reference to Fig. 3, Map
Individual Partition (subregion) can utilize the learning files of the Reduce of its correspondence to carry out partition data block
(bucket), if having N number of sampling key, partition can be divided into N+1 data block, and because of key
It is ordered, so being also drained through sequence between data block.In the MapReduce of pipelining, one
Individual partition can be by multiple batches of Shuffle, and each batch (Pass) comprises from all map output files
(MOFs) data block (bucket) with identical numbering come, and be ordered between bucket, this
Make multiple batches of concurrently can realizing;Size additionally by regulation bucket realizes the base of each batch
In shuffle Yu the reduce process of internal memory, greatly reduce hard disk I O access and delay.
In one specifically embodiment, carry out experimental data and compare:
(1) test environment:
Four back end
The big supplier CDH of hadoop software-three, HDP, MAPR result is similar to
CPU 2X8core
RAM 128GB
Disk 2TBx12
(2) measured result.
Whether using learning files, its code layer is same process, does not only have the default quilt of learning files
Thinking only one of which data block (bucket), this is carried out whole subregion (partition) with primary realization
Shuffle, it is consistent for merge (merge) calculating with reduce, it is impossible to exempt the I O access of hard disk.From
Table 1 is it can be seen that the data using learning files to significantly improve MapReduce after processing in batches process
Ability, the chances are original 1.6 times-2 times.It addition, also find out hard disk from the statistical report of hadoop self
The data accessed greatly reduce, and illustrate to use learning files to change into calculating based on internal memory after processing in batches
The autonomic learning of the present invention optimizes the method that MapReduce processes data, calculates by obtaining reduce
The sampling key of front data is sampled study, then to the data of subsequent job by the judgement of operation label is
No by corresponding learning files, determine whether that the result calling learning files uses, study
Method is simple, can repeat to process the work of class likelihood data fast and efficiently.
With reference to Fig. 4, the embodiment of the present invention also provides for a kind of autonomic learning and optimizes MapReduce process data
Device, including:
Sampling memory element 10, in an operation, the data before calculating reduce are according to default side
Formula samples, and become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence
Leaning portfolio in;
Select unit 20, in subsequent job, search corresponding study literary composition according to its operation label information
Part presss from both sides, if it has, the result then directly invoked in leaning portfolio processes to optimize this;If not yet
Have, then form new learning files and store.
Such as above-mentioned sampling memory element 10, it is simply that the data before calculating reduce are sampled, wherein, reduce
Data before calculating include that multiple key-value pair, a key-value pair are exactly a key (key) value (value) in fact.Note
Under the key of key-value pair and value instantly, owing to data itself are ordered into, so what the key after Qu Yang was also ordered into.
Find the sample intelligence of correspondence for convenience, sample intelligence can be stored catalogue for believing with operation label
In the learning files of breath.In the present embodiment, an operation may comprise multiple reduce and calculate, therefore can
Generating the learning files of multiple correspondence, it is corresponding operation label information that multiple learning files are stored in catalogue
In leaning portfolio, different operations has different operation label informations.
Such as above-mentioned selection unit 20, can quickly process the task of operation as before, improve work effect
Rate.
Reference Fig. 5, in the present embodiment, above-mentioned sampling memory element 10, including:
First sampling module 11, for obtaining a sampling key at interval of the key-value pair specified number;Or,
Second sampling module 12, for obtaining a sampling key at interval of the byte specified number.
Such as above-mentioned first sampling module 11, the sampling mode of this first sampling module 11 can take as key-value pair
Original mold formula, sampling is simple, as long as number key-value pair number is the most permissible, the most each 5000 key-value pairs sampling one
Inferior.If the byte number of each key-value pair is the same, this equivalent counting byte number.
Such as above-mentioned second sampling module 12, the sampling mode of this second sampling module 12 can sample as byte
Pattern, the byte number being suitable for each key-value pair is variant.This pattern can accurately data amount realize
Calculating based on internal memory, such as 500MB is than more typical configuration.
In the present embodiment, the catalogue of above-mentioned leaning portfolio is: the value of signature template is by specifying calculation
The signature value above-mentioned signature template calculated, is operation signature and can identify the uniqueness of operation, be used as
The catalogue of the storage learning files on HDFS, in order to follow-up identical operation can find the learning files of correspondence,
Template is used to produce some signed can identify the configuration parameter of uniqueness, such as:
"mapred.mapper.class,mapreduce.map.class,mapred.reducer.class,mapreduce.reduce.
class,mapred.reduce.tasks,mapreduce.job.reduces,mapreduce.workflow.name,mapred
uce.workflow.node.name"。
The value of above-mentioned signature template, i.e. for the word coupled by the value of each parameter in above-mentioned signature template
Symbol string;
Above-mentioned signature value, is the value to above-mentioned signature template and specifies the value calculated, concrete one
In embodiment, above-mentioned sampling memory element 10, including: Hash calculation module 13, for signature template
Value carries out Hash calculation, draws signature value.
In the present embodiment, above-mentioned first sampling module 11;Or, the second sampling module 12 includes:
Self adaptation submodule 110, for the situation according to transformation of data, adaptively sampled strategy.
Such as above-mentioned self adaptation submodule 110, data when above-mentioned transformation of data refers to actual motion with during sampling
Data are variant, and this species diversity can be very big sometimes, the such as transaction data of Taobao's conventional operational day and double 11
The transaction data of shopping joint has differed from several order of magnitude, if so instructing place by the result of plain data study
The data of the double 11 shopping joint gross distortions of reason, during data be bound to from internal memory overflow, it is to avoid method
It is that we can formulate, according to practical situation, the study literary composition that different strategies, such as different times employing are different
Part, or regenerate learning files etc..The frequency regenerating learning files can be according to learning cost
Height carry out low decision, the learning method cost of the sampling that the present invention is above-mentioned is the lowest, can with each run time
Time relearns, and thus can accomplish transformation of data self adaptation.
In the present embodiment, the information of sampling can be saved in the distributed field system of Hadoop in the form of a file
So that the access of follow-up work on system HDFS.Catalogue on HDFS is also important, and we use
The signature (signature) of operation.Sampling is for each Reduce process, so each reduce enters
Journey can produce a learning files.File format we use Hadoop intrinsic for preserving key-value pair
File format SequenceFile, preserve the information of sampling with it, key be exactly sampling key, value can be used
The side-play amount in data acquisition system of this key.As in figure 2 it is shown, with each 5000 record to data sampling
Once, 4 keys " cat " " fox " " lion " " snake " of sampling altogether, write the value in file respectively
For each key side-play amount (or sample position) in data acquisition system.
In the present embodiment, in learning files, the key of sampling can be used to build a key in internal memory and data
Mapping relations between block (bucket) identifier (id), these mapping relations can be used in follow-up operation
Search bucket belonging to data, just can be used to data are divided at map end, add due to
The key of sampling is ordered into, so be ordered between bucket, material is thus formed a kind of global orderly
The sort algorithm of coarse-grain.Wherein, the structure multiple method of mapping relations, such as: TreeMap mode, its
Building simple, committed memory is few, the general O of search efficiency (logn);Or, the method for Multidimensional numerical is permissible
Build in the study stage, by several most significant bytes (Most Significant Bytes) before sampling key
Array data is filled as array indexing, efficiency height O (1) during lookup, but the probability clashed is the biggest.
The present embodiment uses the pattern of a kind of mixing-classifying, first searches in three-dimensional array, CC (Collision
Counter) it is the number of times that clashes of array element, if CC < 2, then only one of which bucket, if
CC=2, so 1stWith 2ndBucket ID is meaningful, relatively determines 1 with full key boardstOr 2nd
Bucket, if CC > 2, just by the result of TreeMap;If three-dimensional array, its memory size reaches
To 255*255*255*8=132651000bytes=128MB, mono-array of each Reduce obviously can not
OK, all reduce can be used to share data, if clashed between reduce, it is special to use
Bit identifier SB (Special Bit flag) assists and determines reduce, namely area code (partition
number)。
The autonomic learning of the embodiment of the present invention optimizes the method that MapReduce processes data, can be defeated to Map
Go out the segmentation of data, carry out multiple batches of pipelining concurrent processing.During Map output, each Partition (is divided
District) learning files of the Reduce of its correspondence can be utilized to carry out partition data block (bucket), there is N number of sampling
If key, partition can be divided into N+1 data block, and because what key was ordered, so number
According to being also drained through sequence between block.In the MapReduce of pipelining, a partition can be by many batches
Secondary Shuffle, each batch (Pass) comprise from all map output files (MOFs) come there is phase
The data block (bucket) of same numbering, and be ordered between bucket, this makes multiple batches of concurrently may be used
To realize;Additionally by the size of regulation bucket realize the shuffle based on internal memory of each batch with
Reduce process, greatly reduces hard disk I O access and delay.
In one specifically embodiment, carry out experimental data and compare:
(1) test environment:
Four back end
The big supplier CDH of hadoop software-three, HDP, MAPR result is similar to
CPU 2X8core
RAM 128GB
Disk 2TBx12
(2) measured result.
Whether using learning files, its code layer is same process, does not only have the default quilt of learning files
Thinking only one of which data block (bucket), this is carried out whole subregion (partition) with primary realization
Shuffle, it is consistent for merge (merge) calculating with reduce, it is impossible to exempt the I O access of hard disk.From
Table 1 is it can be seen that the data using learning files to significantly improve MapReduce after processing in batches process
Ability, the chances are original 1.6 times-2 times.It addition, also find out hard disk from the statistical report of hadoop self
The data accessed greatly reduce, and illustrate to use learning files to change into calculating based on internal memory after processing in batches
The autonomic learning of the present invention optimizes MapReduce and processes the device of data, calculates by obtaining reduce
The sampling key of front data is sampled study, then to the data of subsequent job by the judgement of operation label is
No by corresponding learning files, determine whether that the result calling learning files uses, study
Method is simple, can repeat to process the work of class likelihood data fast and efficiently
The foregoing is only the preferred embodiments of the present invention, not thereby limit the scope of the claims of the present invention, all
The equivalent structure utilizing description of the invention and accompanying drawing content to be made or equivalence flow process conversion, or directly or
Connect and be used in other relevant technical fields, be the most in like manner included in the scope of patent protection of the present invention.
Claims (10)
1. an autonomic learning optimizes the method that MapReduce processes data, it is characterised in that including:
In an operation, the data before calculating reduce sample according to default mode, and by taking of obtaining
In the leaning portfolio of sample key-like become learning files to store operation label information that catalogue is its correspondence;
In subsequent job, search corresponding leaning portfolio according to its operation label information, if it has, then
Directly invoke the result in leaning portfolio to process to optimize this;If it is not, form new
Practise file and store.
The most according to claim 1 by machine autonomic learning method optimize MapReduce process number
According to method, it is characterised in that described in an operation, to reduce calculate before data according to default
Mode samples, and operation label that catalogue is its correspondence is believed to become learning files to store the sampling key-like obtained
Step in the leaning portfolio of breath, including:
A sampling key is obtained at interval of the key-value pair specified number;Or,
A sampling key is obtained at interval of the byte specified number.
Autonomic learning the most according to claim 1 optimizes the method that MapReduce processes data, and it is special
Levying and be, the catalogue of described leaning portfolio is:
The signature value that the value of signature template calculates by specifying calculation.
Autonomic learning the most according to claim 3 optimizes the method that MapReduce processes data, and it is special
Levying and be, described signature value is that the value of signature template carries out the value that Hash calculation goes out.
Autonomic learning the most according to claim 2 optimizes the method that MapReduce processes data, and it is special
Levy and be, described at interval of one sampling key of the key-value pair acquisition specified number;Or, at interval of specifying number
Purpose byte obtains the step of a sampling key, including:
According to the situation of transformation of data, adaptively selected described interval specifies number.
6. the device of autonomic learning optimization MapReduce process data, it is characterised in that including:
Sampling memory element, in an operation, the data before calculating reduce are according to default mode
Sampling, and become learning files to store the sampling key-like obtained operation label information that catalogue is its correspondence
In leaning portfolio;
Select unit, in subsequent job, search corresponding learning files according to its operation label information
Folder, if it has, the result then directly invoked in leaning portfolio processes to optimize this;If it did not,
Then form new learning files and store.
The most according to claim 6 by machine autonomic learning method optimize MapReduce process number
According to device, it is characterised in that described sampling memory element, including:
First sampling module, for obtaining a sampling key at interval of the key-value pair specified number;Or,
Second sampling module, for obtaining a sampling key at interval of the byte specified number.
Autonomic learning the most according to claim 6 optimizes MapReduce and processes the device of data, and it is special
Levying and be, the catalogue of described leaning portfolio is:
The signature value that the value of signature template calculates by specifying calculation.
Autonomic learning the most according to claim 8 optimizes MapReduce and processes the device of data, and it is special
Levy and be, described sampling memory element, including:
Hash calculation module, the value being signature template for value of signing carries out the value that Hash calculation goes out.
Autonomic learning the most according to claim 7 optimizes MapReduce and processes the device of data, its
It is characterised by, described first sampling module;Or, the second sampling module includes:
Self adaptation submodule, for the situation according to transformation of data, adaptively sampled strategy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610305912.9A CN106021360A (en) | 2016-05-10 | 2016-05-10 | Method and device for autonomously learning and optimizing MapReduce processing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610305912.9A CN106021360A (en) | 2016-05-10 | 2016-05-10 | Method and device for autonomously learning and optimizing MapReduce processing data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021360A true CN106021360A (en) | 2016-10-12 |
Family
ID=57100172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610305912.9A Pending CN106021360A (en) | 2016-05-10 | 2016-05-10 | Method and device for autonomously learning and optimizing MapReduce processing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021360A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107967265A (en) * | 2016-10-18 | 2018-04-27 | 华为技术有限公司 | Access method, data server and the file access system of file |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810293A (en) * | 2014-02-28 | 2014-05-21 | 广州云宏信息科技有限公司 | Text classification method and device based on Hadoop |
CN104536959A (en) * | 2014-10-16 | 2015-04-22 | 南京邮电大学 | Optimized method for accessing lots of small files for Hadoop |
CN105303456A (en) * | 2015-10-16 | 2016-02-03 | 国家电网公司 | Method for processing monitoring data of electric power transmission equipment |
CN105404652A (en) * | 2015-10-29 | 2016-03-16 | 河海大学 | Mass small file processing method based on HDFS |
-
2016
- 2016-05-10 CN CN201610305912.9A patent/CN106021360A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810293A (en) * | 2014-02-28 | 2014-05-21 | 广州云宏信息科技有限公司 | Text classification method and device based on Hadoop |
CN104536959A (en) * | 2014-10-16 | 2015-04-22 | 南京邮电大学 | Optimized method for accessing lots of small files for Hadoop |
CN105303456A (en) * | 2015-10-16 | 2016-02-03 | 国家电网公司 | Method for processing monitoring data of electric power transmission equipment |
CN105404652A (en) * | 2015-10-29 | 2016-03-16 | 河海大学 | Mass small file processing method based on HDFS |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107967265A (en) * | 2016-10-18 | 2018-04-27 | 华为技术有限公司 | Access method, data server and the file access system of file |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102915347B (en) | A kind of distributed traffic clustering method and system | |
CN108304409B (en) | Carry-based data frequency estimation method of Sketch data structure | |
CN104778182B (en) | Data lead-in method and system based on HBase | |
US20160103858A1 (en) | Data management system comprising a trie data structure, integrated circuits and methods therefor | |
CN104618304B (en) | Data processing method and data handling system | |
WO2007085187A1 (en) | Method of data retrieval, method of generating index files and search engine | |
CN109885782B (en) | Ecological environment space big data integration method | |
CN103177035A (en) | Data query device and data query method in data base | |
CN106599190A (en) | Dynamic Skyline query method based on cloud computing | |
CN108140022B (en) | Data query method and database system | |
CN108304460B (en) | Improved database positioning method and system | |
CN113918605A (en) | Data query method, device, equipment and computer storage medium | |
US8438173B2 (en) | Indexing and querying data stores using concatenated terms | |
CN110083731B (en) | Image retrieval method, device, computer equipment and storage medium | |
CN108319604B (en) | Optimization method for association of large and small tables in hive | |
CN106021360A (en) | Method and device for autonomously learning and optimizing MapReduce processing data | |
CN110348693B (en) | Multi-robot rapid task scheduling method based on multi-core computing | |
Feng et al. | Real-time SLAM relocalization with online learning of binary feature indexing | |
CN111858607A (en) | Data processing method and device, electronic equipment and computer readable medium | |
CN104361090B (en) | Data query method and device | |
CN108121807B (en) | Method for realizing multi-dimensional Index structure OBF-Index in Hadoop environment | |
CN106845787A (en) | A kind of data method for automatically exchanging and device | |
CN107315801B (en) | parallel discrete event simulation system initialization data storage method | |
CN115687352A (en) | Storage method and device | |
CN108536819A (en) | Integer arranges method, apparatus, server and the storage medium with character string comparison |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161012 |