CN103617087A - MapReduce optimizing method suitable for iterative computations - Google Patents
MapReduce optimizing method suitable for iterative computations Download PDFInfo
- Publication number
- CN103617087A CN103617087A CN201310600745.7A CN201310600745A CN103617087A CN 103617087 A CN103617087 A CN 103617087A CN 201310600745 A CN201310600745 A CN 201310600745A CN 103617087 A CN103617087 A CN 103617087A
- Authority
- CN
- China
- Prior art keywords
- task
- node
- hadoop
- proceeds
- map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses a MapReduce optimizing method suitable for iterative computations. The MapReduce optimizing method is applied to a Hadoop trunking system. The trunking system comprises a major node and a plurality of secondary nodes. The MapReduce optimizing method comprises the following steps that a plurality of Hadoop jobs submitted by a user are received by the major node; the jobs are placed in a job queue by a job service process of the major node and wait for being scheduled by a job scheduler of the major node; the major node waits for a task request transmitted from the secondary nodes; after the major node receives the task request, localized tasks are scheduled preferentially by the job scheduler of the major node; and if the secondary nodes which transmit the task request do not have localized tasks, prediction scheduling is performed according to task types of the Hadoop jobs. The MapReduce optimizing method can support the traditional data-intensive application, and can also support iterative computations transparently and efficiently; dynamic data and static data can be respectively researched; and data transmission quantity can be reduced.
Description
Technical field
The invention belongs to parallel computation and mass data processing field, more specifically, relate to a kind of MapReduce optimization method of applicable iterative computation.
Background technology
Enter into 21 century, the treatment scale of data is increasing, and other scale of TB level is more and more common, has even occurred other scale of PB level.This other data scale of level is far beyond the processing power of PC.The demand of this processing power has promoted the development of parallel or distributed computing platform just.In this case, the MapReduce model of Google arises at the historic moment, and it is data-intensive computation model under a kind of popular large cluster environment.
MapReduce is a kind of programming model, for the concurrent operation of large-scale dataset (being greater than 1TB).Concept " Map(mapping) " and " Reduce(abbreviation) ", and their main thought, all from Functional Programming, borrow the characteristic of borrowing in addition from vector programming language.He is very easy to programming personnel can not distributed parallel programming in the situation that, and the program of oneself is operated in distributed system.In this model, the organizational form of all data is a kind of <key, value> couple.During programming, programmer need to do just realizes Map and Reduce function.Map function is processed input <key, value> to and export zero or several key-value pairs, Reduce function read Map in the middle of output, finally obtain zero or several results.MapReduce model structure is followed relatively independent principle, between Map or Reduce, does not have data dependence relation.
The mentality of designing of MapReduce model allows it be good at carrying out the calculating of batch mode, such as log analysis and text-processing etc.Yet except the application of these batch processing modes, also exist the application based on machine learning or pattern-recognition, typically have computer vision and data mining application etc.In these application, core algorithm designs based on iterative manner.The realization yet current Hadoop(MapReduce model is increased income) can not transparently support efficiently iterative computation, even some characteristic of Hadoop is not suitable for iterative computation.Along with the development of social networks, computer vision, data mining etc., the data processing scale of this class application is increasing.Can effectively support that the demand of the parallel computational model that this class is applied is increasing.
Summary of the invention
Above defect or Improvement requirement for prior art, the invention provides a kind of MapReduce optimization method of applicable iterative computation, its object is, on the basis of Hadoop, improved, can either support traditional data intensive applications, can transparently support efficiently iterative computation again, and from dynamic data and two aspects of static data, study and realize respectively the minimizing of volume of transmitted data.
For achieving the above object, according to one aspect of the present invention, a kind of MapReduce optimization method of applicable iterative computation is provided, be to be applied in a kind of Hadoop group system, this group system comprises a host node and a plurality of from node, said method comprising the steps of:
(1) host node receives a plurality of Hadoop operations that user submits to, and the job service process of host node is put into job queue by operation, and waits for that the job scheduler of host node carries out job scheduling;
(2) host node is waited for the task requests of sending from node, and after receiving task requests, the job scheduler priority scheduling localization tasks of host node, if there is no localization tasks what send task requests from node, according to the task type of Hadoop operation, predict scheduling, for calculation type task, directly this Hadoop operation is dispatched, for mode transmission task, postpone certain intervals, when total delay time interval reaches delay threshold value, just this Hadoop operation is dispatched;
(3) from node the task of Hadoop operation that receives host node scheduling, judgement Hadoop homework type carries out different disposal, homework type is divided into two kinds of iterative type and non-iterative types, for non-iterative type operation, according to the conventional processing mode of Hadoop, process, for iterative type operation, at Map, before the stage, increase Map end and shuffle process, be used to Map task to read dynamic data, in the Reduce stage, dynamic data is carried out to local cache and transfer to the dynamic data buffer memory assembly management from node, and after operation is disposed, final result is kept in HDFS.
Preferably, step (2) specifically comprises following sub-step:
(2-1) heartbeat message that the job service process monitoring on host node the task service processes of wait from node send, this heartbeat message comprises the current operation information from node, specifically comprises total groove number and the current groove number moving;
(2-2) host node is receiving the heartbeat message of sending from node, according to this heartbeat message, calculate current from the idle groove number of node and the average run channel number of whole Hadoop group system, according to the result of calculating, judging whether need to be to current task of distributing this operation from node, if do not need allocating task, return to step (2-1), otherwise execution step (2-3);
(2-3) counter i=0 is set;
(2-4) judge whether i Hadoop operation has localization tasks current from node, it is the current input data fragmentation (Split) that whether stores i Hadoop operation from node, if do not proceed to step (2-5), if having, proceed to step (2-11);
(2-5) i=i+1 is set, and judges whether i equals the number of Hadoop operation, if equal, enter step (2-7), otherwise return to step (2-4);
(2-6) counter j=0 is set;
(2-7) task type that judges j Hadoop operation is calculation type task or mode transmission task, if calculation type task enters step (2-11), if mode transmission task enters step (2-8);
(2-8) task scheduling of j Hadoop operation is postponed to a heart time;
(2-9) judge that whether the total delay time that j Hadoop job task dispatched reaches a threshold value, if reach, proceeds to step (2-11), otherwise proceeds to step (2-10);
(2-10) j=j+1 is set, and judges whether j equals the number of Hadoop operation, if equal, enter step (2-12), otherwise return to step (2-1);
(2-11) localization tasks of i Hadoop operation is dispatched to current from node, then process finishes;
(2-12) to the task scheduling of j Hadoop operation to current from node, then process finishes.
Preferably, step (2-2) is specially, the current idle groove number from node equals total groove number and deducts the current groove number moving, the average run channel number of whole Hadoop group system be all groove numbers that moving from node that trace daemon monitors and divided by all groove numbers from node, if the idle groove number of present node equals 0, do not need allocating task, if the current groove number moving from node is greater than the average run channel number of whole Hadoop group system simultaneously, do not need allocating task.
Preferably, step (3) specifically comprises following sub-step:
(3-1) receive the task of the Hadoop operation of host node scheduling;
(3-2) homework type of judgement task is that iterative type operation is also non-iterative type operation, if iterative type operation proceeds to step (3-3), if be non-iterative type operation, proceeds to step (3-4);
(3-3) task type that judges this iterative type operation is Map task or Reduce task, if Map task proceeds to step (3-5), if Reduce task proceeds to step (3-9);
(3-4) task type that judges this non-iterative type operation is Map task or Reduce task, if Map task proceeds to step (3-8), if Reduce task proceeds to step (3-9);
(3-5) judge whether this iterative type operation is to move for the first time, if not proceeding to step (3-6), if it is proceed to step (3-7);
(3-6) Map task process (Mapper) place starts a plurality of data copy threads from the task service processes of node, by HTTP mode request Reduce task process place, from node, obtain the dynamic data file that Reduce task process calculates, then proceed to step (3-8);
(3-7) Map task process reads dynamic data initialization value, then proceeds to step (3-8);
(3-8) Hadoop group system resolves into burst one by one by the input file of operation, and Map task process is processed burst, then proceeds to step (3-14);
(3-9) copy of the Reducer task process log-on data from node thread, by HTTP mode request Map task process place, from node, obtain the intermediate output file of Map task process, intermediate output file is stored in from the local disk of node, the file that copy is come can first be placed in core buffer, a plurality of copied files can be merged into final large file according to meeting, this large file sorts according to key, then proceeds to step (3-10);
(3-10) Reduce task process is with <key from the large file obtaining, and iterator> form reads record, and carries out Reduce () method, then proceeds to step (3-11);
(3-11) judgement homework type is that iterative type is also non-iterative type, if iterative type operation proceeds to step (3-12), if be non-iterative type operation, proceeds to step (3-13);
(3-12) result cache from the dynamic data buffer memory assembly of node is carried out Reduce task process, in the middle of internal memory, spills in local disk file when buffer zone is full, then proceeds to step (3-14);
(3-13) Reduce task process is written to the result after carrying out in HDFS, then proceeds to step (3-14);
(3-14) tasks carrying finishes, and then returns to step (3-1).
Preferably, the dynamic data file in step (3-6), by the dynamic data buffer memory assembly management from node, is kept in internal memory and local disk; The copy dynamic data of coming is also managed by dynamic data buffer memory assembly, and same a plurality of Map task process from node are from from the local dynamic data request file of node, the dynamic data input that these data will need as Map task process.
Preferably, in step (3-8), the size of burst is defaulted as HDFS block size, block size configures by configuration file, Map task process resolves into by burst the <key that Map task process needs, the record of value> form, carry out Map () method, by the result cache of carrying out in the middle of internal memory, when buffer zone is full, can spill in the middle of disk, the information of the file meeting record partitioning overflowing, first single spill file according to subregion sequence, then sorts according to key; If there are a plurality of spill files need to be merged into a large file, this process is carried out merge sort to a plurality of spill files.
In general, the above technical scheme of conceiving by the present invention compared with prior art, can obtain following beneficial effect:
(1) the data locality of task is better: owing to having adopted step (2), the scheduling strategy proposing in the present invention is compared to delay dispatching strategy, better balance task demand for localization and task postpone expense, task (task) conception division being about in Hadoop is computation-intensive and intensive two classes of transmission, and in conjunction with the load information of cluster network prediction lag time in real time.Raising task localization ratio can be reached like this, the bulk delay expense of operation can be effectively reduced again.Therefore the present invention has obvious advantage.
(2) dynamic data transmission expense is less: owing to having adopted step (3), the dynamic data cache policy that the present invention proposes has greatly reduced the cluster network transport overhead that read-write dynamic data brings.Theoretical proof experimental verification, the dynamic data transmission total amount of iterative type operation is only directly proportional to task place interstitial content, and has the definite upper limit, i.e. clustered node number.Therefore the present invention has obvious advantage.
(3) cluster usefulness under many operations and multi-user's environment for use is higher: under many operations and multi-user environment, the Internet resources of cluster become the bottleneck of trunking efficiency, will greatly limit the effective utilization of cluster.The present invention, by optimizing the network data flow of Hadoop, reduces cluster network transport overhead, effectively alleviates cluster network load, reduces the Internet resources competition between user and between operation, has improved the cluster effective utilization under many operations and multi-user.Therefore the present invention has obvious advantage.
(4) iterative computation is supported on high-efficient transparent ground.Compared to traditional Hadoop, the present invention both can support traditional batch processing job, can support better iterative type operation again, so use of the present invention field is more extensive, and such as social networks, computer vision, data mining etc.Therefore the present invention has obvious advantage.
Accompanying drawing explanation
Fig. 1 is the process flow diagram that the present invention is applicable to the MapReduce optimization method of iterative computation.
Fig. 2 is the refinement process flow diagram of step of the present invention (2).
Fig. 3 is the refinement process flow diagram of step of the present invention (3).
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.In addition,, in each embodiment of described the present invention, involved technical characterictic just can not combine mutually as long as do not form each other conflict.
Below first technical term of the present invention is explained and illustrated:
Dynamic data: in iterative computation problem, one directly or indirectly constantly by the variable of old value recursion value of making new advances.
Static data: in iterative computation problem, do not have the data of any change, be generally the original input data of algorithm.
Calculation type task: account for the task of major part the computing time of task in whole processing procedures of Map task.
Mode transmission task: the data transmission period of task accounts for the task of major part in whole processing procedures of Map task.
Localization tasks: in the Map task that stores input data fragmentation from node this locality.
Delay dispatching strategy: a kind of strategy that postpones non-localization task scheduling.
Integral Thought of the present invention is, is conceived to multi-user and many operations cluster environment, reduces the Internet Transmission load of cluster by optimizing static data flow and shared data stream.For the optimization of static data flow, main contributions of the present invention one prediction dispatching algorithm; For the optimization of sharing data stream, the present invention achieves the goal by the shuffle process of data buffer storage strategy and increase Map end.
The MapReduce optimization method that the present invention is applicable to iterative computation is to be applied in a kind of Hadoop group system, and this group system comprises master's (Master) node and a plurality of from (Slave) node, and the method comprises the following steps (as shown in Figure 1):
(1) host node receives a plurality of Hadoop operations that user submits to, and the job service process (JobTracker) of host node is put into job queue by operation, and waits for that the job scheduler of host node carries out job scheduling;
(2) host node is waited for the task requests of sending from node, and after receiving task requests, the job scheduler priority scheduling localization tasks of host node, if there is no localization tasks what send task requests from node, according to the task type of Hadoop operation, predict scheduling, for calculation type task, directly this Hadoop operation is dispatched, for mode transmission task, postpone certain intervals, when total delay time interval reaches delay threshold value, just this Hadoop operation is dispatched.Specifically comprise following sub-step (as shown in Figure 2):
(2-1) heartbeat message that the job service process monitoring on host node the task service processes (TaskTracker) of wait from node send, this heartbeat message comprises the current operation information from node, specifically comprises total groove (slot) number and the current groove number moving etc.;
(2-2) host node is receiving the heartbeat message of sending from node, according to this heartbeat message, calculate current from the idle groove number of node and the average run channel number of whole Hadoop group system, according to the result of calculating, judging whether need to be to current task of distributing this operation from node, if do not need allocating task, return to step (2-1), otherwise execution step (2-3); Particularly, the current idle groove number from node equals total groove number and deducts the current groove number moving, the average run channel number of whole Hadoop group system be all groove numbers that moving from node that trace daemon monitors and divided by all groove numbers from node; If the idle groove number of present node equals 0, do not need allocating task, if the current groove number moving from node of while is greater than the average run channel number of whole Hadoop group system, do not need allocating task;
(2-3) counter i=0 is set;
(2-4) judge whether i Hadoop operation has localization tasks current from node, it is the current input data fragmentation (Split) that whether stores i Hadoop operation from node, if do not proceed to step (2-5), if having, proceed to step (2-11);
(2-5) i=i+1 is set, and judges whether i equals the number of Hadoop operation, if equal, enter step (2-7), otherwise return to step (2-4);
(2-6) counter j=0 is set;
(2-7) task type that judges j Hadoop operation is calculation type task or mode transmission task, if calculation type task enters step (2-11), if mode transmission task enters step (2-8);
(2-8) task scheduling of j Hadoop operation is postponed to a heart time; Particularly, heart time is from node, to send the time interval of heartbeat message, is specially 3 seconds;
(2-9) judge that whether the total delay time that j Hadoop job task dispatched reaches a threshold value, if reach, proceeds to step (2-11), otherwise proceeds to step (2-10); The value of threshold value can be configured by cluster administrator, configuration according to being: when threshold value is larger, the localization ratio of task can be larger, but the expense postponing also can be larger; Threshold value is less, and localization ratio is relatively less, but the expense postponing also can be less, and threshold value is defaulted as 3 minutes;
(2-10) j=j+1 is set, and judges whether j equals the number of Hadoop operation, if equal, enter step (2-12), otherwise return to step (2-1);
(2-11) localization tasks of i Hadoop operation is dispatched to current from node, then process finishes;
(2-12) to the task scheduling of j Hadoop operation to current from node, then process finishes.
The advantage of this step is: task is classified, to calculation type task, use the mode of acquiescence to dispatch, mode transmission task is predicted to scheduling.The localized ratio of task can be both can improved like this, the expense that delay brings can be reduced again.
(3) from node the task of Hadoop operation that receives host node scheduling, judgement Hadoop homework type carries out different disposal, homework type is divided into two kinds of iterative type and non-iterative types, for non-iterative type operation, according to the conventional processing mode of Hadoop, process, for iterative type operation, at Map, before the stage, increase a Map end and shuffled (shuffle) process, be used to the task (being Map task) in Map stage to read dynamic data, in the Reduce stage, dynamic data is carried out to local cache and transfer to the dynamic data buffer memory assembly management from node, and final result is kept to Hadoop distributed file system (Hadoop Distributed File System after operation is disposed, be called for short HDFS) in, this step specifically comprises following sub-step (as shown in Figure 3):
(3-1) receive the task of the Hadoop operation of host node scheduling;
(3-2) homework type of judgement task is that iterative type operation is also non-iterative type operation, if iterative type operation proceeds to step (3-3), if be non-iterative type operation, proceeds to step (3-4);
(3-3) task type that judges this iterative type operation is Map task or Reduce task, if Map task proceeds to step (3-5), if Reduce task proceeds to step (3-9);
(3-4) task type that judges this non-iterative type operation is Map task or Reduce task, if Map task proceeds to step (3-8), if Reduce task proceeds to step (3-9);
(3-5) judge whether this iterative type operation is to move for the first time, if not proceeding to step (3-6), if it is proceed to step (3-7);
(3-6) Map task process (Mapper) place starts a plurality of data copy threads from the task service processes of node, by HTTP mode request Reduce task process (Reducer) place, from node, obtain the dynamic data file that Reduce task process calculates, then proceed to step (3-8); These dynamic data file, by the dynamic data buffer memory assembly management from node, are kept in internal memory and local disk; The copy dynamic data of coming is also managed by dynamic data buffer memory assembly, and same a plurality of Map task process from node are from from the local dynamic data request file of node, the dynamic data input that these data will need as Map task process;
The advantage of this sub-step is: the dynamic data of 1, Reduce after the stage is kept at this, has reduced and has been written to the expense that HDFS brings; 2, by the dynamic data the stage from node request Reduce at Map task process place, and be kept at this locality from node of Map task process, Map task process, from from the local request msg of node, has reduced the transmission volume of dynamic data so widely.
(3-7) Map task process reads dynamic data initialization value, then proceeds to step (3-8); Briefly, iterative type operation needs and produces some dynamic datas, and these data are that the initialization value of this dynamic data need to be provided by user when Job execution for the first time;
(3-8) Hadoop group system resolves into burst one by one by the input file of operation, and Map task process is processed burst, then proceeds to step (3-14); Particularly, the size of burst is defaulted as HDFS block size, block size configures by configuration file, Map task process resolves into by burst the <key that Map task process needs, the record of value> form, carry out Map () method, the result cache of carrying out, in the middle of internal memory, can be spilt in the middle of disk when buffer zone is full; The information of the file meeting record partitioning overflowing, first single spill file according to subregion sequence, then sorts according to key; If there are a plurality of spill files need to be merged into a large file, this process is carried out merge sort to a plurality of spill files;
(3-9) copy of the Reducer task process log-on data from node thread, by HTTP mode request Map task process place, from node, obtain the intermediate output file of Map task process, intermediate output file is stored in from the local disk of node, the file that copy is come can first be placed in core buffer, a plurality of copied files can be merged into final large file according to meeting, this large file sorts according to key, then proceeds to step (3-10);
(3-10) Reduce task process is with <key from the large file obtaining, and iterator> form reads record, and carries out Reduce () method, then proceeds to step (3-11);
(3-11) judgement homework type is that iterative type is also non-iterative type, if iterative type operation proceeds to step (3-12), if be non-iterative type operation, proceeds to step (3-13);
(3-12) result cache from the dynamic data buffer memory assembly of node is carried out Reduce task process, in the middle of internal memory, spills in local disk file when buffer zone is full, then proceeds to step (3-14);
(3-13) Reduce task process is written to the result after carrying out in HDFS, then proceeds to step (3-14);
(3-14) tasks carrying finishes, and then returns to step (3-1).
Example:
In order to verify feasibility of the present invention and validity, under the experimental configuration environment shown in lower list 1, carry out the computer program of writing, invention is tested, shown in the following list 2 of test result and table 3:
Table 1: experimental configuration environment
In table 2 and table 3, comparison other of the present invention is Hadoop-0.20.0 and Haloop, and experiment algorithm is fuzzy C-Means.What table 2 represented is the transmission volume comparison of the dynamic data of three MapReduce implementations under different experiments scale.Table 3 represents is execution time comparisons during different iterations under certain experimental size of 3 MapReduce implementations.Experimental result demonstration, the present invention has more satisfactory improvement on network data transmission and time performance.
Dynamic data transmission amount comparison in table 2:fuzzy C-Means
Table 3:fuzzy C-Means execution time comparison
Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.
Claims (6)
1. a MapReduce optimization method for applicable iterative computation, is to be applied in a kind of Hadoop group system, and this group system comprises a host node and a plurality of from node, it is characterized in that, said method comprising the steps of:
(1) host node receives a plurality of Hadoop operations that user submits to, and the job service process of host node is put into job queue by operation, and waits for that the job scheduler of host node carries out job scheduling;
(2) host node is waited for the task requests of sending from node, and after receiving task requests, the job scheduler priority scheduling localization tasks of host node, if there is no localization tasks what send task requests from node, according to the task type of Hadoop operation, predict scheduling, for calculation type task, directly this Hadoop operation is dispatched, for mode transmission task, postpone certain intervals, when total delay time interval reaches delay threshold value, just this Hadoop operation is dispatched;
(3) from node the task of Hadoop operation that receives host node scheduling, judgement Hadoop homework type carries out different disposal, homework type is divided into two kinds of iterative type and non-iterative types, for non-iterative type operation, according to the conventional processing mode of Hadoop, process, for iterative type operation, at Map, before the stage, increase Map end and shuffle process, be used to Map task to read dynamic data, in the Reduce stage, dynamic data is carried out to local cache and transfer to the dynamic data buffer memory assembly management from node, and after operation is disposed, final result is kept in HDFS.
2. MapReduce optimization method according to claim 1, is characterized in that, step (2) specifically comprises following sub-step:
(2-1) heartbeat message that the job service process monitoring on host node the task service processes of wait from node send, this heartbeat message comprises the current operation information from node, specifically comprises total groove number and the current groove number moving;
(2-2) host node is receiving the heartbeat message of sending from node, according to this heartbeat message, calculate current from the idle groove number of node and the average run channel number of whole Hadoop group system, according to the result of calculating, judging whether need to be to current task of distributing this operation from node, if do not need allocating task, return to step (2-1), otherwise execution step (2-3);
(2-3) counter i=0 is set;
(2-4) judge whether i Hadoop operation has localization tasks current from node, it is the current input data fragmentation (Split) that whether stores i Hadoop operation from node, if do not proceed to step (2-5), if having, proceed to step (2-11);
(2-5) i=i+1 is set, and judges whether i equals the number of Hadoop operation, if equal, enter step (2-7), otherwise return to step (2-4);
(2-6) counter j=0 is set;
(2-7) task type that judges j Hadoop operation is calculation type task or mode transmission task, if calculation type task enters step (2-11), if mode transmission task enters step (2-8);
(2-8) task scheduling of j Hadoop operation is postponed to a heart time;
(2-9) judge that whether the total delay time that j Hadoop job task dispatched reaches a threshold value, if reach, proceeds to step (2-11), otherwise proceeds to step (2-10);
(2-10) j=j+1 is set, and judges whether j equals the number of Hadoop operation, if equal, enter step (2-12), otherwise return to step (2-1);
(2-11) localization tasks of i Hadoop operation is dispatched to current from node, then process finishes;
(2-12) to the task scheduling of j Hadoop operation to current from node, then process finishes.
3. MapReduce optimization method according to claim 2, it is characterized in that, step (2-2) is specially, the current idle groove number from node equals total groove number and deducts the current groove number moving, the average run channel number of whole Hadoop group system be all groove numbers that moving from node that trace daemon monitors and divided by all groove numbers from node, if the idle groove number of present node equals 0, do not need allocating task, if the current groove number moving from node is greater than the average run channel number of whole Hadoop group system simultaneously, do not need allocating task.
4. MapReduce optimization method according to claim 1, is characterized in that, step (3) specifically comprises following sub-step:
(3-1) receive the task of the Hadoop operation of host node scheduling;
(3-2) homework type of judgement task is that iterative type operation is also non-iterative type operation, if iterative type operation proceeds to step (3-3), if be non-iterative type operation, proceeds to step (3-4);
(3-3) task type that judges this iterative type operation is Map task or Reduce task, if Map task proceeds to step (3-5), if Reduce task proceeds to step (3-9);
(3-4) task type that judges this non-iterative type operation is Map task or Reduce task, if Map task proceeds to step (3-8), if Reduce task proceeds to step (3-9);
(3-5) judge whether this iterative type operation is to move for the first time, if not proceeding to step (3-6), if it is proceed to step (3-7);
(3-6) Map task process (Mapper) place starts a plurality of data copy threads from the task service processes of node, by HTTP mode request Reduce task process place, from node, obtain the dynamic data file that Reduce task process calculates, then proceed to step (3-8);
(3-7) Map task process reads dynamic data initialization value, then proceeds to step (3-8);
(3-8) Hadoop group system resolves into burst one by one by the input file of operation, and Map task process is processed burst, then proceeds to step (3-14);
(3-9) copy of the Reducer task process log-on data from node thread, by HTTP mode request Map task process place, from node, obtain the intermediate output file of Map task process, intermediate output file is stored in from the local disk of node, the file that copy is come can first be placed in core buffer, a plurality of copied files can be merged into final large file according to meeting, this large file sorts according to key, then proceeds to step (3-10);
(3-10) Reduce task process is with <key from the large file obtaining, and iterator> form reads record, and carries out Reduce () method, then proceeds to step (3-11);
(3-11) judgement homework type is that iterative type is also non-iterative type, if iterative type operation proceeds to step (3-12), if be non-iterative type operation, proceeds to step (3-13);
(3-12) result cache from the dynamic data buffer memory assembly of node is carried out Reduce task process, in the middle of internal memory, spills in local disk file when buffer zone is full, then proceeds to step (3-14);
(3-13) Reduce task process is written to the result after carrying out in HDFS, then proceeds to step (3-14);
(3-14) tasks carrying finishes, and then returns to step (3-1).
5. MapReduce optimization method according to claim 4, is characterized in that, the dynamic data file in step (3-6), by the dynamic data buffer memory assembly management from node, is kept in internal memory and local disk; The copy dynamic data of coming is also managed by dynamic data buffer memory assembly, and same a plurality of Map task process from node are from from the local dynamic data request file of node, the dynamic data input that these data will need as Map task process.
6. MapReduce optimization method according to claim 4, it is characterized in that, in step (3-8), the size of burst is defaulted as HDFS block size, block size configures by configuration file, Map task process resolves into by burst the <key that Map task process needs, the record of value> form, carry out Map () method, by the result cache of carrying out in the middle of internal memory, when buffer zone is full, can spill in the middle of disk, the information of the file meeting record partitioning overflowing, first single spill file sorts according to subregion, then according to key, sort, if there are a plurality of spill files need to be merged into a large file, this process is carried out merge sort to a plurality of spill files.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310600745.7A CN103617087B (en) | 2013-11-25 | 2013-11-25 | MapReduce optimizing method suitable for iterative computations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310600745.7A CN103617087B (en) | 2013-11-25 | 2013-11-25 | MapReduce optimizing method suitable for iterative computations |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103617087A true CN103617087A (en) | 2014-03-05 |
CN103617087B CN103617087B (en) | 2017-04-26 |
Family
ID=50167790
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310600745.7A Active CN103617087B (en) | 2013-11-25 | 2013-11-25 | MapReduce optimizing method suitable for iterative computations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103617087B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104158860A (en) * | 2014-07-31 | 2014-11-19 | 国家超级计算深圳中心(深圳云计算中心) | Job scheduling method and job scheduling system |
CN104270412A (en) * | 2014-06-24 | 2015-01-07 | 南京邮电大学 | Three-level caching method based on Hadoop distributed file system |
CN104503820A (en) * | 2014-12-10 | 2015-04-08 | 华南师范大学 | Hadoop optimization method based on asynchronous starting |
CN105117286A (en) * | 2015-09-22 | 2015-12-02 | 北京大学 | Task scheduling and pipelining executing method in MapReduce |
CN105808634A (en) * | 2015-01-15 | 2016-07-27 | 国际商业机器公司 | Distributed map reduce network |
WO2016145904A1 (en) * | 2015-09-10 | 2016-09-22 | 中兴通讯股份有限公司 | Resource management method, device and system |
CN106354563A (en) * | 2016-08-29 | 2017-01-25 | 广州市香港科大霍英东研究院 | Distributed computing system for 3D (three-dimensional reconstruction) and 3D reconstruction method |
CN106506255A (en) * | 2016-09-21 | 2017-03-15 | 微梦创科网络科技(中国)有限公司 | A kind of method of pressure test, apparatus and system |
CN106547609A (en) * | 2015-09-18 | 2017-03-29 | 阿里巴巴集团控股有限公司 | A kind of event-handling method and equipment |
CN106897133A (en) * | 2017-02-27 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of implementation method based on the management cluster load of PBS job schedulings |
CN107122238A (en) * | 2017-04-25 | 2017-09-01 | 郑州轻工业学院 | Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame |
CN107316124A (en) * | 2017-05-10 | 2017-11-03 | 中国航天系统科学与工程研究院 | Extensive affairs type job scheduling and processing general-purpose platform under big data environment |
CN107391250A (en) * | 2017-08-11 | 2017-11-24 | 成都优易数据有限公司 | A kind of controller of raising Mapreduce task Shuffle performances |
CN107807983A (en) * | 2017-10-30 | 2018-03-16 | 辽宁大学 | A kind of parallel processing framework and design method for supporting extensive Dynamic Graph data query |
CN108153583A (en) * | 2016-12-06 | 2018-06-12 | 阿里巴巴集团控股有限公司 | Method for allocating tasks and device, real-time Computational frame system |
CN108270634A (en) * | 2016-12-30 | 2018-07-10 | 中移(苏州)软件技术有限公司 | A kind of method and system of heartbeat detection |
CN108376104A (en) * | 2018-02-12 | 2018-08-07 | 上海帝联网络科技有限公司 | Node scheduling method and device, computer readable storage medium |
CN108563497A (en) * | 2018-04-11 | 2018-09-21 | 中译语通科技股份有限公司 | A kind of efficient various dimensions algorithmic dispatching method, task server |
CN109117285A (en) * | 2018-07-27 | 2019-01-01 | 高新兴科技集团股份有限公司 | Support the distributed memory computing cluster system of high concurrent |
CN105204920B (en) * | 2014-06-18 | 2019-07-23 | 阿里巴巴集团控股有限公司 | A kind of implementation method and device of the distributed computing operation based on mapping polymerization |
CN110297714A (en) * | 2019-06-19 | 2019-10-01 | 上海冰鉴信息科技有限公司 | The method and device of PageRank is obtained based on large-scale graph data collection |
CN110908796A (en) * | 2019-11-04 | 2020-03-24 | 北京理工大学 | Multi-operation merging and optimizing system and method in Gaia system |
CN111813527A (en) * | 2020-07-15 | 2020-10-23 | 江苏方天电力技术有限公司 | Data-aware task scheduling method |
CN112148202A (en) * | 2019-06-26 | 2020-12-29 | 杭州海康威视数字技术股份有限公司 | Training sample reading method and device |
CN107562926B (en) * | 2017-09-14 | 2023-09-26 | 丙申南京网络技术有限公司 | Multi-hadoop distributed file system for big data analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737114A (en) * | 2012-05-18 | 2012-10-17 | 北京大学 | MapReduce-based big picture distance connection query method |
US20120304186A1 (en) * | 2011-05-26 | 2012-11-29 | International Business Machines Corporation | Scheduling Mapreduce Jobs in the Presence of Priority Classes |
CN103279328A (en) * | 2013-04-08 | 2013-09-04 | 河海大学 | BlogRank algorithm parallelization processing construction method based on Haloop |
-
2013
- 2013-11-25 CN CN201310600745.7A patent/CN103617087B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120304186A1 (en) * | 2011-05-26 | 2012-11-29 | International Business Machines Corporation | Scheduling Mapreduce Jobs in the Presence of Priority Classes |
CN102737114A (en) * | 2012-05-18 | 2012-10-17 | 北京大学 | MapReduce-based big picture distance connection query method |
CN103279328A (en) * | 2013-04-08 | 2013-09-04 | 河海大学 | BlogRank algorithm parallelization processing construction method based on Haloop |
Non-Patent Citations (2)
Title |
---|
SANGWON SEO等: "HPMR:Prefetching and Pre-shuffling in Shared MapReduce Computation Environment", 《IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS,2009》, 31 December 2009 (2009-12-31), pages 1 - 4 * |
冯新建: "基于MapReduce的迭代型分布式数据处理研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 10, 15 October 2013 (2013-10-15), pages 137 - 20 * |
Cited By (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105204920B (en) * | 2014-06-18 | 2019-07-23 | 阿里巴巴集团控股有限公司 | A kind of implementation method and device of the distributed computing operation based on mapping polymerization |
CN104270412A (en) * | 2014-06-24 | 2015-01-07 | 南京邮电大学 | Three-level caching method based on Hadoop distributed file system |
CN104158860B (en) * | 2014-07-31 | 2017-09-29 | 国家超级计算深圳中心(深圳云计算中心) | A kind of job scheduling method and job scheduling system |
CN104158860A (en) * | 2014-07-31 | 2014-11-19 | 国家超级计算深圳中心(深圳云计算中心) | Job scheduling method and job scheduling system |
CN104503820A (en) * | 2014-12-10 | 2015-04-08 | 华南师范大学 | Hadoop optimization method based on asynchronous starting |
CN104503820B (en) * | 2014-12-10 | 2018-07-24 | 华南师范大学 | A kind of Hadoop optimization methods based on asynchronous starting |
CN105808634B (en) * | 2015-01-15 | 2019-12-03 | 国际商业机器公司 | Distributed mapping reduction network |
CN105808634A (en) * | 2015-01-15 | 2016-07-27 | 国际商业机器公司 | Distributed map reduce network |
WO2016145904A1 (en) * | 2015-09-10 | 2016-09-22 | 中兴通讯股份有限公司 | Resource management method, device and system |
CN106547609A (en) * | 2015-09-18 | 2017-03-29 | 阿里巴巴集团控股有限公司 | A kind of event-handling method and equipment |
CN105117286B (en) * | 2015-09-22 | 2018-06-12 | 北京大学 | The dispatching method of task and streamlined perform method in MapReduce |
CN105117286A (en) * | 2015-09-22 | 2015-12-02 | 北京大学 | Task scheduling and pipelining executing method in MapReduce |
CN106354563A (en) * | 2016-08-29 | 2017-01-25 | 广州市香港科大霍英东研究院 | Distributed computing system for 3D (three-dimensional reconstruction) and 3D reconstruction method |
CN106354563B (en) * | 2016-08-29 | 2020-05-22 | 广州市香港科大霍英东研究院 | Distributed computing system for 3D reconstruction and 3D reconstruction method |
CN106506255B (en) * | 2016-09-21 | 2019-11-05 | 微梦创科网络科技(中国)有限公司 | A kind of method, apparatus and system of pressure test |
CN106506255A (en) * | 2016-09-21 | 2017-03-15 | 微梦创科网络科技(中国)有限公司 | A kind of method of pressure test, apparatus and system |
CN108153583A (en) * | 2016-12-06 | 2018-06-12 | 阿里巴巴集团控股有限公司 | Method for allocating tasks and device, real-time Computational frame system |
CN108153583B (en) * | 2016-12-06 | 2022-05-13 | 阿里巴巴集团控股有限公司 | Task allocation method and device and real-time computing framework system |
CN108270634A (en) * | 2016-12-30 | 2018-07-10 | 中移(苏州)软件技术有限公司 | A kind of method and system of heartbeat detection |
CN108270634B (en) * | 2016-12-30 | 2021-08-24 | 中移(苏州)软件技术有限公司 | Heartbeat detection method and system |
CN106897133B (en) * | 2017-02-27 | 2020-09-29 | 苏州浪潮智能科技有限公司 | Implementation method for managing cluster load based on PBS job scheduling |
CN106897133A (en) * | 2017-02-27 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of implementation method based on the management cluster load of PBS job schedulings |
CN107122238A (en) * | 2017-04-25 | 2017-09-01 | 郑州轻工业学院 | Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame |
CN107316124A (en) * | 2017-05-10 | 2017-11-03 | 中国航天系统科学与工程研究院 | Extensive affairs type job scheduling and processing general-purpose platform under big data environment |
CN107391250A (en) * | 2017-08-11 | 2017-11-24 | 成都优易数据有限公司 | A kind of controller of raising Mapreduce task Shuffle performances |
CN107562926B (en) * | 2017-09-14 | 2023-09-26 | 丙申南京网络技术有限公司 | Multi-hadoop distributed file system for big data analysis |
CN107807983A (en) * | 2017-10-30 | 2018-03-16 | 辽宁大学 | A kind of parallel processing framework and design method for supporting extensive Dynamic Graph data query |
CN107807983B (en) * | 2017-10-30 | 2021-08-24 | 辽宁大学 | Design method of parallel processing framework supporting large-scale dynamic graph data query |
CN108376104B (en) * | 2018-02-12 | 2020-10-27 | 上海帝联网络科技有限公司 | Node scheduling method and device and computer readable storage medium |
CN108376104A (en) * | 2018-02-12 | 2018-08-07 | 上海帝联网络科技有限公司 | Node scheduling method and device, computer readable storage medium |
CN108563497B (en) * | 2018-04-11 | 2022-03-29 | 中译语通科技股份有限公司 | Efficient multi-dimensional algorithm scheduling method and task server |
CN108563497A (en) * | 2018-04-11 | 2018-09-21 | 中译语通科技股份有限公司 | A kind of efficient various dimensions algorithmic dispatching method, task server |
CN109117285A (en) * | 2018-07-27 | 2019-01-01 | 高新兴科技集团股份有限公司 | Support the distributed memory computing cluster system of high concurrent |
CN110297714A (en) * | 2019-06-19 | 2019-10-01 | 上海冰鉴信息科技有限公司 | The method and device of PageRank is obtained based on large-scale graph data collection |
CN110297714B (en) * | 2019-06-19 | 2023-05-30 | 上海冰鉴信息科技有限公司 | Method and device for acquiring PageRank based on large-scale graph dataset |
CN112148202A (en) * | 2019-06-26 | 2020-12-29 | 杭州海康威视数字技术股份有限公司 | Training sample reading method and device |
CN112148202B (en) * | 2019-06-26 | 2023-05-26 | 杭州海康威视数字技术股份有限公司 | Training sample reading method and device |
CN110908796A (en) * | 2019-11-04 | 2020-03-24 | 北京理工大学 | Multi-operation merging and optimizing system and method in Gaia system |
CN111813527A (en) * | 2020-07-15 | 2020-10-23 | 江苏方天电力技术有限公司 | Data-aware task scheduling method |
CN111813527B (en) * | 2020-07-15 | 2022-06-14 | 江苏方天电力技术有限公司 | Data-aware task scheduling method |
Also Published As
Publication number | Publication date |
---|---|
CN103617087B (en) | 2017-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103617087A (en) | MapReduce optimizing method suitable for iterative computations | |
Gu et al. | SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters | |
Cho et al. | Natjam: Design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters | |
Kc et al. | Scheduling hadoop jobs to meet deadlines | |
Li et al. | Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce | |
US9323580B2 (en) | Optimized resource management for map/reduce computing | |
Yang et al. | Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds | |
WO2021254135A1 (en) | Task execution method and storage device | |
CN108153589B (en) | Method and system for data processing in a multi-threaded processing arrangement | |
CN106293944A (en) | System and optimization method is accessed based on nonuniformity I/O under virtualization multi-core environment | |
Bok et al. | An efficient MapReduce scheduling scheme for processing large multimedia data | |
Gu et al. | Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms | |
Wang et al. | Actcap: Accelerating mapreduce on heterogeneous clusters with capability-aware data placement | |
Shi et al. | MapReduce short jobs optimization based on resource reuse | |
Han et al. | Energy efficient VM scheduling for big data processing in cloud computing environments | |
US10579419B2 (en) | Data analysis in storage system | |
Irandoost et al. | Mapreduce data skewness handling: a systematic literature review | |
Shabeera et al. | Optimising virtual machine allocation in MapReduce cloud for improved data locality | |
Slagter et al. | SmartJoin: a network-aware multiway join for MapReduce | |
Li et al. | Performance optimization of computing task scheduling based on the Hadoop big data platform | |
Yu et al. | Sasm: Improving spark performance with adaptive skew mitigation | |
Liu et al. | KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes | |
Liu et al. | Run-time dynamic resource adjustment for mitigating skew in mapreduce | |
Mian et al. | Managing data-intensive workloads in a cloud | |
Zhao et al. | A holistic cross-layer optimization approach for mitigating stragglers in in-memory data processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |