[background technology]
When distributed computing system (Map-Reduce) is handled big data quantity, huge data slicer is divided and rule, carry out the result after every data of parallel computation and gather.Traditional distributed computing system comprises Master unit, some Map unit and some Reduce unit.Wherein, the Master unit is the major control program of distributed computing system, is responsible for the task scheduling of Map unit and Reduce unit, the running status of controlling their operation and monitoring them; The Map unit is the unit of handling a part of data, and whole data are handled by a plurality of Map, and each Map can produce interim intermediate result (being intermediate file); Reduce is responsible for the unit intermediate result after all Map cell processing is merged, and obtains net result.
Usually, each computing unit in the Map-Reduce system (some Map unit and some Reduce unit) is deployed in some computing machine, because computing machine self local disk limited storage space, and the intermediate computations file is very big, surpass computing machine local disk space size, has only the 500M remaining space as local disk, and the intermediate data file that calculates has 100G even bigger, therefore the Reduce unit correctly need be preserved and be transferred to the intermediate data file of this magnanimity and further calculate.
Traditional mode be will operation Map unit computing machine big file cutting to be processed, be cut into a plurality of small documents, make following that the intermediate file local disk of generation can put, the Map unit number of Xu Yaoing is just many like this.Yet because the server cost is very high, our hardware resource all is limited usually, and when machine quantity fixedly the time, because big file has been divided into a plurality of small documents, each Map handles the file number that needs move and can increase.And for each Map unit, its processing procedure is serial, promptly must finish the processing that could start next file after the processing of current file, and owing to be subjected to the restriction of local disk space, whole M ap-Reduce task then needs the processing procedure of a very long time just can obtain final result, therefore, traditional this processing mode operational efficiency is not high.And this mode developer need consider the cutting of file etc., and is not knowing that being difficult to realization under the space operating position of disk carries out cutting to file according to suitable size, so underaction; And will handle big file the time, need a large amount of Map unit, and the hardware resource that needs is many, the cost height, and the complexity of realization and difficulty are all higher.
[summary of the invention]
Based on this, be necessary to provide a kind of intermediate file treating apparatus that can improve the distributed computing system of operational efficiency.
A kind of intermediate file treating apparatus of distributed computing system, described device is based on the Map-Reduce framework, comprise Map unit and Reduce unit, described Map unit comprises: the intermediate file generation module is used for generating a plurality of intermediate files according to default file size after handling the Map task; Transport module transfers to the Reduce unit with described intermediate file successively according to the genesis sequence of a plurality of intermediate files; Described Reduce unit comprises the communication module that receives described intermediate file, and the computing module that described intermediate file is calculated, exports net result.
Described Map unit also can comprise: the description document generation module, and the Map description document of the information of the described intermediate file of generation record, the information of described intermediate file comprises: file size, file name and file cryptographic hash; After described Reduce reads described intermediate file in the unit, obtain the information of described intermediate file, the information of described intermediate file is deposited in the Reduce description document.
This Map unit also can comprise: the Map detection module, detect the intermediate file number that generates according to described Map description document and whether surpass first threshold; The Map control module, when the intermediate file number that detects generation when described Map detection module surpassed first threshold, control suspended the generation of intermediate file.
And the Map control module also can be used for after the Map unit has transmitted an intermediate file, then deletes the intact intermediate file of this transmission.
Wherein, described Reduce unit also can comprise: the Reduce detection module is used for judging according to described Reduce description document whether the interim intermediate file number of preserving in Reduce unit surpasses second threshold value; The Reduce control module, when the intermediate file number that detects interim preservation when described Reduce detection module surpasses second threshold value, notify described Map unit to suspend the transmission of intermediate file, when the intermediate file number that detects interim preservation does not surpass second threshold value, notify described Map unit to continue the transmission intermediate file.
Wherein, described Reduce control module also can be used for the interim intermediate file of preserving in deletion Reduce unit after described computing module calculating finishes.
In addition, also be necessary to provide a kind of intermediate file disposal route that can improve the distributed computing system of operational efficiency.
A kind of intermediate file disposal route of distributed computing system, described method may further comprise the steps based on the Map-Reduce framework: handle the Map task, generate a plurality of intermediate files according to default file size; Genesis sequence according to a plurality of intermediate files transfers to intermediate file the Reduce unit successively; Described Reduce unit receives intermediate file, middle file is calculated the output net result.
The step of a plurality of intermediate files of described generation also can comprise the information that writes down described intermediate file by the Map description document, comprising: file size, file name and file cryptographic hash; The step that described Reduce unit receives intermediate file also comprises: obtain the information of intermediate file and deposit in the Reduce description document.
This method also can comprise: detect the intermediate file number that generates according to the Map description document and whether surpass first threshold, if then control suspends the generation of intermediate file.
Wherein, the genesis sequence according to a plurality of intermediate files also can comprise the step that intermediate file transfers to the Reduce unit: when having transmitted an intermediate file to the Reduce unit, then delete the intact intermediate file of described transmission.
This method also can comprise: judge according to described Reduce description document whether the interim intermediate file number of preserving in Reduce unit surpasses second threshold value, if, then suspend the transmission of intermediate file, otherwise, the transmission intermediate file continued.
And also can comprise the step that middle file calculates: the interim described intermediate file of preserving in deletion Reduce unit after the calculating of middle file finishes.
The intermediate file treating apparatus and the method for above-mentioned distributed computing system, by after handling the Map task, generating a plurality of intermediate files according to default file size, and successively it is transferred to the Reduce unit according to the genesis sequence of these intermediate files, be that the Map unit has the intermediate file generation then to be transferred to the Reduce unit, can improve the operational efficiency of distributed computing system like this.
In addition, the number threshold value of the size by intermediate file is set and the intermediate file of transmission, the quantity of intermediate file that can control transmission goes up in limited time when these intermediate file numbers reach certain, may command is suspended data transmission, and it is congested to have avoided too much data transmission to cause.For in the middle of solving in the distributed computing system during super large data file, not needing increases too much Map unit, thereby has saved hardware resource, and the complexity and the difficulty that realize all decrease.
[embodiment]
As shown in Figure 1, a kind of intermediate file treating apparatus of distributed computing system, comprise a plurality of Map unit 10 and a plurality of Reduce unit 20, wherein: the Map task is handled in each Map unit 10, receive input to (key/value) by the Map function, produce intermediate file and pass to Reduce unit 20, Reduce unit 20 receives a middle key and relevant value by the Reduce function, and merge these values, obtain final calculation result.
As shown in Figure 2, in one embodiment, Map unit 10 comprises intermediate file generation module 101 and transport module 102, and wherein: intermediate file generation module 101 is used for generating a plurality of intermediate files according to default file size after handling the Map task.Should default file size can be provided with according to the size of local disk, it is 100M that an intermediate file for example is set.Transport module 102 is used for transmitting it to Reduce unit 20 successively according to the genesis sequence of a plurality of intermediate files.
In another embodiment, as shown in Figure 3, Map unit 10 also comprises description document generation module 103, Map detection module 104 and Map control module 105 except comprising above-mentioned intermediate file generation module 101 and transport module 102, wherein:
Description document generation module 103 is used to generate the Map description document of the information that writes down intermediate file.During Map output intermediate file, can detect whether the Map description document is arranged in the local disk,, then generate this description document if do not have.This description document has write down the information of intermediate file, and the information of this intermediate file comprises the information such as file size, file name and file cryptographic hash of the intermediate file that is generated.Among this embodiment, the Map description document is intermediate_head.dat, when middle file generating module 101 generated first intermediate file (as intermediate_data-1.data) according to default file size, then the Map description document write down the information such as cryptographic hash of length, title and the file data of this document.When the intermediate file of output during greater than default file size (as 100M), then follow-up data continues to save as new intermediate file, as intermediate_data-2.data, and upgrade the Map description document, the rest may be inferred according to this principle for the intermediate file of follow-up output.Generate an intermediate file, this intermediate file that transport module 102 will generate at once is transferred to Reduce unit 20, i.e. the intermediate file that generates earlier transmission earlier.
Map detection module 104 is used for detecting the intermediate file number that generates according to the Map description document and whether surpasses first threshold (as 2), if surpass, shows that then the travelling speed of Map unit 10 is too fast.When Map control module 105 surpassed first threshold in the intermediate file number that detects generation when Map detection module 104, then control suspended the calculating of Map unit 10, and promptly the control time-out generates intermediate file.In one embodiment, Map control module 105 also is used for after Map unit 10 has transmitted an intermediate file, deletes the intact intermediate file of this transmission.
In one embodiment, as shown in Figure 4, Reduce unit 20 comprises communication module 201 and computing module 202, and wherein: communication module 201 receives Map unit 10 and transmits the intermediate file of coming; 202 pairs of middle files of computing module calculate, export net result.Among this embodiment, Reduce unit 20 receives intermediate file, then read intermediate file, obtain the information of intermediate file, the information of the intermediate file that is obtained can be the cryptographic hash of intermediate file etc., the intermediate file information of obtaining (is comprised file size, information such as file name and file cryptographic hash) be stored in the Reduce description document, simultaneously, the intermediate file that calculates can be stored in Reduce unit 20 temporarily, carry out the computation process in the Reduce unit 20 then, after calculating finishes, interim intermediate file of preserving of deletion and the intermediate computations file that produces in computation process etc. allow Reduce unit 20 be in when next intermediate file is waited in Reduce unit 20 and wait for or halted state.
In another embodiment, as shown in Figure 5, Reduce unit 20 also comprises Reduce detection module 203 and Reduce control module 204 except comprising above-mentioned communication module 201, computing module 202.Wherein: Reduce detection module 203 judges that according to the Reduce description document whether Reduce unit 20 interim intermediate file numbers of preserving surpass second threshold value (as 2), if surpass, then show 10 transmission of Map unit too much.Reduce control module 204 is used for when the intermediate file number that detects preservation surpasses second threshold value, then notify Map unit 10 to suspend the transmission of intermediate file, when the intermediate file of 20 interim storages is less than second threshold value up to the Reduce unit, just notify Map unit 10 to continue transmission.Reduce control module 204 also is used for after calculating finishes, the deletion Reduce unit 20 interim intermediate files of preserving.
Behind all intermediate file end of transmissions of Map unit 10 generations, can notify Reduce unit 20 data transmission to finish, and in the Map description document, indicate the intermediate file end of transmission.
As shown in Figure 6, a kind of intermediate file disposal route of distributed computing system may further comprise the steps:
Step S10 handles the Map task, generates a plurality of intermediate files according to default file size.In one embodiment, the size of intermediate file can be set according to the local disk size, the size that each intermediate file for example is set is 100M all, and the intermediate file that is generated writes down its information by the Map description document, comprises file size, file name and cryptographic hash etc.Among this embodiment, can detect earlier whether the Map description document is arranged in the local disk, if do not have, then generation is used to write down the intermediate file information Map description document of (comprising file size, file name and file cryptographic hash etc.).When the intermediate file of output during greater than default file size (as 100M), generate the intermediate file of first 100M after, follow-up data saves as new intermediate file, the rest may be inferred.
Step S20 transfers to intermediate file the Reduce unit successively according to the genesis sequence of a plurality of intermediate files.In one embodiment, Map unit 10 generates an intermediate file, then at once the intermediate file that generates is passed to Reduce unit 20, i.e. the intermediate file that generates earlier transmission earlier.When having transmitted an intermediate file, then delete the intact intermediate file of this transmission, for local disk is saved the space to Reduce unit 20.
Step S30, described Reduce unit receives intermediate file, middle file is calculated the output net result.In one embodiment, after Reduce unit 20 receives intermediate file, read intermediate file, and the cryptographic hash of calculating intermediate file, with the information stores such as cryptographic hash of intermediate file in the Reduce description document, and intermediate file can be stored in the Reduce unit 20 temporarily, after the computation process of intermediate file finishes, then interim intermediate file of preserving of deletion and the intermediate computations file that produces in computation process etc. allow Reduce unit 20 be in when waiting for next intermediate file and wait for or halted state.
In one embodiment, this method comprises also whether the intermediate file number that generates according to Map description document detection Map unit 10 surpasses the step of first threshold.Among this embodiment, can set first threshold (as 2) in advance, when middle file number surpasses this first threshold, the travelling speed that then shows Map unit 10 is too fast, need this moment control to suspend the generation of intermediate file, when the number for the treatment of intermediate file is no more than this first threshold, continue to calculate to generate intermediate file.
In another embodiment, this method also comprises according to the Reduce description document and judges whether the intermediate file number of preserving the Reduce unit surpasses the step of second threshold value.Among this embodiment, can set second threshold value (as 2) in advance, when the Reduce unit 20 interim intermediate files of preserving surpass this second threshold value, say and show 10 transmission of Map unit too much, at this moment, need to suspend the transmission of intermediate file, when the 20 interim intermediate files of preserving are no more than this second threshold value up to the Reduce unit, just continue the transmission intermediate file, for the local disk of Reduce unit 20 is saved the space.
Should be noted that, said method and system also can combine with distributed memory system, after Map unit 10 intermediate files in the distributed computing system generate, upload in the distributed memory system earlier and store, Reduce unit 20 is downloaded from distributed memory system again.
The intermediate file treating apparatus and the method for above-mentioned distributed computing system, by after handling the Map task, generating a plurality of intermediate files according to default file size, and successively it is transferred to Reduce unit 20 according to the genesis sequence of these intermediate files, be that Map unit 10 has the intermediate file generation then to be transferred to the Reduce unit, can improve the operational efficiency of distributed computing system like this.
In addition, the number threshold value of the size by intermediate file is set and the intermediate file of transmission, the quantity of intermediate file that can control transmission goes up in limited time when these intermediate file numbers reach certain, may command is suspended data transmission, and it is congested to have avoided too much data transmission to cause.For in the middle of solving in the distributed computing system during super large data file, not needing increases too much Map unit 10, thereby has saved hardware resource, and the complexity and the difficulty that realize all decrease.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.