CN103176843A - File migration method and file migration equipment of Map Reduce distributed system - Google Patents

File migration method and file migration equipment of Map Reduce distributed system Download PDF

Info

Publication number
CN103176843A
CN103176843A CN2013100906609A CN201310090660A CN103176843A CN 103176843 A CN103176843 A CN 103176843A CN 2013100906609 A CN2013100906609 A CN 2013100906609A CN 201310090660 A CN201310090660 A CN 201310090660A CN 103176843 A CN103176843 A CN 103176843A
Authority
CN
China
Prior art keywords
identification information
file destination
data block
distributed system
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100906609A
Other languages
Chinese (zh)
Other versions
CN103176843B (en
Inventor
潘瑾瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310090660.9A priority Critical patent/CN103176843B/en
Publication of CN103176843A publication Critical patent/CN103176843A/en
Application granted granted Critical
Publication of CN103176843B publication Critical patent/CN103176843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a file migration method of a Map Reduce distributed system and file migration equipment of the Map Reduce distributed system. Migration operation which is used for migrating a target file is started. The migration operation at least comprises a first Map task and a second Map task which are executed in a parallel mode, and a Reduce task which corresponds to the first Map task and the second Map task so that in the Reduce task, metadata, in the target Map Reduce distributed system, of the target file can be generated. Due to the fact that the migration task which enables the target file to be migrated at least comprises the first Map task and the second Map task, and the first Map task and the second Map task are executed in the parallel mode, migration time of the target file can be shortened, and accordingly migration efficiency of the target is improved.

Description

The file migration method and apparatus of MapReduce distributed system
[technical field]
The present invention relates to the file migration technology, relate in particular to a kind of file migration method and apparatus of MapReduce distributed system.
[background technology]
In recent years, along with the fast development of broadband network technology and parallel computation theory, a kind of distributed system of more simplifying is namely shone upon and gathers (MapReduce) distributed system and arise at the historic moment, and thinks that multiple application provides service, for example, provide service for search engine.In the MapReduce distributed system, also can become the MapReduce distributed type assemblies, for example, the Hadoop system, in, a data processing procedure is called an operation (Job), Job is divided into N part with pending data after submitting to, and every part of pending data are processed by a mapping (Map) task, on the node device of Map task run in this MapReduce distributed system, can move one or more Map tasks on a node device; The Output rusults of all Map tasks gathers by gathering (Reduce) task, the result that output is corresponding.Wherein, Hadoop is the project of increasing income under Apache's software fund.
Yet, in the file migration process of MapReduce distributed system, be take file as unit, to move in same task, transport efficiency is not high.
[summary of the invention]
Many aspects of the present invention provide a kind of file migration method and apparatus of MapReduce distributed system, in order to improve the transport efficiency of file.
An aspect of of the present present invention provides a kind of file migration method of MapReduce distributed system, comprising:
Start the migration operation that is used for the migration file destination, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, described file destination comprises the first data and the second data at least, described the first data are stored at least one first data block, and described the second data are stored at least one second data block;
In a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system;
In described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system;
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system.
Aspect as above and arbitrary possible implementation further provide a kind of implementation, and the identification information of described file destination comprises the routing information that described file destination is stored in the file system of described target MapReduce distributed system.
Aspect as above and arbitrary possible implementation further provide a kind of implementation,
The identification information of described at least one the first data block is the side-play amount of starting position in described file destination of described at least one the first data block;
The side-play amount of the identification information of described at least one the second data block in described file destination is the side-play amount of starting position in described file destination of described at least one the second data block.
Aspect as above and arbitrary possible implementation, a kind of implementation further is provided, described in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, comprising:
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, revise the metadata in the mapping table that described target MapReduce distributed system safeguards, to generate the metadata of described file destination in described target MapReduce distributed system; Perhaps
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, metadata in the mapping table of safeguarding according to described target MapReduce distributed system, again with in described the first data and described the second data copy to new file, with as described file destination, and generate the described new metadata of file in described target MapReduce distributed system.
Aspect as above and arbitrary possible implementation further provide a kind of implementation, and described migration operation specifically is used for
Described file destination is moved in described target MapReduce distributed system from the MapReduce distributed system of source.
Another aspect of the present invention provides a kind of file migration equipment of MapReduce distributed system, comprising:
Start unit, be used for starting the migration operation that is used for the migration file destination, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, described file destination comprises the first data and the second data at least, described the first data are stored at least one first data block, and described the second data are stored at least one second data block;
The one Map task executing units is used in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system;
The 2nd Map task executing units is used in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system;
The Reduce task executing units, be used in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system.
Aspect as above and arbitrary possible implementation further provide a kind of implementation, and the identification information of described file destination comprises the routing information that described file destination is stored in the file system of described target MapReduce distributed system.
Aspect as above and arbitrary possible implementation further provide a kind of implementation,
The identification information of described at least one the first data block is the side-play amount of starting position in described file destination of described at least one the first data block;
The side-play amount of the identification information of described at least one the second data block in described file destination is the side-play amount of starting position in described file destination of described at least one the second data block.
Aspect as above and arbitrary possible implementation further provide a kind of implementation, and described Reduce task executing units specifically is used for
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, revise the metadata in the mapping table that described target MapReduce distributed system safeguards, to generate the metadata of described file destination in described target MapReduce distributed system; Perhaps
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, metadata in the mapping table of safeguarding according to described target MapReduce distributed system, again with in described the first data and described the second data copy to new file, with as described file destination, and generate the described new metadata of file in described target MapReduce distributed system.
Aspect as above and arbitrary possible implementation further provide a kind of implementation, and described migration operation specifically is used for
Described file destination is moved in described target MapReduce distributed system from the MapReduce distributed system of source.
as shown from the above technical solution, the embodiment of the present invention is used for moving the migration operation of file destination by startup, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, so that in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system, and in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system, make in described Reduce task, can be according to the identification information of described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, because the migration task of moving a file destination comprises a Map task and the 2nd Map task at least, and a described Map task and described the 2nd Map task are executed in parallel, therefore, can shorten the transit time of this file destination, thereby improved the transport efficiency of file destination.
[description of drawings]
In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, the below will do one to the accompanying drawing of required use in embodiment or description of the Prior Art and introduce simply, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The schematic flow sheet of the file migration method of the MapReduce distributed system that Fig. 1 provides for one embodiment of the invention;
Fig. 2 is that the migration task that starts in embodiment corresponding to Fig. 1 moves to file destination the schematic diagram of the Hadoop B of system from the Hadoop A of system;
The structural representation of the file migration equipment of the MapReduce distributed system that Fig. 3 provides for another embodiment of the present invention.
[embodiment]
For the purpose, technical scheme and the advantage that make the embodiment of the present invention clearer, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment in the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
In addition, herein term " and/or ", be only a kind of incidence relation of describing affiliated partner, can there be three kinds of relations in expression, for example, A and/or B can represent: individualism A exists A and B, these three kinds of situations of individualism B simultaneously.In addition, character "/", represent that generally forward-backward correlation is to liking a kind of relation of "or" herein.
The schematic flow sheet of the file migration method of the MapReduce distributed system that Fig. 1 provides for one embodiment of the invention.
101, start the migration operation that is used for the migration file destination, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, described file destination comprises the first data and the second data at least, described the first data are stored at least one first data block, and described the second data are stored at least one second data block.
102, in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system.
103, in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system.
104, in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system.
Need to prove, 101~104 executive agent can be a MapReduce distributed system, for example, and target MapReduce distributed system or independent MapReduce distributed system etc.
like this, the migration operation that is used for moving file destination by startup, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, so that in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system, and in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system, make in described Reduce task, can be according to the identification information of described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, because the migration task of moving a file destination comprises a Map task and the 2nd Map task at least, and a described Map task and described the 2nd Map task are executed in parallel, therefore, can shorten the transit time of this file destination, thereby improved the transport efficiency of file destination.
The file migration method of existing MapReduce distributed system, the migration task of migration one file destination only comprises a Map task, that is to say, be take file as unit, to move in same Map task, transport efficiency is not high.
Alternatively, in one of the present embodiment possible implementation, the identification information of described file destination can comprise the routing information that described file destination is stored in the file system of described target MapReduce distributed system.
Alternatively, in one of the present embodiment possible implementation, the identification information of described at least one the first data block can be the side-play amount of starting position in described file destination of described at least one the first data block; Correspondingly, the side-play amount of the identification information of described at least one the second data block in described file destination is the side-play amount of starting position in described file destination of described at least one the second data block.
Alternatively, in one of the present embodiment possible implementation, in 104, specifically can be in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, revise the metadata in the mapping table that described target MapReduce distributed system safeguards, to generate the metadata of described file destination in described target MapReduce distributed system.
In this implementation, in described Reduce task, only need to revise, for example, merge etc., metadata in the mapping table that described target MapReduce distributed system is safeguarded need not the first data that described file destination is included and the second data and rewrites, and can further improve the transport efficiency of file destination.Wherein, the data block information of the first data block that can store for fileinfo and the first included data of described file destination of described file destination of described metadata and the second data second data block of storing.
alternatively, in one of the present embodiment possible implementation, in 104, specifically can also be in described Reduce task, identification information according to described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, metadata in the mapping table of safeguarding according to described target MapReduce distributed system, again with in described the first data and described the second data copy to new file, with as described file destination, and generate the described new metadata of file in described target MapReduce distributed system.
In this implementation, in described Reduce task, the first data and the second data that need to described file destination is included rewrite, and further generate the metadata of new file in described target MapReduce distributed system after rewriteeing.
Further, in described Reduce task, further other metadata of deletion and described metadata contradiction, thus can further improve the migration reliability of file destination.
Alternatively, in one of the present embodiment possible implementation, described migration operation specifically can be used for described file destination is moved to described target MapReduce distributed system from source MapReduce distributed system.That is to say, before 101, described file destination is stored in the file system of described source MapReduce distributed system, and after 104, described file destination has been written in the file system of described target MapReduce distributed system.
For the method that makes the embodiment of the present invention provide clearer, the below will be with the Hadoop system as an example, the file system of described MapReduce distributed system is Hadoop distributed file system (Hadoop Distributed File System, HDFS).As shown in Figure 2, hypothetical target file 1(file name is file 1, and store path is file1) be stored in data block 1, data block 2, data block 3, data block 4 and the data block 5 in the file system HDFS of the Hadoop A of system.Wherein, file destination 1 comprises data 1, data 2, data 3, data 4 and data 5, and data 1 are stored in data block 1, data 2 are stored in data block 2, data 3 are stored in data block 3, and data 4 are stored in data block 4, and data 5 are stored in data block 5.
The A of Hadoop system safeguards a mapping table A, has comprised the metadata relevant to file destination in this mapping table A, and is as follows:
File 1, [data block 1, data block 2, data block 3, data block 4 and data block 5];
The migration device start moves operation, comprises Map task 1, Map task 2, Map task 3, Map task 4 and the Map task 5 of executed in parallel in this migration operation, and corresponding Reduce task.Wherein, Map task N(N=1,2,3,4,5) key (Key) and value (Value) the identification information file1 and the data block N that are respectively described file destination.Particularly,
In Map task 1, according to the identification information file1 of described file destination and the identification information offset1 of described data block 1, data 1 are copied in the Hadoop B of system;
In Map task 2, according to the identification information file1 of described file destination and the identification information offset2 of described data block 2, data 2 are copied in the Hadoop B of system;
In Map task 3, according to the identification information file1 of described file destination and the identification information offset3 of described data block 3, data 3 are copied in the Hadoop B of system;
In Map task 4, according to the identification information file1 of described file destination and the identification information offset4 of described data block 4, data 4 are copied in the Hadoop B of system; And
In Map task 5, according to the identification information file1 of described file destination and the identification information offset5 of described data block 5, data 5 are copied in the Hadoop B of system.
The B of Hadoop system safeguards a mapping table B, has comprised the metadata relevant to file destination in this mapping table B, and is as follows:
File 1, [data block 1];
File 2, [data block 2];
File 3, [data block 3];
File 4, [data block 4];
File 5, [data block 5];
Wherein, the key of Reduce task (Key) and value (Value) are respectively identification information file1 and the data block N of described file destination.Particularly,
In described Reduce task, identification information file1, the identification information offset1 of described data block 1, identification information offset2, the identification information offset3 of described data block 3, the identification information offset4 of described data block 4 and the identification information offset5 of data block 2 of described data block 2 according to described file destination, metadata in the mapping table B that the modification Hadoop B of system safeguards is to generate the metadata of described file destination in the Hadoop B of system.Wherein, the metadata relevant to file destination after revising, as follows:
File 1, [data block 1, data block 2, data block 3, data block 4 and data block 5];
And in deletion mapping table B with other metadata of described metadata contradiction, i.e. the metadata of deletion, as follows:
File 1, [data block 1];
File 2, [data block 2];
File 3, [data block 3];
File 4, [data block 4];
File 5, [data block 5];
So far, file destination is moved to the Hadoop B of system from the Hadoop A of system, namely file destination 1 is stored in data block 1, data block 2, data block 3, data block 4 and data block 5 in the file system HDFS of the Hadoop B of system.
in the present embodiment, the migration operation that is used for moving file destination by startup, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, so that in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system, and in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system, make in described Reduce task, can be according to the identification information of described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, because the migration task of moving a file destination comprises a Map task and the 2nd Map task at least, and a described Map task and described the 2nd Map task are executed in parallel, therefore, can shorten the transit time of this file destination, thereby improved the transport efficiency of file destination.
In addition, if the migration of file destination failure only needs again to move the corresponding data of storing in failed data block, and need not again to move whole file destination, thereby can further improve the transport efficiency of file destination.
Need to prove, for aforesaid each embodiment of the method, for simple description, therefore it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, there is no the part that describes in detail in certain embodiment, can be referring to the associated description of other embodiment.
The structural representation of the file migration equipment of the MapReduce distributed system that Fig. 3 provides for another embodiment of the present invention.As shown in Figure 3, the file migration equipment of the MapReduce distributed system that provides of the present embodiment can comprise start unit 31, a Map task executing units 32, the 2nd Map task executing units 33 and Reduce task executing units 34.Wherein, start unit 31, be used for starting the migration operation that is used for the migration file destination, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, described file destination comprises the first data and the second data at least, and described the first data are stored at least one first data block, and described the second data are stored at least one second data block; The one Map task executing units 32 is used in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system; The 2nd Map task executing units 33 is used in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system; Reduce task executing units 34, be used in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system.
Need to prove, the file migration equipment of the MapReduce distributed system that the present embodiment provides can be a MapReduce distributed system, for example, and target MapReduce distributed system or independent MapReduce distributed system etc.
like this, start the migration operation that is used for the migration file destination by start unit, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, so that a Map task executing units is in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system, and the 2nd the Map task executing units in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system, make the Reduce task executing units in described Reduce task, can be according to the identification information of described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, because the migration task of moving a file destination comprises a Map task and the 2nd Map task at least, and a described Map task and described the 2nd Map task are executed in parallel, therefore, can shorten the transit time of this file destination, thereby improved the transport efficiency of file destination.
The file migration equipment of existing MapReduce distributed system, the migration task of migration one file destination only comprises a Map task, that is to say, be take file as unit, to move in same Map task, transport efficiency is not high.
Alternatively, in one of the present embodiment possible implementation, the identification information of described file destination can comprise the routing information that described file destination is stored in the file system of described target MapReduce distributed system.
Alternatively, in one of the present embodiment possible implementation, the identification information of described at least one the first data block can be the side-play amount of starting position in described file destination of described at least one the first data block; Correspondingly, the side-play amount of the identification information of described at least one the second data block in described file destination is the side-play amount of starting position in described file destination of described at least one the second data block.
Alternatively, in one of the present embodiment possible implementation, described Reduce task executing units 34, specifically can be used in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, revise the metadata in the mapping table that described target MapReduce distributed system safeguards, to generate the metadata of described file destination in described target MapReduce distributed system.
In this implementation, described Reduce task executing units 34 is in described Reduce task, only need to revise, for example, merge etc., metadata in the mapping table that described target MapReduce distributed system is safeguarded need not the first data that described file destination is included and the second data and rewrites, and can further improve the transport efficiency of file destination.Wherein, the data block information of the first data block that can store for fileinfo and the first included data of described file destination of described file destination of described metadata and the second data second data block of storing.
alternatively, in one of the present embodiment possible implementation, described Reduce task executing units 34, specifically can also be used in described Reduce task, identification information according to described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, metadata in the mapping table of safeguarding according to described target MapReduce distributed system, again with in described the first data and described the second data copy to new file, with as described file destination, and generate the described new metadata of file in described target MapReduce distributed system.
In this implementation, described Reduce task executing units 34 is in described Reduce task, the first data and the second data that need to described file destination is included rewrite, and further generate the metadata of new file in described target MapReduce distributed system after rewriteeing.
Further, described Reduce task executing units 34 in described Reduce task, further other metadata of deletion and described metadata contradiction, thus can further improve the migration reliability of file destination.
Alternatively, in one of the present embodiment possible implementation, described migration operation specifically can be used for described file destination is moved to described target MapReduce distributed system from source MapReduce distributed system.That is to say, before the file migration equipment executable operations of the MapReduce distributed system that the present embodiment provides, described file destination is stored in the file system of described source MapReduce distributed system, after the file migration equipment executable operations of the MapReduce distributed system that the present embodiment provides, described file destination has been written in the file system of described target MapReduce distributed system.
For the method that makes the embodiment of the present invention provide clearer, the below will be with the Hadoop system as an example, the file system of described MapReduce distributed system is Hadoop distributed file system (Hadoop Distributed File System, HDFS).As shown in Figure 2, hypothetical target file 1(file name is file 1, and store path is file1) be stored in data block 1, data block 2, data block 3, data block 4 and the data block 5 in the file system HDFS of the Hadoop A of system.Wherein, file destination 1 comprises data 1, data 2, data 3, data 4 and data 5, and data 1 are stored in data block 1, data 2 are stored in data block 2, data 3 are stored in data block 3, and data 4 are stored in data block 4, and data 5 are stored in data block 5.Detailed description can referring to the related content in embodiment corresponding to Fig. 1, repeat no more herein.
in the present embodiment, start the migration operation that is used for the migration file destination by start unit, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, so that a Map task executing units is in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system, and the 2nd the Map task executing units in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system, make the Reduce task executing units in described Reduce task, can be according to the identification information of described file destination, the identification information of the identification information of described at least one the first data block and described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, because the migration task of moving a file destination comprises a Map task and the 2nd Map task at least, and a described Map task and described the 2nd Map task are executed in parallel, therefore, can shorten the transit time of this file destination, thereby improved the transport efficiency of file destination.
In addition, if the migration of file destination failure only needs again to move the corresponding data of storing in failed data block, and need not again to move whole file destination, thereby can further improve the transport efficiency of file destination.
The those skilled in the art can be well understood to, and is the convenience described and succinct, the system of foregoing description, and the specific works process of device and unit can with reference to the corresponding process in preceding method embodiment, not repeat them here.
In several embodiment provided by the present invention, should be understood that, disclosed system, apparatus and method can realize by another way.For example, device embodiment described above is only schematic, for example, the division of described unit, be only that a kind of logic function is divided, during actual the realization, other dividing mode can be arranged, for example a plurality of unit or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, indirect coupling or the communication connection of device or unit can be electrically, machinery or other form.
Described unit as separating component explanation can or can not be also physically to separate, and the parts that show as the unit can be or can not be also physical locations, namely can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select according to the actual needs wherein some or all of unit to realize the purpose of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can be also that the independent physics of unit exists, and also can be integrated in a unit two or more unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, the form that also can adopt hardware to add SFU software functional unit realizes.
The above-mentioned integrated unit of realizing with the form of SFU software functional unit can be stored in a computer read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) or processor (processor) carry out the part steps of the described method of each embodiment of the present invention.And aforesaid storage medium comprises: the various media that can be program code stored such as USB flash disk, portable hard drive, ROM (read-only memory) (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment, the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme that aforementioned each embodiment puts down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (10)

1. the file migration method of a MapReduce distributed system, is characterized in that, comprising:
Start the migration operation that is used for the migration file destination, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, described file destination comprises the first data and the second data at least, described the first data are stored at least one first data block, and described the second data are stored at least one second data block;
In a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system;
In described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system;
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system.
2. method according to claim 1, is characterized in that, the identification information of described file destination comprises the routing information that described file destination is stored in the file system of described target MapReduce distributed system.
3. method according to claim 1 and 2, is characterized in that,
The identification information of described at least one the first data block is the side-play amount of starting position in described file destination of described at least one the first data block;
The side-play amount of the identification information of described at least one the second data block in described file destination is the side-play amount of starting position in described file destination of described at least one the second data block.
4. according to claim 1~3 described methods of arbitrary claim, it is characterized in that, described in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system, comprising:
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, revise the metadata in the mapping table that described target MapReduce distributed system safeguards, to generate the metadata of described file destination in described target MapReduce distributed system; Perhaps
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, metadata in the mapping table of safeguarding according to described target MapReduce distributed system, again with in described the first data and described the second data copy to new file, with as described file destination, and generate the described new metadata of file in described target MapReduce distributed system.
5. according to claim 1~4 described methods of arbitrary claim, is characterized in that, described migration operation specifically is used for
Described file destination is moved in described target MapReduce distributed system from the MapReduce distributed system of source.
6. the file migration equipment of a MapReduce distributed system, is characterized in that, comprising:
Start unit, be used for starting the migration operation that is used for the migration file destination, at least the Map task and the 2nd Map task that comprise executed in parallel in described migration operation, and a described Map task and Reduce task corresponding to described the 2nd Map task, described file destination comprises the first data and the second data at least, described the first data are stored at least one first data block, and described the second data are stored at least one second data block;
The one Map task executing units is used in a described Map task, according to the identification information of described file destination and the identification information of described at least one the first data block, with described the first data copy in target MapReduce distributed system;
The 2nd Map task executing units is used in described the 2nd Map task, according to the identification information of described file destination and the identification information of described at least one the second data block, with described the second data copy in target MapReduce distributed system;
The Reduce task executing units, be used in described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, generate the metadata of described file destination in described target MapReduce distributed system.
7. equipment according to claim 6, is characterized in that, the identification information of described file destination comprises the routing information that described file destination is stored in the file system of described target MapReduce distributed system.
8. according to claim 6 or 7 described equipment, is characterized in that,
The identification information of described at least one the first data block is the side-play amount of starting position in described file destination of described at least one the first data block;
The side-play amount of the identification information of described at least one the second data block in described file destination is the side-play amount of starting position in described file destination of described at least one the second data block.
9. according to claim 6~8 described equipment of arbitrary claim, is characterized in that, described Reduce task executing units specifically is used for
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, revise the metadata in the mapping table that described target MapReduce distributed system safeguards, to generate the metadata of described file destination in described target MapReduce distributed system; Perhaps
In described Reduce task, according to the identification information of the identification information of described file destination, described at least one the first data block and the identification information of described at least one the second data block, metadata in the mapping table of safeguarding according to described target MapReduce distributed system, again with in described the first data and described the second data copy to new file, with as described file destination, and generate the described new metadata of file in described target MapReduce distributed system.
10. according to claim 6~9 described equipment of arbitrary claim, is characterized in that, described migration operation specifically is used for
Described file destination is moved in described target MapReduce distributed system from the MapReduce distributed system of source.
CN201310090660.9A 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system Active CN103176843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310090660.9A CN103176843B (en) 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310090660.9A CN103176843B (en) 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system

Publications (2)

Publication Number Publication Date
CN103176843A true CN103176843A (en) 2013-06-26
CN103176843B CN103176843B (en) 2018-12-14

Family

ID=48636744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310090660.9A Active CN103176843B (en) 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system

Country Status (1)

Country Link
CN (1) CN103176843B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808612A (en) * 2014-12-31 2016-07-27 北京嘀嘀无限科技发展有限公司 Method and equipment used for migrating data of database
CN106528711A (en) * 2016-11-02 2017-03-22 北京集奥聚合科技有限公司 Intersection solving method and system for data of out-of-table files
CN111444148A (en) * 2020-04-09 2020-07-24 南京大学 Data transmission method and device based on MapReduce

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000047996A (en) * 1998-07-31 2000-02-18 Nippon Telegr & Teleph Corp <Ntt> Load leveling method for distributed system
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN102196049A (en) * 2011-05-31 2011-09-21 北京大学 Method suitable for secure migration of data in storage cloud
RU2469388C1 (en) * 2011-09-19 2012-12-10 Российская Федерация, от имени которой выступает Государственная корпорация по атомной энергии "Росатом" - Госкорпорация "Росатом" Method of handling data stored in parallel file system with hierarchical memory organisation
CN102855297A (en) * 2012-08-14 2013-01-02 北京高森明晨信息科技有限公司 Method for controlling data transmission, and connector

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000047996A (en) * 1998-07-31 2000-02-18 Nippon Telegr & Teleph Corp <Ntt> Load leveling method for distributed system
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN102196049A (en) * 2011-05-31 2011-09-21 北京大学 Method suitable for secure migration of data in storage cloud
RU2469388C1 (en) * 2011-09-19 2012-12-10 Российская Федерация, от имени которой выступает Государственная корпорация по атомной энергии "Росатом" - Госкорпорация "Росатом" Method of handling data stored in parallel file system with hierarchical memory organisation
CN102855297A (en) * 2012-08-14 2013-01-02 北京高森明晨信息科技有限公司 Method for controlling data transmission, and connector

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808612A (en) * 2014-12-31 2016-07-27 北京嘀嘀无限科技发展有限公司 Method and equipment used for migrating data of database
CN105808612B (en) * 2014-12-31 2019-08-27 北京嘀嘀无限科技发展有限公司 The method and apparatus of data for migrating data library
CN106528711A (en) * 2016-11-02 2017-03-22 北京集奥聚合科技有限公司 Intersection solving method and system for data of out-of-table files
CN106528711B (en) * 2016-11-02 2019-04-30 北京集奥聚合科技有限公司 Intersection solving method and system for data of out-of-table files
CN111444148A (en) * 2020-04-09 2020-07-24 南京大学 Data transmission method and device based on MapReduce
CN111444148B (en) * 2020-04-09 2023-09-05 南京大学 Data transmission method and device based on MapReduce

Also Published As

Publication number Publication date
CN103176843B (en) 2018-12-14

Similar Documents

Publication Publication Date Title
US10303657B2 (en) Docker layer deduplication with layer referencing
KR102010508B1 (en) System and method for updating source code files
CN102707990B (en) Container based processing method and device
US9767035B2 (en) Pass-through tape access in a disk storage environment
CN105224370A (en) A kind of method and apparatus of loading ELF document
CN101763301B (en) System and method for testing a boot image
CN104346479A (en) Database synchronization method and database synchronization device
US10387280B2 (en) Reporting defects in a flash memory back-up system
CN104133775A (en) Method and apparatus for managing memory
CN105718507A (en) Data migration method and device
CN103559449A (en) Detection method and device for code change
US20170177225A1 (en) Mid-level controllers for performing flash management on solid state drives
US8983908B2 (en) File link migration for decommisioning a storage server
CN106970856B (en) Data management system and method for backing up, recovering and mounting data
CN105359108A (en) Storage systems with adaptive erasure code generation
CN103198122A (en) Method and device for restarting in-memory database
CN104461685A (en) Virtual machine processing method and virtual computer system
US10788997B2 (en) Method and device for storage management with metadata stored in disk extents in an extent pool
CN104899218A (en) Data reading and writing method and data reading and writing apparatus
CN105224422A (en) A kind of data back up method and equipment
CN103617097A (en) File recovery method and file recovery device
CN105630491A (en) Method and device for changing functions of program
CN103559139A (en) Data storage method and device
US10042570B2 (en) Tape backup and restore in a disk storage environment with intelligent data placement
CN103430178A (en) Method, apparatus and product of data updating

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant