CN103176843B - The file migration method and apparatus of MapReduce distributed system - Google Patents

The file migration method and apparatus of MapReduce distributed system Download PDF

Info

Publication number
CN103176843B
CN103176843B CN201310090660.9A CN201310090660A CN103176843B CN 103176843 B CN103176843 B CN 103176843B CN 201310090660 A CN201310090660 A CN 201310090660A CN 103176843 B CN103176843 B CN 103176843B
Authority
CN
China
Prior art keywords
identification information
file destination
data block
data
distributed system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310090660.9A
Other languages
Chinese (zh)
Other versions
CN103176843A (en
Inventor
潘瑾瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310090660.9A priority Critical patent/CN103176843B/en
Publication of CN103176843A publication Critical patent/CN103176843A/en
Application granted granted Critical
Publication of CN103176843B publication Critical patent/CN103176843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of file migration method and apparatus of MapReduce distributed system.The embodiment of the present invention passes through the migration operation started for migrating file destination, the first Map task and the 2nd Map task executed parallel is included at least in the migration operation, and the first Map task and the corresponding Reduce task of the 2nd Map task, so that in Reduce task, metadata of the file destination in the target MapReduce distributed system can be generated, since the migration task of one file destination of migration includes at least the first Map task and the 2nd Map task, and the first Map task and the 2nd Map task execute parallel, therefore, the transit time of the file destination can be shortened, to improve the transport efficiency of file destination.

Description

The file migration method and apparatus of MapReduce distributed system
[technical field]
The present invention relates to file migration technology more particularly to a kind of file migration methods of MapReduce distributed system And equipment.
[background technique]
In recent years, with the fast development of broadband network technology and parallel computation theory, a kind of more simplified distributed system System maps and summarizes (MapReduce) distributed system and comes into being, to provide service for a variety of applications, for example, for search Engine provides service.In MapReduce distributed system, MapReduce distributed type assemblies can also be become, for example, Hadoop System, in, a data handling procedure is known as an operation (Job) and pending data is divided into N parts, often after Job is submitted Part pending data is handled by mapping (Map) task, and Map task run is in the MapReduce distributed system A node device on, one or more Map tasks can be run on a node device;The output knot of all Map tasks Fruit is summarized by summarizing (Reduce) task, exports corresponding result.Wherein, Hadoop is one under Apache's software fund A open source projects.
However, being as unit of file, at same during the file migration of MapReduce distributed system It is migrated in business, transport efficiency is not high.
[summary of the invention]
Many aspects of the invention provide a kind of file migration method and apparatus of MapReduce distributed system, to Improve the transport efficiency of file.
An aspect of of the present present invention provides a kind of file migration method of MapReduce distributed system, comprising:
Start the migration operation for migrating file destination, includes at least first executed parallel in the migration operation Map task and the 2nd Map task and the first Map task and the corresponding Reduce task of the 2nd Map task, institute File destination is stated including at least the first data and the second data, first data are stored at least one first data block, Second data are stored at least one second data block;
In the first Map task, according to the identification information of the file destination and at least one described first data The identification information of block, by first data copy into target MapReduce distributed system;
In the 2nd Map task, according to the identification information of the file destination and at least one described second data The identification information of block, by second data copy into target MapReduce distributed system;
In the Reduce task, according to the identification information of the file destination, at least one described first data block Identification information and at least one second data block identification information, generate the file destination in the target Metadata in MapReduce distributed system.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the target text The identification information of part includes that the file destination is stored in the file system of the target MapReduce distributed system Routing information.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation,
The identification information of at least one first data block is that the starting position of at least one first data block exists Offset in the file destination;
Offset of the identification information of at least one second data block in the file destination is described at least one Offset of the starting position of a second data block in the file destination.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, it is described described In Reduce task, according to the identification information of the file destination, the identification information of at least one first data block and institute The identification information for stating at least one the second data block generates the file destination in the target MapReduce distributed system In metadata, comprising:
In the Reduce task, according to the identification information of the file destination, at least one described first data block Identification information and at least one second data block identification information, modify the target MapReduce distributed system The metadata in mapping table safeguarded, to generate the file destination in the target MapReduce distributed system Metadata;Or
In the Reduce task, according to the identification information of the file destination, at least one described first data block Identification information and at least one second data block identification information, according to the target MapReduce distributed system The metadata in mapping table safeguarded, again by first data and second data copy to a new file In, using as the file destination, and generate member of the new file in the target MapReduce distributed system Data.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the migration are made Industry is specifically used for
It is distributed that the file destination is moved to the target MapReduce from the MapReduce distributed system of source In system.
Another aspect of the present invention provides a kind of file migration equipment of MapReduce distributed system, comprising:
Start unit includes at least simultaneously in the migration operation for starting the migration operation for migrating file destination The first Map task and the 2nd Map task and the first Map task and the 2nd Map task that row executes are corresponding Reduce task, the file destination include at least the first data and the second data, and first data are stored at least one In first data block, second data are stored at least one second data block;
First Map task executing units, for being believed in the first Map task according to the mark of the file destination The identification information of breath and at least one first data block, first data copy is distributed to target MapReduce In system;
2nd Map task executing units, for being believed in the 2nd Map task according to the mark of the file destination The identification information of breath and at least one second data block, second data copy is distributed to target MapReduce In system;
Reduce task executing units, in the Reduce task, according to the identification information of the file destination, The identification information of the identification information of at least one first data block and at least one second data block, generates the mesh Mark metadata of the file in the target MapReduce distributed system.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the target text The identification information of part includes that the file destination is stored in the file system of the target MapReduce distributed system Routing information.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation,
The identification information of at least one first data block is that the starting position of at least one first data block exists Offset in the file destination;
Offset of the identification information of at least one second data block in the file destination is described at least one Offset of the starting position of a second data block in the file destination.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the Reduce Task executing units are specifically used for
In the Reduce task, according to the identification information of the file destination, at least one described first data block Identification information and at least one second data block identification information, modify the target MapReduce distributed system The metadata in mapping table safeguarded, to generate the file destination in the target MapReduce distributed system Metadata;Or
In the Reduce task, according to the identification information of the file destination, at least one described first data block Identification information and at least one second data block identification information, according to the target MapReduce distributed system The metadata in mapping table safeguarded, again by first data and second data copy to a new file In, using as the file destination, and generate member of the new file in the target MapReduce distributed system Data.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the migration are made Industry is specifically used for
It is distributed that the file destination is moved to the target MapReduce from the MapReduce distributed system of source In system.
As shown from the above technical solution, the embodiment of the present invention passes through the migration operation started for migrating file destination, institute State included at least in migration operation the first Map task that executes parallel and the 2nd Map task and the first Map task and The corresponding Reduce task of the 2nd Map task, so that in the first Map task, according to the file destination The identification information of identification information and at least one first data block, by first data copy to target MapReduce In distributed system, and in the 2nd Map task, according to the identification information of the file destination and it is described at least one The identification information of second data block, by second data copy into target MapReduce distributed system, so that described It, can be according to the identification information of the file destination, the identification information of at least one first data block in Reduce task With the identification information of at least one second data block, it is distributed in the target MapReduce to generate the file destination Metadata in system, since the migration task of one file destination of migration includes at least the first Map task and the 2nd Map task, And the first Map task and the 2nd Map task execute parallel, therefore, can shorten the migration of the file destination Time, to improve the transport efficiency of file destination.
[Detailed description of the invention]
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is some realities of the invention Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be attached according to these Figure obtains other attached drawings.
Fig. 1 is the process signal of the file migration method for the MapReduce distributed system that one embodiment of the invention provides Figure;
Fig. 2 is moved to file destination from Hadoop system A by the migration task started in the corresponding embodiment of Fig. 1 The schematic diagram of Hadoop system B;
Fig. 3 be another embodiment of the present invention provides the structure of file migration equipment of MapReduce distributed system show It is intended to.
[specific embodiment]
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
In addition, the terms "and/or", only a kind of incidence relation for describing affiliated partner, indicates may exist Three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.Separately Outside, character "/" herein typicallys represent the relationship that forward-backward correlation object is a kind of "or".
Fig. 1 is the process signal of the file migration method for the MapReduce distributed system that one embodiment of the invention provides Figure.
101, start the migration operation for migrating file destination, the executed parallel is included at least in the migration operation One Map task and the 2nd Map task and the first Map task and the corresponding Reduce task of the 2nd Map task, The file destination includes at least the first data and the second data, and first data are stored at least one first data block In, second data are stored at least one second data block.
102, in the first Map task, according to the identification information of the file destination and it is described at least one first The identification information of data block, by first data copy into target MapReduce distributed system.
103, in the 2nd Map task, according to the identification information of the file destination and it is described at least one second The identification information of data block, by second data copy into target MapReduce distributed system.
104, in the Reduce task, according to the identification information of the file destination, at least one described first number According to the identification information of block and the identification information of at least one second data block, the file destination is generated in the target Metadata in MapReduce distributed system.
It should be noted that 101~104 executing subject can be a MapReduce distributed system, for example, mesh Mark MapReduce distributed system or an individual MapReduce distributed system etc..
In this way, including at least in the migration operation and holding parallel by starting the migration operation for migrating file destination The first capable Map task and the 2nd Map task and the first Map task and the corresponding Reduce of the 2nd Map task Task, so that in the first Map task, according to the identification information of the file destination and at least one described first number According to the identification information of block, by first data copy into target MapReduce distributed system, and described second It, will be described according to the identification information of the identification information of the file destination and at least one second data block in Map task Second data copy is into target MapReduce distributed system, can be according to the mesh so that in the Reduce task Mark identification information, the identification information of at least one first data block and the mark of at least one second data block of file Know information, metadata of the file destination in the target MapReduce distributed system is generated, due to migrating a target The migration task of file includes at least the first Map task and the 2nd Map task, and the first Map task and described second Map task executes parallel, therefore, can shorten the transit time of the file destination, to improve the migration of file destination Efficiency.
The migration task of the file migration method of existing MapReduce distributed system, one file destination of migration is only wrapped Containing a Map task, that is to say, that migrated in the same Map task, transport efficiency is not as unit of file It is high.
Optionally, in a possible implementation of the present embodiment, the identification information of the file destination be can wrap Include the routing information that the file destination is stored in the file system of the target MapReduce distributed system.
Optionally, in a possible implementation of the present embodiment, the mark of at least one first data block Information can be offset of the starting position of at least one first data block in the file destination;Correspondingly, institute Stating offset of the identification information of at least one the second data block in the file destination is then at least one described second number According to offset of the starting position of block in the file destination.
It optionally, in 104, specifically can be in the Reduce in a possible implementation of the present embodiment In task, according to the identification information of the file destination, at least one first data block identification information and it is described at least The identification information of one the second data block modifies the member in the mapping table that the target MapReduce distributed system is safeguarded Data, to generate metadata of the file destination in the target MapReduce distributed system.
In this implementation, in the Reduce task, it is only necessary to it modifies, for example, merge etc., the target The metadata in mapping table that MapReduce distributed system is safeguarded, without being counted included by the file destination first It is written over according to the second data, the transport efficiency of file destination can be further increased.Wherein, the metadata can be institute State file destination the file information and the file destination included by the first data block for being stored of the first data and the second number According to the data block information of the second data block stored.
It optionally, in 104, specifically can also be described in a possible implementation of the present embodiment In Reduce task, according to the identification information of the file destination, the identification information of at least one first data block and institute The identification information for stating at least one the second data block, the mapping table safeguarded according to the target MapReduce distributed system In metadata, again by first data and second data copy into a new file, using as the mesh File is marked, and generates metadata of the new file in the target MapReduce distributed system.
In this implementation, it in the Reduce task, needs the first data included by the file destination It is written over the second data, and further generates the new file after rewriteeing in the target MapReduce distributed system In metadata.
Further, in the Reduce task, other yuan contradictory with the metadata can also further be deleted Data, so as to further increase the migration reliability of file destination.
Optionally, in a possible implementation of the present embodiment, the migration operation specifically can be used for institute File destination is stated to move in the target MapReduce distributed system from the MapReduce distributed system of source.Namely Say, before 101, the file destination is stored in the file system of the source MapReduce distributed system, 104 it Afterwards, the file destination then has been written in the file system of the target MapReduce distributed system.
It, below will be described with Hadoop system as an example to make method provided in an embodiment of the present invention clearer The file system of MapReduce distributed system is Hadoop distributed file system (Hadoop Distributed File System, HDFS).As illustrated in fig. 2, it is assumed that file destination 1(file name is file 1, store path file1) it is stored in In data block 1, data block 2, data block 3, data block 4 and data block 5 in the file system HDFS of Hadoop system A.Wherein, File destination 1 includes data 1, data 2, data 3, data 4 and data 5, and data 1 are stored in data block 1, and data 2 are stored in In data block 2, data 3 are stored in data block 3, and data 4 are stored in data block 4 and data 5 are stored in data block 5.
Hadoop system A safeguards a mapping table A, contains metadata relevant to file destination in mapping table A, such as Shown in lower:
File 1, [data block 1, data block 2, data block 3, data block 4 and data block 5];
Equipment starting migration operation is migrated, includes Map task 1, the Map task 2, Map executed parallel in the migration operation Task 3, Map task 4 and Map task 5 and corresponding Reduce task.Wherein, key Map task N(N=1,2,3,4,5) (Key) and value (Value) be respectively the file destination identification information file1 and data block N.Specifically,
In Map task 1, according to the identification information of the identification information file1 of the file destination and the data block 1 Offset1 copies data 1 in Hadoop system B to;
In Map task 2, according to the identification information of the identification information file1 of the file destination and the data block 2 Offset2 copies data 2 in Hadoop system B to;
In Map task 3, according to the identification information of the identification information file1 of the file destination and the data block 3 Offset3 copies data 3 in Hadoop system B to;
In Map task 4, according to the identification information of the identification information file1 of the file destination and the data block 4 Offset4 copies data 4 in Hadoop system B to;And
In Map task 5, according to the identification information of the identification information file1 of the file destination and the data block 5 Offset5 copies data 5 in Hadoop system B to.
Hadoop system B safeguards a mapping table B, contains metadata relevant to file destination in mapping table B, such as Shown in lower:
File 1, [data block 1];
File 2, [data block 2];
File 3, [data block 3];
File 4, [data block 4];
File 5, [data block 5];
Wherein, the key (Key) of Reduce task and value (Value) are respectively the identification information file1 of the file destination With data block N.Specifically,
In the Reduce task, according to the identification information file1 of the file destination, the mark of the data block 1 Information offset1, the identification information offset2 of the data block 2, the identification information offset3 of the data block 3, the number According to the identification information offset4 of the block 4 and identification information offset5 of data block 2, the mapping that Hadoop system B is safeguarded is modified Metadata in table B, to generate metadata of the file destination in Hadoop system B.Wherein, after modifying and target The relevant metadata of file, as follows:
File 1, [data block 1, data block 2, data block 3, data block 4 and data block 5];
And delete other metadata contradictory with the metadata in mapping table B, that is, the metadata deleted is as follows:
File 1, [data block 1];
File 2, [data block 2];
File 3, [data block 3];
File 4, [data block 4];
File 5, [data block 5];
So far, file destination is moved into Hadoop system B from Hadoop system A, i.e. file destination 1 is stored in Hadoop In data block 1, data block 2, data block 3, data block 4 and data block 5 in the file system HDFS of system B.
In the present embodiment, by starting the migration operation for migrating file destination, included at least in the migration operation The first Map task executed parallel and the 2nd Map task and the first Map task and the 2nd Map task are corresponding Reduce task so that in the first Map task, according to the identification information of the file destination and it is described at least one The identification information of first data block, by first data copy into target MapReduce distributed system, and described It, will according to the identification information of the identification information of the file destination and at least one second data block in 2nd Map task Second data copy is into target MapReduce distributed system, can be according to institute so that in the Reduce task State the identification information of file destination, the identification information of at least one first data block and at least one described second data block Identification information, generate metadata of the file destination in the target MapReduce distributed system, due to migration one The migration task of file destination include at least the first Map task and the 2nd Map task, and the first Map task with it is described 2nd Map task executes parallel, therefore, can shorten the transit time of the file destination, to improve file destination Transport efficiency.
In addition, only needing pair stored in the data block for migrating failure again if the migration of file destination fails Data are answered, without migrating entire file destination again, so as to further increase the transport efficiency of file destination.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
Fig. 3 be another embodiment of the present invention provides the structure of file migration equipment of MapReduce distributed system show It is intended to.As shown in figure 3, the file migration equipment of MapReduce distributed system provided in this embodiment may include that starting is single First 31, the first Map task executing units 32, the 2nd Map task executing units 33 and Reduce task executing units 34.Wherein, Start unit 31 includes at least parallel execute for starting the migration operation for migrating file destination in the migration operation The first Map task and the 2nd Map task and the first Map task and the corresponding Reduce of the 2nd Map task appoint Business, the file destination include at least the first data and the second data, and first data are stored at least one first data In block, second data are stored at least one second data block;First Map task executing units 32, for described It, will according to the identification information of the identification information of the file destination and at least one first data block in first Map task First data copy is into target MapReduce distributed system;2nd Map task executing units 33, for described It, will according to the identification information of the identification information of the file destination and at least one second data block in 2nd Map task Second data copy is into target MapReduce distributed system;Reduce task executing units 34, for described In Reduce task, according to the identification information of the file destination, the identification information of at least one first data block and institute The identification information for stating at least one the second data block generates the file destination in the target MapReduce distributed system In metadata.
It should be noted that the file migration equipment of MapReduce distributed system provided in this embodiment can be one A MapReduce distributed system, for example, target MapReduce distributed system or an individual MapReduce distribution System etc..
In this way, starting the migration operation for migrating file destination by start unit, at least wrapped in the migration operation It is corresponding containing the first Map task executed parallel and the 2nd Map task and the first Map task and the 2nd Map task Reduce task so that the first Map task executing units are in the first Map task, according to the file destination The identification information of identification information and at least one first data block, by first data copy to target MapReduce In distributed system and the 2nd Map task executing units are in the 2nd Map task, according to the mark of the file destination The identification information for knowing information and at least one second data block, by second data copy to target MapReduce point In cloth system, so that Reduce task executing units are in the Reduce task, it can be according to the mark of the file destination Know information, the identification information of at least one first data block and the identification information of at least one second data block, it is raw At metadata of the file destination in the target MapReduce distributed system, due to moving for one file destination of migration Shifting task includes at least the first Map task and the 2nd Map task, and the first Map task is with the 2nd Map task It executes parallel, therefore, the transit time of the file destination can be shortened, to improve the transport efficiency of file destination.
The migration task of the file migration equipment of existing MapReduce distributed system, one file destination of migration is only wrapped Containing a Map task, that is to say, that migrated in the same Map task, transport efficiency is not as unit of file It is high.
Optionally, in a possible implementation of the present embodiment, the identification information of the file destination be can wrap Include the routing information that the file destination is stored in the file system of the target MapReduce distributed system.
Optionally, in a possible implementation of the present embodiment, the mark of at least one first data block Information can be offset of the starting position of at least one first data block in the file destination;Correspondingly, institute Stating offset of the identification information of at least one the second data block in the file destination is then at least one described second number According to offset of the starting position of block in the file destination.
Optionally, in a possible implementation of the present embodiment, the Reduce task executing units 34, specifically It can be used in the Reduce task, according to the identification information of the file destination, at least one described first data block Identification information and at least one second data block identification information, modify the target MapReduce distributed system The metadata in mapping table safeguarded, to generate the file destination in the target MapReduce distributed system Metadata.
In this implementation, the Reduce task executing units 34 are in the Reduce task, it is only necessary to it modifies, For example, merge etc., the metadata in mapping table that the target MapReduce distributed system is safeguarded is not necessarily to the mesh First data included by mark file and the second data are written over, and can further increase the transport efficiency of file destination.Its In, the metadata can the first data included by the file information of the file destination and the file destination store The first data block and the data block information of the second data block that is stored of the second data.
Optionally, in a possible implementation of the present embodiment, the Reduce task executing units 34, specifically It can be also used in the Reduce task, according to the identification information of the file destination, at least one described first data The identification information of the identification information of block and at least one second data block, according to the target MapReduce distribution system The metadata united in safeguarded mapping table, again by first data and second data copy to a new file In, using as the file destination, and generate member of the new file in the target MapReduce distributed system Data.
In this implementation, for the Reduce task executing units 34 in the Reduce task, needing will be described First data and the second data included by file destination are written over, and further generate the new file after rewriteeing described Metadata in target MapReduce distributed system.
Further, the Reduce task executing units 34 can also be deleted further in the Reduce task Other metadata contradictory with the metadata, so as to further increase the migration reliability of file destination.
Optionally, in a possible implementation of the present embodiment, the migration operation specifically can be used for institute File destination is stated to move in the target MapReduce distributed system from the MapReduce distributed system of source.Namely It says, before the file migration equipment of MapReduce distributed system provided in this embodiment executes operation, the file destination It is stored in the file system of the source MapReduce distributed system, in MapReduce distribution provided in this embodiment system After the file migration equipment of system executes operation, the file destination then has been written to the target MapReduce distribution system In the file system of system.
It, below will be described with Hadoop system as an example to make method provided in an embodiment of the present invention clearer The file system of MapReduce distributed system is Hadoop distributed file system (Hadoop Distributed File System, HDFS).As illustrated in fig. 2, it is assumed that file destination 1(file name is file 1, store path file1) it is stored in In data block 1, data block 2, data block 3, data block 4 and data block 5 in the file system HDFS of Hadoop system A.Wherein, File destination 1 includes data 1, data 2, data 3, data 4 and data 5, and data 1 are stored in data block 1, and data 2 are stored in In data block 2, data 3 are stored in data block 3, and data 4 are stored in data block 4 and data 5 are stored in data block 5. Detailed description may refer to the related content in the corresponding embodiment of Fig. 1, and details are not described herein again.
In the present embodiment, the migration operation for migrating file destination is started by start unit, in the migration operation Appoint including at least the first Map task executed parallel and the 2nd Map task and the first Map task and the 2nd Map It is engaged in corresponding Reduce task, so that the first Map task executing units are in the first Map task, according to the target The identification information of the identification information of file and at least one first data block, by first data copy to target In MapReduce distributed system and the 2nd Map task executing units are in the 2nd Map task, according to the target The identification information of the identification information of file and at least one second data block, by second data copy to target In MapReduce distributed system, so that Reduce task executing units are in the Reduce task, it can be according to the mesh Mark identification information, the identification information of at least one first data block and the mark of at least one second data block of file Know information, metadata of the file destination in the target MapReduce distributed system is generated, due to migrating a target The migration task of file includes at least the first Map task and the 2nd Map task, and the first Map task and described second Map task executes parallel, therefore, can shorten the transit time of the file destination, to improve the migration of file destination Efficiency.
In addition, only needing pair stored in the data block for migrating failure again if the migration of file destination fails Data are answered, without migrating entire file destination again, so as to further increase the transport efficiency of file destination.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. a kind of file migration method of MapReduce distributed system characterized by comprising
Start the migration operation for migrating file destination, includes at least the first Map executed parallel in the migration operation and appoint Business and the 2nd Map task and the first Map task and the corresponding Reduce task of the 2nd Map task, the target File includes at least the first data and the second data, and first data are stored at least one first data block, and described the Two data are stored at least one second data block;
In the first Map task, according to the identification information of the file destination and at least one first data block Identification information, by first data copy into target MapReduce distributed system;
In the 2nd Map task, according to the identification information of the file destination and at least one second data block Identification information, by second data copy into target MapReduce distributed system;
In the Reduce task, according to the identification information of the file destination, the mark of at least one first data block The identification information for knowing information and at least one second data block, generates the file destination in the target MapReduce Metadata in distributed system.
2. the method according to claim 1, wherein the identification information of the file destination includes the target text The routing information that part is stored in the file system of the target MapReduce distributed system.
3. method according to claim 1 or 2, which is characterized in that
The identification information of at least one first data block is the starting position of at least one first data block described Offset in file destination;
Offset of the identification information of at least one second data block in the file destination be it is described at least one the Offset of the starting position of two data blocks in the file destination.
4. method according to claim 1 or 2, which is characterized in that it is described in the Reduce task, according to the mesh Mark identification information, the identification information of at least one first data block and the mark of at least one second data block of file Know information, generate metadata of the file destination in the target MapReduce distributed system, comprising:
In the Reduce task, according to the identification information of the file destination, the mark of at least one first data block The identification information for knowing information and at least one second data block, modifies the target MapReduce distributed system and is tieed up Metadata in the mapping table of shield, to generate first number of the file destination in the target MapReduce distributed system According to;Or
In the Reduce task, according to the identification information of the file destination, the mark of at least one first data block The identification information for knowing information and at least one second data block, is tieed up according to the target MapReduce distributed system Metadata in the mapping table of shield, again by first data and second data copy into a new file, with As the file destination, and generate metadata of the new file in the target MapReduce distributed system.
5. method according to claim 1 or 2, which is characterized in that the migration operation is specifically used for
The file destination is moved into the target MapReduce distributed system from the MapReduce distributed system of source In.
6. a kind of file migration equipment of MapReduce distributed system characterized by comprising
Start unit is included at least in the migration operation and is held parallel for starting the migration operation for migrating file destination The first capable Map task and the 2nd Map task and the first Map task and the corresponding Reduce of the 2nd Map task Task, the file destination include at least the first data and the second data, and first data are stored at least one first number According in block, second data are stored at least one second data block;
First Map task executing units, in the first Map task, according to the identification information of the file destination and The identification information of at least one first data block, by first data copy to target MapReduce distributed system In;
2nd Map task executing units, in the 2nd Map task, according to the identification information of the file destination and The identification information of at least one second data block, by second data copy to target MapReduce distributed system In;
Reduce task executing units, in the Reduce task, according to the identification information of the file destination, described The identification information of the identification information of at least one the first data block and at least one second data block generates the target text Metadata of the part in the target MapReduce distributed system.
7. equipment according to claim 6, which is characterized in that the identification information of the file destination includes the target text The routing information that part is stored in the file system of the target MapReduce distributed system.
8. equipment according to claim 6 or 7, which is characterized in that
The identification information of at least one first data block is the starting position of at least one first data block described Offset in file destination;
Offset of the identification information of at least one second data block in the file destination be it is described at least one the Offset of the starting position of two data blocks in the file destination.
9. equipment according to claim 6 or 7, which is characterized in that the Reduce task executing units are specifically used for
In the Reduce task, according to the identification information of the file destination, the mark of at least one first data block The identification information for knowing information and at least one second data block, modifies the target MapReduce distributed system and is tieed up Metadata in the mapping table of shield, to generate first number of the file destination in the target MapReduce distributed system According to;Or
In the Reduce task, according to the identification information of the file destination, the mark of at least one first data block The identification information for knowing information and at least one second data block, is tieed up according to the target MapReduce distributed system Metadata in the mapping table of shield, again by first data and second data copy into a new file, with As the file destination, and generate metadata of the new file in the target MapReduce distributed system.
10. according to claim 6 or 7 seeks the equipment, which is characterized in that the migration operation is specifically used for
The file destination is moved into the target MapReduce distributed system from the MapReduce distributed system of source In.
CN201310090660.9A 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system Active CN103176843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310090660.9A CN103176843B (en) 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310090660.9A CN103176843B (en) 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system

Publications (2)

Publication Number Publication Date
CN103176843A CN103176843A (en) 2013-06-26
CN103176843B true CN103176843B (en) 2018-12-14

Family

ID=48636744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310090660.9A Active CN103176843B (en) 2013-03-20 2013-03-20 The file migration method and apparatus of MapReduce distributed system

Country Status (1)

Country Link
CN (1) CN103176843B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808612B (en) * 2014-12-31 2019-08-27 北京嘀嘀无限科技发展有限公司 The method and apparatus of data for migrating data library
CN106528711B (en) * 2016-11-02 2019-04-30 北京集奥聚合科技有限公司 Intersection solving method and system for data of out-of-table files
CN111444148B (en) * 2020-04-09 2023-09-05 南京大学 Data transmission method and device based on MapReduce

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000047996A (en) * 1998-07-31 2000-02-18 Nippon Telegr & Teleph Corp <Ntt> Load leveling method for distributed system
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN102196049A (en) * 2011-05-31 2011-09-21 北京大学 Method suitable for secure migration of data in storage cloud
RU2469388C1 (en) * 2011-09-19 2012-12-10 Российская Федерация, от имени которой выступает Государственная корпорация по атомной энергии "Росатом" - Госкорпорация "Росатом" Method of handling data stored in parallel file system with hierarchical memory organisation
CN102855297A (en) * 2012-08-14 2013-01-02 北京高森明晨信息科技有限公司 Method for controlling data transmission, and connector

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000047996A (en) * 1998-07-31 2000-02-18 Nippon Telegr & Teleph Corp <Ntt> Load leveling method for distributed system
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN102196049A (en) * 2011-05-31 2011-09-21 北京大学 Method suitable for secure migration of data in storage cloud
RU2469388C1 (en) * 2011-09-19 2012-12-10 Российская Федерация, от имени которой выступает Государственная корпорация по атомной энергии "Росатом" - Госкорпорация "Росатом" Method of handling data stored in parallel file system with hierarchical memory organisation
CN102855297A (en) * 2012-08-14 2013-01-02 北京高森明晨信息科技有限公司 Method for controlling data transmission, and connector

Also Published As

Publication number Publication date
CN103176843A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
US20240146771A1 (en) Inclusion of time-series geospatial markers in analyses employing a cyber-decision platform
CN104081348B (en) System and method to reduce memory usage by optimally placing VMs in virtualized data center
US8972465B1 (en) Burst buffer appliance with small file aggregation
US9778926B2 (en) Minimizing image copying during partition updates
US9218197B2 (en) Virtual machine image migration
CN105760184B (en) A kind of method and apparatus of charging assembly
US9400767B2 (en) Subgraph-based distributed graph processing
US10218723B2 (en) System and method for fast and scalable functional file correlation
US9542461B2 (en) Enhancing performance of extract, transform, and load (ETL) jobs
CN106445951A (en) File transmission method and apparatus
US8751762B2 (en) Prevention of overlay of production data by point in time copy operations in a host based asynchronous mirroring environment
CN108289034A (en) A kind of fault discovery method and apparatus
US11249854B2 (en) Method and device for failover in HBase system, and non-transitory computer-readable storage medium
CN110427364A (en) A kind of data processing method, device, electronic equipment and storage medium
US9380001B2 (en) Deploying and modifying a service-oriented architecture deployment environment model
CN103176843B (en) The file migration method and apparatus of MapReduce distributed system
CN110795143A (en) Method, apparatus, computing device, and medium for processing functional module
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
US20160098442A1 (en) Verifying analytics results
US20220036206A1 (en) Containerized distributed rules engine
CN110795331A (en) Software testing method and device
US10372770B1 (en) Cloud-based platform for semantic indexing of web objects
US10970133B2 (en) System and method for hardware acceleration for operator parallelization with streams
CN104468230B (en) Management method, read method, corresponding equipment and the system of configuration file
US9606784B2 (en) Data object with common sequential statements

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant