CN107609061A

CN107609061A - A kind of method and apparatus of data syn-chronization

Info

Publication number: CN107609061A
Application number: CN201710750922.8A
Authority: CN
Inventors: 李贵荣; 黄承松; 夏里峰; 宋书俊
Original assignee: Wuhan Chimy Network Technology Co Ltd
Current assignee: Wuhan Chimy Network Technology Co Ltd
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2018-01-19

Abstract

The present invention provides a kind of method and apparatus of data syn-chronization,The data of each data transfer task are passed to the DataX of different nodes in Hadoop clusters one by one by MapReduce,And start the DataX of different nodes in Hadoop clusters by MapReduce,Realized again by the DataX of different nodes in Hadoop clusters from source to the data transfer of each destination,To complete from source to the data syn-chronization of each destination,Performed so as to which multiple data transfer tasks in data syn-chronization task list to be respectively allocated to the DataX of different nodes in Hadoop clusters,The problem of the problem of unit low memory for avoiding by unit while being brought when performing multiple data transfer tasks and unit network transfer speeds are restricted,Improve the efficiency of data syn-chronization.

Description

A kind of method and apparatus of data syn-chronization

Technical field

The present invention relates to field of computer technology, more particularly, to a kind of method and apparatus of data syn-chronization.

Background technology

DataX is the instrument of an exchange high speed data between database/file system of isomery, can be realized any Data system between data syn-chronization.

In order to solve the problems, such as data syn-chronization between heterogeneous data source, DataX becomes the mesh data synchronization link of complexity Into star-like data syn-chronization link, DataX is responsible for connecting the data syn-chronization between various data sources as intermediate conveyor carrier.It is logical Often, it is mounted with that DataX terminal is responsible for receiving the Data Concurrent of source and delivers to destination as task engine, data transmission procedure exists One process is completed in task engine, by the internal memory operation of task engine, without reading and writing disk.

Using single terminal as task engine, on the one hand, because data transmission procedure is completed in one process, pass through task The internal memory of machine realizes the transmission of data, when DataX performs multiple data transfer tasks simultaneously, it will usually unit internal memory occur not The problem of sufficient；On the other hand, because each data transfer task usually requires to carry out the transmission of mass data, and single terminal is made The limitation of network bandwidth during task engine to be present, it is impossible to network transfer speeds when meeting to perform multiple data transfer tasks simultaneously Demand；Thus the efficiency of data syn-chronization is influenceed.

The content of the invention

In order to overcome above mentioned problem or solve the above problems at least in part, the present invention provides a kind of side of data syn-chronization Method and device.

According to an aspect of the present invention, there is provided a kind of method of data syn-chronization, including：By on data syn-chronization task list Distributed file system is passed to, data syn-chronization task list is included from source to the data transfer task of each destination；Will DataX is uploaded to each node in Hadoop clusters；Each data are obtained from data syn-chronization task list by MapReduce The data of transformation task, and the data of each data transfer task are passed to different nodes in Hadoop clusters one by one DataX；Start the DataX of different nodes in Hadoop clusters by MapReduce, pass through different nodes in Hadoop clusters DataX is realized from source to the data transfer of each destination, to complete from source to the data syn-chronization of each destination.

Wherein, before data syn-chronization task list being uploaded into distributed file system, in addition to：Obtain the address of source The address information of each destination of information sum；According to the address information of source and the address information of each destination, number is determined According to synchronous task list.

Wherein, according to the address information of source and the address information of each destination, data syn-chronization task list is determined, is wrapped Include：According to the form of DataX task configuration files, the address information of source and the address information of each destination are write successively To data syn-chronization task list.

Wherein, DataX is uploaded to before each node in Hadoop clusters, in addition to：Obtain the data class of source Type information and each destination data type information；Believed according to the data type information of source and the data type of each destination Breath, configure DataX.

Wherein, according to the data type information of source and the data type information of each destination, DataX is configured, including： According to the data type information of source, the data for adding DataX read in the plug-in unit of module, so that DataX supports the number to source According to the reading of type；According to the data type information of each destination, the data for adding DataX write out the plug-in unit of module, so that DataX supports writing out to the data type of each destination.

Wherein, the data of each data transfer task are obtained from data syn-chronization task list by MapReduce, including： According to the data format of each data transfer task, MapReduce InputFormat classes and RecordReader classes are customized； By MapReduce InputFormat classes, the data of data syn-chronization task list are divided, and pass through MapReduce's RecordReader classes, it is successively read the data of each data transfer task.

Wherein, by MapReduce start Hadoop clusters in different nodes DataX, by Hadoop clusters not With node DataX realize from source to the data transfer of each destination, including：According to different nodes in Hadoop clusters Store path corresponding to DataX, the DataX of different nodes in Hadoop clusters is started by MapReduce Mapper classes, with So that every DataX is according to the address information of source and the address information of destination, by the data transfer of source to destination.

Another aspect of the present invention, there is provided a kind of device of data syn-chronization, including：At least one processor；And with institute At least one memory of processor communication connection is stated, wherein：The memory storage has can be by the journey of the computing device Sequence instructs, and the processor calls described program instruction to perform above-mentioned method.

Another aspect of the present invention, there is provided a kind of computer program product, the computer program product are non-including being stored in Computer program in transitory computer readable storage medium, the computer program include programmed instruction, when the programmed instruction quilt When computer performs, computer is set to perform above-mentioned method.

Another aspect of the present invention, there is provided a kind of non-transient computer readable storage medium storing program for executing, the non-transient computer are readable Storage medium stores computer program, and the computer program makes computer perform above-mentioned method.

The method and apparatus of a kind of data syn-chronization provided by the invention, by MapReduce by each data transfer task Data be passed to the DataX of different nodes in Hadoop clusters one by one, and started by MapReduce in Hadoop clusters not Realized with the DataX of node, then by the DataX of different nodes in Hadoop clusters from source to the data of each destination biography It is defeated, to complete from source to the data syn-chronization of each destination, so as to by multiple data transfers in data syn-chronization task list The DataX that task is respectively allocated to different nodes in Hadoop clusters is performed, and is avoided by unit while is performed multiple data biographies The problem of the problem of unit low memory brought during defeated task and unit network transfer speeds are restricted, it is same to improve data The efficiency of step.

Brief description of the drawings

, below will be to embodiment or prior art in order to illustrate more clearly of technical scheme of the invention or of the prior art The required accompanying drawing used is briefly described in description, it should be apparent that, drawings in the following description are the one of the present invention A little embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to these Accompanying drawing obtains other accompanying drawings.

Fig. 1 is the flow chart according to the method for the data syn-chronization of the embodiment of the present invention.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, the technical scheme in the present invention is clearly and completely described, it is clear that described embodiment is a part of the invention Embodiment, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making wound The every other embodiment obtained under the premise of the property made work, belongs to the scope of protection of the invention.

In order to facilitate understanding, the Integral Thought of the method for data syn-chronization provided in an embodiment of the present invention is：Hadoop clusters Possess a large amount of distributed nodes, how using Hadoop clustered nodes resource, while complete in data syn-chronization task list Multiple data transfer tasks, to improve the efficiency of data syn-chronization, turn into method provided in an embodiment of the present invention studied it is important Content.

Below in the method based on cloud computing platform Hadoop cluster environment to data syn-chronization provided in an embodiment of the present invention Realization exemplified by illustrate, but method provided in an embodiment of the present invention is not limited to Hadoop cluster environment.

In one embodiment of the invention, with reference to figure 1, there is provided a kind of method of data syn-chronization, including：S11, by data Synchronous task list uploads to distributed file system, and data syn-chronization task list is included from source to the data of each destination Transformation task；S12, each node DataX being uploaded in Hadoop clusters；S13, by MapReduce from data syn-chronization Task list obtains the data of each data transfer task, and the data of each data transfer task are passed into Hadoop one by one The DataX of different nodes in cluster；S14, start the DataX of different nodes in Hadoop clusters by MapReduce, pass through The DataX of different nodes is realized from source to the data transfer of each destination in Hadoop clusters, to complete from source to every The data syn-chronization of one destination.

Specifically, Hadoop is a kind of distributed data and the framework calculated, its bottommost is distributed file system (Hadoop Distributed File System, referred to as HDFS), it stores the text on all nodes in Hadoop clusters Part, HDFS last layer is MapReduce engines.

MapReduce is the computation model and framework towards big data parallel processing, and it implies following three layers of implication：

1) MapReduce is a high performance parallel computation platform (Cluster based on cluster Infrastructure), its permission is formed one with the common commercial server of in the market and includes tens of, hundreds of to thousands of sections The distribution of point and parallel computing trunking.

2) MapReduce is a parallel computation and runs software framework (Software Framework), and it is provided One parallel computation software frame huge but that design is superior, can be automatically performed the parallelization processing of calculating task, automatic division Data and calculating task are calculated, distributes and performs automatically task on clustered node and collect result of calculation, by data distribution The ins and outs for many system bottoms that the parallel computations such as storage, data communication, fault-tolerant processing are related to transfers to system to be responsible for place Reason, greatly reduce the burden of software developer.

3) MapReduce is a Parallel programming model and method (Programming Model＆ Methodology), it by means of functional programming language Lisp design philosophy, there is provided a kind of easy and stroke Sequence design method, basic parallel computation task is realized with two function programmings of Map and Reduce, there is provided abstract operation and Multiple programming interface, handled with simply and easily completing the programming of large-scale data and calculating

Thus, the critical function that MapReduce has is the definition fractionation task according to framework, and task is distributed Each node is handled, and can reach the effect run parallel.

For data syn-chronization task, it is by the data syn-chronization of source to different destinations, is characterized in the data in isomery The data transfer of high speed is realized between storehouse/file system, i.e., is a data transfer task between source and each destination.

HDFS supports the file organization structure of traditional succession type, and a user or a program can create directory, deposit Store up file arrive many catalogues among, the name space level of file system is similar with other file system, can create, move File, file is moved to another from a catalogue, or renaming., can will be from source to every for data syn-chronization task The data transfer task of one destination is written in data syn-chronization task list, and uploads to HDFS, so that MapReduce is read Take.

Meanwhile be the high speed data transfer between database/file system of isomery between source and each destination, need To be realized by DataX, therefore, it is necessary to DataX is uploaded to each node in Hadoop clusters, for Hadoop clusters In each node can run DataX.

Obtain the data of each data transfer task from data syn-chronization task list by MapReduce, and by each number The DataX of different nodes in Hadoop clusters is passed to one by one according to the data of transformation task；Started again by MapReduce The DataX of different nodes in Hadoop clusters, realized by the DataX of different nodes in Hadoop clusters from source to each mesh End data transfer, when completing data transfer tasks all in data syn-chronization task list, that is, complete from source to every The data syn-chronization of one destination.

The data of each data transfer task are passed in Hadoop clusters not by the present embodiment one by one by MapReduce With the DataX of node, and by the DataX of different nodes in MapReduce startup Hadoop clusters, then pass through Hadoop clusters The DataX of middle different nodes is realized from source to the data transfer of each destination, to complete from source to each destination Data syn-chronization, it is different so as to which multiple data transfer tasks in data syn-chronization task list are respectively allocated in Hadoop clusters The DataX of node is performed, and the unit low memory for avoiding by unit while being brought when performing multiple data transfer tasks is asked The problem of topic and unit network transfer speeds are restricted, improve the efficiency of data syn-chronization.

Based on above example, before data syn-chronization task list is uploaded into distributed file system, in addition to：Obtain The address information of each destination of address information sum of source；Believed according to the address of the address information of source and each destination Breath, determines data syn-chronization task list.Wherein, according to the address information of source and the address information of each destination, number is determined According to synchronous task list, including：According to the form of DataX task configuration files, by the address information of source and each destination Address information be written to data syn-chronization task list successively.

Specifically, the form of DataX task configuration file needs to meet specific form, data transfer task is being formulated When need form customization by DataX task configuration file, and all data transfer tasks are written to a data syn-chronization In task list.

When DataX carries out data transmission, data are read from source, then are written out to destination, therefore, it is necessary to obtain source The address information of each destination of address information sum, further according to the form of DataX task configuration files, the address of source is believed The address information of breath and each destination is written to data syn-chronization task list successively.

Based on above example, DataX is uploaded to before each node in Hadoop clusters, in addition to：Acquisition source The data type information at end and each destination data type information；According to the data type information of source and each destination Data type information, configure DataX.Wherein, according to the data type information of source and the data type information of each destination, DataX is configured, including：According to the data type information of source, the data for adding DataX read in the plug-in unit of module, so that DataX Support the reading to the data type of source；According to the data type information of each destination, the data for adding DataX write out mould The plug-in unit of block, so that DataX supports writing out to the data type of each destination.

Specifically, DataX realizes that exchange high speed data uses Framework+ between database/file system of isomery Plugin frameworks are built, and Framework has handled buffering, and stream is controlled, concurrently, the major part of the high speed data syn-chronization such as context loading Technical problem, there is provided simple interface interacts with plug-in unit, and wherein plug-in unit only needs to realize the access to data handling system, DataX has an open framework, and developer can use different plug-in units quickly to support different sources and destination Data syn-chronization between various database/file system.Module (i.e. Reader modules) configuration and source are read in DataX data The plug-in unit that client database/file system matches, the data of source to be read in, while module is write out to DataX data (i.e. Writer modules) configures the plug-in unit to match with purpose client database/file system, the data in DataX to be write out To destination.

The data type information of source and each destination data type information are obtained, and is believed according to the data type of source Breath, the plug-in units of DataX Reader modules is added, according to the data type information of each destination, add DataX Writer The plug-in unit of module, so that DataX is supported to the reading of the data type of source and writing out for the data type of each destination.

Based on above example, each data transfer task is obtained from data syn-chronization task list by MapReduce Data, including：According to the data format of each data transfer task, customize MapReduce InputFormat classes and RecordReader classes；By MapReduce InputFormat classes, the data of data syn-chronization task list are divided, and are passed through MapReduce RecordReader classes, it is successively read the data of each data transfer task.

Specifically, when setting MapReduce pattern of the input, it is necessary to customize corresponding InputFormat classes to ensure Input file can be read according to default form, and the MapReduce pattern of the input task configuration file with DataX again Form (i.e. the data format of data transfer task) it is identical, therefore, customize MapReduce InputFormat classes when need To be customized according to the data format of data transfer task.According to the InputFormat classes of above-mentioned rules customization, you can by data The data of synchronous task list are divided into the data of multiple data transfer tasks.And customization InputFormat classes by data After the data of synchronous task list are divided into the data of multiple data syn-chronization tasks, arranged in what manner from data syn-chronization task The data of the data transfer task of a rule are read in the data of table, it is necessary to according to the form of DataX task configuration file (i.e. The data format of data transfer task) customization MapReduce RecordReader classes.Often read a data transformation task Data can all call the RecordReader classes of customization, and the data of data transfer task are converted to the key needed for MapReduce Value pair is simultaneously exported to DataX.

Based on above example, start the DataX of different nodes in Hadoop clusters by MapReduce, pass through The DataX of different nodes is realized from source to the data transfer of each destination in Hadoop clusters, including：According to Hadoop collection Store path corresponding to the DataX of different nodes in group, started by MapReduce Mapper classes different in Hadoop clusters The DataX of node, to cause every DataX according to the address information of source and the address information of destination, by the data of source Transmit to destination.

Specifically, the store path according to corresponding to the DataX of different nodes in Hadoop clusters, passes through MapReduce's Mapper classes start the DataX of different nodes in Hadoop clusters, for any DataX of different nodes in Hadoop clusters, Data are read in from source according to the address information of source in the data of be passed to data transfer task, further according to be passed to number According to the address information of destination in the data of transformation task, data are written out to destination, so as to realize the data biography of source Transport to destination.

As another embodiment of the present invention, there is provided a kind of device of data syn-chronization, including：At least one processor；With And at least one memory being connected with the processor communication, wherein：The memory storage has and can held by the processor Capable programmed instruction, the processor call described program instruction to perform the method that above-mentioned each method embodiment is provided, example Such as include：Data syn-chronization task list is uploaded into distributed file system, data syn-chronization task list is included from source to every The data transfer task of one destination；Each node DataX being uploaded in Hadoop clusters；By MapReduce from number The data of each data transfer task are obtained according to synchronous task list, and the data of each data transfer task are passed to one by one The DataX of different nodes in Hadoop clusters；Start the DataX of different nodes in Hadoop clusters by MapReduce, pass through The DataX of different nodes is realized from source to the data transfer of each destination in Hadoop clusters, to complete from source to every The data syn-chronization of one destination.

Another embodiment as the present invention, there is provided a kind of computer program product, the computer program product include The computer program being stored on non-transient computer readable storage medium storing program for executing, the computer program include programmed instruction, work as program Instruction is when being computer-executed, and computer is able to carry out the method that above-mentioned each method embodiment is provided, such as including：By data Synchronous task list uploads to distributed file system, and data syn-chronization task list is included from source to the data of each destination Transformation task；Each node DataX being uploaded in Hadoop clusters；By MapReduce from data syn-chronization task list The data of each data transfer task are obtained, and the data of each data transfer task are passed in Hadoop clusters not one by one With the DataX of node；By MapReduce start Hadoop clusters in different nodes DataX, by Hadoop clusters not DataX with node is realized from source to the data transfer of each destination, to complete from source to the data of each destination It is synchronous.

Another embodiment as the present invention, there is provided a kind of non-transient computer readable storage medium storing program for executing, the non-transient meter Calculation machine readable storage medium storing program for executing stores computer program, and the computer program is put forward the above-mentioned each method embodiment of computer execution The method of confession, such as including：Data syn-chronization task list is uploaded into distributed file system, data syn-chronization task list includes From source to the data transfer task of each destination；Each node DataX being uploaded in Hadoop clusters；Pass through MapReduce obtains the data of each data transfer task from data syn-chronization task list, and by each data transfer task Data are passed to the DataX of different nodes in Hadoop clusters one by one；Different sections in Hadoop clusters are started by MapReduce The DataX of point, realized by the DataX of different nodes in Hadoop clusters from source to the data transfer of each destination, with Complete from source to the data syn-chronization of each destination.

One of ordinary skill in the art will appreciate that：Realizing all or part of step of above method embodiment can pass through The related hardware of computer program instructions is completed, and foregoing computer program can be stored in a computer-readable storage and be situated between In matter, the computer program upon execution, execution the step of including above method embodiment；And foregoing storage medium includes： ROM, RAM, magnetic disc or CD etc. are various can be with the medium of store program codes.

The embodiments such as device described above are only schematical, wherein the unit as module declaration can be or Person may not be physically separate, can be or may not be physical location, you can with positioned at a place, or It can also be distributed on multiple NEs.Some or all of module therein can be selected to realize according to the actual needs The purpose of this embodiment scheme.Those of ordinary skill in the art are not in the case where paying performing creative labour, you can to understand And implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can Realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on such understanding, on The part that technical scheme substantially in other words contributes to prior art is stated to embody in the form of software product, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including some fingers Make to cause a computer equipment (can be personal computer, server, or network equipment etc.) to perform each implementation Method described in some parts of example or embodiment.

What is finally illustrated is：The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although ginseng The present invention is described in detail according to previous embodiment, it will be understood by those within the art that：It still can be with Technical scheme described in foregoing embodiments is modified, or equivalent substitution is carried out to which part technical characteristic；And These modifications are replaced, and the essence of appropriate technical solution is departed from the spirit and model of various embodiments of the present invention technical scheme Enclose.

Claims

A kind of 1. method of data syn-chronization, it is characterised in that including：

Data syn-chronization task list is uploaded into distributed file system, the data syn-chronization task list is included from source to every The data transfer task of one destination；

Each node DataX being uploaded in Hadoop clusters；

Obtain the data of each data transfer task from the data syn-chronization task list by MapReduce, and by each number The DataX of different nodes in Hadoop clusters is passed to one by one according to the data of transformation task；

Start the DataX of different nodes in the Hadoop clusters by the MapReduce, by the Hadoop clusters The DataX of different nodes is realized from the source to the data transfer of each destination, to complete from the source to each mesh End data syn-chronization.
2. according to the method for claim 1, it is characterised in that described that data syn-chronization task list is uploaded into distributed text Before part system, in addition to：

Obtain the address information of each destination of address information sum of the source；

According to the address information of the source and the address information of each destination, data syn-chronization task list is determined.
3. according to the method for claim 2, it is characterised in that the address information according to the source and each purpose The address information at end, data syn-chronization task list is determined, including：

According to the form of DataX task configuration files, by the address information of the source and the address information of each destination according to It is secondary to be written to the data syn-chronization task list.
4. according to the method for claim 2, it is characterised in that it is described DataX is uploaded to it is each in Hadoop clusters Before node, in addition to：

Obtain the data type information of the source and each destination data type information；

According to the data type information of the source and the data type information of each destination, DataX is configured.
5. according to the method for claim 4, it is characterised in that the data type information according to the source and each The data type information of destination, DataX is configured, including：

According to the data type information of the source, the data for adding DataX read in the plug-in unit of module, so that DataX supports pair The reading of the data type of the source；

According to the data type information of each destination, the data for adding DataX write out the plug-in unit of module, so that DataX is supported To writing out for the data type of each destination.
6. according to the method for claim 1, it is characterised in that it is described by MapReduce from the data syn-chronization task List obtains the data of each data transfer task, including：

According to the data format of each data transfer task, customize the MapReduce InputFormat classes and RecordReader classes；

By the InputFormat classes of the MapReduce, the data of the data syn-chronization task list are divided, and pass through institute MapReduce RecordReader classes are stated, are successively read the data of each data transfer task.
7. according to the method for claim 3, it is characterised in that described that the Hadoop is started by the MapReduce The DataX of different nodes in cluster, realized by the DataX of different nodes in the Hadoop clusters from the source to each The data transfer of destination, including：

According to store path corresponding to the DataX of different nodes in the Hadoop clusters, pass through the MapReduce's Mapper classes start the DataX of different nodes in the Hadoop clusters, to cause addresses of every DataX according to the source The address information of information and destination, by the data transfer of the source to destination.
A kind of 8. device of data syn-chronization, it is characterised in that including：

At least one processor；And at least one memory being connected with the processor communication, wherein：

The memory storage have can by the programmed instruction of the computing device, the processor call described program instruction with Perform the method as described in claim 1 to 7 is any.
9. a kind of computer program product, it is characterised in that the computer program product includes being stored in non-transient computer Computer program on readable storage medium storing program for executing, the computer program include programmed instruction, when described program is instructed by computer During execution, the computer is set to perform the method as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium storing program for executing, it is characterised in that the non-transient computer readable storage medium storing program for executing is deposited Computer program is stored up, the computer program makes the computer perform the method as described in claim 1 to 7 is any.