CN105718507A

CN105718507A - Data migration method and device

Info

Publication number: CN105718507A
Application number: CN201610007991.5A
Authority: CN
Inventors: 郑振峰
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2016-01-06
Filing date: 2016-01-06
Publication date: 2016-06-29

Abstract

The invention provides a data migration method and device. The method comprises the steps as follows: a data source of a first cluster is loaded through a first class loader; the data source of a second cluster is loaded through a second class loader; the first class loader and the second class loader inherit a loader of a data migration tool; in the data migration tool, a first thread reads data of the data source of the first cluster through the first class loader and the data is put into a data queue; and in the data migration tool, a second thread writes the data in the data queue into the data source of the second cluster through the second class loader. The data migration method and device improve the data migration efficiency across the Hadoop cluster.

Description

A kind of data migration method and device

Technical field

It relates to computer technology, particularly to a kind of data migration method and device.

Background technology

Hadoop is a software frame that mass data can carry out distributed treatment, it it is a Distributed Computing Platform that can allow the light framework of user and use, user can easily on Hadoop exploitation and operation process mass data application program, be used widely at big Data processing.Along with the continuous expansion of big market demand demand, Hadoop has been also carried out a series of version and has changed the technical bottleneck caused to solve huge demand to change.But, usually not compatible between each version of Hadoop, therefore Data Migration just becomes requisite operation in edition upgrading process.Such as, the Data Migration between HIVE one Tool for Data Warehouse of Hadoop (hive be based on) and HDFS (HadoopDistributedFileSystem, Hadoop distributed file system) is the scene being frequently encountered by.

In Data Migration scene between HIVE and the HDFS across cluster, current Data Migration mode is, utilizes the instrument of Hadoop self that the data of a Hadoop cluster are exported to this locality, then imports in HDFS or HIVE of another Hadoop cluster again.In this migration pattern, user needs operation could realize final Data Migration for twice, and whole process is equivalent to carry out complete twice Data Migration so that the efficiency of Data Migration reduces；Further, which needs data are exported to this locality, takies local disk space, and the operation of magnetic disc i/o is relatively time consuming, also reduces transport efficiency.

Summary of the invention

In view of this, the disclosure provides a kind of data migration method and device, to improve the Hadoop efficiency across the Data Migration of cluster.

Specifically, the disclosure is achieved by the following technical solution:

First aspect, it is provided that a kind of data migration method, described data migration method is performed by Data Migration Tools, and described method includes:

Loaded the data source of the first cluster by first kind loader, loaded the data source of the second cluster by Equations of The Second Kind loader；The loader of described Data Migration Tools all inherited by described first kind loader and Equations of The Second Kind loader；

In described Data Migration Tools, first thread reads the data of the data source of the first cluster by described first kind loader, and described data are put into data queue；

In described Data Migration Tools, the data in described data queue are write the data source of the second cluster by the second thread by described Equations of The Second Kind loader.

Second aspect, it is provided that a kind of data migration device, including:

Loader configuration module, for being loaded the data source of the first cluster by first kind loader, loads the data source of the second cluster by Equations of The Second Kind loader；The loader of described Data Migration Tools all inherited by described first kind loader and Equations of The Second Kind loader；

Data read module, in described Data Migration Tools, first thread reads the data of the data source of the first cluster by described first kind loader, and described data are put into data queue；

Data write. module, in described Data Migration Tools, the data in described data queue are write the data source of the second cluster by the second thread by described Equations of The Second Kind loader.

The data migration method of disclosure embodiment and device, the data directory of two clusters is loaded respectively by first kind loader and Equations of The Second Kind loader, the data that first kind loader reads from the first company-data catalogue, can be obtained and write data into the data directory of the second cluster by Equations of The Second Kind loader, the Data Migration between two clusters just can be realized thereby through this JVM, migrate the mode of data relative to twice operation, improve the Hadoop efficiency across the Data Migration of cluster.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of a kind of data migration method that disclosure embodiment provides；

Fig. 2 is the flow chart of a kind of data migration method that disclosure embodiment provides；

Fig. 3 is the process schematic of a kind of Data Migration that disclosure embodiment provides；

Fig. 4 is the process schematic of the another kind of Data Migration that disclosure embodiment provides；

Fig. 5 is the process schematic of another Data Migration that disclosure embodiment provides；

Fig. 6 is the structural representation of a kind of data migration device that disclosure embodiment provides.

Detailed description of the invention

In order to overcome in current Data Migration, the problem that Data Migration causes owing to adopting the operation of two Migration tools could realize for twice transport efficiency is low, the data migration method that the embodiment of the present application provides, Data Migration will be realized, to improve the efficiency of the Data Migration across cluster in a Data Migration Tools.

Fig. 1 illustrates the principle of the data migration method of the application, as shown in Figure 1, it is assumed that the first cluster 11 is two different Hadoop clusters with the second cluster 12, for instance, the first cluster 11 can be CDH5.2.0, and the second cluster 12 can be Hadoop2.6.0；Or, the first cluster 11 is Hadoop1.x, and the second cluster 12 is Hadoop2.x；Or, it is also possible to the first cluster 11 is HDP, and the second cluster 12 is Hadoop, etc., to lift no longer in detail, in these examples, the Data Migration of the first cluster 11 and the second cluster 12 is the Data Migration across cluster.

Scene across the Data Migration of cluster, can be the Data Migration between HIVE and HDFS, for instance, by the HDFS in the Data Migration in the HIVE of CDH5.2.0 to Hadoop2.6.0, or, it is also possible to it is by the HIVE of the Data Migration in the HDFS in Hadoop2.6.0 to CDH5.2.0.Namely Data Migration can be the migration of either direction between HIVE and HDFS.

Continuing with referring to Fig. 1, this application provides a Data Migration Tools 13, the process of Data Migration that will realize in this Data Migration Tools 13 across cluster.Such as, this Data Migration Tools 13 can be JVM (JavaVirtualMachine, Java Virtual Machine).In this JVM, it is possible to appoint mechanism according to the parents of Classloader, create two Classloaders, it is achieved the data manipulation to two clusters.Fig. 2 illustrates the flow process of this data migration method:

In step 201, loaded the data source of the first cluster by first kind loader, loaded the data source of the second cluster by Equations of The Second Kind loader；The loader of Data Migration Tools all inherited by first kind loader and Equations of The Second Kind loader.

For example, it is possible to first carry out the initialization of JVM, load the Jar bag that this JVM instrument itself is required.

In this step, it is possible to create two Classloaders, in conjunction with the example of Fig. 1, one is called first kind loader 14, and another is called Equations of The Second Kind loader 15.

Wherein, the loader of current JVM inherited by first kind loader 14, such as, when the first cluster 11 is CDH5.2.0, this first kind loader 14 can called after CDH5.2.0ClassLoader, load the data source of the first cluster 11 with this first kind loader 14, for instance, load the catalogue at CDHLib place, drive Jar including HIVEJDBC.

The loader of JVM inherited equally by Equations of The Second Kind loader 15, such as, when the second cluster 12 is Hadoop2.6.0, this Equations of The Second Kind loader 15 can called after Hadoop2.6.0ClassLoader, the data source of the second cluster 12 is loaded with this Equations of The Second Kind loader 15, such as, the catalogue at HadoopLib place is loaded.Above-mentioned first kind loader 14 and Equations of The Second Kind loader 15 inherit identical loader, are parents and appoint mechanism.

In step 202., in Data Migration Tools, first thread reads the data of the data source of the first cluster by first kind loader, places data into data queue.

Such as, JVM can create a new thread, this thread is properly termed as first thread.This first thread for reading data from the first cluster, the ContextClassLoader of this first thread can be set to CDH5.2.0ClassLoader so that first thread can read the data in the first cluster by first kind loader CDH5.2.0ClassLoader.

In this step, the data that first thread reads can be put in data queue, shown in Figure 1, illustrates one of them data queue 16, and the data that first thread reads are put in this queue.

In step 203, in Data Migration Tools, the data in data queue are write the data source of the second cluster by the second thread by Equations of The Second Kind loader.

Such as, JVM can create a new thread again, this thread is properly termed as the second thread.This second thread is used for write data in the second cluster, it is possible to the ContextClassLoader of this second thread is set to Hadoop2.6.0ClassLoader.In this step, Hadoop2.6.0ClassLoader can read out the data that first thread is put into from data queue 16, and writes data into the second cluster.

In the data migration method of this example, the data directory of two clusters is loaded respectively by first kind loader and Equations of The Second Kind loader, such as, the catalogue at HadoopLib place, data directory is equivalent to the deposit position of company-data, data source can also be called in the present example, the data that first kind loader reads from the first company-data catalogue, can be obtained and write data into the data directory of the second cluster by Equations of The Second Kind loader, the Data Migration between two clusters just can be realized thereby through this JVM, the mode of data is migrated relative to twice operation, improve the efficiency of Data Migration.As long as user provides the corresponding LIB of version number that two clusters are corresponding, can being completed the Data Migration between HIVE and HDFS of two clusters in a process by above-mentioned scheme realization, what shield different editions cluster realizes difference,

Additionally, in step 203 the second thread to the second cluster write data time, it is also possible to data are carried out, for instance, it is possible to according to after the cleaned filter data of rule that user specifies, then write data into.It addition, the data in transition process are all saved in same JVM, it is possible to be temporal cache in internal memory, thus do not have magnetic disc i/o to operate, be greatly improved data migration efficiency；When data buffer storage is in internal memory, in the internal memory life cycle of data, write the second cluster again to after the cleaned filter of data.

The data migration method that the application provides, both can apply to the migration in HIVE to HDFS direction, it is also possible to suitable in the migration in HDFS to HIVE direction.Following by several examples, detailed description uses the data migration method of the application to carry out the process of Data Migration between HIVE and HDFS.

By in the Data Migration of HIVE to HDFS

In the present example, the first cluster 11 is such as CDH5.2.0, and the data source of the first cluster is HIVE, i.e. CDH5.2.0Hive, shown in Figure 3.Second cluster 12 is such as Hadoop2.6.0, and the data source of the second cluster is HDFS, i.e. Hadoop2.6.0HDFSCluster.This example will by the Data Migration in CDH5.2.0Hive to Hadoop2.6.0HDFSCluster.

Continuing with the Data Migration referring to Fig. 3, CDH5.2.0Hive to Hadoop2.6.0HDFSCluster, it is possible to adopt the mode concurrently migrated to improve the efficiency of Data Migration further.In order to realize concurrently migrating, it is necessary to the data of migration are split, for instance, the data in CDH5.2.0Hive are divided into multiple data fragmentation, slice1, the slice2 in Fig. 3 ... this n data fragmentation parallel migration of ..slicen.

In this example, can creating multiple first thread in JVM, each thread is used for reading one of them data fragmentation, for instance, a first thread reads slice1, and another first thread reads slice2, and another first thread reads slicen etc..These first threads composition first thread pond.In being embodied as, first thread can be passed through HIVE interface and connect HIVE, this HIVE interface can be JDBC (JavaDataBaseConnectivity, java data base connects) interface, JDBC interface is a kind of JavaAPI for performing SQL statement, it is possible to provides unified for multiple relational database and accesses.First thread can create JDBCDriver example and connect HIVE, and performs the data in select statement reading HIVE with Statement, for instance can every 5000 row data be submitted in data queue.

Continuing with in conjunction with Fig. 3, also creating in JVM and have the second thread pool, this second thread pool includes multiple second thread, the corresponding first thread of each second thread, for connecting HDFS by HDFS interface, in data fragmentation one HDFS file of write that first thread is read.Such as, HDFS interface is FileSystem interface, uses FileSystemJAVAAPI can operate HDFS, carries out the operations such as the reading and writing of data, deletion at HDFS.Second thread can create HDFSFileSystem, and creates OutputStream example, and the data of a slice is write in a corresponding HDFSSubFile.Such as, the data of slice1 are write in corresponding HDFSSubFile1, the data of slice2 are write in corresponding HDFSSubFile2.Additionally, by the connection of the abstraction interface realization of JDBC and stream with HIVE or HDFS of two clusters in this example, shield the difference of different Hadoop release version.

In this example, the mode of the data partitioning data burst in CDH5.2.0Hive is had multiple, it is possible to divide according to arranging based on the ID index column in HIVE table or integer.Such as, if HIVE table exists ID index column 1-30000000 totally three thousand ten thousand line number evidence, if splitting into three data fragmentations, then 1-10000000 row is read by a first thread, 10000001-20000000 is read by another first thread, and 200000001-30000000 row is read by another first thread.Again such as, if there is integer row Price in HIVE table, split into 50, then the data that each first thread reads are select*fromtablewherePrice%50=Ilimit10000, and wherein i is Thread Id number.File process feature according to HDFS, the data volume of each data fragmentation is unsuitable too small.

Additionally, after all data of HIVE have all imported, user can the multiple HDFSSubFile of selection combining, this is optional operation.Such as, if timestamp supported by HIVE table, it is also possible to the data of each time period to be imported to HDFS, then by the HDFS Piece file mergence of different time sections.

The advantage concurrently migrated is in that, when after certain data fragmentation bust this, as long as deleting corresponding HDFS file and retransmitting, data rather than the total data of simply certain data fragmentation again migrated migrate and re-execute, it is possible to improve the efficiency of fault recovery.

By in the Data Migration of HDFS to HIVE

In the present example, the data source of the first cluster is HDFS, for instance CDH5.2.0HDFSCluster, and the data source of the second cluster is HIVE, for instance Hadoop2.6.0HIVE.By the data of HDFS to HIVE migrate time, it would however also be possible to employ the scheme concurrently migrated, to improve transport efficiency further.Further, the bottom storage of HIVE is HDFS, and therefore this example realizes concurrently writing according to HDFS.Hive table is divided into external table and internal table, respectively the migration pattern of both tables is illustrated as follows.

External table

Shown in Figure 4, in this example, data are migrated to Hadoop2.6.0HIVE external table by CDH5.2.0HDFSCluster.HIVE external table is often through the HDFS catalogue the specified memory space as it, insert data to be equal under this HDFS catalogue and increase or amendment file, therefore, it is equivalent to write data to HDFS catalogue to HIVE external table write data, writes data into the HDFS catalogue that external table is specified.

As shown in Figure 4, the data of CDH5.2.0HDFSCluster are split, be divided into multiple segmentTread, namely multiple data fragmentation it is divided into, segmentTread1 is a data fragmentation, and segmentTread2 is another data fragmentation etc., and this example divide into n burst.JVM reads data again by the multiple first threads in first thread pond, and each first thread reads a data fragmentation.Multiple second threads in second thread pool and the multiple first thread one_to_one corresponding in first thread pond, the corresponding first thread of each second thread, for writing the data fragmentation of this first thread in a HDFS file in the HDFS catalogue that HIVE external table is specified.Such as, certain first thread have read segmentTread1, then by the second thread of first thread segmentTread1 write being specified the HDFSSubFile1 in HDFS catalogue.

Internal table

Shown in Fig. 5, the scheme that by CDH5.2.0HDFSCluster, data are migrated to Hadoop2.6.0HIVE internal table is similar with the scheme moving to external table, is distinctive in that, the data storage location of internal table transfers to Hive to manage.Therefore in the present example, it is possible to create HDFS catalogue, according to the scheme of said external table, the data of each data fragmentation are write in the corresponding HDFS file of interim HDFS in first thread pond.Such as, certain first thread have read segmentTread1, then by the second thread of first thread segmentTread1 being write the HDFSSubFile1 of interim HDFS catalogue.

Each second thread is after the HDFS file that all data that self is responsible for are written to interim HDFS catalogue, it is possible to connect HIVE by HIVE interface, the data in HDFS file are directed into HIVE.For example, it is possible to create JDBCDriver example to connect Hive data base, using loaddatainpath statement that the data in HDFS file are directed into HIVE, this process and data copy.After copy terminates, it is possible to delete corresponding interim HDFS file, the interim HDFS catalogue of establishment deleted by all data copy after terminating.

The above-mentioned two kinds of situations writing data to HIVE external table and internal table, it is directed to the fractionation to source HDFS file, in this example, can be directly by file size mean allocation, such as, if the size of the source HDFS file data migrated in CDH5.2.0HDFSCluster is 1G, if this document being split into 10 data fragmentations, then the size of each data fragmentation can be obtain by file size mean allocation.Such as, what segmentThread1 included is the data of 0～102.4M, and what segmentThread2 included is the data of 102.4～204.8M.

When file declustering, each data fragmentation is required for determining the true original position of oneself and end position.Such as, when determining original position, if a upper character of the current position divided is newline, current location and original position, otherwise reads backward, read till newline always, then next line is the data that this burst should start to read.When determining end position, if the current rearmost position divided is not newline, reads backward, read till newline always, then last column data that this sheet application that behaves oneself decently read is read.

Fig. 6 illustrates the structure of a kind of data migration device, and this device can perform the data migration method of above-described embodiment, and as shown in Figure 6, this device may include that loader configuration module 61, data read module 62 and Data write. module 63.

Loader configuration module 61, for being loaded the data source of the first cluster by first kind loader, loads the data source of the second cluster by Equations of The Second Kind loader；The loader of described Data Migration Tools all inherited by described first kind loader and Equations of The Second Kind loader；

Data read module 62, in described Data Migration Tools, first thread reads the data of the data source of the first cluster by described first kind loader, and described data are put into data queue；

Data write. module 63, in described Data Migration Tools, the data in described data queue are write the data source of the second cluster by the second thread by described Equations of The Second Kind loader.

In one example, the data source of the first cluster is HIVE, and the data source of the second cluster is HDFS；

Data read module 62, including first thread pond, described first thread pond includes multiple described first thread, and each first thread, for connecting HIVE by HIVE interface, reads a data fragmentation of the data source of described HIVE；

Described Data write. module 63, including the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for connecting HDFS by HDFS interface, the data fragmentation of described first thread is write in a HDFS file.

Such as, HIVE interface is JDBC interface, and described HDFS interface is FileSystem interface.

In one example, the data source of the first cluster is HDFS, and the data source of the second cluster is HIVE.

Data read module 62, including first thread pond, described first thread pond includes multiple described first thread, and each first thread is for reading a data fragmentation of the data source of described HDFS；

Described Data write. module 63, including the second thread pool, described second thread pool includes multiple described second thread, and the corresponding first thread of each second thread, for writing the data fragmentation of described first thread in a HDFS file in the HDFS catalogue that HIVE external table is specified.

Described Data write. module 63, including the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for the data fragmentation of described first thread being write a HDFS file in interim HDFS catalogue, and connect HIVE by HIVE interface, the data in described HDFS file are write HIVE.

If described function is using the form realization of SFU software functional unit and as independent production marketing or use, it is possible to be stored in a computer read/write memory medium.Based on such understanding, part or the part of this technical scheme that prior art is contributed by the technical scheme of the disclosure substantially in other words can embody with the form of software product, this computer software product is stored in a storage medium, including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the disclosure.And aforesaid storage medium includes: USB flash disk, portable hard drive, read only memory (ROM, Read-OnlyMemory), the various media that can store program code such as random access memory (RAM, RandomAccessMemory), magnetic disc or CD.

The foregoing is only the preferred embodiment of the disclosure, not in order to limit the disclosure, within all spirit in the disclosure and principle, any amendment of making, equivalent replacements, improvement etc., should be included within the scope that the disclosure is protected.

Claims

1. a data migration method, it is characterised in that described data migration method is performed by Data Migration Tools, described method includes:

2. method according to claim 1, it is characterised in that

The data source of described first cluster is HIVE, and the data source of the second cluster is HDFS；

Described Data Migration Tools includes first thread pond, and described first thread pond includes multiple described first thread, and each first thread, for connecting HIVE by HIVE interface, reads a data fragmentation of the data source of described HIVE；

Described Data Migration Tools includes the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for connecting HDFS by HDFS interface, writes the data fragmentation of described first thread in a HDFS file.

3. method according to claim 2, it is characterised in that described HIVE interface is JDBC interface, described HDFS interface is FileSystem interface.

4. method according to claim 1, it is characterised in that

The data source of described first cluster is HDFS, and the data source of the second cluster is HIVE；

Described Data Migration Tools includes first thread pond, and described first thread pond includes multiple described first thread, and each first thread is for reading a data fragmentation of the data source of described HDFS；

Described Data Migration Tools includes the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for writing the data fragmentation of described first thread in a HDFS file in the HDFS catalogue that HIVE external table is specified.

5. method according to claim 1, it is characterised in that

Described Data Migration Tools includes the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for the data fragmentation of described first thread being write a HDFS file in interim HDFS catalogue, and connect HIVE by HIVE interface, the data in described HDFS file are write HIVE.

6. a data migration device, it is characterised in that including:

7. device according to claim 6, it is characterised in that the data source of described first cluster is HIVE, and the data source of the second cluster is HDFS；

Described data read module, including first thread pond, described first thread pond includes multiple described first thread, and each first thread, for connecting HIVE by HIVE interface, reads a data fragmentation of the data source of described HIVE；

Described Data write. module, including the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for connecting HDFS by HDFS interface, the data fragmentation of described first thread is write in a HDFS file.

8. device according to claim 7, it is characterised in that described HIVE interface is JDBC interface, described HDFS interface is FileSystem interface.

9. device according to claim 6, it is characterised in that the data source of described first cluster is HDFS, and the data source of the second cluster is HIVE；

Described data read module, including first thread pond, described first thread pond includes multiple described first thread, and each first thread is for reading a data fragmentation of the data source of described HDFS；

Described Data write. module, including the second thread pool, described second thread pool includes multiple described second thread, and the corresponding first thread of each second thread, for writing the data fragmentation of described first thread in a HDFS file in the HDFS catalogue that HIVE external table is specified.

10. device according to claim 6, it is characterised in that the data source of described first cluster is HDFS, and the data source of the second cluster is HIVE；

Described Data write. module, including the second thread pool, described second thread pool includes multiple described second thread, the corresponding first thread of each second thread, for the data fragmentation of described first thread being write a HDFS file in interim HDFS catalogue, and connect HIVE by HIVE interface, the data in described HDFS file are write HIVE.