CN110928860B

CN110928860B - Data migration method and device

Info

Publication number: CN110928860B
Application number: CN201911181471.6A
Authority: CN
Inventors: 陈国杰; 罗建林
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2023-06-20
Anticipated expiration: 2039-11-27
Also published as: CN110928860A

Abstract

The invention provides a data migration method and a data migration device, wherein the method comprises the following steps: acquiring the data distribution condition on the hadoop cluster to be migrated; acquiring a data migration strategy according to the data distribution condition; directly migrating the data on the hadoop cluster to be migrated to a target hadoop cluster according to the data migration strategy; and verifying the data on the target Hadoop cluster after migration according to the data distribution condition on the Hadoop cluster to be migrated, wherein according to two different Hadoop clusters, data, catalogues, rights, verification, monitoring and display and the like are automatically synchronized, a user can automatically realize the data by installing one installation package, no additional workload is needed, and the verification accuracy is hundred percent and a lot of unnecessary workload is reduced.

Description

Data migration method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data migration method and apparatus.

Background

Hadoop is a software framework capable of performing distributed processing on a large amount of data, and a distributed computing platform capable of enabling users to easily construct and use the distributed computing platform, and is developed by the Apache foundation, so that the users can develop distributed programs without knowing the details of a distributed bottom layer.

With the increase of traffic, data also enters the era of rapid progress, and a hadoop architecture is currently used by many big data companies. The conventional hadoop cluster data migration method firstly downloads data on an old hadoop cluster to the local, then sends the data to the local of another remote cluster, then uploads the data to the new hadoop cluster from the local of the other remote cluster, finally performs authority assignment and verification, and the conventional data migration technology is not automatic enough, has a large number of manual operations, and increases missing or error risks and the workload of warehouse personnel.

Disclosure of Invention

In view of the problems in the prior art, the present invention provides a data migration method and apparatus, an electronic device, and a computer-readable storage medium, which can at least partially solve the problems in the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, a data migration method is provided, including:

acquiring the data distribution condition on the hadoop cluster to be migrated;

acquiring a data migration strategy according to the data distribution condition;

directly migrating the data on the hadoop cluster to be migrated to a target hadoop cluster according to the data migration strategy;

and verifying the data on the target hadoop cluster after migration according to the data distribution condition on the hadoop cluster to be migrated.

Further, the obtaining the data distribution condition on the hadoop cluster to be migrated includes:

performing recursion scanning on the catalogue on the hadoop cluster to be migrated to obtain the data distribution condition, wherein the data distribution condition comprises: the number of files under the catalogue, the size of the files and the preset priority level of the catalogue.

Further, the acquiring the data migration policy according to the data distribution condition includes:

sorting the catalogs according to preset priority levels of the catalogs;

acquiring a data transmission strategy of the directory according to the number of files under the directory and the size of the files;

and generating a data migration template according to the directory ordering result and the data transmission strategy of each directory.

Further, the directly migrating the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration policy includes:

and according to the data migration template, the data on the hadoop cluster to be migrated is migrated to the target hadoop cluster.

Further, the verifying the data on the target hadoop cluster after migration according to the data distribution condition on the hadoop cluster to be migrated includes:

verifying whether the catalogs and subdirectories on the target hadoop cluster are complete after migration;

verifying whether the directory and sub-directory rights on the target hadoop cluster are correct after migration;

verifying whether the number of files in the catalogue and the subdirectory on the target hadoop cluster is correct after migration;

verifying whether the number of bytes in the directory and the subdirectory on the target hadoop cluster is correct after migration;

verifying whether the number of records corresponding to the files on the target hadoop cluster after migration is correct or not;

and verifying whether the file content on the target hadoop cluster is correct after migration.

Further, the data migration method further comprises the following steps:

and displaying the data migration condition.

Further, the data migration case includes: the method comprises the steps of waiting for migration of the catalogs, migration progress of each catalogs, migration data size of each catalogs, migration time consumption of each catalogs, expected migration completion time of each catalogs and priority level of migration processing of each catalogs.

In a second aspect, there is provided a data migration apparatus comprising:

the preprocessing module is used for acquiring the data distribution condition on the hadoop cluster to be migrated;

the migration strategy acquisition module acquires a data migration strategy according to the data distribution condition;

the data migration module is used for directly migrating the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration strategy;

and the verification module is used for verifying the data on the target hadoop cluster after migration according to the data distribution condition on the hadoop cluster to be migrated.

Further, the preprocessing module includes:

the recursion scanning unit is used for recursively scanning the catalogue on the hadoop cluster to be migrated to obtain the data distribution condition, wherein the data distribution condition comprises the following steps: the number of files under the catalogue, the size of the files and the preset priority level of the catalogue.

Further, the migration policy obtaining module includes:

the ordering unit is used for ordering the catalogs according to the preset priority level of each catalogue;

the transmission planning unit acquires the data transmission strategy of the catalog according to the number of files under the catalog and the size of the files;

and the migration template acquisition unit generates a data migration template according to the directory ordering result and the data transmission strategy of each directory.

Further, the data migration module includes:

and the data migration unit is used for migrating the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration template.

Further, the verification module includes:

the first verification unit is used for verifying whether the catalogs and the subdirectories on the target hadoop cluster are complete after migration;

the second verification unit is used for verifying whether the rights of the catalogs and the subdirectories on the target hadoop cluster are correct after migration;

the third verification unit is used for verifying whether the numbers of the files in the catalogs and the subdirectories on the target hadoop cluster after migration are correct or not;

a fourth verification unit for verifying whether the number of bytes in the directory and the subdirectory on the target hadoop cluster after migration is correct;

a fifth verification unit for verifying whether the number of records corresponding to the files on the target hadoop cluster after migration is correct;

and a sixth verification unit for verifying whether the file content on the target hadoop cluster after migration is correct.

Further, the data migration apparatus further includes:

and the progress display module is used for displaying the data migration condition.

In a third aspect, an electronic device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the data migration method described above when the program is executed.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the data migration method described above.

The invention provides a data migration method and device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring the data distribution condition on the hadoop cluster to be migrated; acquiring a data migration strategy according to the data distribution condition; directly migrating the data on the hadoop cluster to be migrated to a target hadoop cluster according to the data migration strategy; and verifying the data on the target Hadoop cluster after migration according to the data distribution condition on the Hadoop cluster to be migrated, wherein according to two different Hadoop clusters, data, catalogues, rights, verification, monitoring and display and the like are automatically synchronized, a user can automatically realize the data by installing one installation package, no additional workload is needed, and the verification accuracy is hundred percent and a lot of unnecessary workload is reduced.

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

FIG. 1 is a flow chart of a data migration method according to an embodiment of the present invention;

FIG. 2 illustrates a data structure on a hadoop cluster;

FIG. 3 illustrates how directories and files are recorded in a system;

fig. 4 shows a specific step of step S200 in fig. 2;

FIG. 5 illustrates a specific implementation of a migration template in an embodiment of the present invention;

fig. 6 shows a specific step of step S400 in fig. 2;

FIG. 7 shows a process of verifying according to file size;

FIG. 8 shows a process of verification based on day data;

FIG. 9 is a block diagram of a data migration apparatus in an embodiment of the present invention;

fig. 10 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present application and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

At present, no tool for automatic migration of a big data platform exists, most of the tools are used for migrating data and catalogs in a manual mode, support of a system tool is lacked, and the accuracy of migrated data is greatly influenced by human factors.

In order to at least partially solve the technical problems in the prior art, the embodiment of the invention provides a data migration method, which automatically synchronizes data, catalogs, authorities, verification, monitoring, display and the like according to two different Hadoop clusters, and can be automatically realized by a user only by installing one installation package without any additional workload.

Fig. 1 is a flow chart of a data migration method in an embodiment of the invention. As shown in fig. 1, the data migration method may include the following:

step S100: acquiring the data distribution condition on the hadoop cluster to be migrated;

specifically, performing recursive scanning on the catalogue on the hadoop cluster to be migrated to obtain the data distribution condition, wherein the data distribution condition comprises: the number of files under the catalogue, the size of the files and the preset priority level of the catalogue.

The method comprises the steps of scanning files and catalogues, and recursively scanning the catalogues according to the sequence from a catalogue root node to a catalogue child node; referring to the data structure shown in fig. 2, the procedure of the recursive scanning is as follows:

firstly, using a command to check which directories and files exist under the directory A, and can see that the directory A has the directory B, the directory D and the file C; then checking the B catalogue, wherein E and F files are arranged under the B catalogue; and then checking the D catalogue, wherein G files are arranged under the D catalogue. It should be noted that, as soon as the current directory is checked for any directory, the next directory is checked again by recursively using the command until all the files are.

In addition, referring to fig. 3, in the first column, the first letter is a directory if d, and is a file if d, for how to determine whether the directory or the file is currently scanned.

Furthermore, the number of the files and the sizes of the files under the directory are counted for the scanned directory, so that the process is arranged in the subsequent migration, and the scanned directory can be used as the basis of verification in the verification stage.

It is worth to say that the data distribution situation is equivalent to the account book in which the data on the hadoop cluster to be migrated is recorded, and the account book not only comprises a data structure, but also comprises a plurality of information such as the number of directories, the number of files, the size of the files, the preset priority, authority and the like of the directories.

Step S200: acquiring a data migration strategy according to the data distribution condition;

the data migration strategy comprises the following steps: migration sequence, number of processes started during migration, size of resources occupied by a CPU, size of occupied memory, and the like.

Step S300: directly migrating the data on the hadoop cluster to be migrated to a target hadoop cluster according to the data migration strategy;

the data is directly migrated from the data on the hadoop cluster to be migrated to the target hadoop cluster without an intermediate storage process.

Step S400: and verifying the data on the target hadoop cluster after migration according to the data distribution condition on the hadoop cluster to be migrated.

Specifically, after the migration is finished, in order to ensure the accuracy of the data, multiple aspects of integrity, data structure, file authority and the like of the data on the target hadoop cluster after the migration need to be verified.

It should be noted that, the data migration method provided by the embodiment of the invention can be implemented through JAVA language.

In summary, according to the data migration method provided by the embodiment of the invention, based on the hadoop protocol transmission concept, a file transmission command is generated every time a file of a directory is transmitted, if different transmission parameters are used, different effects can be achieved, overlay transmission can be performed, update transmission can be performed, default transmission can be performed, the data directory is preprocessed firstly, the directory and the data are formally migrated based on the preprocessed result, multidimensional verification is performed on the migrated data and directory, and the whole process is monitored and the result is displayed from a page, so that automatic migration is realized, human participation is reduced, and the accuracy is high.

In an alternative embodiment, referring to fig. 4, this step S200 may include the following:

step S210: and ordering the catalogs according to the preset priority levels of the catalogs.

And when the data is migrated, the catalog is migrated according to the priority ordering of the catalog.

Step S220: acquiring a data transmission strategy of the directory according to the number of files under the directory and the size of the files;

specifically, first, the scanned directory is classified, which can be specifically classified into: 1. only directory, 2. There are directory and file (small file) 3. There are directory and file (big file); and judging whether the file is a large file or a small file, wherein the file is configured according to parameters of the program, if the file is configured as 2G, the file with the size larger than or equal to 2G is the large file, and the file with the size smaller than or equal to 2G is the small file. For size files, different transmission strategies are employed.

Then, according to the number of files and whether the files are large files or small files in the directory, the number of parallel operations is planned to be started, the number of CPUs is occupied, the memory is used, and the like.

For example, when there is no file in the directory, a concurrent processing may be started, and if there is a plurality of small files in the directory, due to the characteristics of small memory occupied by the small files, high migration speed, etc., a plurality of threads may be started to concurrently process a plurality of small files, so as to improve processing efficiency; when a large file exists in the catalog, the large file occupies more memory and has low migration speed, and the migration speed cannot be improved even if a plurality of threads are started, so that a small number of threads are configured for processing.

Specifically, there is a maximum parallel 100, CPU core 100, memory 200 configuration during migration, and the numbers used for this configuration cannot be exceeded. Such as: the two files are arranged under the scanning catalog/A/B/catalog, and only 2 parallel cores of 2 CPUs and 4G memories are started. If the number of files exceeds 100 and the configuration is maximum, only 100 concurrent files can be started, and after the processing is completed, the next 100 files are processed in sequence.

Step S230: and generating a data migration template according to the directory ordering result and the data transmission strategy of each directory, so as to migrate the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration template.

Specifically, a migration sequence is defined according to the directory priority ordering result, and the directories are migrated according to the data transmission strategy of each directory, so that a complete migration template can be obtained.

It should be noted that, the migration template may be understood as a migration account book, where each directory to be migrated, the number and size of files in each directory, and migration policies of each directory are recorded, and when migration is performed, the current directory is migrated according to the migration policies of the current directory to implement data migration. The migration template is a file, all the files are catalogues, and corresponding migration is carried out according to catalog migration. What directories are in the template, what directories are migrated, and none are not migrated. See fig. 5 for a specific structure.

For one directory, migration is to the same directory of the target cluster. Such as: if the B directory is migrated from the first cluster to the second cluster, then everything in the first cluster/under the A/B/directory (E, F file) is migrated to the second cluster/under the A/B/directory.

In an alternative embodiment, referring to fig. 6, the step S400 may include the following:

step S410: verifying whether the catalogs and subdirectories on the target hadoop cluster are complete after migration;

step S420: verifying whether the directory and sub-directory rights on the target hadoop cluster are correct after migration;

it should be noted that the rights include read rights, write rights, and the like, which are preset attributes of the file.

Step S430: verifying whether the number of files in the catalogue and the subdirectory on the target hadoop cluster is correct after migration;

step S440: verifying whether the number of bytes in the directory and the subdirectory on the target hadoop cluster is correct after migration;

step S450: verifying whether the number of records corresponding to the files on the target hadoop cluster after migration is correct or not;

step S460: and verifying whether the file content on the target hadoop cluster is correct after migration.

It should be noted that, the verification step may be performed according to the data conditions on the two clusters obtained by the recursive scanning, or may be a thread set, and the directory or file conditions on each cluster may be gradually read, and after the comparison, the next directory or file condition may be read for comparison.

Wherein, can verify according to the byte number size of the file under the catalogue; the corresponding record number refers to: each file corresponds to a table, and whether the record number of the lookup table is consistent or not is judged.

Referring specifically to FIG. 7, it can be seen whether the file size of the same directory under the 1 st cluster is the same as under the second cluster. Referring to fig. 8, it can be checked whether the same table is the same for a certain day under the 1 st cluster as under the second cluster.

In an alternative embodiment, the data migration method may further include: and displaying the data migration condition.

Wherein, the data migration condition includes: the method comprises the steps of waiting for migration of the catalogs, migration progress of each catalogs, migration data size of each catalogs, migration time consumption of each catalogs, expected migration completion time of each catalogs and priority level of migration processing of each catalogs.

Specifically, the front-end interface is used for displaying the catalogue to be migrated, the migration progress of each catalogue, the migration size of each catalogue, the migration time of each catalogue, the migration completion time of each catalogue, the priority level of each catalogue migration process, whether the catalogue and the file are predicted, the total data volume of the current migration, whether the data migration is successful and failed or not, the failure and the reason are checked, whether retrying is performed or not and the like, so that the migration progress is intuitively displayed, and migration operators can know the migration condition.

In an alternative embodiment, the data migration method may further include: and setting priority of the scanned catalogue.

Based on the same inventive concept, the embodiments of the present application also provide a data migration apparatus, which may be used to implement the method described in the above embodiments, as described in the following embodiments. Since the principle of the data migration device for solving the problem is similar to that of the above method, the implementation of the data migration device can be referred to the implementation of the above method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 9 is a block diagram of a data migration apparatus in an embodiment of the present invention. As shown in fig. 9, the data migration apparatus specifically includes: the system comprises a preprocessing module 10, a migration strategy acquisition module 20, a data migration module 30 and a verification module 40.

The preprocessing module 10 acquires the data distribution condition on the hadoop cluster to be migrated;

the migration policy obtaining module 20 obtains a data migration policy according to the data distribution condition;

the data migration module 30 directly migrates the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration policy;

and the verification module 40 verifies the data on the target hadoop cluster after migration according to the data distribution condition on the hadoop cluster to be migrated.

In an alternative embodiment, the preprocessing module 10 includes: the recursion scanning unit is used for recursion scanning the catalogue on the hadoop cluster to be migrated to obtain the data distribution condition, wherein the data distribution condition comprises the following steps: the number of files under the catalogue, the size of the files and the preset priority level of the catalogue.

In an alternative embodiment, the migration policy acquisition module 20 includes: the system comprises a sequencing unit, a transmission planning unit and a migration template acquisition unit.

The ordering unit orders the catalogs according to the preset priority level of each catalogue;

the transmission planning unit acquires the data transmission strategy of the catalogue according to the number of the files under the catalogue and the size of the files;

the migration template acquisition unit generates a data migration template according to the directory ordering result and the data transmission strategy of each directory.

In an alternative embodiment, the data migration module 30 includes: and the data migration unit is used for migrating the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration template.

In an alternative embodiment, the verification module 40 includes: the first to sixth verification units.

The first verification unit verifies whether the catalogs and the subdirectories on the target hadoop cluster are complete after migration;

the second verification unit verifies whether the directory and sub-directory rights on the migrated target hadoop cluster are correct or not;

the third verification unit verifies whether the number of files in the catalogue and the subdirectory on the target hadoop cluster after migration is correct or not;

the fourth verification unit verifies whether the number of bytes in the catalogue and the subdirectory on the target hadoop cluster after migration is correct or not;

the fifth verification unit verifies whether the number of records corresponding to the files on the target hadoop cluster after migration is correct or not;

and the sixth verification unit verifies whether the file content on the target hadoop cluster is correct after migration.

In an alternative embodiment, the data migration apparatus further includes: and the progress display module is used for displaying the data migration condition.

The apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is an electronic device, which may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

In a typical example the electronic device comprises in particular a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the following steps when said program is executed:

acquiring the data distribution condition on the hadoop cluster to be migrated;

From the above description, it can be known that the electronic device provided by the embodiment of the invention can be used for data migration of Hadoop clusters, and according to two different Hadoop clusters, data, catalogs, rights, verification, monitoring and display and the like can be automatically synchronized, a user only needs to install one installation package, no additional workload is needed, and the verification accuracy is hundred percent and many unnecessary workloads are reduced.

Referring now to fig. 10, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present application is shown.

As shown in fig. 10, the electronic apparatus 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate works and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM)) 603. In the RAM603, various programs and data required for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on drive 610 as needed, so that a computer program read therefrom is mounted as needed as storage section 608.

In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, an embodiment of the invention includes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring the data distribution condition on the hadoop cluster to be migrated;

As can be seen from the above description, the computer readable storage medium provided by the embodiments of the present invention may be used for data migration of Hadoop clusters, and according to two different Hadoop clusters, data, directory, rights, verification, monitoring and display are automatically synchronized, and a user only needs to install one installation package, so that the method can be automatically implemented without any additional workload, and the verification accuracy is hundred percent, and many unnecessary workloads are reduced.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A method of data migration, comprising:

acquiring the data distribution condition on the hadoop cluster to be migrated;

acquiring a data migration strategy according to the data distribution condition; the obtaining the data migration policy according to the data distribution condition includes: sorting the catalogs according to preset priority levels of the catalogs; acquiring a data transmission strategy of the directory according to the number of files under the directory and the size of the files; generating a data migration template according to the directory ordering result and the data transmission strategy of each directory; specifically, classifying the catalogue; the classification includes: only directory, directory and small file, directory and large file; determining started parallel and occupied CPU and memory according to the number of files in the catalog and whether the files are large files or small files; the data migration template is used for recording each catalogue to be migrated, the number and the size of files in each catalogue and the migration strategy of each catalogue;

2. The method for data migration according to claim 1, wherein the obtaining the data distribution situation on the hadoop cluster to be migrated includes:

3. The data migration method according to claim 2, wherein the directly migrating the data on the hadoop cluster to be migrated to the target hadoop cluster according to the data migration policy includes:

4. The data migration method according to claim 1, wherein the verifying the data on the migrated target hadoop cluster according to the data distribution condition on the hadoop cluster to be migrated includes:

5. The data migration method of claim 1, further comprising:

and displaying the data migration condition.

6. The data migration method of claim 5, wherein the data migration condition comprises: the method comprises the steps of waiting for migration of the catalogs, migration progress of each catalogs, migration data size of each catalogs, migration time consumption of each catalogs, expected migration completion time of each catalogs and priority level of migration processing of each catalogs.

7. A data migration apparatus, comprising:

the migration strategy acquisition module acquires a data migration strategy according to the data distribution condition; the migration policy obtaining module includes: the ordering unit is used for ordering the catalogs according to the preset priority level of each catalogue; specifically, classifying the catalogue; the classification includes: only directory, directory and small file, directory and large file; determining started parallel and occupied CPU and memory according to the number of files in the catalog and whether the files are large files or small files; the data migration template is used for recording each catalogue to be migrated, the number and the size of files in each catalogue and the migration strategy of each catalogue;

the transmission planning unit acquires the data transmission strategy of the catalog according to the number of files under the catalog and the size of the files; the migration template acquisition unit generates a data migration template according to the directory ordering result and the data transmission strategy of each directory;

8. The data migration apparatus of claim 7, wherein the preprocessing module comprises:

the recursion scanning unit is used for recursion scanning the catalogue on the hadoop cluster to be migrated to obtain the data distribution condition, wherein the data distribution condition comprises the following steps: the number of files under the catalogue, the size of the files and the preset priority level of the catalogue.

9. The data migration apparatus of claim 8, wherein the data migration module comprises:

10. The data migration apparatus of claim 7, wherein the authentication module comprises:

11. The data migration apparatus of claim 7, further comprising:

12. The data migration apparatus of claim 11, wherein the data migration scenario comprises: the method comprises the steps of waiting for migration of the catalogs, migration progress of each catalogs, migration data size of each catalogs, migration time consumption of each catalogs, expected migration completion time of each catalogs and priority level of migration processing of each catalogs.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the data migration method of any one of claims 1 to 6 when the program is executed by the processor.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the data migration method of any one of claims 1 to 6.