CN113032368A - Data migration method and device, storage medium and platform - Google Patents

Data migration method and device, storage medium and platform Download PDF

Info

Publication number
CN113032368A
CN113032368A CN202110321312.2A CN202110321312A CN113032368A CN 113032368 A CN113032368 A CN 113032368A CN 202110321312 A CN202110321312 A CN 202110321312A CN 113032368 A CN113032368 A CN 113032368A
Authority
CN
China
Prior art keywords
data
migrated
hive
target
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110321312.2A
Other languages
Chinese (zh)
Inventor
金磐石
鲜伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110321312.2A priority Critical patent/CN113032368A/en
Publication of CN113032368A publication Critical patent/CN113032368A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a data migration method, a data migration device, a storage medium and a data migration platform, and relates to the technical field of big data processing. The method is applied to a distributed big data migration platform and comprises the following steps: loading data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform; in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data; and migrating the target data from the Hive data warehouse to a target database. Through the technical scheme provided by the embodiment of the invention, the data in the source database can be quickly and efficiently migrated to the target database, and the influence on system service in the data migration process is reduced.

Description

Data migration method and device, storage medium and platform
Technical Field
The embodiment of the invention relates to the technical field of big data processing, in particular to a data migration method, a data migration device, a storage medium and a data migration platform.
Background
Along with the improvement of the internet technology, data acquisition channels are diversified, and abundant data and massive information are provided for various industries. However, with the continuous progress and improvement of information-based construction, when the current environment cannot meet new requirements, the current environment is replaced by a more powerful system, and data migration between different databases is required.
And the historical abnormal data needs to be repaired while data migration is carried out, so that the characteristics of a new database are met. The success and failure of data migration directly relate to whether the system can be successfully put on line, the quality of data migration seriously affects the stability of a new system, and the method is particularly important for the financial and telecommunication industries. Therefore, how to realize fast and efficient data migration and conversion and reduce the influence on the service in the migration process becomes very important.
Disclosure of Invention
Embodiments of the present invention provide a data migration method, apparatus, storage medium, and platform, which can quickly and efficiently migrate data in a source database to a target database.
In a first aspect, an embodiment of the present invention provides a data migration method, including:
loading data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform;
in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data;
and migrating the target data from the Hive data warehouse to a target database.
In a second aspect, an embodiment of the present invention further provides a data migration apparatus, including:
the data loading module is used for loading data to be migrated in the source database into a Hive data warehouse of the distributed big data migration platform;
the data conversion module is used for performing data conversion on the data to be migrated through a Spark engine in the Hive data warehouse to generate target data;
and the data migration module is used for migrating the target data from the Hive data warehouse to a target database.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data migration method provided in the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a distributed big data migration platform, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the data migration method provided in the embodiment of the present invention.
According to the data migration scheme provided by the embodiment of the invention, data to be migrated in a source database is loaded into a Hive data warehouse of the distributed big data migration platform; in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data; and migrating the target data from the Hive data warehouse to a target database. Through the technical scheme provided by the embodiment of the invention, the data in the source database can be quickly and efficiently migrated to the target database, and the influence on system service in the data migration process is reduced.
Drawings
Fig. 1 is a flowchart of a data migration method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of data conversion provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of data cleansing provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a method of data migration in another embodiment of the present invention;
FIG. 5 is a schematic diagram of a data migration process provided by an embodiment of the invention;
FIG. 6 is a schematic structural diagram of a data migration apparatus according to another embodiment of the present invention;
fig. 7 is a schematic structural diagram of a distributed big data migration platform in another embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and the embodiments of the present invention are illustrative only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
In the related art, data migration is mainly realized through a data migration tool, but the data migration tool can only perform data migration one to one, and cannot meet the requirements of scenes such as data conversion, data processing, data restoration and the like. The existing distributed large-scale database is increasingly applied to the industries of finance, telecommunication and the like, the data processing capacity of a single machine reaches the ceiling, the requirement of mass data migration performance cannot be met, and data conversion operations such as data splitting, illegal data processing, messy code character elimination and the like are indispensable steps in the data migration process and cannot be realized by the existing migration tool.
Fig. 1 is a flowchart of a data migration method according to an embodiment of the present invention, where the data migration method according to an embodiment of the present invention is applicable to a data migration situation, and the data migration method may be executed by a data migration apparatus, where the apparatus may be composed of hardware and/or software, and may be generally integrated in a distributed big data migration platform. As shown in fig. 1, the method specifically includes the following steps:
and step 110, loading the data to be migrated in the source database into a Hive data warehouse of the distributed big data migration platform.
The distributed big data migration platform can be understood as a data migration platform comprising a distributed database, wherein the distributed data is formed by connecting a plurality of computer servers located at different places through a network, and the computer servers are integrated into a complete and global logically centralized and physically distributed big database. For example, the distributed big data migration platform may be a Hadoop platform. The source database may be a database for storing data in the source system. The source system is a system which is independent from the distributed big data migration platform. The Hive can be a data warehouse tool based on a distributed big data migration platform (such as a Hadoop), can be used for data extraction, transformation and loading, and is a large-scale data component capable of storing, querying and analyzing data stored in the Hadoop.
In the embodiment of the invention, the distributed big data migration platform loads the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform. The data to be migrated may be all data in the source database or may be part of data in the source database. Specifically, the distributed big data migration platform acquires data to be migrated from a source database, and stores the data to be migrated into a Hive data warehouse of the distributed big data migration platform. Optionally, the source database may include Hdfs, Hive, elastic search, Hbase, Oracle, Mysql, and other databases. The data to be migrated in the source database may be any traffic-related data. For example, the data to be migrated in the source database may be client-related data, such as client certificate basic information, client address basic information, client telephone basic information, client contract basic information, client mechanism information, client relationship basic information, and other client information.
Optionally, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform, including: acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and storing the data to be migrated into an HDFS file system of the distributed big data migration platform; and loading the data to be migrated in the HDFS file system into a Hive data warehouse of the distributed big data migration platform. Illustratively, the data to be migrated is obtained from a source database based on FTP (File Transfer Protocol). Namely, the distributed big data migration platform loads the data to be migrated from the source database based on the FTP. And storing the loaded data to be migrated into an HDFS file system of the distributed big data migration platform. For example, the loaded data to be migrated may be directly stored in the HDFS file system of the distributed big data migration platform, or the loaded data to be migrated may be first stored in the local disk of the distributed big data migration platform, and then the data to be migrated in the local disk is loaded into the HDFS file system of the distributed big data migration platform. Optionally, acquiring data to be migrated from a source database based on a file transfer protocol FTP, and storing the data to be migrated in the HDFS file system of the distributed big data migration platform, where the method includes: acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and synchronizing the data to be migrated to a local disk of the distributed big data migration platform; and loading the data to be migrated from the local disk based on a data access component, and storing the data to be migrated into an HDFS file system of the distributed big data migration platform. Specifically, the distributed big data migration platform synchronizes data to be migrated in the source database to the local disk based on the FTP, and then uploads the data to be migrated in the local disk to the HDFS file system through the data access component. And loading the data to be migrated in the HDFS file system into a Hive data warehouse of the distributed big data migration platform.
Optionally, the source database is a relational database; correspondingly, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform includes: and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on Java database connection JDBC. Specifically, when the source Database in the source system is a relational Database, that is, when data in the relational Database needs to be migrated, the distributed big data migration platform may extract data to be migrated from the source Database through JDBC (Java Database Connectivity), and load the data to be migrated into the Hive data warehouse of the distributed big data migration platform.
Optionally, the source database is a non-relational database; correspondingly, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform includes: and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on a network authentication protocol Kerberos. The non-relational database may be a Nosql database, that is, a database that cannot be queried or updated using a standard sql statement. The non-relational database may include Hbase database and Elasticsearch database. Specifically, when a source database of a source system is a non-relational database and data migration of the non-relational database is urgently needed, the distributed big data migration platform may extract data to be migrated from the source database through a network authentication protocol Kerberos, and load the data to be migrated into a Hive data warehouse of the distributed big data migration platform.
And step 120, performing data conversion on the data to be migrated through a Spark engine in the Hive data warehouse to generate target data.
The Spark engine is a large-scale data processing engine, can calculate based on a memory, calculates intermediate data without landing on a disk, and can include Spark core, Spark sql, Spark streaming and the like. In the embodiment of the invention, the distributed big data migration platform performs data conversion on data to be migrated through a Spark engine in a Hive data warehouse, and takes the converted data to be migrated as target data. The data conversion of the data to be migrated may include related operations such as data fragmentation, data association, redundant data discarding, differential data repair, missing data completion, and the like, so that the generated target data conforms to the storage rule of the target database.
Optionally, performing data conversion on the data to be migrated through a Spark engine to generate target data, where the generating includes: and executing a data conversion strategy based on a Spark engine, and performing data conversion on the data to be migrated based on the data conversion strategy to generate target data. The data conversion strategy can be a conversion strategy developed by an SQL script, and the data conversion strategy comprises a conversion mode and a conversion rule of data to be migrated. Therefore, the data conversion strategy is executed based on the Spark engine, so that the data to be migrated is subjected to data conversion based on the data conversion strategy. Optionally, the data conversion policy includes at least one of a data fragmentation policy, a data association policy, a redundant data discarding policy, a difference data repair policy, and a missing data completion policy. The data fragmentation strategy can be understood as a segmentation rule for segmenting data to be migrated in a source database; the data association policy can be understood as a rule for performing association operation on data to be migrated in the source database; the redundant data discarding strategy can be understood as a judgment rule for performing redundancy judgment on data to be migrated in a source database and a processing rule for redundant data; the difference data restoration strategy can be understood as a judgment rule for judging the difference data of the data to be migrated in the source database and a restoration rule for the difference data; the missing data completion policy can be understood as a judgment rule for performing missing data on the data to be migrated in the source database and a completion rule for the missing data.
For example, the data conversion policy is a data fragmentation policy, and the data to be migrated is client information, that is, performs distributed migration on the client information in the source database, so that the information of each dimension of the client can be subjected to data fragmentation and fragmentation on the basis of the mechanism to which the client belongs based on the data fragmentation policy to generate the target data. Therefore, cross-database transactions are reduced, the query and maintenance of the same client are completed in the same segment, the transaction response time is reduced, and the client experience can be improved.
Illustratively, fig. 2 is a schematic diagram of data conversion provided by an embodiment of the present invention. As shown in fig. 2, the Data conversion policy is a Data association policy, Data a (Data-a) and Data B (Data-B) are Data to be migrated in a source database, Data types (DataFrame) of the Data a and the Data B are respectively determined based on a Spark engine, then Hive Data tables corresponding to the Data a and the Data B are further respectively determined, and then final target Data (Data-Result) is generated by associating the Data type corresponding to the Data a and the Hive Data tables with the Data type corresponding to the Data B and Hive Data table information (Join).
Optionally, in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data, where the generating includes: and in the memory of the Yarn component of the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data. Specifically, the distributed big data platform performs data conversion on data to be migrated in the memory of the Yarn component through the Spark engine. The advantage of setting up like this lies in, can make the performance of data conversion extremely high, and relative traditional data conversion mode need not to fall to the ground local disk with interim data, can effectively reduce disk IO read-write number of times, because memory read-write efficiency is more than tens of times of disk read-write efficiency for the logic of data conversion is more complicated, and efficiency is higher.
And step 130, migrating the target data from the Hive data warehouse to a target database.
In the embodiment of the invention, the distributed big data migration platform migrates the target data from the Hive data warehouse and migrates the target data to the target database. The target database can be a database used for storing data in a target system, and the target system is a system independent from the distributed big data migration platform. The target database may include Hdfs, Hive, Hbase, Oracle, OceanBase, etc. databases.
Optionally, migrating the target data from the Hive data warehouse to a target database, including: migrating the target data from the Hive data warehouse to a target database based on a File Transfer Protocol (FTP); or, the target database is a relational database, and correspondingly, migrating the target data from the Hive data warehouse to the target database includes: migrating the target data from the Hive data warehouse to the relational database based on Java database connectivity JDBC; or, the target database is a non-relational database, and correspondingly, migrating the target data from the Hive data warehouse to the target database includes: migrating the target data from the Hive data warehouse to a non-relational database based on a network authentication protocol Kerberos. Specifically, the target data can be migrated from the Hive data warehouse to the target database based on the FTP, that is, the target data is exported to a data file form, and then the distributed big data migration platform transmits the data file from the Hive data warehouse to the target database based on the FTP. The target data can be exported into a data file in a fixed-length or non-fixed-length mode. When the target data is a relational database, the distributed big data migration platform can migrate the target data from the Hive data warehouse based on JDBC and migrate the target data to the target database. When the target data is a non-relational database, the distributed big data migration platform can migrate the target data from the Hive data warehouse and migrate the target data to the target database based on a network authentication protocol Kerberos.
According to the data migration method provided by the embodiment of the invention, data to be migrated in a source database is loaded into a Hive data warehouse of the distributed big data migration platform; in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data; and migrating the target data from the Hive data warehouse to a target database. Through the technical scheme provided by the embodiment of the invention, the data in the source database can be quickly and efficiently migrated to the target database, and the influence on system service in the data migration process is reduced.
In some embodiments, before performing data conversion on the data to be migrated by the Spark engine, the method further includes: and performing data cleaning on the data to be migrated based on the data cleaning configuration information. Specifically, the garbled data or the illegal data can be determined from the data to be migrated based on the data cleaning configuration information, and the garbled data or the illegal data can be cleaned. The illegal Unicode code value, the illegal date, the illegal numerical value, the invisible messy code and the control character can be cleaned based on the data cleaning configuration information. Fig. 3 is a schematic diagram of data cleansing according to an embodiment of the present invention. As shown in fig. 3, the data is divided into regions such as Unicode chinese characters, numbers, letters, control characters, etc. according to the commonly used Utf-8 code value, so that it can be determined whether the data to be migrated is in the chinese character Unicode region, the letter Unicode region, the number Unicode region or the symbol Unicode region, and if not, the data to be migrated is cleaned as a messy code. And judging whether the data to be migrated is illegal data or abnormal data according to the digital type, the timestamp type, the non-null data or the length, and if so, cleaning the abnormal data based on the data cleaning configuration information.
In some embodiments, further comprising: and performing data checking on the data to be migrated or the target data based on a preset checking mode. Specifically, in order to ensure the correctness of the migrated data, in the data migration process, data check may be performed on the data to be migrated or the target data based on a preset check mode, so that the data before and after migration are consistent. The preset checking mode comprises at least one of file checking, data number checking, data size checking, sampling checking, code value checking and main key checking. Optionally, before and after the data to be migrated in the source database is loaded to the Hive data warehouse of the distributed big data migration platform, the data to be migrated may be checked based on a preset checking mode, or before and after the data to be migrated is subjected to data conversion by the Spark engine; and checking the target data before and after the target data are migrated from the Hive data warehouse to the target database. And after the checking is passed, performing subsequent operation on the data to be migrated or the target data.
Fig. 4 is a flowchart of a data migration method in another embodiment of the present invention, as shown in fig. 4, the method includes the following steps:
and step 410, loading the data to be migrated in the source database into a Hive data warehouse of the distributed big data migration platform.
Optionally, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform, including: acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and storing the data to be migrated into an HDFS file system of the distributed big data migration platform; and loading the data to be migrated in the HDFS file system into a Hive data warehouse of the distributed big data migration platform. The method for acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP) and storing the data to be migrated into an HDFS file system of the distributed big data migration platform comprises the following steps: acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and synchronizing the data to be migrated to a local disk of the distributed big data migration platform; and loading the data to be migrated from the local disk based on a data access component, and storing the data to be migrated into an HDFS file system of the distributed big data migration platform.
Optionally, the source database is a relational database; correspondingly, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform includes: and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on Java database connection JDBC.
Optionally, the source database is a non-relational database; correspondingly, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform includes: and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on a network authentication protocol Kerberos.
And step 420, performing data cleaning on the data to be migrated based on the data cleaning configuration information.
Step 430, executing a data conversion strategy based on a Spark engine in the yann component memory of the Hive data warehouse, and performing data conversion on the data to be migrated based on the data conversion strategy to generate target data.
Step 440, migrating the target data from the Hive data warehouse to a target database.
Optionally, migrating the target data from the Hive data warehouse to a target database, including: migrating the target data from the Hive data warehouse to a target database based on a File Transfer Protocol (FTP); or, the target database is a relational database, and correspondingly, migrating the target data from the Hive data warehouse to the target database includes: migrating the target data from the Hive data warehouse to the relational database based on Java database connectivity JDBC; or, the target database is a non-relational database, and correspondingly, migrating the target data from the Hive data warehouse to the target database includes: migrating the target data from the Hive data warehouse to a non-relational database based on a network authentication protocol Kerberos.
Optionally, the method further includes: and performing data checking on the data to be migrated or the target data based on a preset checking mode.
Illustratively, fig. 5 is a schematic diagram of a data migration process according to an embodiment of the present invention.
The data migration method provided by the embodiment of the invention can rapidly and efficiently migrate the data in the source database to the target database, reduce the influence on system services in the data migration process, and is suitable for data migration between the multi-source database and the multi-target database, and the data to be migrated is cleaned in the migration number transfer process, so that the cleanness and stability of the data in the target database are ensured.
Fig. 6 is a schematic structural diagram of a data migration apparatus according to another embodiment of the present invention. As shown in fig. 6, the apparatus includes: a data loading module 610, a data conversion module 620 and a data migration module 630. Wherein the content of the first and second substances,
the data loading module 610 is configured to load data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform;
a data conversion module 620, configured to perform data conversion on the data to be migrated through a Spark engine in the Hive data warehouse to generate target data;
a data migration module 630, configured to migrate the target data from the Hive data warehouse to a target database.
The data migration device provided by the embodiment of the invention loads the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform; in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data; and migrating the target data from the Hive data warehouse to a target database. Through the technical scheme provided by the embodiment of the invention, the data in the source database can be quickly and efficiently migrated to the target database, and the influence on system service in the data migration process is reduced.
Optionally, the data loading module includes:
the data storage unit is used for acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP) and storing the data to be migrated into an HDFS file system of the distributed big data migration platform;
and the data loading unit is used for loading the data to be migrated in the HDFS file system into a Hive data warehouse of the distributed big data migration platform.
Optionally, the data storage unit is configured to:
acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and synchronizing the data to be migrated to a local disk of the distributed big data migration platform;
and loading the data to be migrated from the local disk based on a data access component, and storing the data to be migrated into an HDFS file system of the distributed big data migration platform.
Optionally, the source database is a relational database;
correspondingly, the data loading module is configured to:
and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on Java database connection JDBC.
Optionally, the source database is a non-relational database;
correspondingly, the data loading module is configured to:
and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on a network authentication protocol Kerberos.
Optionally, the data conversion module is configured to:
and executing a data conversion strategy based on a Spark engine, and performing data conversion on the data to be migrated based on the data conversion strategy to generate target data.
Optionally, the data conversion policy includes at least one of a data fragmentation policy, a data association policy, a redundant data discarding policy, a difference data repair policy, and a missing data completion policy.
Optionally, the data conversion module is configured to:
and in the memory of the Yarn component of the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data.
Optionally, the apparatus further comprises:
and the data cleaning module is used for cleaning the data to be migrated based on the data cleaning configuration information before the data to be migrated is converted by the Spark engine.
Optionally, the apparatus further comprises:
and the data checking module is used for performing data checking on the data to be migrated or the target data based on a preset checking mode.
Optionally, the preset checking mode includes at least one of file checking, data number checking, data size checking, sampling checking, code value checking and main key checking.
Optionally, the data migration module is configured to: migrating the target data from the Hive data warehouse to a target database based on a File Transfer Protocol (FTP); alternatively, the first and second electrodes may be,
the target database is a relational database, and correspondingly, the data migration module is configured to: migrating the target data from the Hive data warehouse to the relational database based on Java database connectivity JDBC; alternatively, the first and second electrodes may be,
the target database is a non-relational database, and correspondingly, the data migration module is configured to: migrating the target data from the Hive data warehouse to a non-relational database based on a network authentication protocol Kerberos.
The device can execute the methods provided by all the embodiments of the invention, and has corresponding functional modules and beneficial effects for executing the methods. For technical details which are not described in detail in the embodiments of the present invention, reference may be made to the methods provided in all the aforementioned embodiments of the present invention.
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a data migration method, the method including:
loading data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform;
in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data;
and migrating the target data from the Hive data warehouse to a target database.
Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDRRAM, SRAM, EDORAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network (such as the internet). The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the data migration operation described above, and may also perform related operations in the data migration method provided by any embodiment of the present invention.
The embodiment of the invention provides a distributed big data migration platform, wherein the data migration device provided by the embodiment of the invention can be integrated in the distributed big data migration platform. Fig. 7 is a block diagram of a distributed big data migration platform according to an embodiment of the present invention. The distributed big data migration platform 7Hive may include: the data migration method comprises a memory 701, a processor 702 and a computer program which is stored on the memory 701 and can be run by the processor, wherein the processor 702 realizes the data migration method according to the embodiment of the invention when executing the computer program.
The distributed big data migration platform provided in the embodiment of the invention loads data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform; in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data; and migrating the target data from the Hive data warehouse to a target database. Through the technical scheme provided by the embodiment of the invention, the data in the source database can be quickly and efficiently migrated to the target database, and the influence on system service in the data migration process is reduced.
The data migration device, the storage medium and the platform provided in the above embodiments may execute the data migration method provided in any embodiment of the present invention, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to a data migration method provided in any embodiment of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (15)

1. A data migration method is applied to a distributed big data migration platform and comprises the following steps:
loading data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform;
in the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data;
and migrating the target data from the Hive data warehouse to a target database.
2. The method of claim 1, wherein loading data to be migrated in a source database into a Hive data warehouse of the distributed big data migration platform comprises:
acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and storing the data to be migrated into an HDFS file system of the distributed big data migration platform;
and loading the data to be migrated in the HDFS file system into a Hive data warehouse of the distributed big data migration platform.
3. The method according to claim 2, wherein the obtaining of the data to be migrated from the source database based on the file transfer protocol FTP and the storing of the data to be migrated into the HDFS file system of the distributed big data migration platform comprises:
acquiring data to be migrated from a source database based on a File Transfer Protocol (FTP), and synchronizing the data to be migrated to a local disk of the distributed big data migration platform;
and loading the data to be migrated from the local disk based on a data access component, and storing the data to be migrated into an HDFS file system of the distributed big data migration platform.
4. The method of claim 1, wherein the source database is a relational database;
correspondingly, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform includes:
and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on Java database connection JDBC.
5. The method of claim 1, wherein the source database is a non-relational database;
correspondingly, loading the data to be migrated in the source database into the Hive data warehouse of the distributed big data migration platform includes:
and loading the data to be migrated in the source database to a Hive data warehouse of the distributed big data migration platform based on a network authentication protocol Kerberos.
6. The method of claim 1, wherein performing data transformation on the data to be migrated through a Spark engine to generate target data comprises:
and executing a data conversion strategy based on a Spark engine, and performing data conversion on the data to be migrated based on the data conversion strategy to generate target data.
7. The method of claim 6, wherein the data conversion policy comprises at least one of a data fragmentation policy, a data association policy, a redundant data discard policy, a differential data repair policy, and a missing data completion policy.
8. The method of claim 1, wherein in the Hive data warehouse, performing data transformation on the data to be migrated through a Spark engine to generate target data, and the generating comprises:
and in the memory of the Yarn component of the Hive data warehouse, performing data conversion on the data to be migrated through a Spark engine to generate target data.
9. The method of claim 1, further comprising, before performing data conversion on the data to be migrated by a Spark engine:
and performing data cleaning on the data to be migrated based on the data cleaning configuration information.
10. The method of claim 1, further comprising:
and performing data checking on the data to be migrated or the target data based on a preset checking mode.
11. The method of claim 10, wherein the predetermined checking manner comprises at least one of file checking, data number checking, data size checking, sample checking, code value checking, and primary key checking.
12. The method of claim 1, wherein migrating the target data from the Hive data warehouse to a target database comprises: migrating the target data from the Hive data warehouse to a target database based on a File Transfer Protocol (FTP); alternatively, the first and second electrodes may be,
the target database is a relational database, and correspondingly, the step of migrating the target data from the Hive data warehouse to the target database comprises the following steps: migrating the target data from the Hive data warehouse to the relational database based on Java database connectivity JDBC; alternatively, the first and second electrodes may be,
the target database is a non-relational database, and correspondingly, the step of migrating the target data from the Hive data warehouse to the target database comprises the following steps: migrating the target data from the Hive data warehouse to a non-relational database based on a network authentication protocol Kerberos.
13. A data migration device is applied to a distributed big data migration platform and comprises:
the data loading module is used for loading data to be migrated in the source database into a Hive data warehouse of the distributed big data migration platform;
the data conversion module is used for performing data conversion on the data to be migrated through a Spark engine in the Hive data warehouse to generate target data;
and the data migration module is used for migrating the target data from the Hive data warehouse to a target database.
14. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processing means, is adapted to carry out a data migration method according to any one of claims 1 to 12.
15. A distributed big data migration platform comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data migration method according to any one of claims 1 to 12 when executing the computer program.
CN202110321312.2A 2021-03-25 2021-03-25 Data migration method and device, storage medium and platform Pending CN113032368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110321312.2A CN113032368A (en) 2021-03-25 2021-03-25 Data migration method and device, storage medium and platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110321312.2A CN113032368A (en) 2021-03-25 2021-03-25 Data migration method and device, storage medium and platform

Publications (1)

Publication Number Publication Date
CN113032368A true CN113032368A (en) 2021-06-25

Family

ID=76473788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110321312.2A Pending CN113032368A (en) 2021-03-25 2021-03-25 Data migration method and device, storage medium and platform

Country Status (1)

Country Link
CN (1) CN113032368A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860349A (en) * 2022-07-06 2022-08-05 深圳华锐分布式技术股份有限公司 Data loading method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860349A (en) * 2022-07-06 2022-08-05 深圳华锐分布式技术股份有限公司 Data loading method, device, equipment and medium
CN114860349B (en) * 2022-07-06 2022-11-08 深圳华锐分布式技术股份有限公司 Data loading method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
US10459903B2 (en) Comparing data stores using hash sums on disparate parallel systems
US11238069B2 (en) Transforming a data stream into structured data
CN106168965B (en) Knowledge graph construction system
US10152497B2 (en) Bulk deduplication detection
RU2705429C1 (en) Method and device for distributed processing of stream data
US10901996B2 (en) Optimized subset processing for de-duplication
WO2015106711A1 (en) Method and device for constructing nosql database index for semi-structured data
WO2018036549A1 (en) Distributed database query method and device, and management system
CN106649676B (en) HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files
JP2022118108A (en) Log auditing method, device, electronic apparatus, medium and computer program
EP2862101B1 (en) Method and a consistency checker for finding data inconsistencies in a data repository
CN110647531A (en) Data synchronization method, device, equipment and computer readable storage medium
US20220121652A1 (en) Parallel Stream Processing of Change Data Capture
CN113032368A (en) Data migration method and device, storage medium and platform
US10089350B2 (en) Proactive query migration to prevent failures
CN109165262A (en) Fragmentation clustering system and fragmentation method of relational large table
CN106682107B (en) Method and device for determining incidence relation of database table
CN111061719B (en) Data collection method, device, equipment and storage medium
CN105718485B (en) A kind of method and device by data inputting database
CN109739883B (en) Method and device for improving data query performance and electronic equipment
CN111061712A (en) Data connection operation processing method and device
CN110825453A (en) Data processing method and device based on big data platform
CN110263028B (en) Full-scale synchronization method applied to search service
WO2017131795A1 (en) Processing time-varying data using an adjacency list representation of a time-varying graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination