CN110502505A

CN110502505A - A kind of data migration method and device

Info

Publication number: CN110502505A
Application number: CN201910806491.1A
Authority: CN
Inventors: 苏新锋; 薛飞; 王会武; 赵焕芳; 王太宁; 吴洋
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-26

Abstract

The invention discloses a kind of data migration method and devices, the subregion column of migration table are set first, and the degree of parallelism of the data migration task of migration table is set, default hashing algorithm is recycled to carry out hash subregion to the data in migration table, obtain the data volume of each subregion, data volume according to each subregion again, calculate partition data gradient, when partition data gradient is greater than gradient threshold value, re-start hash subregion, when partition data gradient is not more than gradient threshold value, Paralleled executes the data migration task of migration table.The present invention is not more than gradient threshold value by control partition data skewness, is uniformly distributed partition data, so that data migration task load balancing be made to execute parallel, avoids the occurrence of data skew problem, improve Spark data migration efficiency and quality.

Description

A kind of data migration method and device

Technical field

The present invention relates to Data Transference Technology fields, more particularly to a kind of data migration method and device.

Background technique

With the rapid development of big data and artificial intelligence the relevant technologies, new technology is gradually answered in social every profession and trade With current each business bank all carries out deeply in the application for gradually carrying out big data technology, and by new technology and banking service strategy Degree fusion lays the foundation for financial technology development, and big data has widely in financial fields such as financial industry anti money washing, anti-frauds Application prospect.And for a long time, all kinds of business datums of bank are mainly stored in relational database, with deeply making for big data The problem of data quick and stable in relational database is migrated to big data platform with, urgent need to resolve.

Industry generallys use the Data Migration Tools such as Sqoop, Spark at present, wherein although Sqoop tool is convenient, Due to being realized using Map/Reduce, intermediate data must land disk, and data migration efficiency is lower, and also needs to handle difference The problems such as database character set transcoding；Spark compartment model efficiency is higher, but when field number is discrete type, is easy Existing data skew, i.e., a large amount of data have been focused on one or several machines and have been calculated, and lead to entire data migration process mistake Slowly, cause data migration efficiency low.

Summary of the invention

In view of this, avoiding carrying out Data Migration using Spark the present invention provides a kind of data migration method and device During there is the problem of data skew.

In order to achieve the above-mentioned object of the invention, specific technical solution provided by the invention is as follows:

A kind of data migration method, comprising:

The subregion column of migration table are set, and the degree of parallelism of the data migration task of the migration table is set；

Hash subregion is carried out to the data in the migration table using default hashing algorithm, obtains the data of each subregion Amount；

According to the data volume of each subregion, partition data gradient is calculated；

When the partition data gradient is greater than gradient threshold value, returns to described utilize of execution and preset hashing algorithm to institute The data stated in migration table carry out hash subregion；

When the partition data gradient is not more than the gradient threshold value, Paralleled executes the number of the migration table According to migration task.

Optionally, the subregion column of the setting migration table, comprising:

The essential information of the migration table is obtained, and is arranged according to the essential information of migration table setting subregion.

Optionally, the degree of parallelism of the data migration task of the setting migration table, comprising:

According to calculate node core cpu sum in Spark cluster, the parallel of the data migration task of the migration table is set Degree, wherein degree of parallelism is prime number and is less than calculate node core cpu sum in Spark cluster.

Optionally, described that hash subregion is carried out to the data in the migration table using default hashing algorithm, it obtains each The data volume of subregion, comprising:

Generate the first random number and the second random number；

For each data in the migration table, following loop iteration is executed:

Hash=hash*a+key.charAt (i)；

A=a*b；

Wherein, the initial value of hash is 0, i={ 0 ..., len-1 }, and the subregion train value of the data is key, and length is Len, a indicate the first random number, and b indicates the second random number, and key.charAt (i) is indicated i-th in the subregion train value of the data The corresponding numerical value in position；

Loop iteration terminates to obtain the final hash value of the data, and carries out remainder to degree of parallelism using final hash value It calculates, obtains the hashed value of the data；

Hashed value according to the data determines the corresponding subregion of the data.

Optionally, the data volume according to each subregion calculates partition data gradient, comprising:

Determine the maximum amount of data and minimum data amount in each subregion；

Calculate the data volume difference between the maximum amount of data and the minimum data amount；

The ratio for calculating the total amount of data of the data volume difference and the migration table obtains the partition data inclination Degree.

A kind of data migration device, comprising:

Setting unit, for be arranged migration table subregion arrange, and be arranged the migration table data migration task it is parallel Degree；

Hash zoning unit is obtained for carrying out hash subregion to the data in the migration table using default hashing algorithm To the data volume of each subregion；

Gradient computing unit calculates partition data gradient, when the subregion for the data volume according to each subregion When data skewness is greater than gradient threshold value, the hash zoning unit is triggered, when the partition data gradient is not more than institute When stating gradient threshold value, task executing units are triggered；

The task executing units execute the data migration task of the migration table for Paralleled.

Optionally, the setting unit includes:

Subregion column setting subelement, for obtaining the essential information of the migration table, and according to the basic of the migration table Information is arranged subregion and arranges.

Optionally, the setting unit includes:

Subelement is arranged in degree of parallelism, for the migration table to be arranged according to calculate node core cpu sum in Spark cluster Data migration task degree of parallelism, wherein degree of parallelism is prime number and is less than calculate node core cpu sum in Spark cluster.

Optionally, the hash zoning unit, is specifically used for:

Generate the first random number and the second random number；

For each data in the migration table, following loop iteration is executed:

Hash=hash*a+key.charAt (i)；

A=a*b；

Optionally, the gradient computing unit, is specifically used for:

Compared with the existing technology, beneficial effects of the present invention are as follows:

A kind of data migration method disclosed by the invention, first the subregion column of setting migration table, and the number of migration table is set According to the degree of parallelism of migration task, recycles default hashing algorithm to carry out hash subregion to the data in migration table, obtain each point The data volume in area, then according to the data volume of each subregion, partition data gradient is calculated, when partition data gradient is greater than inclination When spending threshold value, hash subregion is re-started, when partition data gradient is not more than gradient threshold value, Paralleled executes migration The data migration task of table.Gradient threshold value is not more than by control partition data skewness in data migration process, is made point Area's data are uniformly distributed, so that data migration task load balancing be made to execute parallel, are avoided the occurrence of data skew problem, are improved Spark data migration efficiency and quality.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is a kind of flow diagram of data migration method disclosed by the embodiments of the present invention；

Fig. 2 is a kind of flow diagram for hashing partition method disclosed by the embodiments of the present invention；

Fig. 3 is a kind of structural schematic diagram of data migration device disclosed by the embodiments of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Present embodiment discloses a kind of data migration methods, applied to based on the Data Migration field under Spark technological frame Jing Zhong, specifically, referring to Fig. 1, data migration method disclosed in the present embodiment specifically includes the following steps:

S101: the subregion column of migration table are set, and the degree of parallelism of the data migration task of migration table is set；

Migration table is the database table for needing to carry out Data Migration.

The essential information for needing to obtain migration table before Data Migration first, as database-driven, address, user name, The data migration task of migration table can be generated in password, database-name, migration table etc., the essential information according to migration table.

Subregion is arranged according to the essential information of migration table to arrange, subregion is classified as the less column of repetition values in migration table, convenient for according to Hash subregion is carried out according to subregion column, can such as be arranged primary key column or unique key column as the subregion of migration table.

Degree of parallelism is indicated using the operation number executed parallel during Spark migrating data.

Specifically, the data migration task of the migration table is arranged according to calculate node core cpu sum in Spark cluster Degree of parallelism, wherein degree of parallelism is prime number and is less than calculate node core cpu sum in Spark cluster.Since degree of parallelism is element Number, the value of degree of parallelism cannot be decomposed into the other values in addition to 1, guarantee that the operation of data migration task balanced can be assigned to Core cpu in Spark cluster.

S102: hash subregion is carried out to the data in migration table using default hashing algorithm, obtains the data of each subregion Amount；

Firstly generate the first random number and the second random number, wherein the first random number can be 5 digits, the second random number It can be 6 digits, then execute hash partition method as shown in Figure 2 for each data in migration table, specifically include Following steps:

S201: it calculates hash=hash*a+key.charAt (i)；

Wherein, the initial value of hash is 0, i={ 0 ..., len-1 }, i.e., the initial value of i is 0, and the subregion of target data arranges Value is key, and length len, a indicate the first random number, and b indicates the second random number, and key.charAt (i) indicates the data The corresponding numerical value of i-th bit in subregion train value.

Target data is that the data of hash partition method are currently executed in migration table.

It is arranged using subregion as name, the subregion train value of target data is for Zhang San, the corresponding character string of Zhang San is Zhangsan, i.e. key are zhangsan, and len 8, i=0 indicate that z, key.charAt (i) are 0 × 5a.

S202: judge whether i is equal to len-1；

If it is not, executing S203: calculating i=i+1, a=a*b；And it returns and executes S201；

If so, executing S204: obtaining the final hash value of target data；

S205: remainder calculating is carried out to degree of parallelism using final hash value, obtains the hashed value of target data；

S206: the hashed value according to target data determines the corresponding subregion of target data.

It is obtained after being calculated for the final hash value of target data degree of parallelism remainder due to the hashed value of target data, because This, the hashed value of target data be [0, Pd) between integer, the subregion that reference numeral is 0 when the hashed value of data is 0, when The subregion that reference numeral is 1 when the hashed value of data is 1, and so on, obtain subregion corresponding to every data.

It should be noted that the disclosed hash partition method of the present embodiment is realized by key.charAt (i) function by word Symbol type column are mapped as numeric type column, make any character row in migration table be mapped as determining the number of range, so that using Spark When parallel migration relation database table, user arranges without providing numeric type field as subregion, provides and arranges nonumeric type subregion Support, expand the use scope of Spark parallel migration relation database table.

S103: according to the data volume of each subregion, partition data gradient is calculated；

Specifically, data skewness d=(MAX (T)-MIN (T))/SUM (T)；

Wherein, MAX (T) is the maximum amount of data in each subregion, and MIN (T) is the minimum data amount in each subregion, SUM (T) is the total amount of data of migration table.

S104: judge whether partition data gradient is greater than gradient threshold value；

S102 is executed if so, returning；That is, regenerating the first random number and the second random number, and re-start hash point Area.

If it is not, executing S105: the data migration task of Paralleled execution migration table.

When data skewness is not more than gradient threshold value, the data migration task of migration table is submitted into Spark, Spark The data migration task of migration table is divided into multiple Data Migration operations, the quantity and the number of partitions, degree of parallelism of Data Migration operation Identical, each core cpu of the node in Spark cluster can only at most be assigned a Data Migration operation, be assigned The core cpu parallel execution of data of Data Migration operation migrates operation.

As it can be seen that data migration method disclosed in the present embodiment, the subregion by the way that migration table is arranged first is arranged, and migration is arranged The degree of parallelism of the data migration task of table recycles default hashing algorithm to carry out hash subregion to the data in migration table, obtains The data volume of each subregion, then according to the data volume of each subregion, partition data gradient is calculated, when partition data gradient is big When gradient threshold value, hash subregion is re-started, when partition data gradient is not more than gradient threshold value, Paralleled is held The data migration task of row migration table.Gradient threshold is not more than by control partition data skewness in data migration process Value, is uniformly distributed partition data, so that data migration task load balancing be made to execute parallel, avoids the occurrence of data skew and ask Topic, improves Spark data migration efficiency and quality.

Disclosed a kind of data migration method based on the above embodiment, the present embodiment is corresponding to disclose a kind of Data Migration dress It sets, referring to Fig. 3, the device includes:

Setting unit 301, for be arranged migration table subregion arrange, and be arranged the migration table data migration task and Row degree；

Zoning unit 302 is hashed, for carrying out hash subregion to the data in the migration table using default hashing algorithm, Obtain the data volume of each subregion；

Gradient computing unit 303 calculates partition data gradient, when described for the data volume according to each subregion When partition data gradient is greater than gradient threshold value, trigger the hash zoning unit 302, when the partition data gradient not When greater than the gradient threshold value, task executing units 304 are triggered；

The task executing units 304, the data migration task of the migration table is executed for Paralleled.

Optionally, the setting unit 301 includes:

Optionally, the hash zoning unit 302, is specifically used for:

Generate the first random number and the second random number；

For each data in the migration table, following loop iteration is executed:

Hash=hash*a+key.charAt (i)；

A=a*b；

Hashed value according to the data determines the corresponding subregion of the data.Optionally, the gradient computing unit 303, it is specifically used for:

A kind of data migration device disclosed in the present embodiment passes through control partition data skewness in data migration process No more than gradient threshold value, it is uniformly distributed partition data, so that data migration task load balancing be made to execute parallel, avoided out Existing data skew problem, improves Spark data migration efficiency and quality.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of data migration method characterized by comprising

Hash subregion is carried out to the data in the migration table using default hashing algorithm, obtains the data volume of each subregion；

When the partition data gradient is greater than gradient threshold value, return execution is described to be moved using default hashing algorithm to described Data in shifting table carry out hash subregion；

When the partition data gradient is not more than the gradient threshold value, the data that Paralleled executes the migration table are moved Shifting task.

2. the method according to claim 1, wherein the subregion column of the setting migration table, comprising:

3. the method according to claim 1, wherein the data migration task of the setting migration table and Row degree, comprising:

The degree of parallelism of the data migration task of the migration table is set according to calculate node core cpu sum in Spark cluster, In, degree of parallelism is prime number and is less than calculate node core cpu sum in Spark cluster.

4. the method according to claim 1, wherein described utilize default hashing algorithm in the migration table Data carry out hash subregion, obtain the data volume of each subregion, comprising:

Generate the first random number and the second random number；

For each data in the migration table, following loop iteration is executed:

Hash=hash*a+key.charAt (i)；

A=a*b；

Wherein, the initial value of hash is 0, i={ 0 ..., len-1 }, and the subregion train value of the data is key, length len, a Indicate the first random number, b indicates the second random number, and key.charAt (i) indicates that i-th bit is corresponding in the subregion train value of the data Numerical value；

Loop iteration terminates to obtain the final hash value of the data, and carries out remainder meter to degree of parallelism using final hash value It calculates, obtains the hashed value of the data；

5. the method according to claim 1, wherein the data volume according to each subregion, calculates the number of partitions According to gradient, comprising:

The ratio for calculating the total amount of data of the data volume difference and the migration table, obtains the partition data gradient.

6. a kind of data migration device characterized by comprising

Setting unit, the subregion for migration table to be arranged arranges, and the degree of parallelism of the data migration task of the migration table is arranged；

Zoning unit is hashed, for carrying out hash subregion to the data in the migration table using default hashing algorithm, is obtained every The data volume of a subregion；

Gradient computing unit calculates partition data gradient, when the partition data for the data volume according to each subregion When gradient is greater than gradient threshold value, the hash zoning unit is triggered, when the partition data gradient is inclined no more than described When gradient threshold value, task executing units are triggered；

7. device according to claim 6, which is characterized in that the setting unit includes:

Subregion column setting subelement, for obtaining the essential information of the migration table, and the essential information according to the migration table Subregion is arranged to arrange.

8. device according to claim 6, which is characterized in that the setting unit includes:

Subelement is arranged in degree of parallelism, for the number of the migration table to be arranged according to calculate node core cpu sum in Spark cluster According to the degree of parallelism of migration task, wherein degree of parallelism is prime number and is less than calculate node core cpu sum in Spark cluster.

9. device according to claim 6, which is characterized in that the hash zoning unit is specifically used for:

Generate the first random number and the second random number；

For each data in the migration table, following loop iteration is executed:

Hash=hash*a+key.charAt (i)；

A=a*b；

10. device according to claim 6, which is characterized in that the gradient computing unit is specifically used for: