CN103761167B

CN103761167B - A kind of method and apparatus for realizing data center backup

Info

Publication number: CN103761167B
Application number: CN201410032550.1A
Authority: CN
Inventors: 刘璧怡; 邓强; 吴楠; 邓鹏飞; 宗栋瑞
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-01-23
Filing date: 2014-01-23
Publication date: 2017-04-05
Anticipated expiration: 2034-01-23
Also published as: CN103761167A

Abstract

The present invention proposes a kind of method and apparatus for realizing data center backup, including：According to data block place table name to be backed up and row Praenomen, memory range of the data block in target data center is determined；It is determined that memory range in choose a back end data storage block.When the present invention solves Hbase across data center data backups, the data block of same row Praenomen stores scattered problem so that the data block storage for backuping to the same row Praenomen at target data center is more concentrated, so as to improve reading speed.

Description

Method and device for realizing data center backup

Technical Field

The invention relates to the field of big data, in particular to a backup method and a backup device of a data center based on Hbase.

Background

The data storage of Hbase is usually a Hadoop-based distributed File storage (HDFS), which usually needs to be backed up when performing data storage in an original data center, and the HDFS is backed up by default in three copies, wherein two copies are in two different data nodes belonging to the same rack, and the other copy is in one data node belonging to another rack different from the rack. Meanwhile, in order to ensure that the data center can still work normally when the data center fails, the data center needs to be backed up.

The existing method for realizing data center backup comprises the following steps:

acquiring a data block to be backed up in an original data center; randomly selecting 1 data node in the target data center to back up the data block in the target data center, and then selecting another two data nodes for back up according to the existing back-up method.

In the data storage method, when the data center is backed up, 1 data node of the target data center is randomly selected, and the data storage of Hbase is stored based on the column family data, that is, in the original data center, the data blocks belonging to the same column family data in the table are collectively stored in the same data node or several adjacent data nodes, so that when data is read, all data nodes where the column family name is located need to be searched according to the column family name where the read data is located, and the searched data nodes are possibly distributed in all data nodes of the target data center. When the method is used for Hbase cross-data center data backup, the characteristic of column family storage cannot be fully utilized, and the problems of scattered and discontinuous storage of data blocks with the same column family name in a target data center are caused, so that the reading speed is low.

Disclosure of Invention

In order to solve the technical problem, the invention provides a backup method and a backup device for a data center, which can make full use of the characteristics of column family storage to make the storage of data blocks backed up to the target data center by the same column family name more centralized, thereby improving the reading speed.

In order to achieve the above object, the present invention provides a method for implementing data center backup, including:

determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up;

and selecting one data node in the determined storage range to store the data block.

Preferably, the determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up includes:

determining the range of the rack where the data nodes stored in the data block are located according to the table name;

determining a physical address range of the data node according to the column family name;

the selecting one of the data nodes from the storage range to store the data block comprises:

one chassis is selected from the range of chassis, and one physical address is selected from the range of physical addresses.

Preferably, the determining, according to the table name, a range of a rack in which a data node stored in the data block is located includes:

calculating hash values of the table names, and calculating hash values of all racks in the target data center respectively;

determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center;

and the maximum value of the hash values of all the table names in the original data center is equal to the maximum value of the hash values of all the racks in the target data center.

Preferably, the selecting one of the data nodes from the storage range to store the data block includes:

and randomly selecting one rack from the range of the racks, or selecting the rack corresponding to the hash value with the smallest absolute value of the difference value of the hash values of the table names.

Preferably, the determining the physical address range of the data node according to the column family name includes:

calculating hash values of the column family names, and respectively calculating hash values of physical addresses of all data nodes in the selected rack;

determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.

and randomly selecting a data node corresponding to a physical address from the physical address range, or selecting a data node corresponding to a physical address corresponding to a hash value with the smallest absolute value of the difference value of the hash values of the column family names.

Preferably, the maximum value is 2 π.

The invention also provides a device for realizing the data center backup, which at least comprises:

the determining module is used for determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up;

and the selection module is used for selecting one data node from the determined storage range to store the data block.

Preferably, the determining module is specifically configured to:

determining the range of the rack where the data nodes stored in the data block are located according to the table name; determining a physical address range of the data node according to the column family name;

the selection module is specifically configured to:

Preferably, the determining module is specifically configured to:

Compared with the prior art, the invention comprises the following steps: determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up; and selecting one data node in the determined storage range to store the data block. According to the technical scheme, the characteristic of column family storage is fully utilized, the problem that data blocks with the same column family name are stored dispersedly when Hbase is backed up across data centers is solved, the data blocks with the same column family name backed up to a target data center are stored more intensively, and therefore the reading speed is improved.

Drawings

The accompanying drawings in the embodiments of the present invention are described below, and the drawings in the embodiments are provided for further understanding of the present invention, and together with the description serve to explain the present invention without limiting the scope of the present invention.

FIG. 1 is a flow chart of a method of implementing a data center backup in accordance with the present invention;

FIG. 2 is a flow chart of an embodiment of a method of implementing data center backup in accordance with the present invention;

fig. 3 is a schematic structural diagram of an apparatus for implementing data center backup according to the present invention.

Detailed Description

The following further description of the present invention, in order to facilitate understanding of those skilled in the art, is provided in conjunction with the accompanying drawings and is not intended to limit the scope of the present invention.

Referring to fig. 1, the present invention provides a method for implementing data center backup, including:

step 100, determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up.

In this step, the range of the rack where the data node stored in the data block is located can be determined according to the table name; and determining the physical address range of the data node according to the column family name. Wherein,

determining the range of the rack where the data node stored in the data block is located according to the table name comprises the following steps:

calculating hash values of the table names, and calculating the hash values of all racks in the target data center respectively; determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center; and the maximum value of the hash values of all the table names in the original data center is equal to the maximum value of the hash values of all the racks in the target data center.

Determining the physical address range of the data node according to the column family name comprises:

calculating hash values of the column family names, and respectively calculating the hash values of the physical addresses of all the data nodes in the selected rack; determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.

The maximum value may be, but is not limited to, 2 pi, and the maximum value of the hash values of all table names and the maximum value of the hash values of all column family names in the table corresponding to the table name may be equal or unequal.

And 101, selecting a data node in the determined storage range to store a data block.

In this step, one rack may be randomly selected from the range of racks, and one physical address may be randomly selected from the range of physical addresses. And after the data blocks are stored, two other data nodes are selected for backup according to the existing backup method.

In this step, a rack corresponding to a hash value having the smallest absolute value of the difference between the hash values of the table names may be selected from the range of racks, and a data node corresponding to a physical address corresponding to a hash value having the smallest absolute value of the difference between the hash values of the column group names may be selected from the range of physical addresses.

The invention stores the data blocks of the same table name and the same column family name in the same storage range, so that the database storage of the same column family name is more centralized, thereby improving the reading speed.

How to select a data node for backup is described below by a specific embodiment.

Step 200, obtaining the table name and the column family name of the data block to be backed up.

In this step, how to obtain the name of the table where the database to be backed up is located and the name of the family where the database to be backed up is located belongs to the prior art, and cannot be used to limit the protection scope of the present invention. For example, when a data block is updated, the rack can record the table name and the column family name of the updated data block, and when backup is performed, the table name and the column family name of the data block to be backed up can be directly acquired.

Step 201, establishing a spherical coordinate system, and mapping the obtained table name and the list name of the data block to a spherical surface with a fixed radius in the spherical coordinate system.

In this step, one point on the spherical surface is represented by two coordinate values: an included angle theta between a connecting line of a point on the spherical surface and the coordinate center and the Z axis, and an included angle between a projection of the connecting line of the point on the spherical surface and the coordinate center on the XoY plane and the X axis

In this step, points on the sphere are used to represent the position of the data block in the original data center, that is, the name of the table where the data block is located is taken asThe name of the column family is taken as theta.

Wherein,may be a hash value of a table name and θ may be a hash value of a column family name. How to calculate each hash value can be implemented by using an existing method, for example, using an existing hash algorithm for calculation, and the specific implementation thereof is not used to limit the protection scope of the present invention, and is not described herein again.

And step 202, taking a straight line which passes through the coordinate center and forms a preset angle with the axis by taking a connecting line from the point mapped on the spherical surface to the coordinate center as the axis, and taking the spherical surface which is obtained by rotating around the axis for one circle as a storage area of the data block corresponding to the point.

In the step, assuming that a connection line OP of a point P of data on a spherical surface and a coordinate center O is taken as an axis, the area of the spherical surface, which is obtained by rotating a straight line which passes through the point O and has an included angle of gamma with the OP around the OP for one circle, is taken as a corresponding storage area (hash value of table name, hash value of column family name), and the conical area is projected on a YoZ plane to obtain the range of theta' belonging to [ theta-gamma, theta + gamma ]; projecting the cone region on the XoY plane yields equation (1):

to obtainThus obtainingThe range of (c) is shown in formula (2):

wherein b is the distance from the intersection point of a straight line which passes through the O point and has an included angle gamma with the OP to the OP, r is the length of the OP, and gamma' is the included angle projected by the straight line which passes through the O point and has an included angle gamma with the OP on the plane XoY and the OP on the plane XoY.

The ratio of the area of the spherical surface truncated by the conical region to the total area of the spherical surface is shown in equation (3):

the value of γ' is obtained as shown in formula (4):

satisfy the range of theta' andthe area of the range of (2) is the storage area of the data block.

And step 203, mapping the physical addresses of the racks and the data nodes corresponding to all the data nodes in the target center to a spherical surface with a fixed radius in the spherical coordinate system.

In this step, points on the sphere are used to represent the positions of the data nodes in the target data center, i.e. the rack is taken asThe fetch physical address is θ.

Wherein,may be a hash value of the chassis and theta may be a hash value of the physical address. How to calculate each hash value belongs to the prior art, and cannot be used to limit the scope of the present invention.

And 204, selecting a data node corresponding to one of the points from the storage area of the data block as a storage position of the data block.

In this step, a data node corresponding to a point closest to a point corresponding to the data block may be selected as a storage location of the data block.

Referring to fig. 3, the present invention further provides an apparatus for implementing data center backup, which at least includes:

In the apparatus of the present invention, the determining module is specifically configured to:

determining the range of a rack where a data node stored in a data block is located according to the table name; determining a physical address range of the data node according to the column family name;

a selection module specifically configured to:

one rack is selected from a range of racks and one physical address is selected from a range of physical addresses.

In the backup apparatus of the present invention, the determining module is specifically configured to:

It should be noted that the above-mentioned embodiments are only for facilitating the understanding of those skilled in the art, and are not intended to limit the scope of the present invention, and any obvious substitutions, modifications, etc. made by those skilled in the art without departing from the inventive concept of the present invention are within the scope of the present invention.

Claims

1. A method for implementing data center backup is characterized by comprising the following steps:

determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up,

the method specifically comprises the following steps:

determining the range of the rack where the data node stored in the data block is located according to the table name,

the method specifically comprises the following steps:

the maximum value of the hash values of all table names in the original data center is equal to the maximum value of the hash values of all racks in the target data center; and selecting one data node in the determined storage range to store the data block.

2. The method of claim 1, wherein determining the storage range of the data block in the target data center according to the table name and the column name of the data block to be backed up further comprises:

3. The method of claim 1, wherein selecting one of the data nodes from the storage range to store the data block comprises:

4. The method of claim 1, wherein determining the physical address range of the data node according to the column family name comprises:

5. The method of claim 4, wherein selecting one of the data nodes from the storage range to store the data block comprises:

6. A method according to any one of claims 3 to 5, wherein the maximum value is 2 π.

7. An apparatus for implementing data center backup, comprising at least:

the determining module is specifically configured to:

the method is specifically used for:

the maximum value of the hash values of all table names in the original data center is equal to the maximum value of the hash values of all racks in the target data center; and the selection module is used for selecting one data node from the determined storage range to store the data block.

8. The apparatus of claim 7, wherein the determining module is specifically configured to:

the selection module is specifically configured to:

9. The apparatus of claim 7, wherein the determining module is specifically configured to: