CN103761167B - A kind of method and apparatus for realizing data center backup - Google Patents
A kind of method and apparatus for realizing data center backup Download PDFInfo
- Publication number
- CN103761167B CN103761167B CN201410032550.1A CN201410032550A CN103761167B CN 103761167 B CN103761167 B CN 103761167B CN 201410032550 A CN201410032550 A CN 201410032550A CN 103761167 B CN103761167 B CN 103761167B
- Authority
- CN
- China
- Prior art keywords
- data
- hash values
- range
- value
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000013500 data storage Methods 0.000 abstract description 5
- 238000013507 mapping Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Storage Device Security (AREA)
Abstract
The present invention proposes a kind of method and apparatus for realizing data center backup, including:According to data block place table name to be backed up and row Praenomen, memory range of the data block in target data center is determined;It is determined that memory range in choose a back end data storage block.When the present invention solves Hbase across data center data backups, the data block of same row Praenomen stores scattered problem so that the data block storage for backuping to the same row Praenomen at target data center is more concentrated, so as to improve reading speed.
Description
Technical Field
The invention relates to the field of big data, in particular to a backup method and a backup device of a data center based on Hbase.
Background
The data storage of Hbase is usually a Hadoop-based distributed File storage (HDFS), which usually needs to be backed up when performing data storage in an original data center, and the HDFS is backed up by default in three copies, wherein two copies are in two different data nodes belonging to the same rack, and the other copy is in one data node belonging to another rack different from the rack. Meanwhile, in order to ensure that the data center can still work normally when the data center fails, the data center needs to be backed up.
The existing method for realizing data center backup comprises the following steps:
acquiring a data block to be backed up in an original data center; randomly selecting 1 data node in the target data center to back up the data block in the target data center, and then selecting another two data nodes for back up according to the existing back-up method.
In the data storage method, when the data center is backed up, 1 data node of the target data center is randomly selected, and the data storage of Hbase is stored based on the column family data, that is, in the original data center, the data blocks belonging to the same column family data in the table are collectively stored in the same data node or several adjacent data nodes, so that when data is read, all data nodes where the column family name is located need to be searched according to the column family name where the read data is located, and the searched data nodes are possibly distributed in all data nodes of the target data center. When the method is used for Hbase cross-data center data backup, the characteristic of column family storage cannot be fully utilized, and the problems of scattered and discontinuous storage of data blocks with the same column family name in a target data center are caused, so that the reading speed is low.
Disclosure of Invention
In order to solve the technical problem, the invention provides a backup method and a backup device for a data center, which can make full use of the characteristics of column family storage to make the storage of data blocks backed up to the target data center by the same column family name more centralized, thereby improving the reading speed.
In order to achieve the above object, the present invention provides a method for implementing data center backup, including:
determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up;
and selecting one data node in the determined storage range to store the data block.
Preferably, the determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up includes:
determining the range of the rack where the data nodes stored in the data block are located according to the table name;
determining a physical address range of the data node according to the column family name;
the selecting one of the data nodes from the storage range to store the data block comprises:
one chassis is selected from the range of chassis, and one physical address is selected from the range of physical addresses.
Preferably, the determining, according to the table name, a range of a rack in which a data node stored in the data block is located includes:
calculating hash values of the table names, and calculating hash values of all racks in the target data center respectively;
determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center;
and the maximum value of the hash values of all the table names in the original data center is equal to the maximum value of the hash values of all the racks in the target data center.
Preferably, the selecting one of the data nodes from the storage range to store the data block includes:
and randomly selecting one rack from the range of the racks, or selecting the rack corresponding to the hash value with the smallest absolute value of the difference value of the hash values of the table names.
Preferably, the determining the physical address range of the data node according to the column family name includes:
calculating hash values of the column family names, and respectively calculating hash values of physical addresses of all data nodes in the selected rack;
determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.
Preferably, the selecting one of the data nodes from the storage range to store the data block includes:
and randomly selecting a data node corresponding to a physical address from the physical address range, or selecting a data node corresponding to a physical address corresponding to a hash value with the smallest absolute value of the difference value of the hash values of the column family names.
Preferably, the maximum value is 2 π.
The invention also provides a device for realizing the data center backup, which at least comprises:
the determining module is used for determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up;
and the selection module is used for selecting one data node from the determined storage range to store the data block.
Preferably, the determining module is specifically configured to:
determining the range of the rack where the data nodes stored in the data block are located according to the table name; determining a physical address range of the data node according to the column family name;
the selection module is specifically configured to:
one chassis is selected from the range of chassis, and one physical address is selected from the range of physical addresses.
Preferably, the determining module is specifically configured to:
calculating hash values of the table names, and calculating hash values of all racks in the target data center respectively;
determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center;
and the maximum value of the hash values of all the table names in the original data center is equal to the maximum value of the hash values of all the racks in the target data center.
Preferably, the determining module is specifically configured to:
calculating hash values of the column family names, and respectively calculating hash values of physical addresses of all data nodes in the selected rack;
determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.
Compared with the prior art, the invention comprises the following steps: determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up; and selecting one data node in the determined storage range to store the data block. According to the technical scheme, the characteristic of column family storage is fully utilized, the problem that data blocks with the same column family name are stored dispersedly when Hbase is backed up across data centers is solved, the data blocks with the same column family name backed up to a target data center are stored more intensively, and therefore the reading speed is improved.
Drawings
The accompanying drawings in the embodiments of the present invention are described below, and the drawings in the embodiments are provided for further understanding of the present invention, and together with the description serve to explain the present invention without limiting the scope of the present invention.
FIG. 1 is a flow chart of a method of implementing a data center backup in accordance with the present invention;
FIG. 2 is a flow chart of an embodiment of a method of implementing data center backup in accordance with the present invention;
fig. 3 is a schematic structural diagram of an apparatus for implementing data center backup according to the present invention.
Detailed Description
The following further description of the present invention, in order to facilitate understanding of those skilled in the art, is provided in conjunction with the accompanying drawings and is not intended to limit the scope of the present invention.
Referring to fig. 1, the present invention provides a method for implementing data center backup, including:
step 100, determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up.
In this step, the range of the rack where the data node stored in the data block is located can be determined according to the table name; and determining the physical address range of the data node according to the column family name. Wherein,
determining the range of the rack where the data node stored in the data block is located according to the table name comprises the following steps:
calculating hash values of the table names, and calculating the hash values of all racks in the target data center respectively; determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center; and the maximum value of the hash values of all the table names in the original data center is equal to the maximum value of the hash values of all the racks in the target data center.
Determining the physical address range of the data node according to the column family name comprises:
calculating hash values of the column family names, and respectively calculating the hash values of the physical addresses of all the data nodes in the selected rack; determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.
The maximum value may be, but is not limited to, 2 pi, and the maximum value of the hash values of all table names and the maximum value of the hash values of all column family names in the table corresponding to the table name may be equal or unequal.
And 101, selecting a data node in the determined storage range to store a data block.
In this step, one rack may be randomly selected from the range of racks, and one physical address may be randomly selected from the range of physical addresses. And after the data blocks are stored, two other data nodes are selected for backup according to the existing backup method.
In this step, a rack corresponding to a hash value having the smallest absolute value of the difference between the hash values of the table names may be selected from the range of racks, and a data node corresponding to a physical address corresponding to a hash value having the smallest absolute value of the difference between the hash values of the column group names may be selected from the range of physical addresses.
The invention stores the data blocks of the same table name and the same column family name in the same storage range, so that the database storage of the same column family name is more centralized, thereby improving the reading speed.
How to select a data node for backup is described below by a specific embodiment.
Step 200, obtaining the table name and the column family name of the data block to be backed up.
In this step, how to obtain the name of the table where the database to be backed up is located and the name of the family where the database to be backed up is located belongs to the prior art, and cannot be used to limit the protection scope of the present invention. For example, when a data block is updated, the rack can record the table name and the column family name of the updated data block, and when backup is performed, the table name and the column family name of the data block to be backed up can be directly acquired.
Step 201, establishing a spherical coordinate system, and mapping the obtained table name and the list name of the data block to a spherical surface with a fixed radius in the spherical coordinate system.
In this step, one point on the spherical surface is represented by two coordinate values: an included angle theta between a connecting line of a point on the spherical surface and the coordinate center and the Z axis, and an included angle between a projection of the connecting line of the point on the spherical surface and the coordinate center on the XoY plane and the X axis
In this step, points on the sphere are used to represent the position of the data block in the original data center, that is, the name of the table where the data block is located is taken asThe name of the column family is taken as theta.
Wherein,may be a hash value of a table name and θ may be a hash value of a column family name. How to calculate each hash value can be implemented by using an existing method, for example, using an existing hash algorithm for calculation, and the specific implementation thereof is not used to limit the protection scope of the present invention, and is not described herein again.
And step 202, taking a straight line which passes through the coordinate center and forms a preset angle with the axis by taking a connecting line from the point mapped on the spherical surface to the coordinate center as the axis, and taking the spherical surface which is obtained by rotating around the axis for one circle as a storage area of the data block corresponding to the point.
In the step, assuming that a connection line OP of a point P of data on a spherical surface and a coordinate center O is taken as an axis, the area of the spherical surface, which is obtained by rotating a straight line which passes through the point O and has an included angle of gamma with the OP around the OP for one circle, is taken as a corresponding storage area (hash value of table name, hash value of column family name), and the conical area is projected on a YoZ plane to obtain the range of theta' belonging to [ theta-gamma, theta + gamma ]; projecting the cone region on the XoY plane yields equation (1):
to obtainThus obtainingThe range of (c) is shown in formula (2):
wherein b is the distance from the intersection point of a straight line which passes through the O point and has an included angle gamma with the OP to the OP, r is the length of the OP, and gamma' is the included angle projected by the straight line which passes through the O point and has an included angle gamma with the OP on the plane XoY and the OP on the plane XoY.
The ratio of the area of the spherical surface truncated by the conical region to the total area of the spherical surface is shown in equation (3):
the value of γ' is obtained as shown in formula (4):
satisfy the range of theta' andthe area of the range of (2) is the storage area of the data block.
And step 203, mapping the physical addresses of the racks and the data nodes corresponding to all the data nodes in the target center to a spherical surface with a fixed radius in the spherical coordinate system.
In this step, points on the sphere are used to represent the positions of the data nodes in the target data center, i.e. the rack is taken asThe fetch physical address is θ.
Wherein,may be a hash value of the chassis and theta may be a hash value of the physical address. How to calculate each hash value belongs to the prior art, and cannot be used to limit the scope of the present invention.
And 204, selecting a data node corresponding to one of the points from the storage area of the data block as a storage position of the data block.
In this step, a data node corresponding to a point closest to a point corresponding to the data block may be selected as a storage location of the data block.
Referring to fig. 3, the present invention further provides an apparatus for implementing data center backup, which at least includes:
the determining module is used for determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up;
and the selection module is used for selecting one data node from the determined storage range to store the data block.
In the apparatus of the present invention, the determining module is specifically configured to:
determining the range of a rack where a data node stored in a data block is located according to the table name; determining a physical address range of the data node according to the column family name;
a selection module specifically configured to:
one rack is selected from a range of racks and one physical address is selected from a range of physical addresses.
In the apparatus of the present invention, the determining module is specifically configured to:
calculating hash values of the table names, and calculating hash values of all racks in the target data center respectively;
determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center;
and the maximum value of the hash values of all the table names in the original data center is equal to the maximum value of the hash values of all the racks in the target data center.
In the backup apparatus of the present invention, the determining module is specifically configured to:
calculating hash values of the column family names, and respectively calculating hash values of physical addresses of all data nodes in the selected rack;
determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.
It should be noted that the above-mentioned embodiments are only for facilitating the understanding of those skilled in the art, and are not intended to limit the scope of the present invention, and any obvious substitutions, modifications, etc. made by those skilled in the art without departing from the inventive concept of the present invention are within the scope of the present invention.
Claims (9)
1. A method for implementing data center backup is characterized by comprising the following steps:
determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up,
the method specifically comprises the following steps:
determining the range of the rack where the data node stored in the data block is located according to the table name,
the method specifically comprises the following steps:
calculating hash values of the table names, and calculating hash values of all racks in the target data center respectively;
determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center;
the maximum value of the hash values of all table names in the original data center is equal to the maximum value of the hash values of all racks in the target data center; and selecting one data node in the determined storage range to store the data block.
2. The method of claim 1, wherein determining the storage range of the data block in the target data center according to the table name and the column name of the data block to be backed up further comprises:
determining a physical address range of the data node according to the column family name;
the selecting one of the data nodes from the storage range to store the data block comprises:
one chassis is selected from the range of chassis, and one physical address is selected from the range of physical addresses.
3. The method of claim 1, wherein selecting one of the data nodes from the storage range to store the data block comprises:
and randomly selecting one rack from the range of the racks, or selecting the rack corresponding to the hash value with the smallest absolute value of the difference value of the hash values of the table names.
4. The method of claim 1, wherein determining the physical address range of the data node according to the column family name comprises:
calculating hash values of the column family names, and respectively calculating hash values of physical addresses of all data nodes in the selected rack;
determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.
5. The method of claim 4, wherein selecting one of the data nodes from the storage range to store the data block comprises:
and randomly selecting a data node corresponding to a physical address from the physical address range, or selecting a data node corresponding to a physical address corresponding to a hash value with the smallest absolute value of the difference value of the hash values of the column family names.
6. A method according to any one of claims 3 to 5, wherein the maximum value is 2 π.
7. An apparatus for implementing data center backup, comprising at least:
the determining module is used for determining the storage range of the data block in the target data center according to the table name and the column family name of the data block to be backed up;
the determining module is specifically configured to:
determining the range of the rack where the data node stored in the data block is located according to the table name,
the method is specifically used for:
calculating hash values of the table names, and calculating hash values of all racks in the target data center respectively;
determining the range of the hash value of the rack as follows: the absolute value of the difference between the hash values of the racks and the hash values of the table names is smaller than or equal to the preset proportion of the maximum value of the hash values of all the racks in the target data center;
the maximum value of the hash values of all table names in the original data center is equal to the maximum value of the hash values of all racks in the target data center; and the selection module is used for selecting one data node from the determined storage range to store the data block.
8. The apparatus of claim 7, wherein the determining module is specifically configured to:
determining a physical address range of the data node according to the column family name;
the selection module is specifically configured to:
one chassis is selected from the range of chassis, and one physical address is selected from the range of physical addresses.
9. The apparatus of claim 7, wherein the determining module is specifically configured to:
calculating hash values of the column family names, and respectively calculating hash values of physical addresses of all data nodes in the selected rack;
determining the range of the hash value of the physical address as follows: the absolute value of the difference between the hash value of the physical address and the hash value of the column family name is smaller than or equal to the preset proportion of the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack; and the maximum value of the hash values of all column family names in the table corresponding to the table name is equal to the maximum value of the hash values of the physical addresses of all the data nodes in the selected rack.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410032550.1A CN103761167B (en) | 2014-01-23 | 2014-01-23 | A kind of method and apparatus for realizing data center backup |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410032550.1A CN103761167B (en) | 2014-01-23 | 2014-01-23 | A kind of method and apparatus for realizing data center backup |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103761167A CN103761167A (en) | 2014-04-30 |
CN103761167B true CN103761167B (en) | 2017-04-05 |
Family
ID=50528409
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410032550.1A Active CN103761167B (en) | 2014-01-23 | 2014-01-23 | A kind of method and apparatus for realizing data center backup |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103761167B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105592178B (en) * | 2015-09-17 | 2018-12-25 | 新华三技术有限公司 | A kind of back end method for determining position and device |
CN105511801B (en) * | 2015-11-12 | 2018-11-16 | 长春理工大学 | The method and apparatus of data storage |
CN107463342B (en) * | 2017-08-28 | 2021-04-20 | 北京奇艺世纪科技有限公司 | CDN edge node file storage method and device |
CN112379840B (en) * | 2020-11-17 | 2023-02-24 | 海光信息技术股份有限公司 | Terminal data protection method and device and terminal |
CN114780298B (en) * | 2022-06-16 | 2022-09-06 | 深圳市慧为智能科技股份有限公司 | File data processing method and device, computer terminal and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750356A (en) * | 2012-06-11 | 2012-10-24 | 清华大学 | Construction and management method for secondary indexes of key value library |
CN103281291A (en) * | 2013-02-19 | 2013-09-04 | 电子科技大学 | Application layer protocol identification method based on Hadoop |
-
2014
- 2014-01-23 CN CN201410032550.1A patent/CN103761167B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750356A (en) * | 2012-06-11 | 2012-10-24 | 清华大学 | Construction and management method for secondary indexes of key value library |
CN103281291A (en) * | 2013-02-19 | 2013-09-04 | 电子科技大学 | Application layer protocol identification method based on Hadoop |
Non-Patent Citations (1)
Title |
---|
《多数据中心非结构化数据复制方法研究》;王开煊;《中国优秀硕士学位论文全文数据库信息科技辑》;20121015(第10期);14-22 * |
Also Published As
Publication number | Publication date |
---|---|
CN103761167A (en) | 2014-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103761167B (en) | A kind of method and apparatus for realizing data center backup | |
JP6302951B2 (en) | Method and system for global namespace using consistent hash method | |
TWI544334B (en) | Data storage device and operating method thereof | |
EP3251033B1 (en) | Hybrid data distribution in a massively parallel processing architecture | |
US20110099187A1 (en) | Method and System for Locating Update Operations in a Virtual Machine Disk Image | |
US8924776B1 (en) | Method and system for calculating parity values for multi-dimensional raid | |
CN106126374B (en) | Method for writing data, method for reading data and device | |
JP2015528603A5 (en) | ||
CN105027069A (en) | Deduplication of volume regions | |
US11029857B2 (en) | Offloading device maintenance to an external processor in low-latency, non-volatile memory | |
JP6870466B2 (en) | Control programs, control methods, controls, and database servers | |
CN106911743A (en) | Small documents write polymerization, read polymerization and system and client | |
CN106201331A (en) | For writing method and apparatus and the storage media of data | |
WO2017020668A1 (en) | Physical disk sharing method and apparatus | |
US20170039142A1 (en) | Persistent Memory Manager | |
GB2593408A (en) | Increasing data performance by transferring data between storage tiers using workload characteristics | |
US20150052327A1 (en) | Dynamic memory relocation | |
CN105518612A (en) | Write-in method and apparatus for monitoring data | |
US10936497B2 (en) | Method and system for writing data to and read data from persistent storage | |
US10157216B2 (en) | Data management system and data management method | |
JP6090489B1 (en) | Error detection device, storage device, and error correction method | |
US10725877B2 (en) | System, method and computer program product for performing a data protection operation | |
CN105677843A (en) | Method for automatically obtaining attribute of four boundaries of parcel | |
WO2017035813A1 (en) | Data access method, device and system | |
CN103902230A (en) | Data processing method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |