CN112578992A

CN112578992A - Data storage method and data storage device

Info

Publication number: CN112578992A
Application number: CN201910926872.3A
Authority: CN
Inventors: 杨艳伟; 孙荣宗
Original assignee: Xian Huawei Technologies Co Ltd
Current assignee: Xian Huawei Technologies Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2021-03-30
Anticipated expiration: 2039-09-27
Also published as: CN112578992B; WO2021057377A1

Abstract

The embodiment of the application discloses a data storage method, which comprises the following steps: sending a first control instruction, wherein the first control instruction instructs to install data processing software in the N storage devices, a storage resource pool and a virtual machine are created in any one of the N storage devices, any virtual machine uses the created storage resource pool in the storage device corresponding to the virtual machine to store data, and any virtual machine is used as an optional data node of the data processing software; acquiring a configuration instruction, wherein the configuration instruction comprises: the number of copies M and a storage perception strategy; the storage-aware policy includes: determining that M data nodes are used for storing data, wherein the M data nodes are located in M different storage devices; determining M data nodes for storing data to be stored according to the configuration instruction; and storing the data to be stored in the M data nodes. According to the data storage method provided by the embodiment of the application, the reliability of data storage is favorably improved through the set storage sensing strategy.

Description

Data storage method and data storage device

Technical Field

The present application relates to the field of computer networks, and in particular, to a data storage method and a data storage apparatus.

Background

In order to improve the security of data storage, when data storage is performed, a multi-copy method is generally adopted, one data file is copied into multiple copies and stored in multiple servers or magnetic arrays respectively, taking the example that the data file is stored in multiple servers, as long as any one server storing the copy of the data file is in an available state, the data file can be accessed, and thus the problems of data loss and inaccessibility of a single server due to network failure, disk damage, power failure, downtime and the like can be avoided.

Although this method has the advantage that data is easy to restore, when a data storage device storing multiple backup data is powered off or down, the number of usable copies is not the total number of copies minus one, but the total number of copies minus the number of copies stored in the current data storage device, that is, the number of actually usable copies is less than the expected number, which reduces the reliability of storage.

Therefore, how to improve the reliability of data storage is a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a data storage method and a data storage device, which can improve the reliability of data storage.

In a first aspect, an embodiment of the present application provides a data storage method, where the method includes the following steps:

sending a first control instruction, wherein the first control instruction instructs to install data processing software in N storage devices, a storage resource pool and a virtual machine are created in any one of the N storage devices, any virtual machine uses the created storage resource pool in the corresponding storage device to store data, any virtual machine serves as an optional data node of the data processing software, and N is an integer greater than or equal to 2;

obtaining configuration instructions, the configuration instructions comprising: setting the copy number M and a storage perception strategy when data storage is carried out; the storage aware policy includes: determining M data nodes for storing data, wherein the M data nodes are located in M different storage devices, and M is an integer less than or equal to N;

determining M data nodes for storing data to be stored according to the configuration instruction;

and storing the data to be stored in the M data nodes.

In the data storage method provided by the embodiment of the application, when multiple copies are stored, different copies are located in different storage devices, so that when one storage device for storing backup data fails, the actually reduced number of copies is one, and compared with the situation that a plurality of copies are possibly unavailable when a storage device for storing one copy in the prior art fails, the storage reliability is improved.

In some possible embodiments, the storage resource pool created in the N storage devices is encoded with erasure codes EC.

The EC codes are adopted in the storage resource pool created by the storage device, when data are lost or damaged, the lost data can be calculated, and the storage device can still be used after certain data are lost. Compared with the traditional mode that the distributed system can still continuously provide 3 copies of each data after failures such as hardware failure occur, the embodiment of the application saves the storage space and improves the storage utilization rate.

In some possible embodiments, the storage utilization rate of the hard disk is 88.89% when the EC coding adopts the 8 data block 1 parity block 8D1P mode, 80% when the EC coding adopts the 4D1P mode, 80% when the EC coding adopts the 8D2P mode, and 66.67% when the EC coding adopts the 4D2P mode.

In a second aspect, an embodiment of the present application provides a data storage device, including:

a sending unit, configured to send a first control instruction, where the first control instruction instructs to install data processing software in N storage devices, and create a storage resource pool and a virtual machine in any one of the N storage devices, where any one of the virtual machines uses the created storage resource pool in the storage device corresponding to the virtual machine to store data, and any one of the virtual machines serves as an optional data node of the data processing software, where N is an integer greater than or equal to 2.

An obtaining unit, configured to obtain a configuration instruction, where the configuration instruction includes: setting the copy number M and a storage perception strategy when data storage is carried out; the storage aware policy includes: determining M data nodes for storing data, the M data nodes being located in M different storage devices, M being an integer less than or equal to N.

And the determining unit is used for determining M data nodes for storing the data to be stored according to the configuration instruction.

And the processing unit is used for storing the data to be stored in the M data nodes.

When the data storage device provided by the embodiment of the application stores a plurality of copies, different copies are located in different storage devices, so that when one storage device for storing backup data fails, the actually reduced number of copies is one, and compared with the situation that a plurality of copies are possibly unavailable when a storage device for storing one copy in the prior art fails, the storage reliability is improved.

In some possible embodiments, the storage resource pool created in the N storage devices is encoded with erasure codes EC. The EC codes are adopted in the storage resource pool created by the storage device, when data are lost or damaged, the lost data can be calculated, and the storage device can still be used after certain data are lost. Compared with the traditional mode that the distributed system can still continuously provide 3 copies of each data after failures such as hardware failure occur, the embodiment of the application saves the storage space and improves the storage utilization rate.

In some possible embodiments, the storage device comprises: distributed servers or magnetic arrays.

In some possible embodiments, the data processing software comprises: and distributed processing software Hadoop.

In some possible embodiments, M is 2.

In a third aspect, an embodiment of the present application provides a data storage system, which includes N storage devices, such as the data storage device described in the second aspect or any one of the possible implementation manners of the second aspect, where N is an integer greater than or equal to 2.

In a fourth aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect or any one of the possible embodiments of the first aspect.

In a fifth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor, implements the method as described in the first aspect or any one of the possible implementation manners of the first aspect.

Drawings

Fig. 1 is a schematic flowchart of a data storage method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a data storage method according to another embodiment of the present application.

Fig. 3 is an interaction flow diagram of a data storage method according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a data storage device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. Other embodiments can be derived by those skilled in the art based on the embodiments in the present application.

Referring to fig. 1, fig. 1 is a data processing method according to an embodiment of the present application, including the following steps.

101. Sending a first control instruction, wherein the first control instruction instructs to install data processing software in N storage devices, a storage resource pool and a virtual machine are created in any one of the N storage devices, any virtual machine uses the created storage resource pool in the corresponding storage device to store data, any virtual machine serves as an optional data node of the data processing software, and N is an integer greater than or equal to 2.

The storage device may be, for example, a distributed server or a magnetic array, etc.

102. Obtaining configuration instructions, the configuration instructions comprising: setting the copy number M and a storage perception strategy when data storage is carried out; the storage aware policy includes: determining M data nodes for storing data, the M data nodes being located in M different storage devices, M being an integer less than or equal to N.

For example, if N is 3 and M is 2, the first control instruction instructs to install the data processing software in 3 storage devices, create a storage resource pool and a virtual machine in any one of the 3 storage devices, where any virtual machine creates data using the created storage resource pool in the storage device corresponding to the virtual machine, and the created 3 virtual machines may serve as optional data nodes of the data processing software. In some possible embodiments, the data processing software may be distributed processing software Hadoop.

103. And determining M data nodes for storing the data to be stored according to the configuration instruction.

For example, if M is 2, two data nodes for saving data to be stored are determined according to the configuration instruction.

104. And storing the data to be stored in the M data nodes.

For example, if M is 2, the data to be stored is saved into the determined consecutive data nodes.

Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to another embodiment of the present application. The method comprises the following steps:

201. sending a first control instruction, wherein the first control instruction instructs to install data processing software in N storage devices, a storage resource pool and a virtual machine are created in any one of the N storage devices, any one virtual machine uses the created storage resource pool in the corresponding storage device to store data, any one virtual machine serves as an optional data node of the data processing software, N is an integer greater than or equal to 2, and the storage resource pool is encoded by erasure codes EC.

The storage device may be, for example, a distributed server or a magnetic array, etc. The following description will be made taking a distributed server as an example.

202. Obtaining configuration instructions, the configuration instructions comprising: setting the copy number M and a storage perception strategy when data storage is carried out; the storage aware policy includes: determining M data nodes for storing data, the M data nodes being located in M different storage devices, M being an integer less than or equal to N.

203. And determining M data nodes for storing the data to be stored according to the configuration instruction.

204. And storing the data to be stored in the M data nodes.

Referring to fig. 3, fig. 3 is a schematic interaction flow diagram of a data storage method according to an embodiment of the present application. As shown in fig. 3, the data storage in this embodiment includes the following steps.

301. Software Defined Storage (SDS) is installed on the first Rack Rack1 and the second Rack Rack2, different Storage resource pools are respectively created on the Rack1 and the Rack2, EC (N: N + M) coding is adopted, and in the embodiment, the EC coding adopts 8D1P mode.

In this embodiment, as shown in FIG. 3, storage resource pool 1 may be created by an SDS, with storage resource pool 1 encoded with EC in 8D1P mode. And creating resource pool 2 from the SDS, resource pool 2 employing EC encoding in 8D1P mode.

SDS is a storage architecture that can separate storage software from hardware. Unlike conventional Network Attached Storage (NAS) or Storage Area Network (SAN) systems, SDS is typically implemented on industry standard systems or x86 systems, thereby eliminating the dependency of software on proprietary hardware. SDS typically employs a distributed architecture to promote reliability and scalability, so it is sometimes referred to as SDS as distributed storage. The two are obviously different, the distributed storage refers to an architecture, and the emphasis is that the architecture is distributed; SDS refers to software defined storage, emphasizing software and hardware decoupling.

SDS has the following advantages: (1) and decoupling software and hardware. The storage hardware is commercial off-the-shelf (COTS) COTS, so that locking of a manufacturer is avoided, and the purchasing cost of the equipment is reduced by purchasing the hardware and the software in a layered manner. (2) And the expansibility is strong. The SDS adopts a distributed architecture, the storage specification expands theoretically infinitely, and the storage specification increases linearly with the number of servers (lateral expansion). SAN is limited by the processing capacity of a controller, the specification of a single set of magnetic array is limited, and after the storage specification exceeds the specification of the magnetic array, a set of storage equipment (longitudinal expansion) must be added. (3) The reliability is high.

302. And creating a virtual machine used by Hadoop, wherein the disk of the virtual machine uses the storage pool on the rack where the virtual machine is located.

Specifically, the virtual machine disk on Rack1 uses storage resource pool 1, and the virtual machine on Rack2 uses storage resource pool 2.

Hadoop is a Distributed System infrastructure frequently used in the prior art, and a Hadoop Distributed File System (HDFS) divides nodes into two types, namely Name nodes and Data nodes. The NameNode manages the namespace for the file system. It maintains a file system tree and all files and directories within the entire tree. This information is permanently stored on the local disk in two files: namespace mirror files and edit log files. The NameNode records the data node information of each block in each file, but the NameNode does not permanently store the position information of the blocks, and the information is reconstructed by the data nodes when the system is started.

303. Hadoops are installed in the virtual machines created in step 302, and the virtual machines are Hadoop-hosted as DataNodes.

304. And configuring a rack sensing strategy of Hadoop, and configuring the number of copies to be 2.

When data is written into Hadoop, one DataNode is taken from each of the two racks, and thus the data is written into two different storage pools. The machine frame level reliability ensures data double activity, and the overall reliability is improved.

It can be understood that, when the storage system is actually built, the number of the racks is not limited to two, and a plurality of racks can adopt the same strategy.

Referring to fig. 4, fig. 4 is a block diagram illustrating a data storage device 400 according to an embodiment of the present application, where the data storage device 400 includes:

a sending unit 401, configured to send a first control instruction, where the first control instruction instructs to install data processing software in N storage devices, and create a storage resource pool and a virtual machine in any one of the N storage devices, where any one of the virtual machines uses the created storage resource pool in the storage device corresponding to the virtual machine to store data, and any one of the virtual machines serves as an optional data node of the data processing software, where N is an integer greater than or equal to 2.

A obtaining unit 402, configured to obtain a configuration instruction, where the configuration instruction includes: setting the copy number M and a storage perception strategy when data storage is carried out; the storage aware policy includes: determining M data nodes for storing data, the M data nodes being located in M different storage devices, M being an integer less than or equal to N.

A determining unit 403, configured to determine, according to the configuration instruction, M data nodes for storing data to be stored.

A processing unit 404, configured to store the data to be stored in the M data nodes.

In some possible embodiments, the pool of storage resources created in the N storages is encoded with erasure codes EC. The EC codes are adopted in the storage resource pool created by the storage device, when data are lost or damaged, the lost data can be calculated, and the storage device can still be used after certain data are lost. Compared with the traditional mode that the distributed system can still continuously provide 3 copies of each data after failures such as hardware failure occur, the embodiment of the application saves the storage space and improves the storage utilization rate.

The embodiment of the application also provides a data storage system, which comprises N storage devices and an embodiment corresponding to any one of the data storage devices, wherein N is an integer greater than or equal to 2. As shown in fig. 4, the data storage device includes: a sending unit 401, configured to send a first control instruction, where the first control instruction instructs to install data processing software in N storage devices, and create a storage resource pool and a virtual machine in any one of the N storage devices, where any one of the virtual machines uses the created storage resource pool in the storage device corresponding to the virtual machine to store data, and any one of the virtual machines serves as an optional data node of the data processing software, where N is an integer greater than or equal to 2.

An embodiment of the present application further provides an electronic device, including: one or more processors; storage means for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a data storage method as in any preceding method embodiment. The method comprises the following steps:

and storing the data to be stored in the M data nodes.

In some possible embodiments, the pool of storage resources created in the N storages is encoded with erasure codes EC.

In some possible embodiments, the EC encodes: including an 8 data chunk 1 parity chunk 8D1P mode, a 4D1P mode, an 8D2P mode, or a 4D2P mode.

In some possible embodiments, the M-2.

In the data storage method provided by the embodiment of the application, when multiple copies are stored, different copies are located in different storage devices, so that when one storage device for storing backup data fails, the actually reduced number of copies is one, and compared with the situation that a plurality of copies are possibly unavailable when a storage device for storing one copy in the prior art fails, the storage reliability is improved. When the storage resource pool created in the N storage devices adopts erasure codes EC coding, when data is lost or damaged, the lost data can be calculated, and the storage resource pool can still be used after some data are lost. Compared with the traditional mode that the distributed system can still continuously provide 3 copies of each data after failures such as hardware failure occur, the embodiment of the application saves the storage space and improves the storage utilization rate.

The present application also provides a computer readable medium, on which a computer program is stored, which when executed by a processor implements the data storage method according to any of the preceding method embodiments. The method comprises the following steps:

and storing the data to be stored in the M data nodes.

In some possible embodiments, the M-2.

It is to be understood that the terms "first," "second," and the like in the description and in the claims, and in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of storing data, the method comprising the steps of:

and storing the data to be stored in the M data nodes.

2. The data storage method of claim 1,

and the storage resource pool created in the N storage devices is encoded by adopting an erasure code EC.

3. The data storage method of claim 2, wherein the storage device comprises: distributed servers or magnetic arrays.

4. The data storage method of claim 2, wherein the EC encodes: including an 8 data chunk 1 parity chunk 8D1P mode, a 4D1P mode, an 8D2P mode, or a 4D2P mode.

5. The data storage method of claim 1, wherein the data processing software comprises: and distributed processing software Hadoop.

6. The data storage method according to any one of claims 1 to 5, wherein M-2.

7. A data storage device, comprising:

a sending unit, configured to send a first control instruction, where the first control instruction instructs to install data processing software in N storage devices, and create a storage resource pool and a virtual machine in any one of the N storage devices, where any one of the virtual machines uses the created storage resource pool in the storage device corresponding to the virtual machine to store data, and any one of the virtual machines serves as an optional data node of the data processing software, where N is an integer greater than or equal to 2;

an obtaining unit, configured to obtain a configuration instruction, where the configuration instruction includes: setting the copy number M and a storage perception strategy when data storage is carried out; the storage aware policy includes: determining M data nodes for storing data, wherein the M data nodes are located in M different storage devices, and M is an integer less than or equal to N;

the determining unit is used for determining M data nodes for storing data to be stored according to the configuration instruction;

8. The data storage device of claim 7, wherein the created pool of storage resources of the N storage devices is encoded with Erasure Codes (EC).

9. The data storage device of claim 8, wherein the storage device comprises: distributed servers or magnetic arrays.

10. The data storage device of claim 8, wherein the EC encodes: including an 8 data chunk 1 parity chunk 8D1P mode, a 4D1P mode, an 8D2P mode, or a 4D2P mode.

11. The data storage device of claim 7, wherein the data processing software comprises: and distributed processing software Hadoop.

12. The data storage device of any of claims 7 to 11, wherein M-2.

13. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a data storage method as claimed in any one of claims 1 to 6.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the data storage method of any one of claims 1 to 6.