CN105323271B

CN105323271B - Cloud computing system and processing method and device thereof

Info

Publication number: CN105323271B
Application number: CN201410289531.7A
Authority: CN
Inventors: 莫嫣; 高洪; 韩银俊
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2014-06-24
Filing date: 2014-06-24
Publication date: 2020-04-24
Anticipated expiration: 2034-06-24
Also published as: CN105323271A; WO2015196692A1

Abstract

The invention provides a cloud computing system and a processing method and device of the cloud computing system. The processing method of the cloud computing system comprises the following steps: receiving an operation request of a client to a cloud computing system; acquiring a data identifier to be operated in the cloud computing system according to the operation request; searching each disk storing data corresponding to the data identification in each node of the cloud computing system and the state of each disk according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of a disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk; and performing corresponding operation according to the state of each disk storing the data corresponding to the data identification in each node in the cloud computing system. The invention can improve the tolerance of the system to the disk fault.

Description

Cloud computing system and processing method and device thereof

Technical Field

The invention relates to the technical field of cloud computing, in particular to a cloud computing system and a processing method and device of the cloud computing system.

Background

Currently, Cloud Computing (Cloud Computing) is a product of convergence of traditional computer Technologies and Network technology development, such as Grid Computing (Grid Computing), Distributed Computing (Distributed Computing), Parallel Computing (Parallel Computing), utility Computing (utilitarian Computing), Network Storage (Network Storage Technologies) Virtualization (Virtualization), Load balancing (Load Balance), and the like. It aims to integrate a plurality of relatively low-cost computing entities into a system with powerful computing power through a network. Distributed caching is a field in the cloud computing category, and its role is to provide distributed storage services for mass data and the ability to access high-speed read and write.

The distributed cache system is formed by connecting a plurality of server nodes and clients; the server node is responsible for storing data, and the client can write, read, update, delete and the like the data to the server. Generally, data cannot be stored on a single server node (hereinafter referred to as "node"), but copies of the same data are stored on multiple nodes and are backed up with each other. The most common storage mode is a master-slave mode, in which one node is used as a master node (master), the other nodes are used as slave nodes (slave), and the identity of the master node is obtained by election or other algorithms. In order to simplify the process, data updating generally occurs on the master node, the slave node acquires data from the master node for synchronization, and data access can acquire data from the master node or the slave node, specifically, the access consistency policy is considered.

In a distributed cache system, the data storage modes are generally classified according to NRW, where N represents the number of copies of data, R represents the number of copies of data obtained in one data access request, and W represents the minimum number of participating nodes in one data update request (i.e., how many nodes complete data update).

When the distributed cache system realizes the persistence function, the data distributed on the server is stored on the disk. In practical situations, if a disk fails, the server cannot provide read-write service. Because the distributed cache system data has the characteristic of a plurality of copies, at this time, as long as other servers are in a normal state, the system can still normally provide read-write service through the copies of other nodes.

If a plurality of disks are hooked on a node of the distributed cache system, only one or a few disks are damaged due to some reason, so that the server cannot normally provide the service, and according to the foregoing, the whole cluster is still available because other servers are normally available. Assuming that a similar situation occurs in another server during this time, and that node cannot provide service normally, it is likely that the number of copies cannot satisfy the NRW policy, the distributed cache cluster cannot provide service at all. Typically, under the condition that the NRW is 3/2/2, which is a relatively common condition, two nodes are down, only one node is normal, and the read-write operation cannot meet the requirement of minimum operation on two copies.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a cloud computing system, and a processing method and device of the cloud computing system, which can improve the tolerance of the system to disk faults.

To solve the above technical problem, an embodiment of the present invention provides an energy consumption monitoring system, including:

in one aspect, a processing method of a cloud computing system is provided, including:

receiving an operation request of a client to a cloud computing system;

acquiring a data identifier to be operated in the cloud computing system according to the operation request;

searching each disk storing data corresponding to the data identification in each node of the cloud computing system and the state of each disk according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of a disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk;

and performing corresponding operation according to the state of each disk storing the data corresponding to the data identification in each node in the cloud computing system.

The step of performing corresponding operations according to the states of the disks comprises:

the operation request is an update request; when the number of the disks which store the data and are in a normal state in the cloud computing system is larger than or equal to the number of the least participating nodes of one data updating request preset by the cloud computing system, responding to the updating request; otherwise, rejecting the update request; or

The operation request is a data access request; when the number of the disks which store the data and are in a normal state in the cloud computing system is larger than or equal to the number of data copies acquired by one data access request preset by the cloud computing system, responding to the data access request; otherwise, the data access request is denied.

When the number of the disks which store the data and are in a normal state in the cloud computing system is greater than or equal to the number of the least participating nodes of one data updating request predetermined by the cloud computing system, the step of responding to the updating request comprises the following steps:

when the operation request is an update request and the state of a disk of a main node storing the data is normal, the main node of the cloud computing system updates the data of the disk where the data of the main node is located; the slave node of the cloud computing system acquires data to be synchronized from the master node, and the slave node updates data to a disk where the data of the slave node is located;

when the operation request is an update request and the state of a disk of a master node storing the data is a fault, a first slave node of the cloud computing system updates the data to the disk of the first slave node where the data is located; a second slave node of the cloud computing system acquires data to be synchronized from the first slave node; the second node updates data to a disk where the data of the second slave node is located; the state of the disks of the first slave node and the second slave node, which store the data, is normal.

When the number of the disks which store the data and are in a normal state in the cloud computing system is greater than or equal to the number of the data copies acquired by one data access request scheduled by the cloud computing system, the step of responding to the data access request comprises the following steps:

when the operation request is a data access request and the state of a disk of a master node storing the data is normal, acquiring a first copy of the data from the disk of the master node of the cloud computing system where the data is located, and acquiring a second copy of the data from the disk of at least one slave node of the cloud computing system where the data is located; selecting a copy of the latest version from the first copy and the second copy; and sending the copy of the latest version to the client; the state of a disk of the second slave node, which stores the data, is normal;

when the operation request is a data access request and the state of a disk of a master node storing the data is a fault, acquiring a third copy of the data from the disk of the data of at least one slave node of the cloud computing system; selecting a copy of the latest version from at least one third copy, and sending the copy of the latest version to the client; and the state of the disk of the second slave node for storing the data is normal.

Before the step of receiving the operation request of the client, the method further comprises:

and acquiring a node disk state report of the cloud computing system from a node.

In another aspect, a processing apparatus of a cloud computing system is provided, including:

the first receiving unit is used for receiving an operation request of a client to the cloud computing system;

the acquisition unit is used for acquiring a data identifier to be operated in the cloud computing system according to the operation request;

the searching unit is used for searching each disk for storing the data corresponding to the data identification in each node of the cloud computing system and the state of each disk according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of a disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk;

and the operation unit is used for carrying out corresponding operation according to the state of each disk storing the data corresponding to the data identification in each node in the cloud computing system.

The operation unit includes:

the operation unit includes:

the first response subunit, the operation request is an update request; when the number of the disks which store the data and are in a normal state in the cloud computing system is larger than or equal to the number of the least participating nodes of one data updating request preset by the cloud computing system, responding to the updating request;

the first rejecting subunit rejects the update request when the number of the disks which store the data in the cloud computing system and are in a normal state is less than the number of the least participating nodes of one data update request preset by the cloud computing system;

the second response subunit, the operation request is a data access request; when the number of the disks which store the data and are in a normal state in the cloud computing system is larger than or equal to the number of data copies acquired by one data access request preset by the cloud computing system, responding to the data access request;

and the second rejecting subunit rejects the data access request when the number of the disks which store the data in the cloud computing system and are in a normal state is less than the number of the data copies acquired by one data access request preset by the cloud computing system.

The device, still include:

and the second receiving unit is used for receiving the node disk state report of the cloud computing system from the node.

In another aspect, a cloud computing system is provided, including: the system comprises a client, a processing device, nodes and disks corresponding to the nodes;

the processing device receives an operation request of the client to the cloud computing system; acquiring a data identifier to be operated in the cloud computing system according to the operation request; searching each disk storing data corresponding to the data identifier in each node of the cloud computing system and the state of each disk according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of the disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk; and performing corresponding operation according to the state of each disk storing the data corresponding to the data identification in each node in the cloud computing system.

And the node sends a node disk state report to the processing device.

The technical scheme of the invention has the following beneficial effects:

aiming at the distributed cache system, the invention can fully utilize available resources under the condition of disk damage, integrate copy resources meeting the requirements of consistency and availability, improve the availability of the system as much as possible and improve the tolerance of the system to faults.

Drawings

Fig. 1 is a schematic flow chart of a processing method of a cloud computing system according to the present invention;

fig. 2 is a schematic structural diagram of a processing device of a cloud computing system according to the present invention;

FIG. 3 is a schematic structural diagram of a cloud computing system according to the present invention;

fig. 4 and fig. 5 are schematic structural diagrams of an application scenario of the cloud computing system according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, a processing method of a cloud computing system according to the present invention includes:

step 11, receiving an operation request of a client to a cloud computing system; the operation request may be a data update request or a data access request, etc.

Step 12, acquiring a data identifier to be operated in the cloud computing system according to the operation request; for example, the operation request is to update copy 1 in fig. 4, and copy 1 is the data identifier.

Step 13, searching each disk storing data corresponding to the data identifier in each node of the cloud computing system and the state of each disk according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of a disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk; the disk status is normal or failed, and in fig. 4, the disk status report of the node a is: (node A: disk I, replica 1, failed; disk II, replica 2, normal; disk III, replica 3, normal).

And step 14, performing corresponding operation according to the state of each disk storing the data corresponding to the data identifier in each node in the cloud computing system.

Prior to step 14, the method further comprises:

and step 10, acquiring a node disk state report of the cloud computing system from a node. When the node detects that a disk for storing data is damaged or fails, a report is sent; or send a report based on the request.

Wherein, step 14 includes:

The method specifically comprises the following steps:

When the operation request is a data access request and the state of a disk of a master node storing the data is normal, acquiring a first copy of the data from the disk of the master node of the cloud computing system where the data is located, and acquiring a second copy of the data from the disk of at least one (or two or 3, set according to actual conditions) slave node of the cloud computing system where the data is located; selecting a copy of the latest version from the first copy and the second copy; and sending the copy of the latest version to the client; the state of a disk of the second slave node, which stores the data, is normal;

For example, fig. 5 shows a distributed cache storage system consisting of 3 nodes, which has three copies of each data, and updates and accesses the data in 322 manner. The number of read request access copies specified by the cloud computing system is 2, when one disk is broken down, the cloud computing system can still respond to an updating or data access operation request, and when two disks are broken down, the cloud computing system cannot respond to the operation request.

In the invention, when a node disk failure occurs, even a plurality of nodes simultaneously fail, as long as the number of copies on the residual available disks on the cluster can meet the NRW strategy, the system can ensure consistency and availability, even the service of all data can not be influenced, the condition that the system can not provide the service completely can not occur, and the service is provided as far as possible.

Certainly, when part of the disk is damaged and continues to provide services, the problem of recovering data after disk recovery is brought, which can be accomplished by a distributed cache data recovery function, that is, copy data is obtained from other nodes to be repaired.

As shown in fig. 2, the processing apparatus of a cloud computing system according to the present invention includes:

a first receiving unit 21, which receives an operation request of a client to the cloud computing system;

the obtaining unit 22 is used for obtaining a data identifier to be operated in the cloud computing system according to the operation request;

the searching unit 23 is configured to search, according to the node disk state report of the cloud computing system, each disk storing data corresponding to the data identifier in each node of the cloud computing system and a state of each disk; the node disk status report comprises: the state of a disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk;

the operation unit 24 performs corresponding operations according to the states of the disks storing the data corresponding to the data identifier in each node in the cloud computing system.

The operation unit 24 includes:

The device, still include:

the second receiving unit 25 receives a node disk status report of the cloud computing system from a node.

As shown in fig. 3, the cloud computing system according to the present invention includes: a client 31, a processing device 32, a node 33, and a disk 34 corresponding to the node 33;

the processing device 32 receives an operation request of the client 31 to the cloud computing system; acquiring a data identifier to be operated in the cloud computing system according to the operation request; searching for a disk storing data corresponding to the data identifier in each node 33 of the cloud computing system and the state of each disk 34 according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of the disk in each node 33 of the cloud computing system and the data identifier corresponding to the data stored in the disk; and performing corresponding operation according to the state of each disk 34 storing the data corresponding to the data identifier in each node in the cloud computing system.

The node 33 sends a node disk status report to the processing device 32.

Two application scenarios of the present invention are described below.

The first application scenario is to describe an implementation method for availability under a multi-disk path condition in a cloud computing distributed cache system.

The method comprises the following steps: the client establishes connection with a plurality of server nodes in the distributed cache system, the server nodes establish connection with each other and operate normally, each server is provided with a plurality of disks for data persistence, and different data fragments are persisted on different disks. The number of data copies is N, the number of read request access copies is R, the number of copy success of write request minimum update copies is W, the single maximum fault tolerance of the system is O (indicating that requests on O nodes are allowed to fail, if a single point of failure occurs, O is 1, O < W), and the consistency requirement is W + R > N.

Step A: under the normal condition of the system, all the disks on each node work normally, and the data has N copies in the system. When a client initiates a data updating request, a Master performs data updating processing on a disk where data is located, a slave synchronizes data from the Master and performs data updating on the disk where the data is located on the slave, and when the data updating is successfully completed on W nodes, a client data updating success message is returned;

when a client side initiates a data access request, a Master/Slave processes the request, and after acquiring accessed data copies from a disk where R pieces of node data are located, the latest copy is selected from the R pieces of data copies and returned to the client side.

And B: when the node A is started, it is found that a certain disk fails to be accessed, but other disks are still normal; or, in the running process of the node A, a certain disk fails to access for multiple times, and the disk is judged to be in fault. The node A is not switched to the node failure, but continues to provide read-write service, and records the identification of the failed disk and the corresponding data copy on the disk.

And C: when a client initiates a data updating request and the data is just distributed on the fault disk of the node A in the step B, the node A directly returns failure when updating the data to the node; when the data updating is successfully completed on the W nodes (the W nodes do not contain the node A), returning a data updating success message to the client;

when a client side initiates a data access request, a node A directly returns failure, a Master/Slave processes the request, and after acquiring accessed data copies from a disk where R nodes (the R nodes do not contain the node A) data are located, the latest copy is selected from the R data copies and returned to the client side.

Step D: and when the client initiates a data updating and access request and the data is not distributed on the fault disk of the node A in the step B, the processing mode is the same as that in the step A.

Step E: when the node B is in the operation process, the disk is judged to be in failure due to multiple access failures of a certain disk. The node B does not switch to the node failure, but continues to provide read-write service, and records the identification of the failed disk and the corresponding data copy on the disk.

Assume that the failed disk of node B and the copy stored on the failed disk of node a do not coincide. The next step is continued.

Step F: when a client initiates a data updating and access request, and the data is just distributed on the fault disk of the node B in the step E, based on the assumption, the data is not on the fault disk of the node A in the step B, and when the data is updated to the node, the node B directly returns failure; when the data update is successfully completed on W nodes (the W nodes do not contain node B), returning a data update success message to the client;

when a client side initiates a data access request, a node B directly returns failure, a Master/Slave processes the request, acquires accessed data copies from a disk where R nodes (the R nodes do not contain the node B) data are located, selects the latest copy from the R data copies, and returns the latest copy to the client side.

Step G: when a client initiates a data updating request and the data is distributed on the failed disk of the node A in the step B, based on the assumption that the data is not distributed on the failed disk of the node B in the step E, when the data is updated and accessed to the node, the processing procedure is the same as that in the step C, and as a result, the data can be updated and accessed normally.

The invention provides a method for improving the usability of a distributed cache system under the condition of multi-disk damage, which enhances the usability of the system under the condition of unchanged consistency, thereby optimizing the application experience.

The second application scenario is described below in conjunction with fig. 4 and 5.

The method specifically comprises the following steps: for the main and standby storage system in the 322 mode, an availability implementation scheme is described in detail when a single node has disk damage and multiple nodes have disk damage at the same time.

A distributed cache system is formed by server nodes and clients, for a specific data, a master node (master) is responsible for processing update and access requests of the clients, and a plurality of standby nodes are used for synchronizing the data of the master and receiving the data access requests of the clients (slave does not process the data update requests).

Environment: a distributed cache storage system consisting of 3 nodes is provided, each data of the storage system has three copies, and the data is updated and accessed in a 322 mode.

The invention comprises the following steps:

step 1, in an initial normal stage, the system receives a client request, and assumes that data is located on copy 1 on disk I of node a (equivalent to the data identifier), copy 1 on disk I of node B, and copy 1 on disk III of node C. For simplicity of description, it is assumed that copy 1 on node B is master and the copies on the other two nodes are slave. Copy 2 on node a is the master and the copies on the other two nodes are the slave. Copy 3 on node a is the master and the copies on the other two nodes are the slave.

And 2, when a client initiates a data updating request, the node B Master updates data to the copy 1 on the disk I, the slave synchronizes data from the Master and updates data to the disk where the data on the slave is located, and when the data updating is successfully completed on the nodes W2, a data updating success message is returned to the client. Because all the disks are normal, all the copies are updated successfully; when a client side initiates a data access request, three nodes all process the request, and after acquiring the accessed data copies from the disk where the R2 node data are located, the data copies are returned to the client side, and all the node copies are read successfully actually.

Step 3, as shown in FIG. 4, assume that disk I on node A is corrupted, resulting in copy 1 being unusable. When the data of an update request initiated by a client is located on the copy 1 of the node A, the Master of the node B updates the data of the copy 1 on the disk I, the slave of the node C synchronizes the data from the Master and updates the data of the copy on the disk III of the node C, and at the moment, after the data update is successfully completed on the W-2 nodes, the successful data update message is returned to the client;

when the data of the data access request initiated by the client is located on the copy 1 of the node a, the node a directly returns failure, and after obtaining the data from the copies 1 of the node B and the node C, (meeting the condition that R is 2) returns to the client.

And 4, in the case of the step 3, when the update and access requests initiated by the client are located on the copy 2 or the copy 3 of the node a, the processing flow is the same as that in the step 2 because the copies of the three nodes are available.

Step 5, as shown in fig. 5, when the disk II on the node B is damaged, the copy 3 of the node B is not available. When the data of the update and access request initiated by the client is located on the copy 1 of the node A, the copies on the node B and the node C are both available, and the NRW policy is satisfied, the processing flow is the same as that in step 3.

And 6, in the case of the step 5, when the update and access requests initiated by the client are located on the copy 2 of the node A, the processing flow is the same as that in the step 2 because the copies 2 of the three nodes are all available.

And 7, in the case of the step 5, when the data of the update request initiated by the client is located on the copy 3 of the node A, the copy 3 of the node B is damaged, and the copy 3 of the node C is available. The Master of the node A updates data to the copy 3 on the disk III, the slave of the node C synchronizes data from the Master and updates data to the copy 3 on the disk II of the node C, and after the data updating is successfully completed on the W2 nodes, a successful data updating message of the client is returned;

when the data of the data access request initiated by the client is located on the copy 3 of the node a, the node B directly returns a failure, and after obtaining the data from the copies 3 of the node a and the node C, (meeting the condition that R is 2) returns to the client.

As can be seen from the above, even if both node a and node B have disk damage, the distributed cache cluster can provide read-write service for all data as long as the copy of the damaged disk is not duplicated.

In the above application scenario, if there are two failed nodes, each node is actually a part of the disk damaged, and in an optimistic situation, if the damaged disk stores copies of different data, at least two copies of all data are still stored on an available disk of the entire system, which completely meets the condition of normally providing all services. Even if a copy of the same data happens to be stored on a damaged disk, the available data on other disks can still meet the consistency and availability, read-write service can be provided, and only the part of data damaged at the same time cannot be read-write accessed.

The invention has the following beneficial effects:

aiming at the distributed cache system, the invention can fully utilize available resources under the condition of disk damage, integrate copy resources meeting the requirements of consistency and availability, improve the availability of the system as much as possible and improve the tolerance of the system to faults. That is to say, in the distributed cache system in the cloud computing field, a disk and data management mechanism is provided, so that even when a partial disk of a node fails, data on an available disk can be utilized as much as possible, the capability of providing services is maintained, and a server side provides a storage service with consistency and availability under the condition of fewer disks or data resources.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A processing method of a cloud computing system is characterized by comprising the following steps:

receiving an operation request of a client to a cloud computing system;

performing corresponding operation according to the state of each disk storing the data corresponding to the data identifier in each node in the cloud computing system;

wherein, the step of performing corresponding operations according to the state of each disk storing the data corresponding to the data identifier in each node in the cloud computing system includes:

2. The method according to claim 1, wherein when the number of disks in the cloud computing system storing the data and being in a normal state is greater than or equal to the number of nodes participating in a data update request scheduled by the cloud computing system at least once, the step of responding to the update request comprises:

when the operation request is an update request and the state of a disk of a master node storing the data is a fault, a first slave node of the cloud computing system updates the data to the disk of the first slave node where the data is located; a second slave node of the cloud computing system acquires data to be synchronized from the first slave node; the second slave node updates data to a disk where the data of the second slave node is located; the state of the disks of the first slave node and the second slave node, which store the data, is normal.

3. The method according to claim 2, wherein when the number of the disks in the cloud computing system storing the data and in a normal state is greater than or equal to the number of data copies acquired by one data access request predetermined by the cloud computing system, the step of responding to the data access request comprises:

4. The method of claim 1, wherein the step of receiving the operation request of the client is preceded by the method further comprising:

5. A processing apparatus of a cloud computing system, comprising:

the operation unit is used for carrying out corresponding operation according to the state of each disk storing data corresponding to the data identification in each node in the cloud computing system;

wherein the operation unit includes:

6. The apparatus of claim 5, further comprising:

7. A cloud computing system, comprising: the system comprises a client, a processing device, nodes and disks corresponding to the nodes;

the processing device receives an operation request of the client to the cloud computing system; acquiring a data identifier to be operated in the cloud computing system according to the operation request; searching for a disk storing data corresponding to the data identifier in each node of the cloud computing system and the state of each disk according to the node disk state report of the cloud computing system; the node disk status report comprises: the state of the disk in each node of the cloud computing system and a data identifier corresponding to data stored in the disk; performing corresponding operation according to the state of each disk storing the data corresponding to the data identifier in each node in the cloud computing system;

8. The system of claim 7, wherein the node sends a node disk status report to the processing device.