CN102088490A - Data storage method, device and system - Google Patents

Data storage method, device and system Download PDF

Info

Publication number
CN102088490A
CN102088490A CN2011100217151A CN201110021715A CN102088490A CN 102088490 A CN102088490 A CN 102088490A CN 2011100217151 A CN2011100217151 A CN 2011100217151A CN 201110021715 A CN201110021715 A CN 201110021715A CN 102088490 A CN102088490 A CN 102088490A
Authority
CN
China
Prior art keywords
master file
volume
dummy block
data
write operation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011100217151A
Other languages
Chinese (zh)
Other versions
CN102088490B (en
Inventor
周文明
钟炎培
吴清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou biological Polytron Technologies Inc
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN 201110021715 priority Critical patent/CN102088490B/en
Publication of CN102088490A publication Critical patent/CN102088490A/en
Priority to PCT/CN2011/078476 priority patent/WO2012097588A1/en
Application granted granted Critical
Publication of CN102088490B publication Critical patent/CN102088490B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2071Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring using a plurality of controllers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/161Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
    • H04L69/162Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a data storage method, device and system. The data storage method comprises the following steps: receiving data to be written by a socket connection between a master volume and a virtual block storage client-side; and writing the data to be written in a volume file of the master volume, and sending the data to be written to a backup volume by a socket connection between the master volume and the backup volume, so that the backup volume reports the result of a write-in operation to a kernel in the virtual block storage client-side after the data to be written are written in a volume file of a backup volume. The data storage method, device and system provided by the invention can be used to improve storage reliability, and reduce information flows; and in addition, the backup volume can share a part of loads of the master volume, thus reaching the purpose of dynamic load balancing.

Description

Date storage method, equipment and system
Technical field
The embodiment of the invention relates to communication technical field, relates in particular to a kind of date storage method, equipment and system.
Background technology
Network block equipment (Network Block Device; Hereinafter to be referred as: NBD), be by transmission control protocol (Transmission Control Protocol; Hereinafter to be referred as: TCP)/Internet Protocol (InternetProtocol; Hereinafter to be referred as: IP) network is the abstract technology used for client of coming out of the file on the server or block device.Dui Ying software is a kind of network storage software based on Linux with it, utilizes this software can make up network store system based on Linux.As a cover storage system, in particular for business industry ﹠ solution, storage system is very complicated, and is all very high to the requirement of Performance And Reliability.But existing NBD just finishes common Network Transmission, lacks the unusual and unusual consideration of memory node to network.
For instance, arrangement NBD client (NBD Client) on home server A, and a virtual NBD equipment nbd1, arrangement NBD server (NBD Server) on remote server B.The virtual machine that is created on the nbd1 is in running status, but because network is unusual or memory node takes place unusually, can't cause virtual machine to shut down from NBD server reading of data.
For solving above-mentioned integrity problem, prior art proposes Redundant Array of Independent Disks (RAID) 1 (RedundantArray of Independent Disks 1; Hereinafter to be referred as: RAID1) scheme, promptly form RAID1 by a plurality of hard disks on the single memory node of bottom, use for the NBD server.
But the inventor finds that there is following shortcoming at least in above-mentioned RAID1 scheme:
(1) the RAID1 scheme can not be stored by cross-node, can only reduce the probability that single node breaks down, if the RAID card breaks down, then the upper-layer service program is unavailable;
(2) can't solve the professional unavailable problem that network causes unusually;
(3) for memory node, increase doubly through data volume after the RAID card, increased the load of memory node greatly.
For solving above-mentioned integrity problem, prior art also provides another scheme, i.e. master file and duplicate volume scheme, and particularly, the upper-layer service program is write input (Input; Hereinafter to be referred as: I)/output (Output; Hereinafter to be referred as: in the time of O),, by master file data are passed to duplicate volume then, write I/O result to the master file report again after duplicate volume has been write, write I/O result by master file to the report of upper-layer service program at last earlier to the master file write data.
Master file and duplicate volume scheme have been separated active and standby physically, compare with the RAID1 scheme, and data can not damaged during the single node fault, have further improved reliability.But the inventor finds master file and duplicate volume scheme and still has following shortcoming:
(1) duplicate volume just is used for Backup Data, and in system's running, all loads that is to say that all at master file the node at master file place can become the I/O bottleneck;
(2) one times I/O wants mutual 4 message, and message traffic is bigger.
Summary of the invention
The embodiment of the invention provides a kind of date storage method, equipment and system, to improve memory reliability, reduces message traffic.
The embodiment of the invention provides a kind of date storage method, comprising:
The socket of storing between the client by master file and dummy block will is connected reception data to be written;
Described data to be written are write the volume file of described master file, and described data to be written are connected with socket between the duplicate volume by described master file send to described duplicate volume, so that described duplicate volume is after writing the volume file of described duplicate volume with described data to be written, report the result of write operation to the kernel of described dummy block will storage client.
The embodiment of the invention also provides a kind of master file node device, comprising:
Receiver module is used for being connected reception data to be written by the socket that described master file node device and dummy block will are stored between the client;
Writing module, be used for the data to be written that described receiver module receives are write the volume file of described master file node device, and described data to be written are connected with socket between the duplicate volume node device by described master file node device send to described duplicate volume node device, so that described duplicate volume node device is after writing the volume file of described duplicate volume node device with described data to be written, report the result of write operation to the kernel of described dummy block will storage client.
The embodiment of the invention also provides a kind of dummy block will storage client device, comprising:
Connect and to set up module, be used for setting up socket with described preassigned duplicate volume and be connected, and set up socket according to the listening port of preassigned master file with described preassigned master file and be connected according to the listening port of preassigned duplicate volume;
Obtain module, be used to obtain volume size, test point and the single node bitmap of described preassigned duplicate volume, and the volume size, test point and the single node bitmap that obtain described preassigned master file;
Comparison module is used for the test point of more described preassigned master file and described preassigned duplicate volume;
Determination module is used for determining that according to the comparative result of described comparison module the volume of up-to-date test point correspondence is real master file that the volume of inferior new test point correspondence is real duplicate volume;
Registering modules, be used for registering the role of described real master file and described real duplicate volume to the kernel of described dummy block will storage client device, and the link between described real master file and described real duplicate volume just often, the descriptor that module is connected with socket that described preassigned duplicate volume is set up is set up in described connection, and described connection is set up module and is registered to the kernel that described dummy block will is stored client device with the descriptor that the socket of described preassigned master file foundation is connected;
Calling module is used for the calling system function and enters the kernel state thread, handles the write operation requests that the upper-layer service program sends in described kernel state thread.
The embodiment of the invention also provides a kind of duplicate volume node device, comprising:
Data reception module is used to receive the master file node device is connected transmission by described master file node device and the socket between the duplicate volume node device data to be written;
The data writing module is used for the data to be written that described data reception module receives are write the volume file of described duplicate volume node device;
Reporting module as a result is used for reporting to the kernel of dummy block will storage client device the result of write operation.
The embodiment of the invention also provides a kind of storage system, comprising: above-mentioned master file node device, above-mentioned dummy block will storage client device and above-mentioned duplicate volume node device.
Pass through the embodiment of the invention, master file is connected by the socket between this master file and the dummy block will storage client and receives after the data to be written, above-mentioned data to be written are write the volume file of master file, and above-mentioned data to be written are connected with socket between the duplicate volume by master file send to duplicate volume, so that duplicate volume writes the volume file of this duplicate volume with data to be written, thereby can improve memory reliability; After duplicate volume writes the volume file of this duplicate volume with data to be written, report the result of write operation to the kernel of dummy block will storage client by duplicate volume; Thereby can reduce message traffic, and can share the sub-load of master file, reach the purpose of dynamic load leveling.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do one to the accompanying drawing of required use in embodiment or the description of the Prior Art below introduces simply, apparently, accompanying drawing in describing below is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of an embodiment of date storage method of the present invention;
Fig. 2 is the schematic diagram of an embodiment of the network architecture of the present invention;
Fig. 3 sets up the flow chart of an embodiment of Triangle Model for the present invention;
Fig. 4 is the flow chart of another embodiment of date storage method of the present invention;
Fig. 5 is the schematic diagram of another embodiment of the network architecture of the present invention;
Fig. 6 is the flow chart of another embodiment of date storage method of the present invention;
Fig. 7 is the schematic diagram of another embodiment of the network architecture of the present invention;
Fig. 8 is the flow chart of another embodiment of date storage method of the present invention;
Fig. 9 is the schematic diagram of another embodiment of the network architecture of the present invention;
Figure 10 is the structural representation of an embodiment of master file node device of the present invention;
Figure 11 is the structural representation of another embodiment of master file node device of the present invention;
Figure 12 is the structural representation of an embodiment of dummy block will storage client device of the present invention;
Figure 13 is the structural representation of another embodiment of dummy block will storage client device of the present invention;
Figure 14 is the structural representation of an embodiment of duplicate volume node device of the present invention;
Figure 15 is the structural representation of another embodiment of duplicate volume node device of the present invention;
Figure 16 is the structural representation of an embodiment of storage system of the present invention;
Figure 17 is the schematic diagram of an embodiment of cloud storage system of the present invention.
Embodiment
For the purpose, technical scheme and the advantage that make the embodiment of the invention clearer, below in conjunction with the accompanying drawing in the embodiment of the invention, technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, the every other embodiment that those of ordinary skills are obtained under the prerequisite of not making creative work belongs to the scope of protection of the invention.
Fig. 1 is the flow chart of an embodiment of date storage method of the present invention, and as shown in Figure 1, this date storage method can comprise:
Step 101, the socket of storing between the client by master file and dummy block will is connected reception data to be written.
Step 102, above-mentioned data to be written are write the volume file of this master file, and above-mentioned data to be written are connected with socket between the duplicate volume by master file send to duplicate volume, so that duplicate volume is after writing the volume file of this duplicate volume with data to be written, report the result of write operation to the kernel of dummy block will storage client.
In the present embodiment, before being connected reception data to be written by the socket between master file and the dummy block will storage client, master file can also receive the write operation requests of the kernel transmission of dummy block will storage client, and this write operation requests is transmitted to duplicate volume, this write operation requests is used to notify master file and/or duplicate volume to prepare to receive data to be written; Above-mentioned write operation requests is that the kernel of dummy block will storage client obtains from the request queue of this kernel registration; Write operation requests in the request queue of this kernel registration is that dummy block will storage client receives after the write operation requests that the upper-layer service program sends, and puts into the request queue of this kernel registration.
In the present embodiment, after the kernel of dummy block will storage client reported the result of write operation, master file can send the test point record request to duplicate volume at duplicate volume, so that master file and duplicate volume upgrade test point separately.
In addition, after the kernel of dummy block will storage client reported the result of write operation, master file can also check in the tabulation of dirty data piece whether dirty data is arranged at duplicate volume; In this dirty data piece tabulation, dirty data is arranged, and satisfy after the predetermined condition, above-mentioned dirty data is write in the disk.Wherein, dirty data does not also write the data of volume file for temporarily leaving in the core buffer; Which data the tabulation of dirty data piece is used to write down is dirty datas.
Further, in the present embodiment, before the write operation requests that the kernel of master file reception dummy block will storage client sends, dummy block will storage client can be set up socket with preassigned duplicate volume according to the listening port of preassigned duplicate volume and be connected, and obtains volume size, test point and the single node bitmap of preassigned duplicate volume; Dummy block will storage client can be set up socket with preassigned master file according to the listening port of preassigned master file and be connected, and obtains volume size, test point and the single node bitmap of preassigned master file; Then, dummy block will storage client can more preassigned master file and the test point of preassigned duplicate volume, and the volume of determining up-to-date test point correspondence is real master file, and the volume of inferior new test point correspondence is real duplicate volume; Afterwards, dummy block will storage client can be to the role of real master file of the kernel registration of this dummy block will storage client and real duplicate volume; Link between above-mentioned real master file and real duplicate volume just often, the descriptor that dummy block will storage client will be connected with the socket that preassigned duplicate volume is set up, and the descriptor that is connected with the socket of preassigned master file foundation is registered to the kernel of dummy block will storage client; Next, dummy block will storage client can the calling system function, and for example: ioctrl enters the kernel state thread, handles the write operation requests that the upper-layer service program sends in this kernel state thread.
Wherein, store the descriptor that client will be connected with the socket that preassigned duplicate volume is set up in dummy block will, and the descriptor that is connected with the socket of preassigned master file foundation is registered to before the kernel of dummy block will storage client, link between real master file and real duplicate volume just often, if the data between real master file and the real duplicate volume are asynchronous, then dummy block will storage client can send data synchronization request to real master file, so that real master file and real duplicate volume carry out data sync.
In addition, in a kind of implementation of present embodiment, when master file takes place when unusual, kernel in dummy block will storage client carries out masterslave switchover, duplicate volume is registered as after the new master file, this new master file can be connected reception data to be written by the socket between new master file and the dummy block will storage client, after will these data to be written writing the volume file of new master file, upgrades the test point and the single node bitmap of new master file; Then, this new master file reports the result of write operation to the kernel of dummy block will storage client.
In the another kind of implementation of present embodiment, when the network between dummy block will storage client and the duplicate volume takes place when unusual, master file can receive the link unexpected message that dummy block will storage client sends, and this link unexpected message is transmitted to duplicate volume, so that duplicate volume sends to master file with the result of write operation, by master file the result of this write operation is sent to the kernel of dummy block will storage client again; Wherein, above-mentioned link unexpected message sends to dummy block will storage client by the heartbeat process of dummy block will storage client.
In another implementation of present embodiment, when the network between master file and the duplicate volume takes place when unusual, master file can upgrade the test point and the single node bitmap of master file after data to be written are write the volume file of this master file; Then, report the result of write operation again to the kernel of dummy block will storage client by master file.
In the present embodiment, receive after the result of write operation, the kernel of dummy block will storage client can determine earlier whether the result of write operation is corresponding with the write operation requests that has sent; If corresponding, then the kernel of dummy block will storage client can send to the result of above-mentioned write operation the upper-layer service program; If the result of write operation is not corresponding with the write operation requests that has sent, then the kernel of dummy block will storage client can abandon the result of this write operation, perhaps buffer memory but do not handle the result of this write operation, the embodiment of the invention does not limit this, but not at once, the result that the kernel of dummy block will storage client abandons this write operation describes for example the embodiment of the invention with the result of write operation and the write operation requests that has sent.
In the foregoing description, master file is connected by the socket between this master file and the dummy block will storage client and receives after the data to be written, above-mentioned data to be written are write the volume file of master file, and above-mentioned data to be written are connected with socket between the duplicate volume by master file send to duplicate volume, so that duplicate volume writes the volume file of this duplicate volume with data to be written, thereby can improve memory reliability; After duplicate volume writes the volume file of this duplicate volume with data to be written, report the result of write operation to the kernel of dummy block will storage client by duplicate volume; Thereby can reduce message traffic, and can share the sub-load of master file, reach the purpose of dynamic load leveling.
The date storage method that the embodiment of the invention provides can improve memory reliability, and under the prerequisite that guarantees memory reliability, can further reduce message traffic, improves performance.
In the embodiment of the invention, master file and duplicate volume can be deployed on the different memory nodes, both can solve the professional unavailable problem that single memory node fault causes, also can solve the professional unavailable problem that the network between professional and master file place memory node or duplicate volume place memory node causes unusually; In addition, under the professional unbroken prerequisite of bonding node failure, the embodiment of the invention can reduce 25% with message traffic, improves a lot on the performance.In addition, duplicate volume has participated in operation flow (responding the result of write operation) when realizing data backup, can share the sub-load of master file, reaches the purpose of a dynamic load leveling.
The embodiment of the invention adopts the network architecture shown in Figure 2, Fig. 2 is the schematic diagram of an embodiment of the network architecture of the present invention, as shown in Figure 2, this network architecture is a stable Triangle Model, in this Triangle Model, on behalf of socket (SOCKET), solid line connect, and arrow points server side, dotted line are the flow direction of control messages.
In the embodiment of the invention, when handling the read operation request, duplicate volume (Backup Volume; Hereinafter to be referred as: BV) not perception, read data operation and response are all by master file (Primary Volume; Hereinafter to be referred as: PV) be responsible for; When handling write operation requests, receive data to be written by PV, BV responds the result of write operation.Like this, size of message can be reduced 25%, thereby reach the purpose that improves performance.
In addition, the network architecture shown in Figure 2 can prevent effectively that memory node is unusual and network is unusual, so long as not dummy block will storage client (virtual block storage client; Hereinafter to be referred as: vbs-client), among PV and the BV any two take place unusually simultaneously, the link between perhaps link between vbs-client and the PV, and vbs-client and the BV takes place unusually simultaneously, and the upper-layer service program can unbrokenly be moved.
Fig. 3 sets up the flow chart of an embodiment of Triangle Model for the present invention, as shown in Figure 3, and the setting up flow process and can comprise of Triangle Model shown in Figure 2:
Step 301, vbs-client sets up SOCKET according to the listening port of preassigned BV with this preassigned BV and is connected.
In the present embodiment, the listening port of preassigned BV is externally provided in advance by preassigned BV.
Step 302, vbs-client and preassigned BV hold consultation, and the content of negotiation comprises:
That (1) notify the current connection of preassigned BV is vbs-client;
(2) notify preassigned BV the information of preassigned PV, comprise title and the listening port of preassigned PV; Wherein, the listening port of preassigned PV is also provided in advance by preassigned PV;
(3) obtain the volume size to preassigned BV transmission, detect the request of (check) point and single node bitmap (solo bitmap); Wherein, check point record is the time point that writes data.
Step 303, preassigned BV returns volume size, check point and solo bitmap to vbs-client.
Step 304, vbs-client sets up SOCKET according to the listening port of preassigned PV with this preassigned PV and is connected.
Step 305, vbs-client and preassigned PV hold consultation, and the content of negotiation comprises:
That (1) notify the current connection of preassigned PV is vbs-client;
(2) notify preassigned PV the information of current preassigned BV, comprise title and the listening port of preassigned BV;
(3) send the request of obtaining volume size, check point and solo bitmap to preassigned PV.
Step 306, preassigned PV returns volume size, check point and solo bitmap to vbs-client.
Step 307, the check point of more preassigned PV of vbs-client and preassigned BV, the volume of determining up-to-date test point correspondence is real PV, the volume of inferior new test point correspondence is real BV.
Particularly, if relatively after the check point, vbs-client finds that the check point of preassigned BV is up-to-date, the data that is to say preassigned BV one side are up-to-date, the preassigned before this PV of this explanation took place unusual, kernel by vbs-client carries out masterslave switchover then, and as real PV, storage system was write data at preassigned BV afterwards with preassigned BV.
Present embodiment is real PV with preassigned PV, and preassigned BV is that real BV is that example describes.
Step 308, vbs-client registers the role of real BV to the kernel of vbs-client, and notifies preassigned BV, and it is real BV.
Step 309, vbs-client registers the role of real PV to the kernel of vbs-client, and notifies preassigned PV, and it is real PV.
Step 310, vbs-client judges whether the link between real PV and the real BV is normal; If normal, execution in step 311~step 313 then; If the link between real PV and the real BV takes place unusual, then withdraw from this flow process, enter the reconstruction flow process.
Step 311, if the data between real PV and the real BV are asynchronous, then vbs-client sends data synchronization request to real PV, so that this real PV and real BV carry out data sync.
Step 312, the descriptor that the SOCKET that vbs-client sets up step 301 and step 304 is connected is registered to the kernel of vbs-client.Like this, the kernel of vbs-client can use vbs-client to be connected with SOCKET between the real PV when real PV sends read operation request, write operation requests or control messages; And when receiving data or control messages, the kernel of vbs-client can select a suitable SOCKET to connect.Because, under Triangle Model, vbs-client is connected with SOCKET between the real BV from the kernel of vbs-client to begin to receive data, and under L model or SOLO model, vbs-client is connected with SOCKET between the real PV from the kernel of vbs-client to begin to receive data.
Step 313, vbs-client calling system function, for example: ioctrl enters the kernel state thread, handles the I/O request that the upper-layer service program sends at the kernel state thread by the call back function of the system of being registered to, for example: write operation requests and read operation request.
The foregoing description can realize setting up Triangle Model, handles write operation requests by this Triangle Model, can realize improving memory reliability, reduces message traffic, improves memory property.
Fig. 4 is the flow chart of another embodiment of date storage method of the present invention, because in the embodiment of the invention, read operation processing of request process only is mutual between vbs-client and the PV, and is the same with prior art, therefore, present embodiment only is introduced the processing procedure of write operation requests.
As shown in Figure 4, this date storage method can comprise:
Step 401, Triangle Model is set up, and storage system is in stable state.
After step 402, Triangle Model were set up, vbs-client can start a kernel state thread and be used for handling write operation requests specially.
Step 403, the write operation requests that the upper-layer service program sends is by the operating system of vbs-client (Operating System; Hereinafter to be referred as: OS) put into the request queue that kernel is registered.
Step 404, the kernel thread of vbs-client are obtained a write operation requests from above-mentioned request queue.
In the present embodiment, the kernel thread of vbs-client can obtain a write operation requests according to predetermined rule from above-mentioned request queue, rule that should be predetermined can be regular for first-in first-out rule or other, present embodiment does not limit this, as long as the kernel thread of vbs-client can obtain a write operation requests according to this predetermined rule from above-mentioned request queue; But present embodiment is that the first-in first-out rule is that example describes with this predetermined rule.
Step 405, the kernel of vbs-client sends write operation requests to PV, and this write operation requests is used to notify PV to prepare to receive data to be written.
Step 406, PV subprocess are received after the write operation requests, and the write operation requests that receives is transmitted to the BV subprocess; This write operation requests is used to notify BV to prepare to receive data to be written.
Step 407, PV is connected reception data to be written by this PV with SOCKET between the vbs-client, these data to be written are write the volume file (volume file) of PV, and will these data to be written be connected with SOCKET between the BV by PV and send to BV, so that BV is at the volume file that data to be written is write this BV.
Step 408, BV reports the result of write operation to the kernel of vbs-client after data to be written are write the volume file of this BV.
Step 409, PV sends check point record request to BV.
Step 410, PV and BV upgrade check point separately.
In the present embodiment, the check point is to judge that current side data is the sole criterion of latest data.
Step 411, PV checks dirty data piece tabulation (Dirty Block List; Hereinafter to be referred as: whether dirty data is arranged DBL), if having, and satisfy predetermined condition, then PV writes the dirty data among the DBL in the disk by force.Wherein, dirty data does not also write the data of volume file for temporarily leaving in the core buffer; Which data DBL is used to write down is dirty datas.
In the present embodiment, above-mentioned predetermined condition can be one of following or combination:
(1) if find that when poll DBL does not change, illustrates that storage system is not in a hurry, can directly the dirty data among the DBL be write in the disk;
(2) if DBL is not empty, and find that when poll this DBL changes, then can after reaching the preset time interval, the dirty data among the DBL be write in the disk.
Step 412, the kernel of vbs-client receive after the result of the write operation that BV sends, and judge whether the result of the write operation that receives is corresponding with the write operation requests that has sent; If corresponding, then execution in step 413; If the result of above-mentioned write operation is not corresponding with the write operation requests that has sent, then the kernel of vbs-client can abandon the result of this write operation, perhaps buffer memory but do not handle the result of this write operation, present embodiment does not limit this, but not at once, the result that the kernel of vbs-client abandons this write operation describes for example present embodiment with the result of write operation and the write operation requests that has sent.
Step 413, the kernel of vbs-client sends to the upper-layer service program with the result of write operation.
In the foregoing description, PV is connected with SOCKET between the vbs-client by this PV and receives after the data to be written, above-mentioned data to be written are write the volume file of PV, and above-mentioned data to be written are connected with SOCKET between the BV by PV send to BV, so that BV writes the volume file of this BV with data to be written, thereby can improve memory reliability; After BV writes the volume file of this BV with data to be written, report the result of write operation to the kernel of vbs-client by BV; Thereby can reduce message traffic, and can share the sub-load of PV, reach the purpose of dynamic load leveling.
When below being presented in various unusual generation, storage system is how to switch to other models from stable Triangle Model, thereby guarantees that the upper-layer service program is impregnable.
When PV takes place when unusual, BV originally becomes real PV, both is responsible for receiving request, also is responsible for responding the result, and at this moment network configuration can be as shown in Figure 5, and Fig. 5 is the schematic diagram of another embodiment of the network architecture of the present invention.The network architecture shown in Figure 5 is single node (SOLO) model
Fig. 6 is the flow chart of another embodiment of date storage method of the present invention, and present embodiment is introduced the data storage procedure under the network architecture shown in Figure 5.
As shown in Figure 6, this date storage method can comprise:
Step 601, vbs-client kernel state thread receive PV unusual write operation requests before take place.
Step 602, PV takes place to cause unusually vbs-client to be connected with SOCKET between the PV to take place unusually, and the kernel of vbs-client carries out masterslave switchover, and original BV is registered as new PV.
Step 603, the process whether unusual PV recovers takes place in vbs-client kernel state thread start detection.
Step 604, vbs-client kernel state thread sends to new PV (being original BV) with above-mentioned write operation requests.
Step 605, new PV writes data to be written the volume file of this new PV.
Step 606, new PV upgrades the check point.
Step 607, new PV upgrades solo bitmap.
Step 608, new PV reports the result of write operation to the kernel of vbs-client.
After step 609, the kernel of vbs-client determine that the result of this write operation and the write operation requests that sends before are corresponding, the result of this write operation is sent to the upper-layer service program.
Step 610, the process whether unusual PV of detection generation recovers sends the kernel of detect-message to vbs-client, whether recovers normal so that the kernel of vbs-client determines to take place unusual PV according to this detect-message.
In the present embodiment, new PV (being original BV) is after writing the volume file of self with data to be written, upgrade solo bitmap and check point, which blocks of data this solo bitmap is used to write down, and variation has taken place, and this check point is used to write down the time point that writes data to be written.The effect that record solo bitmap and check orders is, when next Triangle Model was set up, storage system can be up-to-date according to the data of which side of check point judgement, and the volume that only has latest data just can be PV.In addition, if new than BV of the data of PV then can be gone up corresponding data sync to BV with PV according to the information that writes down among the solo bitmap, guarantee that PV and BV go up the consistency of data; Vice versa.
When the link between vbs-client and the BV takes place when unusual, the network architecture can be as shown in Figure 7, and Fig. 7 is the schematic diagram of another embodiment of the network architecture of the present invention, and the network architecture shown in Figure 7 is the L model.Under the network architecture shown in Figure 7, BV can't send the result of write operation to the upper-layer service program, and at this moment, BV turns to PV to report the result of write operation, is reported the result of write operation at last to the upper-layer service program by PV.
Fig. 8 is the flow chart of another embodiment of date storage method of the present invention, and present embodiment is introduced the data storage procedure under the network architecture shown in Figure 7.
As shown in Figure 8, this date storage method can comprise:
Step 801, under stable state, the link between vbs-client and the BV takes place unusual suddenly.
Step 802 needed the result of the write operation that BV reports originally, and unusual because the link between vbs-client and the BV takes place, BV can't send to the result of write operation the kernel of vbs-client, blocks always.
Step 803 when the heartbeat mechanism of vbs-client detects link when unusual, sends the link unexpected message to vbs-client.
Step 804, vbs-client sends to PV with the link unexpected message.
Step 805, PV is transmitted to BV with the link unexpected message.
Step 806, BV receives after the link unexpected message, and the result of the write operation of not sending is sent to PV.
Step 807, PV sends to the result of write operation the kernel of vbs-client.
Step 808, the kernel of vbs-client reports the upper-layer service program with the result of write operation.
Step 809, when afterwards write operation requests being arranged, BV directly reports PV with the result of write operation and no longer attempts sending to the kernel of vbs-client after data to be written are write the volume file of BV.
When the link between PV and the BV takes place when unusual, the network architecture can be as shown in Figure 9, and Fig. 9 is the schematic diagram of another embodiment of the network architecture of the present invention, and network configuration shown in Figure 9 is the model of falling V.Under the model of falling V, storage system can be ignored BV, is example to handle write operation requests, and PV writes data to be written after the volume file of PV, upgrades check point and the solo bitmap of PV; Then, PV reports the result of write operation to the kernel of vbs-client, and promptly vbs-client does not receive the result of write operation from BV, then receives from PV, and whole handling process and SOLO model class seemingly do not repeat them here.
One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be finished by the relevant hardware of program command, aforesaid program can be stored in the computer read/write memory medium, this program is carried out the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
Figure 10 is the structural representation of an embodiment of master file node device of the present invention, and this master file node device can be realized the flow process that the present invention is embodiment illustrated in fig. 1, and as shown in figure 10, this master file node device can comprise: receiver module 1001 and writing module 1002;
Wherein, receiver module 1001 is used for being connected reception data to be written by the socket that this master file node device and dummy block will are stored between the client;
Writing module 1002, be used for the data to be written that receiver module 1001 receives are write the volume file of master file node device, and data to be written are connected with socket between the duplicate volume node device by the master file node device send to the duplicate volume node device, so that the duplicate volume node device is after will these data to be written writing the volume file of duplicate volume node device, report the result of write operation to the kernel of dummy block will storage client.
In the foregoing description, receiver module 1001 is connected by the socket between this master file node device and the dummy block will storage client and receives after the data to be written, writing module 1002 writes above-mentioned data to be written the volume file of master file node device, and above-mentioned data to be written are connected with socket between the duplicate volume node device by the master file node device send to the duplicate volume node device, so that the duplicate volume node device writes the volume file of this duplicate volume node device with data to be written, thereby can improve memory reliability; After the duplicate volume node device writes the volume file of this duplicate volume with data to be written, report the result of write operation to the kernel of dummy block will storage client by the duplicate volume node device; Thereby can reduce message traffic, and can share the sub-load of master file node device, reach the purpose of dynamic load leveling.
Figure 11 compares with master file node device shown in Figure 10 for the structural representation of another embodiment of master file node device of the present invention, and difference is that master file node device shown in Figure 11 can also comprise: sending module 1003;
In the present embodiment, receiver module 1001 can also receive the write operation requests of the kernel transmission of dummy block will storage client before receiving data to be written; Then sending module 1003 is used for the write operation requests that receiver module 1001 receives is transmitted to the duplicate volume node device, and this write operation requests is used to notify master file node device and/or duplicate volume node device to prepare to receive data to be written; Write operation requests is that the kernel of dummy block will storage client obtains from the request queue of this kernel registration; Write operation requests in the request queue of this kernel registration is that dummy block will storage client receives after the write operation requests that the upper-layer service program sends, and puts into the request queue of this kernel registration.
Further, in the present embodiment, sending module 1003 can also send the test point record request to the duplicate volume node device, so that master file node device and duplicate volume node device upgrade test point separately.
Further, the master file node device in the present embodiment can also comprise:
Check module 1004, be used for checking whether the tabulation of dirty data piece has dirty data;
Then writing module 1002 can also be worked as in the tabulation of dirty data piece dirty data, and satisfies after the predetermined condition, and above-mentioned dirty data is write in the disk.
Further, in the present embodiment, receiver module 1001 can also be worked as dummy block will storage client and link between the duplicate volume node device and be taken place when unusual, receives the link unexpected message of dummy block will storage client transmission; Sending module 1003 can also be transmitted to the duplicate volume node device with the link unexpected message that receiver module 1001 receives, so that the duplicate volume node device sends to the master file node device with the result of write operation, by sending module 1003 result of above-mentioned write operation is sent to the kernel of dummy block will storage client again; This link unexpected message sends to dummy block will storage client by the heartbeat process of dummy block will storage client.
Further, the master file node device in the present embodiment can also comprise:
Update module 1005, be used for taking place when unusual when the link between master file node device and the duplicate volume node device, after writing module 1002 writes the volume file of master file node device with data to be written, upgrade the test point and the single node bitmap of master file node device; At this moment, sending module 1003 can report the result of write operation to the kernel that dummy block will is stored client.
Adopt above-mentioned master file node device to make up storage system, can improve memory reliability, reduce message traffic, improve memory property.
Figure 12 is the structural representation of an embodiment of dummy block will storage client device of the present invention, as shown in figure 12, this dummy block will storage client device can comprise: connect and set up module 1201, acquisition module 1202, comparison module 1203, determination module 1204, Registering modules 1205 and calling module 1206;
Wherein, connect and to set up module 1201, be used for setting up socket with preassigned duplicate volume and be connected, and set up socket according to the listening port of preassigned master file with preassigned master file and be connected according to the listening port of preassigned duplicate volume;
Obtain module 1202, be used to obtain volume size, test point and the single node bitmap of preassigned duplicate volume, and the volume size, test point and the single node bitmap that obtain preassigned master file;
Comparison module 1203 is used for the test point of more preassigned master file and preassigned duplicate volume;
Determination module 1204 is used for determining that according to the comparative result of comparison module 1203 volume of up-to-date test point correspondence is real master file that the volume of inferior new test point correspondence is real duplicate volume;
Registering modules 1205, be used for role to real master file of the kernel registration of dummy block will storage client device and real duplicate volume, and the link between real master file and real duplicate volume just often, the descriptor that module 1201 is connected with socket that preassigned duplicate volume is set up is set up in connection, and connected and set up module 1201 and be registered to the kernel that dummy block will is stored client device with the descriptor that the socket of preassigned master file foundation is connected;
Calling module 1206 is used for the calling system function and enters the kernel state thread, handles the write operation requests that the upper-layer service program sends in the kernel state thread.
Adopt above-mentioned dummy block will storage client device to make up storage system, can improve memory reliability, reduce message traffic, improve memory property.
Figure 13 is the structural representation of another embodiment of dummy block will storage client device of the present invention, compares with dummy block will storage client device shown in Figure 12, and difference is that dummy block will storage client device 12 shown in Figure 13 can also comprise:
Request sending module 1207, be used between real master file and real duplicate volume link just often, if the data between real master file and the real duplicate volume are asynchronous, then send data synchronization request, so that real master file and real duplicate volume carry out data sync to real master file;
Masterslave switchover module 1208, be used for taking place when unusual when master file, kernel at dummy block will storage client device carries out masterslave switchover, duplicate volume is registered as new master file, so that the socket that new master file is stored between the client by new master file and dummy block will is connected reception data to be written, and after data to be written are write the volume file of new master file, upgrade the test point and the single node bitmap of new master file, and the result who reports write operation to the kernel of dummy block will storage client device.
Further, the storage of the dummy block will in present embodiment client device 12 can also comprise: receiver module 1209 and sending module 1210 as a result as a result;
Wherein, receiver module 1209 as a result, are used to receive the result of write operation;
At this moment determination module 1204 can also determine whether the result of the write operation that receiver module as a result 1209 receives is corresponding with the write operation requests that has sent;
Sending module 1210 as a result, are used in result that determination module 1204 is determined the write operation that receiver modules 1209 as a result receive with after the write operation requests that has sent is corresponding, and the result of above-mentioned write operation is sent to the upper-layer service program.
Adopt above-mentioned dummy block will storage client device to make up storage system, can improve memory reliability, reduce message traffic, improve memory property.
Figure 14 is the structural representation of an embodiment of duplicate volume node device of the present invention, and as shown in figure 14, this duplicate volume node device can comprise: data reception module 1401, data writing module 1402 and reporting module 1403 as a result;
Wherein, data reception module 1401 is used to receive the master file node device is connected transmission by this master file node device and the socket between the duplicate volume node device data to be written;
Data writing module 1402 is used for the data to be written that data reception module 1401 receives are write the volume file of duplicate volume node device;
Reporting module 1403 as a result, are used for reporting to the kernel of dummy block will storage client device the result of write operation.
Adopt above-mentioned duplicate volume node device to make up storage system, can improve memory reliability, reduce message traffic, improve memory property.
Figure 15 compares with duplicate volume node device shown in Figure 13 for the structural representation of another embodiment of duplicate volume node device of the present invention, and difference is that duplicate volume node device shown in Figure 14 can also comprise:
Request receiver module 1404 is used for receiving the write operation requests that the master file node device sends before data reception module 1401 receives data to be written, and this write operation requests is used to notify described duplicate volume node device to prepare to receive described data to be written; Can also after reporting module 1403 reports the result of write operation as a result, receive the test point record request that the master file node device sends, to upgrade the test point of this duplicate volume node device.
Adopt above-mentioned duplicate volume node device to make up storage system, can improve memory reliability, reduce message traffic, improve memory property.
Figure 16 is the structural representation of an embodiment of storage system of the present invention, and as shown in figure 16, this storage system can comprise: vbs client 1601, PV 1602 and BV 1603;
Wherein, PV 1602 is used for being connected with socket between the vbs client 1601 by PV 1602 and receives data to be written, these data to be written are write the volume file of PV 1602, and data to be written are connected with socket between the BV 1603 by PV 1602 send to BV 1603, so that BV 1603 after data to be written are write the volume file of BV 1603, reports the result of write operation to the kernel of vbs client 1601.
Particularly, vbs client 1601 can realize that PV 1602 can realize by Figure 10 of the present invention or master file node device shown in Figure 11 by Figure 12 of the present invention or dummy block will storage client device shown in Figure 13.
In the present embodiment, vbs client 1601 can comprise that Triangle Model sets up module 16011, control messages processing module 16012, I/O request sending module 16013, link selection module 16014 and I/O receiver module 16015 as a result;
Wherein, Triangle Model is set up module 16011, is used for initial start-up, node is unusual or network when unusual, rebulids Triangle Model or carries out model and switch; Particularly, Triangle Model is set up module 16011 can set up Triangle Model with reference to the present invention's method that provides embodiment illustrated in fig. 3, realizes connecting the function of setting up module 1201, obtaining module 1202, comparison module 1203, determination module 1204, Registering modules 1205, calling module 1206 and request sending module 1207 in the dummy block will storage client device that provides embodiment illustrated in fig. 12.
Control messages processing module 16012: be the interface between user's attitude and the kernel state;
I/O request sending module 16013: from the I/O request of upper-layer service program, for example: write operation requests or read operation request send to PV 1602 by I/O request sending module 16013;
Link selection module 16014: when unusual generation, the transmitting-receiving of data is different from Triangle Model, so when SOLO model or L model, the kernel of vbs-client will be selected suitable link when sending the I/O request; In the present embodiment, the function of masterslave switchover module 1208 in the dummy block will storage client device that provides embodiment illustrated in fig. 12 can be provided link selection module 16014;
I/O is receiver module 16015 as a result: from the result of the write operation of BV by I/O as a result receiver module 16015 handle; In the present embodiment, I/O receiver module 16015 as a result can be provided by the receiver module 1209 and the function of sending module 1210 as a result as a result in the dummy block will storage client device that provides embodiment illustrated in fig. 12.
In the present embodiment, PV 1602 can comprise read operation request processing module 16021, write operation requests processing module 16022, master/slave data synchronization module 16023, PV state detection module 16024, single node bitmap (solo bitmap) 16025, DBL 16026 and volume file 16027.
Wherein, read operation request processing module 16021 is used to handle the read operation request from vbs client 1601, and PV 1602 is only arrived in the read operation request under Triangle Model, BV 1603 not perception;
Write operation requests processing module 16022 is used to handle the write operation requests from vbs client 1601, and write operation requests can forward BV 1603 under Triangle Model, at last by the result of BV 1603 to the 1601 report write operations of vbs client; In the present embodiment, write operation requests processing module 16022 can be provided by the partial function of receiver module 1001, writing module 1002 and sending module 1003 in the present invention's master file node device that provides embodiment illustrated in fig. 10;
Master/slave data synchronization module 16023: when PV 1602 or BV 1603 takes place when unusual, this storage system enters the SOLO model, when later on write operation requests being arranged, solo bitmap 16025 can write down the data that change, when rebuilding Triangle Model, master/slave data synchronization module 16023 to opposite side, keeps data consistent with data sync;
PV state detection module 16024: when PV 1602 generations are unusual, storage system enters the SOLO model, BV 1603 originally can become real PV, and whether PV state detection module 16024 polls detect the unusual PV 1602 of generation and recover normally, recover the back and rebuild Triangle Model;
Solo bitmap 16025: system recorder memory is run duration under the SOLO model, the data block that changes on the PV 1602; In the present embodiment, solo bitmap 16025 can be provided by the partial function of update module 1005 in the present invention's master file node device that provides embodiment illustrated in fig. 10;
DBL 16026: when write operation requests took place, storage system at first write buffering area with data to be written, and these data to be written all were considered to dirty data before really writing disk, and DBL 16026 is used to write down these dirty datas.In DBL 16026, dirty data is arranged, and when satisfying predetermined condition, PV 1602 can write the dirty data among the DBL 16026 in the disk by force.In the present embodiment, above-mentioned predetermined condition can be one of following or combination:
(1) if finding DBL 16026 when poll does not change, illustrate that storage system is not in a hurry, PV1602 can directly write the dirty data among the DBL 16026 in the disk;
(2) if DBL 16026 is not empty, and find that when poll this DBL 16026 changes, then PV 1602 can write the dirty data among the DBL 16026 in the disk after reaching the preset time interval;
Volume file 16027:, be used to deposit all data based on the sparse file of senior JFS (XFS) file system.Write before the data to be written, volume file 16027 does not take disk space, is a kind of thin distribution, for the user provides very big flexibility.
In the present embodiment, the function of each module is identical with the function of equal modules among the PV 1602 among the BV 1603, does not repeat them here.
In the above-mentioned storage system, PV 1602 is connected with socket between the vbs client 1601 by this PV 1602 and receives after the data to be written, above-mentioned data to be written are write the volume file of PV 1602, and above-mentioned data to be written are connected with socket between the BV 1603 by PV 1602 send to BV 1603, so that BV 1603 writes the volume file of this BV 1603 with data to be written, thereby can improve memory reliability; After BV 1603 writes the volume file of this BV 1603 with data to be written, report the result of write operation to the kernel of vbs client 1601 by BV 1603; Thereby can reduce message traffic, and can share the sub-load of PV 1602, reach the purpose of dynamic load leveling.
Realization in the cloud storage is introduced to the embodiment of the invention below.
The cloud storage is in the conceptive extension of cloud computing and develops a new notion of coming out, be meant by functions such as cluster application, grid or distributed file systems, a large amount of various dissimilar memory devices in the network are gathered collaborative work by application software, a system of storage and Operational Visit function externally is provided jointly.When the core of cloud computing system computing and processing is the storage of mass data and management, just need a large amount of memory device of configuration in the cloud computing system, cloud computing system just is transformed into a cloud storage system so, is the cloud computing system of core so cloud storage system is one with storage and management.
Figure 17 is the schematic diagram of an embodiment of cloud storage system of the present invention, and as shown in figure 17, the cloud storage system in the present embodiment can comprise following equipment:
(1) three piece storage supplier (Block Storage Provider; Hereinafter to be referred as: BSP), be designated as BSP1, BSP2 and BSP3 respectively, BSP1, BSP2 and BSP3 are piece storage agent (the BlockStorage Agent on upper strata; Hereinafter to be referred as: BSA) provide memory space.Simultaneously, PV and BV in the Triangle Model that the embodiment of the invention provides are arranged on each BSP respectively.Under this deployment, can be effectively with the I/O load mean allocation of each BSP.
(2) BSA as the interface between Storage Middleware and the bottom BSP, are responsible for providing virtual NBD equipment to the upper strata.
(3) another station server, this server deploy supervisory control system, charge system and storage resource management system etc.; Wherein, storage resource management system mainly is responsible for the selection that BSP1, BSP2 and BSP3 go up PV and BV, thereby reaches the equilibrium of load between BSP1, BSP2 and the BSP3; Whether supervisory control system is used for monitoring each node in real time unusual, in addition, also is responsible for the performance of monitoring BSP1, BSP2 and BSP3, if any unusual timely notice storage resource management system.
The date storage method that the embodiment of the invention provides, equipment and system can make up large-scale reliable storage system easily.Simultaneously, the embodiment of the invention also effectively reduces the message traffic of whole storage system when improving memory reliability.In the very high occasion of I/O load, advantage is more obvious.
It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, module in the accompanying drawing or flow process might not be that enforcement the present invention is necessary.
It will be appreciated by those skilled in the art that the module in the device among the embodiment can be distributed in the device of embodiment according to the embodiment description, also can carry out respective change and be arranged in the one or more devices that are different from present embodiment.The module of the foregoing description can be merged into a module, also can further split into a plurality of submodules.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (24)

1. a date storage method is characterized in that, comprising:
The socket of storing between the client by master file and dummy block will is connected reception data to be written;
Described data to be written are write the volume file of described master file, and described data to be written are connected with socket between the duplicate volume by described master file send to described duplicate volume, so that described duplicate volume is after writing the volume file of described duplicate volume with described data to be written, report the result of write operation to the kernel of described dummy block will storage client.
2. method according to claim 1 is characterized in that, describedly is connected before the reception data to be written by the socket between master file and the dummy block will storage client, also comprises:
Receive the write operation requests of the kernel transmission of dummy block will storage client, and described write operation requests is transmitted to duplicate volume, described write operation requests is used to notify described master file and/or described duplicate volume to prepare to receive described data to be written; Described write operation requests is that the kernel of described dummy block will storage client obtains from the request queue of described kernel registration; Write operation requests in the request queue of described kernel registration is that described dummy block will storage client receives after the write operation requests that the upper-layer service program sends, and puts into the request queue of described kernel registration.
3. method according to claim 1 is characterized in that, described kernel to described dummy block will storage client also comprises after reporting the result of write operation:
Send the test point record request to described duplicate volume, so that described master file and described duplicate volume upgrade test point separately.
4. according to claim 1 or 3 described methods, it is characterized in that described kernel to described dummy block will storage client also comprises after reporting the result of write operation:
Check in the tabulation of dirty data piece whether dirty data is arranged;
In described dirty data piece tabulation, dirty data is arranged, and satisfy after the predetermined condition, described dirty data is write in the disk.
5. method according to claim 2 is characterized in that, before the write operation requests that the kernel of described reception dummy block will storage client sends, also comprises:
Described dummy block will storage client is set up socket according to the listening port of preassigned duplicate volume with described preassigned duplicate volume and is connected, and obtains volume size, test point and the single node bitmap of described preassigned duplicate volume;
Described dummy block will storage client is set up socket according to the listening port of preassigned master file with described preassigned master file and is connected, and obtains volume size, test point and the single node bitmap of described preassigned master file;
The test point of described dummy block will storage more described preassigned master file of client and described preassigned duplicate volume, the volume of determining up-to-date test point correspondence is real master file, the volume of inferior new test point correspondence is real duplicate volume;
Described dummy block will storage client is registered the role of described real master file and described real duplicate volume to the kernel of described dummy block will storage client;
Link between described real master file and described real duplicate volume just often, the descriptor that described dummy block will storage client will be connected with the socket that described preassigned duplicate volume is set up, and the descriptor that is connected with the socket of described preassigned master file foundation is registered to the kernel of described dummy block will storage client;
Described dummy block will storage client call system function enters the kernel state thread, handles the write operation requests that the upper-layer service program sends in described kernel state thread.
6. method according to claim 5, it is characterized in that, the descriptor that described dummy block will storage client will be connected with the socket that described preassigned duplicate volume is set up, and the descriptor that is connected with socket that described preassigned master file is set up is registered to described dummy block will and stores before the kernel of client, also comprises:
Link between described real master file and described real duplicate volume just often, if the data between described real master file and the described real duplicate volume are asynchronous, then described dummy block will storage client sends data synchronization request to described real master file, so that described real master file and described real duplicate volume carry out data sync.
7. method according to claim 2 is characterized in that, also comprises:
When described master file takes place when unusual, kernel in described dummy block will storage client carries out masterslave switchover, described duplicate volume is registered as after the new master file, and described new master file is connected reception data to be written by described new master file with socket between the described dummy block will storage client;
Described new master file writes described data to be written after the volume file of described new master file, upgrades the test point and the single node bitmap of described new master file;
Described new master file reports the result of write operation to the kernel of described dummy block will storage client.
8. method according to claim 2 is characterized in that, also comprises:
When the link between described dummy block will storage client and the described duplicate volume takes place when unusual, described master file receives the link unexpected message that described dummy block will storage client sends, and described link unexpected message is transmitted to described duplicate volume, so that the result that described duplicate volume is operated said write sends to described master file, the result of said write operation is sent to the kernel of described dummy block will storage client by described master file; Described link unexpected message sends to described dummy block will storage client by the heartbeat process of described dummy block will storage client.
9. method according to claim 2 is characterized in that, also comprises:
When the link between described master file and the described duplicate volume takes place when unusual, described master file writes described data to be written after the volume file of described master file, upgrades the test point and the single node bitmap of described master file;
Described master file reports the result of write operation to the kernel of described dummy block will storage client.
10. according to claim 2,7,8 or 9 described methods, it is characterized in that, also comprise:
Receive after the result of said write operation, the kernel of described dummy block will storage client determines whether the result of said write operation is corresponding with the write operation requests that has sent;
If corresponding, the kernel of then described dummy block will storage client sends to described upper-layer service program with the result of said write operation.
11. a master file node device is characterized in that, comprising:
Receiver module is used for being connected reception data to be written by the socket that described master file node device and dummy block will are stored between the client;
Writing module, be used for the data to be written that described receiver module receives are write the volume file of described master file node device, and described data to be written are connected with socket between the duplicate volume node device by described master file node device send to described duplicate volume node device, so that described duplicate volume node device is after writing the volume file of described duplicate volume node device with described data to be written, report the result of write operation to the kernel of described dummy block will storage client.
12. equipment according to claim 11 is characterized in that, also comprises sending module;
Described receiver module also was used for before receiving described data to be written, received the write operation requests of the kernel transmission of dummy block will storage client;
Described sending module is used for the write operation requests that described receiver module receives is transmitted to described duplicate volume node device, and described write operation requests is used to notify described master file node device and/or described duplicate volume node device to prepare to receive described data to be written; Described write operation requests is that the kernel of described dummy block will storage client obtains from the request queue of described kernel registration; Write operation requests in the request queue of described kernel registration is that described dummy block will storage client receives after the write operation requests that the upper-layer service program sends, and puts into the request queue of described kernel registration.
13. equipment according to claim 12 is characterized in that,
Described sending module also is used for sending the test point record request to described duplicate volume node device, so that described master file node device and described duplicate volume node device upgrade test point separately.
14. according to claim 11 or 13 described equipment, it is characterized in that, also comprise the inspection module;
Described inspection module is used for checking whether the tabulation of dirty data piece has dirty data;
The said write module also is used for when described dirty data piece tabulation dirty data being arranged, and satisfies after the predetermined condition, and described dirty data is write in the disk.
15. equipment according to claim 12 is characterized in that,
Described receiver module also is used for taking place when unusual when described dummy block will storage client and link between the described duplicate volume node device, receives the link unexpected message of described dummy block will storage client transmission;
Described sending module, also be used for the link unexpected message that described receiver module receives is transmitted to described duplicate volume node device, so that the result that described duplicate volume node device is operated said write sends to described master file node device, by described sending module the result of said write operation is sent to the kernel of described dummy block will storage client again; Described link unexpected message sends to described dummy block will storage client by the heartbeat process of described dummy block will storage client.
16. equipment according to claim 12 is characterized in that, also comprises update module;
Update module, be used for taking place when unusual when the link between described master file node device and the described duplicate volume node device, after the said write module writes the volume file of described master file node device with described data to be written, upgrade the test point and the single node bitmap of described master file node device;
Described sending module also is used for reporting to the kernel of described dummy block will storage client the result of write operation.
17. a dummy block will storage client device is characterized in that, comprising:
Connect and to set up module, be used for setting up socket with described preassigned duplicate volume and be connected, and set up socket according to the listening port of preassigned master file with described preassigned master file and be connected according to the listening port of preassigned duplicate volume;
Obtain module, be used to obtain volume size, test point and the single node bitmap of described preassigned duplicate volume, and the volume size, test point and the single node bitmap that obtain described preassigned master file;
Comparison module is used for the test point of more described preassigned master file and described preassigned duplicate volume;
Determination module is used for determining that according to the comparative result of described comparison module the volume of up-to-date test point correspondence is real master file that the volume of inferior new test point correspondence is real duplicate volume;
Registering modules, be used for registering the role of described real master file and described real duplicate volume to the kernel of described dummy block will storage client device, and the link between described real master file and described real duplicate volume just often, the descriptor that module is connected with socket that described preassigned duplicate volume is set up is set up in described connection, and described connection is set up module and is registered to the kernel that described dummy block will is stored client device with the descriptor that the socket of described preassigned master file foundation is connected;
Calling module is used for the calling system function and enters the kernel state thread, handles the write operation requests that the upper-layer service program sends in described kernel state thread.
18. equipment according to claim 17 is characterized in that, also comprises:
Request sending module, be used between described real master file and described real duplicate volume link just often, if the data between described real master file and the described real duplicate volume are asynchronous, then send data synchronization request, so that described real master file and described real duplicate volume carry out data sync to described real master file.
19. equipment according to claim 17 is characterized in that, also comprises:
The masterslave switchover module, be used for taking place when unusual when described master file, kernel at described dummy block will storage client device carries out masterslave switchover, described duplicate volume is registered as new master file, so that described new master file is connected reception data to be written by described new master file with the socket that described dummy block will is stored between the client, and after the volume file that described data to be written is write described new master file, upgrade the test point and the single node bitmap of described new master file, and the result who reports write operation to the kernel of described dummy block will storage client device.
20. equipment according to claim 19 is characterized in that, also comprises receiver module and sending module as a result as a result;
Described receiver module as a result is used to receive the result of write operation;
Described determination module, whether the result of the write operation that also is used for determining that described receiver module as a result receives is corresponding with the write operation requests that has sent;
Described sending module as a result is used in result that described determination module is determined the write operation that described receiver module as a result receives with after the write operation requests that has sent is corresponding, and the result that said write is operated sends to described upper-layer service program.
21. a duplicate volume node device is characterized in that, comprising:
Data reception module is used to receive the master file node device is connected transmission by described master file node device and the socket between the duplicate volume node device data to be written;
The data writing module is used for the data to be written that described data reception module receives are write the volume file of described duplicate volume node device;
Reporting module as a result is used for reporting to the kernel of dummy block will storage client device the result of write operation.
22. equipment according to claim 21 is characterized in that, also comprises:
The request receiver module was used for before described data reception module receives data to be written, received the write operation requests that described master file node device sends, and described write operation requests is used to notify described duplicate volume node device to prepare to receive described data to be written.
23. equipment according to claim 22 is characterized in that,
The described request receiver module also is used for after described reporting module as a result reports the result of write operation, receives the test point record request that described master file node device sends, to upgrade the test point of described duplicate volume node device.
24. storage system, it is characterized in that, comprising: as any described master file node device of claim 11-16, as any described dummy block will storage client device of claim 17-20 with as any described duplicate volume node device of claim 21-23.
CN 201110021715 2011-01-19 2011-01-19 Data storage method, device and system Expired - Fee Related CN102088490B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN 201110021715 CN102088490B (en) 2011-01-19 2011-01-19 Data storage method, device and system
PCT/CN2011/078476 WO2012097588A1 (en) 2011-01-19 2011-08-16 Data storage method, apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110021715 CN102088490B (en) 2011-01-19 2011-01-19 Data storage method, device and system

Publications (2)

Publication Number Publication Date
CN102088490A true CN102088490A (en) 2011-06-08
CN102088490B CN102088490B (en) 2013-06-12

Family

ID=44100102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110021715 Expired - Fee Related CN102088490B (en) 2011-01-19 2011-01-19 Data storage method, device and system

Country Status (2)

Country Link
CN (1) CN102088490B (en)
WO (1) WO2012097588A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102291466A (en) * 2011-09-05 2011-12-21 浪潮电子信息产业股份有限公司 Method for optimizing cluster storage network resource configuration
WO2012097588A1 (en) * 2011-01-19 2012-07-26 华为技术有限公司 Data storage method, apparatus and system
CN104699419A (en) * 2013-12-09 2015-06-10 陈勋元 Operation method of distributed memory disk cluster storage system
CN105940658A (en) * 2015-01-04 2016-09-14 华为技术有限公司 A user data transmission method, apparatus and terminal
CN108804248A (en) * 2017-04-28 2018-11-13 南京壹进制信息技术股份有限公司 A kind of automatic Verification method of volume real-time guard data
CN110837442A (en) * 2019-11-14 2020-02-25 北京京航计算通讯研究所 KVM virtual machine backup system based on dirty data bitmap and network block equipment
CN110837441A (en) * 2019-11-14 2020-02-25 北京京航计算通讯研究所 KVM virtual machine backup method based on dirty data bitmap and network block equipment
CN110879760A (en) * 2018-09-05 2020-03-13 北京鲸鲨软件科技有限公司 Unified storage system and method and electronic equipment
CN112559445A (en) * 2020-12-11 2021-03-26 上海哔哩哔哩科技有限公司 Data writing method and device
CN113032768A (en) * 2021-03-31 2021-06-25 广州锦行网络科技有限公司 Authentication method, device, equipment and computer readable medium
CN113721857A (en) * 2021-09-05 2021-11-30 苏州浪潮智能科技有限公司 Method, equipment and storage medium for managing double-active storage system
CN117421160A (en) * 2023-11-01 2024-01-19 广州鼎甲计算机科技有限公司 Data backup method, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059734A1 (en) * 2006-09-06 2008-03-06 Hitachi, Ltd. Storage subsystem and back-up/recovery method
CN101291205A (en) * 2008-06-16 2008-10-22 杭州华三通信技术有限公司 Backup data transmitting method, system, mirror-image server and customer terminal
CN101706805A (en) * 2009-10-30 2010-05-12 中国科学院计算技术研究所 Method and system for storing object
CN101808127A (en) * 2010-03-15 2010-08-18 成都市华为赛门铁克科技有限公司 Data backup method, system and server

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4456909B2 (en) * 2004-03-29 2010-04-28 株式会社日立製作所 Backup method, storage system and program thereof
US20090177715A1 (en) * 2008-01-04 2009-07-09 Aten International Co., Ltd. Data backup device and system with the same
US7752168B2 (en) * 2008-02-07 2010-07-06 Novell, Inc. Method for coordinating peer-to-peer replicated backup and versioning based on usage metrics acquired from peer client
JP2009211401A (en) * 2008-03-04 2009-09-17 Hitachi Ltd Storage device and its control method
CN102088490B (en) * 2011-01-19 2013-06-12 华为技术有限公司 Data storage method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059734A1 (en) * 2006-09-06 2008-03-06 Hitachi, Ltd. Storage subsystem and back-up/recovery method
CN101291205A (en) * 2008-06-16 2008-10-22 杭州华三通信技术有限公司 Backup data transmitting method, system, mirror-image server and customer terminal
CN101706805A (en) * 2009-10-30 2010-05-12 中国科学院计算技术研究所 Method and system for storing object
CN101808127A (en) * 2010-03-15 2010-08-18 成都市华为赛门铁克科技有限公司 Data backup method, system and server

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012097588A1 (en) * 2011-01-19 2012-07-26 华为技术有限公司 Data storage method, apparatus and system
CN102291466B (en) * 2011-09-05 2014-02-26 浪潮电子信息产业股份有限公司 Method for optimizing cluster storage network resource configuration
CN102291466A (en) * 2011-09-05 2011-12-21 浪潮电子信息产业股份有限公司 Method for optimizing cluster storage network resource configuration
CN104699419A (en) * 2013-12-09 2015-06-10 陈勋元 Operation method of distributed memory disk cluster storage system
CN104699419B (en) * 2013-12-09 2020-05-12 陈勋元 Operation method of distributed memory disk cluster storage system
CN105940658A (en) * 2015-01-04 2016-09-14 华为技术有限公司 A user data transmission method, apparatus and terminal
CN105940658B (en) * 2015-01-04 2019-04-26 华为技术有限公司 A kind of transmission method of user data, device and terminal
CN108804248A (en) * 2017-04-28 2018-11-13 南京壹进制信息技术股份有限公司 A kind of automatic Verification method of volume real-time guard data
CN110879760A (en) * 2018-09-05 2020-03-13 北京鲸鲨软件科技有限公司 Unified storage system and method and electronic equipment
CN110837441A (en) * 2019-11-14 2020-02-25 北京京航计算通讯研究所 KVM virtual machine backup method based on dirty data bitmap and network block equipment
CN110837442A (en) * 2019-11-14 2020-02-25 北京京航计算通讯研究所 KVM virtual machine backup system based on dirty data bitmap and network block equipment
CN112559445A (en) * 2020-12-11 2021-03-26 上海哔哩哔哩科技有限公司 Data writing method and device
CN112559445B (en) * 2020-12-11 2022-12-27 上海哔哩哔哩科技有限公司 Data writing method and device
CN113032768A (en) * 2021-03-31 2021-06-25 广州锦行网络科技有限公司 Authentication method, device, equipment and computer readable medium
CN113721857A (en) * 2021-09-05 2021-11-30 苏州浪潮智能科技有限公司 Method, equipment and storage medium for managing double-active storage system
CN113721857B (en) * 2021-09-05 2023-08-25 苏州浪潮智能科技有限公司 Dual-active storage system management method, device and storage medium
CN117421160A (en) * 2023-11-01 2024-01-19 广州鼎甲计算机科技有限公司 Data backup method, device, computer equipment and storage medium
CN117421160B (en) * 2023-11-01 2024-04-30 广州鼎甲计算机科技有限公司 Data backup method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2012097588A1 (en) 2012-07-26
CN102088490B (en) 2013-06-12

Similar Documents

Publication Publication Date Title
CN102088490B (en) Data storage method, device and system
US11516072B2 (en) Hybrid cluster recovery techniques
US8886796B2 (en) Load balancing when replicating account data
US7340637B2 (en) Server duplexing method and duplexed server system
CN101997823B (en) Distributed file system and data access method thereof
US8688773B2 (en) System and method for dynamically enabling an application for business continuity
CN101706805B (en) Method and system for storing object
US8856091B2 (en) Method and apparatus for sequencing transactions globally in distributed database cluster
US20070061379A1 (en) Method and apparatus for sequencing transactions globally in a distributed database cluster
US20060203718A1 (en) Method, apparatus and program storage device for providing a triad copy of storage data
CN102411639B (en) Multi-copy storage management method and system of metadata
CN110795503A (en) Multi-cluster data synchronization method and related device of distributed storage system
CN110912991A (en) Super-fusion-based high-availability implementation method for double nodes
CN102265277A (en) Operation method and device for data memory system
CN112559637B (en) Data processing method, device, equipment and medium based on distributed storage
CN110727709A (en) Cluster database system
CN101594256A (en) Disaster recovery method, device and system
CN107454171A (en) Message service system and its implementation
CN108512753B (en) Method and device for transmitting messages in cluster file system
CN112181723A (en) Financial disaster recovery method and device, storage medium and electronic equipment
CN103384882A (en) Method of managing usage rights in a share group of servers
CN109254873B (en) Data backup method, related device and system
CN105205160A (en) Data write-in method and device
EP3316114A1 (en) Data reading and writing method and device
CN102833096A (en) Method and device for implementation of low-cost high-availability system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170609

Address after: 510640 Guangdong City, Tianhe District Province, No. five, road, public education building, unit 371-1, unit 2401

Patentee after: Guangdong Gaohang Intellectual Property Operation Co., Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: Huawei Technologies Co., Ltd.

TR01 Transfer of patent right
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Huang He

Inventor after: Chen Zengdong

Inventor after: Wu Ying

Inventor before: Zhou Wenming

Inventor before: Zhong Yanpei

Inventor before: Wu Qing

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170801

Address after: 510000, room 76, 2605 West Whampoa Avenue, Tianhe District, Guangdong, Guangzhou

Patentee after: Guangzhou biological Polytron Technologies Inc

Address before: 510640 Guangdong City, Tianhe District Province, No. five, road, public education building, unit 371-1, unit 2401

Patentee before: Guangdong Gaohang Intellectual Property Operation Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130612

Termination date: 20210119