CN115437550A - Writing method of storage system and writing method of distributed storage system - Google Patents

Writing method of storage system and writing method of distributed storage system Download PDF

Info

Publication number
CN115437550A
CN115437550A CN202110619995.XA CN202110619995A CN115437550A CN 115437550 A CN115437550 A CN 115437550A CN 202110619995 A CN202110619995 A CN 202110619995A CN 115437550 A CN115437550 A CN 115437550A
Authority
CN
China
Prior art keywords
access key
storage system
write
data
command
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110619995.XA
Other languages
Chinese (zh)
Inventor
杨国华
朱文禧
邓瑾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Kuhan Information Technology Co Ltd
Original Assignee
Suzhou Kuhan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Kuhan Information Technology Co Ltd filed Critical Suzhou Kuhan Information Technology Co Ltd
Priority to CN202110619995.XA priority Critical patent/CN115437550A/en
Publication of CN115437550A publication Critical patent/CN115437550A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of storage, and discloses a writing method of a storage system and a writing method of a distributed storage system. The storage system includes a journal area, a data storage area, and a metadata area. When a pre-write command is received, the storage system writes the received data into a data storage area according to a first access key and a second access key, and increases log records corresponding to the written data in a log area; when a write-in confirmation command is received, the storage system identifies write-in data through a first access key in the write-in confirmation command, deletes log records in a log area, updates a metadata area, and cannot acquire the write-in data of a data storage area by using a second access key before the write-in confirmation command is successful; when a rollback command is received, the written data is identified through a first access key in the rollback command, the log record in the log area is deleted, and the written data and the storage space of the written data in the data storage area are released.

Description

Writing method of storage system and writing method of distributed storage system
Technical Field
The present application relates to the field of storage technologies, and in particular, to a write method for a storage system and a write method for a distributed storage system.
Background
In a stand-alone storage system, in order to ensure atomicity of modifications to a target data block, a Write Ahead Log (WAL) is typically used to ensure that data is recorded to the WAL before being modified, and the target data block is updated after the WAL succeeds. This difference is essentially due to the inconsistency between the logical data units of the service and the underlying stored physical data units, which cannot guarantee atomic updates of the logical data units, although the underlying physical units can guarantee atomic updates. Therefore, modules involved in the entire IO link, such as applications, file systems, and storage devices (e.g., SSD), may introduce WAL, causing the write amplification of the entire IO operation to increase dramatically.
In a distributed system, this problem is more serious because the distributed system not only needs to ensure atomicity of data modification on a single storage node, but also needs to achieve consistency of contents of a plurality of nodes in the system with respect to a certain data, i.e. external consistency, under coordination of a distributed consistency protocol. Currently, the mainstream consistency protocol (for example, raft) requires that a log system is introduced, and data is only recorded in the log system, not in the final data system, before a certain part of data is not agreed by nodes in the cluster. Data is applied to the final data system after the nodes within the cluster agree.
The log system introduced by the above scenario enables the application program to submit two IO requests (two write requests) each time of data write/update operation, which increases the overall delay of the IO link. Moreover, writing a copy of the same data twice to the storage device can cause a write amplification problem, in which case the effective bandwidth utilization of the storage device is up to 50%, which greatly wastes the processing capacity of the device. For some low-speed devices, the limited processing capacity is further reduced, and the overall cost of the system is greatly increased.
After responding to the completion of the IO submitted by the upper layer, the NVMe protocol-based SSD device means that the latest data will be returned by the subsequent query for the data. If the service data needs to be rolled back to the version before updating, the service system needs to provide support outside the SSD device, which complicates the service implementation and also reduces the efficiency of data roll-back.
Disclosure of Invention
In order to overcome the defects of the existing system, the invention provides a data writing method of a distributed storage system, and the writing amplification of the whole system is reduced through the fusion of the invention and a consistency protocol. The storage system realized based on the invention can sense the rollback and submission operation of a consistency protocol (such as two-stage submission), so that the overall efficiency of the system is improved.
One embodiment of the present application discloses a writing method for a storage system, where the storage system includes: a first storage area, a second storage area, and a metadata area; the storage system performs the steps of:
when a pre-write command is received, writing the received data into the second storage area according to the first access key and the second access key, and adding a log record corresponding to the written data into the first storage area;
when a write-in command is confirmed, the written data is identified through a first access key in the write-in command, the log record in the first storage area is deleted, and the metadata area is updated, wherein the written data of the second storage area cannot be acquired by using a second access key before the write-in command is confirmed to be successful; and
when a rollback command is received, the written data is identified through a first access key in the rollback command, the log record in the first storage area is deleted, and the written data and the storage space of the written data in the second storage area are released.
In a preferred embodiment, the pre-write command includes the first access key and the second access key.
In a preferred example, the first access key and the second access key are not included in the pre-write command; when the pre-write command is received, the storage system returns the first access key to an external application program; and when the confirmed writing command is received, the storage system returns the second access key to the external application program.
In a preferred embodiment, the pre-write command includes the first access key; and when the writing confirmation command is received, the storage system returns the second access key to the external application program.
In a preferred example, the second access key is included in the pre-write command; and when the pre-write command is received, the storage system returns the first access key to an external application program.
In a preferred embodiment, the first access key and the second access key are included in the confirmation write command.
In a preferred embodiment, the confirm write command includes the first access key, and when the confirm write command is received, the storage system obtains the second access key through an internal association mechanism.
In a preferred example, the rollback command includes the first access key and the second access key.
In a preferred embodiment, the rollback command includes the first access key, and when the rollback command is received, the storage system acquires the second access key through an internal association mechanism.
In a preferred embodiment, the first access Key and the second access Key are keys of a relative address or Key Value Store (Key-Value Store) in a Zone space and a space defined by a physical address (PBA), a logical address (LBA), and a ZNS corresponding to the written data, where the first access Key is located in the first storage area, and the second access Key is located in the second storage area.
In a preferred embodiment, the first storage area is a log area.
In a preferred embodiment, the log record includes a data pointer to the write data.
In a preferred embodiment, the step of updating the metadata area further includes updating index information of the metadata area for accessing the second storage area data, where the index information includes: an L2P logical-to-physical address translation table, a P2L physical-to-logical address translation table, block wear leveling statistics, or statistics needed for block garbage collection.
In a preferred embodiment, after the updating of the metadata area, a message confirming that the writing is successful is returned.
In a preferred embodiment, the storage system is an SSD.
Another embodiment of the present application discloses a writing method for a distributed storage system, where the system includes: at least two nodes, the nodes comprising an application and a storage system in communication with the application, the storage system comprising a first storage area, a second storage area, and a metadata area; the method comprises the following steps:
an application program of one node initiates a data consistency resolution request to other nodes by using the first access key, the second access key and data to be written;
the application program and all the application programs of the nodes receiving the resolution request initiate a pre-write command to a local storage system, when the local storage system receives the pre-write command, the local storage system writes the received data into the second storage area according to the first access key and the second access key, and increases log records corresponding to the written data in the first storage area;
when the application program confirms that the resolution meets the success condition, the first access key and the second access key are used for broadcasting the successful resolution message to other nodes, and the local application program and the application programs of all the nodes receiving the broadcast initiate a confirmation write-in command to a local storage system; and
and when the local storage system receives the write-in confirmation command, identifying the write-in data through a first access key in the write-in confirmation command, deleting the log record in the first storage area, and updating the metadata area, wherein before the write-in confirmation command is successful, the write-in data of the second storage area cannot be acquired by using a second access key.
In a preferred embodiment, when the application program confirms that the resolution does not meet the success condition, the application program uses the first access key and the second access key to broadcast a resolution failure message to other nodes, and the local application program and the application programs of all the nodes receiving the broadcast initiate a rollback command to the local storage system; and when the local storage system receives the rollback command, deleting the log record in the first storage area and releasing the write-in data and the storage space of the write-in data in the second storage area.
In a preferred example, the rollback command includes the first access key and the second access key.
In a preferred embodiment, the rollback command includes the first access key, and when the rollback command is received, the storage system acquires the second access key through an internal association mechanism.
In a preferred embodiment, the pre-write command includes the first access key and the second access key.
In a preferred example, the first access key and the second access key are not included in the pre-write command; when the pre-write command is received, the storage system returns the first access key to an external application program; and when the writing confirmation command is received, the storage system returns the second access key to the external application program.
In a preferred example, the pre-write command includes the first access key; and when the confirmed writing command is received, the storage system returns the second access key to the external application program.
In a preferred example, the second access key is included in the pre-write command; and when the pre-write command is received, the storage system returns the first access key to an external application program.
In a preferred embodiment, the first access key and the second access key are included in the confirmation write command.
In a preferred embodiment, the confirm write command includes the first access key, and when the confirm write command is received, the storage system obtains the second access key through an internal association mechanism.
In a preferred embodiment, the first access Key and the second access Key are keys of a relative address or Key Value Store (Key-Value Store) in a Zone space and a space defined by a physical address (PBA), a logical address (LBA), and a ZNS corresponding to the written data, where the first access Key is located in the first storage area, and the second access Key is located in the second storage area.
In a preferred embodiment, the log record includes a data pointer to the write data.
In a preferred embodiment, the step of updating the metadata area further includes updating index information of the metadata area for accessing the second storage area data, where the index information includes: an L2P logical-to-physical address translation table, a P2L physical-to-logical address translation table, block wear leveling statistics, or statistics needed for block garbage collection.
Compared with the prior art, the method has the following beneficial effects:
1) The data writing and the data external visibility are divided into two explicit operations (pre-writing and confirmed writing/rolling back), and a calling party is given more flexible control right on the data visibility. Single node applications based on this can efficiently implement updates and rollback of data.
2) The distributed storage system can complete the resolution process of a consistency protocol such as two-phase submission (2 PC) according to an explicit two-phase write command, and the semantic combination of the two makes the overall design of the system simpler and more natural.
3) By explicitly exposing the WAL capability inside the storage system to the application layer, the application layer reduces the writing operation of the data of two times to one time, thereby reducing the writing delay and improving the writing bandwidth. The smaller number of writes will help the SSD device to increase the endurance of the device, thereby reducing costs.
The present specification describes a number of technical features distributed throughout the various technical aspects, and if all possible combinations of technical features (i.e. technical aspects) of the present specification are listed, the description is made excessively long. In order to avoid this problem, the respective technical features disclosed in the above summary of the invention of the present application, the respective technical features disclosed in the following embodiments and examples, and the respective technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (which should be regarded as having been described in the present specification) unless such a combination of the technical features is technically infeasible. For example, in one example, the feature a + B + C is disclosed, in another example, the feature a + B + D + E is disclosed, and the features C and D are equivalent technical means for the same purpose, and technically only one feature is used, but not simultaneously employed, and the feature E can be technically combined with the feature C, then the solution of a + B + C + D should not be considered as being described because the technology is not feasible, and the solution of a + B + C + E should be considered as being described.
Drawings
FIG. 1 is a flow chart of a method for writing data on a single storage system according to the present invention.
FIG. 2 is an object and flow diagram involved in an application performing a data write operation to an SSD device when the present invention is not in use.
FIG. 3 is an object and flow diagram involved in successful data write on a single SSD device in accordance with the present invention.
FIG. 4 is an object and flow diagram involved in data write rollback on a single SSD device in accordance with the present invention.
FIG. 5 is a flow chart of a method for writing data on a distributed storage system according to the present invention.
FIG. 6 is an object and flow diagram typically involved in a data write operation resolution using a coherency protocol in a distributed system when the present invention is not used.
FIG. 7 is an object and flow diagram involved in the successful data write across multiple nodes of a distributed system in accordance with the present invention.
FIG. 8 is an object and flow diagram involved in data write rollback across multiple nodes of a distributed system in accordance with the present invention.
Detailed Description
In the following description, numerous technical details are set forth in order to provide a better understanding of the present application. However, it will be understood by those skilled in the art that the technical solutions claimed in the present application may be implemented without these technical details and with various changes and modifications based on the following embodiments.
The following outlines some of the innovative points of the embodiments of the present application:
in conventional writing schemes, the application layer must write the same data twice because: it must be ensured that the new data is persisted before the data is applied to the final data area so that the entire operation can be redone from the log data after an unexpected interruption or exit of the service/program during the application to the final location, thereby ensuring that the modify operation of the final data area is atomic and no partial modification occurs. Furthermore, the storage hardware cannot perceive that two write operations sent independently to disk have an associative effect, and therefore cannot internally merge two write operations to the persistent medium into one. The present invention splits data writing and data visibility to the outside into two explicit operations. Three commands are defined in NVMe: a Prewrite (Prewrite) operation, a confirm write (Commit) operation, a Rollback (Rollback) operation. The data is written into the SSD device through the pre-writing operation, the written data is invisible to the outside but can be identified through Key1 or indirectly accessed, the written data is identified through Key1 in the writing confirmation operation, and the written data can be acquired by using Key2, so that the written data can be visible to the outside through the confirmation writing operation. The write data is discarded in the rollback operation so that the write data is permanently invisible to the outside.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
In one embodiment of the application, a writing method of a storage system is disclosed, wherein the storage system comprises a first storage area, a second storage area and a metadata area. The storage system may be a storage node containing SSD storage devices, host CPUs, and other components. In other embodiments, the storage system may be an SSD storage device, excluding a host. The first storage area may be a journal area and the second storage area is a final data storage area. Fig. 1 shows a flow chart of a writing method of a storage system. The storage system performs the steps of:
step 101, when a pre-write command is received, writing the received data into the data storage area according to the first access key and the second access key, and adding a log record corresponding to the written data into the log area.
In one embodiment, the first access key and the second access key are included in the pre-write command. For example, the first access key and the second access key may be specified in a pre-write command.
In another embodiment, the pre-write command may not include the first access key and the second access key, the storage system may return the first access key to the external application when the pre-write command is received, and the storage system may return the second access key to the external application when the confirm write command is received.
In another embodiment, the first access key is included in the pre-write command, e.g., the first access key may be specified in the pre-write command. The storage system may return the second access key to an external application when the confirm write command is received.
In another embodiment, the second access key is included in the pre-write command, e.g., the second access key may be specified in the pre-write command. When the pre-write command is received, the storage system may return the first access key to an external application.
In an embodiment, the first access Key and the second access Key are keys of a physical address (PBA), a logical address (LBA), a Zone space defined by ZNS, and a relative address or Key Value Store (Key-Value Store) in the space corresponding to the written data, the first access Key is located in the log area, and the second access Key is located in the data storage area.
It should be understood that the first access key and the second access key may not be limited to the aforementioned enumerated forms, but may also take other forms known to those skilled in the art or known in the future.
It should be noted that the first access key is mainly used for identifying the written data, and is used in the confirmation or rollback. For example, the first access key may be identification information of a log record. In special cases, the written data may also be accessed indirectly using the first access key, for example, by obtaining a log record according to the first access key and then accessing the written data according to the log record. The second access key may be an address to write data for normal external access.
And 102, when a write-in command is confirmed, identifying the written data through a first access key in the write-in command, deleting the log record in the log area, and updating the metadata area. The acknowledge write command does not carry the write data.
In one embodiment, the log record includes a data pointer to the write data.
In one embodiment, the step of updating the metadata area further includes updating index information of the metadata area for accessing data of the data storage area, the index information including: an L2P logical to physical address translation table, a P2L physical to logical address translation table, data block wear leveling statistics, or statistics required for data block garbage collection. It should be understood that the data structures that the metadata contains are not limited to the listed forms, but may take other forms known to those skilled in the art or known in the future.
In one embodiment, after updating the metadata area, a message is returned confirming that the write was successful. The write data of the data storage area is not retrievable using a second access key until the confirmation write command is successful. It should be noted that after confirming the execution of the write command, it is not required that the data cannot be accessed using the first access key. For example, the cleaning of the log record may be performed immediately when the confirmed write command is received, or may be performed later but a message that the confirmed write is successful is returned first, and then the first access key is still allowed to access the written data before the delayed execution is not cleaned.
In one embodiment, the first access key and the second access key may be included in the confirm write command.
In another embodiment, the first access key may be included in the confirm write command, and the storage system acquires the second access key through an internal association mechanism when the confirm write command is received.
Step 103, when a rollback command is received, identifying the written data through a first access key in the rollback command, deleting the log record in the log area and releasing the written data and the storage space of the written data in the data storage area.
In one embodiment, the first access key and the second access key may be included in the rollback command.
In one embodiment, the rollback command may include the first access key, and when the rollback command is received, the storage system acquires the second access key through an internal association mechanism.
In order to better understand the technical solution of the present application, a method for writing data into a storage device of a single node is described below with reference to a specific example, where the listed details are mainly for understanding and are not intended to limit the scope of the present application.
FIG. 2 illustrates objects and flow diagrams involved in an application performing a data write operation to an SSD device when the present invention is not in use. And the application program initiates a first write request, records the data to the WAL, and after the WAL succeeds, the application program initiates a second write request to update the data to a final data file address. The application program needs to submit two write requests, which increases the overall delay of the IO link, and one copy of the same data needs to be written twice to the storage device, resulting in a problem of write amplification,
FIG. 3 shows the objects and flow diagrams involved in successful data write on a single SSD device of the present invention, including the steps of:
in step 201, the application program initiates a data pre-write operation, and submits the data pre-write operation to the SSD through the NVMe Prewrite command.
Step 202, after the ssd controller applies for space to the persistent media (i.e., data storage area), the data is written to the newly applied media space.
In step 203, the ssd controller creates a log entry in the WAL, where the log entry contains a pointer pointing to data on the persistent media, and the data written by the pre-write operation can be accessed through the pointer.
And step 204, the SSD controller returns the success of the pre-write operation upwards, and ensures that the written data can be acquired through a Key1 after the subsequent SSD device is powered down and reloaded, namely the SSD controller has a persistence characteristic.
In step 205, after receiving the response of successful pre-write operation, the application program initiates the confirmed write operation after being confirmed by the service logic, and submits the confirmed write operation to the SSD device through the Commit command of the NVMe, and at this time, the application program does not need to carry data, and only needs to carry Key1 to identify the data.
In step 206, the ssd controller clears the log entry in the WAL.
In step 207, the ssd controller updates internal metadata structures such as L2P, and the latest data is visible in subsequent Key2 accesses to the data.
In step 208, the ssd controller returns an acknowledgment upward that the write operation was successful.
In this way, the entire write operation produces only one write to the persistent media (e.g., NAND), thereby increasing the effective write bandwidth of the SSD device. Key1 and Key2 in this example can be embodied as LBAs corresponding to the space to which data is allocated on the persistent media.
FIG. 4 shows objects and a flow diagram involved in data rollback writes on a single SSD device of the present invention, including the steps of:
in step 301, the application initiates a data pre-write operation, which is submitted to the SSD device via NVMe Prewrite command.
In step 302, the ssd controller writes the data to the newly applied media space after applying for the space to the persistent media.
In step 303, the ssd controller creates a log entry in the WAL, where the log entry contains a pointer to data on the persistent medium, and the data written by the pre-write operation can be accessed through the pointer.
In step 304, the SSD controller returns that the pre-write operation is successful, and ensures that the written data can be obtained through Key1 after the subsequent SSD device is powered down and reloaded, i.e. has a persistent characteristic.
And 305, after receiving the response of successful pre-write operation, the application program initiates a Rollback write operation after being confirmed by the service logic, submits the Rollback write operation to the SSD through the Rollback command of the NVMe, and only needs to carry Key1 to identify data without carrying data.
In step 306, the SSD controller clears the log entry in the WAL.
In step 307, the SSD controller releases the data and its space stored on the persistent media.
In step 308, the ssd controller returns the rollback write operation success upwards.
An embodiment of the present application discloses a writing method for a distributed storage system, where the system includes: the node comprises an application program and a storage system communicated with the application program, wherein the storage system comprises a journal area, a data storage area and a metadata area. Fig. 5 shows a flow chart of a writing method of the distributed storage system. It should be noted that the terms and techniques described with respect to the write method of the foregoing storage system also apply to the write method of the distributed storage system. The writing method comprises the following steps:
in step 401, an application of one of said nodes initiates a data consistency resolution request to the other node using the first access key, the second access key and the data to be written.
Step 402, the application program and all the application programs of the nodes receiving the resolution request initiate a pre-write command to the local storage system, and when the local storage system receives the pre-write command, the local storage system writes the received data into the data storage area according to the first access key and the second access key, and adds a log record corresponding to the written data in the log area.
And step 403, when the application program confirms that the resolution meets the success condition, broadcasting a resolution success message to other nodes by using the first access key and the second access key, and initiating a confirmation write command to the local storage system by the local application program and the application programs of all the nodes receiving the broadcast.
Step 404, when the local storage system receives the write-in command, identifying the write-in data through a first access key in the write-in command, deleting the log record in the log area, and updating the metadata area, wherein before the write-in command is successfully confirmed, the write-in data in the data storage area cannot be acquired by using a second access key.
Step 405, when the application program confirms that the resolution does not meet the success condition, the first access key and the second access key are used to broadcast the resolution failure message to other nodes, and the local application program and the application programs of all the nodes receiving the broadcast initiate a rollback command to the local storage system.
Step 406, when the local storage system receives the rollback command, deleting the log record in the log area and releasing the written data and the storage space thereof in the data storage area.
In order to better understand the technical solution of the present application, a method for writing data in a distributed storage system is described below with reference to a specific example, where details listed in the example are mainly for ease of understanding and are not intended to limit the scope of the present application.
FIG. 6 illustrates objects and flow diagrams involved in a data write resolution using a coherency protocol in a distributed system not utilizing the present invention. In fig. 6, two nodes are taken as an example for illustration, and similar to fig. 2, each application needs to submit two write requests, which increases the overall latency of the IO link, and one copy of the same data needs to be written twice to the storage device, resulting in a problem of write amplification,
FIG. 7 shows an object and a flowchart involved in successful data writing on a plurality of nodes (two nodes are taken as an example) of a distributed system according to the present invention, including the following steps:
in step 501, an application program initiates a distributed consistency protocol resolution operation for writing certain data.
Step 502, each node application program sends a request to the SSD of the node through the Prewrite command of NVMe, where the command carries application layer data.
Step 503, the ssd controller writes the data to the media space after applying for space to the persistent media.
In step 504, the ssd controller creates a log entry in the WAL, where the log entry contains a pointer to data on the persistent media, and the data written by the pre-write operation can be accessed through the pointer.
In step 505, the ssd controller returns the pre-write operation success upwards, and ensures that the written data can still be recovered after the subsequent power-down restart, i.e. has a persistent characteristic.
Step 506, the application program initiating the resolution waits for responses from other nodes to the consistency resolution, and decides to submit the write if the collected responses satisfy the condition of success of the resolution.
In step 507, the application program initiates a confirm write operation to the local SSD device, and submits the confirm write operation to the SSD through the Commit command of the NVMe, which does not need to carry data.
In step 508, the SSD controller clears the log entry in the WAL.
In step 509, the ssd controller updates internal metadata structures such as L2P to ensure that the latest data is visible for subsequent accesses to the data.
In step 510, the ssd controller returns an acknowledgment of the write success.
The resolution initiating application broadcasts the resolution results to the other nodes in the cluster, step 511 (which is performed in parallel with step 507).
In step 512, the other nodes receive the resolution result and then initiate a confirm write operation to their local SSD device, the steps are the same as 507.
Step 513, the steps are the same as 508.
Step 514, the step contents are as same as 509.
Step 515, the steps are as in step 510.
A feature of distributed systems is that a request resolution may fail to pass the need to roll back due to a node failure or the like. FIG. 8 shows an object and flow diagram involved in data write rollback across multiple nodes (illustrated with two nodes as an example) in a distributed system according to the present invention, including the following steps:
in step 601, the application program initiates a distributed consistency resolution operation for the writing of certain data.
Step 602, each node application program sends the request data to the SSD through the Prewrite command of NVMe, where the command carries application layer data.
Step 603, the ssd controller writes the data into the media space after applying for space to the persistent media.
In step 604, the ssd controller creates a log entry in the WAL, where the log entry contains a pointer to data on the persistent media, and the data written by the pre-write operation can be accessed through the pointer.
In step 605, the ssd controller returns that the pre-write operation is successful, and ensures that the written data can still be recovered after the subsequent power-down restart, i.e. has a persistent characteristic.
At step 606, the application initiating the resolution waits for other nodes to respond to the consistency resolution.
Step 607, if the node in the cluster does not form a resolution, it decides to roll back the write, the application program initiates a roll back write operation to the local SSD device, and submits to the SSD through the Rollback command of NVMe, at this time, data does not need to be carried.
At step 608, the SSD controller clears the log entry in the WAL.
In step 609, the ssd controller deletes the data pre-written on the persistence and frees up its space.
At step 610, ssd controller returns the rollback write success upward.
The resolution initiating application broadcasts the resolution results to the other nodes in the cluster, step 611 (this step is done in parallel with 607).
In step 612, the other nodes initiate a rollback write operation to their local SSD device after receiving the resolution result, the steps are the same as 607.
Step 613, the same procedure as 608.
And step 614, the steps are as same as 609.
Step 615, the content of step is the same as 610.
The system provided by the invention can be conveniently used in a distributed system, and the consistency of distributed data is realized and the persistence of the data is ensured while the write amplification of a single node is reduced.
It is noted that, in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. In the present patent application, if it is mentioned that a certain action is executed according to a certain element, it means that the action is executed according to at least the element, and two cases are included: performing the action based only on the element, and performing the action based on the element and other elements. The expression of a plurality of, a plurality of and the like includes 2, 2 and more than 2, more than 2 and more than 2.
All documents mentioned in this specification are to be considered as being incorporated in their entirety into the disclosure of the present application so as to be subject to modification as necessary. It should be understood that the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present disclosure should be included in the scope of protection of one or more embodiments of the present disclosure.
In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims (28)

1. A writing method of a storage system, the storage system comprising: a first storage area, a second storage area, and a metadata area; the storage system performs the steps of:
when a pre-write command is received, writing the received data into the second storage area according to a first access key and a second access key, and adding a log record corresponding to the written data into the first storage area;
when a write-in command is confirmed, the written data is identified through a first access key in the write-in command, the log record in the first storage area is deleted, and the metadata area is updated, wherein the written data of the second storage area cannot be acquired by using a second access key before the write-in command is confirmed to be successful; and
when a rollback command is received, the written data is identified through a first access key in the rollback command, the log records in the first storage area are deleted, and the written data and the storage space of the written data in the second storage area are released.
2. The writing method of the storage system according to claim 1, wherein the pre-write command includes the first access key and the second access key.
3. The writing method of a memory system according to claim 1,
the first access key and the second access key are not included in the pre-write command;
when the pre-write command is received, the storage system returns the first access key to an external application program;
and when the writing confirmation command is received, the storage system returns the second access key to the external application program.
4. The writing method of the storage system according to claim 1,
the pre-write command comprises the first access key;
and when the writing confirmation command is received, the storage system returns the second access key to the external application program.
5. The writing method of the storage system according to claim 1,
the pre-write command comprises the second access key;
and when the pre-write command is received, the storage system returns the first access key to an external application program.
6. The writing method of the storage system according to claim 1, wherein the first access key and the second access key are included in the confirmation writing command.
7. The write method of the storage system according to claim 1, wherein the confirmed write command includes the first access key, and when the confirmed write command is received, the storage system obtains the second access key through an internal association mechanism.
8. The writing method for the storage system according to claim 1, wherein the rollback command includes the first access key and the second access key.
9. The write method of the storage system according to claim 1, wherein the rollback command includes the first access key, and when the rollback command is received, the storage system obtains the second access key through an internal association mechanism.
10. The writing method of the storage system according to claim 1, wherein the first access key and the second access key are keys for storing key values or a physical address, a logical address, a Zone space defined by ZNS, and a relative address within the Zone space, which correspond to the written data, wherein the first access key is located in the first storage area, and the second access key is located in the second storage area.
11. The writing method of the storage system according to claim 1, wherein the first storage area is a log area.
12. The write method of the storage system according to claim 1, wherein the log record includes a data pointer to the write data.
13. The writing method of the storage system according to claim 1, wherein the step of updating the metadata area further comprises updating index information of the metadata area for accessing the second storage area data, the index information comprising: an L2P logical to physical address translation table, a P2L physical to logical address translation table, data block wear leveling statistics, or statistics required for data block garbage collection.
14. The writing method of the storage system according to claim 1, further comprising returning a message confirming that the writing is successful after the updating of the metadata area.
15. The writing method of the storage system according to claim 1, wherein the storage system is an SSD.
16. A method for writing to a distributed storage system, the system comprising: at least two nodes, the nodes comprising an application and a storage system in communication with the application, the storage system comprising a first storage area, a second storage area, and a metadata area; the method comprises the following steps:
an application program of one node initiates a data consistency resolution request to other nodes by using the first access key, the second access key and data to be written;
the application program and all application programs of the nodes receiving the resolution request initiate a pre-write command to a local storage system, when the local storage system receives the pre-write command, the local storage system writes the received data into the second storage area according to the first access key and the second access key, and increases log records corresponding to the written data in the first storage area;
when the application program confirms that the resolution meets the success condition, broadcasting a resolution success message to other nodes by using the first access key and the second access key, and initiating a confirmation write command to a local storage system by the local application program and all the application programs of the nodes receiving the broadcast; and
and when the local storage system receives the write-in confirmation command, identifying the write-in data through a first access key in the write-in confirmation command, deleting the log record in the first storage area, and updating the metadata area, wherein before the write-in confirmation command is successful, the write-in data of the second storage area cannot be acquired by using a second access key.
17. The writing method according to claim 16, further comprising:
when the application program confirms that the resolution does not meet the success condition, the first access key and the second access key are used for broadcasting resolution failure information to other nodes, and the local application program and the application programs of all the nodes receiving the broadcast initiate a rollback command to a local storage system; and
and when the local storage system receives the rollback command, deleting the log record in the first storage area and releasing the written data and the storage space of the written data in the second storage area.
18. The write method according to claim 17, wherein the first access key and the second access key are included in the rollback command.
19. The write method according to claim 17, wherein the rollback command includes the first access key, and when the rollback command is received, the storage system obtains the second access key through an internal association mechanism.
20. The writing method according to claim 16, wherein the pre-write command includes the first access key and the second access key.
21. The writing method according to claim 16,
the first access key and the second access key are not included in the pre-write command;
when the pre-write command is received, the storage system returns the first access key to an external application program;
and when the confirmed writing command is received, the storage system returns the second access key to the external application program.
22. The writing method according to claim 16,
the pre-write command comprises the first access key;
and when the confirmed writing command is received, the storage system returns the second access key to the external application program.
23. The writing method according to claim 16,
the pre-write command comprises the second access key;
and when the pre-write command is received, the storage system returns the first access key to an external application program.
24. The write method of claim 16, wherein the first access key and the second access key are included in the confirm write command.
25. The write method according to claim 16, wherein the first access key is included in the confirmed write command, and when the confirmed write command is received, the storage system obtains the second access key through an internal association mechanism.
26. The writing method according to claim 16, wherein the first access key and the second access key are keys for storing a physical address, a logical address, a Zone space defined by ZNS, and a relative address or key value in the space corresponding to the written data, the first access key is located in the first storage area, and the second access key is located in the second storage area.
27. The writing method of claim 16, wherein the log record includes a data pointer to the written data.
28. The writing method according to claim 16, wherein the step of updating the metadata area further includes updating index information of the metadata area for accessing second storage area data, the index information including: an L2P logical-to-physical address translation table, a P2L physical-to-logical address translation table, block wear leveling statistics, or statistics needed for block garbage collection.
CN202110619995.XA 2021-06-03 2021-06-03 Writing method of storage system and writing method of distributed storage system Pending CN115437550A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110619995.XA CN115437550A (en) 2021-06-03 2021-06-03 Writing method of storage system and writing method of distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110619995.XA CN115437550A (en) 2021-06-03 2021-06-03 Writing method of storage system and writing method of distributed storage system

Publications (1)

Publication Number Publication Date
CN115437550A true CN115437550A (en) 2022-12-06

Family

ID=84271748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110619995.XA Pending CN115437550A (en) 2021-06-03 2021-06-03 Writing method of storage system and writing method of distributed storage system

Country Status (1)

Country Link
CN (1) CN115437550A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115794497A (en) * 2023-02-08 2023-03-14 成都佰维存储科技有限公司 SSD power failure solution method and device, readable storage medium and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115794497A (en) * 2023-02-08 2023-03-14 成都佰维存储科技有限公司 SSD power failure solution method and device, readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US9183236B2 (en) Low level object version tracking using non-volatile memory write generations
JP4104586B2 (en) File system having file management function and file management method
US10176190B2 (en) Data integrity and loss resistance in high performance and high capacity storage deduplication
CN101784993B (en) Apparatus using flash memory as storage and method of operating the same
US7035881B2 (en) Organization of read-write snapshot copies in a data storage system
US10191851B2 (en) Method for distributed transaction processing in flash memory
US9342256B2 (en) Epoch based storage management for a storage device
US20110153569A1 (en) Systems and methods for providing nonlinear journaling
US20130332656A1 (en) File system for maintaining data versions in solid state memory
KR101779174B1 (en) Data discard method for journaling filesystem and memory management apparatus thereof
CN105408895A (en) Latch-free, log-structured storage for multiple access methods
US20060200500A1 (en) Method of efficiently recovering database
US8019953B2 (en) Method for providing atomicity for host write input/outputs (I/Os) in a continuous data protection (CDP)-enabled volume using intent log
US7549029B2 (en) Methods for creating hierarchical copies
CN110597663A (en) Transaction processing method and device
WO2004111852A2 (en) Managing a relationship between one target volume and one source volume
KR20110046118A (en) Adaptive logging apparatus and method
EP2979191B1 (en) Coordinating replication of data stored in a non-volatile memory-based system
US20140344503A1 (en) Methods and apparatus for atomic write processing
CN115437550A (en) Writing method of storage system and writing method of distributed storage system
CN1604046A (en) Method, system for managing information on relationships between target volumes
US11055184B2 (en) In-place garbage collection of a sharded, replicated distributed state machine based on supersedable operations
US10452496B2 (en) System and method for managing storage transaction requests
CN115827651A (en) Low-energy-consumption onboard embedded database memory transaction management method and system
CN114780043A (en) Data processing method and device based on multilayer cache and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination