WO2019028799A1 - 一种数据访问方法、装置和系统 - Google Patents

一种数据访问方法、装置和系统 Download PDF

Info

Publication number
WO2019028799A1
WO2019028799A1 PCT/CN2017/096958 CN2017096958W WO2019028799A1 WO 2019028799 A1 WO2019028799 A1 WO 2019028799A1 CN 2017096958 W CN2017096958 W CN 2017096958W WO 2019028799 A1 WO2019028799 A1 WO 2019028799A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
storage node
read
node
storage
Prior art date
Application number
PCT/CN2017/096958
Other languages
English (en)
French (fr)
Inventor
刘华伟
胡瑜
陈灿
刘金水
李晓初
谭春毅
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2017/096958 priority Critical patent/WO2019028799A1/zh
Priority to JP2020507656A priority patent/JP7105870B2/ja
Priority to EP23177282.3A priority patent/EP4273688A3/en
Priority to CN202110394808.2A priority patent/CN113485636B/zh
Priority to KR1020207006897A priority patent/KR20200037376A/ko
Priority to CN201780002892.0A priority patent/CN108064374B/zh
Priority to EP17920626.3A priority patent/EP3657315A4/en
Publication of WO2019028799A1 publication Critical patent/WO2019028799A1/zh
Priority to US16/785,008 priority patent/US11416172B2/en
Priority to US17/872,201 priority patent/US11748037B2/en
Priority to US18/353,334 priority patent/US20230359400A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2221Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test input/output devices or peripheral units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Definitions

  • the present application relates to the field of storage technologies, and in particular, to a data access method, apparatus, and system.
  • FIG. 1 is a schematic structural diagram of a storage system provided by the prior art.
  • the storage system is connected to the host through two switches.
  • the storage system also includes a plurality of dual control arrays coupled to each switch.
  • Each dual control array includes two storage controllers and a plurality of hard disk drives (HDDs) connected to each storage controller.
  • Two storage controllers are connected through a redundant mirror channel to implement mirroring operations in the write data flow.
  • each dual control array acts as a dual control array unit, and each dual control array unit corresponds to a part of a logical block address (LBA) of the host.
  • LBA logical block address
  • the read/write request sent by the host is forwarded by the switch to the dual control array unit corresponding to the LBA carried by the read/write request. Then, the dual control array unit locally performs read/write operations of data.
  • LBA logical block address
  • the system architecture shown in Figure 1 is based on HDD.
  • NVMe nonvolatile memory
  • SSDs solid state disks
  • the performance of NVMe SSD has been enhanced hundreds of times or even thousands of times, such as Intel's P3600NVMe SSD, read-only IOPS reached 450,000 times, and only IOPS has reached 70,000 times, of which IOPS is The English abbreviation for the number of input/output operations per second.
  • the processing architecture shown in FIG. 1 concentrates on two storage controllers, and the storage controller has limited processing capability, the dual-control array storage architecture shown in FIG. 1 is no longer suitable for NVMe SSD is a storage system for storage media, and it is urgent to provide a new system architecture.
  • the embodiment of the present application provides a data access method, apparatus, and system for providing a storage system suitable for using an NVMe SSD as a storage medium.
  • the first aspect provides a data access method, which is applied to a first storage node in a storage system, where the first storage node communicates with at least one second storage node in the host and the storage system through the switch, and at least one second storage node
  • the included physical disk is mapped to a virtual disk of the first storage node.
  • the method may include: receiving a first write request, the first write request carrying the first data to be written; then, striping the first data to be written to obtain stripe data; and writing the stripe data into the first a physical disk and/or a virtual disk of a storage node; and a write location of the recorded stripe data.
  • the first storage node may be any one of the storage nodes in the storage system.
  • the first write request received by the first storage node may be a first write request sent by the host, or may be a first write request from the host forwarded by any one of the second storage nodes.
  • part or all of the physical disks (for example, memory chips) included in each storage node may be mapped to other storage nodes as virtual disks of other storage nodes, such as but not limited to mapping by the NOF protocol, and thus, compared
  • the prior art can be free from the limitation of the processing capability of the CPU or the storage controller in the dual control array, and can greatly improve the processing capability of the storage system.
  • the striped data is written to the physical disk of the second storage node that maps the virtual disk.
  • the first storage node sends the corresponding striping data to the corresponding second storage node, and then the second storage node stores the received data in the local disk (ie, the physical disk mapped to the virtual disk). )in.
  • the fingerprint of the first data to be written is also recorded when the write position of the stripe data is recorded.
  • the write position of the stripe data and the fingerprint of the first data to be written are recorded in the distribution information of the first data to be written.
  • the LBA of the first data to be written is also recorded, wherein the LBA is the LBA carried in the write request.
  • the write position of the stripe data is recorded in the distribution information of the first data to be written, and the LBA of the first data to be written.
  • the first storage node may also perform other steps:
  • the first storage node may receive a second write request sent by the host, and the second write request carries the second write data; and then, determine, according to the second write request, the home node of the second write request, if The home node of the second write request is the first storage node, and the first storage node performs a write operation on the second write request, and if the home node of the second write request is the second storage node, the first storage node is the second The write request is forwarded to the second storage node to cause the second storage node to perform a write operation on the second write request.
  • the write operation reference may be made to the technical solution provided above or the specific implementation manners below, and details are not described herein again.
  • determining the home node of the second write request according to the second write request may include: calculating a fingerprint of the second data to be written; and then determining, according to the fingerprint of the second data to be written, the second write request. Home node.
  • the home node of the second write request is specifically the home node of the second data to be written.
  • the method may further include: determining a home node of the LBA carried by the second write request; wherein, the home node of the LBA is configured to manage a mapping relationship between the LBA and the fingerprint of the second data to be written. .
  • determining the home node of the second write request according to the second write request may include: determining, by the LBA carried by the second write request, the home node of the second write request.
  • the home node of the second write request is specifically the home node of the LBA carried by the second write request.
  • the above is an example of the steps performed by the first storage node in the write data flow.
  • the following describes the steps performed by the first storage node in the read data flow.
  • the first storage node receives the fingerprint of the first data to be read requested by the first read request; and then, according to the fingerprint of the first data to be read, obtains the write position of the first data to be read, And reading the stripe data of the first data to be read from the writing position of the first data to be read.
  • the mapping relationship between the write location of the first data to be written and the fingerprint of the first data to be written is stored in the first storage node.
  • the first storage node first read request, the first read request carries the first LBA; and then, according to the first LBA, the write location of the first to-be-read data requested by the first read request is obtained. And reading the stripe data of the first data to be read from the write position of the first data to be read.
  • the first storage node stores a mapping relationship between the write location of the first data to be written and the first LBA.
  • the first storage node may perform other steps:
  • the first storage node receives the second read request sent by the host;
  • the second read request determines the home node of the second read request. If the home node of the second read request is the first storage node, the first storage node performs a read operation on the second read request, if the home node of the second read request is And the second storage node forwards the second read request to the second storage node, so that the second storage node performs a read operation on the second read request.
  • the read operation reference may be made to the technical solution provided above or the specific implementation manners below, and details are not described herein again.
  • determining the home node of the second read request according to the second read request may include: determining a home node of the LBA carried by the second read request, where the home node of the LBA is used to manage the LBA and the second read request. Mapping the fingerprint of the requested second data to be read; then acquiring the fingerprint of the second data to be read from the home node of the second LBA; and determining the home node of the second read request according to the fingerprint of the second data to be read .
  • the home node of the LBA carried by the second read request is determined, and then the fingerprint of the second data to be read and the home node of the second read request are obtained from the home node of the second LBA.
  • the home node of the second read request is specifically the home node of the second data to be read.
  • determining the home node of the second read request according to the second read request may include: determining, according to the LBA carried by the second read request, the home node of the second read request.
  • the home node of the second read request is specifically the home node of the LBA carried by the second read request.
  • a storage node may perform function module division on the storage node according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated.
  • a processing module In a processing module.
  • a storage node comprising: a memory and a processor, wherein the memory is for storing a computer program that, when executed by the processor, causes any of the methods provided by the first aspect to be performed.
  • the memory may be a memory and/or a memory chip or the like.
  • the processor can be a CPU and/or a control memory or the like.
  • a data access system comprising: any one of the storage nodes provided by the second aspect and the third aspect, wherein the storage node is connected to at least one second storage node in the host and the storage system through the switch, at least A physical disk included in a second storage node is mapped to a virtual disk of the storage node.
  • the present application also provides a computer readable storage medium having stored thereon a computer program that, when executed on a computer, causes the computer to perform the method described in the first aspect above.
  • the application also provides a computer program product, when run on a computer, causing the computer to perform the method of any of the above aspects.
  • the present application also provides a communication chip in which instructions are stored that, when run on a storage node, cause the storage node to be described in the first aspect above.
  • any of the devices or computer storage media or computer program products provided above are used to perform the corresponding methods provided above, and therefore, the beneficial effects that can be achieved can be referred to the beneficial effects in the corresponding methods. , will not repeat them here.
  • FIG. 1 is a schematic structural diagram of a storage system provided by the prior art
  • FIG. 2 is a schematic diagram of a system architecture applicable to the technical solution provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of mapping between a physical disk and a virtual disk according to an embodiment of the present disclosure
  • FIG. 4a is a front view of a hardware form of the system architecture shown in FIG. 2;
  • 4b is a rear view of a hardware form of the system architecture shown in FIG. 2;
  • 4c is a top view of a hardware form of the system architecture shown in FIG. 2;
  • FIG. 5 is a schematic diagram of an expanded system architecture of the system architecture shown in FIG. 2;
  • FIG. 6 is a flowchart 1 of a method for writing data according to an embodiment of the present application.
  • FIG. 7 is a flowchart of a method for reading data based on FIG. 6 according to an embodiment of the present application.
  • FIG. 8 is a second flowchart of a method for writing data according to an embodiment of the present disclosure.
  • FIG. 9 is a flowchart of a method for reading data based on FIG. 8 according to an embodiment of the present application.
  • FIG. 10 is a third flowchart of a method for writing data according to an embodiment of the present disclosure.
  • FIG. 11 is a flowchart of a method for reading data based on FIG. 10 according to an embodiment of the present application.
  • FIG. 12 is a flowchart 4 of a method for writing data according to an embodiment of the present disclosure.
  • FIG. 13 is a flowchart of a method for reading data based on FIG. 12 according to an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a storage node according to an embodiment of the present disclosure.
  • the term "plurality” as used herein refers to two or more.
  • the terms “first”, “second”, etc. are used herein only to distinguish different objects, and the order is not limited.
  • the first storage node and the second storage node are used to distinguish different objects, and the order is not Limited.
  • the term “and/or” in this context is merely an association describing the associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, and both A and B exist, respectively. B these three situations.
  • the character "/" in this article generally indicates that the contextual object is an "or" relationship.
  • FIG. 2 is a schematic diagram showing a system architecture to which the technical solution provided by the present application is applied.
  • the system architecture shown in FIG. 2 can include a host 1 and a storage system 2.
  • the storage system 2 may include a switch 21, and a plurality of storage nodes 22 connected to the switch 21, respectively. It can be understood that, in order to improve reliability, at least two switches 21 can be generally disposed in the storage system 2, and each storage node 22 is connected to each switch 21 at this time.
  • the storage system 2 includes two switches 21 as an example for description.
  • the switch 21 is configured to connect each storage node 22 with the storage node 22 and the host 1.
  • the switch 21 can be, for example but not limited to, an ethernet switch, an infiniband switch, a PCIe switch, or the like.
  • the switch 21 may include an internal switch port 211 and a storage service port 212.
  • the switch port 213 may also be included.
  • the internal switch port 211 is a port that connects to the storage node 22.
  • One or more internal switch ports 211 may be provided on each switch 22, and each internal switch port 211 may be connected to an internal port 220 of one storage node 22.
  • the storage service port 212 is a port connected to the host 1 for providing external storage services.
  • One or more storage service ports 212 may be provided on each switch 21.
  • the expansion port 213 is used to connect to other switches 21 to implement horizontal expansion of the plurality of storage systems 2. It should be noted that the above several ports are divided from the use, and in physical implementation, these ports may be the same.
  • the expansion port 213 can be used as the storage service port 212, and other examples are not enumerated.
  • the internal switch port 211 can be used as the storage service port 212 or the expansion port 213.
  • it can be set according to the hardware form of the storage system. For example, in the hardware configuration shown in FIGS. 4a to 4c, since the internal switch port 211 is located inside the chassis, and the storage service port 212 and the expansion port 213 are located on the chassis surface, generally, the internal switch port 211 is not used. A service port 212 or an expansion port 213 is stored.
  • the storage node 22 is a core component of the storage system that provides input/output (I/O) processing capabilities and storage space.
  • one or more internal ports 220 may be disposed on each storage node 22, wherein the internal ports 220 are ports connecting the internal switch ports 211 of the switch 21, and each internal port 220 may be connected to one switch 21.
  • Internal port 220 can be provided, for example, but not limited to, by a remote direct memory access (RDMA) network card. If switch 21 is an Ethernet switch, then a redundant Ethernet network, or internal Ethernet of storage system 2, is formed to help achieve connectivity when any port or connection or switch fails.
  • RDMA remote direct memory access
  • the storage node shown in FIG. 2 includes an execution and I/O module 221, and one or more storage modules 222 that are coupled to the I/O module 221.
  • the execution and I/O module 221 is responsible for the input and output of I/O requests (including read/write requests) and the execution of related processing flows.
  • the execution and I/O module 221 may be at least one central processing unit (CPU) connected to at least one RDMA network card through an I/O bus.
  • CPU central processing unit
  • RDMA RDMA network card
  • An internal port 220 is provided on the RDMA network card to connect to the switch 21.
  • the I/O bus can be, for example but not limited to, a fast external component interconnect express (PCIe).
  • PCIe fast external component interconnect express
  • some or all of the CPU, I/O bus, and RDMA network card herein may be integrated, such as a system on chip (Soc) or a field programmable gate array (field- Programmable gate array (FPGA); can also be a general-purpose device, such as through a CPU (such as Xeon CPU) and a general-purpose RDMA network card.
  • Soc system on chip
  • FPGA field- Programmable gate array
  • the execution and I/O module 221 is connected to the memory module 222 via an internal I/O bus.
  • the storage module 222 can include at least one storage controller and a plurality of storage chips connected to each storage controller.
  • the memory chip may be a NandFlash chip, or may be other non-volatile memory chips, such as a phase change memory (PCM), a magnetic random access memory (MRAM), and a resistive change. Resistive random access memory (RRAM).
  • the memory controller can be an application specific integrated circuit (ASIC) chip or an FPGA.
  • ASIC application specific integrated circuit
  • the physical form of the storage module herein may be a general solid state drive (SSD), or the storage controller and the storage chip may be connected to the I/O module 221 through the I/O bus. stand up.
  • the storage node 22 when the execution is combined with the I/O module 221 and the storage module 222 by a common component, such as a general-purpose CPU (such as X86 Xeon), a general-purpose RDMA network card, a general-purpose SSD, and the like,
  • a common component such as a general-purpose CPU (such as X86 Xeon), a general-purpose RDMA network card, a general-purpose SSD, and the like
  • the storage node 22 is a general purpose server.
  • the host and the storage system can be accessed through a NOF (NVMe over Fabric) protocol.
  • NOF NVMe over Fabric
  • Some or all of the physical disks (eg, memory chips) included in each storage node may be mapped to other storage nodes as virtual disks of other storage nodes. For example, mapping based on the NOF protocol.
  • the software system in the storage node ie, the instruction executed by the CPU or the memory controller
  • the present application provides a distributed storage system.
  • different storage nodes communicate with each other through switches, and access each other through RDMA network card and RDMA provided by NOF protocol.
  • NOF RDMA network card
  • NOF RDMA
  • Figure 3 shows a schematic diagram of a mapping between a physical disk and a virtual disk.
  • Figure 3 is the storage system
  • the storage node includes 16 storage nodes, which are numbered from 1 to 16, and each of the storage nodes 2 to 15 maps the physical disk to the storage node 1 as a virtual disk of the storage node 1 as an example. of.
  • the description will be made by taking an example in which the memory chip is an NVMe SSD.
  • an implementation manner in which each of the storage nodes 2 to 15 maps the physical disk therein to the storage node 1 is: when the storage system is initialized, the storage nodes 2 to 15 are respectively configured to allow mapping to the storage node 1 Information of the physical disk; then establishing a connection with the storage node 1, respectively.
  • the storage node 1 can obtain information that the storage nodes 2 to 15 respectively determine the physical disk that is allowed to be mapped to the storage node 1, and assign a drive letter to the physical disk mapped to the storage node 1, as the virtual disk of the storage node 1. And record the mapping relationship between the virtual disk and the remote physical disk.
  • the software system on the storage node 1 can be aware of 16 NVMe SSDs, but only one NVMe SSD is actually local, and the remaining 15 NVMe SSDs are virtualized by the NVMe SSD of other storage nodes through the NOF protocol. . Due to the low latency nature of the NOF protocol, the performance difference between accessing a local disk (ie, a physical disk) and a virtual disk can be neglected.
  • the goal of the NOF protocol is to decouple the NVMe SSD from the local computer system. That is, the remote NVMe SSD can be connected to the local computer using the RDMA network card and "seen" in the local computer system. Is a virtual NVMe SSD. Due to the use of RDMA technology, the performance of the remote NVMe SSD (ie virtual NVMe SSD) is basically the same as that of the local NVMe SSD (ie physical NVMe SSD). NOF inherits all NVMe commands and adds some administrative commands, such as Authentication Send, Authentication Receive, Connect, Property Get, and Property Set.
  • the data transfer mode and process of the NOF protocol have some changes relative to the original NVMe protocol, which may include: using RDMA to transfer commands (such as read/write requests)/data instead of the PCIe memory space used by the NVMe.
  • the mapping method because in the NOF system, the initiator and the target cannot "see” the other's memory space.
  • the initiator may be a host, and the target may be a storage node. In another implementation, the initiator can be a storage node and the target can be another storage node.
  • the read data flow may include: after receiving the read request (ie, the READ command), the target device obtains the address information of the cache to be written to the initiator according to the read request. The target then initiates RDMA_WRITE to the initiator to write the read data to the host cache. The target then initiates RDMA_SEND to the initiator to inform the initiator that the transfer is complete.
  • the write data flow may include: the initiator assembles the write request (ie, the Write command) and sends it to the target through RDMA_SEND. After the target receives the write request, it starts RDMA_READ and gets the data to be written by the write request from the initiator. After the target receives the data that the initiator responds, it initiates RDMA_SEND to the initiator to notify the initiator that the transfer is complete.
  • the initiator assembles the write request (ie, the Write command) and sends it to the target through RDMA_SEND.
  • the target After the target receives the write request, it starts RDMA_READ and gets the data to be written by the write request from the initiator. After the target receives the data that the initiator responds, it initiates RDMA_SEND to the initiator to notify the initiator that the transfer is complete.
  • the present application does not limit the hardware configuration of the switch 21 and the storage node 22.
  • the storage node 22 and the switch 21 can exist in one chassis.
  • the storage node 22 is implemented by a general purpose server
  • the storage node 22 and the switch 21 may exist in one rack.
  • the chassis may include one or more switches, one or more power sources, multiple storage nodes, and a backplane connecting the storage node 22 and the switch 21.
  • Figures 4a-4c show a hardware form of the system architecture shown in Figure 2.
  • 4a shows a front view of a rack frame
  • FIG. 4b shows a rear view of the rack frame
  • FIG. 4c shows a top view of the rack frame.
  • the rack frame includes 2 Ethernet switches and 4 redundancy. The remaining power, 16 storage nodes, and the backplane that connects the storage node to the Ethernet switch.
  • the rack frame is provided with 16 empty slots, and each slot can be used to insert a storage node. It may not be fully populated, and at the same time, in order to meet the redundancy requirement, a minimum of 2 storage nodes can be inserted.
  • One or more handle bars are disposed on each storage node for inserting the storage node into the empty slot.
  • the storage service port and the expansion port are provided through an Ethernet switch.
  • the service port here can be a variety of Ethernet speed (such as 10G/40G/25G/50G/100G) ports; the expansion port can be a high-speed port (such as 40G/50G/100G).
  • the internal ports of the Ethernet switch and the internal ports of the storage node can be connected through the backplane.
  • FIG. 4a to FIG. 4c is only an example of the hardware configuration of the system architecture shown in FIG. 2, and does not constitute a limitation on the hardware configuration of the system architecture shown in FIG. 2. .
  • Figure 5 provides an extended system architecture.
  • the storage system provides a storage service externally through the storage service port, and is connected to the M hosts through a network, where M is an integer greater than or equal to 1.
  • the network here can consist of a network that is directly connected or through a switch. If the network is an Ethernet network, the external services of the storage system may be provided using an Ethernet-based storage protocol, including but not limited to any of the following: iSCSI, NOF, iSER, NFS, Samba, and the like.
  • the storage system can also be scaled horizontally through the expansion port, as shown in FIG. 5, including N storage systems, where N is an integer greater than or equal to 2.
  • scale-out can be connected through a directly connected network or through a switch.
  • the storage system includes 16 storage nodes, and the storage nodes 1 to 16 are used as an example. It should also be noted that the steps performed by each storage node may be performed by a CPU and/or a storage controller in the storage node.
  • FIG. 6 is a flowchart of a method for writing data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S101 The host sends a write request to the storage system, where the write request includes LBA1 and data to be written.
  • the storage node 1 in the storage system receives the write request. Specifically, the host forwards the write request to the storage node 1 via the switch.
  • the storage node 1 may be any storage node in the storage system.
  • the write request is backed up.
  • S102 to S103 it can be understood that S102 to S103 are optional steps.
  • the storage node 1 sends the write request to the mirror node of the storage node 1, such as the storage node 2. Specifically, the storage node 1 sends the write request to the storage node 2 via the switch. The storage node 2 receives the write request.
  • Any two storage nodes in the storage system may be mirror nodes of each other.
  • which two storage nodes are mutually mirrored nodes may be preset according to certain rules.
  • the rule may be, for example, but not limited to, configured according to a certain rule to implement load balancing.
  • the load balancing here means that the step of performing the mirroring operation is evenly shared by each storage node as much as possible. For example, sequentially numbering two storage nodes adjacent to each other as the other.
  • the mirror node of this for example, the storage node 1 and the storage node 2 are mutually mirrored nodes, and the storage node 3 and the storage node 4 are mirror nodes of each other...
  • the storage node 2 caches the write request, and returns a mirror completion indication to the storage node 1, and the storage node 1 receives the mirror completion indication.
  • the storage system sends a write operation completion indication to the host.
  • some or all of the following steps S105 to S118 are continuously executed in the storage system, and writing of the data to be written in the write request is completed.
  • the storage node 1 generates a fingerprint of the data to be written, and determines a home node, such as the storage node 3, to be written according to the fingerprint.
  • the fingerprint of the data is used to uniquely mark the characteristics of the data.
  • the fingerprint of the data can be understood as the identity (ID) of the data. If the fingerprints of the two data are the same, the two data are considered to be the same. If the fingerprints of the two data are different, the two data are considered to be different.
  • the present application does not limit how to calculate the fingerprint of the data, for example, by performing a hash operation on the data, wherein the hash operation may be, for example but not limited to, a secure hash algorithm 1 (secure hash algorithm 1). SHA-1), cyclic redundancy check (CRC) 32, etc., wherein CRC32 is a specific implementation of CRC, which can generate a 32-bit check value. Taking SHA-1 as an example, after hashing the data, a 160-bit digest is obtained, which is the fingerprint of the data.
  • the home node of the data is the storage node that performs the write operation on the data.
  • the application does not limit the home node of the data, for example, but not limited to, determining the home node of the data according to a certain algorithm to implement load balancing.
  • the load balancing means that the step of performing the write operation is performed by each storage as much as possible. Nodes are evenly shared.
  • the algorithm can be a modulo operation.
  • the fingerprint is subjected to a modulo operation. If the obtained value is a, the home node of the data is the storage node a+1, where a ⁇ 0, a is an integer, and the storage nodes in the storage system are numbered from 1 . For example, if there are 16 storage nodes in the storage system and the fingerprint of the data is 65537, the 65537 can be modulo 16 to obtain 1, that is, the home node of the data is the storage node 2.
  • the home node of the data to be written is determined according to the fingerprint of the data
  • the mirror node of the storage node 1 is determined according to the storage node 1, and the two are not associated with each other, and may be the same storage node or different storage. Nodes, which are different in the present embodiment, are described as an example.
  • the storage node that receives the write request sent by the host can also be used as the home node of the data to be written carried by the write request.
  • the storage node 1 can also be used as the home node of the data to be written carried by the write request.
  • the fingerprint of the data to be written is 65536
  • 65536 can be modulo 16 to obtain 0, that is, the home node to be written data is the storage node 1.
  • the storage node 1 forwards the write request to the home node (such as the storage node 3) to which the data is to be written.
  • the storage node 3 receives the write request.
  • the storage node 3 queries the data distribution information set, and determines whether the set contains the fingerprint of the data to be written.
  • the home node of the data can manage the data distribution information set.
  • Data distribution managed by the home node of the data The number of data distribution information included in the information set increases as the number of disk operations performed by the storage node increases.
  • the data distribution information set managed by the storage node may be considered to be empty, or the data distribution information set managed by the storage node may not be established in the storage system.
  • each data distribution information set may include distribution information of at least one data.
  • the distribution information of the data can be represented by the metadata table M1, and the related description of M1 is as follows.
  • S108 The storage node 3 performs striping on the write data to obtain stripe data, and writes the stripe data to the physical disk and/or the virtual disk of the storage node 3.
  • This step can be understood as redundant processing of data.
  • the basic principle is: breaking up a complete data (specifically, the data carried in the write request) to obtain multiple data blocks, and optionally generating one or more data. Check blocks. These data blocks and check blocks are then stored in different disks (ie disks).
  • the striping data in S108 may include a data block, and may further include a parity block.
  • the present application does not limit the redundancy processing manner, and may be, for example but not limited to, a redundant array of independent disks (RAID) or an erasure coding (EC).
  • the virtual disk can be used as a local disk, so that the storage The node 3 can select the virtual disk as the disk written by the striped data, and when writing the data to the virtual disk, the storage node 3 can first determine the physical disk of the other storage node that maps the virtual disk, and then according to the NOF protocol, The data block written to the virtual disk is written to the determined physical disk of the other storage node by RDMA.
  • the storage system includes 16 storage nodes, and the redundancy processing mode is an example of the EC.
  • the storage node 3 performs striping according to the EC algorithm to generate write data, and obtains 14 data blocks and 2 Check blocks. Each of these 16 blocks is then written to a storage node of the storage system.
  • the storage node 3 records the write position of the stripe data. Specifically, the storage node 3 can record the write position of the stripe data by recording the distribution information of the data to be written.
  • the distribution information of the data can be represented by the metadata table M1, and the elements included in one metadata table M1 can be as shown in Table 1.
  • FingerPrint Data fingerprint hostLBA Write the LBA carried in the request hostLength Total length of data Seg.type Whether each block in the striped data is a data block or a check block Seg.diskID ID of the disk (which can be a virtual disk or a physical disk) written in each block in the striped data Seg.startLBA The starting LBA of each block in the striped data in the written disk Seg.length The length of each block in the striped data
  • FingerPrint can be used to indicate the writing position of the stripe data of a data.
  • hostLBA represents the LBA used by the host and storage system information to interact. What Seg.startLBA represents is the starting LBA address of the data written in the storage module. This application is on Table 1
  • the recording mode of each element in the strip is not limited. For example, if the length of each block in the stripe data is the same, one length may be recorded, and other examples are not listed one by one.
  • the distribution information of the data to be written recorded by the storage node 3 may include: a fingerprint of the data to be written, LBA1, the total length of the data to be written, and 14 data blocks of the data to be written. And information such as the type of each of the two redundant blocks, the ID of the written disk, the length, and the like.
  • S108 to S109 may be referred to as a home node of data to perform a write operation on the write request/to-be-written data.
  • the home node of the data may perform redundancy processing on the metadata table M1.
  • S110 is an optional step.
  • the storage node 3 writes the write position of the stripe data to the physical disk and/or the virtual disk of the storage node 3. Specifically, the storage node 3 can write the write location of the stripe data to the physical disk of the storage node 3 and/or by writing the distribution information of the data to be written to the physical disk and/or the virtual disk of the storage node 3. Virtual disk.
  • This step can be understood as redundant processing of the distribution information of the data. It can be understood that this step is an optional step.
  • the present application does not limit the redundancy processing manner, such as but not limited to multiple copies, EC or RAID. Taking three copies as an example, the storage node 3 can locally store a distribution information of the data to be written, and then select two storage nodes from the storage system, and then copy the distribution information of the data to be written into two copies, each Writes to one of the two storage nodes.
  • the application does not limit how to select the two storage nodes, for example, but not limited to, using a modulo operation.
  • S111 The storage node 3 feeds back a write operation completion indication to the storage node 1, and the storage node 1 receives the write operation completion indication.
  • S105 if the home node of the data to be written determined by the storage node 1 is the storage node 1, the S106 and S111 may not be executed, and the execution subject of S107 to S111 is the storage node 1.
  • the storage node 1 acquires a home node of the LBA1 carried by the write request, for example, the storage node 4.
  • the home node of the LBA is used to manage the mapping relationship between the LBA and the fingerprint.
  • the application does not limit the home node of the LBA. For example, but not limited to, determining the home node of the LBA according to a certain algorithm to implement load balancing, where load balancing refers to performing mapping between the management LBA and the fingerprint.
  • the steps of the relationship are shared as evenly as possible by the storage nodes.
  • the algorithm can be a modulo operation.
  • the home node of the data is determined according to the fingerprint of the data
  • the home node of the LBA is determined according to the LBA, and the two are not associated with each other, and may be the same storage node or different storage nodes. There is no association between the home node of the LBA and the mirror node of the storage node 1. This embodiment is described by taking the difference between the home node of the LBA1, the home node to be written data, and the mirror node of the storage node 1.
  • S113 The storage node 1 sends the fingerprint of the data to be written and the LBA1 carried by the write request to the storage node 4.
  • the storage node 4 records the mapping relationship between the fingerprint of the data to be written and the LBA1 carried by the write request.
  • the mapping relationship can be represented by a metadata table M2, and an element included in a metadata table M2 can be as shown in Table 2.
  • FingerPrint fingerprint LBA list The LBA list corresponding to the fingerprint NodeID ID of the home node of the data indicated by the fingerprint
  • the metadata table M2 includes the above FingerPrint and LBA list.
  • an LBA list can include one or more LBAs.
  • the LBA list can be represented as a singly linked list.
  • the same fingerprint can be mapped to multiple LBAs. For example, suppose the host sends 4 write requests to the storage system. The related information of the 4 write requests is shown in Table 3.
  • the metadata table M2 recorded by the storage node A is as shown in Table 4.
  • the LBA list corresponding to the fingerprint The home node of the data indicated by the fingerprint Fingerprint of data to be written 1 LBA1, LBA2 Storage node C Fingerprint of data to be written 2 LBA4 Storage node D
  • the metadata table M2 recorded by the storage node B is as shown in Table 5.
  • S115 The storage node 4 writes the mapping relationship between the fingerprint of the data to be written and the LBA1 carried by the write request to the physical disk and/or the virtual disk of the storage node 4.
  • This step can be understood as redundant processing of the mapping relationship between the fingerprint of the data to be written and the LBA1 carried by the write request. It can be understood that this step is an optional step.
  • the present application does not limit the redundancy processing manner. For example, reference may be made to the above.
  • S116 The storage node 4 feeds back a mapping relationship completion indication to the storage node 1, and the storage node 1 receives the mapping relationship completion indication.
  • the LBA can be read from the subsequent data read operation. Read the data to be written at the home node. For details, refer to the flow of reading data described in FIG. 7.
  • the storage node 1 sends an instruction to delete the mirror data to the mirror node (such as the storage node 2) of the storage node 1, and the storage node 2 receives the instruction.
  • FIG. 7 is a flowchart of a method for reading data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S201 The host sends a read request to the storage system, where the read request carries the LBA1.
  • Storage section in the storage system Point 1 receives the write request.
  • the host sends a read request to the switch, and after receiving the read request, the switch forwards the read request to the storage node 1.
  • the storage node 1 receives the request (including the write request and the read request) from the host as
  • the write data flow and the read data flow shown in Embodiment 1 are designed based on the idea that the switch forwards the write request/read request to any storage node, the data is written.
  • the storage nodes that receive write requests and read requests from the host can also be different.
  • S202 The storage node 1 acquires the home node of the LBA1.
  • the determined home node of LBA1 is the storage node 4.
  • S203 The storage node 1 sends the read request to the storage node 4, and the storage node 4 receives the read request.
  • the storage node 4 obtains a fingerprint of the data to be read according to a mapping relationship between the LBA and the fingerprint of the data, such as but not limited to the metadata table M2 as shown in Table 2.
  • the fingerprint of the data to be read is the fingerprint of the data to be written in the above.
  • the storage node 4 can also obtain the home node of the data to be read. According to the above description, the determined home node of the data to be read is the storage node 3.
  • the storage node 4 feeds back the fingerprint of the data to be read to the storage node 1, and the storage node 1 receives the fingerprint of the data to be read, and determines the home node of the data to be read, that is, the storage node 3 according to the fingerprint of the data to be read.
  • the storage node 4 can also feed back the ID of the home node to be data, that is, the storage node 3, to the storage node 1, so that the storage node 1 does not need to obtain the home node of the data to be read according to the fingerprint, thereby saving the calculation of the storage node 1. the complexity.
  • S206 The storage node 1 sends a fingerprint of the data to be read to the storage node 3, and the storage node 3 receives the fingerprint of the data to be read.
  • the storage node 3 determines the writing position of the stripe data of the data to be read according to the fingerprint of the data to be read, for example, but not limited to, according to the metadata table M1 shown in Table 1, to obtain the writing position of the striping data. Then, some or all of the striping data is acquired from these write locations.
  • the data block of the data to be read can be read without reading the check block.
  • the check block of the data to be read may be read, and then the data recovery may be performed, for example, but not limited to, according to a RAID or EC algorithm.
  • the storage node 3 constructs the read data block into complete data, that is, performs data before striping. At this point, the storage node 3 is considered to have acquired the data to be read. The storage node 3 feeds back the data to be read to the storage node 1, and the storage node 1 receives the data to be read.
  • the read data flow shown in FIG. 7 is described based on the write data flow shown in FIG. 6.
  • Those skilled in the art should be able to determine the embodiment in the following scenario according to the read data flow shown in FIG. 7: the attribution of LBA1
  • the scenario in which the node is the same as the storage node 1, and/or the scenario in which the home node of the data to be read is the same as the storage node 1, and/or the scenario in which the home node of the LBA1 is the same as the home node of the data to be read. I will not repeat them here.
  • the step of performing the read/write operation is allocated to the storage node of the storage system according to the fingerprint of the data
  • the step of executing the management fingerprint and the LBA of the host is allocated to the storage node of the storage system according to the LBA. in. This helps achieve load balancing and improves system performance.
  • FIG. 8 is a flowchart of a method for writing data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S301 to S304 Reference may be made to the above S101 to S104, but the present application is not limited thereto.
  • S306 The storage node 1 sends the write request to the storage node 4, and the storage node 4 receives the write request.
  • the storage node 4 performs striping on the write data to obtain stripe data, and writes the stripe data to the physical disk and/or the virtual disk of the storage node 4.
  • the storage node 4 records the write position of the stripe data. Specifically, the storage node 4 can record the write position of the stripe data by recording the distribution information of the data to be written.
  • the distribution information of the data can be represented by the metadata table M3.
  • the element included in the metadata table M3 can be the table obtained by removing the FingerPrint in the above Table 1.
  • the storage node 4 writes the write position of the stripe data to the physical disk and/or the virtual disk of the storage node 4.
  • the storage node 4 records the mapping relationship between the writing position of the striping data and the LBA1.
  • S311 The storage node 4 writes the mapping relationship between the write position of the stripe data and the LBA1 to the physical disk and/or the virtual disk of the storage node 4.
  • FIG. 9 is a flowchart of a method for reading data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S401 to S403 Reference may be made to S201 to S203. Of course, the present application is not limited thereto.
  • the storage node 4 determines, according to the LBA1, a write location of the stripe data of the data to be read, such as, but not limited to, a metadata table obtained by deleting the fingerprint of the data according to Table 1, and obtaining the write of the stripe data of the data to be read. Into the location. Then, some or all of the striping data is acquired from these write locations.
  • a write location of the stripe data of the data to be read such as, but not limited to, a metadata table obtained by deleting the fingerprint of the data according to Table 1, and obtaining the write of the stripe data of the data to be read. Into the location. Then, some or all of the striping data is acquired from these write locations.
  • S405 The storage node 4 constructs the read data block into complete data, that is, performs data before striping. At this point, the storage node 4 is considered to have acquired the data to be read. The storage node 4 feeds back the data to be read to the storage node 1.
  • the read data flow shown in FIG. 9 is described based on the write data flow shown in FIG. 8. Those skilled in the art should be able to determine the embodiment in the following scenario according to the read data flow shown in FIG. 9: the attribution of LBA1 The same scenario as node 1 of storage node. I will not repeat them here.
  • the step of executing the write position of the management data and the LBA of the host is allocated to the storage node of the storage system in accordance with the LBA. This helps achieve load balancing and improves system performance.
  • FIG. 10 is a flowchart of a method for writing data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S501 The host sends a write request to the storage system, where the write request includes LBA1 and data to be written.
  • the storage node 1 in the storage system receives the write request. Specifically, the host sends a write request to the switch, and after receiving the write request, the switch forwards the write request to the storage node 1 according to the information carried by the write request.
  • the difference between the first embodiment and the second embodiment is that the host can carry a specific LBA first.
  • the write request is sent to a specific storage node, which can reduce the computational complexity of the storage system.
  • the host may pre-store the correspondence between the LBA range and the storage node, for example, the LBAs 1 to 100 correspond to the storage node 1, the LBAs 101 to LBA 200 correspond to the storage nodes 2, and then, the write request carries the LBA.
  • the information of the storage node wherein the information of the storage node may include, for example but not limited to, a network address including the storage node, and optionally an ID of the storage node, etc., so that when the switch receives the write request, the switch may Write the information of the storage node written in the request to determine which storage node to forward the write request to.
  • S502 to S511 Reference may be made to S102 to S111. Of course, the present application is not limited thereto.
  • the storage node 1 records the mapping relationship between the fingerprint of the data to be written and the LBA1 carried by the write request.
  • S513 The storage node 1 writes the mapping relationship between the fingerprint of the data to be written and the LBA1 carried by the write request to the physical disk and/or the virtual disk of the storage node 1.
  • S514 to S515 Reference may be made to S117 to S118. Of course, the present application is not limited thereto.
  • FIG. 11 is a flowchart of a method for reading data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S601 The host sends a read request to the storage system, where the read request carries LBA1.
  • the storage node 1 in the storage system receives the write request. Specifically, the host sends a read request to the switch, and after receiving the read request, the switch forwards the read request to the storage node 1 according to the read request.
  • the read data flow shown in FIG. 11 is explained based on the write data flow shown in FIG.
  • the storage node 1 obtains a fingerprint of the data to be read according to a mapping relationship between the LBA and the fingerprint of the data, such as but not limited to the metadata table M2 as shown in Table 2.
  • the fingerprint of the data to be read is the fingerprint of the data to be written in the above.
  • the storage node 1 can also obtain the home node of the data to be read according to the information recorded by the metadata table M2 or according to the above description. According to the above description, the determined home node of the data to be read is the storage node 3.
  • the storage node 1 acquires a home node of the data to be read, such as the storage node 3
  • S604 to S607 Reference may be made to S206 to S209. Of course, the present application is not limited thereto.
  • the host determines, according to the correspondence between the LBA and the storage node, which storage node to send the read/write request to the storage system, that is, the storage node does not need to determine the home node of the LBA, so that The signaling interaction between storage nodes can be reduced, thereby increasing the read/write rate.
  • the present embodiment contributes to load balancing, thereby improving system performance.
  • FIG. 12 is a flowchart of a method for writing data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • S701 to S704 Reference may be made to S501 to S504. Of course, the present application is not limited thereto.
  • S705 The storage node 1 performs striping on the write data to obtain stripe data, and writes the stripe data to the physical disk and/or the virtual disk of the storage node 1.
  • the storage node 1 records the write position of the stripe data. Specifically, the storage node 1 can pass the record The distribution information of the write data is recorded to record the write position of the stripe data.
  • S707 The storage node 1 writes the write position of the stripe data to the physical disk and/or the virtual disk of the storage node 1.
  • S708 The storage node 1 records the mapping relationship between the writing position of the striping data and the LBA1.
  • the storage node 1 writes the mapping relationship between the write position of the stripe data and the LBA1 to the physical disk and/or the virtual disk of the storage node 1.
  • FIG. 13 is a flowchart of a method for reading data applied to the storage system shown in FIG. 2 according to an embodiment of the present application. specific:
  • the storage node 1 determines, according to the LBA, a write location of the stripe data of the data to be read, such as, but not limited to, a metadata table obtained by deleting the fingerprint of the data according to Table 1, and obtaining the write of the stripe data of the data to be read. Into the location. Then, some or all of the striping data is acquired from these write locations.
  • a write location of the stripe data of the data to be read such as, but not limited to, a metadata table obtained by deleting the fingerprint of the data according to Table 1, and obtaining the write of the stripe data of the data to be read. Into the location. Then, some or all of the striping data is acquired from these write locations.
  • S803 The storage node 1 constructs the read data block into complete data, that is, performs data before the striping.
  • S804 The storage node 1 feeds back the data to be read to the host.
  • the host determines, according to the correspondence between the LBA and the storage node, which storage node to send the read/write request to the storage system, that is, the storage node does not need to determine the home node of the LBA, so that The signaling interaction between storage nodes can be reduced, thereby increasing the read/write rate.
  • the present embodiment contributes to load balancing, thereby improving system performance.
  • each node such as a host or a storage node.
  • each node such as a host or a storage node.
  • it includes hardware structures and/or software modules corresponding to the execution of the respective functions.
  • the present application can be implemented in a combination of hardware or hardware and computer software in combination with the elements and algorithm steps of the various examples described in the embodiments disclosed herein. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
  • the embodiment of the present application may perform the division of the function module on the storage node according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present application is schematic, and is only a logical function division, and the actual implementation may have another division manner. The following is an example of dividing each functional module by using corresponding functions:
  • FIG. 14 shows a schematic structural diagram of a storage node 140.
  • the storage node 140 can be any of the storage nodes referred to above.
  • the storage node 140 is connected to the host and the at least one second storage node in the storage system, and the physical disk included in the at least one second storage node is mapped to the virtual disk of the storage node 140.
  • the storage node 140 includes: a transceiver unit 1401. Unit 1402 and storage unit 1403. Transceiver unit 1401, And receiving the first write request, where the first write request carries the first to-be-written data.
  • the processing unit 1402 is configured to perform striping on the first data to be written to obtain stripe data, and write the stripe data into a physical disk and/or a virtual disk of the storage node 140.
  • the storage unit 1403 is configured to record a stripe data writing position.
  • the storage node 140 may be the storage node 3
  • the transceiver unit 1401 may be configured to execute S106/S506,
  • the processing unit 1402 may be configured to execute S108/S508, and the storage unit 1403 may be configured to execute S109/S509.
  • storage node 140 can be storage node 4
  • transceiver unit 1401 can be used to execute S306, processing unit 1402 can be used to execute S307
  • storage unit 1403 can be used to execute S308.
  • storage node 140 can be storage node 1
  • transceiver unit 1401 can be used to execute S701
  • processing unit 1402 can be used to execute S705
  • storage unit 1403 can be used to execute S706.
  • the processing unit 1402 may be specifically configured to: when the stripe data is written into the virtual disk, write the stripe data into the physical disk of the second storage node that maps the virtual disk.
  • the storage unit 1403 can also be configured to: when recording the stripe data writing location, also record the fingerprint of the first data to be written. For example, reference can be made to Table 1 in Example 1.
  • the storage unit 1403 may be further configured to: when recording the stripe data writing position, also recording the LBA of the first data to be written.
  • the storage unit 1403 can be used to execute S310/S708.
  • the transceiver unit 1401 is further configured to: receive a second write request sent by the host, and the second write request carries the second data to be written.
  • the processing unit 1402 is further configured to: determine, according to the second write request, the home node of the second write request, if the home node of the second write request is the storage node 140, perform a write operation on the second write request by the storage node 140, if The home node of the second write request is the second storage node, and the storage node 140 forwards the second write request to the second storage node, so that the second storage node performs a write operation on the second write request.
  • storage node 140 can be storage node 1.
  • the transceiver unit 1401 can be configured to execute S101/S301/S501.
  • Processing unit 1402 can be used to execute S105/S305/S505.
  • the processing unit 1402 may be specifically configured to: calculate a fingerprint of the second data to be written; and then determine a home node of the second write request according to the fingerprint of the second data to be written.
  • processing unit 1402 can be used to execute S105/S 505 in conjunction with FIG. 6 or FIG.
  • the processing unit 1402 is further configured to: determine a home node of the LBA carried by the second write request; the home node of the LBA is configured to manage a mapping relationship between the LBA and the fingerprint of the second data to be written.
  • processing unit 1402 can be used to execute S112.
  • processing unit 1402 may be specifically configured to: determine, according to the LBA carried by the second write request, the home node of the second write request. For example, in conjunction with FIG. 8, processing unit 1402 can be used to execute S305.
  • the transceiver unit 1401 is further configured to receive a fingerprint of the first data to be read requested by the first read request.
  • the processing unit 1402 is further configured to: obtain a write position of the first data to be read according to the fingerprint of the first data to be read, and read a strip of the first data to be read from the write position of the first data to be read. Data.
  • storage node 140 may be storage node 3.
  • the transceiver unit 1401 can be configured to execute S206/S604, and the processing unit 1402 can be configured to execute S207/S605.
  • the transceiver unit 1401 is further configured to receive a first read request, where the first read request carries the first LBA.
  • the processing unit 1402 is further configured to: obtain, according to the first LBA, a write position of the first data to be read requested by the first read request, and read the first data to be read from the write position of the first data to be read. Striped data.
  • the storage node 140 may be a storage node 4, and the transceiver unit 1401 may be used for Executing S403, the processing unit 1402 may be configured to execute S404.
  • storage node 140 may be storage node 1
  • transceiver unit 1401 may be configured to execute S801
  • processing unit 1402 may be configured to execute S803.
  • the transceiver unit 1401 is further configured to receive a second read request sent by the host.
  • the processing unit 1402 is further configured to: determine, according to the second read request, the home node of the second read request, if the home node of the second read request is the storage node 140, perform a read operation on the second read request by the storage node 140, if The home node of the second read request is the second storage node, and the storage node 140 forwards the second read request to the second storage node to cause the second storage node to perform a read operation on the second read request.
  • storage node 140 may be storage node 1.
  • the transceiver unit 1401 can be configured to execute S201/S401/S601, and the processing unit 1402 can be configured to execute S205/S402/S603.
  • the processing unit 1402 may be specifically configured to: determine a home node of the LBA carried by the second read request, where the home node of the LBA is configured to manage the second data to be read requested by the LBA and the second read request. Fingerprint mapping relationship; obtaining a fingerprint of the second data to be read from the home node of the second LBA; determining a home node of the second read request according to the fingerprint of the second data to be read.
  • the transceiving unit 1401 may be configured to execute S201/S601
  • the processing unit 1402 may be configured to execute S205/S603.
  • the processing unit 1402 may be specifically configured to: determine, according to the LBA carried by the second read request, the home node of the second read request.
  • the transceiving unit 1401 can be used to execute S401, and the processing unit 1402 can be used to execute S402.
  • the storage node 140 may specifically be a different storage node in the same figure, for example, in conjunction with FIG. 6, in a scenario in which the storage system receives the first write request.
  • the storage node 140 may specifically be the storage node 3; in the scenario where the storage system receives the first read request, the storage node 140 may specifically be the storage node 1.
  • the same storage node 140 may have the foregoing technical solution provided in the scenario of reading and writing different data multiple times in a specific implementation.
  • each unit in the storage node 140 illustrates, by way of example, the relationship between each unit in the storage node 140 and some steps in the method embodiment shown above. In fact, each unit in the storage node 140 is also Other related steps in the method embodiments shown above may be performed, and are not enumerated here.
  • a hardware implementation of storage node 140 may refer to the storage node shown in FIG. 2.
  • the receiving unit 1401 may correspond to the internal port 220 in FIG.
  • Processing unit 1402 may correspond to the CPU and/or memory controller of FIG.
  • the storage unit 1403 may correspond to the memory in FIG. 2, and may alternatively correspond to the memory chip in FIG. 2.
  • the storage node provided by the embodiment of the present application can be used to perform the above-mentioned read and write process. Therefore, the technical solution can be obtained by referring to the foregoing method embodiments.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • a software program it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data
  • the heart transmits to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device that includes one or more servers, data centers, etc. that can be integrated with the media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (eg, an SSD) or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据访问方法、装置和系统,涉及存储技术领域。该方法应用于存储系统中的第一存储节点,第一存储节点通过交换机与主机和存储系统中的至少一个第二存储节点连通,至少一个第二存储节点所包括的物理盘映射为第一存储节点的虚拟盘。该方法可以包括:接收第一写请求,第一写请求携带第一待写数据;对第一待写数据进行条带化,得到条带化数据;并将条带化数据写入第一存储节点的物理盘和/或虚拟盘;记录条带化数据的写入位置。可以应用于包含NVMe SSD的存储系统。

Description

一种数据访问方法、装置和系统 技术领域
本申请涉及存储技术领域,尤其涉及一种数据访问方法、装置和系统。
背景技术
如图1所示,为现有技术提供的一种存储系统的架构示意图。该存储系统通过两个交换机连接至主机。该存储系统还包括与每一交换机均连接的多个双控阵列。每个双控阵列包括两个存储控制器,以及与每一存储控制器均连接的多个机械硬盘(hard disk drive,HDD)。两个存储控制器之间通过冗余镜像通道连接,以实现写数据流程中的镜像操作。在该系统中,每个双控阵列作为一个双控阵列单元,每个双控阵列单元与主机的一部分逻辑区块地址(logical block address,LBA)对应。主机发送的读/写请求经交换机被转发至与该读/写请求所携带的LBA对应的双控阵列单元。然后,该双控阵列单元在本地实现数据的读/写操作。
图1所示的系统架构是基于HDD提出的,随着快速非易失性存储器(non volatile memory Express,NVMe)固态硬盘(solid state disk,SSD)的逐渐普及,将NVMe SSD应用于双控阵列成为一种普遍的选择。然而,相对HDD,NVMe SSD的性能有了数百倍乃至上千倍的增强,如Intel的P3600NVMe SSD,只读IOPS达到了45万次,只写IOPS也达到了7万次,其中,IOPS是每秒进行读/写操作的次数(input/output operations per second)的英文缩写。由于图1所示的系统架构中,所有的处理操作均集中在两个存储控制器上,而存储控制器的处理能力有限,因此,图1所示的双控阵列存储架构不再适用于以NVMe SSD作为存储介质的存储系统,亟需提供一种新的系统架构。
发明内容
本申请实施例提供一种数据访问方法、装置和系统,用以提供一种适用于以NVMe SSD作为存储介质的存储系统。
为达到上述目的,本申请实施例采用如下技术方案:
第一方面,提供一种数据访问方法,应用于存储系统中的第一存储节点,第一存储节点通过交换机与主机和存储系统中的至少一个第二存储节点连通,至少一个第二存储节点所包括的物理盘映射为第一存储节点的虚拟盘。该方法可以包括:接收第一写请求,第一写请求携带第一待写数据;然后,对第一待写数据进行条带化,得到条带化数据;并将条带化数据写入第一存储节点的物理盘和/或虚拟盘;以及记录条带化数据的写入位置。其中,第一存储节点可以是存储系统中的任意一个存储节点。第一存储节点接收的第一写请求可以是主机发送的第一写请求,也可以是任一个第二存储节点转发的来自主机的第一写请求。该技术方案中,每一存储节点包括的部分或全部物理盘(例如存储芯片)可以映射到其他存储节点,作为其他存储节点的虚拟盘,例如但不限于通过NOF协议进行映射,因此,相比现有技术,可以不受双控阵列中CPU或存储控制器处理能力的限制,可以很大程度上提高存储系统的处理能力。
在一种可能的设计中,当条带化数据写入的是虚拟盘时,将条带化数据写入第二存储节点中映射虚拟盘的物理盘。例如,第一存储节点将对应的条带化数据发送至对应的第二存储节点中,然后,由第二存储节点将所接收到的数据存储在本地盘(即映射至该虚拟盘的物理盘)中。
在一种可能的设计中,在记录条带化数据的写入位置时,还记录第一待写数据的指纹。例如但不限于在第一待写数据的分布信息中记录条带化数据的写入位置以及第一待写数据的指纹。具体实现可以参考下述具体实施方式。
在一种可能的设计中,在记录条带化数据的写入位置时,还记录第一待写数据的LBA,其中,该LBA是写请求中携带的LBA。例如但不限于在第一待写数据的分布信息中记录条带化数据的写入位置,以及第一待写数据的LBA。
上文中是以第一存储节点作为执行写操作的存储节点提供的技术方案。在另一些实施例中,第一存储节点还可以执行其他步骤:
在一种可能的设计中,第一存储节点可以接收主机发送的第二写请求,第二写请求携带第二待写数据;然后,根据第二写请求确定第二写请求的归属节点,如果第二写请求的归属节点是第一存储节点,则由第一存储节点对第二写请求执行写操作,如果第二写请求的归属节点是第二存储节点,则第一存储节点将第二写请求转发至第二存储节点,以使第二存储节点对第二写请求执行写操作。其中,执行写操作的实现方式可以参考上文提供的技术方案或者下文中的具体实施方式,此处不再赘述。
在一种可能的设计中,根据第二写请求确定第二写请求的归属节点,可以包括:计算第二待写数据的指纹;然后,根据第二待写数据的指纹确定第二写请求的归属节点。该可能的设计中,第二写请求的归属节点具体是第二待写数据的归属节点。在此基础上,可选的,该方法还可以包括;确定第二写请求携带的LBA的归属节点;其中,LBA的归属节点用于管理LBA与第二待写数据的指纹之间的映射关系。
在一种可能的设计中,根据第二写请求确定第二写请求的归属节点,可以包括:根据第二写请求携带的LBA确定第二写请求的归属节点。该可能的设计中,第二写请求的归属节点具体是第二写请求携带的LBA的归属节点。
上文中是以写数据流程中,第一存储节点所执行的步骤为例进行说明的,以下说明在读数据流程中,第一存储节点所执行的步骤。
在一种可能的设计中,第一存储节点接收第一读请求所请求的第一待读数据的指纹;然后,根据第一待读数据的指纹,获取第一待读数据的写入位置,并从第一待读数据的写入位置,读取第一待读数据的条带化数据。其中,第一存储节点中存储有第一待写数据的写入位置与第一待写数据的指纹之间的映射关系。
在一种可能的设计中,第一存储节点第一读请求,第一读请求携带第一LBA;然后,根据第一LBA,获取第一读请求所请求的第一待读数据的写入位置,并从第一待读数据的写入位置,读取第一待读数据的条带化数据。其中,第一存储节点中存储有第一待写数据的写入位置与第一LBA之间的映射关系。
上文中读数据流程所提供的技术方案,是以第一存储节点执行读操作的存储节点为例进行说明的,在另一些实施例中,第一存储节点还可以执行其他步骤:
在一种可能的设计中,第一存储节点接收主机发送的第二读请求;然后,根据第 二读请求确定第二读请求的归属节点,如果第二读请求的归属节点是第一存储节点,则由第一存储节点对第二读请求执行读操作,如果第二读请求的归属节点是第二存储节点,则第一存储节点将第二读请求转发至第二存储节点,以使第二存储节点对第二读请求执行读操作。其中,执行读操作的实现方式可以参考上文提供的技术方案或者下文中的具体实施方式,此处不再赘述。
在一种可能的设计中,根据第二读请求确定第二读请求的归属节点,可以包括:确定第二读请求携带的LBA的归属节点,LBA的归属节点用于管理LBA与第二读请求所请求的第二待读数据的指纹的映射关系;然后,从第二LBA的归属节点中获取第二待读数据的指纹;并根据第二待读数据的指纹确定第二读请求的归属节点。或者,确定第二读请求携带的LBA的归属节点,然后,从第二LBA的归属节点中获取第二待读数据的指纹以及第二读请求的归属节点。该可能的设计中,第二读请求的归属节点具体是第二待读数据的归属节点。
在一种可能的设计中,根据第二读请求确定第二读请求的归属节点,可以包括:根据第二读请求携带的LBA,确定第二读请求的归属节点。该可能的设计中,第二读请求的归属节点具体是第二读请求携带的LBA的归属节点。
第二方面,提供一种存储节点,该存储节点可以根据上述方法示例对存储节点进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。
第三方面,提供一种存储节点,包括:存储器和处理器,其中,存储器用于存储计算机程序,该计算机程序被处理器执行时,使得第一方面提供的任意方法被执行。其中,存储器可以是内存和/或存储芯片等。处理器可以是CPU和/或控制存储器等。
第四方面,提供一种数据访问系统,包括:上述第二方面和第三方面提供的任一种存储节点,该存储节点通过交换机与主机和存储系统中的至少一个第二存储节点连通,至少一个第二存储节点所包括的物理盘映射为该存储节点的虚拟盘。
本申请还提供了一种计算机可读存储介质,其上储存有计算机程序,当该程序在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请还提供了一种计算机程序产品,当其在计算机上运行时,使得计算机执行上述任一方面所述的方法。
本申请还提供了一种通信芯片,其中存储有指令,当其在存储节点上运行时,使得存储节点上述第一方面所述的方法。
可以理解地,上述提供的任一种装置或计算机存储介质或计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
附图说明
图1为现有技术提供的一种存储系统的架构示意图;
图2为本申请实施例提供的技术方案所适用的一种系统架构的示意图;
图3为本申请实施例提供的一种物理盘与虚拟盘之间的映射示意图;
图4a为图2所示的系统架构的一种硬件形态的前视图;
图4b为图2所示的系统架构的一种硬件形态的后视图;
图4c为图2所示的系统架构的一种硬件形态的顶视图;
图5为图2所示的系统架构的一种扩展后的系统架构的示意图;
图6为本申请实施例提供的写数据的方法的流程图一;
图7为本申请实施例提供的基于图6的读数据的方法的流程图;
图8为本申请实施例提供的写数据的方法的流程图二;
图9为本申请实施例提供的基于图8的读数据的方法的流程图;
图10为本申请实施例提供的写数据的方法的流程图三;
图11为本申请实施例提供的基于图10的读数据的方法的流程图;
图12为本申请实施例提供的写数据的方法的流程图四;
图13为本申请实施例提供的基于图12的读数据的方法的流程图;
图14为本申请实施例提供的一种存储节点的结构示意图。
具体实施方式
本文中的术语“多个”是指两个或两个以上。本文中的术语“第一”、“第二”等仅是为了区分不同的对象,并不对其顺序进行限定,比如第一存储节点和第二存储节点是为了区分不同的对象,并不对其顺序进行限定。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
图2给出了本申请提供的技术方案所适用的一种系统架构的示意图。图2所示的系统架构可以包括主机1以及存储系统2。存储系统2可以包括:交换机21,以及与该交换机21分别连接的多个存储节点22。可以理解的,为了提高可靠性,存储系统2中一般可以设置至少两个交换机21,此时,每一存储节点22均与每一交换机21连接。图2中以存储系统2包括2个交换机21为例进行说明。
交换机21,用于连通各存储节点22,以及存储节点22与主机1。交换机21可以例如但不限于是以太网交换机(ethernet switch),IB交换机(infiniband switch),PCIe交换机(PCIe switch)等。
示例的,从功能上进行划分,交换机21可以包括内部交换端口211和存储服务端口212,可选的,还可以包括扩展端口213。其中,内部交换机端口211是连接存储节点22的端口。每个交换机22上可以设置一个或多个内部交换端口211,每个内部交换端口211可以连接一个存储节点22的一个内部端口220。存储服务端口212是连接主机1的端口,用于对外提供存储服务。每个交换机21上可以设置一个或多个存储服务端口212。扩展端口213,用于连接其他交换机21,以实现多个存储系统2的横向扩展。应注意,上述几种端口是从用途上进行划分的,在物理实现上,这些端口可以是相同的。例如,在某些情况下,扩展端口213可以作为存储服务端口212使用,其他示例不再一一列举。理论上,内部交换端口211可以作为存储服务端口212或扩展端口213,实际上,可以根据存储系统的硬件形态来设置。例如在如图4a~4c所示的硬件形态中,由于内部交换端口211位于机框内部,而存储服务端口212和扩展端口213位于机框表面,因此,一般地,不将内部交换端口211作为存储服务端口212或扩展端口213。
存储节点22,是存储系统中提供输入/输出(input/output,I/O)处理能力与存储空间的核心部件。示例的,每个存储节点22上可以设置一个或多个内部端口220,其中,内部端口220是连接交换机21的内部交换端口211的端口,每个内部端口220可以连接一个交换机21。内部端口220可以例如但不限于通过远程直接内存访问(remote direct memory access,RDMA)网卡提供。如果交换机21是以太网交换机,这样,就组成了冗余的以太网络,或者称为存储系统2的内部以太网,有助于实现当任何一个端口或者连接或者交换机失效时仍然有连接可用。
在一种实现方式中,图2所示的存储节点包括:执行与I/O模块221,以及与执行与I/O模块221连接的一个或多个存储模块222。
执行与I/O模块221,负责I/O请求(包括读/写请求)的输入输出以及相关处理流程的执行。具体实现时,执行与I/O模块221可以是至少一个中央处理器(central processing unit,CPU)通过I/O总线,与至少一个RDMA网卡连接,另外,CPU上还可以连接一定数量的内存。RDMA网卡上提供内部端口220,以连接交换机21。其中,I/O总线可以例如但不限于是快速外部设备互连总线(peripheral component interconnect express,PCIe)。应注意,在物理实现上,此处的CPU、I/O总线、与RDMA网卡中的部分或全部可以集成在一起,如片上系统(system on chip,Soc)或现场可编程门阵列(field-programmable gate array,FPGA);也可以是通用器件,如通过CPU(如至强CPU)和通用RDMA网卡。执行与I/O模块221通过内部I/O总线与存储模块222连接。
存储模块222,可以包括至少一个存储控制器,以及每一存储控制器连接的多个存储芯片。其中,存储芯片可以是NandFlash芯片,也可以是其他的非易失的存储芯片,如相变存储器(phase change memory,PCM)、磁阻式随机访问存储器(magnetic random access memory,MRAM)、阻变式存储器(resistive random access memory,RRAM)。存储控制器可以是专用集成电路(application specific integrated circuit,ASIC)芯片,也可以是FPGA。同样的,此处的存储模块的物理形态可以是通用的固态硬盘(solid state drives,SSD),也可以是将存储控制器、存储芯片与上述执行与I/O模块221通过I/O总线连接起来。
在一种实现方式中,当执行与I/O模块221、存储模块222均由通用部件构成,例如由通用的CPU(如X86至强)、通用的RDMA网卡、通用的SSD等器件构成,此时,存储节点22即为通用服务器。
本申请提供的技术方案中,主机与存储系统之间可以通过NOF(NVMe over Fabric)协议访问。每一存储节点包括的部分或全部物理盘(例如存储芯片)可以映射到其他存储节点,作为其他存储节点的虚拟盘。例如基于NOF协议进行映射。如此一来,在执行读/写操作时,存储节点中的软件系统(即CPU或内存控制器所执行的指令)可以将这些虚拟盘当作本地的物理盘使用。也就是说,本申请提供了一种分布式存储系统。由于本存储系统中,不同存储节点之间通过交换机互相连通,通过RDMA网卡及NOF协议所提供的RDMA方式互相访问,相比图1所示的系统架构,不受双控阵列中CPU或存储控制器处理能力的限制,可以很大程度上提高存储系统的处理能力。
图3给出了一种物理盘与虚拟盘之间的映射示意图。其中,图3中是以存储系统 中包括16个存储节点,分别编号为1~16,且以存储节点2~15中的每一存储节点将其中的物理盘映射到存储节点1中,作为存储节点1的虚拟盘为例进行说明的。另外,以存储芯片是NVMe SSD为例进行说明。例如,存储节点2~15中的每一存储节点将其中的物理盘映射到存储节点1中的一种实现方式为:存储系统初始化时,存储节点2~15分别配置允许映射至存储节点1的物理盘的信息;然后分别与存储节点1之间建立连接。建立连接之后,存储节点1可以获取到存储节点2~15分别确定允许映射至存储节点1的物理盘的信息,并为映射至存储节点1的物理盘分配盘符,作为存储节点1的虚拟盘,并记录虚拟盘与远端物理盘的映射关系。如此一来,存储节点1上的软件系统可感知有16个NVMe SSD,但实际上只有1个NVMe SSD是在本地的,其余15个NVMe SSD是其他存储节点的NVMe SSD通过NOF协议虚拟出来的。由于NOF协议的低延迟特性,访问本地盘(即物理盘)与虚拟盘性能差异可以时忽略不计。
以下,对NOF协议做简单介绍:NOF协议的目标是可以将NVMe SSD与本地计算机系统解耦,即可以将远端NVMe SSD用RDMA网卡与本地计算机连接起来,在本地计算机系统中“看到”的是一个虚拟的NVMe SSD。由于使用了RDMA技术,远端的NVMe SSD(即虚拟NVMe SSD)与本地的NVMe SSD(即物理NVMe SSD)性能基本没有差别。NOF继承了NVMe的所有命令,并增加了一些管理命令,例如:Authentication Send、Authentication Receive、Connect、Property Get、Property Set。为了适配RDMA网络,NOF协议的数据传送方式以及流程相对于原始NVMe协议有一些改变,具体可以包括:使用RDMA方式传送命令(如读/写请求)/数据,而非NVMe使用的PCIe内存空间映射方式,这是因为在NOF系统中,启动器(initiator)和目标器(target)无法“看到”对方的内存空间。
在本申请的一种实现方式中,启动器可以是主机,目标器可以是存储节点。在另一种实现方式中,启动器可以是一个存储节点,目标器可以是另一个存储节点。
NOF协议中,读数据流程可以包括:目标器接收到读请求(即READ命令)之后,根据该读请求获得要写入启动器的缓存的地址信息。然后,目标器向启动器发起RDMA_WRITE,以将读取到的数据写入主机缓存。接着,目标器向启动器发起RDMA_SEND,以通知启动器传送完毕。
NOF协议中,写数据流程可以包括:启动器将写请求(即Write命令)组装完毕,通过RDMA_SEND发送给目标器。目标器收到写请求后,启动RDMA_READ并从启动器获取该写请求要写入的数据。目标器接收到启动器回应的数据之后,向启动器发起RDMA_SEND,以通知启动器传送完毕。
以下,说明存储系统2的硬件形态。本申请对交换机21和存储节点22的硬件形态均不进行限定。存储节点22与交换机21可以存在于一个机框内。当存储节点22采用通用服务器实现时,存储节点22与交换机21可以存在于一个机架内。存储节点22与交换机21存在于一个机框时,该机框内可以包括一个或多个交换机,一个或多个电源,多个存储节点,以及连接存储节点22与交换机21的背板等。
图4a~图4c给出了图2所示的系统架构的一种硬件形态。其中,图4a示出了一个机架式机框的前视图,图4b示出了该机架式机框的后视图,图4c示出了该机架式机框的顶视图。由图4a~2c可以看出,该机架式机框包括2个以太网交换机、4个冗 余电源、16个存储节点、以及连接存储节点与以太网交换机的背板。
从图4a所示的前视图以及图4c所示的顶视图可以看出,机架式机框上设置有16个空槽位,每个空槽位可用于插入一个存储节点,实际实现时,可以不全部插满,同时为了满足冗余要求,最低可以插入2个存储节点。每个存储节点上设置有一个或多个拉手条,用于实现将该存储节点插入到空槽位。
从图4b所示的后视图可以看出,存储服务端口与扩展端口通过以太网交换机提供。这里的服务端口可以是多种以太网速率(如10G/40G/25G/50G/100G)的端口;扩展端口则可以是高速端口(如40G/50G/100G)。
从图4c所示的顶视图可以看出,以太网交换机的内部端口与存储节点的内部端口可以通过背板实现连接。
应注意,图4a~图4c所示的机架式机框,仅为图2所示的系统架构的硬件形态的一种示例,其不构成对图2所示的系统架构的硬件形态的限制。
图5提供了一种扩展的系统架构。该系统架构中,存储系统通过存储服务端口对外提供存储服务,与M个主机通过网络连接,其中,M是大于或等于1的整数。此处的网络可由直连的或者通过交换机组成的网络构成。若该网络是以太网网络,则存储系统对外的服务可使用基于以太网的存储协议提供,包括但不限于是以下任一种:iSCSI、NOF、iSER、NFS、Samba等。另外,存储系统也可以通过扩展端口进行横向扩展,如图5所示包括N个存储系统,其中,N是大于或等于2的整数。同样的,横向扩展可以通过直连的网络或者通过交换机进行连接。
以下,结合附图对基于本申请提供的系统架构下的读写数据的流程进行说明。在此之前,需要说明的是,下文所示的实施例均是以存储系统中包括16个存储节点,编号为存储节点1~16为例进行说明的,当然具体实现时不限于此。另外需要说明的是,每一存储节点所执行的步骤可以是由该存储节点中的CPU和/或存储控制器执行。
实施例1
如图6所示,为本申请实施例提供的应用于图2所示的存储系统的一种写数据的方法的流程图。具体的:
S101:主机向存储系统发送写请求,其中,该写请求包括LBA1和待写数据。存储系统中的存储节点1接收该写请求。具体的:主机经交换机将该写请求转发至存储节点1。
其中,存储节点1可以是存储系统中的任意一个存储节点。
通常,为了降低写流程中因存储节点失效等原因导致待写数据丢失,存储系统接收到来自主机的写请求之后,会对该写请求进行备份。具体可以参考S102~S103。可以理解的,S102~S103是可选的步骤。
S102:存储节点1向存储节点1的镜像节点如存储节点2发送该写请求。具体的,存储节点1经交换机向存储节点2发送该写请求。存储节点2接收该写请求。
存储系统中的任意两个存储节点可以互为彼此的镜像节点,通常,哪两个存储节点互为彼此的镜像节点可以是根据一定的规则预先设置好的。该规则可以例如但不限于根据一定的规则设置,以实现负载均衡,其中,这里的负载均衡是指,执行镜像操作的步骤尽量由各存储节点均匀分担。例如,依次将编号相邻的两个存储节点作为彼 此的镜像节点,例如,存储节点1与存储节点2互为彼此的镜像节点,存储节点3与存储节点4互为彼此的镜像节点……
S103:存储节点2缓存该写请求,并向存储节点1返回镜像完成指示,存储节点1接收该镜像完成指示。
S104:存储节点1接收到镜像完成指示之后,向主机发送写操作完成指示。
通常,为了加快存储系统对主机的写请求的响应,在镜像完成之后,存储系统即向主机发送写操作完成指示。而存储系统中继续执行以下步骤S105~S118中的部分或全部,完成对写请求中的待写数据的写入。
S105:存储节点1生成待写数据的指纹,并根据该指纹确定待写数据的归属节点,例如存储节点3。
数据的指纹用于唯一标记该数据的特征,换言之,数据的指纹可以理解为数据的身份标识(identity,ID)。若两个数据的指纹相同,则认为这两个数据相同。若两个数据的指纹不同,则认为这两个数据不同。本申请对如何计算得到数据的指纹不进行限定,例如,可以通过对数据进行哈希(hash)运算得到,其中,哈希运算可以例如但不限于是安全散列算法1(secure hash algorithm 1,SHA-1),循环移位校验(cyclic redundancy check,CRC)32等,其中,CRC32是CRC一种具体实现,其可以生成32比特的校验值。以SHA-1为例,对数据进行哈希运算之后,会得到160位的摘要,该摘要即为该数据的指纹。
数据的归属节点即对该数据执行写操作的存储节点。本申请对如何确定数据的归属节点不进行限定,具体可以例如但不限于根据一定的算法确定数据的归属节点,以实现负载均衡,这里的负载均衡是指,执行写操作的步骤尽量由各存储节点均匀分担。例如,该算法可以是取模运算。具体的,对指纹进行取模运算,若得到的值为a,则该数据的归属节点是存储节点a+1,其中,a≥0,a是整数,存储系统中的存储节点从1开始编号。例如,若存储系统中共有16个存储节点,数据的指纹是65537,则可以将65537按照16取模,得到1,即该数据的归属节点是存储节点2。
应注意,待写数据的归属节点是依据数据的指纹确定的,而存储节点1的镜像节点是依据存储节点1确定的,二者无关联关系,可以为同一存储节点,也可以为不同的存储节点,本实施例中是二者不同为例进行说明的。
可以理解的,接收主机发送的写请求的存储节点(本实施例中以存储节点1为例),也可以作为该写请求所携带的待写数据的归属节点。例如,基于上述示例,若待写数据的指纹是65536,则可以将65536按照16取模,得到0,即待写数据的归属节点是存储节点1。
S106:存储节点1向待写数据的归属节点(如存储节点3)转发该写请求。存储节点3接收该写请求。
S107:存储节点3查询数据分布信息集合,并判断该集合中是否包含待写数据的指纹。
若否,说明存储系统中还未存储待写数据,则执行S108。若是,说明存储系统中已经存储了待写数据,则执行S111,以避免重复存储,从而节省存储空间。
数据的归属节点可以管理数据分布信息集合。数据的归属节点所管理的数据分布 信息集合中包括的数据分布信息的个数随着该存储节点执行下盘操作的个数的增加而增加。初始时刻(即该存储节点并未执行写操作时),可以认为该存储节点所管理的数据分布信息集合为空,或者可以认为存储系统中还没有建立该存储节点所管理的数据分布信息集合。其他时刻,每一数据分布信息集合可以包括至少一个数据的分布信息。数据的分布信息可以用元数据表M1表示,M1的相关说明如下文。
S108:存储节点3对待写数据进行条带化,得到条带化数据,并将条带化数据写入存储节点3的物理盘和/或虚拟盘。
该步骤可以理解为对数据进行冗余处理,其基本原理为:将一个完整的数据(具体为写请求中携带的数据)打散,得到多个数据块,可选的还可以生成一个或多个校验块。然后将这些数据块和校验块存储在不同的盘(即磁盘)中。S108中的条带化数据可以包括数据块,进一步地可以包括校验块。本申请对冗余处理方式不进行限定,可以例如但不限于是:独立硬盘冗余阵列(redundant array of independent disks,RAID,简称磁盘阵列)或者纠删码(erasure coding,EC)等。
本申请中,由于一个或多个存储节点的物理盘可以被映射至同一个存储节点的虚拟盘,所以在存储节点3写入数据的时候,可以将该虚拟盘作为本地盘使用,这样,存储节点3可以选择该虚拟盘作为条带化数据写入的磁盘,而在将数据写入虚拟盘时,存储节点3可以首先确定映射该虚拟盘的其他存储节点的物理磁盘,然后根据NOF协议,通过RDMA的方式将写入该虚拟盘的数据块写入所确定的其他存储节点的物理磁盘。
例如,以存储系统包括16个存储节点,且冗余处理方式是EC为例,一种可能的实现方式是:存储节点3依据EC算法对待写数据进行条带化,得到14个数据块以及2个校验块。然后,将这16个块中的每一块写入存储系统的一个存储节点。
S109:存储节点3记录条带化数据的写入位置。具体的,存储节点3可以通过记录待写数据的分布信息记录该条带化数据的写入位置。
数据的分布信息可以用元数据表M1表示,一种元数据表M1包括的元素可以如表1所示。
表1
元素 描述
FingerPrint 数据的指纹
hostLBA 写请求中携带的LBA
hostLength 数据的总长度
Seg.type 条带化数据中的每一块是数据块还是校验块
Seg.diskID 条带化数据中的每一块写入的盘(可以是虚拟盘或者物理盘)的ID
Seg.startLBA 条带化数据中的每一块在写入的盘中的起始LBA
Seg.length 条带化数据中的每一块的长度
其中,FingerPrint、Seg.diskID、Seg.startLBA和Seg.length可以用于表示一数据的条带化数据的写入位置。
应注意,hostLBA所表示的是主机和存储系统信息交互所使用的LBA。而Seg.startLBA所表示的是所写入的数据在存储模块中的起始LBA地址。本申请对表1 中的各元素的记录方式不进行限定,例如若条带化数据中的每一块的长度均相同,则记录一个长度即可,其他示例不再一一列举。
例如,基于S108中的示例,执行步骤S109之后,存储节点3记录的待写数据的分布信息可以包括:待写数据的指纹,LBA1,待写数据的总长度,待写数据的14个数据块以及2个冗余块中每一块的类型、写入的盘的ID、长度等信息。
S108~S109可以被称为数据的归属节点对写请求/待写入数据执行写操作。为了提高可靠性,数据的归属节点可以对元数据表M1进行冗余处理,具体可以参考S110。可以理解的,S110是可选的步骤。
S110:存储节点3将条带化数据的写入位置写入存储节点3的物理盘和/或虚拟盘。具体的,存储节点3可以通过将待写数据的分布信息写入存储节点3的物理盘和/或虚拟盘,实现将条带化数据的写入位置写入存储节点3的物理盘和/或虚拟盘。
该步骤可以理解为对数据的分布信息进行冗余处理,可以理解的,该步骤为可选的步骤。本申请对冗余处理方式不进行限定,例如但不限于是多份复制、EC或RAID等方式。以三份复制为例,存储节点3可以在本地保存一份待写数据的分布信息,然后从存储系统中选择两个存储节点,接着,将该待写数据的分布信息复制两份,每一份写入这两个存储节点中的一个存储节点。其中,本申请对如何选择这两个存储节点不进行限定,例如但不限于采用取模运算进行选择。
S111:存储节点3向存储节点1反馈写操作完成指示,存储节点1接收该写操作完成指示。
可以理解的,在S105中,若存储节点1确定的待写数据的归属节点是存储节点1,则可以不执行S106和S111,且S107~S111的执行主体是存储节点1。
S112:存储节点1获取该写请求携带的LBA1的归属节点,例如存储节点4。
LBA的归属节点用于管理LBA与指纹之间的映射关系。本申请对如何确定LBA的归属节点不进行限定,具体可以例如但不限于根据一定的算法确定LBA的归属节点,以实现负载均衡,这里的负载均衡是指,执行管理LBA与指纹之间的映射关系的步骤尽量由各存储节点均匀分担。例如,该算法可以是取模运算。
应注意,数据的归属节点是依据该数据的指纹确定的,而LBA的归属节点是依据LBA确定的,二者无关联关系,可以为同一存储节点,也可以为不同的存储节点。LBA的归属节点与存储节点1的镜像节点之间无关联关系。本实施例是以LBA1的归属节点、待写数据的归属节点与存储节点1的镜像节点均不同为例进行说明的。
S113:存储节点1向存储节点4发送待写数据的指纹与写请求携带的LBA1。
S114:存储节点4记录待写数据的指纹与写请求携带的LBA1之间的映射关系。
该映射关系可以用元数据表M2表示,一种元数据表M2包括的元素可以如表2所示。
表2
元素 描述
FingerPrint 指纹
LBA list 该指纹对应的LBA列表
NodeID 该指纹所指示的数据的归属节点的ID
在一种实现方式中,元数据表M2包括上述FingerPrint与LBA list即可。
其中,一个LBA列表可以包括一个或多个LBA。LBA列表可以表示为单向链表。同一个指纹可以与多个LBA之间存在映射关系。例如,假设主机向存储系统发送了4个写请求,这4个写请求的相关信息如表3所示。
表3
写请求 携带的信息 LBA的归属节点 数据的归属节点
写请求1 LBA1,待写数据1 存储节点A 存储节点C
写请求2 LBA2,待写数据1 存储节点A 存储节点C
写请求3 LBA3,待写数据1 存储节点B 存储节点C
写请求4 LBA4,待写数据2 存储节点A 存储节点D
基于表3,存储系统执行这4个写请求之后,存储节点A记录的元数据表M2如表4所示。
表4
指纹 该指纹对应的LBA列表 该指纹所指示的数据的归属节点
待写数据1的指纹 LBA1、LBA2 存储节点C
待写数据2的指纹 LBA4 存储节点D
基于表3,存储系统执行这4个写请求之后,存储节点B记录的元数据表M2如表5所示。
表5
指纹 该指纹对应的LBA列表 该指纹所指示的数据的归属节点
待写数据1的指纹 LBA3 存储节点C
S115:存储节点4将待写数据的指纹与写请求携带的LBA1间的映射关系写入存储节点4的物理盘和/或虚拟盘。
该步骤可以理解为对待写数据的指纹与写请求携带的LBA1之间的映射关系进行冗余处理,可以理解的,该步骤为可选的步骤。本申请对冗余处理方式不进行限定,例如可以参考上文。
S116:存储节点4向存储节点1反馈映射关系完成指示,存储节点1接收该映射关系完成指示。
这样,可以防止后续进行待写数据读取时,在同一节点读取指纹相同的数据,从而造成拥塞,如果将LBA打散到不同的节点,则在后续对数据进行读操作时,可以从LBA的归属节点处读取待写数据,具体请参考图7所述的读数据的流程。
S117:存储节点1向该存储节点1的镜像节点(如存储节点2)发送删除镜像数据的指令,存储节点2接收该指令。
S118:存储节点2接收到该指令后,删除该写请求的镜像数据。
至此,写流程结束。
如图7所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种读数据的方法的流程图。具体的:
S201:主机向存储系统发送读请求,该读请求中携带LBA1。存储系统中的存储节 点1接收该写请求。具体的:主机向交换机发送读请求,该交换机接收到该读请求后,将该读请求转发至存储节点1。
需要说明的是,在实施例1中,图6所示的写数据流程和图7所示的读数据流程中,均是以存储节点1接收来自主机的请求(包括写请求和读请求)为例进行说明的,实际实现时,由于实施例1所示的写数据流程和读数据流程均是基于交换机将写请求/读请求转发至任一存储节点这一思想进行设计的,因此,写数据流程和读数据流程中,接收来自主机发送的写请求和读请求的存储节点也可以不同。
S202:存储节点1获取LBA1的归属节点。其具体实现方式可以参考上文,此处不再赘述。基于上文中的描述可知,所确定的LBA1的归属节点是存储节点4。
S203:存储节点1向存储节点4发送该读请求,存储节点4接收该读请求。
S204:存储节点4根据LBA与数据的指纹之间的映射关系,例如但不限于如表2所示的元数据表M2,得到待读数据的指纹。其中,该待读数据的指纹即为上文中的待写数据的指纹。可选的,存储节点4还可以得到该待读数据的归属节点,基于上文中的描述可知,所确定的待读数据的归属节点是存储节点3。
S205:存储节点4向存储节点1反馈待读数据的指纹,存储节点1接收待读数据的指纹,并根据待读数据的指纹,确定待读数据的归属节点,即存储节点3。可选的,存储节点4还可以向存储节点1反馈待数据的归属节点即存储节点3的ID,这样存储节点1不需要根据指纹获取待读数据的归属节点,从而可以节省存储节点1的计算复杂度。
S206:存储节点1向存储节点3发送待读数据的指纹,存储节点3接收待读数据的指纹。
S207:存储节点3根据待读数据的指纹确定待读数据的条带化数据的写入位置,例如但不限于根据表1所示的元数据表M1,得到条带化数据的写入位置。然后,从这些写入位置获取部分或全部条带化数据。
可以理解的,在正常读取数据的过程中,读取待读数据的数据块即可,无需读取校验块。可选的,在数据需要恢复的场景中,可以读取待读数据的校验块,然后可以例如但不限于按照RAID或者EC算法进行数据恢复。
S208:存储节点3将所读取到的数据块构建成完整的数据,即执行条带化之前的数据,至此,认为存储节点3获取到了待读数据。存储节点3将该待读数据反馈给存储节点1,存储节点1接收该待读数据。
S209:存储节点1接收到待读数据之后,将该待读数据反馈给主机。
图7所示的读数据流程是基于图6所示的写数据流程进行说明的,本领域技术人员应当能够根据图7所示的读数据流程,确定出如下场景下的实施例:LBA1的归属节点与存储节点1相同的场景,和/或,待读数据的归属节点与存储节点1相同的场景,和/或,LBA1的归属节点与待读数据的归属节点相同的场景。此处不再一一赘述。
本实施例提供的读写数据流程中,将执行读写操作步骤按照数据的指纹分配到了存储系统的存储节点中,以及将执行管理指纹与主机的LBA的步骤按照LBA分配到了存储系统的存储节点中。这样,有助于实现负载均衡,从而提高系统性能。
实施例2
如图8所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种写数据的方法的流程图。具体的:
S301~S304:可以参考上述S101~S104,当然本申请不限于此。
S305:可以参考上述S112,当然本申请不限于此。
S306:存储节点1向存储节点4发送该写请求,存储节点4接收该写请求。
S307:存储节点4对待写数据进行条带化,得到条带化数据,并将条带化数据写入存储节点4的物理盘和/或虚拟盘。
S308:存储节点4记录条带化数据的写入位置。具体的,存储节点4可以通过记录待写数据的分布信息记录该条带化数据的写入位置。
数据的分布信息可以用元数据表M3表示,一种元数据表M3包括的元素可以是上述表1中除去FingerPrint后得到的表。
S309:存储节点4将条带化数据的写入位置写入存储节点4的物理盘和/或虚拟盘。
上述步骤S307~S309的相关描述可以参考上述S108~S110,此处不再赘述。
S310:存储节点4记录条带化数据的写入位置与LBA1之间的映射关系。
S311:存储节点4将条带化数据的写入位置与LBA1之间的映射关系写入存储节点4的物理盘和/或虚拟盘。
S312~S314:可以参考上述S116~S118,当然本申请不限于此。
如图9所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种读数据的方法的流程图。具体的:
S401~S403:可以参考S201~S203当然本申请不限于此。
S404:存储节点4根据LBA1确定待读数据的条带化数据的写入位置,例如但不限于基于表1删除数据的指纹后得到的元数据表,得到待读数据的条带化数据的写入位置。然后,从这些写入位置获取部分或全部条带化数据。
S405:存储节点4将所读取到的数据块构建成完整的数据,即执行条带化之前的数据,至此,认为存储节点4获取到了待读数据。存储节点4将该待读数据反馈给存储节点1。
S406:可以参考S209,当然本申请不限于此。
图9所示的读数据流程是基于图8所示的写数据流程进行说明的,本领域技术人员应当能够根据图9所示的读数据流程,确定出如下场景下的实施例:LBA1的归属节点与存储节点1相同的场景。此处不再赘述。
本实施例中,将执行管理数据的写入位置与主机的LBA的步骤按照LBA分配到了存储系统的存储节点中。这样,有助于实现负载均衡,从而提高系统性能。
实施例3
如图10所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种写数据的方法的流程图。具体的:
S501:主机向存储系统发送写请求,该写请求包括LBA1和待写数据。存储系统中的存储节点1接收该写请求。具体的:主机向交换机发送写请求,该交换机接收到该写请求后,根据该写请求携带的信息,将该写请求转发至存储节点1。
与上述实施例1和实施例2的区别在于,本实施例中,主机可以先将携带特定LBA 的写请求发送至特定的存储节点,这样,可以降低存储系统的计算复杂度。
在一种示例中,主机可以预先存储LBA范围与存储节点之间的对应关系,例如LBA1~100对应存储节点1,LBA101~LBA200对应存储节点2……,然后,在写请求中携带LBA所对应的存储节点的信息,其中,存储节点的信息可以例如但不限于包括该存储节点的网络地址,可选的还可以包括该存储节点的ID等,这样,交换机接收到写请求时,可以根据该写请求中写到的存储节点的信息,确定将该写请求转发至哪个存储节点。
S502~S511:可以参考S102~S111,当然本申请不限于此。
S512:存储节点1记录待写数据的指纹与写请求携带的LBA1之间的映射关系。
S513:存储节点1将待写数据的指纹与写请求携带的LBA1之间的映射关系写入存储节点1的物理盘和/或虚拟盘。
其中,S512~S513的相关内容的解释可以参考上述S114~S115,此处不再赘述。
S514~S515:可以参考S117~S118,当然本申请不限于此。
如图11所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种读数据的方法的流程图。具体的:
S601:主机向存储系统发送读请求,该读请求中携带LBA1。存储系统中的存储节点1接收该写请求。具体的:主机向交换机发送读请求,该交换机接收到该读请求后,根据该读请求将该读请求转发至存储节点1。
图11所示的读数据流程时基于图10所示的写数据流程进行说明的。关于交换机如何将读请求转发至存储节点1的具体实现过程,可以参考图10所示的写数据流程中,交换机如何将写请求转发至存储节点1的具体实现过程,此处不再赘述。
S602:存储节点1根据LBA与数据的指纹之间的映射关系,例如但不限于如表2所示的元数据表M2,得到待读数据的指纹。其中,该待读数据的指纹即为上文中的待写数据的指纹。然后,存储节点1还可以根据元数据表M2记录的信息,或者根据计算得到该待读数据的归属节点,基于上文中的描述可知,所确定的该待读数据的归属节点是存储节点3。
S603:存储节点1获取待读数据的归属节点如存储节点3
S604~S607:可以参考S206~S209,当然本申请不限于此。
本实施例中,由主机根据LBA与存储节点之间的对应关系,确定将读/写请求发送至存储系统中的哪个存储节点,也就是说,存储节点不需要确定LBA的归属节点,这样,可以减少存储节点之间的信令交互,从而提高读/写速率。另外,基于上述实施例1的有益效果可知,本实施例有助于实现负载均衡,从而提高系统性能。
实施例4
如图12所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种写数据的方法的流程图。具体的:
S701~S704:可以参考S501~S504,当然本申请不限于此。
S705:存储节点1对待写数据进行条带化,得到条带化数据,并将条带化数据写入存储节点1的物理盘和/或虚拟盘。
S706:存储节点1记录条带化数据的写入位置。具体的,存储节点1可以通过记 录待写数据的分布信息记录该条带化数据的写入位置。
S707:存储节点1将条带化数据的写入位置写入存储节点1的物理盘和/或虚拟盘。
其中,S705~S707的相关描述可以参考上述S307~S309,此处不再赘述。
S708:存储节点1记录条带化数据的写入位置与LBA1之间的映射关系。
S709:存储节点1将条带化数据的写入位置与LBA1之间的映射关系写入存储节点1的物理盘和/或虚拟盘。
S710~S711:可以参考上述S117~S118,当然本申请不限于此。
如图13所示,为本申请实施例提供的一种应用于图2所示的存储系统的一种读数据的方法的流程图。具体的:
S801:可以参考上述S601,当然本申请不限于此。
S802:存储节点1根据LBA确定待读数据的条带化数据的写入位置,例如但不限于基于表1删除数据的指纹后得到的元数据表,得到待读数据的条带化数据的写入位置。然后,从这些写入位置获取部分或全部条带化数据。
S803:存储节点1将所读取到的数据块构建成完整的数据,即执行条带化之前的数据。
其中,S802~S803的相关描述可以参考上述S404~S405,此处不再赘述。
S804:存储节点1向主机反馈该待读数据。
本实施例中,由主机根据LBA与存储节点之间的对应关系,确定将读/写请求发送至存储系统中的哪个存储节点,也就是说,存储节点不需要确定LBA的归属节点,这样,可以减少存储节点之间的信令交互,从而提高读/写速率。另外,基于上述实施例2的有益效果可知,本实施例有助于实现负载均衡,从而提高系统性能。
上述主要从各个节点之间交互的角度对本申请实施例提供的方案进行了介绍。可以理解的是,各个节点,例如主机或者存储节点。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对存储节点进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。下面以采用对应各个功能划分各个功能模块为例进行说明:
图14给出了一种存储节点140的结构示意图。该存储节点140可以是上文中涉及的任意一个存储节点。存储节点140通过交换机与主机和存储系统中的至少一个第二存储节点连通,至少一个第二存储节点所包括的物理盘映射为存储节点140的虚拟盘;存储节点140包括:收发单元1401、处理单元1402和存储单元1403。收发单元1401, 用于接收第一写请求,第一写请求携带第一待写数据。处理单元1402,用于对第一待写数据进行条带化,得到条带化数据;并将条带化数据写入存储节点140的物理盘和/或虚拟盘。存储单元1403,用于记录条带化数据写入位置。例如,结合图6或图10,存储节点140可以是存储节点3,收发单元1401可以用于执行S106/S506,处理单元1402可以用于执行S108/S508,存储单元1403可以用于执行S109/S509。例如,结合图8,存储节点140可以是存储节点4,收发单元1401可以用于执行S306,处理单元1402可以用于执行S307,存储单元1403可以用于执行S308。例如,结合图12,存储节点140可以是存储节点1,收发单元1401可以用于执行S701,处理单元1402可以用于执行S705,存储单元1403可以用于执行S706。
在一种可能的设计中,处理单元1402具体可以用于:当条带化数据写入的是虚拟盘时,将条带化数据写入第二存储节点中映射虚拟盘的物理盘。
在一种可能的设计中,存储单元1403还可以用于:在记录条带化数据写入位置时,还记录第一待写数据的指纹。例如,可以参考实施例1中的表1。
在一种可能的设计中,存储单元1403还可以用于:在记录条带化数据写入位置时,还记录第一待写数据的LBA。例如,结合图8或图12,存储单元1403可以用于执行S310/S708。
在一种可能的设计中,收发单元1401还可以用于:接收主机发送的第二写请求,第二写请求携带第二待写数据。处理单元1402还用于:根据第二写请求确定第二写请求的归属节点,如果第二写请求的归属节点是存储节点140,则由存储节点140对第二写请求执行写操作,如果第二写请求的归属节点是第二存储节点,则存储节点140将第二写请求转发至第二存储节点,以使第二存储节点对第二写请求执行写操作。例如,结合图6、图8或图10,存储节点140可以是存储节点1。收发单元1401可以用于执行S101/S301/S501。处理单元1402可以用于执行S105/S305/S505。
在一种可能的设计中,处理单元1402具体可以用于:计算第二待写数据的指纹;然后,根据第二待写数据的指纹确定第二写请求的归属节点。例如,结合图6或图10处理单元1402可以用于执行S105/S505。在一种可能的设计中,处理单元1402还可以用于:确定第二写请求携带的LBA的归属节点;LBA的归属节点用于管理LBA与第二待写数据的指纹之间的映射关系。例如,结合图6,处理单元1402可以用于执行S112。
在一种可能的设计中,处理单元1402具体可以用于:根据第二写请求携带的LBA确定第二写请求的归属节点。例如,结合图8,处理单元1402可以用于执行S305。
在一种可能的设计中,收发单元1401还可以用于,接收第一读请求所请求的第一待读数据的指纹。处理单元1402还可以用于,根据第一待读数据的指纹,获取第一待读数据的写入位置,并从第一待读数据的写入位置,读取第一待读数据的条带化数据。例如,结合图7或图11,存储节点140可以是存储节点3。收发单元1401可以用于执行S206/S604,处理单元1402可以用于执行S207/S605。
在一种可能的设计中,收发单元1401还可以用于,接收第一读请求,第一读请求携带第一LBA。处理单元1402还可以用于:根据第一LBA,获取第一读请求所请求的第一待读数据的写入位置,并从第一待读数据的写入位置,读取第一待读数据的条带化数据。例如,结合图9,存储节点140可以是存储节点4,收发单元1401可以用于 执行S403,处理单元1402可以用于执行S404。例如,结合图13,存储节点140可以是存储节点1,收发单元1401可以用于执行S801,处理单元1402可以用于执行S803。
在一种可能的设计中,收发单元1401还可以用于,接收主机发送的第二读请求。处理单元1402还可以用于,根据第二读请求确定第二读请求的归属节点,如果第二读请求的归属节点是存储节点140,则由存储节点140对第二读请求执行读操作,如果第二读请求的归属节点是第二存储节点,则存储节点140将第二读请求转发至第二存储节点,以使第二存储节点对第二读请求执行读操作。例如,结合图7、图9或图11,存储节点140可以是存储节点1。收发单元1401可以用于执行S201/S401/S601,处理单元1402可以用于执行S205/S402/S603。
在一种可能的设计中,处理单元1402具体可以用于:确定第二读请求携带的LBA的归属节点,LBA的归属节点用于管理LBA与第二读请求所请求的第二待读数据的指纹的映射关系;从第二LBA的归属节点中获取第二待读数据的指纹;根据第二待读数据的指纹确定第二读请求的归属节点。例如,结合图7或图11,收发单元1401可以用于执行S201/S601,处理单元1402可以用于执行S205/S603。
在一种可能的设计中,处理单元1402具体可以用于:根据第二读请求携带的LBA,确定第二读请求的归属节点。例如,结合图9,收发单元1401可以用于执行S401,处理单元1402可以用于执行S402。
需要说明的是,虽然,在上述有些实现方式中,存储节点140可以具体可以是同一个附图中的不同的存储节点,例如,结合图6,在存储系统接收到第一写请求的场景中,存储节点140具体可以是存储节点3;在存储系统接收到第一读请求的场景中,存储节点140具体可以是存储节点1。但是,由于存储系统中每一存储节点均具有任意性,因此,具体实现时,在多次对不同数据进行读写的场景中,同一存储节点140可以具有上文任一技术方案中所提供的功能。
另外需要说明的是,上述均以举例的方式说明了存储节点140中的各单元与上文所示的方法实施例中的一些步骤之间的联系,实际上,存储节点140中的各单元还可以执行上文所示的方法实施例中的其他相关步骤,此处不再一一列举。
存储节点140的一种硬件实现方式可以参考图2中所示的存储节点。结合图2,上述接收单元1401可以对应图2中的内部端口220。处理单元1402可以对应图2中的CPU和/或存储控制器。存储单元1403可以对应图2中的内存,可选的也可以对应图2中的存储芯片。
由于本申请实施例提供的存储节点可用于执行上述提供的读写流程,因此其所能获得的技术效果可参考上述方法实施例,本申请实施例在此不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中 心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如SSD)等。
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (29)

  1. 一种数据访问方法,其特征在于,应用于存储系统中的第一存储节点,所述第一存储节点通过交换机与主机和所述存储系统中的至少一个第二存储节点连通,所述至少一个第二存储节点所包括的物理盘映射为所述第一存储节点的虚拟盘;所述方法包括:
    接收第一写请求,所述第一写请求携带第一待写数据;
    对所述第一待写数据进行条带化,得到条带化数据;并将所述条带化数据写入所述第一存储节点的物理盘和/或虚拟盘;
    记录所述条带化数据写入位置。
  2. 根据权利要求1所述的方法,其特征在于,当所述条带化数据写入的是所述虚拟盘时,将所述条带化数据写入所述第二存储节点中映射所述虚拟盘的物理盘。
  3. 根据权利要求1或2所述的方法,其特征在于,在记录所述条带化数据写入位置时,还记录所述第一待写数据的指纹。
  4. 根据权利要求1或2所述的方法,其特征在于,在记录所述条带化数据写入位置时,还记录所述第一待写数据的逻辑区块地址LBA。
  5. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    接收所述主机发送的第二写请求,所述第二写请求携带第二待写数据;
    根据所述第二写请求确定所述第二写请求的归属节点,如果所述第二写请求的归属节点是所述第一存储节点,则由所述第一存储节点对所述第二写请求执行写操作,如果所述第二写请求的归属节点是所述第二存储节点,则所述第一存储节点将所述第二写请求转发至所述第二存储节点,以使所述第二存储节点对所述第二写请求执行写操作。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述第二写请求确定所述第二写请求的归属节点包括:
    计算所述第二待写数据的指纹;
    根据所述第二待写数据的指纹确定所述第二写请求的归属节点。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    确定所述第二写请求携带的LBA的归属节点;其中,所述LBA的归属节点用于管理所述LBA与所述第二待写数据的指纹之间的映射关系。
  8. 根据权利要求5所述的方法,其特征在于,所述根据所述第二写请求确定所述第二写请求的归属节点包括:
    根据所述第二写请求携带的LBA确定所述第二写请求的归属节点。
  9. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    接收第一读请求所请求的第一待读数据的指纹;
    根据所述第一待读数据的指纹,获取所述第一待读数据的写入位置,并从所述第一待读数据的写入位置,读取所述第一待读数据的条带化数据。
  10. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    接收第一读请求,所述第一读请求携带第一LBA;
    根据所述第一LBA,获取所述第一读请求所请求的第一待读数据的写入位置,并 从所述第一待读数据的写入位置,读取所述第一待读数据的条带化数据。
  11. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    接收所述主机发送的第二读请求;
    根据所述第二读请求确定所述第二读请求的归属节点,如果所述第二读请求的归属节点是所述第一存储节点,则由所述第一存储节点对所述第二读请求执行读操作,如果所述第二读请求的归属节点是所述第二存储节点,则所述第一存储节点将所述第二读请求转发至所述第二存储节点,以使所述第二存储节点对所述第二读请求执行读操作。
  12. 根据权利要求11所述的方法,其特征在于,所述根据所述第二读请求确定所述第二读请求的归属节点包括:
    确定所述第二读请求携带的LBA的归属节点,所述LBA的归属节点用于管理所述LBA与所述第二读请求所请求的第二待读数据的指纹的映射关系;
    从所述第二LBA的归属节点中获取所述第二待读数据的指纹;
    根据所述第二待读数据的指纹确定所述第二读请求的归属节点。
  13. 根据权利要求11所述的方法,其特征在于,所述根据所述第二读请求确定所述第二读请求的归属节点,包括:
    根据所述第二读请求携带的LBA,确定所述第二读请求的归属节点。
  14. 一种存储节点,其特征在于,所述存储节点通过交换机与主机和所述存储系统中的至少一个第二存储节点连通,所述至少一个第二存储节点所包括的物理盘映射为所述存储节点的虚拟盘;所述存储节点包括:
    收发单元,用于接收第一写请求,所述第一写请求携带第一待写数据;
    处理单元,用于对所述第一待写数据进行条带化,得到条带化数据;并将所述条带化数据写入所述存储节点的物理盘和/或虚拟盘;
    存储单元,用于记录所述条带化数据写入位置。
  15. 根据权利要求14所述的存储节点,其特征在于,
    所述处理单元具体用于:当所述条带化数据写入的是所述虚拟盘时,将所述条带化数据写入所述第二存储节点中映射所述虚拟盘的物理盘。
  16. 根据权利要求14或15所述的存储节点,其特征在于,
    所述存储单元还用于:在记录所述条带化数据写入位置时,还记录所述第一待写数据的指纹。
  17. 根据权利要求14或15所述的存储节点,其特征在于,
    所述存储单元还用于:在记录所述条带化数据写入位置时,还记录所述第一待写数据的逻辑区块地址LBA。
  18. 根据权利要求14或15所述的存储节点,其特征在于,
    所述收发单元还用于:接收所述主机发送的第二写请求,所述第二写请求携带第二待写数据;
    所述处理单元还用于:根据所述第二写请求确定所述第二写请求的归属节点,如果所述第二写请求的归属节点是所述存储节点,则由所述存储节点对所述第二写请求执行写操作,如果所述第二写请求的归属节点是所述第二存储节点,则所述存储节点 将所述第二写请求转发至所述第二存储节点,以使所述第二存储节点对所述第二写请求执行写操作。
  19. 根据权利要求18所述的存储节点,其特征在于,所述处理单元具体用于:
    计算所述第二待写数据的指纹;
    根据所述第二待写数据的指纹确定所述第二写请求的归属节点。
  20. 根据权利要求19所述的存储节点,其特征在于,
    所述处理单元还用于:确定所述第二写请求携带的LBA的归属节点;其中,所述LBA的归属节点用于管理所述LBA与所述第二待写数据的指纹之间的映射关系。
  21. 根据权利要求18所述的存储节点,其特征在于,
    所述处理单元具体用于:根据所述第二写请求携带的LBA确定所述第二写请求的归属节点。
  22. 根据权利要求16所述的存储节点,其特征在于,
    所述收发单元还用于,接收第一读请求所请求的第一待读数据的指纹;
    所述处理单元还用于,根据所述第一待读数据的指纹,获取所述第一待读数据的写入位置,并从所述第一待读数据的写入位置,读取所述第一待读数据的条带化数据。
  23. 根据权利要求17所述的存储节点,其特征在于,
    所述收发单元还用于,接收第一读请求,所述第一读请求携带第一LBA;
    所述处理单元还用于:根据所述第一LBA,获取所述第一读请求所请求的第一待读数据的写入位置,并从所述第一待读数据的写入位置,读取所述第一待读数据的条带化数据。
  24. 根据权利要求14或15所述的存储节点,其特征在于,
    所述收发单元还用于,接收所述主机发送的第二读请求;
    所述处理单元还用于,根据所述第二读请求确定所述第二读请求的归属节点,如果所述第二读请求的归属节点是所述存储节点,则由所述存储节点对所述第二读请求执行读操作,如果所述第二读请求的归属节点是所述第二存储节点,则所述存储节点将所述第二读请求转发至所述第二存储节点,以使所述第二存储节点对所述第二读请求执行读操作。
  25. 根据权利要求24所述的存储节点,其特征在于,所述处理单元具体用于:
    确定所述第二读请求携带的LBA的归属节点,所述LBA的归属节点用于管理所述LBA与所述第二读请求所请求的第二待读数据的指纹的映射关系;
    从所述第二LBA的归属节点中获取所述第二待读数据的指纹;
    根据所述第二待读数据的指纹确定所述第二读请求的归属节点。
  26. 根据权利要求24所述的存储节点,其特征在于,
    所述处理单元具体用于:根据所述第二读请求携带的LBA,确定所述第二读请求的归属节点。
  27. 一种存储节点,其特征在于,包括:存储器和处理器,其中,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,使得如权利要求1至13任一项所述的方法被执行。
  28. 一种数据访问系统,其特征在于,包括:如权利要求14至27任一项所述的 存储节点,所述存储节点通过交换机与主机和所述存储系统中的至少一个第二存储节点连通,所述至少一个第二存储节点所包括的物理盘映射为所述存储节点的虚拟盘。
  29. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序在计算机上运行时,使得如权利要求1至13任一项所述的方法被执行。
PCT/CN2017/096958 2017-08-10 2017-08-10 一种数据访问方法、装置和系统 WO2019028799A1 (zh)

Priority Applications (10)

Application Number Priority Date Filing Date Title
PCT/CN2017/096958 WO2019028799A1 (zh) 2017-08-10 2017-08-10 一种数据访问方法、装置和系统
JP2020507656A JP7105870B2 (ja) 2017-08-10 2017-08-10 データアクセス方法、装置およびシステム
EP23177282.3A EP4273688A3 (en) 2017-08-10 2017-08-10 Data access method, device and system
CN202110394808.2A CN113485636B (zh) 2017-08-10 2017-08-10 一种数据访问方法、装置和系统
KR1020207006897A KR20200037376A (ko) 2017-08-10 2017-08-10 데이터 액세스 방법, 디바이스 및 시스템
CN201780002892.0A CN108064374B (zh) 2017-08-10 2017-08-10 一种数据访问方法、装置和系统
EP17920626.3A EP3657315A4 (en) 2017-08-10 2017-08-10 DATA ACCESS METHOD, DEVICE AND SYSTEM
US16/785,008 US11416172B2 (en) 2017-08-10 2020-02-07 Physical disk and virtual disk mapping in storage systems
US17/872,201 US11748037B2 (en) 2017-08-10 2022-07-25 Physical disk and virtual disk mapping in storage systems
US18/353,334 US20230359400A1 (en) 2017-08-10 2023-07-17 Data Access Method, Apparatus, and System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/096958 WO2019028799A1 (zh) 2017-08-10 2017-08-10 一种数据访问方法、装置和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/785,008 Continuation US11416172B2 (en) 2017-08-10 2020-02-07 Physical disk and virtual disk mapping in storage systems

Publications (1)

Publication Number Publication Date
WO2019028799A1 true WO2019028799A1 (zh) 2019-02-14

Family

ID=62141827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/096958 WO2019028799A1 (zh) 2017-08-10 2017-08-10 一种数据访问方法、装置和系统

Country Status (6)

Country Link
US (3) US11416172B2 (zh)
EP (2) EP4273688A3 (zh)
JP (1) JP7105870B2 (zh)
KR (1) KR20200037376A (zh)
CN (2) CN113485636B (zh)
WO (1) WO2019028799A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158948A (zh) * 2019-12-30 2020-05-15 深信服科技股份有限公司 基于去重的数据存储与校验方法、装置及存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033420B (zh) * 2018-08-08 2020-11-03 北京奇艺世纪科技有限公司 一种数据处理方法和装置
CN112256657B (zh) * 2019-07-22 2023-03-28 华为技术有限公司 日志镜像方法及系统
CN112988034B (zh) * 2019-12-02 2024-04-12 华为云计算技术有限公司 一种分布式系统数据写入方法及装置
WO2022002010A1 (zh) * 2020-07-02 2022-01-06 华为技术有限公司 使用中间设备对数据处理的方法、计算机系统、及中间设备
CN112162693B (zh) * 2020-09-04 2024-06-18 郑州浪潮数据技术有限公司 一种数据刷写方法、装置、电子设备和存储介质
CN112783722B (zh) * 2021-01-12 2021-12-24 深圳大学 一种区块链安全监测方法、装置、电子设备及存储介质
CN113253944A (zh) * 2021-07-07 2021-08-13 苏州浪潮智能科技有限公司 一种磁盘阵列访问方法、系统及存储介质
CN113419684B (zh) * 2021-07-09 2023-02-24 深圳大普微电子科技有限公司 一种数据处理方法、装置、设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460439A (zh) * 2009-04-30 2012-05-16 网络存储技术公司 通过条带式文件系统中的容量平衡进行数据分布
CN103092786A (zh) * 2013-02-25 2013-05-08 浪潮(北京)电子信息产业有限公司 一种双控双活存储控制系统及方法
CN104020961A (zh) * 2014-05-15 2014-09-03 深圳市深信服电子科技有限公司 分布式数据存储方法、装置及系统
CN105095290A (zh) * 2014-05-15 2015-11-25 中国银联股份有限公司 一种分布式存储系统的数据布局方法

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7389393B1 (en) * 2004-10-21 2008-06-17 Symantec Operating Corporation System and method for write forwarding in a storage environment employing distributed virtualization
US9032164B2 (en) * 2006-02-17 2015-05-12 Emulex Corporation Apparatus for performing storage virtualization
JP4643543B2 (ja) 2006-11-10 2011-03-02 株式会社東芝 キャッシュ一貫性保証機能を有するストレージクラスタシステム
US8566508B2 (en) * 2009-04-08 2013-10-22 Google Inc. RAID configuration in a flash memory data storage device
US8473690B1 (en) * 2009-10-30 2013-06-25 Netapp, Inc. Using logical block addresses with generation numbers as data fingerprints to provide cache coherency
CN102402394B (zh) 2010-09-13 2014-10-22 腾讯科技(深圳)有限公司 一种基于哈希算法的数据存储方法及装置
US8788788B2 (en) * 2011-08-11 2014-07-22 Pure Storage, Inc. Logical sector mapping in a flash storage array
CN102622189B (zh) * 2011-12-31 2015-11-25 华为数字技术(成都)有限公司 存储虚拟化的装置、数据存储方法及系统
US8972478B1 (en) 2012-05-23 2015-03-03 Netapp, Inc. Using append only log format in data storage cluster with distributed zones for determining parity of reliability groups
CN102915278A (zh) * 2012-09-19 2013-02-06 浪潮(北京)电子信息产业有限公司 重复数据删除方法
US20140195634A1 (en) 2013-01-10 2014-07-10 Broadcom Corporation System and Method for Multiservice Input/Output
US9009397B1 (en) 2013-09-27 2015-04-14 Avalanche Technology, Inc. Storage processor managing solid state disk array
US9483431B2 (en) 2013-04-17 2016-11-01 Apeiron Data Systems Method and apparatus for accessing multiple storage devices from multiple hosts without use of remote direct memory access (RDMA)
US9986028B2 (en) 2013-07-08 2018-05-29 Intel Corporation Techniques to replicate data between storage servers
US11016820B2 (en) * 2013-08-26 2021-05-25 Vmware, Inc. Load balancing of resources
CN103942292A (zh) * 2014-04-11 2014-07-23 华为技术有限公司 虚拟机镜像文件处理方法、装置及系统
US9558085B2 (en) * 2014-07-02 2017-01-31 Hedvig, Inc. Creating and reverting to a snapshot of a virtual disk
CN105612488B (zh) 2014-09-15 2017-08-18 华为技术有限公司 数据写请求处理方法和存储阵列
US9565269B2 (en) 2014-11-04 2017-02-07 Pavilion Data Systems, Inc. Non-volatile memory express over ethernet
CN108702374A (zh) * 2015-09-02 2018-10-23 科内克斯实验室公司 用于以太网类型网络上的存储器和I/O的远程访问的NVM Express控制器
CN105487818B (zh) * 2015-11-27 2018-11-09 清华大学 针对云存储系统中重复冗余数据的高效去重方法
WO2017113960A1 (zh) * 2015-12-28 2017-07-06 华为技术有限公司 一种数据处理方法以及NVMe存储器
CN107430494B (zh) * 2016-01-29 2020-09-15 慧与发展有限责任合伙企业 用于远程直接存储器访问的系统、方法和介质
US10334334B2 (en) * 2016-07-22 2019-06-25 Intel Corporation Storage sled and techniques for a data center
US20180032249A1 (en) * 2016-07-26 2018-02-01 Microsoft Technology Licensing, Llc Hardware to make remote storage access appear as local in a virtualized environment
US10509708B2 (en) * 2017-06-13 2019-12-17 Vmware, Inc. Code block resynchronization for distributed multi-mirror erasure coding system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460439A (zh) * 2009-04-30 2012-05-16 网络存储技术公司 通过条带式文件系统中的容量平衡进行数据分布
CN103092786A (zh) * 2013-02-25 2013-05-08 浪潮(北京)电子信息产业有限公司 一种双控双活存储控制系统及方法
CN104020961A (zh) * 2014-05-15 2014-09-03 深圳市深信服电子科技有限公司 分布式数据存储方法、装置及系统
CN105095290A (zh) * 2014-05-15 2015-11-25 中国银联股份有限公司 一种分布式存储系统的数据布局方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158948A (zh) * 2019-12-30 2020-05-15 深信服科技股份有限公司 基于去重的数据存储与校验方法、装置及存储介质
CN111158948B (zh) * 2019-12-30 2024-04-09 深信服科技股份有限公司 基于去重的数据存储与校验方法、装置及存储介质

Also Published As

Publication number Publication date
US20200174708A1 (en) 2020-06-04
EP4273688A3 (en) 2024-01-03
US11748037B2 (en) 2023-09-05
US11416172B2 (en) 2022-08-16
EP3657315A1 (en) 2020-05-27
CN113485636B (zh) 2023-07-18
EP3657315A4 (en) 2020-07-22
CN108064374A (zh) 2018-05-22
JP2020530169A (ja) 2020-10-15
EP4273688A2 (en) 2023-11-08
CN108064374B (zh) 2021-04-09
CN113485636A (zh) 2021-10-08
US20230359400A1 (en) 2023-11-09
KR20200037376A (ko) 2020-04-08
US20220357894A1 (en) 2022-11-10
JP7105870B2 (ja) 2022-07-25

Similar Documents

Publication Publication Date Title
CN113485636B (zh) 一种数据访问方法、装置和系统
US10459649B2 (en) Host side deduplication
US11010078B2 (en) Inline deduplication
US11620064B2 (en) Asynchronous semi-inline deduplication
US11573855B2 (en) Object format resilient to remote object store errors
WO2019127018A1 (zh) 存储系统访问方法及装置
US9548888B1 (en) Technique for setting WWNN scope for multi-port fibre channel SCSI target deduplication appliances
US20230052732A1 (en) Object and sequence number management
WO2019127021A1 (zh) 存储系统中存储设备的管理方法及装置
WO2021017782A1 (zh) 分布式存储系统访问方法、客户端及计算机程序产品
WO2019127017A1 (zh) 存储系统中存储设备的管理方法及装置
US11947419B2 (en) Storage device with data deduplication, operation method of storage device, and operation method of storage server
US9501290B1 (en) Techniques for generating unique identifiers
US11662917B1 (en) Smart disk array enclosure race avoidance in high availability storage systems
US20210311654A1 (en) Distributed Storage System and Computer Program Product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17920626

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020507656

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017920626

Country of ref document: EP

Effective date: 20200221

ENP Entry into the national phase

Ref document number: 20207006897

Country of ref document: KR

Kind code of ref document: A