CN111045597A - Computer system - Google Patents

Computer system Download PDF

Info

Publication number
CN111045597A
CN111045597A CN201910951173.4A CN201910951173A CN111045597A CN 111045597 A CN111045597 A CN 111045597A CN 201910951173 A CN201910951173 A CN 201910951173A CN 111045597 A CN111045597 A CN 111045597A
Authority
CN
China
Prior art keywords
erasure coding
coding logic
data
logic
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910951173.4A
Other languages
Chinese (zh)
Other versions
CN111045597B (en
Inventor
桑庞·保罗·欧拉利格
佛瑞德·沃里
奥斯卡·P·品托
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/207,080 external-priority patent/US10635609B2/en
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Publication of CN111045597A publication Critical patent/CN111045597A/en
Application granted granted Critical
Publication of CN111045597B publication Critical patent/CN111045597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • G06F13/1678Details of memory controller using bus width
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0658Controller construction arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • H03M13/151Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
    • H03M13/154Error and erasure correction, e.g. by using the error and erasure locator or Forney polynomial
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Preparation Of Compounds By Using Micro-Organisms (AREA)
  • Lock And Its Accessories (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A computer system is disclosed. The computer system may include at least one non-volatile storage express (NVMe) Solid State Drive (SSD), a Field Programmable Gate Array (FPGA) to implement one or more functions supporting the non-volatile storage express solid state drive (e.g., data acceleration, deduplication, data integrity, data encryption, and data compression), and a peripheral component interconnect express (PCIe) switch. The peripheral component interconnect express switch may communicate with both the field programmable gate array and the non-volatile storage express solid state drive.

Description

Computer system
Technical Field
The present inventive concept relates generally to computer systems and, more particularly, to erasure coding within a Peripheral Component Interconnect Express (PCIe) switch.
Background
Currently, most Non-Volatile Memory Express (NVMe) based Solid State Drives (SSDs) with Redundant Array of Independent Disks (RAID) protection are implemented via an external PCIe plug-In Card (AIC). To optimize bus bandwidth between a host Central Processing Unit (CPU) and an AIC RAID controller, the bus typically supports X16 PCIe lanes (lanes). However, due to physical limitations of the PCIe card standard form factor (form factor), each AIC RAID controller supports only a small number of U.2 connectors (currently the preferred connector for NVMe SSD): typically only two or four U.2 connectors are supported.
To support up to 24 NVMe SSDs inside a 2U chassis (chassis), 6 AIC RAID controllers are required, resulting in 6 different RAID domains. This configuration increases the cost and complexity of managing 6 RAID domains. Furthermore, each AIC RAID controller currently costs approximately $ 400. Thus, even for the entire RAID solution for a single 2U chassis, the AIC RAID controller alone exceeds $ 2,400, which has not accounted for the cost of NVMe SSD.
The adoption of NVMe SSDs in the enterprise market is limited due to the lack of cost-effective RAID data protection for large data sets. Software RAID solutions are suitable for relatively small data sets, but not for large data (BigData).
There are other problems with using AIC RAID controllers:
1) as described above, having multiple RAID domains inside the chassis increases management complexity.
2) As a corollary of the complexity of RAID domain management, a chassis does not have a single RAID domain, but rather, it would be preferable to have a single RAID domain.
3) Central Processing Units (CPUs) need to support a large number of PCIe lanes: each AIC RAID controller has 16 PCIe lanes × 6 AIC RAID controllers per chassis-96 PCIe lanes for the AIC RAID controller alone. Only the much more expensive high-end CPUs currently support so many PCIe lanes.
4) Since each AIC RAID controller may consume 25 watts, 6 AIC RAID controllers may increase the power consumption per chassis by up to 150 watts.
5) Often chassis have only a few PCIe slots (slots), which may limit the number of AIC RAID controllers that can be added and indirectly reduce the number of NVMe SSDs in the chassis that can be protected by RAID.
6) Software RAID solutions often support relatively few RAID levels and may increase the CPU load (overhead).
7) When used over a network, SSD accesses may be slow due to the time required to send data accesses between networks. Furthermore, in some instances, network type storage may require a software RAID implementation, thereby increasing the CPU load.
There remains a need for a way to support erasure coding of large numbers of storage devices without being limited by AIC RAID controllers and software RAID solutions.
[ objects of the invention ]
Exemplary embodiments of the present disclosure may provide a system for supporting data protection using erasure code.
Disclosure of Invention
Example embodiments provide a computer system that may include a non-volatile storage express (NVMe) Solid State Drive (SSD), a Field Programmable Gate Array (FPGA) implementing functionality to support the NVMe SSD, and a peripheral component interconnect express (PCIe) switch. The functions supporting NVMe SSD come from a group of functions including data acceleration (dataaccercerization), data de-duplication (data de-duplication), data integrity (data encryption), data encryption (data encryption), and data compression (data compression). The PCIe switch communicates with the FPGA and the NVMe SSD.
Another exemplary embodiment provides a computer system that may include a non-volatile storage express (NVMe) Solid State Drive (SSD) and a Field Programmable Gate Array (FPGA) including a first FPGA portion and a second FPGA portion. The first FPGA portion implements the functionality to support NVMe SSD. The second FPGA portion implements a peripheral component interconnect express (PCIe) switch. The functions that support NVMe SSD come from a group of functions that include data acceleration, deduplication, data integrity, data encryption, and data compression. The PCIe switch communicates with the FPGA and the NVMe SSD. The FPGA and the NVMe SSD are located inside the common shell.
Yet another exemplary embodiment provides a computer system that may include a non-volatile storage express (NVMe) Solid State Drive (SSD) and a peripheral component interconnect express (PCIe) switch with erasure coding logic. The PCIe switch may include an external connector capable of enabling the PCIe switch to communicate with the processor, at least one connector capable of enabling the PCIe switch to communicate with the nvmes SSD, a Power Processing Unit (PPU) for configuring the PCIe switch, and an erasure coding controller including circuitry (circuitry) for applying an erasure coding scheme to data stored on the NVMe SSD.
[ Effect of the invention ]
According to embodiments of the invention, the use of a PCIe switch that includes Look-Aside erasure coding logic (hook-inside Eraser coding) to move erasure coding closer to the storage device may reduce the time required to move data back and forth. In addition, by placing the erasure coding controller with the PCIe switch, the need for expensive RAID plug-in cards is eliminated and a larger array (even across multiple chassis) can be used.
Drawings
Fig. 1 illustrates a machine including a peripheral component interconnect express (PCIe) switch having lookaside erasure coding logic, according to an embodiment of the inventive concept.
Figure 2 shows additional detail of the machine shown in figure 1.
Fig. 3 shows additional details of the machine shown in fig. 1, including a power board and a mid-plane (mid-plane) connecting a PCIe switch having the lookaside erasure coding logic shown in fig. 1 to a storage device.
Fig. 4 illustrates the memory device of fig. 3 for implementing different erasure coding schemes.
FIG. 5 shows details of the PCIe switch shown in FIG. 1 with lookaside erasure coding logic.
Fig. 6 shows details of a PCIe switch having a perspective erasure coding logic according to another embodiment of the inventive concept.
FIG. 7 illustrates a first topology (topology) using the PCIe switch with side view erasure coding logic shown in FIG. 1 according to one embodiment of the present inventive concept.
FIG. 8 illustrates a second topology using the PCIe switch with lookaside erasure coding logic shown in FIG. 1 according to another embodiment of the present inventive concept.
FIG. 9 illustrates a third topology using the PCIe switch with lookaside erasure coding logic shown in FIG. 1 according to yet another embodiment of the present inventive concept.
FIG. 10 illustrates a fourth topology using the PCIe switch with lookaside erasure coding logic shown in FIG. 1 according to yet another embodiment of the present inventive concept.
11A-11D illustrate a flow diagram of an exemplary process for the PCIe switch with lookaside erasure coding logic shown in FIG. 1 to support an erasure coding scheme in accordance with an embodiment of the present inventive concept.
Fig. 12A-12B illustrate a flow diagram of an exemplary process for a PCIe switch with lookaside erasure coding logic shown in fig. 1 to perform initialization, according to an embodiment of the inventive concept.
FIG. 13 illustrates a flow diagram of an exemplary process for the PCIe switch with lookaside erasure coding logic of FIG. 1 to incorporate a new storage device into an erasure coding scheme in accordance with an embodiment of the present inventive concept.
FIG. 14 illustrates a flow diagram of an exemplary process for a PCIe switch with lookaside erasure coding logic shown in FIG. 1 to handle failed storage in accordance with an embodiment of the inventive concept.
[ description of symbols ]
105: machine/host
110: processor with a memory having a plurality of memory cells
115: memory device
120: memory controller
125. 320, 605, 1005: peripheral component interconnect express (PCIe) switch
130: storage device
130-1, 130-2, 130-3, 130-4, 130-5, 130-6: solid State Drive (SSD)/storage/physical storage
135: device driver
205: clock (CN)
210: network connector
215: bus line
220: user interface
225: input/output engine
305: middle plane
310. 315: distribution board
325. 330: baseboard Management Controller (BMC)
405. 410, 415: erasure coding scheme
505: connector with a locking member
510-1, 510-2, 510-3, 510-4, 510-5, 510-6: PCIe to PCIe stack
515: PCIe switch core
520: power Processing Unit (PPU)
525: probing logic
530: erasing encoding controller
535-1, 535-2, 535-3, 535-4, 535-5, 535-6: capture interface
540: multiplexer
545: high speed buffer storage
550: write buffer
555: erasure code enable signal
705: field Programmable Gate Array (FPGA)
1103. 1106, 1109, 1112, 1115, 1118, 1121, 1124, 1127, 1130, 1133, 1136, 1139, 1145, 1148, 1151, 1154, 1160, 1163, 1205, 1210, 1215, 1220, 1225, 1235, 1240, 1305, 1310, 1315, 1405, 1410, 1415, 1420: square block
1142. 1157, 1166, 1230, 1320, 1425: dotted line
Detailed Description
Reference will now be made in detail to embodiments of the present inventive concept, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present concepts. It should be understood, however, that one of ordinary skill in the art may practice the inventive concept without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail as not to unnecessarily obscure aspects of the embodiments.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first module may be termed a second module, and, similarly, a second module may be termed a first module, without departing from the scope of the inventive concept.
The terminology used in the description of the inventive concept herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. The singular forms "a/an" and "the" as used in the description of the concepts of the present invention and the appended claims are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The components and features shown in the drawings are not necessarily to scale.
A Field Programmable Gate Array (FPGA) has sufficient intelligence, computing resources, and high-speed Input/Output (I/O) connections to perform Redundant Array of Independent Disks (RAID)/Erasure Code parity generation and data discovery (Erasure Code parity generation and data discovery) when necessary. FPGA + Solid State Drives (SSDs) may require embedded peripheral component interconnect express (PCIe) switches to support more co-controllers/co-processors, such as one or more SSDs, Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), etc. Multiple coprocessors also require more NAND flash (NAND flash) channels.
Embodiments of the present invention support erasure codes within PCIe switches inside the FPGA. Embodiments of the inventive concept may also enable a user to remotely configure a RAID engine (internal to the FPGA) through a Baseboard Management Controller (BMC). Users may use these standard interfaces, such as PCIe (used as a control plane) or System Management Bus (SMBus), to provision on-chip RAID (RAID-on-a-chip, RoC) or erase code controllers. For users that lease computing resources, it may be useful to be able to configure the storage in this manner: when done, the user may wish to quickly destroy the data before the next user may use the same computing resources. In this case, the BMC may send an erase command to all embedded PCIe switches inside the plurality of FPGAs + SSDs. Upon receiving an erase command, the RoC/erase code controller of the FPGA will erase both the data specified by the command Logical Block Address (LBA) range and the parity data.
Today, PCIe switches expose virtual switches or virtual packets (virtual grouping), where more than one switch is exposed to an administrator. These configurations are useful in a virtualized environment when the network, CPU-GPU, FPGA, and memory behind these virtual domains can be grouped together. Such virtual grouping may be applied to storage in one embodiment by creating a RAID sub-group for the virtualized environment that is exposed to a user group, or alternatively for RAID grouping (e.g., RAID 10, RAID 50, RAID 60, etc.). These tiered RAID groups create small groups and an additional RAID layer is applied on top to create a larger RAID solution. The virtual switch manages the smaller RAID groups, while the master switch manages the overall RAID configuration.
The solution provides the benefit of having important distinguishing features in enterprise and data center environments, as data protection schemes are enabled and management is brought closer to the storage unit. Embodiments of the inventive concept provide higher density and performance with lower power consumption.
The solution may consist of one embedded PCIe switch with integration RoC or an erasure code controller located in the data path between the host and the SSD. The PCIe switch + RoC components may be managed by the BMC for configuration and control, and the software may be exposed to an interface for specific configuration before being released to new users.
When operating in erasure code/RAID mode, all incoming non-volatile storage express (NVMe) or Fabric-based NVMe (NVMe over Fabric, NVMe-af) traffic to and from the embedded PCIe switch may be probed by RoC or an erasure code controller (which may be referred to as a lookaside RoC or erasure code controller). RoC or the scrub code controller may determine whether the data in the traffic results in a cache hit (cache hit) to its local cache. If there is a cache hit, the transaction (read or write) need not be forwarded to the appropriate SSD. The requested read data may be provided directly from the cache at RoC. The write data will be updated directly to the local cache of RoC and marked as "modified" or "dirty" data.
For SSDs, parity may be distributed among the connected SSDs. For example, if RAID4 is selected, the last SSD may only be used to store parity, while the other SSDs are used to store data.
Virtual I/O addresses may be supported by having an external PCIe switch between the host and the SSD device. In this case, primary RoC, which is part of the host PCIe switch, may virtualize all SSD addresses. In other words, the address and device are not visible to the host Operating System (OS). In such embodiments of the inventive concept, peer-to-peer transactions between at least two SSDs as peers are allowed and supported. This option may enhance some form of redundancy and/or availability (availability) of SSDs by striping (striping) across more than one SSD. In this mode, the embedded RoC or erasure code controller within the FPGA can be disabled (if present). The only RoC/erasure code controller enabled is located in the host PCIe switch.
If the storage device is operating in single device mode, all incoming NVMe/PCIe traffic may be forwarded to the SSD with the requested data.
If the paired mode is enabled, the RoC/erasure code controller can determine if the address of the requested data belongs to its own Base Address Register (BAR) field. In this case, the transaction may be completed locally RoC. For write transactions, either a posted write buffer (posted write buffer) or a write cache (using some embedded Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM)) may be used. If there is a write cache hit (a previous write has occurred and the data is still stored in the write cache buffer), then the process depends on the write cache policy. For example, if the cache policy is write-back (write-back), the write command will be completed by RoC cache and terminated. If the cache policy is write-through, the write command will complete when the write data has been successfully transferred to the drive. In such a case, RoC may terminate the write command to the host once the write data has been successfully updated to its local cache.
RoC may virtualize a stack of devices it claims and present the devices as a single device or fewer devices as a protection scheme against data or device failures. Data protection schemes can be essentially distributed across a bank of devices so that when data is lost on any device, the data can be reconstructed from the other devices. RAID and Erasure Coding (EC) are common data protection schemes that use distributed algorithms to protect such losses.
To virtualize the device below RoC, the device may be terminated at RoC and not visible to the host. That is, a PCIe switch may be connected to all known devices, and RoC may be connected to the switch. To manage the devices, RoC may discover and configure individual devices through the PCIe switch. Alternatively, RoC may be pass-through in default/factory mode and allow host software to configure RoC. Host software may be specifically customized to work with PCIe switch + RoC hardware. Once configured, RoC may terminate the device and make it invisible to the host.
PCIe switch + RoC may be configured in a variety of ways for RAID mode and EC mode. There may be additional PCIe switches downstream to create a larger fan-out configuration (fan-out configuration) to support more devices. In addition, more than one such combination of hardware may be associated together to form a larger setup (setup). For example, 2 PCIe switches + RoC may work together to form an alternative configuration. Alternatively, the 2 PCIe switches + RoC may operate individually.
When PCIe switch + RoC works individually, the host instantiates each RoC and PCIe switch combination as a separate device. Here, the host may have a standard OS driver that will see all SSDs virtualized by RoC. For example, assume that there are 6 SSDs aggregated under the PCIe switch and 1 SSD exposed to the host by RoC; a second RoC and PCIe switch combination may also expose similar settings to the host. The host discovers 2 SSDs (one per device) for all RoC controller devices. Each RoC controller may expose separate device space for each exposed SSD. The host may not see all devices that support and are behind this exposed SSD. RoC manage hardware I/O paths through the PCIe switches.
This approach may be used in an active-passive setup (active-passive setup), where the second controller is a backup path for protecting against failure of the first controller path. The host is only actively using the first controller here and none of the I/O is sent to the second RoC controller. If an active-passive setup is used, the 2 RoC controllers may copy data internally. This may be accomplished by the first active controller sending all writes to the second RoC controller, as in a RAID 1 data protection setting.
There may be a second active-passive setup where the second RoC and PCIe switch may not have any SSDs behind them and may be the only standby controller path. In this case, since the 2 RoC controllers are related to the same set of SSDs, no I/O may be sent between them. This is the standard active-passive setup.
The SSDs behind each RoC may also be uncoordinated with each other, in which case the 2 SSDs are treated as separate SSDs, without sharing a protection scheme between them.
In yet another usage, both paths may be used in active-active setup (active-active setup). Such an arrangement may be used for load-balancing purposes. Here, the host may use both paths in a way that distributes the I/O workload using a particular software layer. The two RoC controllers may coordinate their write operations between them to keep the two SSDs synchronized. That is, each SSD from each RoC controller may contain the same data as in the RAID 1 setting.
In yet another configuration, the 2 RoC controllers communicate in a manner that keeps their I/Os distributed in a custom setup (custom setup). Here, the host uses only one RoC controller: another RoC controller is connected to the first RoC controller. The first RoC controller may expose one or more virtual NVMe SSDs to the host. The 2 RoC can be set to divide odd LBA space and even LBA space between them. Since NVMe uses a pull model (pullmodel) for data from the device side, the host sends commands only to the SSD exposed by the first RoC controller. RoC the controller may send a copy of the message to the second RoC controller via its side channel connection. RoC the controller may be set to service odd-only or even-only LBAs, stripes (stripes), zones (zones), etc. This arrangement provides internal load balancing that need not be managed by the host and can be managed transparently by the RoC and PCIe switch combination. The respective RoC controller can process only odd LBA ranges or even LBA ranges and satisfy requests to the host buffer. Since both RoC controllers access the host, they can fill in data to either their odd or even pairs.
For example, the host may send a command to the first RoC controller to read four consecutive LBAs 0, LBA1, LBA2, LBA 3, and the first RoC controller sends a copy to the second RoC controller. Then, the first RoC controller reads the data for LBA 0 and LBA2 from the first two SSDs on its PCIe switch, while the second RoC controller reads the data from LBA1 and LBA 3 from the first two SSDs on its PCIe switch. The second RoC controller can then report to the first RoC controller that it has completed its operation, and the first RoC controller can then report to the host that the transaction is complete.
Odd/even LBAs/stripes/zone pairs are examples that may be applied to other load sharing usages.
Embodiments of the present inventive concept can support SSD failure, removal, and hot addition (hot addition). RoC in the PCIe switch needs to detect this when the SSD is not working properly or is removed from its slot. When the PCIe switch detects this, RoC may begin a rebuild operation (rebuild operation) for the SSD that failed or was removed. RoC may also handle any I/O operations during the reconstruction period by prioritizing data from the associated stripe.
There are at least two ways to report SSD failures or removals to RoC in a PCIe switch. In one embodiment of the inventive concept, all SSDs have a presence pin (Present pin) connected to the BMC. When the SSD is pulled out of the chassis, the BMC detects the removal. The BMC then reports the affected slot number to RoC in the PCIe switch. The BMC may also periodically monitor the health of the SSD. If the BMC detects any fatal error condition reported by an SSD, the BMC may decide to leave this SSD out of service. The BMC may then report the failed slot number to RoC so that a new SSD may be rebuilt.
In another embodiment of the inventive concept, the PCIe switch may be capable of supporting hot plug (hot plug) in which all SSDs are connected by PCIe sideband signals (side band signals) and certain error conditions may be detected. The PCIe switch may detect when an SSD is pulled or added in, or when the PCIe link to the SSD is no longer connected. In such an error scenario, RoC in the PCIe switch may isolate the failed SSD, or the BMC may isolate the failed SSD by disabling the power supply to the failed drive and immediately starting to rebuild the drive.
When asserted (assert), the presence (PRSNT #) pin of each U.2 connector may indicate the presence of a new device in the chassis. The signals are connected to a PCIe switch and/or a BMC. RoC can configure new drives into their existing domains as appropriate based on the current data protection policy.
All incoming traffic from the host needs to be forwarded to probe P2P and the address translation logic (physical-to-logical). During PCIe enumeration, all configuration cycles for all ports need to be forwarded to probe P2P logic. Depending on the selected mode of operation, the behavior of a PCIe switch with RoC is defined as follows:
Figure BDA0002225725920000091
Figure BDA0002225725920000101
RoC may also be located between, and in line with, the PCIe switch and the host processor. In such an embodiment of the inventive concept, RoC may be referred to as perspective RoC (Look-Through RoC). When using perspective RoC, if the PCIe switch operates like a normal PCIe switch, RoC is disabled and becomes a retimer (re-timer) for all ports. In this case, all upstream ports are allowed to connect as in the normal use case.
If RoC is enabled, a small number of non-transparent bridge (NTB) ports will be connected to the host. In such a case, RoC may virtualize the incoming address as a logical address according to the selected RAID or erasure coding level.
Whether RoC is a lookaside RoC or a perspective RoC, all incoming read/write memory requests may be checked against RoC's local cache to determine a cache hit or a cache miss (cache miss). If there is a cache hit, the requested read data may be provided by RoC local cache memory instead of the SSD. For a memory write hit, the write data may be updated to the cache memory immediately. The same write data may be updated to the SSD later. Such an implementation may reduce the total latency (latency) of memory writes, thereby improving system performance.
If there is a cache miss, the RoC controller may determine which SSD is the correct drive to access the data.
To address a PCIe device, the PCIe device must be enabled by mapping to the I/O port address space or memory mapped address space of the system. Firmware, device drivers, or operating system programs of the system program the Base Address Register (BAR) to inform the device of its address mapping by writing a configuration command to the PCI controller. Since all PCIe devices are inactive at system reset, they will not be assigned an address available to the operating system or device driver to communicate with. A basic input/output system (BIOS) or operating system addresses PCIe slots (e.g., a first PCIe slot, a second PCIe slot, or a third PCIe slot on a motherboard, etc.) geographically through a PCIe controller using an Initialization Device Select (IDSEL) signal for each slot.
PCI BAR bit
Figure BDA0002225725920000102
Figure BDA0002225725920000111
Since the BIOS or operating system has no direct way to determine which PCIe slots have devices installed (nor is there a direct way to determine which functions the devices implement), the PCI bus is enumerated. Bus enumeration may be carried out by attempting to read each combination of bus number and device number from vendor Identification (ID) and device identification (VID/DID) registers at the function 15 of the device. Note that the device number that is different from the DID is simply the serial number of the device on this bus. Further, after a new bridge is detected, a new bus number is defined and device enumeration resumes at device number zero.
If no response is received from the function 15 of the device, the bus master may perform an abort (abort) and return an all-bit-on value (FFFFFFFF in hexadecimal), which is an invalid VID/DID value. In this manner, the device driver may understand that the designated combined bus/device _ number/function (B/D/F) is not present. Thus, when a read of a function ID with a value of zero for a given bus/device causes a master (initiator) abort, the device driver can conclude that there is no working device on this bus (a device is required to implement a function number of zero). In this case, it is not necessary to read the remaining function numbers (1 to 7) because they will not exist.
When the read of the specified B/D/F combination for the vendor ID register is successful, the device driver will know that this device is present. The device driver may write all 1's to its BAR and read back the requested memory size of the device in encoded form. The design implies that all address space sizes are powers of 2 and naturally aligned.
At this point, the BIOS or operating system may program the memory mapped address and the I/O port address into the BAR configuration registers of the device. These addresses will remain valid as long as the system remains on. Once powered down, all of these settings are lost and the process is repeated the next time the system is powered back up. Since this entire process is fully automated, the user does not need to manually configure any newly added hardware by replacing the DIP switch on the card itself. Such automatic device discovery and address space allocation is a plug and play (plug and play) implementation.
If a PCIe-to-PCIe bridge is found, the system may assign a non-zero bus number to a secondary (secondary) PCI bus outside of the bridge, and then enumerate the devices on this secondary bus. If more PCIe bridges are found, the discovery may continue recursively until all possible domain/bus/device combinations are scanned.
Each non-bridge PCIe device function may implement up to 6 BARs, each of which may be responsive to a different address in the I/O port and memory mapped address space. Each BAR illustrates a region.
The PCIe device may also have an optional Read Only Memory (ROM) that may contain driver code or configuration information.
The BMC may configure RoC settings directly. The BMC may have a hard coded path or configurable settings in which a particular data protection scheme is to be applied. The latter may expose the interface to this configuration as a BIOS option or additionally to software via a hardware exposed interface. The hard-coded scheme may be built into the BIOS firmware and may still provide the option of enabling/disabling the protection.
To handle device failures, the BMC may detect when a driver is bad or removed through a control path. The BMC may also determine that the device is expected to soon deteriorate through Self-Monitoring Analysis and Reporting Technology (SMART). In these cases, the BMC may reconfigure RoC the hardware to enable the failed scenario or alert the user of the scenario. The BMC only enters the control path and not the data path. When a new driver is inserted, the BMC may again intervene and configure the new driver as part of the protected group, or initiate a rebuild operation. RoC the hardware can handle the actual reconstruction, the restoration path in this setup, to provide as little performance impact as possible while providing less latency in the data access path.
Fig. 1 illustrates a machine including a peripheral component interconnect express (PCIe) switch having lookaside erasure coding logic, according to an embodiment of the inventive concept. In FIG. 1, a machine 105 is shown. Machine 105 may include a processor 110. The processor 110 may be any kind of processor: such as Intel Xeon (Intel Xeon), Celeron (Celeron), Itanium (Itanium) or Atom processor (Atom processor), Advanced Micro Devices (AMD) goslon (Opteron) processor, Advanced reduced instruction set computing machine (ARM) processor, and the like. Although fig. 1 shows a single processor 110 in machine 105, machine 105 may include any number of processors, each of which may be a single core processor or a multi-core processor, and which may be mixed in any desired combination.
Machine 105 may also include a memory 115, and memory 115 may be managed by a memory controller 120. The Memory 115 may be any type of Memory, such as a flash Memory, a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), a Persistent Random Access Memory (Persistent Random Access Memory), a Ferroelectric Random Access Memory (FRAM), or a Non-Volatile Random Access Memory (NVRAM) such as a Magnetoresistive Random Access Memory (MRAM). Memory 115 may also be any desired combination of different memory types.
Machine 105 may also include a peripheral component interconnect express (PCIe) switch 125 having lookaside erasure coding logic. PCIe switch 125 can be any desired PCIe switch that supports lookaside erasure coding logic.
Machine 105 may also include a storage device 130, and storage device 130 may be controlled by a device driver 135. Storage 130 may be any desired form of storage capable of communicating with PCIe switch 125. For example, the storage 130 may be a non-volatile storage express (NVMe) Solid State Drive (SSD).
Although fig. 1 depicts machine 105 as a server, which may be a stand-alone server or a rack server, embodiments of the present inventive concept may include any desired type of machine 105 without limitation. For example, machine 105 may be replaced with a desktop computer (desktop computer) or a laptop computer (laptop computer) or any other machine that may benefit from embodiments of the present inventive concepts. Machines 105 may also include special purpose portable computing machines, tablet computers (tablets), smart phones, and other computing machines.
Figure 2 shows additional detail of the machine shown in figure 1. In FIG. 2, generally, machine 105 includes one or more processors 110, which one or more processors 110 may include a memory controller 120 and a clock 205, which clock 205 may be used to coordinate the operation of the components of machine 105. Processor 110 may also be coupled to memory 115, and memory 115 may include, for example, Random Access Memory (RAM), read-only memory (ROM), or other state-retaining media. The processor 110 may also be coupled to a storage device 130 and a network connector 210, the network connector 210 may be, for example, an ethernet connector or a wireless connector. The processor 110 may also be connected to a bus 215, and the bus 215 may be attached to a user interface 220 and input/output interface ports that may be managed using an input/output engine 225, among other components.
Fig. 3 shows additional details of machine 105 of fig. 1, including a power board and a midplane connecting PCIe switch 125 having the lookaside erasure coding logic of fig. 1 to a storage device. In fig. 3, machine 105 may include a midplane 305 and power boards 310 and 315. Power panel 310 may include PCIe switch 125 and baseboard management controller 325 with lookaside erasure coding logic, and power panel 315 may include PCIe switch 320 and baseboard management controller 330 with lookaside erasure coding logic, respectively. ( power boards 310 and 315 may also include additional components not shown in FIG. 3: FIG. 3 focuses on the elements most relevant to embodiments of the inventive concept.)
In some embodiments of the inventive concept, each PCIe switch 125 and 320 with lookaside erasure coding logic may support up to 96 total PCIe lanes. PCIe switches 125 and 320 with lookaside erasure coding logic are connected to storage devices 130-1 through 130-6 using U.2 connectors, each U.2 connector supporting up to 4 PCIe lanes per device. Two X4 lanes are used (one X4 lane per direction of communication), which means that each PCIe switch can support up to 96 ÷ 8 ═ 12 devices. Thus, FIG. 3 shows 12 storage devices 130-1 through 130-3 in communication with PCIe switch 125 having lookaside erasure coding logic, and 12 storage devices 130-4 through 130-6 in communication with PCIe switch 320 having lookaside erasure coding logic. The number of storage devices in communication with the PCIe switches 125 and 320 having the lookaside erasure coding logic is limited only by the number of PCIe lanes provided by the PCIe switches 125 and 320 having the lookaside erasure coding logic and the number of PCIe lanes used per storage device 130-1 through 130-6.
In some embodiments of the inventive concept, PCIe switches 125 and 320 with lookaside erasure coding logic may be implemented using custom circuitry. In other embodiments of the inventive concept, the PCIe switches 125 and 320 with the lookaside erasure coding logic may be implemented using a suitably programmed Field Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC).
The BMCs 325 and 330 may be used to configure the memory devices 130-1 through 130-6. For example, BMC325 and 330 may initialize memory devices 130-1 through 130-6, thereby erasing any data present on memory devices 130-1 through 130-6: at startup, when storage devices 130-1 through 130-6 are added to the erasure coding scheme, or when both occur simultaneously. Alternatively, such functionality may be supported by a processor (processor 110 shown in fig. 1 or by a local processor present (but not shown) on power panels 310 and 315). The BMCs 325 and 330 (or the processor 110 shown in FIG. 1 or a local processor present on (but not shown) the power boards 310 and 315) may also be responsible for the initial configuration of the lookaside erasure coding logic of the PCIe switches 125 and 320 having the lookaside erasure coding logic.
FIG. 3 shows an exemplary complete set of data protection for two PCIe switches 125 and 320 with lookaside erasure coding logic: BMCs 325 and 330 may directly configure the lookaside erasure coding logic. BMCs 325 and 330 may have a hard-coded path or configurable settings in which a particular data protection scheme is applied. The latter may expose the interface to this configuration as a basic input/output system (BIOS) option, or to additional software via a hardware exposed interface. The hard-coded scheme may be built into the BIOS firmware and may still provide the option of enabling/disabling the protection.
In the event of a storage device failure, BMCs 325 and 330 may detect when a storage device is going bad or removed via a control path. BMCs 325 and 330 may then reconfigure the side view erasure coding logic to enable the failure scenario. BMCs 325 and 330 may be connected to the control path, but not the data path. Similarly, when a new storage device is inserted, BMCs 325 and 330 may intervene and configure the new storage device as part of an established group or initiate a rebuild operation. The lookaside erasure coding logic may handle the actual reconstruction; ideally, the recovery path in this setup should minimize the performance impact on data access and reconstruct the data on the storage device from the remaining storage devices.
At this time, it makes sense to define the term "erasure coding". Erasure coding is intended to set forth any desired manner for encoding data across multiple storage devices. Erasure coding may require at least two memory devices or at least two portions of a memory device (e.g., a single shell or casing (housing) containing two or more NAND flash channels) because if only one memory device is used, conventional data access techniques appropriate for that memory device may be used to store data. In other words, erasure coding is defined to mean a way to store data across two or more storage devices, two or more portions of a single storage device, or any combination thereof, in a manner that more efficiently uses the storage devices and/or provides data redundancy.
A Redundant Array of Independent Disks (RAID) represents a subset of erasure coding; or in other words, the RAID level represents a particular implementation of various erasure coding schemes. However, there may be other erasure coding schemes that may be defined beyond the traditional RAID level.
Typically, erasure coding (or RAID) is implemented using two or more physically distinct storage devices. In some embodiments of the inventive concept, however, a single case or housing may include multiple portions of the storage device that may be considered separate storage devices for erasure coding purposes. For example, a single NVMe SSD shell or case may include multiple NAND flash memory channels. For erasure coding purposes, each NAND flash channel can be considered a separate storage device, with data striped (or coded) across the various NAND flash channels. In some embodiments of the inventive concept, this makes it possible to implement erasure coding using a single storage device. In addition, it is possible for a PCIe switch 125 with side-view erasure coding logic to support Error correction codes (either built into the PCIe switch 125 with side-view erasure coding logic somewhere, or through additional logic) or other functions that may be used with a single storage device.
Fig. 4 illustrates the memory devices 130-1 through 130-6 of fig. 3 for implementing different erasure coding schemes. In FIG. 4, storage devices 130-1 through 130-6 may be used in a RAID0 configuration, as shown in erasure coding scheme 405. RAID0 stripes data across various storage devices. That is, the data is divided into logical units suitable for the storage devices, and each logical unit is written to a different storage device up to the number of storage devices in the array; after a logical unit of data has been written on all storage devices, the data is written again on the first storage device, and so on.
RAID0 has advantages over using a single storage device alone or even using an unorganized group of Disks, such as a disk enclosure (JBOD) or a Flash enclosure (JBOF). Because data is stored on multiple storage devices, the data can be read and written more quickly, with each storage device operating in parallel. Thus, for example, by dividing the data across 12 storage devices 130-1 through 130-6 as shown in FIG. 4, each storage device 130-1 through 130-6 need only read or write one twelfth of the total data, which is faster than reading or writing the entire data. The total capacity of the array may be calculated as the number of storage devices in the array multiplied by the capacity of the smallest storage device in the array. Thus, in FIG. 4, since the array includes 12 data storage devices, the total capacity of the array is 12 times the capacity of the smallest storage device in the array.
The disadvantage of RAID0 is that there is protection against storage device failure: if any storage device in the array fails, data is lost. In fact, RAID0 may be considered to be at higher risk than JBOD or JBOF: by striping data across multiple storage devices, all data is lost if any individual storage device fails. Thus, while failure of a single storage device may result in some data loss in a JBOD or JBOF setting, not all data is necessarily lost.)
RAID0 does not include any redundancy and is therefore not technically a redundant array of independent disks. But conventionally RAID0 is considered a RAID level and RAID0 can be certainly considered an erasure coding scheme.
Erasure coding scheme 410 illustrates RAID5, which is a common RAID scheme. In RAID5, parity blocks may be computed for data stored on the other storage devices of this stripe. Thus, in FIG. 4, since the RAID5 array includes a total of 12 storage devices, 11 storage devices are used as data drives and 1 storage device is used as a parity drive. RAID4, which is no longer used often, stores all parity information on a single drive, the total capacity of an array (where there are n storage devices in the array) may be calculated as n-1 times the capacity of the smallest storage device. Since each stripe includes one parity block, the erasure coding scheme 410 can tolerate the failure of up to one storage device and still be able to access all data (the data on the failed storage device can be recovered using the data on the functional storage device in conjunction with the parity block).
Note that RAID5 provides less total storage compared to RAID0, but provides some protection against storage device failure. This is an important tradeoff when deciding on RAID level: the relative importance of total storage capacity and redundancy.
Other RAID levels not shown in figure 4 may also be used as erasure coding schemes. For example, RAID6 uses two storage devices to store parity information, thereby reducing the total storage capacity to n-2 times the minimum storage device capacity, but tolerates up to two storage device failures at the same time. Hybrid schemes are also possible: for example, RAID 0+1, RAID 1+0, RAID 5+0, RAID 6+0, and other RAID schemes are possible, each scheme providing a different total storage capacity and storage device fault tolerance. For example, five of the storage devices 130-1 through 130-6 may be used to form one RAID5 array, another five of the storage devices 130-1 through 130-6 may be used to form a second RAID5 array, and these two groups in combination with the remaining two storage devices may be used to form a larger RAID5 array. Alternatively, storage devices 130-1 through 130-6 may be divided into two groups, each implementing a RAID0 array, where the two groups act as a larger RAID 1 array (thereby implementing a RAID 0+1 setup). It should be noted that RAID and erasure coding techniques use fixed codes (fixed codes) or rotating codes (rotatingcodes), and the above fixed code/parity drive notation is used for illustration purposes only.
Erasure coding scheme 415 represents a more general description that applies to all RAID levels and any other desired erasure coding scheme. Considering the array of storage devices 130-1 through 130-6, these storage devices may be divided into two groups: one group is for storing data and the other group is for storing code. The code may be parity information or any other desired encoding information that allows for recovery of lost data from a subset of data in a data group and some encodings in a group of encodings. As shown in FIG. 4, erasure coding scheme 415 can include up to X data storage devices and Y code storage devices. Given any combination of X storage devices from the array, it is contemplated that it is possible to access or reconstruct data from all X data storage devices. Thus, erasure coding scheme 415 can generally tolerate up to Y storage device failures in an array and still be able to access all data stored in the array. In terms of capacity, the total capacity of the erasure coding scheme 415 is X times the capacity of the smallest storage device.
Note that in the above discussion, the total capacity of any erasure coding scheme is set forth relative to the "capacity of the smallest storage device". For some erasure coding schemes, the storage devices may have different capacities and still be fully utilized. Some erasure coding schemes (e.g., RAID0 or RAID 1) expect all storage devices to have the same capacity and will discard any capacity that a larger storage device may include. Thus, the phrase "capacity of a minimum storage device" should be understood as a relative phrase, and the total capacity provided by an array using any particular erasure coding scheme may be greater than the above formula.
Returning to FIG. 3, regardless of the particular erasure coding scheme used, the lookaside erasure coding logic of PCIe switches 125 and 320 effectively creates a new storage device from among the physical storage devices 130-1 through 130-6. Since the storage presented by the erasure coding scheme is not physically present, this new storage can be considered a virtual storage. And since this virtual storage uses physical storage 130-1 through 130-6, physical storage 130-1 through 130-6 should be hidden from the host. After all, when the data stored on storage devices 130-1 through 130-6 may have been encoded in a manner unknown to the host, it may be problematic for the host to attempt to directly access the blocks on storage devices 130-1 through 130-6.
To support the use of this virtual storage device, the PCIe switches 125 and/or 320, which have lookaside erasure coding logic, may inform the processor 110 of FIG. 1 of the capacity of the virtual storage device. For example, if storage devices 130-1 to 130-6 include five NVMe SSDs (each storing 1 Terabyte (TB) of data (for mathematical simplicity, 1TB is considered to be 2)40One byte instead of 1012Bytes) and the erasure coding scheme implements a RAID5 array, the effective storage capacity of the virtual storage device is 4 TB. (other embodiments of erasure coding using fewer or more storage devices (each storage device may store less or more than 1TB) may result in virtual storage devices having different capacities.) PCIe switches 125 and/or 320 having lookaside erasure coding logic may be connected to provide a total of 4 TBs (or 2 TBs.)42Bytes) of storage capacity to the processor 110. As further described below with reference to FIG. 5, the processor 110 shown in FIG. 1 may then write the data to a block in this virtual storage device, and the lookaside erasure coding logic may handle the actual storage of the data. For example, if the block sizes on the NVMe SSD are 4 Kilobytes (KB) each, the processor 110 may request to write data to numbers 0 to 230-logical blocks between 1.
Alternatively, PCIe switches 125 and/or 320 having lookaside erasure coding logic may request a block of host memory addresses from the processor 110 shown in FIG. 1, which represents a method for communicating with virtual storage. When the processor 110 shown in FIG. 1 wants to read or write data, a transmission including the appropriate address within the host memory address block may be sent to the PCIe switches 125 and/or 320 with the lookaside erasure coding logic. This block of host memory addresses should be at least as large as the virtual storage implemented using the erasure coding scheme (and may be greater than the initial capacity of the virtual storage if it is expected that additional storage may be added to the erasure coding scheme during use).
FIG. 5 shows details of PCIe switch 125 having the lookaside erasure coding logic shown in FIG. 1. In fig. 5, PCIe switch 125 with lookaside erasure coding logic may include various components, such as connector 505, PCIe-to-PCIe stacks (PCIe-to-PCIe stacks) 510-1 to 510-6, PCIe switch core 515, and Power Processing Unit (PPU) 520. Connector 505 enables PCIe switch 125, having lookaside erasure coding logic, to communicate with various other components in machine 105 of FIG. 1, such as processor 110 of FIG. 1 and storage devices 130-1 through 130-6 of FIG. 3. One or more of the connectors 505 may be referred to as an "external" connector because it is connected to an upstream component (e.g., the processor 110 shown in FIG. 1); the remaining connectors 505 may be referred to as internal or downstream "connectors" because they are connected to downstream devices (e.g., memory devices 130-1 through 130-6 shown in FIG. 3). The PCIe-to-PCIe stacks 510-1 through 510-6 allow data exchange between PCIe devices. For example, storage device 130-1 of FIG. 3 may send data to storage device 130-3 of FIG. 3. Alternatively, processor 110 of FIG. 1 may be requesting one or more of storage devices 130-1 through 130-6 of FIG. 3 to perform a read or write request. The PCIe-to-PCIe stacks 510-1 through 510-6 may include buffers to temporarily store data: for example, if the destination device for a particular transmission is currently busy, buffers in the PCIe-to-PCIe stacks 510-1 through 510-6 may store the transmission until the destination device is idle. PPU 520 may act as a configuration hub handling any configuration requests for PCIe switch 125 with lookaside erasure coding logic. Although FIG. 5 shows six PCIe-to-PCIe stacks 510-1 through 510-6, embodiments of the inventive concept may include any number of PCIe-to-PCIe stacks. The PCIe switch core 515 operates to route data from one PCIe port to another.
Before entering the operation of the probing logic 525 and the erasure coding controller 530, it is helpful to understand that there are at least two different "addresses" for the data stored on the memory devices 130-1 through 130-6 shown in FIG. 3. On any storage device, data is written to a specific address associated with the hardware structure: this address can be considered a "physical" address: in the context of NVMe SSDs, the "Physical" Address is often referred to as the Physical Block Address (PBA).
Flash memory used in NVMe SSDs typically does not allow data to be rewritten in place. Conversely, when data needs to be rewritten, the old data will be invalidated, and the new data will be written to the new data block at other locations on the NVMe SSD. Thus, the PBA to write data associated with a particular data structure (whether a file, object, or any other data structure) may change over time.
In addition, there are other reasons for relocating data in flash memory. Data is typically erased from flash memory in larger units than are used when writing data to the flash memory. If valid data is stored elsewhere in the cell to be erased, this valid data must be written to other locations in the flash memory before the cell can be erased. This erase process is commonly referred to as Garbage Collection (garpage Collection), and the process of copying valid data from the cells to be erased is referred to as programming. And Wear leveling (Wear leveling), a process that attempts to make cells in flash memory use roughly the same level, may also relocate data within the flash memory.
Each time a particular data block is moved, the host receives a notification and is informed of the new storage location of the data. Notifying the host in this manner places a significant burden on the host. Therefore, most flash memory devices inform the host of the Logical Block Address (LBA) at which data is stored, and maintain a table (typically in a Flash Translation Layer (FTL)) that maps LBAs to PBAs. Then, rather than notifying the host of the new address, the flash memory can update the LBA to PBA mapping table in the FTL each time the data in question is moved to a new PBA. Thus, for each storage device, there may be both PBA and LBA associated with the data.
By adding the concept of virtual storage presented by the lookaside erasure coding logic, a further level is introduced for this structure. Recall the example presented above with reference to fig. 3, where the erasure coding scheme includes five 1TB NVMe SSDs, each using a block of size 4 KB. Each NVMe SSD may include numbers 0 to 228LBA between 1. But the virtual storage presented to the host includes numbers 0 to 230LBA between 1.
Thus, the LBA range seen by the host may represent a combination of multiple LBA ranges for various storage devices. To distinguish between the LBA range used by the host and the LBA range of the respective storage device, the LBA used by the host may be referred to as "host LBA (host LBA)", "global LBA (global LBA)", or "operating system (O/S) -aware LBA (operating system-aware LBA)", while the LBA used by the storage device may be referred to as "device LBA (deviceo)", "local LBA (local LBA)", or "LBA after roc after RoC". The host LBA range may be divided among the various storage devices in any desired manner. For example, a host LBA range may be divided into contiguous blocks, with each respective block being allocated to a particular storage device. By using this scheme, host LBA 0 through LBA228Device LBA 0 through LBA2 that-1 can map to storage device 130-128-1, host LBA228To LBA229Device LBA 0 through LBA2 that-1 can map to storage device 130-228-1, and so on. Alternatively, respective bits in the host LBA may be used to determine the appropriate storage device and device LBA to store this data: for example, the low order bits in the host LBA are used to identify the device and these bits are stripped off to produce the device LBA used by the storage device. But no matter how the host LBAs are mapped to the device LBAs, there may be two, three, or possibly even more non-representation of data storage locationsThe same address.
Of course, the storage device is not required to be homogeneous: they may be of different sizes and thus have different numbers of LBAs: it may even be a different device type, for example, mixing an SSD with a hard disk drive.
Note that for simplicity of explanation, the term "device LBA" may be used even if the address provided to the storage device is not a logical block address (e.g., hard disk drive). If the "device LBA" is the actual address on the storage device where the data is stored, the storage device may not map the device LBA to a different address before accessing the data.
Returning now to fig. 5, the snoop logic 525 and erasure coding controller 530 act as lookaside erasure coding logic for the PCIe switch 125 having the lookaside erasure coding logic. The probe logic 525 may "probe" (e.g., by intercepting requests before they are delivered to their destination) the transmission and determine the appropriate destination using capture interfaces (capture interfaces) 535-1 through 535-6, which capture interfaces 535-1 through 535-6 may pass to the probe logic 525 via multiplexer 540. As discussed above, the processor 110 only "sees" a virtual storage device of a given capacity (or a block of host memory addresses of a particular size) and issues commands to read or write data based on the host LBAs (associated with the virtual storage device). The probe logic 525 may convert these host LBAs to device LBAs on one or more particular physical storage devices and change the transfer accordingly to direct the request. The detection logic 525 may manage such translation in any desired manner. For example, the probing logic 525 may include a table that maps a first range of host LBAs to storage device 130-1 shown in FIG. 3, a second range of host LBAs to storage device 130-2 shown in FIG. 3 (and so on), where the device LBAs depend on factors that may be relevant to how the side-view erasure coding logic operates: such as the erasure coding scheme itself (e.g., RAID level), stripe size, number of storage devices, etc. Alternatively, the probing logic 525 may use a particular bit in the host LBA to decide which of the storage devices 130-1 through 130-6 shown in FIG. 3 to store the data in question: for example, if the array includes only two storage devices, the detection logic 525 may use the lower order bits (or some other bits in the logical block address) to determine whether data is to be written to the first storage device or the second storage device. For example, FIG. 3 shows a total of 24 storage devices 130-1 through 130-6, storage devices 130-1 through 130-6 may use bit values 00000 through 10111; bit values between 11000 through 11111 should be avoided.) embodiments of the inventive concept may use any other desired method to map a logical block address received from a host to a block address on a (suitable) storage device.
As an example, consider that processor 110 shown in FIG. 1 sends a write request with enough data to fill an entire stripe across all of storage devices 130-1 through 130-6 (after arithmetic erasure coding). The detection logic 525 may divide the data into separate logical units and, as discussed below, the erasure coding controller 530 may provide or modify the data. Probe logic 525 may then generate a transfer with the appropriate data that is destined for each of storage devices 130-1 through 130-6.
Note that when the probing logic 525 replaces the original host LBA with a device LBA appropriate for the storage device in question, this device LBA need not be a physical block address. In other words, the device LBA used by the probe logic may itself be another logical block address. This configuration enables the physical storage device to continue to manage its own data storage as appropriate. For example, if the physical storage device is an NVMe SSD, the SSD may move data around to perform garbage collection or wear leveling, using its flash translation layer to manage the association of the provided device LBA with the PBA on one of the NAND flash chips. Such operation may occur without knowledge of the detection logic 525. However, if the storage device in question does not relocate data, the device LBA provided by the probe logic 525 may be a physical address on the storage device unless so indicated by the host.
As described above, erasure coding controller 530 may implement an erasure coding scheme. Depending on the erasure coding scheme, erasure coding controller 530 may simply generate the appropriate parity data (e.g., when using a RAID5 or RAID6 erasure coding scheme) while leaving the original data (as provided by processor 110 shown in FIG. 1) unchanged. However, in some embodiments of the inventive concept, erasure coding controller 530 may also modify the original data. For example, erasure coding controller 530 may implement error correction codes on the original data so that the blocks stored on the respective storage devices 130-1 through 130-6 of FIG. 3 may be properly read even in the event of errors. Alternatively, the erasure coding controller 530 can encrypt the data written to the storage devices 130-1 to 130-6 shown in fig. 3, thereby making the data written to the storage devices 130-1 to 130-6 shown in fig. 3 unreadable without an encryption key (encryption key) -or worse, causing the erasure coding controller 530 to consider that the storage devices 130-1 to 130-6 are destroyed if the processor 110 shown in fig. 1 is to write the data directly. Alternatively, erasure coding controller 530 may introduce parity information (or similar types of information) into the data written into each of storage devices 130-1 through 130-6 shown in FIG. 3. The particular operations performed on the data by erasure coding controller 530 depend on the erasure coding scheme used.
The detection logic 525 and the erasure coding controller 530 may be implemented in any desired manner. For example, the detection logic 525 and the erasure coding controller 530 may be implemented using a processor with suitable software stored thereon. But since PCIe switches are typically implemented as hardware circuitry (which is typically faster than software running on a processor of a device such as a PCIe switch, which typically need not implement a significant amount of functionality), the snoop logic 525 and erasure coding controller 530 may be implemented using suitable circuitry. Such circuitry may comprise an FPGA, an ASIC, or any other desired hardware implementation, programmed in a suitable manner.
In the most basic embodiment, the lookaside erasure coding logic can be implemented using only the probing logic 525 and the erasure coding controller 530. The inclusion of cache 545 and/or write buffer 550 in the side-view erasure coding logic may provide significant benefits.
The cache 545 may store a subset of the data stored in the virtual storage. Generally, the capacity of cache 545 is less than the total virtual storage, but accesses are faster. Thus, by storing some data in cache 545, cache hits to cache 545 may result in faster performance of the virtual storage as compared to accessing data from the underlying physical storage device. For example, the cache 545 may store data that was most Recently accessed from the virtual storage device, using any desired algorithm to identify the data to be replaced as it becomes older (e.g., Least Recently Used algorithm or Least frequently Used algorithm). The cache 545 may be implemented using any desired memory structure, such as DRAM, SRAM, MRAM, or any other desired memory structure. Cache 545 may be implemented even with faster memory structures than conventional memory, such as may be used in an L1 or L2 cache in a processor. Finally, although cache 545 is shown as part of PCIe switch 125 having lookaside erasure coding logic, cache 545 may also be stored in memory 115 shown in fig. 1 and accessed from memory 115 by PCIe switch 125 having lookaside erasure coding logic.
Write buffer 550 provides a mechanism to speed up write requests. The time required to perform a write operation to a virtual storage device that uses erasure coding to span (span) multiple physical storage devices may be slower than a similar write request to a single physical storage device. Performing a write operation may involve reading data from other storage devices in the same block, after which the new data may be merged, and then the merged data may be written back to the appropriate storage device. Performing the merge may also involve calculating parity information or other code information. And the write request may also be delayed if the underlying physical storage device is busy performing other operations, such as processing a read request. It may be undesirable to delay software running on the processor 110 of fig. 1 while waiting for the write request to complete. Thus, rather than blocking software running on the processor 110 of FIG. 1, the write buffer 550 may temporarily store data until the write to the underlying physical storage device is complete; while the detection logic 525 may inform software running on the processor 110 of fig. 1 that the write request is complete. This approach is similar to a write-through cache policy (write-through cache policy) in which a write operation is completed before software running on the processor 110 is notified that the write has been completed. Like cache 545, write buffer 550 may be implemented using any desired memory structure, such as a DRAM, SRAM, MRAM, or L1 or L2 cache structures, among other possibilities.
As part of performing a write operation, the lookaside erase encoding logic may check whether any of the data needed to complete the write operation is currently located in cache 545. For example, when the processor 110 shown in FIG. 1 sends a write request to a virtual storage device, the erasure coding scheme may require reading the entire stripe to calculate parity information or other code information. If some (or all) of this data resides in the cache 545, the data may be accessed from the cache 545, rather than by reading the data from the underlying physical storage device. Additionally, the cache policy may suggest that data to be written should also be cached in cache 545 in case the data may be requested again in the near future.
Although fig. 5 shows cache 545 and write buffer 550 as separate elements, embodiments of the present inventive concept may combine the two into a single element (which may be referred to simply as a "cache"). In such embodiments of the inventive concept, the cache may include a bit indicating whether the data stored thereon is "clean" or "dirty". "clean" data represents data that has only been read, but not modified, since its last write to the underlying physical storage device; the "dirty" data has been modified since its last write to the underlying physical storage device. If the cache includes "dirty" data, the lookaside erasure coding logic may need to write the "dirty" data back to the underlying storage when the data is removed from the cache according to the cache policy. Additionally, embodiments of the present inventive concept may include cache 545, write buffer 550, both (alone or combined into a single element) or neither.
As discussed above, the lookaside erasure coding logic in the PCIe switch 125 having the lookaside erasure coding logic may "create" a virtual storage device from the underlying physical storage device, and this would be problematic if the processor 110 of FIG. 1 gained direct access to the physical storage devices 130-1 through 130-6 of FIG. 3. Thus, when the machine 105 shown in fig. 1 initially boots (i.e., boots or powers up) and attempts to enumerate the various PCIe devices that are accessible, the PCIe switch 125 having the lookaside erasure coding logic may determine that it is to use the lookaside erasure coding logic and its attached storage devices. In such a case, the PCIe switch 125 with the lookaside erasure coding logic should prevent enumeration of any PCIe devices downstream of the PCIe switch 125 with the lookaside erasure coding logic. By preventing such enumeration, the PCIe switch 125 with the lookaside erasure coding logic can "create" the virtual storage device without concern that the processor 110 of fig. 1 may be able to directly access data on the storage devices 130-1 through 130-6 of fig. 3 (which may corrupt data used in the erasure coding scheme). However, as discussed below with reference to fig. 9-10, there may be scenarios in which PCIe switch 125 with lookaside erasure coding logic should allow downstream enumeration of PCIe devices.
The probing logic 525 may also pass configuration commands to the PPU 520. In this way, the snoop logic 525 may also operate as a PCIe-to-PCIe stack for the purpose of connecting the PCIe switch core 515 with the PPU 520.
Finally, the probing logic 525 may receive an erasure coding enable signal 555 from the processor 110 shown in fig. 1 (possibly through a pin on the PCIe switch 125 having lookaside erasure coding logic). The erasure coding enable signal 555 may be used to enable erasure coding logic in a PCIe switch 125 having lookaside erasure coding logic.
Fig. 6 shows details of a PCIe switch having a perspective erasure coding logic according to another embodiment of the inventive concept. As can be seen by comparing fig. 5 and 6, the primary difference between the lookaside erasure coding logic and the perspective erasure coding logic in the PCIe switch 125 with lookaside erasure coding logic shown in fig. 5 and the PCIe switch 605 with perspective erasure coding logic shown in fig. 6 is the place where the erasure coding logic is placed. In the PCIe switch 125 with side-view erasure coding logic shown in fig. 5, the erasure coding logic is located "side" of the PCIe switch, while in the PCIe switch 605 with perspective erasure coding logic shown in fig. 6, the erasure coding logic is "inline" with the PCIe switch.
There are technical advantages and disadvantages to using side-view erasure coding logic compared to perspective erasure coding logic. The lookaside erasure coding logic shown in fig. 5 is a more complex implementation because the probing logic 525 is required to intercept and manage the redirection of data from the host. In contrast, the perspective erasure coding logic of FIG. 6 is easier to implement because all data between the host and the storage devices 130-1 through 130-6 of FIG. 3 passes through the erasure coding controller 530. On the other hand, including the lookaside erasure coding logic does not introduce additional latency to the operation of PCIe switch 125 when the erasure coding logic is disabled. In contrast, the perspective erasure coding logic shown in FIG. 6 can act as a PCIe endpoint. The perspective erasure coding logic of fig. 6 may buffer data between the host and the memory devices 130-1 through 130-6 of fig. 3, which may increase latency of communication. In the transparent erasure coding logic shown in fig. 6, the erasure coding controller 530 can further include elements such as a Frame Buffer (Frame Buffer), a routing Table (Route Table), a Port Arbitration logic (Port Arbitration logic), and a Scheduler (Scheduler) (not shown in fig. 6): elements typically included within the PCIe switch core 515.
In addition, PCIe switches typically use the same number of ports for upstream (to host) traffic as downstream (to storage devices and other connected devices) traffic. For example, if PCIe switch 605 includes a total of 96 ports, typically 48 are used for upstream traffic and 48 are used for downstream traffic. However, with the perspective erasure coding logic shown in fig. 6 enabled, erasure coding controller 530 may virtualize all downstream devices. In this case, only 16 or possibly 32 upstream ports are typically required to communicate with the host. If PCIe switch 605 includes more ports than 32 or 64 ports, additional ports may be used to connect additional downstream devices, which may be used to increase the capacity of the virtual storage device. To this end, the erasure coding controller 530 shown in fig. 6 may use a non-transparent bridge (NTB) port to communicate with the host.
Fig. 6 illustrates a PCIe switch 605 including a perspective erasure coding logic. Embodiments of the inventive concept may separate the perspective erasure coding logic from the PCIe switch 605. For example, the perspective erasure coding logic can be implemented as a separate component from the PCIe switch 605 using an FPGA or ASIC.
However, although there are implementation and technical differences between the side-view erasure coding logic as shown in fig. 5 and the perspective erasure coding logic as shown in fig. 6, functionally both erasure coding logics will achieve similar results. Thus, the side view erasure coding logic as shown in FIG. 5 and the perspective erasure coding logic as shown in FIG. 6 can be interchanged as desired. Any reference to side-view erasure coding logic in this document is intended to also encompass perspective erasure coding logic.
Fig. 7-10 illustrate various topologies for using the PCIe switch 125 with lookaside erasure coding logic shown in fig. 1. But regardless of the topology in use, the operation of the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 1 is the same: both to provide connectivity to various additional storage devices and to support erasure coding across these storage devices.
FIG. 7 illustrates a first topology using the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 1 according to one embodiment of the present inventive concept. In FIG. 7, a PCIe switch 125 having lookaside erasure coding logic is shown, which may be implemented as a separate component of the machine 105 shown in FIG. 1. That is, the PCIe switch 125 having the lookaside erasure coding logic may be manufactured and sold separately from any other components, such as the processor 110 or the storage device 130 shown in FIG. 1.
A PCIe switch 125 having lookaside erasure coding logic may be connected to the storage device 130. In fig. 7, PCIe switch 125 with lookaside erasure coding logic is shown connected to only a single storage device, which may not support erasure coding: erasure coding requires at least two storage devices or at least two portions of a storage device to perform striping, chunking, grouping, and using parity information or code information. Even a single storage device PCIe switch 125 with lookaside erasure coding logic may provide some advantages. For example, PCIe switch 125 with lookaside erasure coding logic may support the use of error correction codes with storage device 130 or encrypt data stored on storage device 130 if storage device 130 does not provide these services natively.
Storage 130 may also be connected to the FPGA 705. The FPGA705 can support acceleration. In short, there may be scenarios where data needs to be processed and then discarded. Loading all such data into the processor 110 shown in fig. 1 to perform the processing can be expensive and time consuming: calculations can be more easily carried out at locations closer to the data. FPGA705 can support such computations being performed closer to memory, eliminating the need to load data into the processor 110 of figure 1 to perform the computations: this concept is referred to as "acceleration". FPGA-based acceleration is discussed more in us patent application No. 16/122,865 filed on 5.9.2018, the U.S. patent applications claim the benefit of U.S. provisional patent application No. 62/642,568 filed on 3/13/2018, U.S. provisional patent application No. 62/641,267 filed on 3/9/2018, U.S. provisional patent application No. 62/638,904 filed on 3/5/2018 (all of which are incorporated herein by reference), and U.S. patent application No. 16/124,179 filed on 9/6/2018, U.S. patent application No. 16/124,182 filed on 9/6/2018, and U.S. patent application No. 16/124,182 filed on 9/6/2018 (all of which are continuations of U.S. patent application No. 16/122,865 filed on 9/5/2018 and are incorporated herein by reference). Since the purpose of acceleration is to handle data without transferring the data to the processor 110 shown in FIG. 1, FIG. 7 shows the FPGA705 closer to the storage device 130. Note, however, that the particular arrangement shown in fig. 7 is not required: the FPGA705 may be located between the PCIe switch 125 with lookaside erasure coding logic and the storage device 130.
In addition to data acceleration, the FPGA705 may provide other functions to support the storage device 130. For example, the FPGA705 may implement deduplication functionality for the storage device 130 in an attempt to reduce the number of times the same data is stored on the storage device 130. FPGA705 can determine whether particular data is stored more than once on the storage device 130, establish associations between various logical block addresses (or other information used by the host to identify the data) and where the data is stored on the storage device 130, and delete additional copies.
Alternatively, the FPGA705 may implement data integrity functions on the storage device 130, such as adding error Correction codes to prevent data loss due to operational errors of the storage device 130 or T10DIF (data integrity Field) protected end-to-end using Cyclic Redundancy Correction (CRC). In this way, the FPGA705 may be able to detect when there is an erroneous write and read of data on the storage device 130 or data in transit, and recover the original data. Note that FPGA705 can implement data integrity functions without the host being aware that data integrity functions are being provided: the host may only see the data itself and not any of the error correction codes.
Alternatively, FPGA705 can implement data encryption functions on storage device 130 to prevent unauthorized parties from being able to access data on storage device 130: data returned from the FPGA705 may be meaningless to the requestor without providing the appropriate encryption key. The host may provide an encryption key to be used when writing and reading data. Alternatively, FPGA705 can automatically perform data encryption and decryption: FPGA705 can store encryption keys (and can even generate encryption keys on behalf of a host) and determine the appropriate encryption key to use based on who requests the data.
Alternatively, the FPGA705 can implement data compression functions on the storage device 130 to reduce the amount of space required to store data on the storage device 130. When writing data to the storage device 130, the FPGA705 can perform the following functions: data provided by the host is compressed into a smaller amount of storage, and then the compressed data (and any information needed to recover the original data when the data is read from storage device 130) is stored. When reading data from storage 130, FPGA705 can read the compressed data (and any information needed to recover the original data from the compressed data) and remove the compression to recover the original data.
Any desired implementation of deduplication, data integrity, data encryption, and data compression may be used. Embodiments of the inventive concept are not limited to a particular implementation of any of these functions.
The FPGA705 can also implement any combination of functions on the storage device 130 as desired. For example, FPGA705 may implement both data compression and data integrity (as data compression may increase the sensitivity of data to errors: a single error in data stored on storage device 130 may result in a large amount of data being unavailable). Or FPGA705 can implement both data encryption and data compression (to protect data while using as little memory as possible for the data). Other combinations of two or more functions may also be provided by FPGA 705.
In terms of overall operation, the FPGA705 can read data from an appropriate source when performing any of these functions. Note that although the term "source" is a singular noun, embodiments of the inventive concept may read data from multiple sources (e.g., multiple storage devices) where appropriate. FPGA705 can then perform the appropriate operations on the data: data acceleration, data integration (dataintegration), data encryption, and/or data compression. The FPGA705 can then take appropriate action on the outcome of the operation: for example, the results are sent to the host 105 shown in FIG. 1, or the data is written to the storage device 130.
Although the above functions are described with reference to the FPGA705 shown in fig. 7, embodiments of the inventive concept can include these functions anywhere in a system that includes an FPGA. Furthermore, embodiments of the inventive concept may allow the FPGA705 to access data from "remote" storage. For example, returning briefly to FIG. 3, and assuming that storage device 130-1 comprises an FPGA similar to FPGA705, storage device 130-2 lacks such a storage device. The FPGA included in storage device 130-1 may be used to apply its functionality to storage device 130-2 by sending requests to storage device 130-2. For example, if the FPGA in storage device 130-1 provides data acceleration, the FPGA in storage device 130-1 can send a request to read data from storage device 130-2, perform the appropriate acceleration, and then send the results to the appropriate destination (e.g., host 105 shown in FIG. 1).
In fig. 7 (and in the topologies shown in fig. 8-10 below), PCIe switches 125 with lookaside erasure coding logic may be attached to devices that do not qualify for erasure coding. For example, the PCIe switch 125 with the lookaside erasure coding logic may be attached to other storage devices with built-in erasure coding functionality, or devices that are not storage devices, such as the FPGA705 or Graphics Processing Unit (GPU) shown in fig. 7. All such devices may be described as devices that are not eligible for erasure coding (or at least are not eligible for erasure coding provided by PCIe switch 125 having lookaside erasure coding logic).
When a PCIe switch 125 with lookaside erasure coding logic is connected to a device that is not eligible for erasure coding, the system has various alternatives that can be used. In one embodiment of the inventive concept, including any erasure coding unqualified devices may cause the lookaside erasure coding logic of the PCIe switch 125 having the lookaside erasure coding logic to be disabled. Thus, for example, if a PCIe switch 125 with lookaside erasure coding logic were to be connected to an FPGA705 or GPU or a storage device with native erasure coding logic as shown in fig. 7, then none of the storage devices connected to the PCIe switch 125 with lookaside erasure coding logic could be used with erasure coding. Note that the decision to disable the lookaside erasure coding logic of the PCIe switch 125 having the lookaside erasure coding logic does not necessarily switch to other PCIe switches having the lookaside erasure coding logic in the same chassis or in other chassis. For example, fig. 3 shows two PCIe switches 125 and 320 having lookaside erasure coding logic, where one PCIe switch may enable the lookaside erasure coding logic and the other PCIe switch may disable the lookaside erasure coding logic.
Another embodiment of the inventive concept may disable erasure coding unqualified devices as if they were not connected at all to a PCIe switch 125 with lookaside erasure coding logic. In this embodiment of the inventive concept, PCIe switch 125 with lookaside erasure coding logic may enable the lookaside erasure coding logic for storage device 130 and may disable any other erasure coding eligible storage devices as if it were not connected to PCIe switch 125 with lookaside erasure coding logic.
In yet another embodiment of the inventive concept, a PCIe switch 125 having the lookaside erasure coding logic may enable the lookaside erasure coding logic for storage devices that may be overwritten by the lookaside erasure coding logic, but still enable other devices that are not eligible for erasure coding to be accessed. This embodiment of the inventive concept is the most complex implementation: PCIe switch 125 with lookaside erasure coding logic needs to determine which devices qualify for erasure coding and which do not, then analyze the traffic to determine if the destination of the traffic is a virtual storage device (in which case the traffic is intercepted by the lookaside erasure coding logic) or not a virtual storage device (in which case the traffic is delivered to its original destination).
In embodiments of the inventive concept in which machine 105 does not ultimately provide the full functionality of the installed device (i.e., embodiments of the inventive concept in which erasure coding is disabled due to the presence of a device that is not eligible for erasure coding, or such a device is disabled by PCIe switch 125 having lookaside erasure coding logic), machine 105 may notify the user of this fact. This notification may be provided by the processor 110 of FIG. 1, the BMC325 of FIG. 3, or the PCIe switch 125 with lookaside erasure coding logic. In addition to informing the user that some functions have been disabled, the notification may also inform the user how to reconfigure machine 105 to allow the added functions. For example, the notification may suggest that devices that are ineligible for erasure coding are connected to particular slots in midplane 305 shown in fig. 3 (possibly those slots connected to PCIe switch 320 having lookaside erasure coding logic) and suggest that storage devices that are eligible for erasure coding are connected to other slots (e.g., those slots connected to PCIe switch 125 having lookaside erasure coding logic). In this manner, at least some erasure coding eligible storage devices can benefit from an erasure coding scheme without blocking access to other devices that are not erasure coding eligible.
Fig. 8 illustrates a second topology using the PCIe switch 125 with lookaside erasure coding logic shown in fig. 1 according to another embodiment of the present inventive concept. In fig. 8, PCIe switch 125 with lookaside erasure coding logic may be located within FPGA 705: that is, the FPGA705 may also implement the PCIe switch 125 with lookaside erasure coding logic. The FPGA705 and PCIe switch 125 with lookaside erasure coding logic may then be connected to the storage devices 130-1 through 130-4. Although FIG. 8 shows the FPGA705 and PCIe switch 125 with lookaside erasure coding logic connected to four storage devices 130-1 through 130-4, embodiments of the inventive concept may include any number of storage devices 130-1 through 130-6.
In general, the topology shown in FIG. 8 may be implemented within a single shell or enclosure and contain all of the components shown (SSDs 130-1 through 130-4 may be separate flash memories, rather than self-contained SSDs). That is, the entire structure shown in fig. 8 may be sold as a single unit rather than as a separate component. Embodiments of the inventive concept may also include a riser card (riser card) at one end connected to the machine 105 shown in fig. 1 (possibly connected to the midplane 305 shown in fig. 3) and a connector (e.g., U.2, M.3 or SFF-TA-1008 connector) at the other end for connecting to the storage devices 130-1 through 130-4. And although fig. 8 shows PCIe switch 125 with lookaside erasure coding logic as part of FPGA705, PCIe switch 125 with lookaside erasure coding logic may also be implemented as part of an intelligent SSD.
Fig. 9 illustrates a third topology for using the PCIe switch 125 with lookaside erasure coding logic shown in fig. 1 according to yet another embodiment of the present inventive concept. In FIG. 9, two PCIe switches 125 and 320 with lookaside erasure coding logic are shown, with up to 24 storage devices 130-1 through 130-6 connected between the two PCIe switches 125 and 320. As described above with reference to FIG. 3, each PCIe switch 125 and 320 with lookaside erasure coding logic may include 96 PCIe lanes, four PCIe lanes being used in each direction to communicate with one of the storage devices 130-1 through 130-6: PCIe switches 125 and 320, each having lookaside erasure coding logic, may then support up to 12 storage devices. To support erasure coding across storage devices supported by multiple PCIe switches 125 and 320 with lookaside erasure coding logic, one PCIe switch with lookaside erasure coding logic may be designated to be responsible for erasure coding across all devices, and the lookaside erasure coding logic may be enabled. Another PCIe switch 320 having the lookaside erasure coding logic may operate purely as a PCIe switch with the lookaside erasure coding logic disabled. The selection of which PCIe switch should be selected to handle erasure coding may be done in any desired manner: for example, the two PCIe switches may negotiate between themselves as such, or the PCIe switch that is enumerated first may be designated to handle erasure coding. The PCIe switch selected to handle the erasure coding may then report the virtual storage device (across the two PCIe switches) while the PCIe switch not handling the erasure coding may not report downstream devices (to prevent the processor 110 shown in fig. 1 from attempting to access the storage device as part of the erasure coding scheme).
Note that while the PCIe switches 125 and 320 with the lookaside erasure coding logic may both be located in the same chassis, the PCIe switches 125 and 320 with the lookaside erasure coding logic may be located in different chassis. That is, the erasure coding scheme may span storage devices between multiple chassis. All that is required is that the PCIe switches in the various chassis can negotiate with each other the location of the storage device to be part of the erasure coding scheme. Embodiments of the inventive concept are also not limited to two PCIe switches 125 and 320 with lookaside erasure coding logic: the storage devices included in the erasure coding scheme can be connected to any number of PCIe switches 125 and 320 with lookaside erasure coding logic.
Host LBAs may be split across PCIe switches 125 and 320 with lookaside erasure coding logic in any desired manner. For example, the least significant bit (least significant bit) in a host LBA may be used to identify which PCIe switch 125 or 320 having side-view erasure coding logic includes a storage device storing data with this host LBA. By using more than two PCIe switches with lookaside erasure coding logic, multiple bits may be used to determine which PCIe switch with lookaside erasure coding logic manages the storage device storing the data. Once the appropriate PCIe switch with lookaside erasure coding logic has been identified (and the snoop logic 525 of fig. 5 has modified the transmission), the transmission may be routed to the appropriate PCIe switch with lookaside erasure coding logic (assuming the destination of the transmission is not a storage device connected to the PCIe switch with lookaside erasure coding logic enabled).
In another embodiment of the inventive concept, instead of having a single PCIe switch with lookaside erasure coding logic responsible for virtualizing all storage devices connected to two PCIe switches with lookaside erasure coding logic, each PCIe switch with lookaside erasure coding logic may create a separate virtual storage device (with a separate erasure coding domain). In this way, different, but smaller, erasure coding fields can be created for different customers.
Fig. 9 may also represent another embodiment of the inventive concept. Although FIG. 9 implies that only storage devices 130-1 through 130-16 are connected to PCIe switches 125 and 320 with lookaside erasure coding logic and that all storage devices 130-1 through 130-6 can be used with an erasure coding scheme, as discussed above, embodiments of the inventive concept are not so limited: PCIe switches 125 and 320 with lookaside erasure coding logic may have devices that do not qualify for erasure coding connected to them. Such devices may be grouped under a single PCIe switch with side-view erasure coding logic, and storage devices eligible for erasure coding are grouped under different PCIe switches 125 with side-view erasure coding logic. In this manner, the best function of the machine 105 shown in fig. 1 may be achieved, with one (or some) of the PCIe switches having lookaside erasure coding logic enabled and one (or some) of the PCIe switches having lookaside erasure coding logic disabled.
Fig. 10 illustrates a fourth topology for using the PCIe switch 125 with lookaside erasure coding logic shown in fig. 1 according to yet another embodiment of the present inventive concept. In fig. 10, PCIe switches 125, 320, and 1005 with lookaside erasure coding logic may be structured in a hierarchy (hierarchy) as compared to fig. 9. At the top of the hierarchy, PCIe switch 125 with lookaside erasure coding logic may manage erasure coding of all storage devices in the hierarchy below PCIe switch 125 with lookaside erasure coding logic, and thus may enable the lookaside erasure coding logic. On the other hand, PCIe switches 320 and 1005 having lookaside erasure coding logic may disable their lookaside erasure coding logic (since their storage devices are managed by the lookaside erasure coding logic of PCIe switch 125 having lookaside erasure coding logic).
Although fig. 10 illustrates three PCIe switches 125, 320, and 1005 with lookaside erasure coding logic configured as a two-tier hierarchy, embodiments of the inventive concept are not limited in the number of PCIe switches included or their hierarchical arrangement. Thus, embodiments of the inventive concept may support any number of PCIe switches with lookaside erasure coding logic arranged in any desired hierarchy.
Embodiments of the inventive concept described above with reference to fig. 1-10 are directed to single port memory devices. Embodiments of the inventive concept may be extended to dual port memory devices where one (or more) memory devices communicate with multiple PCIe switches having lookaside erasure coding logic. In such an embodiment of the inventive concept, if the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 is not able to communicate with the dual port storage device, the PCIe switch 125 with lookaside erasure coding logic may send a transmission to the PCIe switch 320 with lookaside erasure coding logic to attempt to communicate with the storage device. The PCIe switch 320 with the lookaside erasure coding logic effectively acts as a bridge to communicate the PCIe switch 125 with the lookaside erasure coding logic to the storage devices.
Embodiments of the present inventive concept may also support detecting and handling storage device failures. For example, consider again FIG. 4 and assume that storage device 130-1 is malfunctioning. Storage device 130-1 may fail for any number of reasons: power surges may have damaged electronic components (electronics), wiring (either within storage device 130-1 or in the connection between storage device 130-1 and PCIe switch 125 with the lookaside erasure coding logic) may have failed, storage device 130-1 may have detected too many errors and shut itself down, or storage device 130-1 may have failed for other reasons. Storage 130-1 may also have been removed from its slot by the user (possibly to replace it with an updated, more reliable, or larger storage). Storage device 130-1 may become unavailable for whatever reason.
PCIe switch 125, having lookaside erasure coding logic, may detect the failure of storage device 130-1 through the presence pins on the connector of storage device 130-1. If storage device 130-1 is removed from the chassis, or if storage device 130-1 has been shut down, it may no longer assert its presence through the presence pins on the connector, which may trigger an interrupt in PCIe switch 125 with lookaside erasure coding logic. Alternatively, the PCIe switch 125 (or BMC325 shown in FIG. 3) having the lookaside erasure coding logic may send an occasional message (occasionalmessage) to the storage device 130-1 to check if it is still active (a process sometimes referred to as "heartbeat"): if storage device 130-1 does not respond to such a message, PCIe switch 125 or BMC325 of FIG. 3 having the lookaside erasure coding logic may conclude that storage device 130-1 has failed.
If (and when) storage device 130-1 fails, PCIe switch 125, having lookaside erasure coding logic, may manage the scenario by utilizing other means to access any data that would normally be requested from storage device 130-1. For example, if there is a mirror (mirror) of storage device 130-1, PCIe switch 125 with lookaside erasure coding logic may request data from the mirror of storage device 130-1. Alternatively, the PCIe switch 125 with lookaside erasure coding logic may request the rest of the stripe containing the desired data from the other storage devices in the array and use the erasure coding information to reconstruct the data from storage device 130-1. Other mechanisms may exist by which a PCIe switch 125 having lookaside erasure coding logic may access data stored on the failed storage device 130-1.
Embodiments of the present inventive concept may also support detecting and handling the insertion of new storage devices in an array. As with detecting a failure of a storage device, the PCIe switch 125 (or BMC325 shown in fig. 3) with the lookaside erasure coding logic may detect the insertion of a new storage device using the presence pins on the connector by occasionally pinging the device to see what is connected or any other desired mechanism (as with detecting a failed storage device, using the presence pins to detect a new storage device may trigger an interrupt in the PCIe switch 125 with the lookaside erasure coding logic). When a new storage device is detected, the PCIe switch 125 with lookaside erasure coding logic may add this new storage device to the array. Adding a new storage device to the array does not necessarily involve changing the erasure coding scheme: such a change may require a change to all data stored on the storage device. (e.g., consider a change from RAID5 to RAID 6: each stripe will now require two parity blocks (which will need to rotate between storage devices), requiring computation and moving large amounts of data.) but adding new storage devices to an existing erasure coding scheme may not require moving large amounts of data around. Thus, although adding new storage may not improve the array's tolerance to storage failure, adding new storage may still increase the capacity of the virtual storage.
If a failed storage device already exists in the array, the insertion of the new storage device may be utilized to reconstruct the failed storage device. The erasure coding controller 530 shown in FIG. 5 can calculate the data stored on the failed storage device and store this data in the appropriate block address on the replacement storage device. For example, the original data on the failed storage device may be calculated from the data on the other storage devices (both the original data and parity or code information); parity or code information stored on the failed storage device may be recalculated based on the original data on the other storage devices. (of course, if a mirror exists for the failed storage device, then erasure coding controller 530 of FIG. 5 may simply instruct the copying of the data from the mirror onto the replacement storage device.)
Rebuilding a failed storage device can be a time consuming process. In some embodiments of the inventive concept, the rebuild may be performed once the replacement storage device is installed. In other embodiments of the inventive concept, the erasure coding controller 530 shown in fig. 5 can reconstruct a storage device during an idle time period, in so far as the storage device can be reconstructed during the idle time period. However, if the virtual storage device is busy, the erasure coding controller 530 of FIG. 5 can defer rebuilding the replacement storage device until idle time occurs, and can reconstruct the data from the failed storage device as needed based on a request from the processor 110 of FIG. 1. (of course, such reconstructed data may be written to a replacement storage device without waiting for a complete reconstruction, so that this data does not need to be recalculated again at a later time.)
Embodiments of the inventive concept may also support initialization of the storage device. When a new storage device is added to the array (either as a replacement storage device for a failed storage device or to increase the capacity of the virtual storage device), the new storage device may be initialized. Initialization may include preparing the storage device for an erasure coding scheme.
Initialization of the new storage device may also involve erasing existing data from the new storage device. For example, consider a scenario in which a particular storage device is leased to a customer. The lease for this customer has ended and the storage device can be reused for the new customer. But the storage device may still have data stored thereon from the original customer. To avoid later customers gaining access to earlier customer data, the data on the storage device may be erased using any desired mechanism. For example, a table storing information about a data storage location may be erased. Or the data itself may be overwritten with new data (to prevent later attempts to restore any information that may have been deleted): the new data may use a pattern designed to help ensure that the original data may not be restored. For example, the united states department of Defense (DOD) has published criteria for how to erase data to prevent recovery: these criteria may be used to erase old data on the storage device and then reuse the storage device for a new client.
Initialization may not be limited to when a new storage device is hot added to an existing array. Initialization may also occur when the storage device or PCIe switch 125 with side-view erasure coding logic or the machine 105 of FIG. 1 as a whole is initially powered up together.
Fig. 11A-11D illustrate a flow diagram of an example process for the PCIe switch 125 with lookaside erasure coding logic of fig. 1 to support the erasure coding schemes 405, 410, and 415 of fig. 4 in accordance with an embodiment of the inventive concepts. In fig. 11A, at block 1103, the PCIe switch 125 with lookaside erasure coding logic of fig. 3 may be initialized (possibly through the BMC325 of fig. 3 or the processor 110 of fig. 1). At block 1106, the PCIe switch 125 having the lookaside erasure coding logic shown in FIG. 3 may receive the transmission. This transmission may be a read or write request from the processor 110 of FIG. 1, a control transmission from the processor 110 of FIG. 1 or the BMC325 of FIG. 3, or a transmission sent by the memory devices 130-1 to 130-6 of FIG. 3 in response to a read or write request from the processor 110 of FIG. 1.
At block 1109, the detection logic 525 of FIG. 5 may determine whether the transmission is a control transmission from the processor 110 of FIG. 1. If so, at block 1112, PCIe switch 125, having lookaside erasure coding logic, shown in FIG. 3, may deliver the control transmission to PPU 520, shown in FIG. 5, after which processing ends.
If the transfer is not a control transfer from the processor 110 of FIG. 1, then at block 1115 (FIG. 11B), the detection logic 525 of FIG. 5 may determine whether the transfer is a read or write request from a host. If the transfer is not a read or write request from the host, then at block 1118, the probe logic 525 of FIG. 5 may replace the device LBA with a host LBA appropriate for the host in the transfer. The probe logic 525 shown in FIG. 5 may also modify transfers to imply that the transfers are from virtual storage, rather than physical storage that stores actual data. At block 1121, the PCIe switch 125 having the lookaside erasure coding logic of FIG. 3 may deliver the transmission to the processor 110 of FIG. 1, after which processing ends.
If, on the other hand, the transfer is a read or write request from the processor 110 of FIG. 1, then at block 1124, the detection logic 525 of FIG. 5 may determine whether the data in question is available in the cache 545 of FIG. 5 or the write buffer 550 of FIG. 5. If data is available in cache 545 in FIG. 5 or write buffer 550 in FIG. 5, erasure coding controller 530 in FIG. 5 may access the data from the appropriate location at block 1127 (FIG. 11C).
If data is not available in the cache 545 of FIG. 5 or the write buffer 550 of FIG. 5, then at block 1130, the probe logic 525 of FIG. 5 may modify the transfer to replace the host LBA provided by the host with the device LBA from which the storage device should read data. The probe logic 525 of fig. 5 may also modify the transmission to identify the appropriate storage device to receive the transmission. Next, at block 1133, the detection logic 525 may deliver the transmission to the appropriate storage device.
The PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 has the required data at this point, regardless of whether the data in question is accessible from cache or read from storage. At this point, the treatment may diverge (divide). If the transmission is a read request from the processor 110 of FIG. 1, the PCIe switch 125 having the lookaside erasure coding logic of FIG. 3 may return data to the processor 110 of FIG. 1 at block 1136. As shown in block 1139, the detection logic 525 of FIG. 5 may also store data in the cache 545 of FIG. 5; block 1139 is optional and may be omitted as indicated by dashed line 1142. At this time, the process ends.
On the other hand, if the transmission from the processor 110 of FIG. 1 is a write request, then at block 1145 the erasure coding controller 530 of FIG. 5 may read the stripe across the storage devices 130-1 through 130-6 of FIG. 3. Block 1145 is actually a repeat of blocks 1127, 1130, and 1133, and may not be required; block 1145 is included in FIG. 11C to emphasize that writing data to virtual storage may involve reading data from the entire stripe across storage devices 130-1 through 130-6. At block 1148, the erasure coding controller 530 of FIG. 5 may merge data received from the processor 110 of FIG. 1 with a stripe of data accessed from a cache or from storage devices 130-1 through 130-6.
At this point, processing may be forking again (divide), depending on whether the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 3 includes the write buffer 550 shown in FIG. 5. If the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 3 includes the write buffer 550 shown in FIG. 5, at block 1151 (FIG. 11D), the erasure coding controller 530 shown in FIG. 5 can write the merged stripe of data to the write buffer 550 shown in FIG. 5 (marking this data as dirty and requiring a flush to storage devices 130-1 through 130-6). Next, at block 1154, the PCIe switch 125 with lookaside erasure coding logic of fig. 3 may report the write request completion to the processor 110 of fig. 1. Note that if write buffer 550 of FIG. 5 uses a write-back cache policy, then block 1154 is appropriate; if the write buffer 550 shown in FIG. 5 uses a direct-to-write cache policy, block 1154 may be omitted as indicated by dashed line 1157.
Finally, because the PCIe switch 125 having lookaside erasure coding logic shown in FIG. 3 does not include the write buffer 550 shown in FIG. 5, or because data in the write buffer 550 shown in FIG. 5 is to be flushed to the storage devices 130-1 through 130-6 shown in FIG. 3, at block 1160, the erasure coding controller 530 shown in FIG. 5 may write the updated stripe back to the storage devices 130-1 through 130-6 shown in FIG. 3. Next, at block 1163, the PCIe switch 125 with lookaside erasure coding logic of fig. 3 may report the write request completion to the processor 110 of fig. 1. Note that if the merged data is already stored in write buffer 550 of FIG. 5 and write-back cache policy is used by write buffer 550 of FIG. 5, then block 1163 is not required: the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 has reported the write request completion (at block 1154). In such a scenario, block 1163 may be omitted as shown by dashed line 1166. At this time, the process ends.
Fig. 12A-12B illustrate a flowchart of an exemplary process by which the PCIe switch 125 with lookaside erasure coding logic shown in fig. 1 performs initialization, according to an embodiment of the inventive concept. In FIG. 12A, at block 1205, PCIe switch 125 having lookaside erasure coding logic shown in FIG. 3 determines whether the device connected to PCIe switch 125 having lookaside erasure coding logic shown in FIG. 3 is only a storage device and may have erasure coding managed by PCIe switch 125 having lookaside erasure coding logic shown in FIG. 3. If there are devices that are not storage devices or that may not have erasure coding managed by the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 3 connected to the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 3, in some embodiments of the inventive concept, the PCIe switch 125 with lookaside erasure coding logic shown in FIG. 3 may disable the lookaside erasure coding logic at block 1210, after which processing ends.
In other embodiments of the inventive concept, however, the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 may manage erasure coding even if devices that do not qualify for erasure coding are connected to the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3. In these embodiments of the inventive concept, or if only erasure coding eligible storage devices are connected to the PCIe switch 125 having lookaside erasure coding logic shown in FIG. 3, the PCIe switch 125 having lookaside erasure coding logic shown in FIG. 3 may enable the lookaside erasure coding logic at block 1215. Next, at block 1220 (FIG. 12B), the PCIe switch 125 having lookaside erasure coding logic of FIG. 3 may be configured to use an erasure coding scheme (perhaps by BMC325 of FIG. 3 or processor 110 of FIG. 1).
At block 1225, the PCIe switch 125 having lookaside erasure coding logic shown in fig. 3 may disable the erasure coding unqualified devices. Note that block 1225 is optional, as shown by dashed line 1230: it may be that no erasure coding eligible devices are connected to the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3, or that the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 may allow the processor 110 shown in fig. 1 to access those erasure coding ineligible devices despite the use of erasure coding for other devices.
At block 1235, for any device undergoing erasure coding, the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 may terminate enumeration downstream of the PCIe switch 125 with lookaside erasure coding logic shown in fig. 3. At block 1240, PCIe switch 125, having lookaside erasure coding logic, shown in FIG. 3 may report the virtual storage device to processor 110, shown in FIG. 1, based on the storage devices 130-1 through 130-6, shown in FIG. 3, undergoing erasure coding. The PCIe switch 125 with lookaside erasure coding logic shown in fig. 3 may also report any other PCIe devices that may be enumerated to the processor 110 shown in fig. 1. At this time, the process ends.
FIG. 13 illustrates a flow diagram of an exemplary process for PCIe switch 125 with lookaside erasure coding logic of FIG. 1 to incorporate new storage into the erasure coding scheme 405, 410 and 415 of FIG. 4 in accordance with an embodiment of the inventive concepts. In FIG. 13, at block 1305, the PCIe switch 125 (or the BMC325 of FIG. 3) having the lookaside erasure coding logic of FIG. 3 may check for a new storage device. If a new storage device is detected, at block 1310, the erasure coding controller 530 of FIG. 5 can add the new storage device to the array behind the virtual storage device. Finally, at block 1315, the PCIe switch 125 (or the BMC325 of FIG. 3, or the processor 110 of FIG. 1) having the lookaside erasure coding logic of FIG. 3 may initialize the new storage device. At this point, processing may end or may return to block 1305 to check for additional new storage, as indicated by dashed line 1320.
FIG. 14 illustrates a flowchart of an exemplary process for PCIe switch 125 with lookaside erasure coding logic shown in FIG. 1 to handle failed storage in accordance with an embodiment of the inventive concepts. In FIG. 14, at block 1405, the PCIe switch 125 (or the BMC325 of FIG. 3) having the lookaside erasure coding logic of FIG. 3 may check for a failed (or removed) storage device. If a failed storage device is detected, at block 1410, the erasure coding controller 530 of FIG. 5 can perform erasure coding recovery for data already stored on the failed storage device when a read request arrives that would have accessed the data from the failed storage device. Such erasure coding recovery may involve reading data from a stripe that includes the requested data from other storage devices and computing the requested data from the remaining data in the stripe.
At block 1415, the PCIe switch 125 having lookaside erasure coding logic of FIG. 3 (or the BMC325 of FIG. 3) may determine whether a replacement storage device has been added to the array behind the virtual storage device. If a storage device has been added to the array behind the virtual storage device, at block 1420, the erasure coding controller 530, as shown in FIG. 5, can reconstruct the failed storage device using the replacement storage device. At this point, processing may end or may return to block 1405 as indicated by dashed line 1425 to check for additional new storage.
In fig. 11A to 14, some embodiments of the inventive concept are shown. Those skilled in the art will recognize that other embodiments of the inventive concept are possible by changing the order of the blocks, by omitting blocks, or by including elements not shown in the figures. All such variations of the flow diagrams, whether explicitly described or not, are considered embodiments of the inventive concept.
Embodiments of the inventive concept provide technical advantages over the prior art. Using PCIe switches with lookaside erasure coding logic moves erasure coding closer to the storage device, which reduces the time required to move data around. Removing erasure coding from the processor reduces the load on the processor, allowing the processor to execute more instructions for the application. By using a configurable erasure coding controller, any desired erasure coding scheme can be used, rather than the limited set of schemes supported by hardware and software erasure coding vendors. By placing the erasure coding controller with the PCIe switch, the need for expensive RAID plug-in cards is eliminated and even larger arrays spanning multiple chassis may be used.
The following discussion is intended to provide a brief, general description of a suitable machine or machines in which certain aspects of the present inventive concepts may be implemented. The one or more machines may be controlled, at least in part, by input from conventional input devices (e.g., keyboard, mouse, etc.) as well as by directives received from another machine, interaction with a Virtual Reality (VR) environment, biometric feedback, or other input signals. The term "machine" as used herein is intended to broadly encompass a single machine, a virtual machine, or a system of communicatively coupled machines, virtual machines, or devices operating together. Exemplary machines include computing devices such as personal computers, workstations, servers, portable computers, handheld devices, telephones, tablets, and transportation devices such as private or public transportation vehicles (e.g., cars, trains, taxis, etc.).
The one or more machines may include embedded controllers, such as programmable or non-programmable logic devices or arrays, Application Specific Integrated Circuits (ASICs), embedded computers, smart cards, and the like. The one or more machines may utilize one or more connections (e.g., through a network interface, modem, or other communicative coupling) with one or more remote machines. The machines may be interconnected by a physical and/or logical network, such as an intranet (intranet), the internet, a local area network, a wide area network, and so on. Those skilled in the art will appreciate that network communications may utilize a variety of wired and/or wireless short-range or long-range carriers and protocols, including Radio Frequency (RF), satellite, microwave, Institute of Electrical and Electronics Engineers (IEEE) 802.11, and,
Figure BDA0002225725920000371
Optical devices, infrared devices, cables, lasers, etc.
Embodiments of the inventive concepts may be described with reference to or in conjunction with associated data including functions, procedures, data structures, application programs, etc., which when accessed by a machine results in the machine performing tasks or defining abstract data types or low-level hardware contexts. The associated data may be stored, for example, in volatile and/or non-volatile memory (e.g., RAM, ROM, etc.) or in other storage devices and their associated storage media, including hard drives, floppy disks, optical storage, magnetic tape, flash memory, memory sticks, digital video disks (digital video disks), biological memory, and the like. The associated data may be delivered over a transmission environment, including a physical network and/or a logical network, in the form of packets, serial data, parallel data, propagated signals, etc., and may be used in a compressed or encrypted format. The associated data may be used in a distributed environment (distributed environment) and stored locally and/or remotely for access by the machine.
Embodiments of the inventive concepts may include a tangible, non-transitory, machine-readable medium comprising instructions executable by one or more processors, the instructions comprising instructions for performing the elements of the inventive concepts described herein.
The various operations of the methods described above may be performed by any suitable means capable of performing the described operations, such as various hardware and/or software components, circuits, and/or modules. The software may comprise an ordered listing of executable instructions for implementing logical functions, and may be embodied in any "processor-readable medium" for use by or in connection with an instruction execution system, apparatus, or device, such as a single-core or multi-core processor or a processor-containing system.
The blocks or steps of a method or algorithm and the functions described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a tangible, non-transitory computer-readable medium. A software module may reside in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), registers, a hard disk, a removable disk, a compact disk read only memory (CD ROM), or any other form of storage medium known in the art.
In view of the principles of the present inventive concept having been described and illustrated with reference to illustrated embodiments, it will be recognized that the illustrated embodiments can be modified in arrangement and detail, and can be combined in any desired manner, without departing from such principles. Also, while the foregoing discussion focuses on particular embodiments, other configurations are also contemplated. In particular, even though expressions such as "embodiments in accordance with the inventive concept" or the like are used herein, these phrases are intended to generally reference embodiment possibilities, and are not intended to limit the inventive concept to particular embodiment configurations. The terms used herein may refer to the same embodiment or different embodiments that may be combined into other embodiments.
The foregoing illustrative embodiments should not be construed as limiting the inventive concepts of the present invention. Although a few embodiments have been described, those skilled in the art will readily appreciate that many modifications are possible in the embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of the present inventive concept as defined in the claims.
Embodiments of the inventive concept can be extended to the following statements, but are not limited thereto:
statement 1. embodiments of the inventive concept include a peripheral component interconnect express (PCIe) switch having erasure coding logic, the PCIe switch having erasure coding logic comprising:
an external connector enabling the PCIe switch to communicate with the processor;
at least one connector enabling the PCIe switch to communicate with the at least one storage device;
a Power Processing Unit (PPU) to handle configuration of a PCIe switch;
an erasure coding controller comprising circuitry for applying an erasure coding scheme to data stored on the at least one storage device; and
probe logic comprising circuitry to intercept a data transmission received at the PCIe switch and modify the data transmission in response to an erasure coding scheme.
Statement 2. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 1, wherein the erasure coding logic includes at least one of side view erasure coding logic and perspective erasure coding logic.
Statement 3. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 1, wherein the at least one storage device comprises at least one non-volatile storage express (NVMe) Solid State Drive (SSD).
Statement 4. embodiments of the inventive concept include a PCIe switch having erasure coding logic in accordance with statement 3, wherein the probing logic is operable to intercept control transmissions received at the PCIe switch and forward the control transmissions to the PPU.
Statement 5. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the probing logic is operable to intercept a data transfer received at the PCIe switch from the host and replace a host Logical Block Address (LBA) used by the host with a device LBA used by the at least one NVMe SSD in the data transfer.
Statement 6. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 5, wherein the snoop logic is further operable to direct data transmissions to the at least one NVMe SSD.
Statement 7. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the probing logic is operable to intercept a data transfer received at the PCIe switch from one of the at least one NVMe SSD and replace in the data transfer a device LBA used by the one of the at least one NVMe SSD with a host LBA used by the host.
Statement 8. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, the PCIe switch having erasure coding logic further comprising a cache.
Statement 9. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 8, wherein the snoop logic is operable to return a response to the data transfer based at least in part on the presence in the cache of data requested in the data transfer from the host.
Statement 10. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein:
the PCIe switch is positioned in the chassis; and is
The chassis includes memory that is used by the erasure coding controller as external cache.
Statement 11. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, the PCIe switch having erasure coding logic further comprising a write buffer.
Statement 12 embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 11, wherein:
the data transfer includes a write operation from the host; and is
The erasure coding controller is operable to complete the write operation after sending a response to the data transfer to the host.
Statement 13. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 11, wherein the erasure coding controller is operable to store data in write operations in the write buffer.
Statement 14. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is operable to enable the erasure coding controller and the probing logic based at least in part on all of the at least one NVMe SSD being usable with the erasure coding controller.
Statement 15. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is operable to disable the erasure coding controller and the probing logic based at least in part on the at least one NVMe SSD including built-in erasure coding functionality.
Statement 16. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 15, wherein the PCIe switch is operable to notify a user that the erasure coding controller and the probing logic are disabled based at least in part on the at least one NVMe SSD including built-in erasure coding functionality.
Statement 17. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is operable to disable the erasure coding controller and the probing logic based at least in part on at least one non-storage device being connected to the PCIe switch using the at least one connector.
Statement 18. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 17, wherein the PCIe switch is operable to notify a user that the erasure coding controller and the probing logic are disabled based at least in part on the at least one non-storage device being connected to the PCIe switch using the at least one connector.
Statement 19. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is operable to enable the erasure coding controller and the probing logic with the at least one NVMe SSD and prevent access to non-storage devices connected to the PCIe switch using the at least one connector.
Statement 20. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 19, wherein the PCIe switch is operable to notify a user that access to a non-storage device connected to the PCIe switch is blocked.
Statement 21. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is operable to manage an erasure coding scheme on at least one additional NVMe SSD connected to a second PCIe switch using an erasure coding controller and probing logic.
Statement 22 embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 21, wherein the second PCIe switch is operable to disable the second erasure coding controller and the second probing logic in the second PCIe switch.
Statement 23. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 22, wherein:
the PCIe switch is located in the first chassis; and is
The second PCIe switch is located in the second chassis.
Statement 24. embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 3, wherein the PCIe switch is implemented using a Field Programmable Gate Array (FPGA).
Statement 25. embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 3, wherein:
the at least one NVMe SSD includes at least two NVMe SSDs; and is
The PCIe switch and the at least two NVMe SSDs are located inside a common housing.
Statement 26. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch and the at least one NVMe SSD are located in separate housings.
Statement 27. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein:
a PCIe switch operable to detect a failed NVMe SSD of the at least one NVMe SSD; and is
The erasure coding controller is operable to handle data transfers to cope with failed NVMe SSDs.
Statement 28. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 27, wherein the erasure coding controller is operable to perform erasure coding recovery of data stored on a failed nvmesd.
Statement 29. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 28, wherein the erasure coding controller is operable to reconstruct a replacement NVMe SSD for a failed NVMe SSD.
Statement 30. embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 3, wherein:
the PCIe switch is operable to detect a new NVMe SSD; and is
The erasure coding controller is operable to use the new NVMe SSD as part of the erasure coding scheme.
Statement 31. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 30, wherein the erasure coding controller is operable to effect the capacity increase using the new NVMe SSD.
Statement 32. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 30, wherein the PCIe switch is operable to detect a new nvmesd connected to one of the at least one connector.
Statement 33. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 30, wherein the PCIe switch is operable to detect a new NVMe SSD through a message from a second PCIe switch.
Statement 34 embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 33, wherein the new NVMe SSD is connected to a second connector on a second PCIe switch.
Statement 35. embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 3, wherein the at least one connector includes a presence pin for detecting both failed NVMe SSDs and new NVMe SSDs.
Statement 36. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is operable to present itself to the host as a single device and prevent downstream PCIe bus enumeration to the at least one nvmesd.
Statement 37 embodiments of the inventive concept include a PCIe switch having erasure coding logic in accordance with statement 36, wherein the PCIe switch is further operable to prevent downstream PCIe bus enumeration to a second PCIe switch downstream from the PCIe switch.
Statement 38 embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 36, wherein the PCIe switch is operable to virtualize the at least one NVMe SSD.
Statement 39. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the erasure coding controller is operable to initialize a new nvmesds connected to one of the at least one connector.
Statement 40. embodiments of the inventive concept include a PCIe switch having erasure coding logic in accordance with statement 39, wherein the erasure coding controller is operable to initialize a new nvmesd after a hot insertion event.
Statement 41. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 39, wherein the erasure coding controller is further operable to initialize the at least one NVMe SSD upon startup.
Statement 42. an embodiment of the inventive concept includes a PCIe switch having erasure coding logic according to statement 3, wherein the PCIe switch is part of a system that includes a Baseboard Management Controller (BMC) operable to initialize a new NVMe SSD connected to one of the at least one connector.
Statement 43. embodiments of the inventive concept include a PCIe switch with erasure coding logic according to statement 42, wherein the BMC is operable to initialize the at least one NVMe SSD upon boot up.
Statement 44. embodiments of the inventive concept include a PCIe switch having erasure coding logic according to statement 3, wherein the erasure coding controller comprises a stripe manager for striping data across the at least one NVMe SSD.
Statement 45 an embodiment of the inventive concept includes a method comprising:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic;
processing the transmission using probe logic in the erasure coding logic; and
the transmission is delivered to its destination through the PCIe switch.
Statement 46. embodiments of the inventive concept include a method according to statement 45, wherein the erasure coding logic includes at least one of a lookaside erasure coding logic and a perspective erasure coding logic.
Statement 47. embodiments of the inventive concept include a method according to statement 45, wherein:
processing the transmission using the probing logic in the erasure coding logic includes determining, by the probing logic, that the transmission includes a control transmission; and is
Delivering the transmission to its destination through the PCIe switch includes delivering the transmission to a Power Processing Unit (PPU).
Statement 48. embodiments of the inventive concept include a method according to statement 45, wherein processing the transmission using the probe logic in the erasure coding logic comprises processing the transmission using the probe logic based at least in part on the erasure coding logic being active.
Statement 49 embodiments of the inventive concept include a method according to statement 45, wherein:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic includes receiving a read request from a host;
processing the transfer with probe logic in the erasure coding logic includes replacing a host Logical Block Address (LBA) with a device LBA in the read request; and is
Delivering the transmission to its destination through the PCIe switch includes delivering a read request to a non-volatile storage express (NVMe) Solid State Drive (SSD).
Statement 50. embodiments of the inventive concept include a method according to statement 49, wherein processing the transmission using the probe logic in the erasure coding logic further comprises identifying the NVMe SSD to which the read request should be delivered.
Statement 51 embodiments of the inventive concept include a method according to statement 49, wherein:
processing the transmission using the probe logic in the erasure coding logic further includes accessing data requested by the host in the read request from the cache based at least in part on the data residing in the cache;
replacing the host Logical Block Address (LBA) with a device LBA in the read request includes replacing the host LBA with the device LBA in the read request based at least in part on the data not residing in the cache; and is
Delivering, by the PCIe switch, the transmission to its destination includes delivering the read request to the NVMe SSD based at least in part on the data not residing in the cache.
Statement 52. embodiments of the inventive concept include a method according to statement 45, wherein:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic includes receiving a write request from a host;
processing the transfer with the probe logic in the erasure coding logic includes replacing the host LBA with the device LBA in the write request; and is
Delivering the transmission to its destination through the PCIe switch includes delivering the write request to the nvmesd.
Statement 53. embodiments of the inventive concept include a method according to statement 52, wherein processing the transmission using the probe logic in the erasure coding logic further comprises identifying the NVMe SSD to which the write request should be delivered.
Statement 54 embodiments of the inventive concept include a method according to statement 52, the method further comprising:
reading a block stripe from at least one NVMe SSD;
merging the data in the write request with the block stripe to form an updated block stripe; and
writing the updated block stripe to the at least one NVMe SSD.
Statement 55 an embodiment of the inventive concept includes the method according to statement 54, wherein merging the data in the write request includes calculating additional data to be written to the at least one NVMe SSD in addition to the data in the write request.
Statement 56 embodiments of the inventive concept include a method according to statement 54, wherein:
the method further includes reading the chunk stripe from the cache based at least in part on the chunk stripe residing in the cache; and is
Reading the block stripe from the at least one NVMe SSD includes reading the block stripe from the at least one NVMe SSD based at least in part on the block stripe not residing in the cache.
Statement 57. embodiments of the inventive concept include a method according to statement 54, wherein writing the updated block stripe to the at least one NVMe SSD includes writing the updated block stripe to a write buffer.
Statement 58 embodiments of the inventive concept include a method according to statement 57, further comprising responding to the host that the write has completed after the updated block stripe is written to the write buffer and before the updated block stripe is written to the at least one NVMe SSD.
Statement 59 an embodiment of the inventive concept includes a method according to statement 45, wherein:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic includes receiving a response from the nvmesd;
processing the transfer with the probe logic in the erasure coding logic includes replacing the device LBA with the host LBA in the response; and is
Delivering the transmission to its destination through the PCIe switch includes delivering the response to the host.
Statement 60. embodiments of the inventive concept include a method according to statement 59, wherein processing the transfer using the probe logic in the erasure coding logic further comprises replacing an identifier of the NVMe SSD with an identifier of the virtual storage device (identifier).
Statement 61 embodiments of the inventive concept include a method according to statement 45, wherein delivering the transmission to its destination by the PCIe switch comprises delivering the transmission to a second PCIe switch to which an NVMe SSD is connected, the NVMe SSD being the destination.
Statement 62 embodiments of the inventive concept include a method according to statement 61, wherein the PCIe switch is located in a first chassis and the second PCIe switch is located in a second chassis.
Statement 63. embodiments of the inventive concept include a method according to statement 45, further comprising initializing at least one NVMe SSD connected to the PCIe switch for use with erasure coding.
Statement 64 embodiments of the inventive concept include a method according to statement 45, the method further comprising:
detecting that a new NVMe SSD is connected to the PCIe switch; and
adding the new NVMe SSD to the capacity of the virtual storage device.
Statement 65. embodiments of the inventive concept include a method according to statement 64, further comprising initializing a new NVMe SSD for use with erasure coding.
Statement 66 embodiments of the inventive concept include a method according to statement 45, the method further comprising:
detecting a failed NVMe SSD connected to the PCIe switch; and
erasure coding recovery is performed on the data stored on the failed NVMe SSD.
Statement 67. embodiments of the inventive concept include a method according to statement 66, the method further comprising:
detecting a replacement NVMe SSD for the failed NVMe SSD; and
the failed NVMe SSD is rebuilt using the replacement NVMe SSD.
Statement 68. embodiments of the inventive concept include a method according to statement 45, the method further comprising:
detecting that only NVMe SSDs without erasure coding functionality are connected to the PCIe switch; and
erasure coding logic in the PCIe switch is enabled.
Statement 69 embodiments of the inventive concept include a method according to statement 68, further comprising terminating PCIe bus enumeration downstream of the PCIe switch.
Statement 70. embodiments of the inventive concept include a method according to statement 68, further comprising reporting a virtual storage device to the host, the capacity of the virtual storage device based at least in part on the capacity of the NVMe SSD connected to the PCIe switch and the erasure coding scheme.
Statement 71 embodiments of the inventive concept include a method according to statement 45, the method further comprising:
detecting that at least one non-storage device or at least one NVMe SSD having erasure coding functionality is connected to a PCIe switch; and
erasure coding logic in the PCIe switch is disabled.
Statement 72 embodiments of the inventive concept include a method according to statement 45, the method further comprising:
detecting that at least one non-storage device or at least one NVMe SSD having erasure coding functionality is connected to a PCIe switch;
enabling erasure coding logic in the PCIe switch; and
disabling the at least one non-storage device or the at least one NVMe SSD having erasure coding functionality.
Statement 73 embodiments of the inventive concept include a method according to statement 72, the method further comprising terminating PCIe bus enumeration downstream of the PCIe switch.
Statement 74 embodiments of the inventive concept include a method according to statement 72, further comprising reporting a virtual storage device to the host, the capacity of the virtual storage device based at least in part on the capacity of the NVMe SSD connected to the PCIe switch and the erasure coding scheme.
Statement 75 embodiments of the inventive concept include a method according to statement 45, further comprising configuring a PCIe switch having erasure coding logic to use an erasure coding scheme.
Statement 76. embodiments of the inventive concept include the method of statement 75, wherein configuring the PCIe switch having erasure coding logic to use an erasure coding scheme comprises configuring the PCIe switch having erasure coding logic to use an erasure coding scheme using a Baseboard Management Controller (BMC).
Statement 77 an embodiment of the inventive concept includes an article comprising a non-transitory storage medium having stored thereon instructions that, when executed by a machine, cause:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic;
processing the transmission using probe logic in the erasure coding logic; and is
The transmission is delivered to its destination through the PCIe switch.
Statement 78 embodiments of the inventive concept include an article according to statement 77, wherein the erasure coding logic includes at least one of a lookaside erasure coding logic and a perspective erasure coding logic.
Statement 79 embodiments of the inventive concept include an article according to statement 77, wherein:
processing the transmission using the probing logic in the erasure coding logic includes determining, by the probing logic, that the transmission includes a control transmission; and is
Delivering the transmission to its destination through the PCIe switch includes delivering the transmission to a Power Processing Unit (PPU).
Statement 80. embodiments of the inventive concept include an article according to statement 77, wherein processing a transmission using probe logic in erasure coding logic comprises processing a transmission using probe logic based at least in part on erasure coding logic being active.
Statement 81 embodiments of the inventive concept include an article according to statement 77, wherein:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic includes receiving a read request from a host;
processing the transfer with probe logic in the erasure coding logic includes replacing a host Logical Block Address (LBA) with a device LBA in the read request; and is
Delivering the transmission to its destination through the PCIe switch includes delivering a read request to a non-volatile storage express (NVMe) Solid State Drive (SSD).
Statement 82. embodiments of the inventive concept include an article according to statement 81, wherein processing the transmission using the probe logic in the erasure coding logic further comprises identifying the NVMe SSD to which the read request should be delivered.
Statement 83. embodiments of the inventive concept include an article according to statement 81, wherein:
processing the transmission using the probe logic in the erasure coding logic further includes accessing data requested by the host in the read request from the cache based at least in part on the data residing in the cache;
replacing the host Logical Block Address (LBA) with a device LBA in the read request includes replacing the host LBA with the device LBA in the read request based at least in part on the data not residing in the cache; and
delivering, by the PCIe switch, the transmission to its destination includes delivering a read request to the NVMe SSD based at least in part on the data not residing in the cache.
Statement 84. embodiments of the inventive concept include an article according to statement 77, wherein:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic includes receiving a write request from a host;
processing the transfer with the probe logic in the erasure coding logic includes replacing the host LBA with the device LBA in the write request; and is
Delivering the transmission to its destination through the PCIe switch includes delivering the write request to the NVMe SSD.
Statement 85 embodiments of the inventive concept include an article according to statement 84, wherein processing the transmission using the probe logic in the erasure coding logic further comprises identifying an NVMe SSD to which the write request should be delivered.
Statement 86. embodiments of the inventive concept include an article according to statement 84, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause:
reading a block stripe from at least one NVMe SSD;
merging the data in the write request with the block stripe to form an updated block stripe; and is
Writing the updated block stripe to the at least one NVMe SSD.
Statement 87 embodiments of the inventive concept include an article according to statement 86, wherein merging the data in the write request includes calculating additional data to be written to the at least one NVMe SSD in addition to the data in the write request.
Statement 88 embodiments of the inventive concept include an article according to statement 86, wherein:
the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause a chunk stripe to be read from a cache based, at least in part, on the chunk stripe residing in the cache; and is
Reading the block stripe from the at least one NVMe SSD includes reading the block stripe from the at least one NVMe SSD based at least in part on the block stripe not residing in the cache.
Statement 89 embodiments of the inventive concept include an article according to statement 86, wherein writing the updated block stripe to the at least one NVMe SSD includes writing the updated block stripe to a write buffer.
Statement 90. embodiments of the inventive concept include an article according to statement 89, the non-transitory storage medium having stored thereon further instructions that, when executed by the machine, cause responding to the host that the write has completed after the updated block stripe is written to the write buffer and before the updated block stripe is written to the at least one NVMe SSD.
Statement 91 an embodiment of the inventive concept includes an article according to statement 77, wherein:
receiving a transmission at a peripheral component interconnect express (PCIe) switch having erasure coding logic includes receiving a response from the nvmesd;
processing the transfer with the probe logic in the erasure coding logic includes replacing the device LBA with the host LBA in the response; and is
Delivering the transmission to its destination through the PCIe switch includes delivering the response to the host.
Statement 92. embodiments of the inventive concept include an article according to statement 91, wherein processing the transmission using the probe logic in the erasure coding logic further comprises replacing an identifier of the NVMe SSD with an identifier of the virtual storage device.
Statement 93 embodiments of the inventive concept include an article according to statement 77, wherein delivering the transmission to its destination through the PCIe switch comprises delivering the transmission to a second PCIe switch to which an NVMe SSD is connected, the NVMe SSD being the destination.
Statement 94 embodiments of the inventive concept include an article according to statement 93, wherein the PCIe switch is located in a first chassis and the second PCIe switch is located in a second chassis.
Statement 95 embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that when executed by a machine result in initializing at least one NVMe SSD connected to the PCIe switch for use with erasure coding.
Statement 96. embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause:
detecting that a new NVMe SSD is connected to the PCIe switch; and is
Adding the new NVMe SSD to the capacity of the virtual storage device.
Statement 97 embodiments of the inventive concept include an article according to statement 96, the non-transitory storage medium having stored thereon further instructions that when executed by a machine result in initializing a new NVMe SSD for use with erasure coding.
Statement 98 embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause:
detecting a failed NVMe SSD connected to the PCIe switch; and is
Erasure coding recovery is performed on the data stored on the failed NVMe SSD.
Statement 99 embodiments of the inventive concept include an article according to statement 98, the non-transitory storage medium having stored thereon further instructions that when executed by a machine result in:
detecting a replacement NVMe SSD for the failed NVMe SSD; and is
The failed NVMe SSD is rebuilt using the replacement NVMe SSD.
Statement 100. embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause:
detecting that only NVMe SSDs without erasure coding functionality are connected to the PCIe switch; and is
Erasure coding logic in the PCIe switch is enabled.
Statement 101 embodiments of the inventive concept include an article according to statement 100, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause termination of PCIe bus enumeration downstream of the PCIe switch.
Statement 102 embodiments of the inventive concept include an article according to statement 100, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause reporting of a virtual storage device to a host, the capacity of the virtual storage device being based at least in part on a capacity of an NVMe SSD connected to a PCIe switch and an erasure coding scheme.
Statement 103 embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause:
detecting that at least one non-storage device or at least one NVMe SSD having erasure coding functionality is connected to a PCIe switch; and is
Erasure coding logic in the PCIe switch is disabled.
Statement 104 embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause:
detecting that at least one non-storage device or at least one NVMe SSD having erasure coding functionality is connected to a PCIe switch;
enabling erasure coding logic in the PCIe switch; and is
Disabling the at least one non-storage device or the at least one NVMe SSD having erasure coding functionality.
Statement 105 embodiments of the inventive concept include an article in accordance with statement 104, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause termination of PCIe bus enumeration downstream of the PCIe switch.
Statement 106 embodiments of the inventive concept include an article according to statement 104, the non-transitory storage medium having stored thereon further instructions that when executed by a machine result in reporting a virtual storage device to a host, the capacity of the virtual storage device being based at least in part on the capacity of an NVMe SSD connected to a PCIe switch and an erasure coding scheme.
Statement 107. embodiments of the inventive concept include an article according to statement 77, the non-transitory storage medium having stored thereon further instructions that, when executed by a machine, cause configuring a PCIe switch having erasure coding logic to use an erasure coding scheme.
Statement 108. embodiments of the inventive concept include an article in accordance with statement 107, wherein configuring a PCIe switch having erasure coding logic to use an erasure coding scheme comprises configuring a PCIe switch having erasure coding logic to use an erasure coding scheme using a Baseboard Management Controller (BMC).
Statement 109 embodiments of the inventive concept include a system comprising:
a non-volatile storage express (NVMe) Solid State Drive (SSD);
a Field Programmable Gate Array (FPGA) implementing one or more functions supporting NVMe SSD, the one or more functions including at least one of data acceleration, deduplication, data integrity, data encryption, and data compression; and
a peripheral component interconnect express (PCIe) switch;
wherein the PCIe switch communicates with the FPGA and the NVMe SSD.
Statement 110 embodiments of the inventive concept include a system according to statement 109, wherein the FPGA and NVMe SSD are located inside a common housing.
Statement 111 embodiments of the inventive concept include a system according to statement 110, wherein the PCIe switch is located outside a common enclosure that includes the FPGA and the NVMe SSD.
Statement 112 embodiments of the inventive concept include a system according to statement 109, wherein:
the PCIe switch is connected to the FPGA; and is
The FPGA is connected to the NVMe SSD.
Statement 113 embodiments of the inventive concept include a system according to statement 109, wherein:
the PCIe switch is connected to the NVMe SSD; and is
NVMe SSD is connected to FPGA.
Statement 114 embodiments of the inventive concept include a system according to statement 109, wherein the PCIe switch includes erasure coding logic, the erasure coding logic including an erasure coding controller.
Statement 115 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic includes at least one of a lookaside erasure coding logic and a perspective erasure coding logic.
Statement 116 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic is operable to return a response to a read request from the host based at least in part on the presence in the cache of data requested in the read request.
Statement 117 embodiments of the inventive concept include a system according to statement 116, wherein the erasure coding logic further includes a cache.
Statement 118 embodiments of the inventive concept include a system according to statement 116, wherein:
the PCIe switch is positioned in the chassis; and is
The chassis includes memory that is used as cache by erasure coding logic.
Statement 119 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic is operable to return a response to the write request to the host before the write request is completed.
Statement 120 embodiments of the inventive concept include a system according to statement 119, wherein:
the PCIe switch further comprises a write buffer; and is
The erasure coding controller is operable to store data in the write request in a write buffer.
Statement 121 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic comprises lookaside erasure coding logic comprising probing logic.
Statement 122 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic is operable to intercept a control transmission received at the PCIe switch and forward the control transmission to a Power Processing Unit (PPU).
Statement 123 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic is operable to intercept a data transfer received at the PCIe switch from the host and replace a host Logical Block Address (LBA) used by the host with a LBA used by the NVMe SSD in the data transfer.
Statement 124. embodiments of the inventive concept include a system according to statement 123, wherein the erasure coding logic is further operable to direct data transfer to the NVMe SSD.
Statement 125. embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic is operable to intercept a data transfer received at the PCIe switch from the NVMe SSD and replace a device LBA used by the NVMe SSD with a host LBA used by the host in the data transfer.
Statement 126 embodiments of the inventive concept include a system according to statement 114, wherein the erasure coding logic defines a virtual storage across the NVMe SSD and the second NVMe SSD.
Statement 127 embodiments of the inventive concept include a system according to statement 114, wherein the PCIe switch is operable to enable erasure coding logic based at least in part on the NVMe SSD being able to be used with the erasure coding logic.
Statement 128 embodiments of the inventive concept include a system according to statement 114, further comprising a second device connected to the PCIe switch having erasure coding logic.
Statement 129 embodiments of the inventive concept include a system according to statement 128, wherein the second device comprises at least one of a storage device, an SSD with a Field Programmable Gate Array (FPGA), and a Graphics Processing Unit (GPU).
Statement 130 embodiments of the inventive concept include a system according to statement 128, wherein:
the second device is not capable of being used with erasure coding logic; and is
The PCIe switch is operable to disable the erasure coding logic based at least in part on the second apparatus being unable to be used with the erasure coding logic.
Statement 131 embodiments of the inventive concept include a system according to statement 128, wherein:
the second device is not capable of being used with erasure coding logic; and is
The PCIe switch is operable to enable the erasure coding logic based at least in part on the NVMe SSD being able to be used with the erasure coding logic and to enable access to the second device without using the erasure coding logic.
Statement 132 embodiments of the inventive concept include a system according to statement 128, wherein:
the second device is not capable of being used with erasure coding logic; and is
The PCIe switch is operable to enable the erasure coding logic and disable access to the second device based at least in part on the NVMe SSD being able to be used with the erasure coding logic.
Statement 133. embodiments of the inventive concept include a system comprising:
a non-volatile storage express (NVMe) Solid State Drive (SSD); and
a Field Programmable Gate Array (FPGA) including a first FPGA portion and a second FPGA portion, the first FPGA portion implementing one or more functions supporting NVMe SSD including at least one of data acceleration, deduplication, data integrity, data encryption, and data compression, and the second FPGA portion implementing a peripheral component interconnect express (PCIe) switch,
wherein the PCIe switch communicates with the FPGA and the NVMe SSD, and
wherein the FPGA and the NVMe SSD are located inside a common housing.
Statement 134 embodiments of the inventive concept include a system according to statement 133, wherein the PCIe switch includes erasure coding logic that includes an erasure coding controller.
Statement 135 embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic defines a virtual storage device that spans at least two portions of the NVMe SSD.
Statement 136 embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic defines a virtual storage across the NVMe SSD and the second NVMe SSD.
Statement 137 embodiments of the inventive concept include a system according to statement 136, wherein the second NVMe SSD is located inside the common housing.
Statement 138. embodiments of the inventive concept include a system according to statement 136, wherein the second NVMe SSD is located outside the common housing.
Statement 139 embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic includes at least one of a lookaside erasure coding logic and a perspective erasure coding logic.
Statement 140. embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic is operable to return a response to a read request from the host based at least in part on the presence in the cache of data requested in the read request.
Statement 141 embodiments of the inventive concept include a system according to statement 140, wherein the FPGA further includes a cache.
Statement 142 embodiments of the inventive concept include a system according to statement 140, wherein:
the common shell is positioned in the chassis; and is
The chassis includes memory that is used as cache by erasure coding logic.
Statement 143 embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic is operable to return a response to the write request to the host before the write request is completed.
Statement 144 embodiments of the inventive concept include a system according to statement 143, wherein:
the FPGA further includes a write buffer; and is
The erasure coding controller is operable to store data in the write request in a write buffer.
Statement 145 embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic includes lookaside erasure coding logic that includes probing logic.
Statement 146 embodiments of the inventive concept include a system according to statement 145, wherein the snoop logic is operable to intercept control transmissions received at the PCIe switch and forward the control transmissions to a Power Processing Unit (PPU).
Statement 147 embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic is operable to intercept a data transfer received at the PCIe switch from the host and replace a host Logical Block Address (LBA) used by the host with a device LBA used by the NVMe SSD in the data transfer.
Statement 148. embodiments of the inventive concept include a system according to statement 147, wherein the erasure coding logic is further operable to direct data transfer to the NVMe SSD.
Statement 149. embodiments of the inventive concept include a system according to statement 134, wherein the erasure coding logic is operable to intercept a data transfer received at the PCIe switch from the NVMe SSD and replace a device LBA used by the NVMe SSD with a host LBA used by the host in the data transfer.
Statement 150. embodiments of the inventive concept include a system according to statement 134, wherein the PCIe switch having erasure coding logic is operable to enable erasure coding logic based at least in part on the NVMe SSD being able to be used with the erasure coding logic.
Statement 151 embodiments of the inventive concept include a system according to statement 134, wherein the PCIe switch having erasure coding logic is operable to disable the erasure coding logic based at least in part on the NVMe SSD not being able to be used with the erasure coding logic.
Statement 152. embodiments of the inventive concept include a system comprising:
a non-volatile storage express (NVMe) Solid State Drive (SSD); and
a peripheral component interconnect express (PCIe) switch having erasure coding logic, comprising:
an external connector enabling the PCIe switch to communicate with the processor;
at least one connector enabling the PCIe switch to communicate with the NVMe SSD;
a Power Processing Unit (PPU) to configure a PCIe switch; and
an erasure coding controller comprising circuitry to apply an erasure coding scheme to data stored on the NVMe SSD.
Statement 153 embodiments of the inventive concept include a system according to statement 152, wherein:
the system further includes a second NVMe SSD; and is
The PCIe switch with erasure coding logic includes a second connector to enable the PCIe switch with erasure coding logic to communicate with the second NVMe SSD.
Statement 154 embodiments of the inventive concept include a system according to statement 152, wherein:
the system further comprises:
a second NVMe SSD; and
a second PCIe switch comprising:
a second external connector enabling the second PCIe switch to communicate with the processor;
a second connector enabling a second PCIe switch to communicate with a second NVMe SSD;
and
a third connector enabling the second PCIe switch to communicate with the PCIe switch having erasure coding logic; and is
The PCIe switch having erasure coding logic includes a fourth connector to enable the PCIe switch having erasure coding logic to communicate with a second PCIe switch,
wherein the erasure coding scheme is applied to the data stored on the NVMe SSD and the second NVMe SSD.
Statement 155 embodiments of the inventive concept include a system according to statement 154, wherein the second PCIe switch further comprises a second erasure coding logic that is disabled.
Statement 156 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic comprises at least one of a lookaside erasure coding logic and a perspective erasure coding logic.
Statement 157. embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic is operable to return a response to a read request from the host based at least in part on the presence in the cache of data requested in the read request.
Statement 158. embodiments of the inventive concept include a system according to statement 157, wherein the erasure coding logic further includes a cache.
Statement 159 an embodiment of the inventive concept includes a system according to statement 157, wherein:
a PCIe switch having erasure coding logic is located in the chassis; and is
The chassis includes memory that is used as cache by erasure coding logic.
Statement 160 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic is operable to return a response to the write request to the host before the write request is completed.
Statement 161. an embodiment of the inventive concept includes a system according to statement 160, wherein:
the PCIe switch with erasure coding logic further comprises a write buffer; and is
The erasure coding controller is operable to store data in the write request in a write buffer.
Statement 162 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic comprises side-view erasure coding logic comprising probing logic.
Statement 163 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic is operable to intercept a control transmission received at the PCIe switch and forward the control transmission to a Power Processing Unit (PPU).
Statement 164 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic is operable to intercept a data transfer received at the PCIe switch from the host and replace a host Logical Block Address (LBA) used by the host with a LBA used by the NVMe SSD in the data transfer.
Statement 165 embodiments of the inventive concept include a system according to statement 164, wherein the erasure coding logic is further operable to direct data transfer to the NVMe SSD.
Statement 166 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic is operable to intercept a data transfer received at the PCIe switch from the NVMe SSD and replace a device LBA used by the NVMe SSD with a host LBA used by the host in the data transfer.
Statement 167 embodiments of the inventive concept include a system according to statement 152, wherein the erasure coding logic defines a virtual storage across the NVMe SSD and the second NVMe SSD.
Statement 168. embodiments of the inventive concept include a system according to statement 152, wherein the PCIe switch having erasure coding logic is operable to enable the erasure coding logic based at least in part on the NVMe SSD being able to be used with the erasure coding logic.
Statement 169. embodiments of the inventive concept include a system according to statement 152, further comprising a second device connected to the PCIe switch having erasure coding logic.
Statement 170. embodiments of the present inventive concept include a system according to statement 169, wherein the second device comprises at least one of a storage device, an SSD with a Field Programmable Gate Array (FPGA), and a Graphics Processing Unit (GPU).
Statement 171 embodiments of the inventive concept include a system according to statement 169, wherein:
the second device is not capable of being used with erasure coding logic; and is
The PCIe switch having the erasure coding logic is operable to disable the erasure coding logic based at least in part on the second device being unable to use with the erasure coding logic.
Statement 172. embodiments of the inventive concept include a system according to statement 169, wherein:
the second device is not capable of being used with erasure coding logic; and is
The PCIe switch having the erasure coding logic is operable to enable the erasure coding logic based at least in part on the NVMe SSD being able to be used with the erasure coding logic and to enable access to the second device without using the erasure coding logic.
Statement 173 embodiments of the inventive concept include a system according to statement 169, wherein:
the second device is not capable of being used with erasure coding logic; and is
The PCIe switch having the erasure coding logic is operable to enable the erasure coding logic and disable access to the second device based at least in part on the NVMe SSD being able to be used with the erasure coding logic.
Accordingly, this detailed description and the accompanying materials, in terms of the various permutations of the embodiments described herein, are intended to be illustrative only, and should not be taken as limiting the scope of the inventive concept. Accordingly, the inventive concept is intended to cover all such modifications as fall within the scope and spirit of the following claims and equivalents thereof.

Claims (20)

1. A computer system, comprising:
non-volatile storage fast solid state drives;
a field programmable gate array implementing one or more functions supporting the non-volatile storage fast solid state drive, the one or more functions including at least one of data acceleration, deduplication, data integrity, data encryption, and data compression; and
a peripheral component interconnect express switch;
wherein the peripheral component interconnect express switch is in communication with the field programmable gate array and the non-volatile storage express solid state drive.
2. The computer system of claim 1, wherein the peripheral component interconnect express switch comprises erasure coding logic comprising an erasure coding controller.
3. The computer system of claim 2, wherein the erasure coding logic is operative to return a response to a read request from a host based on the presence in a cache of at least a portion of data requested in the read request.
4. The computer system of claim 2, wherein the erasure coding logic is operative to return a response to a write request to a host prior to completion of the write request.
5. The computer system of claim 2, wherein the erasure coding logic comprises lookaside erasure coding logic, the lookaside erasure coding logic comprising probing logic.
6. The computer system of claim 2, wherein the erasure coding logic is operative to intercept a data transfer received at the peripheral component interconnect express switch from a host and replace a host logical block address used by the host in the data transfer with a device logical block address of the non-volatile storage fast solid state drive.
7. The computer system of claim 2, wherein the peripheral component interconnect express switch is operative to enable the erasure coding logic based on at least a portion of the non-volatile storage express solid state drive not including native erasure coding logic therein.
8. The computer system of claim 2, further comprising a second device connected to the peripheral component interconnect express switch having the erasure coding logic.
9. The computer system of claim 8, wherein:
the second device comprises at least one of a non-storage device and a storage device having native erasure coding logic; and is
The peripheral component interconnect express switch is operative to disable the erasure coding logic based at least in part on the second device.
10. The computer system of claim 8, wherein:
the second device comprises at least one of a non-storage device and a storage device having native erasure coding logic; and is
Including at least one of the non-storage device and the storage device having the native erasure coding logic, the peripheral component interconnect express switch operative to enable the erasure coding logic based at least in part on the non-volatile storage express solid state drive not including the native erasure coding logic and to enable access to the second device without using the erasure coding logic.
11. A computer system, comprising:
non-volatile storage fast solid state drives; and
a field programmable gate array including a first field programmable gate array portion and a second field programmable gate array portion, the first field programmable gate array portion implementing one or more functions to support the non-volatile storage fast solid state drive, the one or more functions including at least one of data acceleration, deduplication, data integrity, data encryption, and data compression, and the second field programmable gate array portion implementing a peripheral component interconnect express switch,
wherein the peripheral component interconnect express switch is in communication with the field programmable gate array and the non-volatile storage express solid state drive, and
wherein the field programmable gate array and the non-volatile storage fast solid state drive are located inside a common housing.
12. The computer system of claim 11, wherein the peripheral component interconnect express switch comprises erasure coding logic comprising an erasure coding controller.
13. The computer system of claim 12, wherein the erasure coding logic comprises at least one of a look-aside erasure coding logic and a perspective erasure coding logic.
14. The computer system of claim 12, wherein the erasure coding logic is operative to return a response to a read request from a host based at least in part on the presence in a cache of data requested in the read request.
15. The computer system of claim 12, wherein the erasure coding logic is operative to return a response to a write request to a host prior to completion of the write request.
16. The computer system of claim 12, wherein the erasure coding logic comprises lookaside erasure coding logic, the lookaside erasure coding logic comprising probing logic.
17. The computer system of claim 12, wherein the erasure coding logic is operative to intercept a data transfer received at the peripheral component interconnect express switch from a host and to replace in the data transfer a host logical block address used by the host with a device logical block address used by the non-volatile storage express solid state drive.
18. The computer system of claim 12, wherein the erasure coding logic is operative to intercept a data transfer received at the peripheral component interconnect express switch from the non-volatile storage fast solid state drive and replace in the data transfer a device logical block address used by the non-volatile storage fast solid state drive with a host logical block address used by a host.
19. The computer system of claim 12, wherein the peripheral component interconnect express switch having the erasure coding logic is operative to enable the erasure coding logic based on at least a portion of the non-volatile storage express solid state drive not including native erasure coding logic therein.
20. The computer system of claim 12, wherein the peripheral component interconnect express switch having the erasure coding logic is operative to disable the erasure coding logic based on at least a portion of the non-volatile storage express solid state drive including native erasure coding logic therein.
CN201910951173.4A 2018-10-12 2019-10-08 Computer system Active CN111045597B (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US201862745261P 2018-10-12 2018-10-12
US62/745,261 2018-10-12
US16/207,080 2018-11-30
US16/207,080 US10635609B2 (en) 2018-03-02 2018-11-30 Method for supporting erasure code data protection with embedded PCIE switch inside FPGA+SSD
US16/226,629 2018-12-19
US16/226,629 US10838885B2 (en) 2018-03-02 2018-12-19 Method for supporting erasure code data protection with embedded PCIE switch inside FPGA+SSD
US16/260,087 2019-01-28
US16/260,087 US11860672B2 (en) 2018-03-02 2019-01-28 Method for supporting erasure code data protection with embedded PCIE switch inside FPGA+SSD

Publications (2)

Publication Number Publication Date
CN111045597A true CN111045597A (en) 2020-04-21
CN111045597B CN111045597B (en) 2024-08-20

Family

ID=70219044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910951173.4A Active CN111045597B (en) 2018-10-12 2019-10-08 Computer system

Country Status (4)

Country Link
JP (1) JP7370801B2 (en)
KR (1) KR20200041815A (en)
CN (1) CN111045597B (en)
TW (1) TWI791880B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148227A (en) * 2020-09-25 2020-12-29 中国科学院空天信息创新研究院 Storage device and information processing method
CN112732477A (en) * 2021-04-01 2021-04-30 四川华鲲振宇智能科技有限责任公司 Method for fault isolation by out-of-band self-checking

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102225577B1 (en) * 2020-08-21 2021-03-09 (주)테온 Method and device for distributed storage of data using hybrid storage
JP2023001494A (en) * 2021-06-21 2023-01-06 キオクシア株式会社 Memory system and control method
TWI784804B (en) * 2021-11-19 2022-11-21 群聯電子股份有限公司 Retiming circuit module, signal transmission system and signal transmission method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130232293A1 (en) * 2012-03-05 2013-09-05 Nguyen P. Nguyen High performance storage technology with off the shelf storage components
US8572320B1 (en) * 2009-01-23 2013-10-29 Cypress Semiconductor Corporation Memory devices and systems including cache devices for memory modules
US20140022849A1 (en) * 2012-06-20 2014-01-23 IISC8 Inc Solid State Drive Memory Device Comprising Secure Erase Function
US20140337540A1 (en) * 2013-05-08 2014-11-13 Lsi Corporation METHOD AND SYSTEM FOR I/O FLOW MANAGEMENT FOR PCIe DEVICES
US20160085458A1 (en) * 2014-09-23 2016-03-24 HGST Netherlands B.V. SYSTEM AND METHOD FOR CONTROLLING VARIOUS ASPECTS OF PCIe DIRECT ATTACHED NONVOLATILE MEMORY STORAGE SUBSYSTEMS
US20160259597A1 (en) * 2015-03-02 2016-09-08 Fred WORLEY Solid state drive multi-card adapter with integrated processing
WO2018086171A1 (en) * 2016-11-10 2018-05-17 苏州韦科韬信息技术有限公司 Pcie interface-based solid-state hard disk security system and method
US10007443B1 (en) * 2016-03-31 2018-06-26 EMC IP Holding Company LLC Host to device I/O flow
CN108334285A (en) * 2017-01-20 2018-07-27 三星电子株式会社 The method of storage system and operation storage system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819517A (en) * 2011-06-08 2012-12-12 鸿富锦精密工业(深圳)有限公司 PCIE (peripheral component interconnect-express) interface card
JP2014063497A (en) * 2012-09-21 2014-04-10 Plx Technology Inc Pci express switch with logical device capability
US8954657B1 (en) * 2013-09-27 2015-02-10 Avalanche Technology, Inc. Storage processor managing solid state disk array
US9336173B1 (en) * 2013-12-20 2016-05-10 Microsemi Storage Solutions (U.S.), Inc. Method and switch for transferring transactions between switch domains
TW201823916A (en) * 2016-12-27 2018-07-01 英業達股份有限公司 Server system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8572320B1 (en) * 2009-01-23 2013-10-29 Cypress Semiconductor Corporation Memory devices and systems including cache devices for memory modules
US20130232293A1 (en) * 2012-03-05 2013-09-05 Nguyen P. Nguyen High performance storage technology with off the shelf storage components
US20140022849A1 (en) * 2012-06-20 2014-01-23 IISC8 Inc Solid State Drive Memory Device Comprising Secure Erase Function
US20140337540A1 (en) * 2013-05-08 2014-11-13 Lsi Corporation METHOD AND SYSTEM FOR I/O FLOW MANAGEMENT FOR PCIe DEVICES
US20160085458A1 (en) * 2014-09-23 2016-03-24 HGST Netherlands B.V. SYSTEM AND METHOD FOR CONTROLLING VARIOUS ASPECTS OF PCIe DIRECT ATTACHED NONVOLATILE MEMORY STORAGE SUBSYSTEMS
US20160259597A1 (en) * 2015-03-02 2016-09-08 Fred WORLEY Solid state drive multi-card adapter with integrated processing
US10007443B1 (en) * 2016-03-31 2018-06-26 EMC IP Holding Company LLC Host to device I/O flow
WO2018086171A1 (en) * 2016-11-10 2018-05-17 苏州韦科韬信息技术有限公司 Pcie interface-based solid-state hard disk security system and method
CN108334285A (en) * 2017-01-20 2018-07-27 三星电子株式会社 The method of storage system and operation storage system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148227A (en) * 2020-09-25 2020-12-29 中国科学院空天信息创新研究院 Storage device and information processing method
CN112148227B (en) * 2020-09-25 2023-03-24 中国科学院空天信息创新研究院 Storage device and information processing method
CN112732477A (en) * 2021-04-01 2021-04-30 四川华鲲振宇智能科技有限责任公司 Method for fault isolation by out-of-band self-checking

Also Published As

Publication number Publication date
JP2020061149A (en) 2020-04-16
KR20200041815A (en) 2020-04-22
TWI791880B (en) 2023-02-11
JP7370801B2 (en) 2023-10-30
CN111045597B (en) 2024-08-20
TW202020675A (en) 2020-06-01

Similar Documents

Publication Publication Date Title
US10838885B2 (en) Method for supporting erasure code data protection with embedded PCIE switch inside FPGA+SSD
US11797181B2 (en) Hardware accessible external memory
US11360679B2 (en) Paging of external memory
CN111045597B (en) Computer system
US9684591B2 (en) Storage system and storage apparatus
TWI591512B (en) Storage system and method of storage protection
US8560772B1 (en) System and method for data migration between high-performance computing architectures and data storage devices
US8074017B2 (en) On-disk caching for raid systems
US20170220249A1 (en) Systems and Methods to Maintain Consistent High Availability and Performance in Storage Area Networks
US11782634B2 (en) Dynamic use of non-volatile ram as memory and storage on a storage system
JP2007524932A (en) Method, system, and program for generating parity data
US20240095196A1 (en) Method for supporting erasure code data protection with embedded pcie switch inside fpga+ssd

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant