WO2022007225A1 - 数据存储方法、存储系统、存储设备及存储介质 - Google Patents

数据存储方法、存储系统、存储设备及存储介质 Download PDF

Info

Publication number
WO2022007225A1
WO2022007225A1 PCT/CN2020/120835 CN2020120835W WO2022007225A1 WO 2022007225 A1 WO2022007225 A1 WO 2022007225A1 CN 2020120835 W CN2020120835 W CN 2020120835W WO 2022007225 A1 WO2022007225 A1 WO 2022007225A1
Authority
WO
WIPO (PCT)
Prior art keywords
check
units
data
storage
medium layer
Prior art date
Application number
PCT/CN2020/120835
Other languages
English (en)
French (fr)
Inventor
付克博
陈亮
董如良
吴祥
罗小东
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20944002.3A priority Critical patent/EP4170499A4/en
Publication of WO2022007225A1 publication Critical patent/WO2022007225A1/zh
Priority to US18/149,288 priority patent/US20230137007A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0616Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0688Non-volatile semiconductor memory arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/35Unequal or adaptive error protection, e.g. by providing a different level of protection according to significance of source information or by adapting the coding according to the change of transmission channel characteristics
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/373Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 with erasure correction and erasure determination, e.g. for packet loss recovery or setting of erasures for the decoding of Reed-Solomon codes

Definitions

  • the embodiments of the present application relate to the technical field of data storage, and in particular, to a data storage method, a storage system, a storage device, and a storage medium.
  • EC erasure code
  • multi-level media storage system refers to a storage system including a variety of storage media with different performances.
  • DRAM dynamic random access memory
  • SCM storage class memory
  • Nand flash memory among them, DRAM has higher performance and lower
  • the performance and delay of SCM is between DRAM and Nand flash memory.
  • DRAM and SCM are called high-performance media layers
  • Nand Flash is called low-performance media layers.
  • the same erasure code ratio is used in the high-performance media layer and the low-performance media layer for data storage, so there is a problem of low storage space utilization.
  • Embodiments of the present application provide a data storage method, a storage system, a storage device, and a storage medium, which can improve the storage space utilization of the storage system.
  • the technical solution is as follows:
  • N is not equal to X
  • M can be equal to Y or not equal to Y.
  • the properties of the first dielectric layer are different from those of the second dielectric layer.
  • the performance of the first medium layer is higher than that of the second medium layer, and data is stored in the first medium layer according to the first erasure code ratio.
  • the data stored in the first medium layer is migrated to the second medium layer according to the second erasure code ratio.
  • the first medium layer may be a storage medium such as DRAM or SCM used as a memory
  • the second medium layer may be a persistent storage medium such as a solid state disk or a mechanical hard disk.
  • the ratio of erasure codes will also change with the data migration, so that the match the performance of the storage medium.
  • data is stored in the first medium layer according to the first erasure code ratio
  • the data stored in the first medium layer is stored according to the second correction code.
  • the process of migrating the erasure code ratio to the second medium layer is a data reading process. That is, the data to be read is stored in the first medium layer according to the first erasure code ratio, and after the set condition is satisfied, the data stored in the first medium layer is read to the second medium layer, and the second Data is stored in the medium layer according to the second erasure code ratio. After that, the data read into the second medium layer is fed back to the user.
  • the performance of the medium layer mentioned above is characterized by parameters such as the read/write speed and time delay of the medium layer.
  • the performance of the first dielectric layer is higher than the performance of the second dielectric layer, and the ratio of N and M is smaller than the ratio of X and Y. where N is less than X. That is to say, a smaller ratio of erasure codes is used to store data in the high-performance medium layer, and a larger ratio of erasure codes is used to store data in the low-performance medium layer. Since the input/output (IO) granularity received by the high-performance medium layer is relatively small, storing data with a smaller ratio of erasure codes is easy to fill up the stripes, reduces write amplification, and improves storage space utilization. In addition, using a larger ratio of erasure codes to store data in the low-performance medium layer can reduce the proportion of redundant data in the storage space, thereby improving the utilization of storage space.
  • IO input/output
  • the data stored in the first medium layer includes a plurality of sub-data, each sub-data corresponds to a first check matrix, and each first check matrix includes the N data units and all the The M verification units are described. Therefore, in this implementation, a specific data storage method is provided.
  • the so-called parity check matrix is a set of all data units and check units contained in a strip.
  • the matrix contains multiple elements, and each element is a data unit or a parity unit.
  • the first check matrix refers to a matrix corresponding to data stored in the first medium layer
  • the second check matrix refers to a matrix corresponding to data stored in the second medium layer.
  • the first medium layer stores 4 sub-data
  • each sub-data corresponds to a first check matrix
  • each first check matrix includes 6 data units and 2 check units.
  • the data stored in the first medium layer according to the first erasure code ratio can be directly merged into data that conforms to the second erasure code ratio, and then stored in the second medium layer, that is, Data can be merged or split between different media layers, saving the computing resources of the storage system.
  • the calculating to obtain the Y verification units includes calculating and obtaining the Y verification units according to the X data units.
  • Y check units are calculated based on X data units, so the number of check units in the second check matrix is relatively flexible, and is not limited to the first check matrix described above.
  • the calculating to obtain Y check units includes calculating the Y check units according to M check units included in each first check matrix, where Y is equal to the M.
  • the check cells in the second check matrix are generated by the check cells included in the first check matrix, which can reduce the computation time compared to recalculating Y check cells through X data cells. quantity.
  • the Y check units in the second check matrix can be obtained by a simple XOR operation based on the M check units in each of the first check matrices, so the amount of calculation is small.
  • the value of Y is equal to the value of M in most cases, but it does not rule out that Y is not equal to M in some scenarios.
  • the M check units when M is greater than or equal to 2, the M check units include a check unit p and a check unit q, wherein the check unit p in each first check matrix stores In the first storage node of the storage system, the check unit q in each first check matrix is stored in the second storage node of the storage system, and the method further includes: from the first check matrix The storage node obtains the check unit p in each first check matrix; obtains the check unit q in each first check matrix from the second storage node; The calculation of the Y check units by the included M check units specifically includes: calculating and obtaining the Y check units according to the check unit p in each first check matrix and the check unit q in each first check matrix. There are Y check units, and the Y check units include the check unit p' and the check unit q' in the check matrix.
  • the first erasure code matching ratio or the second erasure code matching ratio is determined according to the topology structure of the storage system and the fault tolerance capability of the storage system, wherein the topology structure
  • the fault tolerance capability is used to indicate the number of storage nodes included in the storage system
  • the fault tolerance capability is used to indicate the number of storage nodes that the storage system tolerates errors.
  • the first erasure code ratio and the second erasure code ratio may both be determined according to the topology and fault tolerance of the storage system, or the first erasure code ratio may be determined according to the topology and fault tolerance of the storage system Obtain, the second erasure code ratio is obtained according to the first erasure code ratio, or, the second erasure code ratio is determined according to the topology and fault tolerance of the storage system, and the first erasure code ratio is determined according to the second erasure code ratio.
  • Erasure code matching is obtained, which is not limited in this embodiment of the present application.
  • the number of error-tolerant storage nodes in the storage system is equal to the number of check units corresponding to the first erasure code matching, or the number of check units corresponding to the second erasure code matching.
  • the management node acquires the topology structure and fault tolerance of the storage system, and then the number of data units in the first erasure code ratio is determined by the formula N ⁇ (k*M)-M range of values.
  • N is the number of data units corresponding to the first erasure code matching.
  • k is the number of nodes included in the storage system.
  • the security level is server-level security
  • the above-mentioned nodes refer to servers
  • the security level is cabinet-level security
  • the above-mentioned nodes are cabinets.
  • M is the number of nodes that the storage system can tolerate errors indicated by the fault tolerance capability, that is, the number of check units in the first erasure code matching ratio.
  • write amplification means that the amount of data actually written by the storage node is greater than the amount of data received from the computing node.
  • the write amplification value corresponding to the first candidate erasure code matching is equal to the total number of data units and check units corresponding to the first candidate erasure code matching and The ratio between the number of data units. For example, for an erasure code ratio of 6:2, the write amplification value corresponding to the erasure code ratio is (6+2)/6.
  • the topological structure and fault tolerance capability of the storage system may be acquired by the management node, and then the number of data units in the second erasure code ratio is determined by the formula X ⁇ (i*Y)-Y. value range, and then determine the second erasure code matching ratio according to the value range of the number of data units and Y.
  • X is the number of data units corresponding to the second erasure code matching, and X is greater than N.
  • i is the number of child nodes of the node included in the storage system, where, when the security level is server-level security, i is the number of child nodes of the server included in the storage system, where the child node of a server may refer to the server An attached physical hard disk or a logical domain of hard disks.
  • i is the number of child nodes of the cabinet included in the storage system, where the child nodes of the cabinet refer to the number of servers included in the cabinet.
  • Y is the number of error-tolerant nodes indicated by the fault tolerance capability of the storage system, that is, Y is the number of check units corresponding to the second erasure code matching.
  • Y and M may be equal or may not be equal, which is not limited in this embodiment of the present application.
  • the security level may be obtained by the management node after user configuration, or may also be determined by the management node according to the topology structure of the storage system according to a preset principle, wherein the preset principle It refers to a calculation principle that can guarantee the reliability of the storage system, which is not limited in this embodiment of the present application.
  • the security level includes server-level security, cabinet-level security, and the like. Among them, server-level security is used to indicate that the storage system can tolerate the failure of at most t servers. Cabinet-level security is used to indicate that the storage system can tolerate up to t cabinet failures.
  • the set condition includes that the available capacity of the first dielectric layer reaches a capacity threshold. Assuming that the performance of the first dielectric layer is higher than the performance of the second dielectric layer, the capacity of the first dielectric layer is often smaller than that of the second dielectric layer, and the first dielectric layer is also more precious than the second dielectric layer because the storage in the first dielectric layer is Data in the media layer can be read and written faster and with less latency. Therefore, when the available capacity of the first medium layer reaches the capacity threshold, the data in the first medium layer will be migrated to the second medium layer, so that the first medium layer can make room for processing more data.
  • the so-called migration to the second medium layer means that the data is copied to the second medium layer without requiring whether the data originally stored in the first medium layer is deleted immediately.
  • the set condition includes that an access frequency of the data stored in the first medium layer is lower than a heat threshold.
  • a heat threshold the performance of the first medium layer is higher than that of the second medium layer.
  • the data with high access frequency can continue to remain in the first medium layer, while the data with low access frequency is migrated to the second medium layer. It not only saves the storage space of high-performance storage media, but also improves the read and write performance of data with high access frequency.
  • a storage system in a second aspect, includes a first dielectric layer and a second dielectric layer, the performance of the first dielectric layer is different from the performance of the second dielectric layer, in the first dielectric layer
  • the data is stored according to the first erasure code ratio; when the set condition is satisfied, the data in the first medium layer is migrated to the second erasure code ratio according to the second erasure code ratio.
  • the first erasure code matching corresponds to N data units and M check units, N and M are integers greater than or equal to 1, and the second erasure code matching corresponds to X data Unit and Y check units, X and Y are integers greater than or equal to 1, and the ratio of N and M is not equal to the ratio of X and Y.
  • N is not equal to X
  • M can be equal to Y or not equal to Y.
  • the storage system provided by the second aspect can be either a distributed storage system or a centralized storage system.
  • the storage system includes multiple storage nodes.
  • Each storage node includes memory and hard disks.
  • the memory is the first medium layer, such as DRAM and/or SCM
  • the hard disk is the second medium layer, such as a solid state disk or a mechanical hard disk.
  • the memory can also be used as a solid-state disk, and in this case, the second medium layer refers to a mechanical hard disk or other storage medium with lower performance.
  • the memory included in each storage node serves as the first medium layer
  • the hard disk included in each storage node serves as the second medium layer
  • the performance of the first medium layer is higher than that of the second medium layer.
  • Any one of the multiple storage nodes can receive data sent by the host or client. Taking the first storage node as an example, the data is temporarily stored in the memory of the first storage node. When the amount of data in the memory of the node reaches the size of the data unit included in the first check matrix, the first storage node divides the data into N data units according to the first erasure code ratio, and calculates N data units.
  • the M check units of the data unit are then distributed to store the N data units and the M data units in each storage node (which may include the first storage node itself) of the storage system.
  • the first storage node continues to receive the data sent by the client, and when the data in its memory reaches the size of the data unit included in the first parity check matrix, the N data units and N data units corresponding to the data continue to be stored.
  • the M units are distributed and stored in each storage node. When the set conditions are satisfied, each storage node can migrate the data in its memory to the hard disk, which means that the data units contained in the memory of each storage node are migrated to the hard disk.
  • the first storage node may obtain Y check units corresponding to these data units (for example, X data units) according to the second erasure code ratio, and write the Y check units into the hard disk (which may be the first data unit).
  • the local hard disk of the storage node may also be the hard disk of other storage nodes), thereby realizing that the data is stored in the second medium layer according to the second erasure code ratio.
  • Y check units may be obtained by calculation from X data units, and in other implementations, Y check units may be obtained by calculation from M data units in multiple first check matrices.
  • the performance of the first dielectric layer is higher than the performance of the second dielectric layer, and the ratio of N and M is smaller than the ratio of X and Y. where N is less than X. That is to say, a smaller ratio of erasure codes is used to store data in the high-performance medium layer, and a larger ratio of erasure codes is used to store data in the low-performance medium layer. Since the input/output (IO) granularity received by the high-performance medium layer is relatively small, storing data with a smaller ratio of erasure codes is easy to fill up the stripes, reduces write amplification, and improves storage space utilization. In addition, using a larger ratio of erasure codes to store data in the low-performance medium layer can reduce the proportion of redundant data in the storage space, thereby improving the utilization of storage space.
  • IO input/output
  • the data stored in the first medium layer includes a plurality of sub-data, each sub-data corresponds to a first check matrix, and each first check matrix includes the N data units and all the The M verification units are described. That is, in this implementation manner, the check matrix is a set of all data units and check units included in a strip, and the matrix includes multiple elements, each element being a data unit or a check unit.
  • the first check matrix refers to a matrix corresponding to data stored in the first medium layer
  • the second check matrix refers to a matrix corresponding to data stored in the second medium layer.
  • the first medium layer stores 4 sub-data
  • each sub-data corresponds to a first check matrix
  • each first check matrix includes 6 data units and 2 check units.
  • the data in the second medium layer corresponds to a second check matrix
  • the second check matrix includes the X data units and the Y check units; wherein, The X data units are from N data units included in each first check matrix, and the X is an integer multiple of the N.
  • the data units and check units in the multiple first check matrices are stored in the storage system in a scattered manner. in the memory of each storage node. After the set conditions are satisfied, each storage node stores the data units contained in the first check matrix stored in its own memory to its own hard disk, and these data units form X data units in the second check matrix.
  • a certain storage node among the multiple storage nodes can also generate Y check units according to the check units in the first check matrix, and store the Y check units in its own hard disk, so that , that is, the data stored in the first medium layer is stored in the second medium layer according to the second erasure code ratio.
  • the data stored in the first medium layer according to the first erasure code ratio can be directly merged into data that conforms to the second erasure code ratio, and then stored in the second medium layer, that is, Data can be combined and split between different media layers, saving the computing resources of the storage system.
  • the Y verification units are obtained by calculation according to the X data units.
  • a storage node in the storage system taking the first storage node as an example, the first storage node obtains X data units, and calculates corresponding Y check units according to the X data units.
  • the number of check units obtained by calculation is relatively flexible, that is, Y may be equal to the number M of check units in the first check matrix, or may not be equal to M.
  • the Y check units are obtained by calculation according to the M check units included in each first check matrix, and the Y is equal to the M.
  • the first storage node obtains the M check units included in each first check matrix, based on the check units included in the obtained first check matrix , Y check units are obtained through simple XOR operation, so that the amount of calculation can be reduced compared to the calculation of Y check units by X data units.
  • the value of Y is equal to the value of M in most cases, but it is not excluded that Y is not equal to M in some scenarios.
  • the M check units in each first check matrix are stored in a first storage node of the storage system, and the first storage node is used to obtain each first check matrix.
  • the M check units in the check matrix are calculated according to the M check units included in each first check matrix to obtain the Y check units.
  • the Y check units in the second check matrix are calculated by using the M check units of each first check matrix.
  • the ratio of the first erasure code or the ratio of the second erasure code is determined according to the topology structure of the storage system and the fault tolerance capability of the storage system, wherein the The topology structure is used to indicate the number of storage nodes included in the storage system, and the fault tolerance capability is used to indicate the number of storage nodes that the storage system tolerates errors.
  • the management node may determine the first erasure code matching ratio or the second erasure code matching ratio according to the topology structure and fault tolerance capability of the storage system.
  • the management node may be any storage node among multiple storage nodes in the storage system, or the management node is a node in the storage system that is independent of the storage nodes and used to manage each storage node.
  • the management node may obtain the first erasure code ratio and the second erasure code ratio during initialization, or may obtain the first erasure code ratio and the second erasure code ratio during the operation of the storage system. Erasure code ratio.
  • the first erasure code ratio and the second erasure code ratio may both be determined according to the topology and fault tolerance of the storage system, or the first erasure code ratio may be determined according to the topology and fault tolerance of the storage system Obtain, the second erasure code ratio is obtained according to the first erasure code ratio, or, the second erasure code ratio is determined according to the topology and fault tolerance of the storage system, and the first erasure code ratio is determined according to the second erasure code ratio.
  • Erasure code matching is obtained, which is not limited in this embodiment of the present application.
  • the number of error-tolerant storage nodes in the storage system is equal to the number of check units corresponding to the first erasure code matching, or the number of check units corresponding to the second erasure code matching.
  • the management node obtains the topology structure and fault tolerance of the storage system, and then determines the value range of the number of data units in the first erasure code matching ratio by using the formula N ⁇ (k*M)-M.
  • N is the number of data units corresponding to the first erasure code matching.
  • k is the number of nodes included in the storage system.
  • the security level is server-level security
  • the above-mentioned nodes refer to servers
  • the security level is cabinet-level security
  • the above-mentioned nodes are cabinets.
  • M is the number of nodes that the storage system can tolerate errors indicated by the fault tolerance capability, that is, the number of check units in the first erasure code matching ratio.
  • the management node obtains the topology and fault tolerance of the storage system, and then determines the value range of the number of data units in the second erasure code ratio by using the formula X ⁇ (i*Y)-Y, and then according to the The value range of the number of data units and Y determine the matching ratio of the second erasure code.
  • X is the number of data units corresponding to the second erasure code matching, and X is greater than N.
  • i is the number of child nodes of the node included in the storage system, where, when the security level is server-level security, i is the number of child nodes of the server included in the storage system, where the child node of a server may refer to the server An attached physical hard disk or a logical domain of hard disks.
  • i is the number of child nodes of the cabinet included in the storage system, where the child nodes of the cabinet refer to the number of servers included in the cabinet.
  • Y is the number of error-tolerant nodes indicated by the fault tolerance capability of the storage system, that is, Y is the number of check units corresponding to the second erasure code matching.
  • Y and M may be equal or may not be equal, which is not limited in this embodiment of the present application.
  • the security level may be obtained by the management node after user configuration, or may also be determined by the management node according to the topology structure of the storage system according to a preset principle, wherein the preset principle It refers to a calculation principle that can guarantee the reliability of the storage system, which is not limited in this embodiment of the present application.
  • the security level includes server-level security, cabinet-level security, and the like. Among them, server-level security is used to indicate that the storage system can tolerate the failure of at most t servers. Cabinet-level security is used to indicate that the storage system can tolerate up to t cabinet failures.
  • the set condition includes that the available capacity of the first dielectric layer reaches a capacity threshold.
  • the set condition includes that an access frequency of the data stored in the first medium layer is lower than a heat threshold.
  • the storage system provided in the second aspect may also be a centralized storage system, such as a storage array or other storage device.
  • the storage array includes one or more controllers, and each controller is coupled to a disk enclosure.
  • the memory in each controller constitutes the first media layer
  • the hard disks in the hard disk enclosure constitute the second media layer.
  • Any controller can receive data sent by the host or client. Taking the first controller as an example, the data is temporarily stored in the memory of the first controller.
  • the first controller When the amount of data in the memory of the first controller reaches When the size of the data units included in the first check matrix, the first controller divides the data into N data units according to the first erasure code ratio, and calculates M check units of the N data units , and then store the N data units and M data units in a distributed manner in each controller (which may include the first controller itself) of the storage system. Similarly, the first controller continues to receive the data sent by the host, and when the data in its memory reaches the size of the data unit contained in the first parity check matrix, continues to the N data units and M corresponding to the data. The units are distributed in each controller.
  • each controller can migrate the data in its memory to the hard disk, which means that the data units contained in the memory of each controller are migrated to the hard disk.
  • the first controller can obtain Y check units corresponding to these data units (for example, X data units) according to the second erasure code ratio, and write the check units into the hard disk, thereby realizing the
  • the data is stored in the second medium layer according to the second erasure code ratio.
  • a third aspect provides a storage device, where the storage device may be a storage node in the distributed storage system provided in the second aspect, or may be an independent server, or may be a hard disk enclosure, etc., which both have storage capacity and A device with computing power. It includes a first dielectric layer, a second dielectric layer and a processor.
  • the processor is used for: receiving data; storing the data in the first medium layer according to the first erasure code ratio; when the set condition is satisfied, the processor removes the data from the first medium layer Acquire the data, and migrate the data to the second medium layer according to the second erasure code ratio; wherein, the first erasure code ratio corresponds to N data units and M check units, and N and M are integers greater than or equal to 1, the second erasure code matching corresponds to X data units and Y check units, X and Y are integers greater than or equal to 1, and the ratio of N and M is not equal to X and the ratio of Y. Among them, N is not equal to X, and M can be equal to Y or not equal to Y.
  • the performance of the first dielectric layer is higher than the performance of the second dielectric layer, and the ratio of N and M is smaller than the ratio of X and Y.
  • the data stored in the first medium layer includes a plurality of sub-data, each sub-data corresponds to a first check matrix, and each first check matrix includes the N data units and all the The M verification units are described.
  • the migrating the data to the second medium layer according to the second erasure code ratio includes: obtaining the X according to N data units included in each first check matrix data units, the X is an integer multiple of N; Y check units are obtained by calculation to generate a second check matrix, the second check matrix includes the X data units and the Y check units check unit; write the second check matrix into the second medium layer.
  • the calculating to obtain the Y verification units includes calculating and obtaining the Y verification units according to the X data units.
  • the calculating to obtain Y check units includes calculating the Y check units according to M check units included in each first check matrix, where Y is equal to the M.
  • the set condition includes that the available capacity of the first dielectric layer reaches a capacity threshold.
  • the set condition includes that an access frequency of the data stored in the first medium layer is lower than a heat threshold.
  • a computer-readable storage medium where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the data storage method described in the first aspect.
  • a computer program product containing instructions that, when executed on a computer, cause the computer to execute the data storage method described in the first aspect above.
  • a data processing apparatus is provided, the data processing apparatus is located in the storage device described above, the storage system includes a first medium layer, a second medium layer and a processor, and the performance of the first medium layer is different on the performance of the second dielectric layer.
  • the apparatus includes a storage module 1101 and a migration module 1102.
  • the storage module 1101 is configured to receive data and store the data in the first medium layer according to a first erasure code ratio.
  • the migration module 1102 is configured to acquire the data from the first medium layer when the set condition is satisfied, and migrate the data to the second medium layer according to the second erasure code ratio.
  • the first erasure code matching corresponds to N data units and M check units, N and M are integers greater than or equal to 1, and the second erasure code matching corresponds to X data units and Y A check unit, X and Y are integers greater than or equal to 1, and the ratio of N and M is not equal to the ratio of X and Y.
  • N and M are integers greater than or equal to 1
  • X and Y are integers greater than or equal to 1
  • the ratio of N and M is not equal to the ratio of X and Y.
  • FIG. 1 is an architecture diagram of a storage system provided by an embodiment of the present application
  • FIG. 2 is an architecture diagram of another storage system provided by an embodiment of the present application.
  • FIG. 3 is a system architecture diagram of a storage device provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the distribution of each unit in a first check matrix provided in an embodiment of the present application in the first dielectric layer;
  • FIG. 6 is a schematic diagram of the distribution of cells in the w first check matrices in the first dielectric layer provided by an embodiment of the present application;
  • FIG. 8 is a schematic diagram of a principle of obtaining a second check matrix according to merging w first check matrices according to an embodiment of the present application;
  • FIG. 9 is a schematic diagram of a check unit for obtaining a second check matrix provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of writing data in a second medium layer according to a second erasure code ratio provided by an embodiment of the present application;
  • FIG. 11 is a schematic structural diagram of a data storage device provided by an embodiment of the present application.
  • Application 101 is a general term for various applications presented to the user.
  • the client 102 is configured to receive a data access request triggered by the application 101 , interact with the storage node 20 , and send the data access request to the storage node 20 .
  • the client 102 is also configured to receive data from the storage node and forward the data to the application 101 .
  • the functions of the client 102 are implemented by the processor included in the computing node 10 running the program in the memory.
  • Client 102 may also be implemented by hardware components located within compute node 10 . Any client 102 in the computing node cluster can access any storage node 20 in the storage node cluster.
  • the storage node cluster includes one or more storage nodes 20 (three storage nodes 20 are shown in FIG. 1 , but not limited to three storage nodes 20 ), and each storage node 20 can be interconnected.
  • Storage nodes such as servers, desktop computers, or storage array controllers, disk enclosures, etc. Functionally, the storage node 20 is mainly used for computing or processing data.
  • the storage node 20 includes at least a processor, a memory and a network card.
  • SCM can provide faster read and write speeds than hard disks, but is slower than DRAM in terms of operation speed and cheaper than DRAM in cost. It should be noted that the processor can directly access the memory, for example, as shown in FIG. 2 , the processor can directly access the DRAM and the SCM.
  • the DRAM and the SCM are only exemplified in this embodiment, and in some possible cases, the memory may only include one of the DRAM and the SCM.
  • the memory may also include other random access memories, such as static random access memory (Static Random Access Memory, SRAM) and the like.
  • static random access memory Static Random Access Memory, SRAM
  • PROM Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • the memory may also be a dual in-line memory module or a dual in-line memory module (Dual In-line Memory Module, DIMM for short), that is, a module composed of dynamic random access memory (DRAM).
  • DIMM Dual In-line Memory Module
  • the memory includes one type of memory as an example for description, but this does not constitute a limitation on the number of memory types included in the memory.
  • Hard disks read and write data slower than memory and are often used to store data persistently.
  • the storage node 20a As an example, one or more hard disks are arranged inside; or, a hard disk enclosure (as shown in FIG. 2 ) is mounted outside the storage node 20, and multiple hard disks are arranged in the hard disk enclosure. Regardless of the deployment method, these hard disks can be regarded as hard disks included in the storage node 20 .
  • the hard disk here may refer to a physical hard disk, or may refer to a logical domain or a fault domain including multiple physical hard disks, which is not limited in this embodiment of the present application.
  • the type of the physical hard disk is a solid-state hard disk, a mechanical hard disk, or other types of hard disks.
  • the memory and the hard disk included in the memory are two completely different storage media, and the performance of the two is completely different. Among them, compared with the hard disk, the data reading speed of the memory is faster and the delay is smaller, that is, the performance of the memory is higher than that of the hard disk. Based on this, in the embodiments of the present application, as shown in FIGS. 1 and 2 , the memory in each storage node 20 is referred to as the first medium layer, and the hard disk in each storage node 20 is referred to as the second medium layer. The performance of the first dielectric layer is higher than that of the second dielectric layer.
  • the network card is used to communicate with other storage nodes, or used to communicate with a disk enclosure coupled to the storage node.
  • the network card can directly access the memory of the storage node, as shown in Figure 2, the network card can directly access the DRAM and SCM.
  • FIG. 3 is a schematic structural diagram of another storage system provided by an embodiment of the present application.
  • the storage system shown in FIG. 3 is a storage array, and the storage array includes at least one controller (the controller 11 shown in FIG. 3 ) and a plurality of hard disks 22 .
  • the controller 11 is connected to a host (not shown in the figure) through a storage area network (English: storage area network, SAN).
  • Controller 11 may be a computing device such as a server, desktop computer, or the like.
  • An operating system and application programs are installed on the controller 11 .
  • the controller 11 can receive input-output (I/O) requests from the host.
  • the controller 11 may also store the data (if any) carried in the I/O request and write the data to the hard disk 22 .
  • the hard disk 22 is a mechanical hard disk or a solid-state hard disk, and the solid-state hard disk is a memory with a flash memory (English: flash memory) chip as a medium, also known as a solid state drive (
  • FIG. 3 is only an exemplary illustration.
  • the storage array may include two or more controllers.
  • the physical structure and function of each controller are similar to those of the controller 11 , and this embodiment does not limit the controllers among the controllers. and the connection mode between any controller and the hard disk 22 .
  • a hard disk may refer to a physical hard disk, or may refer to a logical domain or a fault domain including multiple physical hard disks, which is not limited in this embodiment of the present application.
  • the controller 11 includes an interface card 110 , a processor 112 and an interface card 113 .
  • the interface card 110 is used to communicate with the host, and the controller 11 receives the operation instruction of the host through the interface card 110 .
  • the processor 112 may be a central processing unit (English: central processing unit, CPU). In this embodiment of the present application, the processor 112 is configured to receive an I/O request from a host and process the I/O request.
  • the I/O request is a data write request or a data read request, and the processor 112 may also send the data in the write data request to the hard disk 22 .
  • the interface card 113 is used to communicate with the hard disk 22 , and the controller 11 sends a write data request (including data, logical address of the data, and virtual address of the data) to the hard disk 22 for storage through the interface card 113 .
  • the controller 11 further includes a memory 111 .
  • the memory 111 is used to temporarily store data received from the host or data read from the hard disk 22 .
  • the controller 11 may temporarily store the data in the multiple write data requests in the memory 111 .
  • the capacity of the memory 111 reaches a certain threshold, the data stored in the memory 111 , the virtual address of the data, and the logical address allocated for the data are sent to the hard disk 22 .
  • the hard disk 22 stores the received data.
  • the memory 111 includes volatile memory, flash memory chips, or a combination thereof.
  • the volatile memory is, for example, random-access memory (English: random-access memory, RAM).
  • Flash memory chips are various machine-readable media that can store program codes, such as floppy disks, hard disks, and optical disks.
  • the memory 111 has a power protection function, which means that when the system is powered off and then powered on again, the data stored in the memory 111 will not be lost.
  • the memory 111 and the hard disk 22 included in the controller 11 are two completely different storage media. Among them, compared with the hard disk, the data reading speed of the memory is faster and the delay is smaller, that is, the performance of the memory is higher than that of the hard disk.
  • the memory 111 with higher performance is referred to as the first medium layer
  • the plurality of hard disks 22 with lower performance relative to the memory are referred to as the second medium layer, that is, the performance of the first medium layer higher than the second dielectric layer.
  • the memory 111 with higher performance is referred to as the second medium layer
  • the plurality of hard disks 22 with lower performance relative to the memory are referred to as the first medium layer.
  • the performance of the first dielectric layer is lower than that of the second dielectric layer.
  • Erasure coding is a data redundancy technology. Compared with the multi-copy strategy, erasure coding has higher disk utilization. For example, Reed-Solomon code is a common erasure code.
  • the erasure coding technology mainly encodes the original data to obtain redundancy through the erasure coding algorithm, and stores the data and redundancy together to achieve the purpose of fault tolerance.
  • the basic idea is to obtain m blocks of redundant elements (check units) by calculating n blocks of original data elements (data units), and the disk utilization rate is n/(n+m). For the elements of the n+m blocks, when any m block elements are in error (including the original data elements and redundant elements), the original n blocks of data elements can be recovered through the corresponding reconstruction algorithm.
  • the process of generating checksums is called encoding, and the process of recovering lost data elements is called decoding.
  • the ratio of erasure correction codes mentioned in this application refers to the ratio of the data element n to the redundant element m.
  • FIG. 4 is a flowchart of a data storage method provided by an embodiment of the present application. The method may be applied to the storage system shown in FIG. 1 or FIG. 2 , and may also be applied to the storage system shown in FIG. 3 . Referring to Figure 4, the method includes the following steps:
  • Step 401 In the first medium layer, store data according to the first erasure code ratio.
  • the storage system includes a variety of different storage media.
  • the storage system includes two different storage media, and the performance of the first media layer is higher than that of the second media.
  • the data storage method is explained by taking the performance of the layer as an example.
  • the first dielectric layer is DRAM and the second dielectric layer is hard disk.
  • the so-called performance of the first medium layer is higher than that of the second medium layer means that the read and write speed of the first medium layer is faster and the time delay is smaller than that of the second medium layer.
  • the first erasure code ratio corresponding to the first medium layer and the second erasure code ratio corresponding to the second medium layer may be obtained respectively, and then In the first medium layer, data is stored according to the first erasure code ratio, and in the second medium layer, data is stored according to the second erasure code ratio.
  • the first erasure code ratio and the second erasure code ratio can be obtained by the management node in the storage system.
  • the management node is any storage node among multiple storage nodes in the storage system, or the management node is a node in the storage system that is independent of the storage nodes and used to manage each storage node.
  • the management node can obtain the first erasure code ratio and the second erasure code ratio during initialization, and can also obtain the first erasure code ratio and the second erasure code ratio during the operation of the storage system. , which is not limited in the embodiments of the present application.
  • the first erasure code ratio and the second erasure code ratio can be obtained by the controller in the storage system. Next, an introduction is made by taking the management node acquiring the first erasure code ratio and the second erasure code ratio as an example.
  • the management node determines the first erasure code matching ratio or the second erasure code matching ratio according to the topology structure and fault tolerance capability of the storage system.
  • the first erasure code ratio and the second erasure code ratio may both be determined according to the topology and fault tolerance of the storage system, or the first erasure code ratio may be determined according to the topology and fault tolerance of the storage system Obtain, the second erasure code ratio is obtained according to the first erasure code ratio, or, the second erasure code ratio is determined according to the topology and fault tolerance of the storage system, and the first erasure code ratio is determined according to the second erasure code ratio. Erasure code matching is obtained, which is not limited in this embodiment of the present application.
  • the topology is used to indicate the number of storage nodes included in the storage system
  • the fault tolerance capability is used to indicate the number of storage nodes that the storage system tolerates errors.
  • the number of error-tolerant storage nodes in the storage system is equal to the number of check units corresponding to the first erasure code matching, or the number of check units corresponding to the second erasure code matching.
  • the management node first obtains the topology of the storage system.
  • the management node may store the topology of the storage system, or receive the topology of the storage system sent by other devices, or receive the topology of the storage system input by the user.
  • the topology can indicate the composition of the storage system, for example, the number of storage nodes included in the storage system, and the number of child nodes included in each storage node.
  • the storage node is a server
  • the number of child nodes of the storage node refers to the number of physical hard disks included in the server or the number of hard disk logical domains obtained by dividing the physical hard disks included in the corresponding storage node.
  • the storage node is a cabinet
  • the number of child nodes of the storage node refers to the number of servers included in the cabinet.
  • a rack contains multiple servers.
  • each server includes 60 physical hard disks, and every 15 physical hard disks is divided into 1 hard disk logical domain, according to the topology, it can be known that the storage system includes 4 storage nodes , each server is a storage node. Each storage node includes four hard disk logical domains, that is, each storage node includes four child nodes.
  • the management node In addition to obtaining the topology of the storage system, the management node also obtains the security level and fault tolerance of the storage system.
  • a configuration interface is displayed on the management node, and the configuration interface includes a security level configuration item and a fault tolerance capability configuration option.
  • the user inputs the required security level in the security level configuration item, and inputs the number t of nodes that allow errors in the fault tolerance configuration option, where t is an integer greater than or equal to.
  • the management node obtains the security level entered by the user and the number of error-allowed nodes t.
  • the security level includes server-level security, cabinet-level security, and the like. Among them, server-level security is used to indicate that the storage system can tolerate the failure of at most t servers.
  • Cabinet-level security is used to indicate that the storage system can tolerate up to t cabinet failures.
  • the management node may also determine the security level of the storage system according to a preset principle according to the topology of the storage system, wherein the preset principle refers to a calculation principle that can ensure the reliability of the storage system.
  • the fault tolerance capability of the storage system may also be a system default value, which is not limited in this embodiment of the present application.
  • the management node After acquiring the topology structure, fault tolerance capability and security level of the storage system, the management node determines the value range of the number of data units in the first erasure code matching ratio through the following formula (1).
  • N is the number of data units corresponding to the first erasure code matching.
  • k is the number of nodes included in the storage system.
  • the security level is server-level security, the above-mentioned nodes refer to servers, and when the security level is cabinet-level security, the above-mentioned nodes are cabinets.
  • the management node After determining the value range of the number of data units in the first erasure code ratio, the management node determines to obtain a plurality of first candidate erasure code ratios according to the value range and M, and each candidate erasure code ratio The code ratio corresponds to a value in the value range. Afterwards, the corresponding erasure correction code ratio with the smallest write amplification value is selected from the plurality of first candidate erasure code ratios as the first erasure code ratio.
  • write amplification means that the amount of data actually written by the storage node is greater than the amount of data received from the computing node.
  • write amplification is represented by a write amplification value.
  • the write amplification value corresponding to the first candidate erasure correction code ratio is equal to the first candidate erasure correction code ratio.
  • the ratio between the total number of data units and check units for erasure matching and the number of data units For example, for the erasure code ratio of 6:2, the erasure code ratio is used to represent that every 6 data units corresponds to 2 check units. In this way, the write amplification value corresponding to the erasure code ratio is (6+2)/6.
  • the management node is further configured to obtain the second erasure code matching ratio according to the topology structure and fault tolerance capability of the storage system. Specifically, the management node uses the following formula (2) to determine the range of the number of data units in the second erasure code matching ratio.
  • X is the number of data units corresponding to the second erasure code matching, and X is greater than N.
  • i is the number of child nodes of the node included in the storage system, where, when the security level is server-level security, i is the number of child nodes of the server included in the storage system, where the child node of a server may refer to the server An attached physical hard disk or a logical domain of hard disks.
  • i is the number of child nodes of the cabinet included in the storage system, where the child nodes of the cabinet refer to the number of servers included in the cabinet.
  • the management node After determining the value range of the number of data units in the second erasure code matching ratio, the management node determines the second erasure code matching ratio according to the value range and Y.
  • each server included in the storage system has 4 sub-systems under it. node, so that the total number of child nodes of 4 servers is 16.
  • the management node can select the number of data units to be 24. At this time, the ratio of the second erasure code is 24:2.
  • N in the first erasure code matching and X in the second erasure code matching are not equal, and N is less than X.
  • M in the first erasure erasure code configuration and Y in the second erasure erasure code configuration may be equal or unequal.
  • the ratio of N and M is not equal to the ratio of X and Y.
  • the management node determines the second erasure code matching ratio X according to the first erasure code matching ratio N:M and a preset w :Y. where X is equal to w*N and Y is equal to or greater than M.
  • the management node determines the first erasure code matching ratio N:M according to the second erasure code matching ratio X:Y and the preset w. where N is equal to X/w and M is equal to or less than Y.
  • the management node calculates the number X of data units in the second erasure code ratio and the number of data units in the first erasure code ratio
  • the preset w is actually The above is the number of first parity check matrices included in the data stored in the first medium layer according to the first erasure code ratio.
  • the storage node After obtaining the first erasure code ratio, the second erasure code ratio, and w, subsequently, when the storage node receives a data write request sent by the computing node, the storage node will be in accordance with the first erasure code ratio and w in Data is written in the first media layer.
  • the write data request includes data to be written.
  • the target storage node receives the write data request sent by the computing node, and when the amount of received data to be written reaches the size of N data units, the target storage node divides the to-be-written data into N data units. , and generate M check units according to the N data units.
  • the N data units and the M check units constitute a sub-data, and the sub-data corresponds to a first check matrix, wherein the first check matrix includes the N data units and the M check units.
  • the target storage node stores the N data units and M check units contained in the first check matrix into the first medium layer of the storage system.
  • the target storage node continues to receive the write data request sent by the computing node, obtains another first parity check matrix according to the above method, and stores it in the first medium layer.
  • the subsequent step 402 can be executed.
  • the data to be written is divided into 6 data units, and 2 check units are generated according to the 6 data units.
  • a first check unit including 6 data units and 2 check units is generated. matrix.
  • the 8 cells included in the first check matrix are stored in the memory of each storage node in the storage system.
  • the target storage node can distribute the check cells included in each first check matrix on the same storage node, and the data cells included in each first check matrix can be distributed among the storage nodes according to the principle of even distribution. in the node.
  • the storage nodes included in the storage system are four servers, and the memory included in the four servers is the first media layer.
  • the target storage node stores the first check matrix.
  • the 2 check units included in the check matrix are stored in its own memory, and the remaining 6 data units are forwarded to the other 3 servers for storage.
  • the other 3 servers Stores 2 data units. In this way, after the four first check matrices are stored in the first medium layer, the distribution of the 32 units included in the four check matrices on the storage nodes of the storage system is shown in FIG. 6 .
  • FIG. 5 and FIG. 6 are only possible distributions of each unit in a first parity check matrix given by way of example in the embodiment of the present application.
  • the target storage node can also randomly distribute the multiple cells evenly in each storage node according to the maximum number of cells allowed to be distributed in each storage node of the storage system and the number of cells included in the first check matrix. That is, the check unit is not limited to be on the same storage node.
  • one data unit and one check unit in a first check matrix may be stored on the target storage node, and another check unit and one data unit may be stored on another storage node , in this way, there are still 4 data units remaining, and these 4 data units are respectively stored on the remaining two storage nodes.
  • the 4 check units p in the 4 first check matrices are stored in a storage node
  • the four verification units q are stored in one storage node, and the storage nodes where the four verification units p are located may be different from the storage nodes where the four verification units q are located.
  • the controller in the storage array can determine the first erasure code ratio, the second erasure code ratio, and the data stored in the first medium layer.
  • the method for determining the number w of the first parity check matrix included in the data refers to the method described above, and details are not described herein again in this embodiment of the present application.
  • the controller may write data in the first medium layer according to the first erasure code ratio and w, where the first media layer That is, the memory of the controller included in the storage system is stored. It should be noted that the controller generates w first check matrices with reference to the method described above. According to the number of controllers included in the storage system and/or the number of memories included in each controller, the data units and check units included in each first parity check matrix are distributed and stored to the respective controllers with reference to the foregoing method. in memory.
  • the storage node or the controller When the data written in the first medium layer according to the first erasure code ratio reaches the set condition, the storage node or the controller will write the part of the data that meets the set condition according to the second erasure code ratio to the second dielectric layer.
  • the ratio of the second erasure code is X:Y, that is, the data written into the second medium layer includes X data units and Y check units.
  • the setting condition includes that the available capacity of the first medium layer reaches the capacity threshold, or the setting condition includes that the access frequency of the data stored in the first medium layer is lower than the heat threshold, or the setting condition includes that one write cycle has been reached
  • the set condition may also be other conditions, which are not limited in this embodiment of the present application.
  • the available capacity is equal to the total capacity of the first media layer minus the data volume of the stored data. When the available capacity of the first medium layer reaches the capacity threshold, the data in the first medium layer is migrated to the second medium layer, which helps the first medium layer to make room for processing more data.
  • the second media layer includes hard disks included in storage nodes in the storage system.
  • the w first check matrices are assembled, that is, after the data units and check units included in the w first check matrices are written into the first medium layer, according to the data stored in the first medium layer
  • the N data units contained in each of the first check matrices in the w first check matrices included in the data obtain X data units, where X is an integer multiple of N; Y check units are obtained by calculation to generate a second A check matrix, the second check matrix includes X data units and Y check units; the second check matrix is written into the second medium layer.
  • the above process can be implemented in the following different implementation manners.
  • the first implementation manner after the target storage node collects the w first check matrices through the above step 401, the target storage node calculates and obtains the second check matrix according to the w ⁇ N data units included in the w first check matrices. Y check units. After that, the target storage node stores the calculated Y check units in the second medium layer. For other storage nodes, each storage node stores the data units belonging to the w first check matrices stored by itself to the second medium layer when the set conditions are met. In this way, the w ⁇ N data units stored in the second medium layer are the X data units of the second check matrix, and the Y check units calculated by the target storage node are the Y included in the second check matrix a verification unit.
  • the target storage node when the target storage node stores the data units and Y check units stored by itself in the second medium layer, if the number of hard disks included in the second medium layer is greater than the total number of units included in the second check matrix , the target storage node selects Y hard disks from the plurality of hard disks included in the target storage node according to the calculated number Y of verification units. Then the target storage node writes the Y check units into the selected hard disks, wherein one unit is written on each hard disk.
  • the target storage node also stores data units belonging to w first check matrices, the target storage node selects a hard disk for each data unit from the hard disks included in the target storage node, and writes the data unit into the data unit. to the selected hard disks, where one unit is also written to each hard disk.
  • the storage node determines the second medium according to the number of check cells included in the second check matrix. The maximum number of cells allowed to be distributed on each hard disk in the tier. Afterwards, if the target storage node also stores data units belonging to w first check matrices, then according to the maximum number of units and the number of data units belonging to w first check matrices stored by itself and Y from its own A plurality of hard disks are selected from the included hard disks, and then the stored data unit and the verification unit are written to the selected plurality of hard disks.
  • the target storage node does not store data units belonging to the w first check matrices, select multiple hard disks from the hard disks included in the target storage node according to the maximum number of units and Y, so as to write the Y check units to the on the selected hard disk.
  • a plurality of cells in the second parity check matrix may be stored distributed on one hard disk, but the number of stored cells does not exceed the maximum number of cells allowed to be distributed on the hard disk.
  • the data units belonging to the w first parity check matrices stored by itself can be written into the second medium layer by referring to the above method.
  • the w first parity check matrices include a total of 24 data units, that is, the second The number of data units included in the parity check matrix is 24.
  • the number of check units calculated by the target storage node is 2. Since the number of check cells included in the second check matrix is 2, it can be known that the maximum number of cells allowed to be distributed on each hard disk logical domain is 2.
  • each storage node combines the 24 data cells and the
  • 2 parity units are stored in the second medium layer
  • 3 storage nodes in the 4 storage nodes 2 units are stored in each hard disk logical domain of each storage node in the 3 storage nodes, that is, each A storage node stores a total of 8 units, and for another storage node, you can store 2 units on one hard disk logical domain of the storage node, or store one unit on each of the two hard disk logical domains of the storage node. unit.
  • the second implementation manner after the target storage node collects the w first check matrices, it obtains Y checks in the second check matrix according to the w ⁇ M check units included in the w first check matrices. unit. After that, the target storage node stores the calculated Y check units in the second medium layer. For other storage nodes, each storage node stores the data units belonging to the w first parity check matrices stored by itself to the second medium layer when the amount of data in its own cache reaches a certain threshold.
  • the target storage node obtains the w ⁇ M check units stored by itself, and then according to the w ⁇ M check units The unit obtains Y check units.
  • the target storage node performs an XOR operation on the stored w check units p or other
  • the check unit p' in the second check matrix is obtained by the calculation method
  • the check unit q' in the second check matrix is obtained by performing an XOR operation on the stored w check units q or other calculation methods. It can be seen that, in the embodiment of the present application, the check units of the second check matrix can be obtained by directly calculating the M check units included in each first check matrix. All data units in the check matrix are recalculated to check the check unit, which reduces the amount of computation.
  • the storage node can directly obtain the stored check cells to obtain the check cells in the second check matrix. Compared with the case where the verification units are distributed and stored in each storage node, there is no need to obtain verification units across storage nodes, which reduces the amount of network forwarding.
  • FIG. 8 is a schematic diagram of a principle of obtaining a second check matrix according to merging w first check matrices according to an embodiment of the present application.
  • p 1 is the check unit p
  • q 1 is the check unit q.
  • a 7 to a 12 in the second first check matrix are 6 data units
  • the remaining two columns of elements p 2 and q 2 are 2 check units, and so on.
  • the elements of the first 6 columns of each check matrix are taken out to form 24 data units of the second check matrix.
  • check unit p that is, p 1 to p 4
  • XOR operation or use other calculation methods to obtain the check unit p' in the second check matrix
  • put each check unit p' in the check matrix put each check unit p' in the check matrix.
  • the check units q that is, q 1 to q 4
  • an XOR operation is performed, or other calculation methods are used to obtain the check units q' in the second check matrix.
  • the target storage node when the M check units included in each first check matrix are check units r, the target storage node performs an XOR operation on the stored w check units r, and incremental calculation obtains the second check unit.
  • check unit r' in the check matrix after that, the target storage node can obtain the data units in each first check matrix stored on itself and other storage nodes, and calculate the check unit according to the obtained w ⁇ N data units p' and check unit q'.
  • the check unit r', the check unit p' and the check unit q' obtained by calculation are used as the Y check units in the second check matrix. It can be seen that in this implementation, Y and M are not equal.
  • the target storage node performs an exclusive process on the stored 3 check units r. OR operation to obtain the check unit r' in the second check matrix.
  • the target storage node obtains the 21 data units stored in itself and other storage nodes, and calculates the check unit p' and the check unit q' according to the 21 data units.
  • the Y check units included in the second check matrix are the generated check unit p', check unit q' and check unit r'.
  • the target storage node After calculating and obtaining Y check units, the target storage node also refers to the method described in the first implementation manner to store the stored data unit and Y check units in the second medium layer, and other storage nodes cache their own After the amount of data in the storage device reaches a certain threshold, the respective stored data units are stored in the second medium layer, which is not repeated in this embodiment of the present application.
  • the target storage node obtains the stored check units from each storage node, and then according to the obtained w ⁇ M Y verification units are obtained from the number of verification units.
  • the implementation manner of the target storage node obtaining Y verification units according to the acquired w ⁇ M verification units refers to the implementation manner in the above (1), which is not repeated in this embodiment of the present application.
  • the target storage node stores the stored data unit and Y check units in the second medium layer with reference to the method described in the first implementation manner, and other storage nodes store data in their own caches. After the amount of data reaches a certain threshold, the respective stored data units are stored in the second medium layer, and details are not described herein again in this embodiment of the present application.
  • the third implementation method After the target storage node has assembled the w first check matrices, it writes the units belonging to the w first check matrices stored by itself to the second medium layer, and each other storage node caches its own After the stored quantity reaches a certain threshold, the cells belonging to the w first parity check matrices stored by itself are also written into the second medium layer. After that, the target storage node obtains the w ⁇ M check units written into the second medium layer, and then calculates and obtains Y check units according to the w ⁇ M check units, and uses the calculated Y check units as The Y parity units of the second parity check matrix are written into the second medium layer.
  • one check unit may be written on each hard disk, so that w ⁇ M check units will be written to w ⁇ M hard disks.
  • all parity units can be written to one hard disk.
  • the target storage node After each storage node writes the data units and the check units belonging to the w first check matrices stored by itself into the second medium layer, the target storage node obtains w ⁇ M check units from the second medium layer . Wherein, if w ⁇ M check units are to be written into w ⁇ M hard disks, the target storage node reads w ⁇ M check units from these w ⁇ M hard disks. If all verification units are written into one hard disk, the target storage node obtains w ⁇ M verification units from the hard disk at one time, which can reduce the number of network communications and save bandwidth resources.
  • the target storage The node reads the parity units located in the same column from each hard disk, thereby obtaining w ⁇ M parity units. In this way, the number of network communications can also be reduced to a certain extent, and bandwidth resources can be saved.
  • the target storage node After obtaining w ⁇ M check units, the target storage node refers to the method described in the first implementation manner, calculates and obtains Y check units according to the w ⁇ M check units, and then calculates the Y check units.
  • the units are respectively written to Y hard disks, wherein each hard disk is written with a parity unit.
  • the Y hard disks in which the Y verification units are written are not the hard disks in the X hard disks in which the data units are written.
  • each storage node can refer to the method described in the foregoing first implementation manner, on one hard disk. Write to two or more cells, as long as the maximum number of cells allowed to be stored is not exceeded.
  • the M check units included in each first check matrix may be stored on a hard disk under the same storage node, or the M check units included in each first check matrix
  • the parity units located in the same column can be stored in the hard disk under a storage node, for example, stored in the same hard disk logical domain under a storage node, or stored in a physical hard disk under a storage node, so as to This reduces the number of network forwarding required when calculating Y check units in the second check matrix.
  • FIG. 10 is a schematic diagram of writing data according to the second erasure code ratio in the second medium layer according to an embodiment of the present application.
  • the two verification units are respectively the verification unit p and the verification unit q.
  • the second media layer includes 16 hard disk logical domains distributed on 4 storage nodes (a1 to a4), each storage node is distributed with 4 hard disk logical domains, and each hard disk logical domain includes 16 physical hard disks. Since the number of check cells is 2, a maximum of 2 cells in the second check matrix are allowed to be distributed on each hard disk logical domain.
  • two hard disk logical domains are first selected from the 16 hard disk logical domains.
  • the selected two hard disk logical domains are the hard disk logical domain a21 on storage node a2 and the hard disk logical domain a41 on storage node a4.
  • each storage node storing the check unit p of each first check matrix writes the check unit p stored by itself into the physical hard disk of the hard disk logical domain a21, for example, each first check matrix in the
  • the verification unit p is all written into the first physical hard disk of the hard disk logical domain a21.
  • each hard disk logical domain is allowed to store at most two cells in the second parity check matrix, after writing the calculated check cell p' in the second check matrix to the hard disk logical domain a21,
  • the hard disk logical domain a21 can further store at most one data unit in the second parity check matrix.
  • the target storage node or the storage node a4 can also perform an XOR operation on the four check units q stored therein, and incrementally calculate the check unit q' in the second check matrix. , and store it in the hard disk logical domain a41. In this way, the hard disk logical domain a41 can store only one data unit in the second parity check matrix at most.
  • each storage node distributes 24 data units in 4 units according to the principle of distributing at most two units in the second check matrix on each hard disk logical domain.
  • Each storage node includes 16 hard disk logical domains.
  • the above describes the process of writing data in the first medium layer according to the first erasure code ratio when the storage system is the storage system shown in FIG. 1 or FIG. 2 .
  • the operations performed by the above-mentioned storage nodes can be performed by the controller, so that the data in the first medium layer includes the w first parity check matrices.
  • the data unit and the check unit are combined into a second check matrix and written into the second medium layer, which is not repeated in this embodiment of the present application.
  • the data unit and the check unit at the non-faulty position are read from the first medium layer and reconstructed, so as to restore the data in the fault point in the first medium layer.
  • the number of data units corresponding to the first erasure code matching is smaller than the number of data units corresponding to the second erasure code, that is, the first erasure code matching is a small proportion. ratio, and the second erasure code matching ratio is a large ratio matching ratio.
  • data is stored in the first medium layer according to the first erasure code ratio
  • data is stored in the second medium layer according to the second erasure code ratio
  • the performance of the first medium layer is higher than that of the second medium layer.
  • the medium layer that is to say, the high-performance medium layer uses a smaller ratio of erasure codes to store data
  • the low-performance medium layer uses a larger ratio of erasure codes to store data.
  • the high-performance medium layer Since the IO granularity received by the high-performance medium layer is relatively small, when the high-performance medium layer stores data with a small erasure code ratio, whenever the size of the received data reaches the corresponding erasure code ratio When the size of N data units is used, one stripe can be filled (N data units and M check units can form one stripe). Compared with the larger ratio of erasure codes, the smaller The erasure code ratio makes it easier to fill up the stripes, thereby reducing the amount of data that is filled with 0s in the stripes, reducing write amplification, and improving storage space utilization. For example, in the high-performance media layer, the erasure code ratio of 6:2 is used to store data.
  • a piece of data stored in the first medium layer according to the first erasure code ratio can be directly converted into a piece of data that conforms to the second erasure code ratio, and then stored in the second
  • the media layer that is, the data of different media layers can be calculated and converted to each other, which saves the computing resources of the storage system.
  • an embodiment of the present application provides a data storage device.
  • the device is applied to the storage device described above.
  • the storage device may be any storage node in the storage system shown in FIG. 1 or FIG. It is the storage array shown in Figure 3 and so on.
  • the storage device includes the first dielectric layer and the second dielectric layer described above, and the properties of the first dielectric layer are different from those of the second dielectric layer.
  • the device includes:
  • the storage module 1101 is configured to receive data, and store the data in the first medium layer according to the first erasure code ratio. For details, reference may be made to step 401 in the foregoing embodiment.
  • the migration module 1102 is configured to acquire the data from the first medium layer when the set condition is satisfied, and migrate the data to the second medium layer according to the second erasure code ratio.
  • the first erasure code matching corresponds to N data units and M check units, N and M are integers greater than or equal to 1, and the second erasure code matching corresponds to X data units and Y A check unit, X and Y are integers greater than or equal to 1, and the ratio of N and M is not equal to the ratio of X and Y.
  • the storage module 1101 and the migration module 1102 may be implemented by the processor in the storage device running program instructions in the memory.
  • the performance of the first dielectric layer is higher than that of the second dielectric layer, and the ratio of N and M is smaller than the ratio of X and Y.
  • the data stored in the first medium layer includes a plurality of sub-data, each sub-data corresponds to a first check matrix, and each first check matrix includes N data units and M check units.
  • the migration module 1102 is specifically used for:
  • the second parity check matrix is written into the second medium layer.
  • calculating and obtaining Y check units includes calculating and obtaining Y check units according to X data units.
  • the migration module is further configured to: calculate Y check units according to M check units included in each first check matrix, where Y is equal to M.
  • the M check units in each first check matrix are stored in the first storage node of the storage system, and the device is further configured to obtain each first check matrix from the first storage node.
  • the M check units are stored in the first storage node of the storage system, and the device is further configured to obtain each first check matrix from the first storage node. The M check units.
  • the M check units include a check unit p and a check unit q, wherein the check unit p in each first check matrix is stored in the first check unit of the storage system.
  • the check unit q in each first check matrix is stored in the second storage node of the storage system, and the device is further used for:
  • the calculation of Y check units according to the M check units included in each first check matrix specifically includes: according to the check unit p in each first check matrix and the check unit p in each first check matrix
  • the verification unit q calculates to obtain Y verification units, and the Y verification units include the verification unit p' and the verification unit q' in the verification matrix.
  • the device is further configured to: determine the first erasure code matching ratio or the second erasure code matching ratio according to the topology structure of the storage system and the fault tolerance capability of the storage system, wherein the topology structure is used to indicate that the storage system contains The number of storage nodes The fault tolerance is used to indicate the number of storage nodes that the storage system tolerates errors.
  • the set condition includes that the available capacity of the first dielectric layer reaches a capacity threshold.
  • the set condition includes that the access frequency of the data stored in the first medium layer is lower than a heat threshold.
  • the performance of the first medium layer and the second medium layer are different. Based on this, the first medium layer and the second medium layer perform data processing according to different ratios of erasure codes. storage. Due to the different write amplification corresponding to different erasure code ratios, the storage space utilization rate is also different. Therefore, selecting different erasure code ratios for data storage according to the performance of the medium layer can better play the corresponding role. The storage performance of the media layer can effectively improve the utilization of storage space.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (eg: infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the available media may be magnetic media (eg: floppy disk, hard disk, magnetic tape), optical media (eg: Digital Versatile Disc (DVD)), or semiconductor media (eg: Solid State Disk (SSD)) )Wait.
  • references herein to "at least one” refers to one or more, and “plurality” refers to two or more.
  • “/” means or means, for example, A/B can mean A or B;
  • "and/or” in this document is only an association relationship to describe associated objects, indicating that There can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that the words “first”, “second” and the like do not limit the quantity and execution order, and the words “first”, “second” and the like are not necessarily different.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Security & Cryptography (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

本申请实施例公开了一种数据存储方法、存储系统、存储设备及存储介质,属于数据存储技术领域。在本申请实施例中,第一介质层和第二介质层的性能不同,基于此,在第一介质层和第二介质层中按照不同的纠删码配比来进行数据存储。由于不同的纠删码配比对应的写放大不同,所导致的存储空间利用率也不同,因此,根据介质层的性能的不同选取不同的纠删码配比进行数据存储能够更好的发挥相应介质层的存储性能,有效的提高存储空间利用率。

Description

数据存储方法、存储系统、存储设备及存储介质
本申请要求于2020年7月10日提交的申请号为202010661972.0、发明名称为“数据存储方法以及存储设备”的中国专利申请的优先权,以及于2020年8月21日提交的申请号为202010851132.0、发明名称为“数据存储方法、存储系统、存储设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及数据存储技术领域,特别涉及一种数据存储方法、存储系统、存储设备及存储介质。
背景技术
在存储系统中,提升有效存储容量是降低存储成本的有力武器,而纠删码(erasure code,EC)技术就能够提升存储系统的有效存储容量。当前,EC技术被广泛的应用于多级介质存储系统中。所谓多级介质存储系统是指包括多种不同性能的存储介质的存储系统。例如,由动态随机存取存储器(dynamic random access memory,DRAM)、存储级内存(storage class memory,SCM)和Nand闪存三种存储介质组成的存储系统,其中,DRAM具有较高的性能和较低的时延,SCM的性能和时延介于DRAM和Nand闪存之间,通常将DRAM和SCM称为高性能介质层,将Nand Flash称为低性能介质层。在诸如此类的多级介质存储系统中,在高性能介质层和低性能介质层中采用相同的纠删码配比来进行数据存储,这样,存在着存储空间利用率低的问题。
发明内容
本申请实施例提供了一种数据存储方法、存储系统、存储设备及存储介质,能够提高存储系统的存储空间利用率。所述技术方案如下:
第一方面,提供了一种数据存储方法,所述方法应用于存储系统中,所述存储系统包括第一介质层和第二介质层,所述第一介质层的性能不同于所述第二介质层的性能,所述方法包括:在所述第一介质层中,按照第一纠删码配比存储数据;当设定条件满足时,将所述第一介质层中存储的所述数据按照第二纠删码配比迁移至所述第二介质层;其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。
其中,N不等于X,M可以等于Y,也可以不等于Y。
第一介质层的性能不同于第二介质层的性能。一种情况下,第一介质层的性能高于第二介质层的性能,在第一介质层中按照第一纠删码配比存储数据。当设定条件满足时,将第一介质层中存储的数据按照第二纠删码配比迁移至第二介质层。第一介质层可以是DRAM或者SCM等作为内存使用的存储介质,第二介质层可以是固态硬盘或者机械硬盘等持久化存储介 质。另一种情况下,在某些应用场景中,需要将低性能的存储介质中存储的数据还原至高性能的存储介质中,相应地,纠删码配比也会随着数据迁移发生变化,以匹配该存储介质的性能。例如,当第一介质层的性能低于第二介质层的性能时,在第一介质层中按照第一纠删码配比存储数据,并将第一介质层中存储的数据按照第二纠删码配比迁移至第二介质层这个过程则是一个数据读取过程。也即,待读取的数据在第一介质层中按照第一纠删码配比存储,在设定条件满足之后,将第一介质层中存储的数据读取至第二介质层,第二介质层中按照第二纠删码配比存储数据。之后,将读取到第二介质层中的数据反馈给用户。
上面说的介质层的性能通过该介质层的读写速度和时延等参数来表征。
在一种实现方式中,所述第一介质层的性能高于所述第二介质层的性能,N和M的比值小于X和Y的比值。其中,N小于X。也就是说,在高性能介质层采用较小的纠删码配比来存储数据,而在低性能介质层采用较大的纠删码配比来存储数据。由于高性能介质层接收到的输入输出(input/output,IO)粒度比较小,所以,通过较小的纠删码配比来存储数据,容易凑满分条,降低了写放大,提高了存储空间利用率。另外,在低性能介质层采用较大的纠删码配比来存储数据,能够减少冗余数据在存储空间中的占比,从而提高存储空间利用率。
在一种实现方式中,所述第一介质层存储的所述数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括所述N个数据单元和所述M个校验单元。因此,在该实现中,提供了一种具体的数据存储方式,所谓校验矩阵是一个分条所包含的所有数据单元和校验单元的集合,矩阵中包含多个元素,每个元素即一个数据单元或一个校验单元。本实施例中,第一校验矩阵是指存储在第一介质层中的数据所对应的矩阵,第二校验矩阵是指存储在第二介质层中的数据所对应的矩阵。示例性的,第一介质层存储有4个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括6个数据单元和2个校验单元。
在一种实现方式中,将所述第一介质层中存储的所述数据按照第二纠删码配比迁移至所述第二介质层的实现过程为:根据每个第一校验矩阵所包含的N个数据单元获得所述X个数据单元,所述X是所述N的整数倍;计算获得Y个校验单元以生成第二校验矩阵,所述第二校验矩阵包括所述X个数据单元和所述Y个校验单元;将所述第二校验矩阵写入所述第二介质层中。示例性的,第二介质层存储有一个或多个第二校验矩阵,其中,有一个第二校验矩阵是来自于第一介质层中的4个第一校验矩阵。具体的,所述第二校验矩阵包括24个数据单元(即X=24)和2个校验单元(即Y=2)。其中,这24个数据单元则直接来自上述每个第一校验矩阵的6个数据单元。
在本申请实施例中,在第一介质层中按照第一纠删码配比存储的数据能够直接合并为符合第二纠删码配比的数据,进而存储至第二介质层,也即,数据在不同介质层间可以合并或拆分,节省了存储系统的计算资源。
在一种实现方式中,所述计算获得Y个校验单元包括根据所述X个数据单元计算获得所述Y个校验单元。按照这种计算方式,Y个校验单元是根据X个数据单元计算获得的,那么第二校验矩阵中校验单元的数量相对比较灵活,不局限于前面描述的第一校验矩阵中校验单元的数量(M)。换言之,Y可以等于M,也可以不等于M。
在一种实现方式中,所述计算获得Y个校验单元包括根据每个第一校验矩阵所包含的M个校验单元计算所述Y个校验单元,所述Y等于所述M。按照这种计算方式,通过第一校验矩阵中包括的校验单元生成第二校验矩阵中的校验单元,相较于重新通过X数据单元来计算 Y个校验单元,能够减小计算量。例如来说,第二校验矩阵中的Y个校验单元可以以各个第一校验矩阵中的M个校验单元为基础,通过简单的异或运算获得,因此计算量较小。按照这种计算方式,大部分情况下Y的值与M的值相等,但也不排除某些场景下Y不等于M。
在一种实现方式中,每个第一校验矩阵中的所述M个校验单元均存储在所述存储系统的第一存储节点中,所述方法还包括:从所述第一存储节点中获取每个第一校验矩阵中的M个校验单元。由于每个第一校验矩阵的M个校验单元均存储在第一存储节点中,因此,在通过各个第一校验矩阵的M个校验单元计算第二校验矩阵中的Y个校验单元时,即能够直接从第一存储节点中获取到所有的第一校验矩阵中的校验单元,相较于将各个第一校验矩阵中的校验单元分散存储,能够有效的减少网络通信次数,节省带宽资源。第一存储节点是指存储系统所包含的任意一个存储节点。
在一种实现方式中,当M大于或等于2时,所述M个校验单元包括校验单元p和校验单元q,其中,每个第一校验矩阵中的校验单元p均存储在所述存储系统的第一存储节点中,每个第一校验矩阵中的校验单元q均存储在所述存储系统的第二存储节点中,所述方法还包括:从所述第一存储节点获取每个第一校验矩阵中的校验单元p;从所述第二存储节点获取每个第一校验矩阵中的校验单元q;所述根据每个第一校验矩阵所包含的M个校验单元计算所述Y个校验单元具体包括:根据每个第一校验矩阵中的校验单元p和每个第一校验矩阵中的校验单元q计算获得所述Y个校验单元,所述Y个校验单元包括所述校验矩阵中的校验单元p’和校验单元q’。
将每个第一校验矩阵的M个校验单元中的校验单元p存储在第一存储节点中,将每个第一校验矩阵的M个校验单元中的校验单元q存储在第二存储节点中,这样,在通过各个第一校验矩阵的M个校验单元中的校验单元p计算第二校验矩阵中的Y个校验单元中的校验单元p’时,即能够直接从第一存储节点中获取到所有的第一校验矩阵中的校验单元p。同理,在通过各个第一校验矩阵的M个校验单元中的校验单元q计算第二校验矩阵中的Y个校验单元中的校验单元q’时,即能够直接从第二存储节点中获取到所有的第一校验矩阵中的校验单元q相较于将各个第一校验矩阵中的校验单元p分散存储,校验单元q分散存储,能够有效的减少网络通信次数,节省带宽资源。
在一种实现方式中,根据所述存储系统的拓扑结构和所述存储系统的容错能力确定所述第一纠删码配比或所述第二纠删码配比,其中,所述拓扑结构用于指示所述存储系统所包含的存储节点的数量所述容错能力用于指示所述存储系统容忍出错的存储节点的数量。
其中,第一纠删码配比和第二纠删码配比可以均根据存储系统的拓扑结构和容错能力确定得到,或者,第一纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第二纠删码配比根据第一纠删码配比获得,或者,第二纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第一纠删码配比根据第二纠删码配比获得,本申请实施例对此不作限定。另外,该存储系统容忍出错的存储节点的数量等于第一纠删码配比对应的校验单元的数量,或者第二纠删码配比对应的校验单元的数量。
示例性地,本申请实施例中由管理节点获取存储系统的拓扑结构以及容错能力,之后,通过公式N≤(k*M)-M来确定第一纠删码配比中的数据单元的数量的取值范围。其中,N为第一纠删码配比对应的数据单元的数量。k为存储系统中包含的节点的数量,当安全级别为服务器级安全时,上述节点指服务器,当安全级别为机柜级安全时,上述节点为机柜。M 为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,第一纠删码配比中的校验单元的数量。需要说明的是,M可以是默认值,也可以是由用户自定义的值,且M为大于或等于1的整数,例如,M=2。在确定出第一纠删码配比中的数据单元的数量的取值范围之后,管理节点根据该取值范围和M,确定得到多个第一候选纠删码配比,每个候选纠删码配比对应所述取值范围的一个值。之后,从多个第一候选纠删码配比中选择对应的写放大值最小的纠删码配比作为第一纠删码配比。
需要说明的是,写放大是指存储节点实际写入的数据量大于从计算节点接收到的数据量。对于任一个第一候选纠删码配比而言,该第一候选纠删码配比对应的写放大值等于该第一候选纠删码配比对应的数据单元和校验单元的总数量与数据单元的数量之间的比值。例如,对于纠删码配比6:2而言,该纠删码配比对应的写放大值即为(6+2)/6。
或者,本申请实施例可以由管理节点获取存储系统的拓扑结构以及容错能力,之后,通过公式X≤(i*Y)-Y来确定第二纠删码配比中的数据单元的数量的取值范围,进而根据该数据单元的数量的取值范围和Y确定第二纠删码配比。其中,X为第二纠删码配比对应的数据单元的数量,且X大于N。i为该存储系统包含的节点的子节点的数量,其中,当安全级别为服务器级安全时,i为该存储系统包含的服务器的子节点的数量,其中,服务器的子节点可以是指该服务器连接的物理硬盘或硬盘逻辑域。当安全级别为机柜级安全时,i为该存储系统包含的机柜的子节点的数量,其中,机柜的子节点是指该机柜包含的服务器的数量。Y为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,Y为第二纠删码配比对应的校验单元的数量。需要说明的是,Y可以是默认值,也可以是由用户自定义的值,且Y大于或等于1,例如,Y=2。另外,Y与M可以相等,也可以不相等,本申请实施例对此不作限定。
需要说明的是,上面所述的安全级别可以是由用户配置后管理节点获取的,或者,也可以是管理节点根据该存储系统的拓扑结构按照预设原则确定得到的,其中,该预设原则是指能够保证该存储系统的可靠性的计算原则,本申请实施例对此不作限定。其中,安全级别包括服务器级安全、机柜级安全等。其中,服务器级安全用于指示存储系统最多能够容忍t个服务器出现故障。机柜级安全用于指示该存储系统最多能够容忍t个机柜出现故障。
在一种实现方式中,所述设定条件包括第一介质层的可用容量达到容量阈值。假设第一介质层的性能高于第二介质层的性能,那么第一介质层的容量往往小于第二介质层的容量,第一介质层也比第二介质层更加珍贵,因为存储在第一介质层中的数据读写的速度更快,时延更小。因此,当第一介质层的可用容量达到容量阈值时,第一介质层中的数据会被迁移至第二介质层中,以便于第一介质层腾出空间处理更多的数据。所谓迁移至第二介质层是指,数据被复制到第二介质层,而并不要求原来存储在第一介质层中的数据是否被立即删除。
在一种实现方式中,所述设定条件包括所述第一介质层存储的数据的访问频率低于热度阈值。继续假设第一介质层的性能高于第二介质层的性能,采用这种实现可以使得访问频率高的数据继续保留在第一介质层中,而访问频率低的数据则迁移至第二介质层中,既节省了高性能的存储介质的存储空间,又提高了访问频率高的数据的读写性能。
第二方面,提供了一种存储系统,所述存储系统包括第一介质层和第二介质层,所述第一介质层的性能不同于所述第二介质层的性能,在所述第一介质层中,数据是按照第一纠删 码配比存储的;当设定条件满足时,所述第一介质层中的所述数据被按照第二纠删码配比迁移至所述第二介质层;其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。其中,N不等于X,M可以等于Y,也可以不等于Y。
第二方面所提供的存储系统既可以是一种分布式的存储系统,也可以是一种集中式的存储系统。以分布式的存储系统为例,存储系统包括多个存储节点。每个存储节点包括内存和硬盘。内存是第一介质层,例如DRAM和/或SCM,硬盘是第二介质层,例如是固态硬盘也可以是机械硬盘。在某些实施方式中,内存也可以由固态硬盘充当,而这种情况下第二介质层是指机械硬盘或者其他性能更低的存储介质。示例性的,每个存储节点包括的内存作为第一介质层,每个存储节点包括的硬盘作为第二介质层,第一介质层的性能高于第二介质层。所述多个存储节点中的任意一个存储节点都可以接收主机或客户端发送的数据,以第一存储节点为例,数据被临时存储在第一存储节点的内存中,当所述第一存储节点的内存中的数据量达到第一校验矩阵所包含的数据单元的尺寸时,所述第一存储节点按照第一纠删码配比将这些数据划分为N个数据单元,并计算N个数据单元的M个校验单元,然后将这N个数据单元和M个数据单元,分散地存储在所述存储系统的各个存储节点(可以包含第一存储节点自己)中。同理,所述第一存储节点继续接收客户端发送的数据,当其内存中的数据达到第一校验矩阵所包含的数据单元的尺寸时,继续将该数据所对应的N个数据单元和M个单元分散地存储在各个存储节点中。当设定条件满足时,每个存储节点都可以将其内存中的数据迁移至硬盘中,这意味着每个存储节点的内存中所包含的数据单元均被迁移至了硬盘中。第一存储节点可以按照第二纠删码配比获取这些数据单元(例如X个数据单元)所对应的Y个校验单元,并将所述Y个校验单元写入硬盘(可以是第一存储节点本地的硬盘,也可以是其他存储节点的硬盘),由此即实现了在第二介质层中按照第二纠删码配比存储数据。
在一些实现方式中,Y个校验单元可以由X个数据单元计算获得,在另一些实现方式中,Y个校验单元可以由多个第一校验矩阵中的M个数据单元计算获得。
在一种实现方式中,所述第一介质层的性能高于所述第二介质层的性能,N和M的比值小于X和Y的比值。其中,N小于X。也就是说,在高性能介质层采用较小的纠删码配比来存储数据,而在低性能介质层采用较大的纠删码配比来存储数据。由于高性能介质层接收到的输入输出(input/output,IO)粒度比较小,所以,通过较小的纠删码配比来存储数据,容易凑满分条,降低了写放大,提高了存储空间利用率。另外,在低性能介质层采用较大的纠删码配比来存储数据,能够减少冗余数据在存储空间中的占比,从而提高存储空间利用率。
在一种实现方式中,所述第一介质层存储的所述数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括所述N个数据单元和所述M个校验单元。也即,在该种实现方式中,校验矩阵是一个分条所包含的所有数据单元和校验单元的集合,矩阵中包含多个元素,每个元素即一个数据单元或一个校验单元。本实施例中,第一校验矩阵是指存储在第一介质层中的数据所对应的矩阵,第二校验矩阵是指存储在第二介质层中的数据所对应的矩阵。示例性的,第一介质层存储有4个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括6个数据单元和2个校验单元
在一种实现方式中,所述第二介质层中的所述数据对应第二校验矩阵,所述第二校验矩 阵包括所述X个数据单元和所述Y个校验单元;其中,所述X个数据单元来自每个第一校验矩阵所包含的N个数据单元,所述X是所述N的整数倍。具体地,以第一介质层为各个存储节点的内存,第二介质层为各个存储节点的硬盘为例,多个第一校验矩阵中的数据单元和校验单元被分散的存储在存储系统的各个存储节点的内存中。当设定条件满足之后,各个存储节点将自身内存中存储第一校验矩阵所包含的数据单元存储至自身的硬盘中,这些数据单元组成第二校验矩阵中的X数据单元。除此之外,多个存储节点中的某个存储节点还可以根据第一校验矩阵中的校验单元生成Y个校验单元,并将Y个校验单元存储至自身的硬盘中,如此,即实现了将第一介质层存储的数据按照第二纠删码配比存储至第二介质层中。
在本申请实施例中,在第一介质层中按照第一纠删码配比存储的数据能够直接合并为符合第二纠删码配比的数据,进而存储至第二介质层,也即,数据在不同介质层间可以合并拆分,节省了存储系统的计算资源。
在一种实现方式中,所述Y个校验单元是根据所述X个数据单元计算获得的。其中,存储系统中的某个存储节点,以第一存储节点为例,第一存储节点获取X个数据单元,并根据该X个数据单元计算得到对应的Y个校验单元,在这种情况下,计算得到的校验单元的数量比较灵活,也即,Y可以等于第一校验矩阵中的校验单元的数量M,也可以不等于M。
在一种实现方式中,所述Y个校验单元是根据每个第一校验矩阵所包含的M个校验单元计算获得的,所述Y等于所述M。具体的,仍以第一存储节点为例,第一存储节点获取各个第一校验矩阵所包含的M个校验单元,以获取到的第一校验矩阵中所包含的校验单元为基础,通过简单的异或运算计算得到Y个校验单元,这样,相较于通过X数据单元来计算Y个校验单元,能够减小计算量。并且,在该种实现方式中,大部分情况下Y的值与M的值相等,但也不排除某些场景下Y不等于M。
在一种实现方式中,每个第一校验矩阵中的所述M个校验单元均存储在所述存储系统的第一存储节点中,所述第一存储节点用于获取每个第一校验矩阵中的M个校验单元,并根据每个第一校验矩阵所包含的M个校验单元计算获得所述Y个校验单元。
由于每个第一校验矩阵的M个校验单元均存储在第一存储节点中,因此,在通过各个第一校验矩阵的M个校验单元计算第二校验矩阵中的Y个校验单元时,即能够直接从第一存储节点中获取到所有的第一校验矩阵中的校验单元,相较于将各个第一校验矩阵中的校验单元分散存储,能够有效的减少网络通信次数,节省带宽资源。
在一种实现方式中,所述第一纠删码配比或所述第二纠删码配比是根据所述存储系统的拓扑结构和所述存储系统的容错能力确定的,其中,所述拓扑结构用于指示所述存储系统所包含的存储节点的数量所述容错能力用于指示所述存储系统容忍出错的存储节点的数量。
其中,在分布式的存储系统中,可以由管理节点来根据存储系统的拓扑结构和容错能力确定第一纠删码配比或第二纠删码配比。其中,该管理节点可以为该存储系统的多个存储节点中的任一存储节点,或者是,该管理节点为该存储系统中独立于存储节点之外用于对各个存储节点进行管理的一个节点。在本实施例中,该管理节点可以在初始化时获取第一纠删码配比和第二纠删码配比,也可以在存储系统运行过程中获取第一纠删码配比和第二纠删码配比。其中,第一纠删码配比和第二纠删码配比可以均根据存储系统的拓扑结构和容错能力确定得到,或者,第一纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第二纠删码配比根据第一纠删码配比获得,或者,第二纠删码配比根据存储系统的拓扑结构和容错能力 确定得到,第一纠删码配比根据第二纠删码配比获得,本申请实施例对此不作限定。另外,该存储系统容忍出错的存储节点的数量等于第一纠删码配比对应的校验单元的数量,或者第二纠删码配比对应的校验单元的数量。
示例性地,管理节点获取存储系统的拓扑结构以及容错能力,之后,通过公式N≤(k*M)-M来确定第一纠删码配比中的数据单元的数量的取值范围。其中,N为第一纠删码配比对应的数据单元的数量。k为存储系统中包含的节点的数量,当安全级别为服务器级安全时,上述节点指服务器,当安全级别为机柜级安全时,上述节点为机柜。M为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,第一纠删码配比中的校验单元的数量。需要说明的是,M可以是默认值,也可以是由用户自定义的值,且M为大于或等于1的整数,例如,M=2。在确定出第一纠删码配比中的数据单元的数量的取值范围之后,管理节点根据该取值范围和M,确定得到多个第一候选纠删码配比,每个候选纠删码配比对应所述取值范围的一个值。之后,从多个第一候选纠删码配比中选择对应的写放大值最小的纠删码配比作为第一纠删码配比。
需要说明的是,写放大是指存储节点实际写入的数据量大于从计算节点接收到的数据量。对于任一个第一候选纠删码配比而言,该第一候选纠删码配比对应的写放大值等于该第一候选纠删码配比对应的数据单元和校验单元的总数量与数据单元的数量之间的比值。例如,对于纠删码配比6:2而言,该纠删码配比对应的写放大值即为(6+2)/6。
或者,管理节点获取存储系统的拓扑结构以及容错能力,之后,通过公式X≤(i*Y)-Y来确定第二纠删码配比中的数据单元的数量的取值范围,进而根据该数据单元的数量的取值范围和Y确定第二纠删码配比。其中,X为第二纠删码配比对应的数据单元的数量,且X大于N。i为该存储系统包含的节点的子节点的数量,其中,当安全级别为服务器级安全时,i为该存储系统包含的服务器的子节点的数量,其中,服务器的子节点可以是指该服务器连接的物理硬盘或硬盘逻辑域。当安全级别为机柜级安全时,i为该存储系统包含的机柜的子节点的数量,其中,机柜的子节点是指该机柜包含的服务器的数量。Y为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,Y为第二纠删码配比对应的校验单元的数量。需要说明的是,Y可以是默认值,也可以是由用户自定义的值,且Y大于或等于1,例如,Y=2。另外,Y与M可以相等,也可以不相等,本申请实施例对此不作限定。
需要说明的是,上面所述的安全级别可以是由用户配置后管理节点获取的,或者,也可以是管理节点根据该存储系统的拓扑结构按照预设原则确定得到的,其中,该预设原则是指能够保证该存储系统的可靠性的计算原则,本申请实施例对此不作限定。其中,安全级别包括服务器级安全、机柜级安全等。其中,服务器级安全用于指示存储系统最多能够容忍t个服务器出现故障。机柜级安全用于指示该存储系统最多能够容忍t个机柜出现故障。
在一种实现方式中,所述设定条件包括第一介质层的可用容量达到容量阈值。
在一种实现方式中,所述设定条件包括所述第一介质层存储的数据的访问频率低于热度阈值。
另外,第二方面提供的存储系统也可以是一种集中式的存储系统,例如是个存储阵列或者其他存储设备。以存储阵列为例,所述存储阵列包含一个或多个控制器,每个控制器和硬盘框耦合。每个控制器中的内存构成第一介质层,硬盘框中的硬盘构成第二介质层。任意一个控制器都可以接收主机或客户端发送的数据,以第一控制器为例,数据被临时存储在第一 控制器的内存中,当所述第一控制器的内存中的数据量达到第一校验矩阵所包含的数据单元的尺寸时,所述第一控制器按照第一纠删码配比将这些数据划分为N个数据单元,并计算N个数据单元的M个校验单元,然后将这N个数据单元和M个数据单元,分散地存储在所述存储系统的各个控制器(可以包含第一控制器自己)中。同理,所述第一控制器继续接收主机发送的数据,当其内存中的数据达到第一校验矩阵所包含的数据单元的尺寸时,继续将该数据所对应的N个数据单元和M个单元分散地存储在各个控制器中。当设定条件满足时,每个控制器都可以将其内存中的数据迁移至硬盘中,这意味着每个控制器的内存中所包含的数据单元均被迁移至了硬盘中。第一控制器可以按照第二纠删码配比获取这些数据单元(例如X个数据单元)所对应的Y个校验单元,并将所述校验单元写入硬盘,由此即实现了在第二介质层中按照第二纠删码配比存储数据。
第三方面提供一种存储设备,所述存储设备可以是一个第二方面提供的分布式存储系统中的存储节点,也可以是一个独立的服务器,还可以是一个硬盘框等既具有存储能力又具有计算能力的设备。它包括第一介质层、第二介质层和处理器。所述处理器用于:接收数据;按照第一纠删码配比将所述数据存储在所述第一介质层中;当设定条件满足时,所述处理器从所述第一介质层中获取所述数据,将所述数据按照第二纠删码配比迁移至所述第二介质层;其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。其中,N不等于X,M可以等于Y,也可以不等于Y。
在一种实现方式中,所述第一介质层的性能高于所述第二介质层的性能,N和M的比值小于X和Y的比值。
在一种实现方式中,所述第一介质层存储的所述数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括所述N个数据单元和所述M个校验单元。
在一种实现方式中,所述将所述数据按照第二纠删码配比迁移至所述第二介质层包括:根据每个第一校验矩阵所包含的N个数据单元获得所述X个数据单元,所述X是所述N的整数倍;计算获得Y个校验单元以生成第二校验矩阵,所述第二校验矩阵包括所述X个数据单元和所述Y个校验单元;将所述第二校验矩阵写入所述第二介质层中。
在一种实现方式中,所述计算获得Y个校验单元包括根据所述X个数据单元计算获得所述Y个校验单元。
在一种实现方式中,所述计算获得Y个校验单元包括根据每个第一校验矩阵所包含的M个校验单元计算所述Y个校验单元,所述Y等于所述M。
在一种实现方式中,所述设定条件包括第一介质层的可用容量达到容量阈值。
在一种实现方式中,所述设定条件包括所述第一介质层存储的数据的访问频率低于热度阈值。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据存储方法。
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据存储方法。
第六方面,提供了一种数据处理装置,所述数据处理装置位于前面介绍的存储设备中,所述存储系统包括第一介质层、第二介质层和处理器,第一介质层的性能不同于第二介质层的性能。该装置包括存储模块1101和迁移模块1102,所述存储模块1101用于接收数据,按照第一纠删码配比将所述数据存储在所述第一介质层中。迁移模块1102用于当设定条件满足时,从所述第一介质层中获取所述数据,将所述数据按照第二纠删码配比迁移至所述第二介质层。其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。该装置的其他功能可参考第三方面的描述。
上述第二方面、第三方面、第四方面、第五方面和第六方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
附图说明
图1是本申请实施例提供的一种存储系统架构图;
图2是本申请实施例提供的另一种存储系统架构图;
图3是本申请实施例提供的一种存储设备的系统架构图;
图4是本申请实施例提供的一种数据存储方法的流程图;
图5是本申请实施例提供的一个第一校验矩阵中的各个单元在第一介质层中的分布示意图;
图6是本申请实施例提供的w个第一校验矩阵中的单元在第一介质层中的分布示意图;
图7是本申请实施例提供第二校验矩阵中的单元在第二介质层中的分布示意图;
图8是本申请实施例提供的一种根据w个第一校验矩阵合并得到第二校验矩阵的原理示意图;
图9是本申请实施例提供的一种获得第二校验矩阵的校验单元的示意图;
图10是本申请实施例提供的一种在第二介质层中按照第二纠删码配比写入数据的示意图;
图11是本申请实施例提供的一种数据存储装置的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
在对本申请实施例进行详细的解释说明之前,先对本申请实施例涉及的系统架构进行介绍。
图1是本申请实施例提供的一种存储系统的结构示意图。如图1所示,该存储系统包括计算机节点集群和存储节点集群。其中,计算节点集群包括一个或多个计算节点10(图1中示出了两个计算节点10,但不限于两个计算节点10)。计算节点10是用户侧的一种计算设备,如服务器、台式计算机等。在硬件层面,计算节点10中设置有处理器和内存(图1中未示出)。在软件层面,计算节点10上运行有应用程序(application)101(简称应用)和客户端程序102(简称客户端)。应用101是对用户呈现的各种应用程序的统称。客户端102用于接收由应用101触发的数据访问请求,并且与存储节点20交互,向存储节点20发送数据访问请求。 客户端102还用于接收来自存储节点的数据,并向应用101转发该数据。应理解的是,当客户端102是软件程序时,客户端102的功能由计算节点10所包含的处理器运行内存中的程序来实现。客户端102也可以由位于计算节点10内部的硬件组件来实现。计算节点集群中的任意一个客户端102可以访问存储节点集群中的任意一个存储节点20。
存储节点集群包括一个或多个存储节点20(图1中示出了三个存储节点20,但不限于三个存储节点20),各个存储节点20之间可以互联。存储节点如服务器、台式计算机或者存储阵列的控制器、硬盘框等。在功能上,存储节点20主要用于对数据进行计算或处理等。
在硬件上,如图1所示,存储节点20至少包括处理器、存储器和网卡。
其中,处理器是中央处理器(central processing unit,CPU),用于处理来自存储节点20外部的数据,或者存储节点20内部生成的数据。
存储器,是指用于存储数据的装置。在本申请实施例中,存储器包括内存和硬盘。其中,内存是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存包括一种或多种类型的存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(Read Only Memory,ROM)。举例来说,随机存取存储器可以是动态随机存取存储器(Dynamic Random Access Memory,DRAM),也可以是存储级存储器(Storage Class Memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(Random Access Memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,SCM能够提供比硬盘更快速的读写速度,但运算速度上比DRAM慢,在成本上也比DRAM更为便宜。需要说明的是,处理器可以直接访问内存,例如,如图2中所示,处理器可以直接访问DRAM和SCM。
然而,DRAM和SCM在本实施例中只是示例性的说明,在一些可能的情况中,内存可以只包含DRAM和SCM中的其中一种。或者,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(Static Random Access Memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(Programmable Read Only Memory,PROM)、可抹除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)等。另外,内存还可以是双列直插式存储器模块或双线存储器模块(Dual In-line Memory Module,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块。在后续的实施例中,均以内存包括一种存储器为例那个说明,但这并不构成对内存包括的存储器类型数量的限制。
硬盘读写数据的速度比内存慢,通常用于持久性地存储数据。以存储节点20a为例,其内部设置一个或多个硬盘;或者,在存储节点20的外部挂载一个硬盘框(如图2所示),在硬盘框中设置多个硬盘。无论哪一种部署方式,这些硬盘都可以视作存储节点20所包含的硬盘。其中,这里的硬盘可以是指物理硬盘,也可以是指一个包括多个物理硬盘的逻辑域或故障域,本申请实施例对此不作限定。另外,物理硬盘的类型为固态硬盘、机械硬盘,或者其他类型的硬盘。
需要说明的是,内存包括的存储器与硬盘是完全不同的两种存储介质,二者的性能完全不同。其中,相较于硬盘,内存的数据读取速度更快,时延更小,也即,内存的性能高于硬盘的性能。基于此,在本申请实施例中,如图1和2中所示,将各个存储节点20中的内存称为第一介质层,将各个存储节点20中的硬盘称为第二介质层。其中,第一介质层的性能高于 第二介质层。当然,也可以将各个存储节点20中的内存称为第二介质层,将各个存储节点20中的硬盘称为第一介质层。此时,第一介质层的性能低于第二介质层。可选地,当内存包括多种性能不同的存储介质时,也可以将内存中的每种存储介质作为一个介质层,例如,如图1和图2中,各个存储节点中的DRAM组成一个介质层,SCM组成一个介质层,硬盘组成一个介质层。可选地,在一些可能的情况中,第一介质层可以是固态硬盘(solid state disk,SSD),第二介质层可以为硬盘驱动器(hard diskdrvie,HDD)。后续实施例中将以内存中包括一种存储介质,将各个存储节点的内存中的该种存储介质作为第一介质层、硬盘作为第二介质层为例进行说明。
网卡用于与其他存储节点进行通信,或者,用于与该存储节点耦合的硬盘框进行通信。另外,网卡可以直接访问存储节点的内存,如图2所示,网卡可以直接访问DRAM和SCM。
图3是本申请实施例提供的另一种存储系统的结构示意图。图3所示的存储系统为一个存储阵列,该存储阵列包括至少一个控制器(如图3所示的控制器11)和多个硬盘22。控制器11通过存储区域网络(英文:storage area network,SAN)与主机(图中未示出)连接。控制器11可以是一种计算设备,如服务器、台式计算机等等。在控制器11上安装有操作系统以及应用程序。控制器11可以接收来自主机的输入输出(I/O)请求。控制器11还可以存储I/O请求中携带的数据(如果有的话),并且将该数据写入硬盘22中。其中,硬盘22为机械硬盘或固态硬盘,固态硬盘是以闪存(英文:flash memory)芯片为介质的存储器,又名固态驱动器(Solid State Drive,SSD)。
图3仅是示例性说明,在实际应用中存储阵列可包含两个或两个以上控制器,每个控制器的物理结构和功能与控制器11类似,并且本实施例并不限定控制器之间,以及任意一个控制器与硬盘22之间的连接方式。只要各个控制器之间,以及各个控制器和硬盘22之间能够相互通信。另外,在本实施例中,硬盘可以是指物理硬盘,也可以是指一个包括多个物理硬盘的逻辑域或故障域,本申请实施例对此不作限定。
如图3所示,控制器11包括接口卡110、处理器112和接口卡113。
接口卡110用于和主机通信,控制器11通过接口卡110接收主机的操作指令。处理器112可能是一个中央处理器(英文:central processing unit,CPU)。在本申请实施例中,处理器112用于接收来自主机的I/O请求、处理所述I/O请求。所述I/O请求是写数据请求或者读数据请求,处理器112还可以将写数据请求中的数据发送给硬盘22。接口卡113,用于和硬盘22通信,控制器11通过接口卡113将写数据请求(包括数据、数据的逻辑地址以及数据的虚拟地址)发送给硬盘22存储。
可选地,控制器11还包括内存111。内存111用于临时存储从主机接收的数据或从硬盘22读取的数据。控制器11接收主机发送的多个写数据请求时,可以将多个写数据请求中的数据暂时保存在内存111中。当内存111的容量达到一定阈值时,将内存111存储的数据、数据的虚拟地址以及为数据分配的逻辑地址发送给硬盘22。硬盘22存储接收到的数据。内存111包括易失性存储器,闪存芯片或其组合。易失性存储器例如为随机访问存储器(英文:random-access memory,RAM)。闪存芯片例如软盘、硬盘、光盘等各种可以存储程序代码的机器可读介质。内存111具有保电功能,保电功能是指系统发生掉电又重新上电时,内存111中存储的数据也不会丢失。
需要说明的是,控制器11包括的内存111与硬盘22为完全不同的两种存储介质。其中,相较于硬盘,内存的数据读取速度更快,时延更小,也即,内存的性能高于硬盘的性能。在本申请实施例中,将性能更高的内存111称为第一介质层,将性能相对内存而言较低的多个硬盘22称为第二介质层,也即,第一介质层的性能高于第二介质层。或者,将性能更高的内存111称为第二介质层,将性能相对内存而言较低的多个硬盘22称为第一介质层。此时,第一介质层的性能低于第二介质层。
纠删码是一种数据冗余技术,相对于多副本策略,纠删码具有更高的磁盘利用率。例如Reed-Solomon码就是一种常见的纠删码。纠删码技术主要是通过纠删码算法将原始的数据进行编码得到冗余,并将数据和冗余一并存储起来,以达到容错的目的。其基本思想是将n块原始的数据元素(数据单元)通过一定的计算,得到m块冗余元素(校验单元),磁盘利用率为n/(n+m)。对于这n+m块的元素,当其中任意的m块元素出错(包括原始的数据元素和冗余元素)时,均可以通过对应的重构算法恢复出原来的n块数据元素。生成校验的过程被成为编码(encoding),恢复丢失的数据元素的过程被称为解码(decoding)。本申请所提到的纠删码配比是指数据元素n与冗余元素m的比值。接下来对本申请实施例提供的数据存储方法进行介绍。
图4是本申请实施例提供的一种数据存储方法的流程图。该方法可以应用于图1或图2所示的存储系统中,也可以应用于图3所示的存储系统中。参见图4,该方法包括以下步骤:
步骤401:在第一介质层中,按照第一纠删码配比存储数据。
由前述图1至图3中的介绍可知,存储系统中包括多种不同的存储介质,本申请实施例中以存储系统包括两种不同的存储介质且第一介质层的性能高于第二介质层的性能为例对该数据存储方法进行解释说明。例如,第一介质层是DRAM,第二介质层是硬盘。所谓第一介质层的性能高于第二介质层的性能是指第一介质层比第二介质层的读写速度更快、时延更小。
在本申请实施例中,针对第一介质层和第二介质层,可以分别获取第一介质层对应的第一纠删码配比和第二介质层对应的第二纠删码配比,进而在第一介质层中,按照第一纠删码配比存储数据,而在第二介质层中按照第二纠删码配比存储数据。
其中,如果该数据存储方法应用于图1或图2所示的存储系统中,则可以由存储系统中的管理节点获取第一纠删码配比和第二纠删码配比。其中,该管理节点为该存储系统的多个存储节点中的任一存储节点,或者是,该管理节点为该存储系统中独立于存储节点之外用于对各个存储节点进行管理的一个节点。另外,该管理节点可以在初始化时获取第一纠删码配比和第二纠删码配比,也可以在存储系统运行过程中获取第一纠删码配比和第二纠删码配比,本申请实施例对此不做限定。如果该数据存储方法应用于图3所示的存储系统中,则可以由存储系统中的控制器来获取第一纠删码配比和第二纠删码配比。接下来以管理节点获取第一纠删码配比和第二纠删码配比为例来进行介绍。
在一种可能的实现方式中,管理节点根据该存储系统的拓扑结构和容错能力确定第一纠删码配比或第二纠删码配比。其中,第一纠删码配比和第二纠删码配比可以均根据存储系统的拓扑结构和容错能力确定得到,或者,第一纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第二纠删码配比根据第一纠删码配比获得,或者,第二纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第一纠删码配比根据第二纠删码配比获得,本申请实施 例对此不作限定。另外,拓扑结构用于指示存储系统所包含的存储节点的数量,容错能力用于指示存储系统容忍出错的存储节点的数量。其中,该存储系统容忍出错的存储节点的数量等于第一纠删码配比对应的校验单元的数量,或者第二纠删码配比对应的校验单元的数量。
其中,管理节点首先获取存储系统的拓扑结构。示例性地,管理节点中可以存储有存储系统的拓扑结构,或者,接收由其他设备发送的该存储系统的拓扑结构,或者,接收用户输入的存储系统的拓扑结构。该拓扑结构能够指示存储系统的组成,例如,该存储系统内所包含的存储节点数量,每个存储节点包括的子节点的数量。其中,当存储节点为一台服务器时,存储节点的子节点的数量是指服务器包括的物理硬盘的数量或者对相应存储节点包括的物理硬盘进行划分得到的硬盘逻辑域的数量。当存储节点为一个机柜时,存储节点的子节点的数量是指机柜内包括的服务器的数量。通常,一个机柜包括多个服务器。
例如,假设存储系统内包括4个服务器,每个服务器包括60个物理硬盘,其中,每15个物理硬盘划分为1个硬盘逻辑域,则根据该拓扑结构可知,该存储系统包括4个存储节点,每个服务器即为一个存储节点。每个存储节点包括4个硬盘逻辑域,也即,每个存储节点包括的子节点的数量为4。
除了获取存储系统的拓扑结构,管理节点还获取该存储系统的安全级别和容错能力。在一种可能的实现方式中,管理节点上显示有配置界面,该配置界面包括安全级别配置项和容错能力配置选项。用户在该安全级别配置项中输入所需的安全级别,并在容错能力配置选项中输入允许出错的节点数量t,t为大于或等于的整数。管理节点获取用户输入的安全级别和允许出错的节点数量t。其中,安全级别包括服务器级安全、机柜级安全等。其中,服务器级安全用于指示存储系统最多能够容忍t个服务器出现故障。机柜级安全用于指示该存储系统最多能够容忍t个机柜出现故障。可选地,管理节点也可以根据该存储系统的拓扑结构,按照预设原则确定该存储系统的安全级别,其中,该预设原则是指能够保证该存储系统的可靠性的计算原则,本申请实施例对此不作限定。另外,该存储系统的容错能力也可以是一个系统默认值,本申请实施例对此不作限定。
在获取到存储系统的拓扑结构、容错能力和安全级别之后,管理节点通过下述公式(1)确定第一纠删码配比中的数据单元的数量的取值范围。
N≤(k*M)-M      (1)
其中,N为第一纠删码配比对应的数据单元的数量。k为存储系统中包含的节点的数量,当安全级别为服务器级安全时,上述节点指服务器,当安全级别为机柜级安全时,上述节点为机柜。M为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,第一纠删码配比中的校验单元的数量。需要说明的是,M可以是默认值,也可以是由用户自定义的值,且M为大于或等于1的整数,例如,M=2。
在确定出第一纠删码配比中的数据单元的数量的取值范围之后,管理节点根据该取值范围和M,确定得到多个第一候选纠删码配比,每个候选纠删码配比对应所述取值范围的一个值。之后,从多个第一候选纠删码配比中选择对应的写放大值最小的纠删码配比作为第一纠删码配比。
其中,写放大是指存储节点实际写入的数据量大于从计算节点接收到的数据量。在本申请实施例中,写放大通过写放大值来表征,对于任一个第一候选纠删码配比而言,该第一候选纠删码配比对应的写放大值等于该第一候选纠删码配比的数据单元和校验单元的总数量与 数据单元的数量之间的比值。例如,对于纠删码配比6:2而言,该纠删码配比用于表征每6个数据单元对应2个校验单元,如此,该纠删码配比对应的写放大值即为(6+2)/6。
示例性地,假设根据该存储系统的拓扑结构指示该存储系统包括4个服务器,用户输入的安全级别为服务器级安全,则k=4。假设该存储系统能够容忍出错的存储节点的数量为2个,也即,M=2。根据上述公式(1)可得,第一纠删码配比的数据单元的数量取值范围为N≤4*2-2,也即,N≤6。在确定第一纠删码配比的数据单元的数量取值范围之后,根据该取值范围和校验单元的数量可得到常用的多个第一候选纠删码配比分别为6:2、4:2和2:2。由于6:2这一配比是三个配比中写放大最小的,因此,将6:2这一配比作为第一纠删码配比。
除了第一纠删码配比之外,管理节点还用于根据该存储系统的拓扑结构和容错能力获取第二纠删码配比。具体的,管理节点通过下述公式(2)确定第二纠删码配比中的数据单元的数量取值范围。
X≤(i*Y)-Y       (2)
其中,X为第二纠删码配比对应的数据单元的数量,且X大于N。i为该存储系统包含的节点的子节点的数量,其中,当安全级别为服务器级安全时,i为该存储系统包含的服务器的子节点的数量,其中,服务器的子节点可以是指该服务器连接的物理硬盘或硬盘逻辑域。当安全级别为机柜级安全时,i为该存储系统包含的机柜的子节点的数量,其中,机柜的子节点是指该机柜包含的服务器的数量。Y为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,Y为第二纠删码配比对应的校验单元的数量。需要说明的是,Y可以是默认值,也可以是由用户自定义的值,且Y大于或等于1,例如,Y=2。另外,Y与M可以相等,也可以不相等,本申请实施例对此不作限定。还需要说明的是,安全级别可以由前述介绍的配置方式由用户进行配置,在这种情况下,管理节点直接获取用户配置的安全级别。或者,该安全级别也可以是管理节点根据该存储系统的拓扑结构按照预设原则确定得到的,其中,该预设原则是指能够保证该存储系统的可靠性的计算原则,本申请实施例对此不作限定。
在确定出第二纠删码配比中的数据单元的数量取值范围之后,管理节点根据该取值范围和Y,确定第二纠删码配比。
例如,仍以前述的包含4个服务器的存储系统为例,假设每个服务器包括4个硬盘逻辑域,则在安全级别为服务器级安全时,该存储系统中包含的每个服务器下有4个子节点,这样,4个服务器的子节点的总数量为16。假设容错能力所指示的该存储系统能够容忍出错的节点的数量为2,也即Y=2,则根据上述公式(2)可得,X≤(16*2)-2,也即,X≤30。根据该取值范围,考虑到系统可靠性约束机制,管理节点可选择数据单元的数量为24,此时,第二纠删码配比即为24:2。
由上文可知,第一纠删码配比中的N和第二纠删码配比中的X不相等,且N小于X。另外,第一纠删码配比中的M和第二纠删码配比中的Y可以相等也可以不等。除此之外,N和M的比值不等于X和Y的比值。
上述介绍了分别根据存储系统的拓扑结构和容错能力确定第一纠删码配比和第二纠删码配比的实现过程。在一些可能的实现方式中,在参考上述方式确定得到第一纠删码配比之后,管理节点根据第一纠删码配比N:M和预设的w确定第二纠删码配比X:Y。其中,X等于w*N,Y等于M或大于M。或者,在参考上述方式确定得到第二纠删码配比之后,管理节点根据第二纠删码配比X:Y和预设的w确定第一纠删码配比N:M。其中,N等于X/w,M等于Y或 小于Y。
管理节点获得了第一纠删码配比和第二纠删码配比之后,计算第二纠删码配比中的数据单元的数量X和第一纠删码配比中的数据单元的数量N之间的比值,该比值就等于在第一介质层中按照第一纠删码配比存储的数据包括的第一校验矩阵的个数w。例如,当第二纠删码配比中的数据单元的数量X=24,第一纠删码配比中的数据单元的数量N=6,则可以确定在第一介质层中按照第一纠删码配比存储的数据包括的第一校验矩阵的个数w=4。由此可见,在前述根据第一纠删码配比获得第二纠删码配比或者是根据第二纠删码配比获得第一纠删码配比的实现方式中,预设的w实际上就是在第一介质层中按照第一纠删码配比存储的数据包括的第一校验矩阵的个数。
在获得第一纠删码配比、第二纠删码配比以及w之后,后续,当存储节点接收到计算节点发送的写数据请求时,存储节点按照第一纠删码配比和w在第一介质层中写入数据。其中,写数据请求包括待写入的数据。接下来以存储系统中的目标存储节点接收计算节点发送的写数据请求为例对该过程进行说明。
示例性地,目标存储节点接收计算节点发送的写数据请求,当接收的待写入数据的数据量达到N个数据单元的尺寸时,目标存储节点将这些待写入数据划分成N个数据单元,并根据该N个数据单元生成M个校验单元。所述N个数据单元和M个校验单元构成一个子数据,该子数据对应一个第一校验矩阵,其中,该第一校验矩阵包括该N个数据单元和M个校验单元。之后,目标存储节点将第一校验矩阵包含的N个数据单元和M个校验单元存储至该存储系统的第一介质层中。与此同时,目标存储节点继续接收计算节点发送的写数据请求,按照上述方式获得另一个第一校验矩阵,并存储至所述第一介质层。如此,当按照上述方式该目标存储节点将该w个第一校验矩阵包括的数据单元和校验单元作为写入至第一介质层后,即可以执行后续步骤402。
例如,第一纠删码配比为6:2,也即,N=6,M=2,且w=4,当目标存储节点接收到计算节点发送的待写入数据的数量达到6个数据单元的尺寸时,将这些待写入数据划分成6个数据单元,根据这6个数据单元生成2个校验单元,之后,生成包括6个数据单元和2个校验单元的第一校验矩阵。将第一校验矩阵包括的8个单元存储至该存储系统中的各个存储节点的内存中。
具体的,目标存储节点可以将各个第一校验矩阵包括的校验单元分布在同一个存储节点上,对于各个第一校验矩阵包括的数据单元,则可按照平均分布的原则分布在各个存储节点中。
参见图5,假设该存储系统包括的存储节点为4个服务器,这4个服务器包括的内存为第一介质层。第一纠删码配比为6:2,也即,N=6,M=2,并且,w=4。由于M=2,所以每个服务器的内存中最多允许存储第一校验矩阵的8个单元中的2个单元,基于此,目标存储节点在获得一个第一校验矩阵之后,将第一校验矩阵包括的2个校验单元存储至自身内存中,而将剩余的6个数据单元转发至其他3个服务器中存储,例如,参见图5,在其他3个服务器中的每个服务器上各存储2个数据单元。如此,在将4个第一校验矩阵存储至第一介质层后,4个校验矩阵包含的32个单元在存储系统的存储节点的分布如图6所示。
图5和图6仅是本申请实施例示例性的给出的一种第一校验矩阵中各个单元的可能分布。可选地,目标存储节点也可以根据存储系统每个存储节点最多允许分布的单元数量和第一校 验矩阵包括的单元的数量,随机将多个单元平均分布在各个存储节点中。也即,不限制校验单元在同一个存储节点上。例如,在上述的示例中,可以将一个第一校验矩阵中的1个数据单元和1个校验单元存储在目标存储节点上,另外一个校验单元和一个数据单元存储至另外一个存储节点中,这样,还剩余4个数据单元,这4个数据单元分别存储在剩余的两个存储节点上。进一步地,当第一校验矩阵中的两个校验单元为校验单元p和校验单元q时,4个第一校验矩阵中的4个校验单元p存储在一个存储节点中,4个校验单元q存储在一个存储节点中,且4个校验单元p所在的存储节点和4个校验单元q所在的存储节点可以不同。
上述介绍了存储系统为图1或图2所示的存储系统时,在第一介质层中按照第一纠删码配比写入数据的过程。可选地,当存储系统为图3所示的存储阵列中,可以由存储阵列中的控制器来确定第一纠删码配比、第二纠删码配比和第一介质层中存储的数据包括的第一校验矩阵的个数w,确定的方法参考前述介绍的方法,本申请实施例在此不再赘述。在确定第一纠删码配比、第二纠删码配比和w之后,控制器可以按照第一纠删码配比和w在第一介质层中写入数据,其中,第一介质层即为存该存储系统包括的控制器的内存。需要说明的是,控制器参考前述介绍的方法生成w个第一校验矩阵。根据该存储系统包括的控制器的数量和/或每个控制器包括的内存的数量,参考前述的方法将每个第一校验矩阵包括的数据单元和校验单元分布存储至各个控制器的内存中。
步骤402:当设定条件满足时,将第一介质层中存储的数据按照第二纠删码配比迁移至第二介质层。
在按照第一纠删码配比在第一介质层中写入的数据达到设定条件时,存储节点或控制器将达到设定条件的这部分数据,按照第二纠删码配比写入至第二介质层。其中,第二纠删码配比为X:Y,也即,写至第二介质层中的数据包括X个数据单元和Y个校验单元。
其中,设定条件包括第一介质层的可用容量达到容量阈值,或者,设定条件包括第一介质层存储的数据的访问频率低于热度阈值,或者,设定条件包括到达了一个写入周期的结束时刻,当然,设定条件也可以是其他条件,本申请实施例对此不做限定。其中,可用容量等于第一介质层的总容量减去已存储的数据的数据量。当第一介质层的可用容量达到容量阈值时,将第一介质层中的数据迁移至第二介质层中,有利于第一介质层腾出空间处理更多的数据。
接下来仍以该数据存储方法应用于图1或图2所示的存储系统中为例来对本步骤进行说明。在该存储系统中,第二介质层包括该存储系统内的存储节点所包括的硬盘。
示例性地,在凑齐w个第一校验矩阵,也即在将w个第一校验矩阵包括的数据单元和校验单元写入至第一介质层后,根据第一介质层中存储的数据包括的w个第一校验矩阵中每个第一校验矩阵所包含的N个数据单元获得X个数据单元,X是N的整数倍;计算获得Y个校验单元以生成第二校验矩阵,第二校验矩阵包括X个数据单元和Y个校验单元;将第二校验矩阵写入第二介质层中。其中,上述过程可以通过以下几种不同的实现方式来实现。
第一种实现方式:目标存储节点在通过上述步骤401凑齐w个第一校验矩阵之后,根据w个第一校验矩阵包括的w×N个数据单元计算得到第二校验矩阵中的Y个校验单元。之后,目标存储节点将计算得到的Y个校验单元存储至第二介质层。对于其他存储节点而言,各个存储节点在满足设定条件时,将自身存储的属于w个第一校验矩阵的数据单元存储至第二介质层。如此,存储至第二介质层中的w×N个数据单元即为第二校验矩阵的X个数据单元, 目标存储节点计算得到的Y个校验单元即为第二校验矩阵包括的Y个校验单元。
其中,目标存储节点在将自身存储的数据单元和Y个校验单元存储至第二介质层中时,如果第二介质层包括的硬盘的数量大于第二校验矩阵所包含的单元的总数量,则目标存储节点根据计算得到的校验单元的数量Y,从自身所包括的多个硬盘中选择Y个硬盘。然后目标存储节点将Y个校验单元写入至选择的硬盘中,其中,每个硬盘上写入一个单元。可选地,如果目标存储节点上还存储有属于w个第一校验矩阵的数据单元,则目标存储节点从自身所包括的硬盘中为每个数据单元选择一个硬盘,并将数据单元写入至选择的硬盘上,其中,每个硬盘上也同样写入一个单元。
可选地,如果第二介质层包括的硬盘的数量不大于第二校验矩阵所包含的单元的总数量,则存储节点根据第二校验矩阵包括的校验单元的数量,确定第二介质层中每个硬盘上允许分布的最大单元数。之后,如果目标存储节点上还存储有属于w个第一校验矩阵的数据单元,则按照该最大单元数和自身存储的属于w个第一校验矩阵的数据单元的数量和Y从自身所包括的硬盘中选择多个硬盘,进而将存储的数据单元和校验单元写入至选择的多个硬盘中。当然,如果目标存储节点中未存储属于w个第一校验矩阵的数据单元,则按照最大单元数和Y从自身所包括的硬盘中选择多个硬盘,从而将Y个校验单元写入至选择的硬盘上。在这种情况下,一个硬盘上分布可能存储第二校验矩阵中的多个单元,但是,存储的单元的数量不超过硬盘允许分布的最大单元数。对于除目标存储节点之外的其他存储节点,均可参考上述方法将自身存储的属于w个第一校验矩阵的数据单元写入至第二介质层中。
例如,参见图7,假设第二介质层包括16个硬盘逻辑域,且16个硬盘逻辑域分属于4个存储节点,w个第一校验矩阵一共包括24个数据单元,也即,第二校验矩阵包括的数据单元的数量为24。目标存储节点计算得到的校验单元的数量为2。由于第二校验矩阵包括的校验单元的数量为2,因此可知每个硬盘逻辑域上允许分布的最大单元数为2,如此,各个存储节点在按照上述介绍的方式将24个数据单元和2个校验单元存储至第二介质层时,对于4个存储节点中的3个存储节点,这3个存储节点中每个存储节点的每个硬盘逻辑域内存储2个单元,也即,每个存储节点上一共存储8个单元,而对于另外一个存储节点,则可以在该存储节点的一个硬盘逻辑域上存储2个单元,或者是在该存储节点的两个硬盘逻辑域上各存储一个单元。
第二种实现方式:目标存储节点在凑齐w个第一校验矩阵之后,根据w个第一校验矩阵包括的w×M个校验单元获得第二校验矩阵中的Y个校验单元。之后,目标存储节点将计算得到的Y个校验单元存储至第二介质层。对于其他存储节点而言,各个存储节点在自身缓存中的数据量达到一定阈值时,将自身存储的属于w个第一校验矩阵的数据单元存储至第二介质层。
对于第二种实现方式,分别针对以下几种不同的情况进行说明。
(1)当w个第一校验矩阵包括的所有校验单元均存储在目标存储节点中时,目标存储节点获取自身存储的w×M个校验单元,进而根据该w×M个校验单元获得Y个校验单元。
举例来说,当每个第一校验矩阵包括的M个校验单元分别为校验单元p和校验单元q时,目标存储节点对存储的w个校验单元p进行异或运算或者其他计算方式得到第二校验矩阵中的校验单元p’,对存储的w个校验单元q进行异或运算或者其他计算方式得到第二校验矩阵中的校验单元q’。由此可见,本申请实施例中,通过直接对各个第一校验矩阵包括的M个校 验单元进行计算即能够得到第二校验矩阵的校验单元,相较于根据w个第一校验矩阵中的所有数据单元重新计算校验单元,减少了计算量。并且,由于各个第一校验矩阵中的校验单元均存储在同一个存储节点上,因此,该存储节点能够直接获取存储的校验单元来获得第二校验矩阵中的校验单元,相较于将校验单元分布存储在各个存储节点中的情况,无需跨存储节点中获取校验单元,减少了网络转发量。
在计算获得Y个校验单元之后,目标存储节点参考前述第一种实施例中介绍的方法将Y个校验单元存储至第二介质层中,或者,将自身存储的属于w个校验矩阵的数据单元和Y个校验单元存储至第二介质层中。其他存储节点在满足设定条件之后,将各自存储的数据单元在存储至第二介质层中,如此,各个存储节点中存储的w个第一校验矩阵则合并为第二校验矩阵存储至了第二介质层中。
图8是本申请实施例示出的一种根据w个第一校验矩阵合并得到第二校验矩阵的原理示意图。如图8所示,w=4,第一个第一校验矩阵中的前六列元素a 1至a 6为6个数据单元,剩下两列元素p 1和q 1为2个校验单元。其中,p 1为校验单元p,q 1为校验单元q。同理,第二个第一校验矩阵中的a 7至a 12为6个数据单元,并且,剩下两列元素p 2和q 2为2个校验单元,以此类推。将每个校验矩阵的前6列元素取出,组成第二校验矩阵的24个数据单元。将每个第一校验矩阵中的校验单元p(也即p 1至p 4)取出,进行异或运算或者采用其他计算方式得到第二校验矩阵中的校验单元p’,将每个第一校验矩阵中的校验单元q(也即q 1至q 4)取出,进行异或运算,或者采用其他计算方式得到第二校验矩阵中的校验单元q’。
需要说明的是,图8示例性地说明了将w个第一校验矩阵合并为第二校验矩阵的过程,在一些可能的应用场景中,例如,当第一介质层的性能低于第二介质层的性能,将第一介质层中的数据读取到第二介质层的过程中,可能需要将第二校验矩阵拆分为w个第一校验矩阵,在这种情况下,只需将上述过程逆向执行即能够得到w个第一校验矩阵,本申请实施例对此不做限定。
可选地,当每个第一校验矩阵包括的M个校验单元为校验单元r时,目标存储节点对存储的w个校验单元r进行异或运算,增量计算得到第二校验矩阵中的校验单元r’,之后,目标存储节点可以获取自身以及其他存储节点上存储的各个第一校验矩阵中的数据单元,根据获取的w×N个数据单元计算得到校验单元p’和校验单元q’。将计算得到的校验单元r’、校验单元p’和校验单元q’作为第二校验矩阵中的Y个校验单元。由此可见,在该种实现方式中,Y和M不相等。并且,由于根据数据单元计算校验单元r’的过程较为复杂,因此,本申请实施例中根据各个第一校验矩阵中的校验单元r增量计算得到第二校验矩阵中的校验单元r’,减小了计算开销。另外,根据w×N个数据单元计算得到校验单元p’和校验单元q’,使得第二校验矩阵中包含了3个校验单元,提升了第二介质层中存储的数据的冗余度,使得容错能力得以提升。
例如,参见图9,假设有3个第一校验矩阵,每个第一校验矩阵包括7个数据单元和1个校验单元r,目标存储节点将存储的3个校验单元r进行异或运算得到第二校验矩阵中的校验单元r’。之后,目标存储节点获取自身及其他各个存储节点中存储的21个数据单元,根据这21个数据单元计算得到校验单元p’和校验单元q’。如此,第二校验矩阵包括的Y个校验单元即为生成的校验单元p’、校验单元q’和校验单元r’。由此可见,通过该种实现方式,能够提升第二介质层中存储的数据的冗余度,使得容错能力得以提升,同时还能够减少部分计 算开销。
在计算获得Y个校验单元之后,目标存储节点同样参考前述第一种实现方式中介绍的方法将存储的数据单元和Y个校验单元存储至第二介质层中,其他存储节点在自身缓存中的数据量达到一定阈值之后,将各自存储的数据单元在存储至第二介质层中,本申请实施例在此不再赘述。
(2)当各个第一校验矩阵包括的M个校验单元分散存储在不同的存储节点中时,目标存储节点从各个存储节点中获取存储的校验单元,进而根据获取到的w×M个校验单元获得Y个校验单元。其中,目标存储节点根据获取到的w×M个校验单元获得Y个校验单元的实现方式参考上述(1)中的实现方式,本申请实施例不再赘述。在获得Y个校验单元之后,目标存储节点参考前述第一种实现方式中介绍的方法将存储的数据单元和Y个校验单元存储至第二介质层中,其他存储节点在自身缓存中的数据量达到一定阈值之后,将各自存储的数据单元在存储至第二介质层中,本申请实施例在此不再赘述。
第三种实现方式:目标存储节点在凑齐w个第一校验矩阵之后,将自身存储的属于w个第一校验矩阵的单元写入至第二介质层,其他各个存储节点在自身缓存存储的数量达到一定阈值之后,将自身存储的属于w个第一校验矩阵的单元也写入至第二介质层中。之后,目标存储节点获取写入至第二介质层中的w×M个校验单元,进而根据w×M个校验单元计算获得Y个校验单元,将计算得到的Y个校验单元作为第二校验矩阵的Y个校验单元写入至第二介质层中。
其中,如果第二介质层包括的硬盘的数量大于第二校验矩阵所包含的单元的总数量,则各个存储节点在将自身存储的属于w个第一校验矩阵的数据单元和校验单元写入至第二介质层时,对于数据单元,可以为每个数据单元选择一个硬盘,并将每个数据单元写入至为相应数据单元选择的硬盘中,其中,为不同的数据单元选择的硬盘也不同。这样,对于第二校验矩阵包括的X个数据单元,这X个数据单元将被写入至X个硬盘中。对于w个第一校验矩阵中的校验单元,各个存储节点可以将自身存储的校验单元存储至除上述的X个硬盘中的剩余硬盘中。
需要说明的是,在写入校验单元时,可以在每个硬盘上写入一个校验单元,这样,w×M个校验单元将被写入至w×M个硬盘中。或者,所有的校验单元可以写入至一个硬盘中。或者,可以在M个硬盘上写入w×M个校验单元,其中,M个硬盘中每个硬盘上写入的校验单元为第一校验矩阵中位于同一列的校验单元。例如,当M=2时,两个校验单元中的一个校验单元为校验单元p,另一个校验单元为校验单元q,则各个第一校验矩阵中的校验单元p写入至一个硬盘上,校验单元q写入至另一个硬盘上。
各个存储节点在将自身存储的属于w个第一校验矩阵的数据单元和校验单元写入至第二介质层中之后,目标存储节点从第二介质层中获取w×M个校验单元。其中,如果w×M个校验单元将被写入至w×M个硬盘中,则目标存储节点从这w×M个硬盘中读取w×M个校验单元。如果所有的校验单元被写入至一个硬盘中,则目标存储节点从该硬盘中一次性获取w×M个校验单元,这样,能够减少网络通信次数,节省带宽资源。如果w×M个校验单元被写入至M个硬盘中,且M个硬盘中每个硬盘上写入的校验单元为第一校验矩阵中位于同一列的校验单元,则目标存储节点从各个硬盘上读取位于相同列的校验单元,从而得到w×M个校验单元。如此,在一定程度上也可以减少网络通信次数,节省带宽资源。
在获取到w×M个校验单元之后,目标存储节点参考前述第一种实现方式中介绍的方法,根据该w×M个校验单元计算获得Y个校验单元,进而将这Y个计算单元分别写入至Y个硬盘,其中,每个硬盘上写入一个校验单元。并且,写入Y个校验单元的Y个硬盘不为前述写入数据单元的X个硬盘中的硬盘。
可选地,如果第二介质层包括的硬盘的数量不大于第二校验矩阵所包含的单元的总数量,则各个存储节点可参考前述第一种实现方式中介绍的方法,在一个硬盘上写入两个或两个以上的单元,只要不超出允许存储的最大单元数即可。同样的,在这种情况下,各个第一校验矩阵包括的M个校验单元中可以存储在同一个存储节点下的硬盘上,或者,各个第一校验矩阵包括的M个校验单元中位于同一列的校验单元可以存储在一个存储节点下的硬盘中,例如,存储在一个存储节点下的同一个硬盘逻辑域上,或者是存储在一个存储节点下的一个物理硬盘上,以此来减少计算第二校验矩阵中Y个校验单元时所需的网络转发次数。
举例说明,图10是本申请实施例提供的一种在第二介质层中按照第二纠删码配比写入数据的示意图。如图10所示,第一校验矩阵的个数w=4,每个第一校验矩阵包括6个数据单元和2个校验单元。2个校验单元分别为校验单元p和校验单元q。第二介质层包括分布在4个存储节点(a1至a4)上的16个硬盘逻辑域,每个存储节点上分布有4个硬盘逻辑域,每个硬盘逻辑域内包括16个物理硬盘。由于校验单元的数量为2,因此,每个硬盘逻辑域上最多允许分布第二校验矩阵中的2个单元。基于此,首先从16个硬盘逻辑域中选择两个硬盘逻辑域,例如,选择的两个硬盘逻辑域分别为存储节点a2上的硬盘逻辑域a21和存储节点a4上的硬盘逻辑域a41。之后,存储有各个第一校验矩阵的校验单元p的各个存储节点将自身存储的校验单元p写入至硬盘逻辑域a21的物理硬盘中,例如,每个第一校验矩阵中的校验单元p均写入至硬盘逻辑域a21的第一个物理硬盘中。同理,存储有各个第一校验矩阵的校验单元q的各个存储节点将自身存储的校验单元q写入至硬盘逻辑域a41的第一个物理硬盘中。之后,对于硬盘逻辑域a21,目标存储节点或存储节点a2对写入至硬盘逻辑域a21上的4个第一校验矩阵中的4个校验单元p进行异或运算,得到第二校验矩阵中的校验单元p’,将该校验单元p’存储在硬盘逻辑域a21上。由于每个硬盘逻辑域上最多允许存储第二校验矩阵中的两个单元,所以,在将计算得到的第二校验矩阵中的校验单元p’写入至硬盘逻辑域a21上之后,该硬盘逻辑域a21上最多还能再存储第二校验矩阵中的一个数据单元。同理,对于硬盘逻辑域a41,同样可以由目标存储节点或存储节点a4对其中存储的4个校验单元q进行异或运算,增量计算得到第二校验矩阵中的校验单元q’,并将其存储在硬盘逻辑域a41中,如此,硬盘逻辑域a41上最多也只能存储第二校验矩阵中的一个数据单元。之后,对于每个第一校验矩阵包括的6个数据单元,各个存储节点根据每个硬盘逻辑域上最多分布第二校验矩阵中的两个单元的原则,将24个数据单元分布在4个存储节点包括的16个硬盘逻辑域上。
上述介绍了存储系统为图1或图2所示的存储系统时,在第一介质层中按照第一纠删码配比写入数据的过程。可选地,当存储系统为图3所示的存储阵列中,则上述存储节点执行的操作可以由控制器来执行,从而将第一介质层中数据包括的w个第一校验矩阵中的数据单元和校验单元合并为第二校验矩阵写入至第二介质层中,本申请实施例对此不再赘述。
在按照上述的数据存储方法存储数据后,当第一介质层包括的节点或第二介质层包括的硬盘发生故障时,如果数据已经存入至了第二介质层,也即,第二校验矩阵已经生成,则根据故障点的个数、故障位置以及第二校验矩阵中各个单元的分布位置,从第二介质层中读取 除故障点之外的其他位置上的数据单元和校验单元进行重构,从而恢复出故障点中的数据。可选地,如果数据已存入至第一介质层,但是还未存入至第二介质层,则根据故障点的个数、故障位置以及各个第一校验矩阵中各个单元的分布位置,从第一介质层中读取未发生故障的位置上的数据单元和校验单元进行重构,从而恢复出第一介质层中故障点中的数据。
在本申请实施例中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数,也即,第一纠删码配比为小比例的配比,而第二纠删码配比为大比例的配比。在此基础上,在第一介质层中按照第一纠删码配比存储数据,在第二介质层中按照第二纠删码配比存储数据,而第一介质层的性能高于第二介质层,也就是说,在高性能介质层采用较小的纠删码配比来存储数据,而在低性能介质层采用较大的纠删码配比来存储数据。由于高性能介质层接收到的IO粒度比较小,所以,在高性能介质层通过较小的纠删码配比来存储数据时,每当接收到数据的尺寸达到该纠删码配比对应的N个数据单元的尺寸时,即可以凑满一个分条(N个数据单元和M个校验单元即可组成一个分条),相较于较大的纠删码配比而言,小的纠删码配比更容易凑满分条,从而使得分条中补0的数据量减少,降低了写放大,提高了存储空间利用率。例如,在高性能介质层中采用6:2的纠删码配比来存储数据,相较于采用24:2来存储数据,在指定的时间段内,根据接收到的小粒度的IO请求,凑齐6个数据单元比凑齐24个数据单元更为容易,这样,就不必在凑不齐24个数据单元时进行补0,也即,使得分条中补0的数据量减少,降低了分条中冗余数据量的占比,降低了写放大,提高了存储空间利用率。另外,在低性能介质层采用较大的纠删码配比来存储数据,能够减少冗余数据在存储空间中的占比,从而提高存储空间利用率。
另外,在本申请实施例中,在第一介质层中按照第一纠删码配比存储的一份数据能够直接转换为符合第二纠删码配比的一份数据,进而存储至第二介质层,也即,不同介质层的数据可以相互计算转换,节省了存储系统的计算资源。
上述实施例主要介绍了第一介质层性能高于第二介质层时,从第一介质层向第二介质层中写入数据的实现过程。当第一介质层性能低于第二介质层时,将第一介质层存储的数据迁移至第二介质层相当于是一个数据读取的过程,在这种情况下,第一纠删码配比中的N将大于第二纠删码配比中的X,相应地读取过程将是前述写入过程的逆过程,本申请实施例在此不再赘述。
接下来对本申请实施例提供的数据存储装置进行介绍。
参见图11,本申请实施例提供了一种数据存储装置,该装置应用于前述介绍的存储设备中,存储设备可以是图1或图2所示的存储系统中的任意一个存储节点,也可以是图3所示存储阵列等。该存储设备包括前面描述的第一介质层和第二介质层,第一介质层的性能不同于第二介质层的性能。该装置包括:
存储模块1101,用于接收数据,按照第一纠删码配比将所述数据存储在所述第一介质层中。具体可参考前述实施例中的步骤401。
迁移模块1102,用于当设定条件满足时,从所述第一介质层中获取所述数据,将所述数据按照第二纠删码配比迁移至所述第二介质层。其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。 具体可参考前述实施例中的步骤402;存储模块1101和迁移模块1102可以由所述存储设备中的处理器运行存储器里的程序指令来实现。
可选地,第一介质层的性能高于第二介质层的性能,N和M的比值小于X和Y的比值。
可选地,第一介质层存储的数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括N个数据单元和M个校验单元。
可选地,迁移模块1102具体用于:
根据每个第一校验矩阵所包含的N个数据单元获得X个数据单元,X是N的整数倍;
计算获得Y个校验单元以生成第二校验矩阵,第二校验矩阵包括X个数据单元和Y个校验单元;
将第二校验矩阵写入第二介质层中。
可选地,计算获得Y个校验单元包括根据X个数据单元计算获得Y个校验单元。
可选地,迁移模块具体还用于:根据每个第一校验矩阵所包含的M个校验单元计算Y个校验单元,Y等于M。
可选地,每个第一校验矩阵中的M个校验单元均存储在存储系统的第一存储节点中,该装置还用于从第一存储节点中获取每个第一校验矩阵中的M个校验单元。
可选地,当M大于或等于2时,M个校验单元包括校验单元p和校验单元q,其中,每个第一校验矩阵中的校验单元p均存储在存储系统的第一存储节点中,每个第一校验矩阵中的校验单元q均存储在存储系统的第二存储节点中,该装置还用于:
从第一存储节点获取每个第一校验矩阵中的校验单元p;
从第二存储节点获取每个第一校验矩阵中的校验单元q;
根据每个第一校验矩阵所包含的M个校验单元计算Y个校验单元具体包括:根据每个第一校验矩阵中的校验单元p和每个第一校验矩阵中的校验单元q计算获得Y个校验单元,Y个校验单元包括校验矩阵中的校验单元p’和校验单元q’。
可选地,该装置还用于:根据存储系统的拓扑结构和存储系统的容错能力确定第一纠删码配比或第二纠删码配比,其中,拓扑结构用于指示存储系统所包含的存储节点的数量容错能力用于指示存储系统容忍出错的存储节点的数量。
可选地,设定条件包括第一介质层的可用容量达到容量阈值。
可选地,设定条件包括第一介质层存储的数据的访问频率低于热度阈值。
综上所述,在本申请实施例中,第一介质层和第二介质层的性能不同,基于此,在第一介质层和第二介质层中按照不同的纠删码配比来进行数据存储。由于不同的纠删码配比对应的写放大不同,所导致的存储空间利用率也不同,因此,根据介质层的性能的不同选取不同的纠删码配比进行数据存储能够更好的发挥相应介质层的存储性能,有效的提高存储空间利用率。
需要说明的是:上述实施例提供的数据存储装置在进行数据存储时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据存储装置与数据存储方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))、或者半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本文的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
以上所述并不用以限制本申请实施例,凡在本申请实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。

Claims (28)

  1. 一种数据存储方法,其特征在于,所述方法应用于存储系统中,所述存储系统包括第一介质层和第二介质层,所述第一介质层的性能不同于所述第二介质层的性能,所述方法包括:
    在所述第一介质层中,按照第一纠删码配比存储数据;
    当设定条件满足时,将所述第一介质层中存储的所述数据按照第二纠删码配比迁移至所述第二介质层;
    其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。
  2. 根据权利要求1所述的方法,其特征在于,所述第一介质层的性能高于所述第二介质层的性能,N和M的比值小于X和Y的比值。
  3. 根据权利要求2所述的方法,其特征在于,所述第一介质层存储的所述数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括所述N个数据单元和所述M个校验单元;
    所述将所述第一介质层中存储的所述数据按照第二纠删码配比迁移至所述第二介质层包括:
    根据每个第一校验矩阵所包含的N个数据单元获得所述X个数据单元,所述X是所述N的整数倍;
    计算获得Y个校验单元以生成第二校验矩阵,所述第二校验矩阵包括所述X个数据单元和所述Y个校验单元;
    将所述第二校验矩阵写入所述第二介质层中。
  4. 根据权利要求3所述的方法,其特征在于,所述计算获得Y个校验单元包括根据所述X个数据单元计算获得所述Y个校验单元。
  5. 根据权利要求3所述的方法,其特征在于,所述计算获得Y个校验单元包括根据每个第一校验矩阵所包含的M个校验单元计算所述Y个校验单元,所述Y等于所述M。
  6. 根据权利要求5所述的方法,其特征在于,每个第一校验矩阵中的所述M个校验单元均存储在所述存储系统的第一存储节点中,所述方法还包括:从所述第一存储节点中获取每个第一校验矩阵中的M个校验单元。
  7. 根据权利要求5所述的方法,其特征在于,当M大于或等于2时,所述M个校验单元包括校验单元p和校验单元q,其中,每个第一校验矩阵中的校验单元p均存储在所述存储系统的第一存储节点中,每个第一校验矩阵中的校验单元q均存储在所述存储系统的第二存储节点中,所述方法还包括:
    从所述第一存储节点获取每个第一校验矩阵中的校验单元p;
    从所述第二存储节点获取每个第一校验矩阵中的校验单元q;
    所述根据每个第一校验矩阵所包含的M个校验单元计算所述Y个校验单元具体包括:根据每个第一校验矩阵中的校验单元p和每个第一校验矩阵中的校验单元q计算获得所述Y个 校验单元,所述Y个校验单元包括所述第二校验矩阵中的校验单元p’和校验单元q’。
  8. 根据权利要求1-7任一所述的方法,其特征在于,所述方法还包括:根据所述存储系统的拓扑结构和所述存储系统的容错能力确定所述第一纠删码配比或所述第二纠删码配比,其中,所述拓扑结构用于指示所述存储系统所包含的存储节点的数量,所述容错能力用于指示所述存储系统容忍出错的存储节点的数量。
  9. 根据权利要求1-8任一所述的方法,其特征在于,所述设定条件包括所述第一介质层的可用容量达到容量阈值。
  10. 根据权利要求1-8任一所述的方法,其特征在于,所述设定条件包括所述第一介质层存储的数据的访问频率低于热度阈值。
  11. 一种存储系统,其特征在于,所述存储系统包括第一介质层和第二介质层,所述第一介质层的性能不同于所述第二介质层的性能;
    在所述第一介质层中,数据是按照第一纠删码配比存储的;
    当设定条件满足时,所述第一介质层中的所述数据被按照第二纠删码配比迁移至所述第二介质层;
    其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。
  12. 根据权利要求11所述的系统,其特征在于,所述第一介质层的性能高于所述第二介质层的性能,N和M的比值小于X和Y的比值。
  13. 根据权利要求12所述的系统,其特征在于,所述第一介质层存储的所述数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括所述N个数据单元和所述M个校验单元;
    所述第二介质层中的所述数据对应第二校验矩阵,所述第二校验矩阵包括所述X个数据单元和所述Y个校验单元;其中,所述X个数据单元来自每个第一校验矩阵所包含的N个数据单元,所述X是所述N的整数倍。
  14. 根据权利要求13所述的系统,其特征在于,所述Y个校验单元是根据所述X个数据单元计算获得的。
  15. 根据权利要求13所述的系统,其特征在于,所述Y个校验单元是根据每个第一校验矩阵所包含的M个校验单元计算获得的,Y等于M。
  16. 根据权利要求15所述的系统,其特征在于,每个第一校验矩阵中的所述M个校验单元均存储在所述存储系统的第一存储节点中,所述第一存储节点用于获取每个第一校验矩阵中的M个校验单元,并根据每个第一校验矩阵所包含的M个校验单元计算获得所述Y个校验单元。
  17. 根据权利要求11-16任一所述的系统,其特征在于,所述第一纠删码配比或所述第二纠删码配比是根据所述存储系统的拓扑结构和所述存储系统的容错能力确定的,其中,所述拓扑结构用于指示所述存储系统所包含的存储节点的数量,所述容错能力用于指示所述存储系统容忍出错的存储节点的数量。
  18. 根据权利要求11-17任一所述的系统,其特征在于,所述设定条件包括所述第一介质层的可用容量达到容量阈值。
  19. 根据权利要求11-17任一所述的系统,其特征在于,所述设定条件包括所述第一介质层存储的数据的访问频率低于热度阈值。
  20. 一种存储设备,其特征在于,所述存储设备包括第一介质层、第二介质层和处理器;
    所述处理器用于:
    接收数据;
    按照第一纠删码配比将所述数据存储在所述第一介质层中;
    当设定条件满足时,从所述第一介质层中获取所述数据,将所述数据按照第二纠删码配比迁移至所述第二介质层;
    其中,所述第一纠删码配比对应N个数据单元和M个校验单元,N和M为大于或等于1的整数,所述第二纠删码配比对应X个数据单元和Y个校验单元,X和Y为大于或等于1的整数,N和M的比值不等于X和Y的比值。
  21. 根据权利要求20的存储设备,其特征在于,所述第一介质层的性能高于所述第二介质层的性能,N和M的比值小于X和Y的比值。
  22. 根据权利要求21的存储设备,其特征在于,所述第一介质层存储的所述数据包括多个子数据,每个子数据对应一个第一校验矩阵,每个第一校验矩阵包括所述N个数据单元和所述M个校验单元;
    所述处理器在将所述数据按照第二纠删码配比迁移至所述第二介质层时具体用于:
    根据每个第一校验矩阵所包含的N个数据单元获得所述X个数据单元,所述X是所述N的整数倍;
    计算获得Y个校验单元以生成第二校验矩阵,所述第二校验矩阵包括所述X个数据单元和所述Y个校验单元;
    将所述第二校验矩阵写入所述第二介质层中。
  23. 根据权利要求22的存储设备,其特征在于,所述处理器在计算获得Y个校验单元时具体用于根据所述X个数据单元计算获得所述Y个校验单元。
  24. 根据权利要求22的存储设备,其特征在于,所述处理器在计算获得Y个校验单元时具体用于根据每个第一校验矩阵所包含的M个校验单元计算所述Y个校验单元,所述Y等于所述M。
  25. 根据权利要求24的存储设备,其特征在于,每个第一校验矩阵中的所述M个校验单元均存储在所述第二介质层的同一个硬盘或同一个硬盘逻辑域中。
  26. 根据权利要求20-25任一所述的存储设备,其特征在于,所述设定条件包括所述第一介质层的可用容量达到容量阈值。
  27. 根据权利要求20-25任一所述的存储设备,其特征在于,所述设定条件包括所述第一介质层存储的数据的访问频率低于热度阈值。
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行权利要求1-10任一所述的数据存储方法。
PCT/CN2020/120835 2020-07-10 2020-10-14 数据存储方法、存储系统、存储设备及存储介质 WO2022007225A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20944002.3A EP4170499A4 (en) 2020-07-10 2020-10-14 DATA STORAGE METHOD, STORAGE SYSTEM, STORAGE DEVICE AND STORAGE MEDIUM
US18/149,288 US20230137007A1 (en) 2020-07-10 2023-01-03 Data storage method, storage system, storage device, and storage medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010661972.0 2020-07-10
CN202010661972 2020-07-10
CN202010851132.0A CN113918378A (zh) 2020-07-10 2020-08-21 数据存储方法、存储系统、存储设备及存储介质
CN202010851132.0 2020-08-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/149,288 Continuation US20230137007A1 (en) 2020-07-10 2023-01-03 Data storage method, storage system, storage device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022007225A1 true WO2022007225A1 (zh) 2022-01-13

Family

ID=79232468

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2020/120835 WO2022007225A1 (zh) 2020-07-10 2020-10-14 数据存储方法、存储系统、存储设备及存储介质
PCT/CN2021/105640 WO2022007968A1 (zh) 2020-07-10 2021-07-12 分条管理方法、存储系统、分条管理装置及存储介质

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/105640 WO2022007968A1 (zh) 2020-07-10 2021-07-12 分条管理方法、存储系统、分条管理装置及存储介质

Country Status (4)

Country Link
US (2) US20230137007A1 (zh)
EP (2) EP4170499A4 (zh)
CN (2) CN113918378A (zh)
WO (2) WO2022007225A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686342A (zh) * 2021-07-22 2023-02-03 华为技术有限公司 存储系统中的数据存储方法以及装置
CN114816837B (zh) * 2022-06-28 2022-12-02 苏州浪潮智能科技有限公司 一种纠删码融合方法、系统、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150236725A1 (en) * 2014-02-20 2015-08-20 Quantum Corporation Dynamically Configuring Erasure Code Redundancy And Distribution
US20150378820A1 (en) * 2014-06-25 2015-12-31 Quantum Corporation High Reliability Erasure Code Distribution
CN105487823A (zh) * 2015-12-04 2016-04-13 华为技术有限公司 一种数据迁移的方法及装置
CN110531936A (zh) * 2019-08-29 2019-12-03 西安交通大学 基于多种存储介质的分布式纠删码混合存储的林型存储结构及方法
CN111176880A (zh) * 2018-11-09 2020-05-19 杭州海康威视系统技术有限公司 磁盘分配方法、装置和可读存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577274B (zh) * 2012-07-31 2016-07-06 国际商业机器公司 管理存储器阵列的方法和装置
US9595979B2 (en) * 2015-01-20 2017-03-14 International Business Machines Corporation Multiple erasure codes for distributed storage
CN105630423B (zh) * 2015-12-25 2018-11-27 华中科技大学 一种基于数据缓存的纠删码集群存储扩容方法
US9817713B2 (en) * 2016-02-04 2017-11-14 International Business Machines Corporation Distributed cache system utilizing multiple erasure codes
EP3859507A1 (en) * 2016-04-07 2021-08-04 Huawei Technologies Co., Ltd. Method for processing stripe in storage device and storage device
CN106201766B (zh) * 2016-07-25 2018-03-20 深圳市中博科创信息技术有限公司 数据存储控制方法及数据服务器
CN107436733B (zh) * 2017-06-29 2020-11-06 华为技术有限公司 分片管理方法和分片管理装置
CN109189326B (zh) * 2018-07-25 2020-09-08 华为技术有限公司 分布式集群的管理方法和装置
CN110865901B (zh) * 2018-08-28 2021-05-04 华为技术有限公司 组建ec条带的方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150236725A1 (en) * 2014-02-20 2015-08-20 Quantum Corporation Dynamically Configuring Erasure Code Redundancy And Distribution
US20150378820A1 (en) * 2014-06-25 2015-12-31 Quantum Corporation High Reliability Erasure Code Distribution
CN105487823A (zh) * 2015-12-04 2016-04-13 华为技术有限公司 一种数据迁移的方法及装置
CN111176880A (zh) * 2018-11-09 2020-05-19 杭州海康威视系统技术有限公司 磁盘分配方法、装置和可读存储介质
CN110531936A (zh) * 2019-08-29 2019-12-03 西安交通大学 基于多种存储介质的分布式纠删码混合存储的林型存储结构及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4170499A4

Also Published As

Publication number Publication date
EP4174630A4 (en) 2023-11-29
US20230163789A1 (en) 2023-05-25
US20230137007A1 (en) 2023-05-04
EP4170499A4 (en) 2023-11-29
CN113918083A (zh) 2022-01-11
EP4170499A1 (en) 2023-04-26
CN113918378A (zh) 2022-01-11
WO2022007968A1 (zh) 2022-01-13
EP4174630A1 (en) 2023-05-03

Similar Documents

Publication Publication Date Title
US20210173741A1 (en) Distributed multi-level protection in a hyper-converged infrastructure
USRE48222E1 (en) Reconstruct reads in a raid array with dynamic geometries
US10001944B2 (en) Systems and methods for data organization in storage systems using large erasure codes
US10896089B2 (en) System level data-loss protection using storage device local buffers
US8560879B1 (en) Data recovery for failed memory device of memory device array
US8677063B2 (en) Parity declustered storage device array with partition groups
US11422703B2 (en) Data updating technology
US10387307B2 (en) Lock-free raid implementation in multi-queue architecture
US20230137007A1 (en) Data storage method, storage system, storage device, and storage medium
US11074124B2 (en) Method and system for enhancing throughput of big data analysis in a NAND-based read source storage
US20200341873A1 (en) Data access method, apparatus and computer program product
EP4145265A2 (en) Storage system
WO2023020136A1 (zh) 存储系统中的数据存储方法及装置
CN109840051B (zh) 一种存储系统的数据存储方法及装置
US20190042365A1 (en) Read-optimized lazy erasure coding
WO2024001863A1 (zh) 一种数据处理方法及相关设备
WO2023109571A1 (zh) 存储级内存、数据处理方法和处理器系统
CN111831230A (zh) 一种面向高性能计算及大数据的智能存储系统
Thomasian RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial
WO2024037010A1 (zh) 数据存储方法、装置及存储介质
Akutsu et al. MEC: Network optimized multi-stage erasure coding for scalable storage systems
Huang et al. Optimizing erasure-coded data archival for replica-based storage clusters
US12008267B2 (en) Storage management system and method
Pan et al. ACS: an alternate coding scheme to improve degrade read performance for SSD-based RAID5 systems
KR20230101405A (ko) 분산 파일 시스템에서 이레이저 코딩을 가속화하는 전자 장치 및 그 동작 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20944002

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020944002

Country of ref document: EP

Effective date: 20230118

NENP Non-entry into the national phase

Ref country code: DE