WO2018176265A1 - 分布式存储系统的访问方法和相关装置和相关系统 - Google Patents

分布式存储系统的访问方法和相关装置和相关系统 Download PDF

Info

Publication number
WO2018176265A1
WO2018176265A1 PCT/CN2017/078579 CN2017078579W WO2018176265A1 WO 2018176265 A1 WO2018176265 A1 WO 2018176265A1 CN 2017078579 W CN2017078579 W CN 2017078579W WO 2018176265 A1 WO2018176265 A1 WO 2018176265A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
storage node
state
data
storage
Prior art date
Application number
PCT/CN2017/078579
Other languages
English (en)
French (fr)
Inventor
杨定国
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2017/078579 priority Critical patent/WO2018176265A1/zh
Priority to CN201780000202.8A priority patent/CN108934187B/zh
Priority to EP17904157.9A priority patent/EP3537687B1/en
Priority to JP2019521779A priority patent/JP6833990B2/ja
Priority to SG11201901608VA priority patent/SG11201901608VA/en
Publication of WO2018176265A1 publication Critical patent/WO2018176265A1/zh
Priority to US16/574,421 priority patent/US11307776B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • G06F3/0622Securing storage systems in relation to access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the present application relates to the field of computer technology, and in particular, to an access method and related apparatus and related system of a distributed storage system.
  • the traditional network storage system uses a centralized storage server to store all data. Under this architecture, the storage server becomes a system performance bottleneck, and is also a weak point of reliability and security, which is difficult to meet the needs of large-scale storage applications.
  • Distributed storage systems distribute data across multiple independent devices (ie, storage nodes).
  • the distributed storage system adopts a scalable system structure, and utilizes multiple storage nodes to share storage load.
  • the distributed storage system is beneficial to improve system reliability, availability, access efficiency, and scalability.
  • erasure codes English: erasure code, abbreviation: EC
  • EC technology can effectively improve system performance while ensuring system reliability.
  • other similar redundancy check techniques may also be used by distributed storage systems.
  • a stripe for example, an EC strip
  • N data strips and M check strips for example, including 2 data strips and 1 check strip
  • the M check strips are calculated based on the N data strips, and the data strips can be up to 1 Mb in length.
  • Embodiments of the present application provide a data access method and related apparatus and system.
  • an embodiment of the present application provides a data reading method, where the method is applied to a distributed storage system including m storage nodes, each of the storage nodes includes at least one memory (the memory may be, for example, a hard disk or Other forms of memory), the m being an integer greater than one.
  • the method can include, for example, when the terminal requests to read the first data from the distributed storage system, the first storage node of the m storage nodes can receive the first read data request from the terminal.
  • the first read data request carries first location information, and the first location information is used to describe a location of the first data in a data body to which it belongs.
  • the first storage node determines, according to the first location information, that the memory where the first data is located is the first memory, and the first memory belongs to the first storage node, where the first storage node a storage node determines a state of the first memory currently recorded in the state information aggregate (the state information aggregate of the first storage node cache) (the first memory may be in a trusted access state or an untrusted access, for example status).
  • the first storage node reads the first data from the first memory when it is determined that the first memory is in a trusted access state. Transmitting, by the first storage node, to the terminal, in response to the first read data request A read data response, the first read data response carrying the read first data.
  • the first data is part or all of data in a strip to which the first data belongs.
  • the data body in each embodiment of the present application may be, for example, a file, an object, a file segment, a database record or a data segment, and the like.
  • the state information aggregates in the embodiments of the present application are used to record the state of the hard disk, etc., and the format of the state information aggregate may be, for example, a file or other data format. If the format of the status information aggregate is a file, the status information aggregate may also be referred to as a "state file.”
  • the state information aggregate for recording the state of the memory since the state information aggregate for recording the state of the memory is introduced in the distributed storage system, the state information aggregate can be utilized to record the access state of the memory managing the distributed storage system.
  • the memory in the trusted access state recorded by the state information aggregate can directly read the requested data from the terminal to feed back to the terminal, and in the case of the trust access state, the related stripe is not read first in the traditional manner. All of the strips bring cumbersome steps to verify the correctness. By reducing the amount of data read, it is beneficial to reduce the memory load, which in turn helps to improve system performance.
  • the first storage node may first determine whether the state is cached. The information aggregate, if the state information aggregate has been cached, the first storage node may determine the state of the first memory currently recorded in the cached state information aggregate; if the state information aggregate is not cached, and When the first storage node is in a connected state with the management node for issuing the state information aggregate, the first storage node first acquires the state information aggregate from the management node for caching, and then determines the state of the cache. The state of the first memory currently recorded in the information aggregate.
  • the first storage node may first determine whether the state is cached.
  • the information aggregate if the state information aggregate has been cached, the first storage node may determine whether the acquired duration of the cached state information aggregate exceeds a duration threshold (the duration threshold may be, for example, 2 minutes, 10 minutes, or 30 minutes, etc.)
  • the duration threshold may be, for example, 2 minutes, 10 minutes, or 30 minutes, etc.
  • the first storage node If the obtained duration of the cached state information aggregate exceeds the duration threshold, and the first storage node is in a connected state with the management node for issuing the state information aggregate, the first storage node first manages The node acquires the state information aggregate, updates the previously cached state information aggregate by using the newly acquired state information aggregate, and then determines the state of the first memory currently recorded in the cached state information aggregate. If the obtained duration of the cached state information aggregate does not exceed, or the first storage node is not in a connected state with the management node for issuing the state information aggregate, then the first storage node determines the cache. The status of the first memory currently recorded in the status information aggregate.
  • the method may further include: at the first memory, belonging to a second storage node of the m storage nodes, and determining that the first memory is in trusted access a state (for example, the first storage node determines, according to the state information aggregate, that the first memory is in a trusted access state, that is, the state information aggregate of the first storage node cache records: the first memory is in a trusted access state), And in the case that the first storage node and the management node for issuing the state information aggregate are in a connected state (when the first The storage node is in a connected state with the management node for distributing the state information aggregate, and the state information aggregate of the first storage node cache is relatively likely to be the latest state information aggregate of the management node, so the first The possibility that this state information aggregate currently cached by the storage node is currently valid is relatively large.
  • the first storage node determines, according to the cached state information aggregate, that the first memory is in a trusted access state, which is likely to be consistent with the actual situation of the first memory, the first storage.
  • the node forwards the first read data request to the second storage node.
  • the second storage node After receiving the first read data request from the first storage node, the second storage node reads the first data from the first memory, to the first storage node or the The terminal sends a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the method may further include: at the first memory, belonging to a second storage node among the m storage nodes, and the first memory is in a trusted access state ( Specifically, for example, in the case that the first storage node currently caches the state information aggregate, the first memory is in the trusted access state, or the first memory belongs to the m storage nodes.
  • the first storage node may add the first storage in the first read data request An identifier of the state information aggregate currently cached by the node, and sending, to the second storage node, the identifier that adds the identifier of the state information aggregate Read data requests.
  • the second storage node may use the identifier of the state information aggregate carried by the first read data request and the current cached by itself.
  • the identifiers of the state information aggregates are compared, and when the identifiers of the state information aggregates carried by the first read data request are the same as the identifiers of the state information aggregates that are currently cached by the first read data request (when the first The identifier of the state information aggregate carried by the read data request is the same as the identifier of the state information aggregate currently cached by the second storage node, and indicates the current cache state of the first storage node and the second storage node.
  • the information aggregate is the same, in this case, the possibility that the state information aggregate currently cached by the first storage node is currently valid is very large. In this case, if the first storage node is connected with the management node Then, the possibility that the state information aggregate currently cached by the first storage node is currently valid is further increased, so the first storage Determining, by the storage node based on its currently cached state information aggregate, that the first memory is in a trusted access state, most likely in accordance with the actual situation of the first memory, reading the first from the first memory And transmitting, to the first storage node or the terminal, a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the identifier of the state information aggregate includes, for example, a version number of the state information aggregate and/or a data digest of the state information aggregate. It can be understood that the identifier of a version of the status information aggregate may be any information that can represent the version of the status information aggregate. The version number and/or data summary of the status information aggregate is only the identifier of the status information aggregate. May be an example.
  • the role of the management node is usually assumed by a specific node (such as a specific storage node in a distributed storage system), and in some special cases (such as the current management node failure or insufficient processing capacity, etc.)
  • a specific node such as a specific storage node in a distributed storage system
  • some special cases such as the current management node failure or insufficient processing capacity, etc.
  • the storage node that assumes the role of the management node it is also possible to allow the storage node that assumes the role of the management node to change, for example, the storage node A assumes the role of the management node in a certain period of time (in this case, the node identifier of the management node is also the storage node A).
  • node ID the node identifier of the management node is the node identifier of storage node B. That is to say, in some special cases, since the storage node that assumes the role of the management node may change, there is a possibility that the node identifier of the management node (ie, the node identifier of the storage node that assumes the role of the management node) changes.
  • Each storage node in the distributed storage system can cache the node identifier that is currently regarded as the latest management node. When the storage node finds that the storage node that assumes the role of the management node changes, then the latest management node of the discovered management node is used. The node ID updates the node ID of the management node it cached previously.
  • the method may further include: at the first memory, belonging to a second storage node of the m storage nodes, and the first memory is in a trusted access state (specifically, for example If the first storage node caches the first information storage device in the state information aggregate, and the first storage node is in a connected state with the management node for distributing the state information aggregate, The first storage node adds, in the first read data request, a node identifier of a management node that is currently cached by the first storage node for issuing a state information aggregate, and sends an added identifier to the second storage node. The first read data request of the node identifier of the management node.
  • the second storage node receives the first read data request from the first storage node, or receives the first read data request from the first storage node, And in a case where the second storage node is in a connected state with the management node for issuing the state information aggregate, the second storage node identifies the node identifier of the management node carried by the first read data request. Compare with the node ID of the management node that is currently cached by itself.
  • the node identifier of the management node carried by the first read data request is the same as the node identifier of the management node currently cached by itself (when the first storage node and the second storage node cache are The node identifiers of the management nodes are the same, that is, the first storage node and the second storage node regard the same storage node as a management node, and then the storage node that is regarded as the management node by the first storage node and the second storage node is regarded as the storage node.
  • the possibility of being the current latest management node in the distributed storage system is very large.
  • the first storage node determines that the first memory is in a trusted access state based on the state information aggregate of the cache, which is likely to be consistent. Reading the first data from the first memory to the first memory Or the terminal node transmits a first response to the first read data in response to the read data request, wherein the response carries the first read data to the first data reading.
  • the method may further include: at the first memory, belonging to a second storage node of the m storage nodes, and the first memory is in a trusted access state (specifically, for example If the first storage node caches the status information aggregate, the first storage is in a trusted access state, or the first storage belongs to the second storage node of the m storage nodes, and The first memory is in a trusted access state (eg, the first storage is in a state information aggregate of the first storage node cache, and the first storage is in a trusted access state), and the first storage node and the management for publishing the state information aggregate Nodes are connected In the case where the first storage node adds the identifier of the state information aggregate currently cached by the first storage node and the node identifier of the management node used to issue the state information aggregate in the first read data request And transmitting, to the second storage node, the first read data request that adds the identifier of the state information aggregate and the node identifier of the management no
  • the second storage node receives the first read data request from the first storage node, or receives the first read data request from the first storage node, And the second storage node is in the connection state between the second storage node and the management node for issuing the state information aggregate, the second storage node is configured by the first read data request
  • the identifier is compared with the identifier of the current state information aggregate of the cache, and the node identifier of the management node carried by the first read data request is compared with the node identifier of the management node that is currently cached by the host, where The identifier of the state information aggregate carried by the first read data request is the same as the identifier of the state information aggregate that is currently cached by the first read data request, and the node identifier of the management node carried by the first read data request is compared.
  • the node identifiers of the nodes are the same, that is, the first storage node and the second storage node regard the same storage node as the management node, and then the storage node that is regarded as the management node by the first storage node and the second storage node is The possibility of the current latest management node in the distributed storage system is very large. In this case, if the state information aggregate of the current storage node and the second storage node are the same, the first storage node is currently cached. The possibility that the state information aggregate is currently valid is very large.
  • the state information aggregate currently cached by the first storage node is currently valid. The possibility is even greater, so the first storage node determines that the first memory is in a trusted access state based on its currently cached state information aggregate, most likely in accordance with the actual situation of the first memory. Reading the first data from the first memory, sending the first storage node or the terminal First read data in response to the read data in response to the first request, the first response carries read data to read the first data.
  • the method further includes: when the first storage node determines that the first memory is in an untrusted access state, the first storage node determines the first data The associated storage node determines the N memories in which the strips of the stripe are located.
  • the first storage node reads the strips of the stripe from the N memories, and the first storage node is based on reading Performing a check operation on each strip of the stripe to obtain the first data (specifically, the first storage node first performs a check operation based on each strip of the stripe read Obtaining a stripe to which the first data belongs, and then obtaining the first data from a stripe to which the first data belongs, and transmitting, to the terminal, a first read data response for responding to the first read data request The first read data response carries the obtained first data.
  • the x memories (the x memories include the first memory) of the N memories in which the strips of the stripe to which the first data belongs belong belong to the first a storage node, and Nx of the N memories belong to y storage nodes different from the first storage node
  • the method may further include: determining, at the first storage node, the first memory In a non-trusted access state, and determining that other memories of the N memories other than the first storage node are in a trusted access state (or in the a storage node determines that the first memory is in an untrusted access state, and determines that other memories of the N memories other than the first storage node are in a trusted access state, and the first storage node is used for
  • the management nodes of the release status information aggregate are in a connected state
  • the first storage node sends the stripe identifier carrying the stripe and the cached by the first storage node to the y storage nodes.
  • each storage node of the y storage nodes may, after receiving the read data request from the first storage node, encrypt the identifier of the state information aggregate carried by the read data request with its current cache.
  • the identifiers of the state information aggregates carried by the read data request are compared with the identifiers of the state information aggregates currently cached by themselves (when the state information aggregates carried by the read data requests are The identifier is the same as the identifier of the current cached state information aggregate of the storage node itself that receives the read data request, that is, the first storage node and the storage node that receives the read data request are the same as the current cached state information aggregate. In this case, the possibility that the state information aggregate currently cached by the first storage node is currently valid is very large.
  • the first storage node determines that the first memory is in an untrusted access state based on its currently cached state information aggregate, most likely in compliance with the actual condition of the first memory, and can be read from a corresponding memory included therein Corresponding strips of the stripe, the corresponding strips of the read strips are sent to the first storage node.
  • the first storage node performs verification based on each stripe of the stripe collected from the y storage nodes and the first storage node (excluding a strip to which the first data belongs) Computing to obtain first data, and transmitting, to the terminal, a first read data response for responding to the first read data request, the first read data response carrying the obtained first data.
  • one of the m storage nodes is a management node.
  • the method may further include: when the state of the first memory is changed from the first state to the second state, the first storage node transmitting a memory state change report to the management node, wherein the memory state change report indicates the first
  • the memory is in a second state; wherein the first state and the second state are different, the first state and the second state comprise any one of the following states: an offline state, a data reconstruction state, and a trusted access state.
  • the management node may update the state of the first memory recorded in the state information aggregate buffered by the management node to the second state. a status, and updating a version number of the status information aggregate cached by the management node.
  • the management node sends the updated state information aggregate to the storage nodes other than the management node among the m storage nodes.
  • the first storage node updates its current cached state information aggregate with the state information aggregate from the management node.
  • the method further includes: for example, when the terminal needs to request to write the second data to the distributed storage system, the first storage node may receive the first write data request from the terminal, The first write data request carries the second data and the second location information, and the second location information is used to describe the location of the second data in the data body to which it belongs.
  • the first storage node determines, based on the second location information, W memories involved in writing the second data.
  • the first storage node divides the second data into W-T data strips, and calculates T check strips by using the W-T data strips.
  • the T check strips and the W-T data strips are formed including W One strip of strips, the W memories are in one-to-one correspondence with the W strips, the T and W are positive integers and the T is smaller than the W.
  • each storage node of the y2 storage nodes after receiving the write data request from the first storage node, each storage node of the y2 storage nodes sends the identifier of the state information aggregate carried by the write data request with its current current Comparing the identifiers of the cached state information aggregates, for example, when comparing the identifiers of the state information aggregates carried by the write data request with the identifiers of the state information aggregates currently cached by itself, Corresponding strips of the stripe are written in the corresponding memory; if the identifiers of the state information aggregates carried by the write data request are different from the identifiers of the state information aggregates currently cached by the write data, the refusal is performed. A respective strip of the strip is written to the corresponding memory it includes.
  • the method may further include, for example, when the terminal needs to request the third storage data to be written to the distributed storage system, the first storage node receives the second write data request from the terminal, The second write data request carries third data and third location information, and the third location information is used to describe a location of the third data in a data body to which the third data belongs.
  • the first storage node Determining, by the first storage node, the W memories involved in writing the third data based on the third location information; the first storage node dividing the third data into WT data stripes, utilizing The WT data strips are calculated to obtain T check strips, wherein the T check strips and the WT data strips form a stripe including W strips, and the W memories and One-to-one correspondence between the W stripes, the T and W being positive integers and the T being smaller than the W.
  • the storage node Determining, by the first storage node, a state in which the W memories currently recorded in the cached state information aggregate are in a state, determining that the W1 memories in the W memories are in a non-offline state (for example, and When W2 memories in the W memories are in an offline state), and in a connected state between the first storage node and the management node for issuing the state information aggregate, y2 to the W1 memories belong to The storage node sends a write data request for the strip carrying the stripe.
  • each of the y2 storage nodes may write the stripe to the corresponding memory included therein after receiving the write data request from the first storage node. Corresponding strips.
  • the method may further include: the first storage node receiving a third write data request from the terminal, where the third write data request carries the fourth data and the fourth Location information, the fourth location information is used to describe a location of the fourth data in a data volume to which it belongs. Determining, by the first storage node, the W memories involved in writing the fourth data based on the fourth location information; dividing the fourth data into WT data stripes, using the WT data stripes Calculating T check strips, the T check strips and the WT data strips form a stripe comprising W strips, and the W memories and the W strips have a one-to-one correspondence
  • the T and W are positive integers and the T is less than the W.
  • the first storage node sends the stripe carrying the stripe and the node identifier of the management node for issuing the state information aggregate to the y2 storage nodes to which the W1 memories belong. Data request.
  • each storage node of the y2 storage nodes may identifier the node of the management node carried by the write data request and the current cache of the management node Comparing the node identifiers of the management nodes, wherein, in the case that the node identifier of the management node carried by the write data request is the same as the node identifier of the management node currently cached by itself, the same is written to the corresponding memory included in the storage node
  • Corresponding strips of the stripe in addition, in the case that the node identifier of the management node carried by the write data request is different from the node identifier of the management node currently cached by itself, the corresponding memory included therein may be rejected. Write the corresponding strip of the strip.
  • the reconfiguration mechanism may be passed after the memory is re-wired.
  • the W1 is smaller than the W, wherein the W2 memories of the W memories currently recorded in the state information aggregate cached by the first storage node are in an offline state, and the method may further include: a storage node generates a first reconstruction log, wherein the first reconstruction log records a memory identifier of the second memory in the W2 storage, and the W is further recorded in the first reconstruction log.
  • a stripe identifier of the first stripe corresponding to the second memory in the strip the stripe identifier of the stripe is also recorded in the first reconstruction log;
  • the second memory is the Any one of W2 memories.
  • the second storage node to which the second memory belongs collects the first weight generated during the offline of the second memory Constructing a log
  • the second storage node acquires an identifier of the first stripe to be written into the second memory recorded in the first reconstruction log, and determines other WTs other than the first stripe included in the stripe WT memories in which the strips are located, the WT strips are read from the WT memories, and the WT strips are used to perform a check operation to reconstruct a place to be written into the second memory
  • the first strip is described, and the reconstructed first strip is written into the second memory.
  • the embodiment of the present application further provides a distributed storage system, where the distributed storage system includes m storage nodes, where each of the storage nodes includes at least one memory, and each of the memories includes a non- Volatile storage medium, the m being an integer greater than one.
  • the first storage node of the m storage nodes is configured to: receive a first read data request from the terminal, where the first read data request carries first location information, where the first location information is used to describe a location of the first data in the data body to which it belongs; determining a first memory in which the first data is located based on the first location information.
  • the first memory belongs to the first storage node
  • the first data is read from the first memory, Transmitting, to the terminal, a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the first data is, for example, part or all of the data in the strip to which the first data belongs.
  • the distributed storage system further includes a set of status information aggregates. Management node.
  • the first storage node is further configured to: at the first storage, belong to a second storage node among the m storage nodes, and the first storage is in a trusted access state, and the first When the storage node is in a connected state with the management node for issuing the state information aggregate, the first read data request is forwarded to the second storage node.
  • the second storage node is configured to read the first data from the first memory after receiving the first read data request from the first storage node, to the first
  • the storage node or the terminal transmits a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the first storage node is further configured to: at the first storage, belong to a second storage node among the m storage nodes, and the first storage is in a trust
  • the identifier of the status information aggregate currently cached by the first storage node is added to the first read data request, and the status is added to the second storage node.
  • the second storage node is configured to: after receiving the first read data request from the first storage node, the identifier of the state information aggregate carried by the first read data request and itself Comparing the identifiers of the current cached state information aggregates, and comparing the identifiers of the state information aggregates carried by the first read data request with the identifiers of the state information aggregates currently cached by the first read data request, Reading, by the first memory, the first data, and sending, to the first storage node or the terminal, a first read data response for responding to the first read data request, where the first read data response carries The first data read.
  • the first storage node is further configured to: at the first storage, belong to a second storage node among the m storage nodes, and the first storage is in a trust
  • the identifier of the state information aggregate currently cached by the first storage node and the node identifier of the management node for issuing the state information aggregate are added to the first read data request
  • the second storage node sends the first read data request to which the identifier of the state information aggregate and the node identifier of the management node are added.
  • the second storage node is configured to: after receiving the first read data request from the first storage node, identifier of the state information aggregate carried by the first read data request Comparing with the identifier of the state information aggregate that is currently cached by itself, and comparing the node identifier of the management node carried by the first read data request with the node identifier of the management node that is currently cached by the host, and comparing the The identifier of the state information aggregate carried by the first read data request is the same as the identifier of the state information aggregate that is currently cached by the first read data request, and compares the node identifier of the management node carried by the first read data request with its current current If the node identifier of the cached management node is the same, the first data is read from the first memory, and the first storage node or the terminal is sent to respond to the first read data request. The first read data response, the first read data response carrying the read first data.
  • the first storage node is further configured to: determine that the first memory is in an untrusted access state, and the N strips of the stripe to which the first data belongs The x memories in the memory belong to the first storage node, and the Nx memories in the N memories belong to different from the first storage section In the case of the y storage nodes of the point, a read data request carrying the stripe identifier of the stripe and the identifier of the state information aggregate is transmitted to the y storage nodes.
  • Each of the y storage nodes is configured to: after receiving the read data request from the first storage node, the identifier of the state information aggregate carried by the read data request and its current current Comparing the identifiers of the cached state information aggregates, and comparing the identifiers of the state information aggregates carried by the read data request with the identifiers of the state information aggregates currently cached by itself, reading from the corresponding memory included in the cache Corresponding strips of the strips are taken, and the corresponding strips of the strips read are sent to the first storage node.
  • the first storage node is further configured to perform a verification operation based on each stripe of the stripe collected from the y storage nodes and the first storage node to obtain first data, to the The terminal sends a first read data response for responding to the first read data request, the first read data response carrying the obtained first data.
  • one of the m storage nodes is a management node.
  • the first storage node is further configured to: when the state of the first memory is changed from the first state to the second state, send a memory state change report to the management node, where the memory state change report indicates A memory is in the second state.
  • the first state and the second state are different, and the first state and the second state include any one of the following states: an offline state, a data reconstruction state, and a trusted access state.
  • the management node is configured to: after receiving the memory state change report from the first storage node, update the state of the first memory recorded in the state information aggregate buffered by the management node to the first a second state, and updating a version number of the state information aggregate buffered by the management node; the management node sending updated state information to other storage nodes other than the management node among the m storage nodes Aggregate.
  • the first storage node is further configured to update its current cached state information aggregate with the state information aggregate from the management node.
  • other storage nodes also update their current cached state information aggregates with the state information aggregate from the management node.
  • the first storage node is further configured to receive a first write data request from a terminal, where the first write data request carries second data and second location information, where The two location information is used to describe the location of the second data in its associated data volume.
  • the first storage node is further configured to: determine, according to the second location information, W memories related to writing the second data; divide the second data into WT data stripes, and use the WT
  • the data strips are calculated to obtain T check strips.
  • the T check strips and the WT data strips form a stripe including W strips, and the W memories are in one-to-one correspondence with the W strips, and the T and W are A positive integer and the T is less than the W.
  • the first storage node is further configured to: when the state information aggregate cached by the first storage node is currently recorded, the W1 memories in the W memories are in a non-offline state and W2 in the W memories.
  • the memory is in an offline state, and a write data request carrying the stripe of the stripe and the identifier of the state information aggregate is sent to the y2 storage nodes to which the W1 memories belong.
  • each of the y2 storage nodes is configured to: after receiving the write data request from the first storage node, carry the write data request with the identifier of the state information aggregate and its current current Comparing the identifiers of the cached state information aggregates, and comparing the identifiers of the state information aggregates carried by the write data requests with the identifiers of the state information aggregates that are currently cached by themselves, in the corresponding memory included therein Write the score Corresponding strips of the strips; in the case that the identifiers of the state information aggregates carried by the write data request are different from the identifiers of the state information aggregates currently cached by themselves, the writes are refused to be written to the corresponding memories included therein The corresponding strips of the strips.
  • the first storage node is further configured to receive a second write data request from the terminal, where the second write data request carries third data and third location information, where The three-position information is used to describe the location of the third data in the data body to which it belongs.
  • the first storage node is further configured to: determine, according to the third location information, W memories related to writing the third data; divide the third data into WT data stripes, and use the WT Data strips are calculated to obtain T check strips, wherein the T check strips and WT data strips form strips comprising W strips, the W memories and the W strips One-to-one correspondence, the T and W are positive integers and the T is smaller than the W.
  • the first storage node is further configured to determine a state of the W memories currently recorded in the cached state information aggregate, and determine that the W1 memories in the W memories are in a non-offline state (for example, And the W2 memories of the W memories are in an offline state, and in a case where the first storage node and the management node for issuing the state information aggregate are in a connected state, the W1 memories belong to The y2 storage nodes send a write data request carrying the stripe of the stripe.
  • each of the y2 storage nodes is configured to, after receiving a write data request from the first storage node, write the stripe into a corresponding memory included therein Corresponding strips.
  • the first storage node is further configured to receive a third write data request from the terminal, where the third write data request carries fourth data and fourth location information, where Four position information is used to describe the position of the fourth data in the data body to which it belongs;
  • the first storage node is further configured to: determine, according to the fourth location information, W memories related to writing the fourth data; divide the fourth data into WT data stripes, and use the WT Data strips are calculated to obtain T check strips, wherein the T check strips and WT data strips form strips comprising W strips, the W memories and the W strips One-to-one correspondence, the T and W are positive integers and the T is smaller than the W.
  • the first storage node is further configured to: when the state information aggregate cached by the first storage node is currently recorded, the W1 memories in the W memories are in a non-offline state (for example, and in the W memories).
  • the bearer is sent to the y2 storage nodes to which the W1 memory belongs.
  • each of the y2 storage nodes is configured to: after receiving the write data request from the first storage node, the node identifier of the management node carried by the write data request and its current Comparing the node identifiers of the cached management nodes, and comparing the node identifiers of the management nodes carried by the write data request with the node identifiers of the management nodes currently cached by itself, writing to the corresponding memory included therein Determining a corresponding stripe of the stripe; in case comparing that the node identifier of the management node carried by the write data request is different from the node identifier of the management node currently cached by itself, refusing to write to the corresponding memory included in the stripe The corresponding strips of the strips.
  • the first storage node is further configured to generate a first reconstruction log, where the first reconstruction log records a memory identifier of the second memory in the w2 storages, where the Also remember in a reconstruction log a stripe identifier of the first stripe corresponding to the second memory is recorded in the W stripe, and the stripe identifier of the stripe is also recorded in the first reconstructed log;
  • the second memory is any one of the W2 memories.
  • the second storage node to which the second memory belongs collects the first generated during the offline of the second memory Reconstructing the log
  • the second storage node acquires an identifier of the first stripe to be written into the second memory recorded in the first reconstructed log; and determining the first strip included in the stripe WT memories in which the WT strips are located, the WT strips are read from the WT memories, and the WT strips are used to perform a check operation to reconstruct a write to the second memory.
  • the first strip writes the reconstructed first strip to the second memory.
  • the embodiment of the present application further provides a storage node, where the storage node is one of the m storage nodes included in the distributed storage system, where each of the storage nodes includes at least one a memory, each of the memories comprising a non-volatile storage medium, the m being an integer greater than one.
  • the storage node includes: a communication unit and a processing control unit.
  • the communication unit is configured to receive a first read data request from the terminal, where the first read data request carries first location information, where the first location information is used to describe that the first data is in a data body to which the data belongs position.
  • a processing control unit configured to: when the first memory in which the first data is located based on the first location information is a first memory, and the first memory belongs to the storage node, when the state information aggregate The first memory is currently recorded in a trusted access state, and the first data is read from the first memory.
  • the communication unit is further configured to send, to the terminal, a first read data response for responding to the first read data request, where the first read data response carries the read first data .
  • the communication unit is further configured to: in the first memory, belong to a second storage node among the m storage nodes, and the first memory is in a trusted access state, And if the storage node is in a connected state with the management node for issuing the state information aggregate, forwarding the first read data request to the second storage node.
  • the first read data request is used to trigger the second storage node to read the first data from the first memory after receiving the first read data request from the first storage node, Transmitting, to the storage node or the terminal, a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the communication unit is further configured to: when the first memory belongs to a second storage node of the m storage nodes, and the first memory is in a trusted access state And adding, in the first read data request, an identifier of the state information aggregate currently cached by the first storage node, and sending, to the second storage node, an identifier to which the identifier of the state information aggregate is added.
  • the first read data request when the first memory belongs to a second storage node of the m storage nodes, and the first memory is in a trusted access state And adding, in the first read data request, an identifier of the state information aggregate currently cached by the first storage node, and sending, to the second storage node, an identifier to which the identifier of the state information aggregate is added.
  • the first read data request is further configured to: when the first memory belongs to a second storage node of the m storage nodes, and the first memory is in a trusted access state And adding, in the first read data request, an identifier of the state information aggregate currently
  • the first read data request is used to trigger the second storage node to receive the first read data request from the storage node, and the state information aggregate that is carried by the first read data request.
  • the identifier is compared with the identifier of the current state information aggregate of the current cache, and the identifier of the state information aggregate carried by the first read data request is compared with the identifier of the state information aggregate of the current cache. Reading the first data from the first memory, and transmitting, to the storage node or the terminal, a first response to the first read data request Reading a data response, wherein the first read data response carries the read first data.
  • one of the m storage nodes is a management node.
  • the communication unit is further configured to: when the state of the first memory is changed from the first state to the second state, send a memory state change report to the management node, wherein the memory state change report indicates that the first memory is in the first state a second state; the first state and the second state being different, the first state and the second state comprising any one of the following states: an offline state, a data reconstruction state, and a trusted access state.
  • the memory state change report is configured to trigger the management node to record the first record in the state information aggregate cached by the management node after receiving the memory state change report from the first storage node.
  • the state of the memory is updated to the second state, and the version number of the state information aggregate buffered by the management node is updated. Sending the updated state information aggregate to the storage nodes other than the management node among the m storage nodes.
  • the processing control unit is configured to update its current cached state information aggregate by using a state information aggregate from the management node.
  • the embodiment of the present application provides a data access method, which is applied to a first storage node, where the first storage is located in a distributed storage system including m storage nodes, where each of the storage nodes includes At least one memory, each of the memories comprising a non-volatile storage medium, the m being an integer greater than one,
  • the method includes: the first storage node receiving (from a terminal) a first read data request, wherein the first read data request carries first location information, and the first location information is used to describe the first data a location in the data body to which it belongs; determining, at the first storage node, that the memory in which the first data is located is a first memory based on the first location information, and the first memory is attributed to the first storage In the case of a node, when the first memory in the state information aggregate currently records the trusted access state, the first storage node reads the first data from the first memory, and sends the response to the first And reading a first read data response of the data request, wherein the first read data response carries the read first data.
  • the embodiment of the present application provides a storage node, where the storage node is one of the m storage nodes included in the distributed storage system, and each of the storage nodes includes at least one memory.
  • Each of the memories includes a non-volatile storage medium, the m being an integer greater than one,
  • the storage node includes a processor and a communication interface coupled to each other; the processor is configured to perform some or all of the steps of the method performed by the first storage node (or other storage node) in the above aspects.
  • the processor is configured to receive a first read data request by using the communication interface, where the first read data request carries first location information, where the first location information is used to describe the first data in the a location in the associated data body; when the memory in which the first data is located is determined to be the first memory based on the first location information, and the first memory belongs to the first storage node, when the state information is Recording, in the aggregate, that the first memory is in a trusted access state, reading the first data from the first memory, and transmitting, by the communication interface, first read data in response to the first read data request In response, the first read data response carries the read first data.
  • an embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium stores program code.
  • the program code includes instructions for performing some or all of the steps of a method performed by any one of the storage nodes (eg, the first storage node or the second storage node) of the first aspect.
  • an embodiment of the present application provides a computer program product comprising instructions, when the computer program product is run on a computer (such as a storage node), causing the computer to perform any one of the above aspects. Part or all of the steps performed by the storage node (eg, the first storage node or the second storage node, etc.).
  • an embodiment of the present application provides a storage node, including: a processor, a communication interface, and a memory coupled to each other; the processor is configured to execute part or all of a method performed by any one of the foregoing aspects. step.
  • the embodiment of the present application further provides a service system, which may include: a distributed storage service system and a terminal, a communication connection between the distributed storage service system and the terminal; and the distributed storage service system is a rights management Apply to any of the distributed storage service systems provided by the embodiments.
  • a service system which may include: a distributed storage service system and a terminal, a communication connection between the distributed storage service system and the terminal; and the distributed storage service system is a rights management Apply to any of the distributed storage service systems provided by the embodiments.
  • FIGS. 1A and 1B are schematic structural diagrams of some distributed storage systems according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of state transition of a memory provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an organization form of a state information aggregate provided by an embodiment of the present application
  • FIG. 1-E and FIG. 1-F are schematic diagrams showing the organization forms of some strips provided by examples in the embodiments of the present application.
  • FIGS. 2A and 2B are schematic diagrams showing the organization of some files provided by examples in the embodiment of the present application.
  • FIG. 3A is a schematic flowchart of a state update method according to an embodiment of the present application.
  • FIG. 3B and FIG. 3C are schematic diagrams showing the organization of some state information aggregates provided by examples in the embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another state update method according to an embodiment of the present application.
  • FIG. 4B and FIG. 4C are schematic diagrams showing the organization forms of some state information aggregates provided by the embodiments of the present application.
  • FIG. 5-A is a schematic flowchart of another data access method provided by an example of the embodiment of the present application.
  • FIG. 5-B is a schematic diagram of an organization form of a reconstruction log provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart diagram of another data access method according to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a data reconstruction method according to an embodiment of the present application.
  • FIGS. 7-B and 7-C are schematic diagrams showing the organization of some state information aggregates provided by examples in the embodiments of the present application.
  • FIG. 8 is a schematic flowchart diagram of another data access method according to an embodiment of the present application.
  • FIG. 9 is a schematic flowchart diagram of another data access method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart diagram of another data access method according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a distributed storage system according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a storage node according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of another storage node provided by an embodiment of the present application.
  • a distributed storage system includes a plurality of storage nodes connected by an internetwork (in some scenarios, storage nodes may also be referred to as "storage service nodes").
  • the terminal can access the distributed storage system such as reading and writing through the network.
  • the product form of the terminal of the embodiment of the present application may be, for example, a read data request in the form of a mobile internet device, a notebook computer, a server, a tablet, a palmtop computer, a desktop computer, a mobile phone, or other products. Or a terminal device that writes a data request such as a data request.
  • the terminal may also be replaced by a host.
  • the storage node includes one or more memories.
  • the memory in the embodiment of the present application includes a non-volatile storage medium, and the memory including the non-volatile storage medium may also be referred to as a non-volatile memory.
  • the memory may be a storage medium such as a hard disk, a rewritable optical disk, and a solid state hard disk.
  • FIG. 1-A and FIG. 1-B are examples in which the memory is a hard disk.
  • the storage node may also include a network card, a memory, a processor, an expansion card, and the like.
  • the processor is interconnected to the memory, for example by an expansion card.
  • a storage node can be interconnected with other storage nodes or other external devices through its network card.
  • a network card can be thought of as a communication interface for a storage node.
  • the processor may include several functional units, which may be hardware (for example, a field programmable gate array (FPGA) circuit, or a processor/FPGA circuit and other A combination of auxiliary hardware) can also be software.
  • FPGA field programmable gate array
  • FIG. 1-B functions such as a client (English: client), a data service (English: data service, abbreviated: DS) unit, and a state manager (English: monitor) can be run in the processor.
  • DS data service
  • a state manager English: monitor
  • the programs that implement these functions can be programs that are in memory, in which case they can also be considered in memory.
  • these functions are located at various storage nodes, specifically, in a collection of processors and memory in each storage node.
  • these modules are drawn in the processor in Figure 1-B.
  • the processor itself has a storage function, and the program code can be directly programmed into the processor.
  • the memory is no longer a necessary device. .
  • the client is a system input/output (English: input output, IO) entry of the corresponding storage node, and is mainly responsible for receiving an access request sent by another device (such as a terminal or other storage node), and the access request is also called an IO request.
  • the access request is, for example, a read data request, a write data request, or the like.
  • the DS unit is mainly responsible for receiving an access request sent by the client, reading and writing the local memory based on the access request, and returning the result of the read and write operation to the client.
  • the monitor can be responsible for monitoring the state of the memory contained in the corresponding storage node. Of course, the monitor can also be responsible for monitoring the state of the corresponding storage node.
  • a monitor can be deployed on each storage node in the distributed storage system, and a monitor in each storage node can be used to monitor information such as the state of the storage included in the storage node. Further, the monitors of all storage nodes can be connected together to form a monitor cluster. And from the way specified by the administrator or through the election algorithm, from One "monitor" is elected in the monitor cluster, and the other monitor is "from monitor”.
  • the storage node where the primary monitor is located can be called a "management node". For example, each slave can send the current status of each memory included in its storage node to the primary monitor.
  • the master monitor may generate a state information aggregate based on information such as the state of the memory included in each of the collected storage nodes.
  • the state information aggregate may be simply referred to as a "node map" or a "state file", for example. .
  • Information such as the status of each memory is recorded in the status information aggregate. Further, for example, from the monitor, the current state of the storage node where the storage node is located can be sent to the primary monitor.
  • the state information aggregate can further record the state of each storage node and the like.
  • monitor is the function that the processor implements by running programs in memory. In general, it can be considered that the monitor is a function of the processor + memory; or that the monitor is a function of the storage node.
  • the function of the following main monitor can be executed by the main processor, or the main processor + processor + memory execution, or the main storage node; the function from the monitor can be executed by the slave processor, or from the processor + processor + Memory execution, or execution from a storage node.
  • the primary monitor can actively push the latest state information aggregate to other storage nodes, or other storage nodes can actively request the latest state information aggregate from the primary monitor.
  • the monitors on other storage nodes can cache the received state information aggregate in its memory, and the state information aggregate cached in the memory is locally maintained by the corresponding storage node (corresponding monitor).
  • State information aggregate Whenever the state information aggregate of the autonomous monitor is received from the monitor, the monitor can update the state information aggregate of the current cache by using the newly received state information aggregate from the main monitor. That is, the current state information aggregate that is kept from the monitor is the latest state information collection body that it receives.
  • the state information aggregate currently cached from the monitor is usually the latest state information aggregate released by the main monitor; and when from the monitor and the main monitor If the monitor is not in the connected state, then the monitor may miss the latest state information aggregate released by the master monitor. Therefore, the state information aggregate currently cached from the monitor may not be the latest state information aggregate released by the master monitor. .
  • the main monitor does not release a new state information aggregate during the period from the monitor and the main monitor, the state information aggregate currently buffered from the monitor may also be the latest one released by the main monitor.
  • State information aggregate Direct communication or indirect communication between devices in a connected state. For this reason, a storage node that is in a connected state with the management node can obtain the latest version of the state information aggregate from the management node.
  • the state of the memory includes an online (English: online) state and an offline (English: offline) state.
  • the online status includes: a data reconstruction state and a trusted access state.
  • the data reconstruction state can also be called catching (English: catchup state);
  • the trust access state can also be called normal (English: normal) state.
  • the memory can enter the trusted access state after the data reconstruction is completed, or in the case that some data reconstruction is not needed, for example, the memory can directly enter the trust from the offline state. Access status. In general, both the offline state and the data reconstruction state can be considered as untrusted access states.
  • the state of the memory reflects the validity of the data in the memory (ie, whether it is trusted).
  • the memory When the memory is in the trusted access state, it means that the data allocated to this memory is successfully written to this memory. That is to say, the memory to which the allocated data is all written is in a trusted access state.
  • the data on the memory in this state can be considered valid.
  • the memory when the memory is in data reconstruction State, then indicates that the memory is reconfiguring the relevant data lost during offline, then it can be considered that the data on the memory is likely to be invalid in this case, that is, the memory is in an untrusted access state.
  • the memory When the memory is offline, it can be considered that the data on the memory is not valid in this case, that is, the memory is in an untrusted access state. It can be understood that the offline memory cannot be accessed, the data cannot be read from the memory or the data can be written from the memory.
  • Figure 1-C illustrates possible transitions between different states of the memory, which can be migrated from an online state to an offline state, or from an offline state to an online state.
  • the memory may be changed from an offline state to a data reconstruction state or a trusted access state, and may be changed from a data reconstruction state to a trusted access state, and may be changed from a data reconstruction state or a trusted access state to an offline state.
  • the memory offline may be caused by a variety of factors, such as when the memory in the storage node cannot be detected by the storage node, or when the storage node receives a fault status report from a memory (fault status report indicates a memory offline), then the storage node The status of the corresponding memory can be set to the offline state.
  • the management node the primary monitor
  • the management node does not receive the heartbeat message of a storage node multiple times in succession, this means that the storage node is in the offline state, and the management node can set all the memories included in the storage node to the offline state.
  • the state information aggregate can record the memory ID and state of each memory, and can further record the node ID and state of the storage node to which each memory belongs.
  • FIG. 1-D exemplifies a possible representation of the state information aggregate.
  • the state information aggregate includes the version number of the state information aggregate and each memory (the memory is a hard disk in the figure). Identification and status (status such as offline status, data reconstruction status, or trusted access status).
  • the state information aggregate is also adaptively updated due to memory state changes. For example, when the management node (the primary monitor) receives the memory state change report (the memory state changes, for example, the memory is pulled out or powers off) or determines that the storage node is offline (for example, the corresponding storage node is not received within a certain length of time). The heartbeat message), then the management node considers that the state information aggregate update condition is satisfied, the management node performs the state information aggregate update, and the updated state information aggregate corresponds to a new version number, and the updated state information aggregate The memory status after the change is recorded. The management node pushes the updated state information aggregate to other storage nodes (from the monitor).
  • the management node receives the memory state change report (the memory state changes, for example, the memory is pulled out or powers off) or determines that the storage node is offline (for example, the corresponding storage node is not received within a certain length of time).
  • the heartbeat message then the management node considers that the state information aggregate update condition is satisfied
  • the version number of the status information aggregate can be generated by the primary monitor based on global variables. Specifically, for example, each time the state information aggregate is updated, the version number is incremented by +1 or +2 on the current basis. Therefore, different version numbers can represent different sets of status information. By comparing the version numbers of the two state information aggregates, it can be determined whether the two state information aggregates are the same.
  • data bodies may be stored in the memory of the distributed storage system (data bodies refer to data that can be stored by the memory, such as files, objects or records or other formats).
  • data bodies refer to data that can be stored by the memory, such as files, objects or records or other formats.
  • different strips of the same data volume are stored on different memories in consideration of system reliability and the like.
  • the data strip performs a redundancy check operation (such as an EC operation) to obtain a check strip.
  • the verification redundancy configuration (such as EC configuration) is used to describe the proportional relationship between the data strips and the check strips in the stripe.
  • the check redundancy configuration of the file can be expressed as A:T, then A is the number of data stripes included in the strip, and T (T indicates redundancy) is the number of check strips included in the strip.
  • the verification redundancy algorithm is an EC algorithm
  • the verification redundancy configuration ie EC configuration
  • 2 is the number of data stripes included in the stripe (EC strip)
  • 1 is the stripe.
  • the number of check strips included that is, in this case, the strip includes 2 data strips and 1 check strip
  • the strip includes 2 data strips and 1 check strip.
  • Figure 1-E illustrates a stripe comprising two data strips and one check strip.
  • Figure 1-F shows a stripe comprising a plurality of data strips and a plurality of check strips.
  • a distributed storage system can support one or more check redundancy configurations, and each check redundancy configuration can correspond to one or more check member groups, wherein each check member group includes multiple Memory.
  • the memory included in different check member groups may be completely different; it may also be partially identical, that is, each memory may serve one or more check member groups.
  • Figure 2-A is a schematic diagram of dividing a file into several file segments. Each file segment of a file may correspond to the same or a different check redundancy configuration, and the check member groups of different file segments may also be the same or different. In this case, file segmentation can be treated as a file. For example, see FIG. 2-B, FIG. 2-B is a schematic diagram of not dividing a file into several file segments.
  • the distributed storage system includes m storage nodes, wherein the m storage nodes may include, for example, storage nodes Nd2, Nd1, Nd3, ..., Nd4, etc., for example, the storage node Nd1 is a management node, and the distributed storage system includes more Embodiments in the case of a large number or a small number of storage nodes may be deduced by analogy.
  • the following example scheme mainly involves the non-management node to actively trigger the management node to update and release the state information aggregate.
  • FIG. 3A is a schematic flowchart of a state update method according to an embodiment of the present application.
  • a state update method provided by an embodiment of the present application may include:
  • the storage node Nd2 transmits a memory state change report Report1 to the management node Nd1.
  • the memory status change report Report1 is used to indicate the latest status of the memory Nd2-D1 after the change.
  • the memory state change report Report1 may carry the memory identifier, the status identifier, and the like of the memory Nd2-D1. For example, when the latest state after the change of the memory Nd2-D1 is the offline state, the state identifier of the memory Nd2-D1 carried by the memory state change report Report1 indicates the offline state. For another example, when the latest state after the change of the memory Nd2-D1 is the data reconstruction state, the state identifier of the memory Nd2-D1 carried in the memory state change report Report1 indicates the data reconstruction state. For another example, when the latest state after the change of the memory Nd2-D1 is the trusted access state, the state identifier of the memory Nd2-D1 carried in the memory state change report Report1 indicates the trusted access state. And so on.
  • the management node Nd1 receives the memory state change report Report1 from the storage node Nd2.
  • the management node Nd1 updates its current cached state information aggregate based on the memory state change report Report1.
  • the management node Nd1 updates the state of the memory Nd2-D1 recorded in the currently cached state information aggregate to the state of the memory Nd2-D1 indicated by the memory state change report Report1, and displays the state information.
  • the version number in the collection (that is, the version number of this state information collection itself) is updated.
  • the memory state change report Report1 may be specifically transmitted from the monitor located in the storage node Nd2 to the storage node Nd1.
  • the memory state change report Report1 from the storage node Nd2 can be received by the master monitor located in the management node Nd1.
  • the state information aggregate before the management node Nd1 is updated, for example, as shown in FIG. 3-B.
  • the state information aggregate after the update of the management node Nd1 is exemplified, for example, in FIG. 3-C.
  • the actual application is not limited to such an example.
  • the management node Nd1 sends the updated latest state information aggregate to other storage nodes.
  • the storage nodes Nd2, Nd3, Nd4, and the like respectively receive the updated state information aggregate from the management node Nd1.
  • the storage nodes Nd2, Nd3, and Nd4 respectively update the state information aggregates of the respective current caches using the state information aggregates from the management node Nd1. This facilitates the relative synchronization of the state information aggregates buffered by the storage nodes (Nd1, Nd2, Nd3, Nd4, etc.).
  • the foregoing solution provides an entity that triggers the management node to update and release the state information information by the non-management node. Specifically, when the state of the memory in a storage node changes, the storage node actively reports the memory to the management node. The status change report, which in turn triggers the management node to update and publish the status information aggregate based on this memory status change report.
  • This mechanism facilitates the mechanism of synchronizing the state information aggregates of the storage nodes in the distributed storage system as much as possible, so that each storage node can relatively accurately know the state of each memory in the distributed storage system, for subsequent memory-based The state of the read and write operations laid a foundation.
  • the distributed storage system may include m storage nodes, wherein the m storage nodes may include, for example, storage nodes Nd2, Nd1, Nd3, ..., Nd4, for example, the storage node Nd1 is a management node, and the distributed storage system includes a larger number.
  • the m storage nodes may include, for example, storage nodes Nd2, Nd1, Nd3, ..., Nd4, for example, the storage node Nd1 is a management node, and the distributed storage system includes a larger number.
  • Embodiments in the case of a smaller number of storage nodes may be deduced by analogy.
  • the following exemplary scheme mainly updates and publishes a state information aggregate based on the heartbeat monitoring results of other storage nodes by the management node.
  • FIG. 4-A is a schematic flowchart diagram of another state update method according to another embodiment of the present application.
  • another state update method provided by another embodiment of the present application may include:
  • the management node Nd1 receives a heartbeat message periodically sent by each storage node.
  • the management node Nd1 does not receive the heartbeat message sent by the storage node Nd2 within a set duration (where the set duration is, for example, 5 minutes, 10 minutes, 20 minutes, 1 minute, or other duration).
  • the storage node Nd2 is determined to be offline.
  • the management node Nd1 updates the state information aggregate currently cached by the management node Nd1 when it is determined that the storage node Nd2 is offline.
  • the management node Nd1 updates the state of all the memories included in the storage node Nd2 recorded in the currently cached state information aggregate to an offline state, and updates the aggregate version number in the state information aggregate.
  • monitor in each storage node, it may be specifically managed by the management node Nd1.
  • the master monitor receives heartbeat messages periodically sent from the monitor in each storage node.
  • the storage node Nd2 includes memories Nd2-D1, Nd2-D2, and Nd2-D3. Then, the state information aggregate before the management node Nd1 is updated, for example, as shown in FIG. 4-B, and the state information aggregate after the management node Nd1 is updated, for example, as shown in FIG. 4-C, of course, the actual application is not limited to such an example. form.
  • the management node Nd1 sends the updated latest state information aggregate to other storage nodes.
  • the storage nodes Nd3 and Nd4 and the like respectively receive the updated state information aggregate from the management node Nd1.
  • the storage nodes Nd3 and Nd4 respectively update the respective state information aggregates that are currently cached using the state information aggregate from the management node Nd1.
  • the state information aggregates buffered by the management node Nd1 and other storage nodes can be kept relatively synchronized, and the storage node Nd2 cannot receive the latest state information released by the management node Nd1 because it is in an offline state.
  • the storage node Nd2 may actively request the latest state information aggregate from the management node to update the cached state information aggregate.
  • the storage node Nd2 may also passively wait for the management node to release the latest state again.
  • the information aggregate updates the cached state information aggregate by using the latest state information aggregate that is re-released by the management node.
  • the foregoing solution provides a mechanism for updating and publishing a state information aggregate mainly by a management node based on a heartbeat monitoring result of other storage nodes, and the mechanism is beneficial to make each storage node in the distributed storage system synchronize the state information collection as much as possible.
  • the storage node can relatively accurately know the state of each memory in the distributed storage system, and lays a foundation for subsequent read and write operations based on the memory state.
  • the state update methods described in Figures 3-A and 4-A may all be referenced to the same distributed storage system. That is to say, the management node updates and publishes the state information aggregate based on the heartbeat monitoring results of other storage nodes, and the non-administrative node can also actively trigger the management node to update and release the state information aggregate.
  • a data access method is described below with reference to the accompanying drawings, and is mainly directed to a possible scenario in which a terminal reads data from a distributed storage system.
  • the data access method is applied to a distributed storage system.
  • the distributed storage system includes m storage nodes, and the m storage nodes include storage nodes Nd1 and Nd2.
  • the memory Nd2-D1, the memory Nd2-D2, and the memories Nd2-D3 are all located in the storage node Nd1.
  • the corresponding check redundancy configuration is specifically 2:1 as an example, that is, the stripe includes two data strips and one check strip, and other check redundancy configurations. The situation can be deduced by analogy.
  • FIG. 5-A is a schematic flowchart of a data access method according to an embodiment of the present application.
  • a data access method provided by an embodiment of the present application may include:
  • the terminal When the terminal needs to write data to the distributed storage system, the terminal sends a write data request carrying the data to be written to the distributed storage system.
  • the write data request is named Wq3 in this embodiment.
  • the storage node Nd1 receives the write data request Wq3 from the terminal.
  • the write data request Wq3 can carry the data Data1 (the data Data1 is the data to be written), the file identifier of the file to which the data Data1 belongs, and the location information of the data Data1.
  • the location information of the data Data1 is used to describe the location of the data Data1 in the file to which the data Data1 belongs.
  • the location information of the data Data1 includes, for example, the file offset address of the data Data1 (data The file offset address of Data1 indicates the start position of the data Data1 in the file, the length of the data Data1, and the like.
  • the storage node Nd1 divides the data Data1 into two data strips (for convenience of reference in the following, the two data strips are named as the data strip Pd1 and the data strip Pd2 in this embodiment).
  • the storage node Nd1 calculates one check strip by using the two data strips (for the convenience of the following description, in the embodiment, the check strip is named as the check strip Pj1), the one school
  • the strip and 2 strips form a strip comprising 3 strips.
  • the lengths of the data strip Pd1 and the data strip Pd2 may be equal or unequal.
  • the storage node Nd1 determines the memory involved in the write data Data1 based on the location information of the data Data1.
  • the storage node Nd1 determines that the memory involved in the write data Data1 is the memory Nd2-D1, the memory Nd2-D2, and the memory Nd2-D3.
  • the above three memories are in one-to-one correspondence with the three strips.
  • the check strip Pj1 corresponds to the memory Nd2-D3
  • the data strip Pd1 corresponds to the memory Nd2-D1
  • the data strip Pd2 corresponds to the memory Nd2-D2.
  • the storage node Nd1 determines a state of the memory Nd2-D1, the memory Nd2-D2, and the memory Nd2-D3 currently recorded in the locally buffered state information aggregate.
  • the current recording memory Nd2-D2 and Nd2-D3 in the state information aggregate locally cached by the storage node Nd1 are in an online state (trusted access state or data reconstruction state), and the memory Nd2-D1 is in an offline state.
  • the description will be given.
  • the storage node Nd2 When it is determined that the memory Nd2-D2 and the memory Nd2-D3 are in an online state and the memory Nd2-D1 is in an offline state, the storage node Nd2 writes the data strip Pd2 into the memory Nd2-D2, and the storage node Nd2 is to be The strip Pj1 is written to the memory Nd2-D3.
  • the remaining non-offline storage nodes are still normally writing data because: according to the EC check algorithm, even if a certain number of stripes are not successfully written, Rewriting the unsuccessful stripe recovery by writing a successful stripe will not cause data loss. As long as the number of unsuccessfully written strips does not exceed the number of strips, the EC check algorithm can operate normally. Of course, it is similar for other verification algorithms.
  • the data strip Pd1 fails to be written into the memory Nd2-D1, so the storage node Nd1 or the storage node Nd2 generates the reconstruction log log1.
  • the reconstruction log log1 records the stripe identifier of the stripe (data strip Pd1) that failed to be written, and the memory identifier of the memory (memory Nd2-D1) that failed to be written.
  • the reconstruction log log1 can also record the stripe identifier of the stripe to which the write failed stripe (data strip Pd1) belongs.
  • the reconstruction log log1 is used to reconstruct the data strip Pd1 after the memory Nd2-D1 is re-online.
  • Figure 5-B illustrates a possible representation of a reconstructed log log1, although other manifestations may be used.
  • the storage node Nd2 can directly write the data strip Pd1 to the memory Nd2-D1. In this case, the storage node Nd1 or the storage node Nd2 does not need to perform generation. Refactoring the relevant steps of log1.
  • the write data scenario that is mainly targeted by the foregoing solution is a possible scenario in which the memory corresponding to each strip in the stripe is located in the same storage node.
  • the state information aggregate for recording the state of the memory so that the state information aggregate can be utilized to record the access state of the memory managing the distributed storage system.
  • the data can be directly written to the memory; for the memory in the offline state, it is not necessary to generate the relevant reconstruction log after the failure of the attempt to write, and the related reconstruction log can be directly generated.
  • the prior art attempts to write data first, and then generates a related reconstruction log after attempting to write failure.
  • the solution in this embodiment can reduce the invalidation attempt, thereby improving the data writing efficiency, thereby facilitating the improvement of system performance.
  • the distributed storage system includes m storage nodes, and the m storage nodes include storage nodes Nd2, Nd1, Nd3, and Nd4.
  • the memories Nd2-D1 belong to the storage node Nd2, the memories Nd3-D1 belong to the storage node Nd3, and the memories Nd4-D1 belong to the storage node Nd4.
  • the verification redundancy configuration is specifically 2:1 as an example, that is, the stripe includes two data strips and one check strip, and other verification redundancy configurations may be deduced.
  • FIG. 6 is a schematic flowchart diagram of another data access method according to another embodiment of the present application.
  • another data access method provided by another embodiment of the present application may include:
  • steps 601-602 wherein the steps 601-602 are the same as the steps 501-502. Therefore, the related description may refer to steps 501-502, and details are not described herein.
  • the storage node Nd1 determines the memory involved in the write data Data1 based on the location information of the data Data1.
  • the storage node Nd1 determines that the memory involved in the write data Data1 is the memory Nd2-D1, the memory Nd3-D1, and the memory Nd4-D1.
  • the three memories (Nd2-D1, Nd3-D1, and Nd4-D1) are in one-to-one correspondence with the three strips.
  • the check strip Pj1 corresponds to the memory Nd4-D1
  • the data strip Pd1 corresponds to the memory Nd2-D1
  • the data strip Pd2 corresponds to the memory Nd3-D1.
  • the storage node Nd1 determines the state of the currently recorded memory Nd2-D1, the memory Nd3-D1, and the memory Nd4-D1 among the locally stored state information aggregates. The following is a description of the case where the memory information Nd3-D1 and Nd4-D1 are in an online state and the memory Nd2-D1 is in an offline state, which is currently recorded in the state information aggregate locally cached by the storage node Nd1.
  • the storage node Nd1 may determine the currently recorded memory Nd2-D1, the memory Nd3-D1, and the memory Nd4- among the locally cached state information aggregates when it is in a connected state with the management node. The status of D1.
  • the storage node Nd1 sends a write data request Wq1 to the storage node Nd3 to which the memory Nd3-D1 belongs, and The storage node Nd1 transmits a write data request Wq2 to the storage node Nd4 to which the memory Nd4-D1 belongs.
  • the write data request Wq1 carries the data strip Pd2 and the state information set currently stored by the storage node Nd1.
  • the write data request Wq1 may further carry the length of the data strip Pd2 (the length may be, for example, 110 Kb), the file offset address of the data strip Pd2, the stripe identifier of the strip to which the data strip Pd2 belongs, and the data strip Pd2 Stripe identifier, data stripe, file identifier of the file to which Pd2 belongs, and so on.
  • the write data request Wq2 may carry the check strip Pj1 and the version number of the state information aggregate currently stored locally by the storage node Nd1. Further, the write data request Wq2 may further carry the length of the check strip Pj1, the stripe identifier of the strip to which the check strip Pj1 belongs, the stripe identifier of the check strip Pj1, and the like.
  • the storage node Nd1 carries the version number of the state information collection body that is cached in the write data request sent to other storage nodes, which can facilitate other storage nodes to verify the state information aggregate of the storage node Nd1 cache by comparing the version numbers. is it effective.
  • the storage node Nd3 After receiving the write data request Wq1 from the storage node Nd1, the storage node Nd3 performs the version number of the state information aggregate carried by the write data request Wq1, and the version number of the state information aggregate currently cached by the storage node Nd3 itself. Comparison.
  • the secondary slave in the storage node Nd3 can be considered The latest version of the status information aggregate released by the primary monitor in the management node can be received in a timely manner.
  • the state information aggregate currently cached by the storage node Nd3 is considered to be the latest version of the state information aggregate maintained by the primary monitor.
  • the version number of the state information aggregate carried by the write data request Wq1 is the same as the version number of the state information aggregate currently cached by the storage node Nd3, which means that the state information aggregate currently cached by the storage node Nd1 is also maintained by the main monitor.
  • Nd3-D1 when comparing the version number of the state information aggregate carried by the write data request Wq1 with the version number of the state information aggregate currently cached by itself, it can be considered that the state of Nd3-D1 buffered by Nd1 is not accurate. In other words, although the state of Nd3-D1 recorded in Nd1 is the trust access state, it cannot be ensured that the real state of Nd3-D1 is the trust access state. Since Nd3-D1 may be in an untrusted access state, storage node Nd3 may refuse to write data strip Pd2 to memory Nd3-D1.
  • the state information aggregate of the storage node cache is relatively large for the state information aggregate that is newly released by the management node.
  • the communication quality between the management node and the storage node is extremely poor, which may result in the storage node failing to obtain the latest state information collection body of the management node, that is, even if the storage node The management nodes are in a connected state, and there is a small possibility that the state information aggregate of the storage node cache is not the latest state information collection body of the management node. Therefore, when the storage node and the management node are in a connected state, the version numbers of the currently cached state information aggregates are further compared. If the version numbers are the same, the respective current The cached state information aggregate is valid. This approach can further improve reliability.
  • the storage node Nd3 transmits a write data response Wr1 for responding to the write data request Wq1 to the storage node Nd1, and a write operation result of the write data response Wr1 carrying the data strip Pd2. Specifically, when the storage node Nd3 successfully writes the data strip Pd2 to the memory Nd3-D1, the write operation results in a successful write. When the storage node Nd2 fails to write the data strip Pd2 to the memory Nd3-D1 (for example, the storage node Nd2 refuses to write the data strip to the memory Nd3-D1), the corresponding write operation results in a write failure.
  • step 607 After receiving the write data request Wq2 from the storage node Nd1, the storage node Nd4 performs a version number determination of the state information aggregate similar to step 606 (the difference is that the storage node Nd3 becomes the storage node Nd4, and the write data request Wq1 The write request Wq1) is changed to determine whether or not the check strip Pj1 is written to the memory Nd4-D1.
  • the specific implementation process of step 607 is not described here.
  • Step 606 and step 607 may be performed simultaneously, or may be performed first by any one.
  • the storage node Nd1 receives the write data response Wr1 from the storage node Nd3.
  • the storage node Nd1 receives the write data response Wr2 from the storage node Nd4.
  • the storage node Nd1 may resend the corresponding write data request to request rewriting. For example, if the write data response Wr1 indicates that the corresponding write failed, the storage node Nd1 may resend the write data request Wq1 to the storage node Nd3.
  • the storage node Nd1 in the storage node Nd2 is in an offline state, the data strip Pd1 fails to be written into the memory Nd2-D1, so the storage node Nd1 generates a reconstruction log log1, and the storage node Nd1 writes the stripe successfully.
  • a storage node eg, the storage node Nd3 and/or storage node Nd4 sends a reconstruction log log1.
  • the storage node Nd3 and/or the storage node Nd4 can receive and store the reconstruction log log1, which is used to reconstruct the data strip Pd1 after the memory Nd2-D1 is re-online.
  • the reconstruction log log1 records the stripe identifier of the stripe (data strip Pd1) that failed to be written, and the memory identifier of the memory (memory Nd2-D1) that failed to be written.
  • the reconstruction log log1 can also record the stripe identifier of the stripe to which the write failed stripe (data strip Pd1) belongs.
  • the write data scenario that is mainly targeted in the foregoing solution is a possible scenario in which the memory corresponding to each strip in the strip is not located in the same storage node.
  • a state information aggregate for recording the state of the memory is introduced, so that the state information aggregate can be utilized to record the access state of the memory managing the distributed storage system.
  • this embodiment describes a method for writing data to be written.
  • the memory corresponding to each strip in the stripe corresponding to the data to be written is not located in the same storage node, and the storage node Nd1 and the management node are in a connected state.
  • the storage node Nd1 initially believes that the storage can be successfully written to the memories Nd3-D1 and Nd4-D1, in order to verify the storage node Nd1 cache
  • the storage node Nd1 carries the version number of the cached state information aggregate in the write data request sent to the other storage nodes (the storage node Nd3 and the storage node Nd4), and the other storage nodes write the data.
  • the version number of the status information aggregate to be carried is compared with the version number of the self-cached status information aggregate, and the version number of the status information aggregate carried in the write data request is the same as the version number of the current cached status information aggregate.
  • the state information aggregate buffered by the storage node Nd1 is valid, the record in the state information aggregate is recorded.
  • the state of the device should be accurate. In this case, the corresponding data write is usually successful, so the data is directly written to the relevant memory. On the contrary, it can be considered that the state information aggregate buffered by the storage node Nd1 may be invalid, so the state of the related memory recorded in the state information aggregate is likely to be inaccurate. In this case, the corresponding data writing is likely to fail, so It is relatively reasonable to refuse to write at this time.
  • the distributed storage system may include m storage nodes, wherein the m storage nodes include storage nodes Nd1, Nd2, Nd3, Nd4, and the like.
  • the storage node Nd1 is a management node that publishes a collection of status information (that is, a storage node where the primary monitor is located).
  • the memories Nd2-D1 belong to the storage node Nd2
  • the memories Nd3-D1 belong to the storage node Nd3
  • the memories Nd4-D1 belong to the storage node Nd4.
  • Such a data reconstruction method can be performed, for example, after the scheme illustrated in the example of FIG. 6-A.
  • FIG. 7-A is a schematic flowchart of a data reconstruction method according to an embodiment of the present application.
  • a data reconstruction method provided by an embodiment of the present application may include:
  • the storage node Nd2 may send a memory state change report P2 to the management node Nd1, where the memory state change report P2 may indicate that the state of the memory Nd2-D1 is changed from offline to data heavy.
  • the memory state change report P2 may indicate that the state of the memory Nd2-D1 is changed from offline to data heavy.
  • State of construction Specifically, for example, the memory identifier and the state identifier of the memory Nd2-D1 in the memory state change report P2 (the state identifier is a state identifier of the data reconstruction state, that is, the state indicated by the state identifier is a data reconstruction state).
  • the management node Nd1 receives the memory state change report P2 from the storage node Nd2, and the management node Nd1 updates the cached state information aggregate. Specifically, the storage node Nd1 updates the state of the memory Nd2-D1 recorded in the cached state information aggregate to the state of the memory Nd2-D1 indicated by the memory state change report P2, and sets the state information set. The version number of the state information aggregate of the volume record is updated.
  • the memory state change report P2 may be sent from the monitor to the management node Nd1 in the storage node Nd2, and the corresponding may be the main in the management node Nd1.
  • the monitor receives the memory state change report P2 from the storage node Nd2.
  • the state information aggregate before the management node Nd1 is updated, for example, as shown in FIG. 3-C
  • the state information aggregate after the management node Nd1 is updated, for example, as shown in FIG. 7-B, of course, in actual application, the state information set
  • the body is not limited to such an exemplary form.
  • the storage node Nd1 sends the updated state information aggregate to other storage nodes.
  • the storage nodes Nd2, Nd3, and Nd4 receive the updated state information aggregates from the management node Nd1, respectively, and the storage nodes Nd2, Nd3, and Nd4 update their current state with the state information aggregate from the management node Nd1.
  • a collection of cached state information In this way, the state information aggregates maintained by the respective storage nodes (Nd1, Nd2, Nd3, and Nd4) can remain relatively synchronized.
  • the storage node Nd1 notifies the storage node Nd2 to perform data reconstruction for the memory Nd2-D1.
  • the management node Nd1 can notify the storage node Nd2 to perform data reconstruction for the memory Nd2-D1 by transmitting an updated aggregate of status information.
  • the management node Nd1 may notify the storage node Nd2 to perform data reconstruction for the memory Nd2-D1 through other messages (for example, a data reconstruction notification message).
  • the storage node Nd2 collects the correlation of the offline period of the memory Nd2-D1 from the storage node Nd3 and/or Nd4. Refactor the log.
  • the storage node Nd2 reconstructs data based on the related reconstruction log during the offline storage Nd2-D1 offline.
  • the storage node Nd2 collects the reconstruction log log1 during the offline period of the memory Nd2-D1, and the reconstruction log log1 records the stripe identifier and write of the stripe (data strip Pd1) of the write failure during the offline operation of the memory Nd2-D1.
  • the memory identifier of the failed memory (memory Nd2-D1).
  • the reconstruction log log1 can also record the stripe identifier of the stripe to which the write failed stripe (data strip Pd1) belongs.
  • the storage node Nd2 may determine the stripe to which the data strip Pd1 belongs according to the stripe identifier of the strip to which the data strip Pd1 belongs, and further acquire other strips of the strip to which the data strip Pd1 belongs (data strip Pd2 and check strip) With Pj1), the storage node Nd2 performs a redundancy check operation based on the data strip Pd2 and the check strip Pj1 to reconstruct the data strip Pd1, and the storage node Nd2 writes the reconstructed data strip Pd1 into the memory Nd2- D1.
  • the storage node Nd2 may first perform deduplication processing on the collected plurality of reconstructed logs, and then based on deduplication. The subsequent reconstruction logs are separately reconstructed.
  • the manner of data reconstruction based on the reconstructed log can be exemplified as described above.
  • the storage node Nd2 sets the state of the memory Nd2-D1 to the trusted access state. Further, the storage node Nd2 may send a memory state change report P3 to the management node Nd1, and the memory state change report P3 may indicate that the state of the memory Nd2-D1 is changed from the data reconstruction state to the trusted access state.
  • the management node Nd1 receives the memory state change report P3 from the storage node Nd2, and the management node Nd1 updates the cached state information aggregate. Specifically, the storage node Nd1 updates the state of the memory Nd2-D1 recorded in the cached state information aggregate to the state of the memory Nd2-D1 indicated by the memory state change report P3, and sets the state information set. The version number of the state information aggregate of the volume record is updated.
  • the memory state change report P3 may be sent from the monitor to the management node Nd1 in the storage node Nd2, and the corresponding may be the master in the management node Nd1.
  • the monitor receives the memory state change report P3 from the storage node Nd2.
  • the state information aggregate before the management node Nd1 is updated, for example, as shown in FIG. 7-B
  • the state information aggregate after the management node Nd1 is updated, for example, as shown in FIG. 7-C
  • the state information aggregate is It is not limited to such an example form.
  • the management node Nd1 sends the updated state information aggregate to other storage nodes (for example, Nd2, Nd3, Nd4, etc.).
  • the storage nodes Nd2, Nd3 and Nd4 can respectively receive the updated state information aggregates from the storage node Nd1, and the storage nodes Nd2, Nd3 and Nd4 can update their current caches respectively with the state information aggregate from the management node Nd1.
  • the state cached by each storage node (Nd1, Nd2, Nd3, and Nd4) Information aggregates can be kept as relatively synchronized as possible.
  • the above solution is mainly for a data reconstruction scenario after the offline memory is re-wired, and gives a possible data reconstruction mechanism in this possible scenario.
  • the storage node After the offline storage is re-online, the storage node notifies the management node that the management node updates the status of the memory recorded by itself to the "online reconfiguration state", and then the management node issues the latest state information collection to other storage nodes, and then Notify other storage nodes that the state of this memory has been updated to "online refactoring state”. The associated storage node then reconstructs the stripe that was previously unwritten into this memory.
  • the storage node where the re-lived memory is located updates the state of the re-lived memory to the "trusted access state", and notifies the management node to update the status of the re-lived memory to "trusted access state", and then, The management node again publishes the latest state information aggregate to other storage nodes, thereby notifying the remaining storage nodes that the state of the re-lived memory is updated to the "trusted access state".
  • a state information aggregate for recording a memory state is introduced. Therefore, the state information aggregate can be used to record the access state of the memory of the distributed storage system, so that the storage node in the distributed storage system can accurately understand the memory.
  • State switching switching from “offline state” to “online reconfiguration state”, then switching from “online reconfiguration state” to “trusted access state”, which helps to reduce the chance of failure to try to read and write, and thus helps to improve the system performance.
  • the distributed storage system includes m storage nodes, and the m storage nodes include storage nodes Nd2, Nd1, Nd3, and Nd4.
  • the memory Nd2-D1 belongs to the storage node Nd2
  • the memory Nd3-D1 belongs to the storage node Nd3
  • the memory Nd4-D1 belongs to the storage node Nd4.
  • the format of the data body is taken as an example for description.
  • FIG. 8 is a schematic flowchart diagram of a data access method according to an embodiment of the present application.
  • a data access method provided by an embodiment of the present application may include:
  • the terminal may send a read data request to the distributed storage system.
  • the read data request is named Rq1 in this embodiment.
  • the storage node Nd1 in the distributed storage system receives the read data request Rq1 from the terminal.
  • the read data request Rq1 carries the file identifier of the file to which the data Data2 (data Data2 is the data to be read) and the location information of the data Data2.
  • the location information of the data Data2 is used to describe the location of the data Data2 in the file to which it belongs.
  • the location information of the data Data2 is, for example, a file offset address of the data Data2 that can be included (the file offset address of the data Data2 indicates the start or end position of the data Data2 in the file), the length of the data Data2, and the like.
  • the storage node Nd1 determines the memory where the data Data2 is located based on the location information of the data Data2.
  • the storage node Nd1 determines that the memory in which the data Data2 is located is the memory Nd2-D1.
  • the file to which Data2 belongs is divided into a plurality of data strips (and a number of check strips) when stored, one of which is stored in the memory Nd2-D1, and the data Data2 is part of the stripe.
  • the storage node Nd1 determines the state of the currently recorded memory Nd2-D1 among the locally cached state information aggregates.
  • the storage node Nd1 may be in a connected state with the management node, or may be in a disconnected state.
  • step 804 when the storage node Nd1 currently records the state information aggregate of the cache: the memory Nd2-D1 is in the trusted access state, then step 804 is performed.
  • the state information aggregate record currently maintained by the storage node Nd1 is recorded: the memory Nd2-D1 is in the data reconstruction state or the offline state, step 807 is performed.
  • the storage node Nd1 may add the version number of the state information aggregate currently stored locally by the storage node Nd1 in the read data request Rq1 to obtain the updated read data request Rq1.
  • the storage node Nd1 transmits an updated read data request Rq1 to the storage node Nd2 to which the memory Nd2-D1 belongs.
  • the updated read data request Rq1 may carry the version number of the state information aggregate currently stored locally by the storage node Nd1. Further, the read data request Rq1 may also carry the location information of the data Data2, the file identifier of the file to which the data strip Pd2 belongs, and the like.
  • the storage node Nd1 carries the version number of the state information collection body cached by itself in the read data request sent to other storage nodes, mainly for facilitating the verification of the storage node Nd1 cache state information by other storage nodes by version number comparison. Whether the aggregate is valid.
  • the storage node Nd2 may compare the version number of the state information aggregate carried by the read data request Rq1 with the version number of the state information aggregate currently maintained by itself. .
  • the storage node Nd2 and the management node are in the connected state (ie, storing The primary monitor in the node Nd2 is properly connected to the primary monitor in the management node.
  • the state information aggregate currently cached by the storage node Nd2 is likely to be the latest version of the state information aggregate maintained by the primary monitor.
  • the storage node Nd2 can read the data Data2 from the memory Nd2-D1.
  • the storage node Nd2 may refuse to read from the memory Nd2-D1. Take data.
  • the storage node Nd2 and the management node are not in the connected state (that is, the primary monitor in the storage node Nd2 is not directly connected to the primary monitor in the management node, in this case, the storage node Nd2 is currently cached.
  • the status information aggregate may be the latest version of the status information aggregate maintained by the primary monitor, or may not be the latest version of the status information aggregate maintained by the primary monitor. Therefore, in this case, the storage node Nd2 may also reject the secondary storage Nd2.
  • the storage node Nd2 can read the data Data2 from the memory Nd2-D1.
  • the storage node Nd2 When the storage node Nd2 successfully reads the data Data2 from the memory Nd1-D1, the storage node Nd2 transmits a read data response Rr1 for responding to the read data request Rq1 to the storage node Nd1, and the read data response Rr1 carries the read data Data2.
  • the read data response Rr1 may carry the read operation result (the read operation result is a read failure) ).
  • the storage node Nd1 receives the read data response Rr1 from the storage node Nd2. If the storage node Nd2 reads the data Data2 successfully, the storage node Nd1 can obtain the data Data2 from the read data response Rr1.
  • the storage node Nd1 transmits a read data response Rr1 for responding to the read data request Rq1 to the terminal, and the read data response Rr1 carries the data Data2. Then end this process.
  • the storage node Nd1 determines a data strip Pd1 to which the data Data1 belongs, and the storage node Nd1 determines a strip to which the data strip Pd1 belongs, because the data Nd2-D1 is determined to be in a non-trusted access state.
  • the storage node Nd1 determines the memory in which the data strip Pd2 and the check strip Pj1 are located.
  • the memory in which the data strip Pd2 is located is represented as a memory Nd3-D1, and the storage node to which the memory Nd3-D1 belongs is represented as a storage node Nd3.
  • the memory in which the check strip Pj1 is located is represented as a memory Nd4-D1, and the storage node to which the memory Nd4-D1 belongs is represented as a storage node Nd4.
  • the storage node Nd1 that has received the read request determines the memory in which the remaining strips in the strip are located for subsequent use.
  • the remaining strips use the checksum algorithm to recover the strip where Data2 is located, and then get Data2.
  • the storage node Nd1 sends a read data request Rq3 to the storage node Nd3 to which the memory Nd3-D1 belongs.
  • the storage node Nd1 transmits a read data request Rq4 to the storage node Nd4 to which the memory Nd4-D1 belongs.
  • the read data request Rq3 carries the stripe identifier of the data strip Pd2 and the version number of the state information aggregate currently cached by the storage node Nd1. Further, the read data request Rq3 may also carry the length of the data strip Pd2 (the length is 110Kb or other length), the file offset address of the data strip Pd2, the stripe identifier of the strip to which the data strip Pd2 belongs, and the data strip. The file identifier of the file with the Pd2 file may even carry the segment identifier of the file segment to which the data strip Pd2 belongs.
  • the read data request Rq4 carries the stripe identifier of the check strip Pj1 and the version number of the state information aggregate currently stored locally by the storage node Nd1. Further, the read data request Rq4 may also carry the length of the check strip Pj1 and the stripe identifier of the strip to which the check strip Pj1 belongs.
  • the storage node Nd3 After receiving the read data request Rq3 from the storage node Nd1, the storage node Nd3 compares the version number of the state information aggregate carried by the read data request Rq3 with the version number of the state information aggregate currently cached by itself. Comparing the case where the version number of the state information aggregate carried by the read data request Rq3 is the same as the version number of the state information aggregate currently cached by itself, and the storage node Nd3 is in a connected state with the management node (ie, storing The primary monitor in the node Nd3 is connected to the primary monitor in the management node. In this case, the state information aggregate currently cached by the storage node Nd3 is the latest version of the state information aggregate maintained by the primary monitor, and the read data request Rq1.
  • the version number of the state information aggregate that is carried is the same as the version number of the state information aggregate currently cached by the storage node Nd3.
  • the state information collection body currently cached by the storage node Nd1 is also the state information of the latest version maintained by the primary monitor. Aggregate, then the storage node Nd1 currently caches the records in the state information aggregate State memory Nd3-D1 (the trusted access state) are accurate), the storage node Nd3 read data from the memory strip Pd2 Nd3-D1.
  • the storage node Nd3 may refuse to read the data strip from the memory Nd3-D1. With Pd2.
  • the storage node Nd3 transmits a read data response Rr3 for responding to the read data request Rq3 to the storage node Nd1, and a read operation result of the read data response Rr3 carrying the data strip Pd2. Specifically, when the storage node Nd3 successfully reads the data strip Pd2 from the memory Nd3-D1, the read data response Rr3 carries the read operation result as a successful read, and reads The data response Rr3 carries the data strip Pd2.
  • the read data response Rr3 carries the read operation result as a read failure.
  • the storage node Nd4 After receiving the read data request Rq4 from the storage node Nd1, the storage node Nd4 performs a similar operation with the storage node Nd3 (for example, performing comparison of the state information aggregate version, and transmitting the read data response Rr4 to the storage node Nd1). Since this step can refer to step 809, it will not be described here.
  • the storage node Nd1 receives the read data response Rr3 from the storage node Nd3.
  • the storage node Nd1 receives the read data response Rr4 from the storage node Nd4.
  • the storage node Nd1 may resend the corresponding read data request to request re-read. For example, if the read data response Rr3 indicates that the corresponding read failed, the storage node Nd1 may resend the write data request to the storage node Nd3 to re-request the read data strip Pd2. For another example, if the read data response Rr4 indicates that the corresponding read operation failed, the storage node Nd1 may resend the read data request Wq2 to re-request the read check strip Pj1. And so on.
  • the storage node Nd1 obtains the data strip Pd2 and the check strip Pj1, the storage node Nd1 performs a check operation by using the data strip Pd2 and the check strip Pj1 to obtain the data strip Pd1, which is obtained from the data strip Pd1.
  • Data Data2 the storage node Nd1 transmits a read data response Rr1 for responding to the read data request Rq1 to the terminal, and the read data response Rr1 carries the data Data2.
  • the storage node Nd1 When the storage node Nd1 obtains the data Data2 from the read data response Rr1 from the storage node Nd2, the storage node Nd1 transmits a read data response Rr1 for responding to the read data request Rq1, and the read data response Rr1 carries the data Data2.
  • the state information aggregate for recording the state of the memory since the state information aggregate for recording the state of the memory is introduced in the distributed storage system, the state information aggregate can be utilized to record the access state of the memory managing the distributed storage system. Specifically, the storage node that receives the read data request is introduced, and based on a comparison result between the version number of the state information aggregate carried by the read data request and the version number of the state information aggregate carried by the read data request currently cached by itself. The mechanism for reading the data decision accordingly helps to better determine the validity of the state information aggregate used by comparing the version numbers.
  • the requested data can be directly read from the terminal to feed back to the terminal, and in the case of the trusted access, since it is not necessary to perform the traditional method, all the relevant strips are read first.
  • the cumbersome steps of correcting the correctness can reduce the memory load by reducing the amount of data read, which in turn helps to improve system performance.
  • This embodiment describes a method of reading data to be read.
  • the data to be read is a part of a strip or an entire strip. If the storage node where the strip is located is trustworthy, directly return the data to be read to the requesting device (terminal or host); if the storage node where the strip is located is not trusted (in the reconstructed state or offline state), then The other strips of the stripe in which the data to be read is located (except the strip in which the data to be read is located) are subjected to a check calculation, thereby obtaining a stripe of the data to be read, thereby obtaining the data to be read therefrom.
  • the storage node Nd1 carries the version number of the state information aggregate that is cached by itself in the read data request sent to other storage nodes, mainly to facilitate other storage nodes to verify whether the state information aggregate of the storage node Nd1 is cached by version number comparison. effective.
  • the storage node Nd2 is connected to the management node normally, in this case, if the data is available,
  • the version number of the state information aggregate carried by Rq1 is the same as the version number of the state information aggregate currently cached by the storage node Nd2, and then the storage nodes Nd1 and Nd2 are considered to have the same and latest state information aggregates.
  • the storage node Nd2 can read the data Data2 from the memory Nd2-D1.
  • the state information aggregate currently cached by the storage node Nd2 may be the latest version of the state information aggregate issued by the management node, or may not be managed.
  • the state information aggregate of the latest version issued by the node so since it is difficult to determine the validity of the state information aggregate currently cached by the storage node Nd2, the storage node Nd2 can refuse to read the data Data2 from the memory Nd2-D1.
  • the storage node Nd2 can read the data Data2 from the memory Nd2-D1. .
  • the storage node Nd2 can read the data Data2 from the memory Nd2-D1.
  • the distributed storage system includes m storage nodes, wherein the m storage nodes include storage nodes Nd1, Nd2, Nd3, and Nd4.
  • the memory Nd2-D1 belongs to the storage node Nd2.
  • the data body is taken as a file for description.
  • FIG. 9 is a schematic flowchart diagram of another data access method according to another embodiment of the present application.
  • another data access method provided by another embodiment of the present application may include:
  • the terminal may send a read data request to the distributed storage system.
  • the read data request is named Rq1 in this embodiment.
  • the storage node Nd2 in the distributed storage system receives the read data request Rq1 from the terminal.
  • the related information carried by the read data request Rq1 in step 901 is the same as the related information carried by the read data request Rq1 in step 801.
  • the storage node Nd2 determines the memory where the data Data2 is located based on the location information of the data Data2.
  • a scheme for determining a memory in which data is stored based on location information has been described in other embodiments and will not be described here.
  • the storage node Nd2 determines that the memory in which the data Data2 is located is the memory Nd2-D1.
  • the storage node Nd2 determines the state of the memory Nd2-D1 recorded in the local cached state information aggregate.
  • the state information aggregate recording memory Nd2-D1 currently cached by the storage node Nd2 is in a trusted access state.
  • the connection state is processed between the storage node Nd2 and the management node (that is, the connection between the monitor and the primary monitor in the management node in the storage node Nd2 is normal.
  • the state information aggregate of the storage node Nd2 is currently cached. The latest version of the status information collection released by the main monitor).
  • the storage node Nd2 can read the data Data2 from the memory Nd2-D1.
  • the storage node Nd2 transmits a read data response Rr1 for responding to the read data request Rq1 to the terminal, and the read data response Rr1 carries the data Data2.
  • the other strips of the strip to which the data Data2 belongs may be collected, and the data Data2 is obtained by using other strips. Concrete The manner of implementation is not described here.
  • the state information aggregate for recording the state of the memory since the state information aggregate for recording the state of the memory is introduced in the distributed storage system, the state information aggregate can be utilized to record the access state of the memory managing the distributed storage system. Specifically, a storage node that receives a read data request is introduced, and a read data decision is made based on a state of the related memory recorded in the cached release state information aggregate and a connection state between the management node of the release state information aggregate.
  • the mechanism is such that it is advantageous to better determine the validity of the set of state information used.
  • the requested data can be read directly from the terminal to feed back to the terminal, and in the case of the trusted access, since it is not necessary to perform the traditional method, all the strips of the relevant stripe are read first.
  • the cumbersome steps of correctness verification are beneficial to reduce the memory burden by reducing the amount of data read, which in turn helps to improve system performance.
  • the data to be read (data Data2) is located in the memory (ND2-D1) of the storage node (Nd2) that receives the read data request of the terminal/host, and the storage node (Nd2) and the management node are in the connected state, then It is considered that the storage state recorded in the local cached state information aggregate of this storage node (Nd2) is relatively reliable.
  • the storage node (Nd2) recorded in the record in the state information aggregate is a trusted access state, the data to be read (data Data2) can be directly read therefrom and then returned to the terminal/host.
  • the distributed storage system includes m storage nodes, wherein the m storage nodes include storage nodes Nd2, Nd1, Nd3, and Nd4.
  • the memories Nd2-D1 belong to the storage node Nd2, the memories Nd3-D1 belong to the storage node Nd3, and the memories Nd4-D1 belong to the storage node Nd4.
  • the data body is taken as a file for description.
  • FIG. 10 is a schematic flowchart diagram of another data access method according to another embodiment of the present application.
  • another embodiment of the present application provides another data access method, which may include:
  • the storage node Nd1 determines the state of the memory Nd2-D1 recorded in the locally cached state information aggregate.
  • step 1004 when the state information aggregate recording memory Nd2-D1 currently cached by the storage node Nd1 is in the trusted access state.
  • step 1007 When the state information aggregate recording memory Nd2-D1 currently maintained by the storage node Nd1 is in the data reconstruction state or the offline state, step 1007 is performed.
  • the storage node Nd1 may forward the read data request Rq1 to the management node, or forward the read data request Rq1 to other storage in a connected state with the management node.
  • the node determines the state of the memory Nd2-D1 recorded in the locally cached state information aggregate by the other storage nodes receiving the Rq1, and performs subsequent related operations accordingly. For specific subsequent operations, refer to this embodiment 1003 to step 1011. In this embodiment, a case where the storage node Nd1 and the management node are in a connected state is taken as an example for description.
  • the storage node Nd1 Since it is determined that the memory Nd2-D1 is in the trust access state, the storage node Nd1 sends a read data request Rq1 to the storage node Nd2 to which the memory Nd2-D1 belongs.
  • the storage node Nd2 After the storage node Nd2 receives the read data request Rq1 from the storage node Nd1, the storage node Nd2 can read the data Data2 from the memory Nd2-D1. The storage node Nd2 is further sent to the storage node Nd1 for responding to the read data.
  • the read data request Rq1 is responsive to Rr1, wherein the read data response Rr1 carries the read data Data2.
  • the storage node Nd1 receives the read data response Rr1 from the storage node Nd2. If the storage node Nd2 reads the data Data2 successfully, the storage node Nd1 can obtain the data Data2 from the read data response Rr1. After completing this step, the process is exited, and the following steps 1007 and other steps are not executed.
  • the storage node Nd1 determines the data strip Pd1 to which the data Data1 belongs, and the storage node Nd1 determines the strip to which the data strip Pd1 belongs, the strip includes the data strip Pd1.
  • the data strip has a Pd2 and a check strip Pj1.
  • the storage node Nd1 determines the memory in which the data strip Pd2 and the check strip Pj1 are located.
  • the memory in which the data strip Pd2 is located is represented as a memory Nd3-D1
  • the storage node to which the memory Nd3-D1 belongs is represented as a storage node Nd3.
  • the memory in which the check strip Pj1 is located is represented as a memory Nd4-D1
  • the storage node to which the memory Nd4-D1 belongs is represented as a storage node Nd4.
  • the storage node Nd1 determines the states of the memories Nd3-D1 and Nd4-D1 recorded in the state information aggregate of its local cache.
  • the storage node Nd1 currently maintains the state information aggregate record: the memories Nd3-D1 and Nd4 -D1 is in the untrusted access state, then the data read failure may be directly fed back, or after waiting for the set duration, the state of the currently cached state information aggregate recording memories Nd1-D1, Nd3-D1, and Nd4-D1 may be re-viewed.
  • the storage node Nd1 transmits a read data request Rq3 to the storage node Nd3 to which the memory Nd3-D1 belongs.
  • the storage node Nd1 transmits a read data request Rq4 to the storage node Nd4 to which the memory Nd4-D1 belongs.
  • the read data request Rq3 carries the version number of the stripe identifier of the data strip Pd2.
  • the read data request Rq3 may also carry the length of the data strip Pd2 (for example, 110 Kb), the file offset address of the data strip Pd2, the stripe identifier of the strip to which the data strip Pd2 belongs, and the file identifier of the file to which the data strip Pd2 belongs. It is even possible to carry the segment identifier of the file segment to which the data strip Pd2 belongs.
  • the read data request Rq4 carries the stripe identifier of the check strip Pj1. Further, the read data request Rq4 may also carry the length of the check strip Pj1 and the stripe identifier of the strip to which the check strip Pj1 belongs.
  • the storage node Nd3 After the storage node Nd3 receives the read data request Rq3 from the storage node Nd1, the storage node Nd3 reads the data strip Pd2 from the memory Nd3-D1.
  • the storage node Nd3 transmits a read data response Rr3 for responding to the read data request Rq3 to the storage node Nd1, and a read operation result of the read data response Rr3 carrying the data strip Pd2. Specifically, when the storage node Nd3 successfully reads the data strip Pd2 from the memory Nd3-D1, the read operation result of the read data response Rr3 is read successfully, and the read data response Rr3 carries the data strip Pd2. When the storage node Nd3 has not successfully read the data strip Pd2 from the memory Nd3-D1, the read data response Rr3 carries the read operation result as a read failure.
  • the storage node Nd4 After receiving the read data request Rq4 from the storage node Nd1, the storage node Nd4 reads the check strip Pj1 from the memory Nd4-D1.
  • the storage node Nd4 transmits a read data response Rr4 for responding to the read data request Rq4 to the storage node Nd1, and the read data response Rr4 carries the read operation result of the check strip Pj1. Specifically, when the storage node Nd4 successfully reads the check strip Pj1 from the memory Nd4-D1, the read operation result of the read data response Rr4 is read successfully, and the read data response Rr3 carries the check strip Pj1. When the storage node Nd4 fails to read the check strip from the memory Nd4-D1 Pj1, then the read data response Rr4 carries the read operation result as a read failure.
  • steps 1011 to 1012 are the same as steps 811 to 812.
  • the storage node (Nd1) receives the read request, and the data to be read (data Data1) is located in the memory (Nd2-D1) of the other storage node (Nd2). If Nd2 and the management node are in the connected state, and the state of the memory Nd2-D1 recorded in the Nd2 locally cached state information aggregate is the trusted access state, the data to be read can be read directly from Nd2-D1. If Nd2 and the management node are not in the connected state, then the memory in which the data to be read is located in the stripe of the other memory (data strip Pd2 and check strip Pj1) is located (Nd3-D1 and Nd4-D1).
  • the memories (Nd3-D1 and Nd4-D1) in which the strips (data strip Pd2 and check strip Pj1) located in other memories are located are in a trusted access state, and are in a connected state with the management node. Then, the strips located in other memories are read out, and the stripe (Pd1) in which the data to be read (data Data1) is obtained by the check algorithm is obtained, and the data to be read (data Data1) is obtained from Pd1 and returned to the terminal/ Host.
  • the state information aggregate can be utilized to record the access state of the memory managing the distributed storage system. Specifically, a storage node that receives and transits a read data request is introduced, and the state of the related memory recorded in the cached release state information aggregate and the connection state between the release state and the management node of the release state information aggregate are correspondingly read.
  • the mechanism by which the data requests a transit decision makes it easier to determine the validity of the aggregate of state information used.
  • the relevant storage node can be directly triggered to read the requested data to feed back to the terminal, and in the case of such trusted access, since it is not necessary to perform the traditional method, all the relevant strips are read first. Bringing cumbersome steps to verify the correctness, reducing the amount of data read helps to reduce the memory load, which in turn helps improve system performance.
  • an embodiment of the present application provides a distributed storage system 1100, where the distributed storage system includes m storage nodes, where each of the storage nodes includes at least one memory, and each of the memories includes a non- Volatile storage medium, the m being an integer greater than one.
  • the first storage node 1110 of the m storage nodes is configured to: receive a first read data request from the terminal, where the first read data request carries first location information, where the first location information is used by Describe a location of the first data in its associated data volume; determining a first memory in which the first data is located based on the first location information.
  • the first memory belongs to the first storage node
  • the first data is read from the first memory, Transmitting, to the terminal, a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the first data is, for example, part or all of the data in the strip to which the first data belongs.
  • the distributed storage system further includes a management node for publishing a collection of status information.
  • the first storage node is further configured to: at the first storage, belong to a second storage node among the m storage nodes, and the first storage is in a trusted access state, and the first Storage node and for sending When the management nodes of the cloth state information aggregate are in a connected state, the first read data request is forwarded to the second storage node.
  • the second storage node 1120 is configured to read the first data from the first memory after receiving the first read data request from the first storage node, to the first A storage node or the terminal transmits a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the first storage node 1110 is further configured to: in the first storage, belong to a second storage node among the m storage nodes, and the first storage is in In the case of trusting the access state, adding an identifier of the state information aggregate currently cached by the first storage node to the first read data request, and sending the The first read data request of the identity of the collection of status information.
  • the second storage node 1120 is configured to: after receiving the first read data request from the first storage node, identifiers of the state information aggregate carried by the first read data request Comparing the identifiers of the state information aggregates that are currently cached by the user, and comparing the identifiers of the state information aggregates carried by the first read data request with the identifiers of the state information aggregates currently cached by the first read data request, Reading, by the first memory, the first data, and sending, to the first storage node or the terminal, a first read data response for responding to the first read data request, where the first read data response carries There is the first data read.
  • the first storage node 1110 is further configured to: in the first storage, belong to a second storage node among the m storage nodes, and the first storage is in In the case of trusting the access status, the identifier of the state information aggregate currently cached by the first storage node and the node identifier of the management node for issuing the state information aggregate are added to the first read data request. Sending, to the second storage node, the first read data request to which the identifier of the state information aggregate and the node identifier of the management node are added.
  • the second storage node 1120 is configured to: after receiving the first read data request from the first storage node, the state information aggregate of the first read data request.
  • the identifier is compared with the identifier of the state information aggregate that is currently cached by itself, and the node identifier of the management node carried by the first read data request is compared with the node identifier of the management node that is currently cached by the host, and the comparison is performed.
  • the identifier of the state information aggregate carried by the first read data request is the same as the identifier of the state information aggregate that is currently cached by the first read data request, and compares the node identifier of the management node carried by the first read data request with itself.
  • the first data is read from the first memory, and sent to the first storage node or the terminal to respond to the first read data request.
  • the first read data response the first read data response carrying the read first data.
  • the first storage node 1110 is further configured to: determine that the first memory is in an untrusted access state, and the strips of the strips to which the first data belongs are located The x memories in the memory belong to the first storage node, and if the Nx memories in the N memories belong to the y storage nodes different from the first storage node, the storage to the y The node sends a read data request carrying the stripe identifier of the stripe and the identifier of the state information aggregate.
  • Each of the y storage nodes is configured to receive read data from the first storage node.
  • the identifier of the state information aggregate carried by the read data request is compared with the identifier of the state information aggregate that is currently cached by the host, and if the state information aggregate of the read data request is compared, The identifier is the same as the identifier of the state information aggregate that is currently cached by itself, and the corresponding stripe of the stripe is read from the corresponding memory included therein, and the corresponding stripe of the stripe read is sent to the first storage node.
  • the first storage node is further configured to perform a verification operation based on each stripe of the stripe collected from the y storage nodes and the first storage node to obtain first data, to the The terminal sends a first read data response for responding to the first read data request, the first read data response carrying the obtained first data.
  • one of the m storage nodes is a management node.
  • the first storage node is further configured to: when the state of the first memory is changed from the first state to the second state, send a memory state change report to the management node, where the memory state change report indicates A memory is in the second state.
  • the first state and the second state are different, and the first state and the second state include any one of the following states: an offline state, a data reconstruction state, and a trusted access state.
  • the management node is configured to: after receiving the memory state change report from the first storage node, update the state of the first memory recorded in the state information aggregate buffered by the management node to the first a second state, and updating a version number of the state information aggregate buffered by the management node; the management node sending updated state information to other storage nodes other than the management node among the m storage nodes Aggregate.
  • the first storage node is further configured to update its current cached state information aggregate with the state information aggregate from the management node.
  • other storage nodes also update their current cached state information aggregates with the state information aggregate from the management node.
  • the first storage node is further configured to receive a first write data request from a terminal, where the first write data request carries second data and second location information, where The two location information is used to describe the location of the second data in its associated data volume.
  • the first storage node is further configured to: determine, according to the second location information, W memories related to writing the second data; divide the second data into WT data stripes, and use the WT Data strips are calculated to obtain T check strips, wherein the T check strips and WT data strips form strips comprising W strips, the W memories and the W strips One-to-one correspondence, the T and W are positive integers and the T is smaller than the W.
  • the first storage node is further configured to: when the state information aggregate cached by the first storage node is currently recorded, the W1 memories in the W memories are in a non-offline state and W2 in the W memories.
  • the memory is in an offline state, and a write data request carrying the stripe of the stripe and the identifier of the state information aggregate is sent to the y2 storage nodes to which the W1 memories belong.
  • each of the y2 storage nodes is configured to: after receiving the write data request from the first storage node, carry the write data request with the identifier of the state information aggregate and its current current Comparing the identifiers of the cached state information aggregates, and comparing the identifiers of the state information aggregates carried by the write data requests with the identifiers of the state information aggregates that are currently cached by themselves, in the corresponding memory included therein Write a corresponding stripe of the stripe; if the identifier of the state information aggregate carried by the write data request is different from the identifier of the state information aggregate of the current cache, the corresponding rejection is included A corresponding strip of the strip is written in the memory.
  • the first storage node is further configured to receive a second write data request from the terminal, where the second write data request carries third data and third location information, where The three-position information is used to describe the location of the third data in the data body to which it belongs.
  • the first storage node is further configured to: determine, according to the third location information, W memories related to writing the third data; divide the third data into WT data stripes, and use the WT Data strips are calculated to obtain T check strips, wherein the T check strips and WT data strips form strips comprising W strips, the W memories and the W strips One-to-one correspondence, the T and W are positive integers and the T is smaller than the W.
  • the first storage node is further configured to determine a state of the W memories currently recorded in the cached state information aggregate, and determine that the W1 memories in the W memories are in a non-offline state (for example, And the W2 memories of the W memories are in an offline state, and in a case where the first storage node and the management node for issuing the state information aggregate are in a connected state, the W1 memories belong to The y2 storage nodes send a write data request carrying the stripe of the stripe.
  • each of the y2 storage nodes is configured to, after receiving a write data request from the first storage node, write the stripe into a corresponding memory included therein Corresponding strips.
  • the first storage node is further configured to receive a third write data request from the terminal, where the third write data request carries fourth data and fourth location information, where Four position information is used to describe the position of the fourth data in the data body to which it belongs;
  • the first storage node is further configured to: determine, according to the fourth location information, W memories related to writing the fourth data; divide the fourth data into WT data stripes, and use the WT Data strips are calculated to obtain T check strips, wherein the T check strips and WT data strips form strips comprising W strips, the W memories and the W strips One-to-one correspondence, the T and W are positive integers and the T is smaller than the W.
  • the first storage node is further configured to: when the state information aggregate cached by the first storage node is currently recorded, the W1 memories in the W memories are in a non-offline state (for example, and in the W memories).
  • the bearer is sent to the y2 storage nodes to which the W1 memory belongs.
  • each of the y2 storage nodes is configured to: after receiving the write data request from the first storage node, the node identifier of the management node carried by the write data request and its current Comparing the node identifiers of the cached management nodes, and comparing the node identifiers of the management nodes carried by the write data request with the node identifiers of the management nodes currently cached by itself, writing to the corresponding memory included therein Determining a corresponding stripe of the stripe; in case comparing that the node identifier of the management node carried by the write data request is different from the node identifier of the management node currently cached by itself, refusing to write to the corresponding memory included in the stripe The corresponding strips of the strips.
  • the first storage node is further configured to generate a first reconstruction log, where the first reconstruction log records a memory identifier of the second memory in the w2 storages, where the A stripe identifier of the first stripe corresponding to the second memory storage in the W stripe is also recorded in a reconstruction log, and the stripe is also recorded in the first reconstruction log.
  • the second storage node to which the second memory belongs collects the first generated during the offline of the second memory Reconstructing the log
  • the second storage node acquires an identifier of the first stripe to be written into the second memory recorded in the first reconstructed log; and determining the first strip included in the stripe WT memories in which the WT strips are located, the WT strips are read from the WT memories, and the WT strips are used to perform a check operation to reconstruct a write to the second memory.
  • the first strip writes the reconstructed first strip to the second memory.
  • an embodiment of the present application further provides a storage node 1200, where the storage node is one of the m storage nodes included in the distributed storage system, and each of the storage nodes includes at least one storage.
  • Each of the memories includes a non-volatile storage medium, and m is an integer greater than one.
  • the storage node includes: a communication unit and a processing control unit.
  • the communication unit 1210 is configured to receive a first read data request from the terminal, where the first read data request carries first location information, where the first location information is used to describe that the first data is in a data body to which the data belongs. s position.
  • a processing control unit 1220 configured to determine, according to the first location information, that the memory where the first data is located is a first memory, and the first memory belongs to the storage node, when the state information set The first memory is currently recorded in the trusted access state, and the first data is read from the first memory.
  • the communication unit 1210 is further configured to send, to the terminal, a first read data response for responding to the first read data request, where the first read data response carries the read first data.
  • the communication unit 1210 is further configured to: in the first memory, belong to a second storage node among the m storage nodes, and the first memory is in a trusted access state. And the first read data request is forwarded to the second storage node if the storage node is in a connected state with the management node for issuing the state information aggregate.
  • the first read data request is used to trigger the second storage node to read the first data from the first memory after receiving the first read data request from the first storage node, Transmitting, to the storage node or the terminal, a first read data response for responding to the first read data request, the first read data response carrying the read first data.
  • the communication unit is further configured to: at the first memory, belong to a second storage node among the m storage nodes, and the first memory is in a trusted access state.
  • the identifier of the state information aggregate currently cached by the first storage node is added, and the identifier of the state information aggregate is added to the second storage node. The first read data request;
  • the first read data request is used to trigger the second storage node to receive the first read data request from the storage node, and the state information aggregate that is carried by the first read data request.
  • the identifier is compared with the identifier of the current state information aggregate of the current cache, and the identifier of the state information aggregate carried by the first read data request is compared with the identifier of the state information aggregate of the current cache. Reading the first data from the first memory, and transmitting, to the storage node or the terminal, a first read data response for responding to the first read data request, wherein the first read data The response carries the first data read.
  • one of the m storage nodes is a management node.
  • the communication unit is further configured to: when the state of the first memory is changed from the first state to the second state, send a memory state change report to the management node, wherein the memory state change report indicates that the first memory is in the first state a second state; the first state and the second state being different, the first state and the second state comprising any one of the following states: an offline state, a data reconstruction state, and a trusted access state.
  • the memory state change report is configured to trigger the management node to record the first record in the state information aggregate cached by the management node after receiving the memory state change report from the first storage node.
  • the state of the memory is updated to the second state, and the version number of the state information aggregate buffered by the management node is updated. Sending the updated state information aggregate to the storage nodes other than the management node among the m storage nodes.
  • the processing control unit is configured to update its current cached state information aggregate by using a state information aggregate from the management node.
  • an embodiment of the present application provides a storage node 1300, where the storage node is one of m storage nodes included in a distributed storage system, and each of the storage nodes includes at least one storage.
  • Each of the memories includes a non-volatile storage medium, and m is an integer greater than one.
  • the storage node includes a processor 1310 and a communication interface 1320 that are coupled to each other.
  • the processor is configured to perform some or all of the steps of the method performed by the first storage node or other storage node in the above method embodiments.
  • the memory 1330 is configured to store instructions and data
  • the processor 1310 is configured to execute the instructions
  • the communication interface 1320 is configured to communicate with other devices under the control of the processor 1310.
  • the processor 1310 is also referred to as a central processing unit (English: Central Processing Unit, abbreviated: CPU).
  • the components of the storage node in a particular application are coupled together, for example, via a bus system.
  • the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to the data bus.
  • various buses are labeled as bus system 1340 in the figure.
  • the method disclosed in the foregoing embodiment of the present application may be applied to the processor 1310 or implemented by the processor 1310.
  • the processor 1310 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1310 or an instruction in a form of software.
  • the processor 1310 can be a general purpose processor, a digital signal processor, an application specific integrated circuit, an off-the-shelf programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
  • the processor 1310 can implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application.
  • the general purpose processor 1310 may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly implemented by the hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software modules can be located in random memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, etc., which are well established in the art.
  • the storage medium is located in the memory 1330, for example, the processor 1310 can read the information in the memory 1320, in combination with its hardware. The steps of the method.
  • the processor 1310 is configured to receive a first read data request by using the communication interface 1320, where the first read data request carries first location information, where the first location information is used to describe the first data in the a location in the associated data body; when the memory in which the first data is located is determined to be the first memory based on the first location information, and the first memory belongs to the first storage node, when the state information is Recording, in the aggregate, that the first memory is in a trusted access state, reading the first data from the first memory, and transmitting, by the communication interface, first read data in response to the first read data request In response, the first read data response carries the read first data.
  • the embodiment of the present application provides a computer readable storage medium, wherein the computer readable storage medium stores program code.
  • the program code includes instructions for performing some or all of the steps of a method performed by any one of the storage nodes (eg, the first storage node or the second storage node) of the first aspect.
  • embodiments of the present application provide a computer program product including instructions, when the computer program product is run on a computer (such as a storage node), causing the computer to execute any one of the foregoing aspects. Part or all of the steps performed by the method (e.g., the first storage node or the second storage node, etc.).
  • the embodiment of the present application further provides a service system, including: a distributed storage service system and a terminal, a communication connection between the distributed storage service system and the terminal; and the distributed storage service system is provided by the embodiment of the present application. Any of a variety of distributed storage service systems.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape), an optical medium (such as an optical disk), or a semiconductor medium (such as a solid state hard disk) or the like.
  • a magnetic medium such as a floppy disk, a hard disk, a magnetic tape
  • an optical medium such as an optical disk
  • a semiconductor medium such as a solid state hard disk
  • the disclosed apparatus may be implemented in other manners.
  • the device embodiments described above are merely illustrative, such as the division of the above units, only one
  • the logical function division may be implemented in another way.
  • multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the above integrated units if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer-accessible memory.
  • the technical solution of the present application which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a memory.
  • a number of requests are included to cause a computer device (which may be a personal computer, server or network device, etc., and in particular a processor in a computer device) to perform all or part of the steps of the above-described methods of various embodiments of the present application.

Abstract

分布式存储系统的访问方法和相关装置和相关系统。一种数据访问方法,应用于分布式存储系统,分布式存储系统包括m个存储节点,m个存储节点包括第一存储节点,每个所述存储节点包括至少1个存储器,所述方法包括:所述第一存储节点接收来自终端第一读数据请求,在所述第一存储节点基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,所述第一存储节点从所述第一存储器读取所述第一数据,向终端发送用于响应所述第一读数据请求的第一读数据响应。通过减少数据读取量就有利于降低存储器负担,进而有利于提升系统性能。

Description

分布式存储系统的访问方法和相关装置和相关系统 技术领域
本申请涉及了计算机技术领域,具体涉及了分布式存储系统的访问方法和相关装置和相关系统。
背景技术
传统网络存储系统采用集中的存储服务器存放所有数据,这种架构下存储服务器成为系统性能瓶颈,也是可靠性和安全性的薄弱点,难以满足大规模存储应用的需要。分布式存储系统则是将数据分散存储在多台独立设备(即存储节点)上。分布式存储系统采用可扩展的系统结构,利用多台存储节点分担存储负荷,分布式存储系统有利于提高系统的可靠性、可用性、存取效率和扩展性。
随着存储容量的不断增大,分布式存储系统中的存储节点数量和硬盘数量也在不断的扩大。越来越多的分布式存储系统使用纠删码(英文:erasure code,缩写:EC)技术将数据打散存放在不同硬盘上,以提高数据可靠性。EC技术在尽量保证系统可靠性的同时,还可有效地提升系统性能。当然,除EC技术之外,其他类似的冗余校验技术也可能被分布式存储系统使用。
分条(stripe)(例如EC分条)包括多个条带(英文:strip),具体可包括N个数据条带和M个校验条带(例如包括2个数据strip和1个校验strip),这M个校验条带基于这N个数据条带计算得到,数据条带的长度可达1Mb。传统技术中,即使读操作所欲读取的目标数据仅占分条的一小部分,每次读操作都需读取出目标数据所属分条中的所有数据条带和校验条带,这样才能进行数据的正确性校验,从而确保目标数据的正确性。但由于分条的长度往往比较大,这使得在例如随机读等场景下,会加重硬盘负担,大大降低系统性能。
发明内容
本申请实施例提供数据访问方法和相关装置及系统。
第一方面,本申请实施例提供一种数据读取方法,所述方法应用于包括m个存储节点的分布式存储系统,每个所述存储节点包括至少1个存储器(存储器例如可为硬盘或其他形式的存储器),所述m为大于1的整数。所述方法可包括:例如当终端需从分布式存储系统请求读取第一数据,所述m个存储节点之中的第一存储节点可接收来自终端的第一读数据请求。所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置。在所述第一存储节点基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况之下,所述第一存储节点确定状态信息集合体(所述第一存储节点缓存的状态信息集合体)中当前记录的所述第一存储器所处状态(所述第一存储器例如可能处于信任访问状态或者非信任访问状态)。当确定出所述第一存储器处于信任访问状态,所述第一存储节点从所述第一存储器读取所述第一数据。第一存储节点向所述终端发送用于响应所述第一读数据请求的第 一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。所述第一数据为所述第一数据所属条带中的部分或全部数据。
本申请各实施例中的数据体例如可为文件、对象、文件分段、数据库记录或者数据段等等。
其中,本申请各实施例中的状态信息集合体用于记录硬盘状态等,状态信息集合体的格式例如可为文件或其他数据格式。如果状态信息集合体的格式为文件,那么状态信息集合体也可称为“状态文件”。
可以看出,上述方案中,由于在分布式存储系统中,引入了用于记录存储器状态的状态信息集合体,故而可以利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态,对于状态信息集合体记录的处于信任访问状态的存储器,可直接从中读取被请求数据来向终端反馈,而在这种信任访问状态的情况下,由于无需执行传统方式中先读取出相关分条的全部条带来进行正确性校验的繁琐步骤,通过减少数据读取量就有利于降低存储器负担,进而有利于提升系统性能。
此外,在本申请一些可能实施方式中,在所述第一存储节点确定状态信息集合体中当前记录的所述第一存储器所处状态之前,所述第一存储节点可先确定是否缓存了状态信息集合体,如果已缓存了状态信息集合体,那么所述第一存储节点可确定缓存的状态信息集合体中当前记录的所述第一存储器所处状态;如果未缓存状态信息集合体,且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第一存储节点先从管理节点获取状态信息集合体进行缓存,而后再确定缓存的状态信息集合体中当前记录的所述第一存储器所处状态。
此外,在本申请一些可能实施方式中,在所述第一存储节点确定状态信息集合体中当前记录的所述第一存储器所处状态之前,所述第一存储节点可先确定是否缓存了状态信息集合体,如果已经缓存了状态信息集合体,那么所述第一存储节点可确定缓存的状态信息集合体的获得时长是否超过时长阈值(时长阈值例如可为2分钟、10分钟或30分钟等,通常来说,当前缓存的状态信息集合体的获得时长越长,那么这个状态信息集合体失效的可能性越大)。在缓存的状态信息集合体的获得时长超过时长阈值,且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第一存储节点先从管理节点获取状态信息集合体,利用最新获取的状态信息集合体更新此前已缓存的状态信息集合体,而后再确定缓存的状态信息集合体中当前记录的所述第一存储器所处状态。在缓存的状态信息集合体的获得时长未超过,或者所述第一存储节点与用于发布状态信息集合体的管理节点之间未处于连接态的情况下,那么所述第一存储节点确定缓存的状态信息集合体中当前记录的所述第一存储器所处状态。
此外,在本申请一些可能实施方式中,所述方法还可包括:在所述第一存储器归属于所述m个存储节点中的第二存储节点,且确定出所述第一存储器处于信任访问状态(具体例如第一存储节点基于状态信息集合体确定出所述第一存储器处于信任访问状态,即第一存储节点缓存的状态信息集合体中记录:所述第一存储器处于信任访问状态),且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下(当所述第一 存储节点与用于发布状态信息集合体的管理节点之间处于连接态,那么第一存储节点缓存的状态信息集合体是管理节点最新发布的状态信息集合体的可能性相对很大,因此第一存储节点当前缓存的这个状态信息集合体当前有效的可能性相对很大。故而可认为,第一存储节点基于缓存的这个状态信息集合体来确定出所述第一存储器处于信任访问状态,很大可能是符合所述第一存储器的实际情况的),所述第一存储节点向所述第二存储节点转发所述第一读数据请求。所述第二存储节点在接收到来自所述第一存储节点的所述第一读数据请求之后,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请的一些可能实施方式中,所述方法还可包括:在所述第一存储器归属于所述m个存储节点之中的第二存储节点,且所述第一存储器处于信任访问状态(具体例如第一存储节点当前缓存的状态信息集合体中记录:所述第一存储器处于信任访问状态)的情况之下,或者,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,且所述第一存储器处于信任访问状态,且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下(当第一存储节点与管理节点之间处于连接态,那么第一存储节点获得管理节点最新发布的状态信息集合体的可能性就非常大),第一存储节点可在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述状态信息集合体的标识的所述第一读数据请求。
所述第二存储节点在接收到来自所述第一存储节点的所述第一读数据请求之后,可将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下(当所述第一读数据请求携带的所述状态信息集合体的标识与第二存储节点自身当前缓存的状态信息集合体的标识相同,那就表示所述第一存储节点和所述第二存储节点当前缓存的状态信息集合体相同,这种情况下所述第一存储节点当前缓存的状态信息集合体当前有效的可能性非常大。在这种情况下,如果所述第一存储节点与管理节点之间处于连接态,那么,第一存储节点当前缓存的状态信息集合体当前有效的可能性进一步加大,因此所述第一存储节点基于其当前缓存的状态信息集合体确定所述第一存储器处于信任访问状态,很大可能是符合所述第一存储器的实际情况的),从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
本申请各实施例中,状态信息集合体的标识例如包括状态信息集合体的版本号和/或状态信息集合体的数据摘要等。可以理解,某版本的状态信息集合体的标识具体可以是能代表这个版本的状态信息集合体的任何信息,状态信息集合体的版本号和/或数据摘要只是这个状态信息集合体的标识的一些可能举例。
在一般情况下,通常是由某个特定的节点(例如分布式存储系统中的某特定存储节点)来承担管理节点的角色,而在一些特别情况下(例如当前管理节点故障或处理能力不足等 情况下),也可能允许承担管理节点这个角色的存储节点发生变化,例如在某个时段是由存储节点A来承担管理节点这个角色(此时,管理节点的节点标识也即为存储节点A的节点标识),但在另一个时段存储节点A不再承担管理节点这个角色,而变化为由存储节点B来承担管理节点这个角色(此时管理节点的节点标识为存储节点B的节点标识)。也就是说在一些特别的情况下,由于承担管理节点这个角色的存储节点可能变化,因此管理节点的节点标识(即承担管理节点这个角色的存储节点的节点标识)存在变化的可能性。分布式存储系统中的各存储节点可缓存当前被其视为最新管理节点的节点标识,当存储节点发现承担管理节点这个角色的存储节点发生变化,那么就将使用其发现的最新的管理节点的节点标识来更新其此前缓存的管理节点的节点标识。
在本申请一些可能实施方式中,所述方法还可包括:在所述第一存储器归属于所述m个存储节点中的第二存储节点,且所述第一存储器处于信任访问状态(具体例如第一存储节点缓存的状态信息集合体中记录所述第一存储器处于信任访问状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第一存储节点在所述第一读数据请求中添加所述第一存储节点当前缓存的用于发布状态信息集合体的管理节点的节点标识,向所述第二存储节点发送添加了所述管理节点的节点标识的所述第一读数据请求。
那么,所述第二存储节点在接收到来自所述第一存储节点的所述第一读数据请求的情况下,或者在接收到来自所述第一存储节点的所述第一读数据请求,且所述第二存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第二存储节点将所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较。其中,在比较出所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况之下(当所述第一存储节点和第二存储节点缓存的管理节点的节点标识相同,也就说明第一存储节点和第二存储节点将同一个存储节点视为管理节点,那么被第一存储节点和第二存储节点均视为管理节点的这个存储节点,是分布式存储系统中的当前最新管理节点的可能性就非常大,在这种情况下,如果所述第一存储节点与管理节点之间处于连接态,那么,所述第一存储节点缓存的状态信息集合体当前有效的可能性非常大,因此可以认为,所述第一存储节点基于其缓存的这个状态信息集合体来确定出所述第一存储器处于信任访问状态,很大可能是符合所述第一存储器的实际情况的),从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述方法还可包括:在所述第一存储器归属于所述m个存储节点中的第二存储节点,且所述第一存储器处于信任访问状态(具体例如第一存储节点缓存的状态信息集合体中记录所述第一存储器处于信任访问状态)的情况下,或者在所述第一存储器归属于所述m个存储节点中的第二存储节点,且所述第一存储器处于信任访问状态(例如第一存储节点缓存的状态信息集合体中记录所述第一存储器处于信任访问状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的 情况下,所述第一存储节点在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识和用于发布状态信息集合体的管理节点的节点标识,向所述第二存储节点发送添加了所述所述状态信息集合体的标识和所述管理节点的节点标识的所述第一读数据请求。
那么,所述第二存储节点在接收到来自所述第一存储节点的所述第一读数据请求的情况下,或者在接收到来自所述第一存储节点的所述第一读数据请求,且所述第二存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第二存储节点将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,并且将所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较,其中,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,且比较出所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况之下(当所述第一存储节点和第二存储节点缓存的管理节点的节点标识相同,也就说明第一存储节点和第二存储节点将同一个存储节点视为管理节点,那么被第一存储节点和第二存储节点均视为管理节点的这个存储节点,是分布式存储系统中的当前最新管理节点的可能性就非常大,在这种情况下,如果第一存储节点和第二存储节点当前缓存的状态信息集合体相同,那么第一存储节点当前缓存的状态信息集合体当前有效的可能性非常大,进一步的,如果第一存储节点和/或第一存储节点与管理节点之间处于连接态,那么第一存储节点当前缓存的状态信息集合体当前有效的可能性就更大了,因此所述第一存储节点基于其当前缓存的状态信息集合体确定所述第一存储器处于信任访问状态,很大可能是符合所述第一存储器的实际情况的),从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请的一些可能实施方式中,所述方法还包括:在所述第一存储节点确定所述第一存储器处于非信任访问状态的情况下,所述第一存储节点确定所述第一数据所属的分条,所述第一存储节点确定所述分条的各条带所在的N个存储器。
在所述N个存储器均属于所述第一存储节点的情况下,所述第一存储节点从所述N个存储器读取所述分条的各条带,所述第一存储节点基于读取到的所述分条的各条带进行校验运算以得到所述第一数据(具体的,所述第一存储节点先基于读取到的所述分条的各条带进行校验运算以得到所述第一数据所属条带,而后从所述第一数据所属条带中得到所述第一数据),向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有得到的所述第一数据。
在本申请的一些可能的实施方式中,当所述第一数据所属的分条的各条带所在的N个存储器中的x个存储器(x个存储器包括所述第一存储器)属于所述第一存储节点,且所述N个存储器中的N-x个存储器属于不同于所述第一存储节点的y个存储节点,所述方法还可包括:在所述第一存储节点确定所述第一存储器处于非信任访问状态,且确定所述N个存储器中除所述第一存储节点之外的其它存储器处于信任访问状态的情况下(或者在所述第 一存储节点确定所述第一存储器处于非信任访问状态,并且确定所述N个存储器中除所述第一存储节点之外的其它存储器处于信任访问状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下),所述第一存储节点向所述y个存储节点发送携带所述分条的分条标识和所述第一存储节点所缓存状态信息集合体的标识的读数据请求。相应的,所述y个存储节点中的每个存储节点可在接收到来自所述第一存储节点的读数据请求之后,将所述读数据请求携带的状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,若比较出所述读数据请求携带的状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同(当读数据请求携带的状态信息集合体的标识与接收这个读数据请求的存储节点自身当前缓存的状态信息集合体的标识相同,那就表示所述第一存储节点和接收这个读数据请求的存储节点当前缓存的状态信息集合体相同,这种情况下,所述第一存储节点当前缓存的状态信息集合体当前有效的可能性非常大,在这种情况下,如果第一存储节点与管理节点之间处于连接态,那么第一存储节点当前缓存的状态信息集合体当前有效的可能性进一步加大,因此所述第一存储节点基于其当前缓存的状态信息集合体确定所述第一存储器处于非信任访问状态,很大可能是符合所述第一存储器的实际情况的),则可从其包括的相应存储器中读取所述分条的相应条带,向第一存储节点发送读取到的所述分条的相应条带。
相应的,所述第一存储节点基于从所述y个存储节点和所述第一存储节点收集到的所述分条的各条带(此处不包括第一数据所属条带)进行校验运算以得到第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有得到的所述第一数据。
在本申请的一些可能的实施方式之中,所述m个存储节点中的其中1个存储节点为管理节点。所述方法还可包括:当第一存储器所处状态由第一状态变更为第二状态,所述第一存储节点向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态;其中,所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态。
相应的,所述管理节点可以在接收到了来自所述第一存储节点的存储器状态变更报告之后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并更新所述管理节点缓存的所述状态信息集合体的版本号。所述管理节点向所述m个存储节点之中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体。
相应的,所述第一存储节点用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
在本申请一些可能实施方式中,所述方法还包括:例如当终端需向分布式存储系统请求写入第二数据,所述第一存储节点可接收来自终端的第一写数据请求,所述第一写数据请求携带有第二数据和第二位置信息,所述第二位置信息用于描述所述第二数据在其所属数据体中的位置。所述第一存储节点基于所述第二位置信息确定写所述第二数据所涉及到的W个存储器。所述第一存储节点将所述第二数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带。其中,所述T个校验条带和W-T个数据条带形成包括W 个条带的1个分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点确定其缓存的状态信息集合体之中当前记录的所述W个存储器所处状态,当确定出所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和所述状态信息集合体的体标识的写数据请求。
相应的,所述y2个存储节点中的每个存储节点在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,例如在比较出所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带;在比较出所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
又如在本申请一些可能实施方式中,所述方法还可包括:例如当终端需向分布式存储系统请求写入第三数据,所述第一存储节点接收来自终端的第二写数据请求,所述第二写数据请求携带有第三数据和第三位置信息,所述第三位置信息用于描述所述第三数据在其所属数据体中的位置。所述第一存储节点基于所述第三位置信息确定写所述第三数据所涉及到的W个存储器;所述第一存储节点将所述第三数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的1个分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点确定其缓存的状态信息集合体之中当前记录的所述W个存储器所处状态,在确定出所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带的写数据请求。相应的,所述y2个存储节点之中的每个1个存储节点在接收到来自所述第一存储节点的写数据请求之后,便可向其包括的相应存储器中写入所述分条的相应条带。
又例如,在本申请一些可能实施方式中,所述方法还可包括:所述第一存储节点接收来自终端的第三写数据请求,所述第三写数据请求携带有第四数据和第四位置信息,所述第四位置信息用于描述所述第四数据在其所属数据体中的位置。所述第一存储节点基于所述第四位置信息确定写所述第四数据涉及的W个存储器之后;将所述第四数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。当所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间 处于连接态的情况下,所述第一存储节点向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和用于发布状态信息集合体的管理节点的节点标识的写数据请求。
相应的,所述y2个存储节点中的每个存储节点在接收到来自所述第一存储节点的写数据请求之后,可将所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较,其中,在比较出所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带,此外,在比较出所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识不相同的情况下,可拒绝向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,当分配给某存储器(例如第二存储器)的条带因这个存储器当时离线等原因而未成功写入这个存储器,那么在这个存储器重新在线之后可通过重构机制来进行相关条带重构。例如所述W1小于所述W,其中,所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W2个存储器处于离线状态,所述方法还可包括:所述第一存储节点生成第一重构日志,所述第一重构日志中记录了所述W2个存储器中的第二存储器的存储器标识,所述第一重构日志中还进一步记录了所述W个条带中的与所述第二存储器所对应的第一条带的条带标识,所述第一重构日志中还记录了所述分条的分条标识;所述第二存储器为所述W2个存储器中的任意一个存储器。
相应的,当所述W2个存储器中的所述第二存储器重新上线之后,并且所述第二存储器所归属的第二存储节点收集到了在所述第二存储器离线期间产生的所述第一重构日志,第二存储节点获取第一重构日志中记录的需写入所述第二存储器的第一条带的标识;确定所述分条所包括的所述第一条带之外其他W-T个条带所在的W-T个存储器,从所述W-T个存储器中读取所述W-T个条带,利用所述W-T个条带进行校验运算以重构出需写入所述第二存储器的所述第一条带,将重构出的所述第一条带写入所述第二存储器。
可以看出,上述举例的方案中给出了可在一定程度上验证状态信息集合体有效性的若干种可能实施方式,这些可能实施方式有利于满足多种可靠性等级的验证需求,进而有利于满足多种读写访问效率需求。
第二方面,本申请实施例还提供一种分布式存储系统,所述分布式存储系统包括m个存储节点,其中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数。
所述m个存储节点中的第一存储节点用于:接收来自终端的第一读数据请求,所述第一读数据请求携带有第一位置信息,其中,所述第一位置信息用于描述第一数据在其所属数据体中的位置;基于所述第一位置信息确定所述第一数据所在的第一存储器。
在所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。所述第一数据例如为所述第一数据所属条带中的部分或全部数据。
在本申请一些可能实施方式中,所述分布式存储系统还包括用于发布状态信息集合体 的管理节点。
其中,所述第一存储节点还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述第二存储节点转发所述第一读数据请求。
其中,所述第二存储节点用于,在接收到来自所述第一存储节点的所述第一读数据请求后,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述第一存储节点还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且,所述第一存储器处于信任访问状态的情况之下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述所述状态信息集合体的标识的所述第一读数据请求。
其中,所述第二存储节点用于,接收到来自所述第一存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述第一存储节点还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且,所述第一存储器处于信任访问状态的情况之下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识和用于发布状态信息集合体的管理节点的节点标识,向所述第二存储节点发送添加了所述状态信息集合体的标识和所述管理节点的节点标识的所述第一读数据请求。
相应的,所述第二存储节点用于,在接收到来自所述第一存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,且将所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,且比较出所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况之下,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述第一存储节点还用于:在确定所述第一存储器处于非信任访问状态,并且所述第一数据所属的分条的各条带所在的N个存储器中的x个存储器属于所述第一存储节点,且所述N个存储器中的N-x个存储器属于不同于所述第一存储节 点的y个存储节点的情况下,向所述y个存储节点发送携带所述分条的分条标识和所述状态信息集合体的标识的读数据请求。
所述y个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的读数据请求之后,将所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,若比较出所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,从其包括的相应存储器中读取所述分条的相应条带,向第一存储节点发送读取到的所述分条的相应条带。
所述第一存储节点还用于,基于基于从所述y个存储节点和所述第一存储节点收集到的所述分条的各条带进行校验运算以得到第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有得到的所述第一数据。
在本申请一些可能实施方式中,例如所述m个存储节点之中的其中1个存储节点为管理节点。其中,所述第一存储节点还用于,当所述第一存储器所处状态由第一状态变更为第二状态,向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态。其中,所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态。
其中,所述管理节点用于,在接收到了来自所述第一存储节点的存储器状态变更报告之后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并更新所述管理节点缓存的所述状态信息集合体的版本号;所述管理节点向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体。
相应的,所述第一存储节点还用于,用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。相应的,其他存储节点也用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
在本申请一些可能实施方式中,所述第一存储节点还用于,接收来自终端的第一写数据请求,所述第一写数据请求携带有第二数据和第二位置信息,所述第二位置信息用于描述所述第二数据在其所属数据体中的位置。
其中,所述第一存储节点还用于,基于所述第二位置信息确定写所述第二数据涉及的W个存储器;将第二数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带。其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点还用于,当所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W1个存储器处于非离线状态且所述W个存储器中的W2个存储器处于离线状态,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和所述状态信息集合体的标识的写数据请求。
相应的,所述y2个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述写数据请求携带的状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,向其包括的相应存储器中写入所述分 条的相应条带;在比较出所述写数据请求携带的状态信息集合体的标识与自身当前缓存的状态信息集合体的标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,所述第一存储节点还用于,接收来自终端的第二写数据请求,所述第二写数据请求携带有第三数据和第三位置信息,所述第三位置信息用于描述所述第三数据在其所属数据体中的位置。
所述第一存储节点还用于,基于所述第三位置信息确定写所述第三数据涉及的W个存储器;将所述第三数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点还用于,确定其缓存的状态信息集合体中当前记录的所述W个存储器所处状态,在确定出所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带的写数据请求。
相应的,所述y2个存储节点中的每个1个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,所述第一存储节点还用于,接收来自终端的第三写数据请求,所述第三写数据请求携带有第四数据和第四位置信息,所述第四位置信息用于描述所述第四数据在其所属数据体中的位置;
所述第一存储节点还用于,基于所述第四位置信息确定写所述第四数据涉及的W个存储器;将所述第四数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点还用于,当所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和用于发布状态信息集合体的管理节点的节点标识的写数据请求。
相应的,所述y2个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较,在比较出所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带;在比较出所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,第一存储节点还用于生成第一重构日志,所述第一重构日志中记录了所述w2个存储器中的第二存储器的存储器标识,所述第一重构日志中还记 录了所述W个条带中的与所述第二存储器存储器所对应的第一条带的条带标识,所述第一重构日志中还记录了所述分条的分条标识;其中,所述第二存储器为所述W2个存储器中的任意一个存储器。
相应的,例如当所述W2个存储器中的所述第二存储器重新上线之后,并且所述第二存储器所归属的第二存储节点收集到了在所述第二存储器离线期间产生的所述第一重构日志,第二存储节点获取第一重构日志中记录的需写入所述第二存储器的第一条带的标识;确定所述分条所包括的所述第一条带之外其他W-T个条带所在的W-T个存储器,从所述W-T个存储器中读取所述W-T个条带,利用所述W-T个条带进行校验运算以重构出需写入所述第二存储器的所述第一条带,将重构出的所述第一条带写入所述第二存储器。
第三方面,本申请实施例还提供一种存储节点,所述存储节点为分布式存储系统所包括的m个存储节点的其中1个存储节点,其中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数。所述存储节点包括:通信单元和处理控制单元。
所述通信单元,用于接收来自终端的第一读数据请求,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置。
处理控制单元,用于在基于所述第一位置信息确定所述第一数据所在的存储器为的第一存储器,且所述第一存储器归属于所述存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据。
其中,所述通信单元还用于,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述通信单元还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,且所述第一存储器处于信任访问状态,且所述存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述第二存储节点转发所述第一读数据请求。
所述第一读数据请求用于触发所述第二存储节点在接收到了来自所述第一存储节点的所述第一读数据请求之后,从所述第一存储器读取所述第一数据,向所述存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述通信单元还用于,在所述第一存储器归属于所述m个存储节点中的第二存储节点,并且所述第一存储器处于信任访问状态的情况下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述状态信息集合体的标识的所述第一读数据请求。
所述第一读数据请求用于触发所述第二存储节点接收到来自所述存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述存储节点或所述终端发送用于响应所述第一读数据请求的第一 读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述m个存储节点中其中1个存储节点为管理节点。
所述通信单元还用于,当所述第一存储器所处状态由第一状态变更为第二状态,向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态;所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态。其中,所述存储器状态变更报告用于触发所述管理节点在接收到来自所述第一存储节点的存储器状态变更报告后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并且更新所述管理节点缓存的所述状态信息集合体的版本号。向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体。
所述处理控制单元用于,采用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
第四方面,本申请实施例提供了一种数据访问方法,应用于第一存储节点,所述第一存储位于包括m个存储节点的分布式存储系统中,其中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,
所述方法包括:所述第一存储节点接收(来自终端)第一读数据请求,其中,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置;在所述第一存储节点基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,所述第一存储节点从所述第一存储器读取所述第一数据,发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
第五方面,本申请实施例提供了一种存储节点,所述存储节点为分布式存储系统所包括的m个存储节点的其中1个存储节点,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,
所述存储节点包括相互耦合的处理器和通信接口;所述处理器用于执行以上各方面中由第一存储节点(或其它存储节点)所执行方法的部分或全部步骤。
例如,所述处理器用于,通过所述通信接口接收第一读数据请求,所述第一读数据请求携带有第一位置信息,其中,所述第一位置信息用于描述第一数据在其所属数据体中的位置;在基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据,通过所述通信接口发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
第六方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储了程序代码。所述程序代码包括用于执行第一方面中的任意一存储节点(例如第一存储节点或第二存储节点)所执行方法的部分或全部步骤的指令。
第七方面,本申请实施例提供了一种包括指令的计算机程序产品,当所述计算机程序产品在计算机(计算机例如为存储节点)上运行时,使得所述计算机执行以上各方面中由任意一存储节点(例如第一存储节点或第二存储节点等)所执行方法的部分或全部步骤。
第八方面,本申请实施例提供了一种存储节点,包括:相互耦合的处理器、通信接口和存储器;所述处理器用于执行以上各方面中由任意一存储节点所执行方法的部分或全部步骤。
第九方面,本申请实施例还提供了一种业务系统,可以包括:分布式存储服务系统和终端,分布式存储服务系统和终端之间通信连接;所述分布式存储服务系统为如权本申请实施例提供的任意一种分布式存储服务系统。
附图说明
图1-A和图1-B是本申请实施例举例的一些分布式存储系统的架构示意图;
图1-C是本申请实施例举例提供的存储器的状态迁移示意图;
图1-D是本申请实施例举例提供的状态信息集合体的组织形式示意图;
图1-E和图1-F是本申请实施例举例提供的一些分条的组织形式示意图;
图2-A和图2-B是本申请实施例举例提供的一些文件的组织形式示意图;
图3-A是本申请实施例举例提供的一种状态更新方法的流程示意图;
图3-B和图3-C是本申请实施例举例提供的一些状态信息集合体的组织形式示意图;
图4-A是本申请实施例举例提供的另一种状态更新方法的流程示意图;
图4-B和图4-C是本申请实施例举例提供的一些状态信息集合体的组织形式示意图;
图5-A是本申请实施例举例提供的另一种数据访问方法的流程示意图;
图5-B是本申请实施例举例提供的一种重构日志的组织形式示意图;
图6是本申请实施例举例提供的另一种数据访问方法的流程示意图;
图7-A是本申请实施例举例提供的一种数据重构方法的流程示意图;
图7-B和图7-C是本申请实施例举例提供的一些状态信息集合体的组织形式示意图;
图8是本申请实施例举例提供的另一种数据访问方法的流程示意图;
图9是本申请实施例举例提供的另一种数据访问方法的流程示意图;
图10是本申请实施例举例提供的另一种数据访问方法的流程示意图;
图11是本申请实施例举例提供的一种分布式存储系统的示意图;
图12是本申请实施例举例提供的一种存储节点的示意图;
图13是本申请实施例举例提供的另一种存储节点的示意图。
具体实施方式
本申请的说明书和权利要求书以及上述附图中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包括。例如包括了一系列步骤或单元的过程、方法、系统或产品或设备没有限定于已列出的步骤或单元,而是可选地还可以包括没有列出的步骤或者单元,或者可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。另外来说,术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定 顺序。
下面首先结合附图介绍分布式存储系统的一些可能架构。
参见图1-A和图1-B,本申请实施例的一些技术方案可基于1-A或图1-B举例所示架构的分布式存储系统或其变形架构具体实施。分布式存储系统包括通过互联网络连接的多个存储节点(在某些场景下,存储节点也可称为“存储服务节点”)。终端可通过网络对分布式存储系统进行例如读写等访问。
可以理解的是,本申请的实施例的终端(terminal)的产品形态例如可以为移动互联网设备、笔记本电脑、服务器、平板电脑、掌上电脑、台式电脑、手机或其他产品形态的可发出读数据请求或写数据请求等数据访问请求的终端设备。在本申请各个实施例中,终端也可以被主机(host)代替。
其中,存储节点包括1个或多个存储器。本申请实施例中的存储器包含非易失性存储介质,包含非易失性存储介质的存储器也可称为非易失性存储器。本实施例中,存储器例如可为硬盘、可擦写光盘和固态硬盘等存储介质,图1-A和图1-B中是以存储器是硬盘为例进行介绍。
存储节点还可包括网卡、内存、处理器和扩展卡等。处理器例如通过扩展卡与存储器互联。存储节点可通过其网卡与其他存储节点或其他外部设备互联。网卡可看作是存储节点的一种通信接口。
其中,处理器中可包括若干个功能单元,这些功能单元可以是硬件(例如处理器/现场可编程门阵列(英文:field programmable gate array,缩写:FPGA)电路,或者处理器/FPGA电路与其他辅助硬件的组合),也可以是软件。例如图1-B举例所示架构中,处理器中可以运行例如客户端(英文:client)、数据服务(英文:data service,缩写:DS)单元和状态管理器(英文:monitor)等功能。在运行这些功能时,可以认为这些功能是处理器中的一个模块。在不运行这些功能时,实现这些功能的程序可以是位于内存中的程序,在这种情况下,也可以认为这些模块位于内存中。总的来说,这些功能是位于各个存储节点,具体而言,位于各个存储节点中处理器和内存组成的集合之中。为了方便描述,在图1-B中把这些模块画在了处理器中。在一些特殊情况下,例如在系统级芯片(system on chip,SoC)之中,处理器本身有存储功能,程序代码可以直接烧录在处理器中,这种情况下内存不再是必须的器件。
其中,client为相应存储节点的一种系统输入输出(英文:input output,缩写:IO)入口,主要负责接收其他设备(例如终端或其他存储节点)发送的访问请求,访问请求也称IO请求,访问请求例如为读数据请求、写数据请求等。DS单元主要负责接收client发送的访问请求,基于访问请求对本地存储器进行读写操作,并将读写操作结果返回给client。monitor可负责监控相应存储节点所包含存储器的状态,当然monitor也还可负责监控相应存储节点的状态等。
例如可在分布式存储系统中的每个存储节点上部署monitor,每个存储节点中的monitor则可用于监控这个存储节点包括的存储器的状态等信息。进一步的,可将所有存储节点的monitor连接在一起形成monitor集群。并且通过管理员指定的方式或者通过选举算法,从 monitor集群中选举出1个“主monitor”,其他monitor则为“从monitor”。主monitor所在的存储节点可称“管理节点”。例如每个从monitor可将自身所在存储节点所包括的各存储器当前所处状态发送给主monitor。主monitor则可基于收集到的各存储节点所包括的存储器所处的状态等信息生成状态信息集合体。其中,状态信息集合体例如可简称“节点地图(英文:nodemap)”或“状态文件”等。。状态信息集合体中记录了各存储器的状态等信息。进一步的,例如从monitor可将自身所在存储节点当前所处状态发送给主monitor。状态信息集合体还可以进一步记录各存储节点的状态等。如前所述,monitor是处理器通过运行内存中的程序所实现的功能。总的来说,可以认为monitor是处理器+内存所具有的功能;或者说monitor是存储节点所具有的功能。因此,下面的主monitor的功能可以由主处理器执行,或者主处理器+处理器+内存执行,或者主存储节点执行;从monitor的功能可以由从处理器执行,或者从处理器+处理器+内存执行,或者从存储节点执行。
其中,主monitor可主动的将最新的状态信息集合体推送给其他存储节点,或者其他存储节点也可主动的从主monitor请求最新的状态信息集合体。其他的存储节点上的monitor(即从monitor)可将接收到的状态信息集合体缓存在其内存之中,缓存在内存中的这个状态信息集合体便为相应存储节点(相应monitor)本地维护的状态信息集合体。每当从monitor接收到来自主monitor的状态信息集合体,那么,这个从monitor便可用最新接收到的来自主monitor的这个状态信息集合体,来更新其当前缓存的状态信息集合体。即从monitor保持当前缓存的状态信息集合体是其接收到的最新状态信息集合体。
可以理解的是,当从monitor与主monitor之间处于连接态,那么从monitor当前缓存的状态信息集合体通常就会是主monitor所发布的最新的状态信息集合体;而当从monitor与主monitor之间未处于连接态,那么从monitor有可能遗漏接收主monitor所发布的最新的状态信息集合体,因此从monitor当前缓存的状态信息集合体可能就不是主monitor所发布的最新的状态信息集合体。当然,如果在从monitor与主monitor之间未处于连接态的期间,主monitor并未发布新的状态信息集合体,那么从monitor当前缓存的状态信息集合体也有可能还是主monitor所发布的最新的状态信息集合体。处于连接态的设备之间可以直接通信或者间接的通信。出于这样的原因,和管理节点处于连接态的存储节点可以从管理节点获得最新版本的状态信息集合体。
其中,存储器的状态(访问状态)包括在线(英文:online)状态和离线(英文:offline)状态。其中,在线状态包括:数据重构状态和信任访问状态。其中,数据重构状态也可称追赶(英文:catchup)状态);信任访问状态也可称正常(英文:normal)状态。
可以理解的是,在需进行数据重构的情况下,存储器在数据重构完成后可进入信任访问状态,或者在一些无需进行数据重构的情况下,例如,存储器可直接从离线状态进入信任访问状态。总的来说,离线状态和数据重构状态都可看作是非信任访问状态。
存储器的状态反映出存储器中数据的有效性(也即是否可信任)。存储器处于信任访问状态,则表示分配给这个存储器的数据被全部成功写入这个存储器了。也就是说,被分配的数据被全部写入其中的这个存储器处于信任访问状态。例如,当存储器处于信任访问状态,那可认为处于这种状态的存储器上的数据是有效的。例如,当存储器处于数据重构 状态,那么表示存储器此时正在对离线期间丢失的相关数据进行重构,那么可认为这种情况下存储器上的数据很可能不是有效的,也就是说存储器处于非信任访问状态。当存储器处于离线状态,那么可认为这种情况下存储器上的数据不是有效的,也就是说存储器处于非信任访问状态。可以理解的是,处于离线状态的存储器无法被访问,不可从存储器中读数据也不可从存储器中写数据。
参见图1-C,图1-C举例了存储器的不同状态之间的可能变迁方式,存储器可从在线状态迁移到离线状态,也可从离线状态迁移到在线状态。具体例如,存储器可从离线状态变更为数据重构状态或信任访问状态,可从数据重构状态变更为信任访问状态,可从数据重构状态或信任访问状态变更为离线状态。
存储器offline可能由多种因素引发,例如,当存储节点中的存储器无法被这个存储节点检测到,或当存储节点接收到了来自某存储器的故障状态报告(故障状态报告指示存储器offline),那么存储节点可将相应存储器的状态设置为offline状态。又例如当管理节点(主monitor)连续多次接收不到某个存储节点的心跳消息,这就表示这个存储节点处于offline状态,那么管理节点可以将这个存储节点包括的所有存储器设定为offline状态。
其中,状态信息集合体中可记录各存储器的存储器ID和状态,还可进一步记录各存储器所属存储节点的节点ID和状态。参见图1-D,图1-D举例了状态信息集合体的一种可能表现形式,状态信息集合体中包括状态信息集合体的版本号、各存储器(图中以存储器为硬盘为例)的标识和状态(状态例如为离线状态、数据重构状态或信任访问状态)。
由于存储器的状态可能是随着时间变化而变化,因此状态信息集合体也会因存储器状态变化而适应性更新。例如当管理节点(主monitor)接收到存储器状态变更报告(存储器状态发生变化,例如存储器被拔出下电或者重新上电)后或判定存储节点离线(例如在一定时长内未收到相应存储节点的心跳消息),那么管理节点认为状态信息集合体更新条件满足,管理节点进行状态信息集合体更新,而更新后的状态信息集合体对应1个新的版本号,更新后的状态信息集合体中记录了变更后的存储器状态。管理节点将更新后的状态信息集合体推送给其他存储节点(从monitor)。状态信息集合体的版本号可由主monitor基于全局变量生成。具体例如,状态信息集合体每更新1次,版本号则在当前基础上递增+1或+2等。因此,不同版本号可以表示不同的状态信息集合体。通过比较两个状态信息集合体的版本号,便可判定这两个状态信息集合体是否相同。
分布式存储系统的存储器中可能存储各种数据体(数据体是指可以被存储器所存储的数据,数据体例如为文件、对象或记录或其他格式的数据)。一般来说,考虑到系统可靠性等原因,同1个数据体的不同条带分别存储至不同存储器上。其中,数据条带进行冗余校验运算(如EC运算等)可以得到校验条带。校验冗余配置(例如EC配置)用于描述分条中的数据条带和校验条带的比例关系。例如文件的校验冗余配置可表示为A:T,那么,A为分条包括的数据条带的数量,T(T表示冗余度)则为分条包括的校验条带的数量。例如假设校验冗余算法为EC算法,那么当校验冗余配置(即EC配置)为2:1,那么2为分条(EC分条)包括的数据条带的数量,1为分条包括的校验条带数量,即,在这种情况之下,分条包括2个数据条带和1个校验条带,分条包括的2个数据条带和1个校验条带将分别存储至不同的存 储器。也就是说需要3个存储器来存储分条包括的2个数据条带和1个校验条。又假设当校验冗余配置(即EC配置)为16:4,那么16为分条(EC分条)包括的数据条带的数量,4为分条包括的校验条带的数量,即这种情况下,分条包括16个数据条带和4个校验条带,这种情况下则需20个存储器来分别存储分条包括的这20个条带。其他情况以此类推。图1-E举例示出了1种包括2个数据条带和1个校验条带的分条。图1-F示出了1种包括多个数据条带和多个校验条带的分条。
进一步的,一个分布式存储系统可支持一种或多种校验冗余配置,每种校验冗余配置可对应一个或多个校验成员组,其中,每个校验成员组包括多个存储器。当然,不同的校验成员组所包括的存储器有可能完全不同;也有可能部分相同,也就是说每个存储器可以为一个或多个校验成员组服务。例如校验冗余配置为2:1,那么这个校验冗余配置对应的每个校验成员组都包括3个(2+1=3)存储器。例如校验冗余配置为4:2,那么这个校验冗余配置对应的每个校验成员组都包括6个(4+2=6)存储器。例如校验冗余配置为16:4,那这个校验冗余配置对应的每个校验成员组都包括20个(16+4=20)存储器,以此类推。
此外,当某些文件长度很大时(如文件长度超过1Gb),那么有时为便于管理,可能将文件划分为若干文件分段,例如文件分段不超过500M或1Gb等。参见图2-A,图2-A为将文件划分为若干文件分段的示意图。文件的每个文件分段可对应相同或不同的校验冗余配置,不同文件分段的校验成员组也可能相同或者不同。在这种情况下,文件分段可被当作文件来对待的。例如参见图2-B,图2-B为未将文件划分为若干文件分段的示意图。
下面通过一些实施例对本申请的一些相关方法进行描述。
实施例一
下面先结合附图举例介绍一种状态更新方法,其中,状态更新方法应用于分布式存储系统。例如分布式存储系统包括m个存储节点,其中,所述m个存储节点例如可以包括存储节点Nd2、Nd1、Nd3,……,Nd4等,例如存储节点Nd1为管理节点,分布式存储系统包括更多数量或更少数量的存储节点的情况下的实施方式可以此类推。下面举例的方案主要由非管理节点来主动触发管理节点更新和发布状态信息集合体。
请参见图3-A,图3-A是本申请一实施例提供的一种状态更新方法的流程示意图。参见图3-A,本申请一实施例提供的一种状态更新方法可以包括:
301、当存储节点Nd2中的存储器Nd2-D1的状态发生变化,那么存储节点Nd2向管理节点Nd1发送存储器状态变更报告Report1。存储器状态变更报告Report1用于指示出存储器Nd2-D1变更后的最新状态。
其中,存储器状态变更报告Report1中可以携带存储器Nd2-D1的存储器标识和状态标识等等。举例来说,当存储器Nd2-D1变更后的最新状态为offline状态,那么存储器状态变更报告Report1携带的存储器Nd2-D1的状态标识指示出offline状态。又例如,当存储器Nd2-D1变更后的最新状态为数据重构状态,那么存储器状态变更报告Report1携带的存储器Nd2-D1的状态标识指示出数据重构状态。又举例来说,当存储器Nd2-D1变更后的最新状态为信任访问状态,那么存储器状态变更报告Report1携带的存储器Nd2-D1的状态标识指示出信任访问状态。以此类推。
302、管理节点Nd1接收来自存储节点Nd2的所述存储器状态变更报告Report1。管理节点Nd1基于所述存储器状态变更报告Report1更新自身当前缓存的状态信息集合体。
具体例如,所述管理节点Nd1将其当前缓存的状态信息集合体中记录的存储器Nd2-D1的状态,更新为存储器状态变更报告Report1所指示出的存储器Nd2-D1的状态,并且将这个状态信息集合体中的版本号(即这个状态信息集合体自身的版本号)更新。
举例来说,在各存储节点中都部署了monitor的情况下,那么具体可由位于存储节点Nd2当中的从monitor向存储节点Nd1发送存储器状态变更报告Report1。相应的可以由位于管理节点Nd1当中的主monitor接收来自存储节点Nd2的存储器状态变更报告Report1。
管理节点Nd1更新前的状态信息集合体例如图3-B举例所示,管理节点Nd1更新后的状态信息集合体例如图3-C举例所示,当然实际应用中并不限于这样的举例形式。
303、管理节点Nd1向其它的存储节点发送更新后的最新的状态信息集合体。
相应的,存储节点Nd2、Nd3和Nd4等分别接收来自管理节点Nd1的更新后的状态信息集合体。存储节点Nd2、Nd3和Nd4分别利用来自管理节点Nd1的状态信息集合体更新各自当前缓存的状态信息集合体。这样有利于各存储节点(Nd1、Nd2、Nd3和Nd4等等)所缓存的状态信息集合体就可保持相对同步。
上述方案提供了一种主要由非管理节点来主动的触发管理节点更新和发布状态信息集合体,具体的,当某存储节点中的存储器的状态发生变化,那么这个存储节点主动向管理节点上报存储器状态变更报告,进而触发管理节点基于这个存储器状态变更报告来及时更新和发布状态信息集合体。这种机制有利于尽量使得分布式存储系统中的各存储节点同步状态信息集合体的机制,有利于使得各存储节点可以相对准确的知晓分布式存储系统中的各存储器的状态,为后续基于存储器的状态进行读写操作奠定了一定基础。
实施例二
下面结合附图举例介绍另一种状态更新方法,其中,状态更新方法应用于分布式存储系统。例如分布式存储系统可包括m个存储节点,其中,m个存储节点例如可包括存储节点Nd2、Nd1、Nd3,……,Nd4,例如存储节点Nd1为管理节点,分布式存储系统包括更多数量或更少数量的存储节点的情况下的实施方式可以此类推。下面举例的方案主要由管理节点基于对其他存储节点的心跳监控结果来更新和发布状态信息集合体。
请参见图4-A,图4-A是本申请的另一个实施例所提供的另一种状态更新方法的流程示意图。参见图4-A,本申请的另一个实施例提供的另一种状态更新方法可以包括:
401、管理节点Nd1接收各存储节点周期性发送的心跳消息。
402、当管理节点Nd1在设定时长范围内(其中,设定时长范围例如可为5分钟、10分钟、20分钟、1分钟或其他时长)未接收到存储节点Nd2发送的心跳消息,那么可认定存储节点Nd2处于离线状态。管理节点Nd1在认定存储节点Nd2处于离线状态的情况下,更新管理节点Nd1当前缓存的状态信息集合体。
具体的,管理节点Nd1将其当前缓存的状态信息集合体中记录的存储节点Nd2所包括的所有存储器的状态均更新为离线状态,并且将这个状态信息集合体中的集合体版本号更新。
举例来说,在各存储节点中都部署了monitor的情况下,那么具体可由管理节点Nd1中 的主monitor接收各存储节点中的从monitor周期性发送的心跳消息。
例如存储节点Nd2包括存储器Nd2-D1、Nd2-D2和Nd2-D3。那么管理节点Nd1更新前的状态信息集合体例如图4-B举例所示,而管理节点Nd1更新后的状态信息集合体例如图4-C举例所示,当然实际应用中并不限于这样的举例形式。
403、管理节点Nd1向其它的存储节点发送更新后的最新的状态信息集合体。
相应的,所述存储节点Nd3和Nd4等分别接收来自管理节点Nd1的更新后的状态信息集合体。所述存储节点Nd3和Nd4分别利用来自所述管理节点Nd1的状态信息集合体更新掉各自当前已缓存的状态信息集合体。这样管理节点Nd1与其他各存储节点(例如Nd3和Nd4)所缓存的状态信息集合体就可保持相对同步,而存储节点Nd2由于处于离线状态,因此就无法接收到管理节点Nd1发布的最新状态信息集合体,因此存储节点Nd2在离线期间就无法保持与管理节点直接的状态信息集合体同步。当存储节点Nd2重新上线之后,存储节点Nd2可以主动向管理节点请求最新的状态信息集合体来对其缓存的状态信息集合体进行更新,当然存储节点Nd2也可被动等待管理节点再次发布最新的状态信息集合体,利用管理节点再次发布的最新状态信息集合体来对其缓存的状态信息集合体进行更新。
上述方案提供了一种主要由管理节点基于对其他存储节点的心跳监控结果来更新和发布状态信息集合体的机制,这种机制有利于尽量使得分布式存储系统中的各存储节点同步状态信息集合体,进而使得各存储节点可以相对准确的知晓分布式存储系统中的各存储器的状态,为后续基于存储器的状态进行读写操作奠定了一定基础。
可以理解,在一些可能实施方式中,图3-A和图4-A所描述的状态更新方法可能均被引用于同一分布式存储系统。也就是说,管理节点基于对其他存储节点的心跳监控结果来更新和发布状态信息集合体,同时,非管理节点也可主动的触发管理节点更新和发布状态信息集合体。
实施例三
下面再结合附图举例介绍一种数据访问方法,主要针对终端从分布式存储系统读数据的一种可能场景,这种数据访问方法应用于分布式存储系统。例如分布式存储系统包括m个存储节点,m个存储节点包括存储节点Nd1和Nd2。存储器Nd2-D1、存储器Nd2-D2和存储器Nd2-D3均位于存储节点Nd1之中。其中,本实施例中主要是以相应校验冗余配置具体为2:1为例来说明的,即分条包括2个数据条带和1个校验条带,其他校验冗余配置的情况可以此类推。
参见图5-A,图5-A是本申请一实施例提供的一种数据访问方法的流程示意图。如图5-A举例所示,本申请一实施例提供的一种数据访问方法可以包括:
501、当终端需向分布式存储系统写入数据,终端向分布式存储系统发送携带有待写入数据的写数据请求,为了下面便于引述,本实施例中把这个写数据请求命名为Wq3。本步骤中,存储节点Nd1接收来自终端的写数据请求Wq3。
写数据请求Wq3可携带数据Data1(数据Data1为待写入数据)、数据Data1所属文件的文件标识和数据Data1的位置信息。其中,数据Data1的位置信息用于描述数据Data1在数据Data1所属文件中的位置。数据Data1的位置信息例如包括数据Data1的文件偏移地址(数据 Data1的文件偏移地址表示了数据Data1在文件中的起始位置)和数据Data1的长度等。
502、存储节点Nd1将数据Data1切分为2个数据条带(为了下面便于引述,本实施例中把这两个数据条带命名为数据条带Pd1和数据条带Pd2)。存储节点Nd1利用所述2个数据条带计算得到1个校验条带(为了下面便于引述,本实施例之中把这个校验条带命名为校验条带Pj1),所述1个校验条带和2个数据条带形成包括3个条带的1个分条。例如数据条带Pd1和数据条带Pd2的长度可相等或不等。
503、存储节点Nd1基于数据Data1的位置信息确定写数据Data1所涉及的存储器。
例如存储节点Nd1确定出写数据Data1涉及的存储器为存储器Nd2-D1、存储器Nd2-D2和存储器Nd2-D3。
一般来说,在为文件数据分配存储器时,通常是基于文件数据在文件中的位置来确定这段数据文件将具体分配给那些存储器存储。因此,基于数据Data1的位置信息可确定写数据Data1所涉及的存储器。存储节点
上述3个存储器(Nd2-D1、Nd2-D2和Nd2-D3)与所述3个条带之间一一对应。具体例如校验条带Pj1对应存储器Nd2-D3,数据条带Pd1对应存储器Nd2-D1,数据条带Pd2对应存储器Nd2-D2。
504、存储节点Nd1确定其本地所缓存的状态信息集合体中当前所记录的所述存储器Nd2-D1、存储器Nd2-D2和存储器Nd2-D3的状态。
其中,下面以存储节点Nd1本地缓存的状态信息集合体中当前记录存储器Nd2-D2和Nd2-D3处于在线状态(信任访问状态或数据重构状态),而存储器Nd2-D1处于离线状态的情况为例来进行说明。
505、当确定出存储器Nd2-D2和存储器Nd2-D3处于在线状态且所述存储器Nd2-D1处于离线状态,那么,存储节点Nd2将数据条带Pd2写入存储器Nd2-D2,存储节点Nd2将校验条带Pj1写入存储器Nd2-D3。
本步骤中,之所以在存在离线存储节点情况下,仍然对余下的非离线存储节点正常写入数据,是由于:按照EC校验算法,即使有一定数量的条带没有成功写入,也可以通过写入成功的条带把写入不成功的条带重构恢复出来,并不会造成数据丢失。只要不成功写入的条带的数量不超过校验条带的数量,EC校验算法就可以正常运算。当然,对于其他校验算法也是类似的。
506、由于存储节点Nd2中的存储器Nd2-D1处于离线状态,因此数据条带Pd1写入存储器Nd2-D1失败,因此存储节点Nd1或存储节点Nd2生成重构日志log1。
其中,重构日志log1中记录写入失败的条带(数据条带Pd1)的条带标识、写入失败的存储器(存储器Nd2-D1)的存储器标识。重构日志log1中还可记录写入失败的条带(数据条带Pd1)所属分条的分条标识等。所述重构日志log1用于后续当存储器Nd2-D1重新上线之后重构数据条带Pd1。图5-B举例示出了一种重构日志log1的可能表现形式,当然也可能采用其他表现形式。
可以理解,如果存储器Nd2-D1也处于信任访问状态,那么存储节点Nd2可直接将数据条带Pd1写入存储器Nd2-D1。在这种情况下,存储节点Nd1或存储节点Nd2就无需执行生成 重构日志log1的相关步骤了。
在上述方案主要针对的写数据场景,是分条中的各条带所对应存储器均位于同一个存储节点的一种可能场景,在这种可能场景下,由于在分布式存储系统中,引入了用于记录存储器状态的状态信息集合体,故而可以利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态。对于处于在线状态的存储器,可直接向其中写入数据;对于处于离线状态的存储器,无需在尝试写入失败后再生成相关重构日志,而可以直接生成相关重构日志。在本实施例例举的场景下,现有技术是先尝试数据写入,在尝试写入失败之后再生成相关重构日志。本实施例方案由于可降低无效尝试,进而提升数据写入效率,进而有利于提升系统性能。
实施例四
下面再结合附图举例介绍另一种数据访问方法,主要针对终端从分布式存储系统读数据的另一可能场景。在上一个实施例中,条带涉及的存储器位于同一个存储节点;在本实施例的场景中,这些存储器位于不同存储节点。
例如分布式存储系统包括m个存储节点,m个存储节点包括存储节点Nd2、Nd1、Nd3和Nd4等。存储器Nd2-D1属于存储节点Nd2,存储器Nd3-D1属于存储节点Nd3,存储器Nd4-D1属于存储节点Nd4。本实施例中主要以校验冗余配置据具体为2:1为例,即分条包括2个数据条带和1个校验条带,其他校验冗余配置的情况可以此类推。
请参见图6,图6是本申请另一实施例提供的另一种数据访问方法的流程示意图。如图6举例所示,本申请另一个实施例提供的另一种数据访问方法可以包括:
601~602,其中,步骤601~602与步骤501~502相同,因此,相关描述可以相应参考步骤501~502,此处不再赘述。
603、存储节点Nd1基于数据Data1的位置信息确定写数据Data1所涉及的存储器。
例如存储节点Nd1确定出写数据Data1涉及的存储器为存储器Nd2-D1、存储器Nd3-D1和存储器Nd4-D1。其中,上述3个存储器(Nd2-D1、Nd3-D1和Nd4-D1)与所述3个条带之间一一对应。具体例如,校验条带Pj1对应存储器Nd4-D1,数据条带Pd1对应存储器Nd2-D1,数据条带Pd2对应存储器Nd3-D1。
604、存储节点Nd1确定其本地所缓存的状态信息集合体之中当前所记录的存储器Nd2-D1、存储器Nd3-D1和存储器Nd4-D1的状态。其中,下面以存储节点Nd1本地缓存的状态信息集合体中当前记录的是:存储器Nd3-D1和Nd4-D1处于在线状态,而存储器Nd2-D1处于离线状态的情况为例来进行说明。
具体的,存储节点Nd1可在其与管理节点之间处于连接态的情况下,确定其本地所缓存的状态信息集合体之中当前所记录的存储器Nd2-D1、存储器Nd3-D1和存储器Nd4-D1的状态。
605、当确定出存储器Nd3-D1和存储器Nd4-D1处于在线状态且所述存储器Nd2-D1处于离线状态,存储节点Nd1向所述存储器Nd3-D1所属的存储节点Nd3发送写数据请求Wq1,并且,存储节点Nd1向所述存储器Nd4-D1所属的存储节点Nd4发送写数据请求Wq2。
例如写数据请求Wq1携带数据条带Pd2和存储节点Nd1当前本地缓存的状态信息集合 体的版本号。进一步的,写数据请求Wq1还可以携带数据条带Pd2的长度(长度例如可为110Kb)、数据条带Pd2的文件偏移地址、数据条带Pd2所属分条的分条标识、数据条带Pd2的条带标识、数据条带Pd2所属文件的文件标识等等。
又例如,写数据请求Wq2可携带校验条带Pj1和存储节点Nd1当前本地缓存的状态信息集合体的版本号。进一步的,写数据请求Wq2还可携带所述校验条带Pj1的长度、校验条带Pj1所属分条的分条标识、校验条带Pj1的条带标识等等。
可以理解,存储节点Nd1将自身缓存的状态信息集合体的版本号携带在发送给其他存储节点的写数据请求中,可以便于其他存储节点通过版本号比较来验证存储节点Nd1缓存的状态信息集合体是否有效。
606、存储节点Nd3在接收到来自存储节点Nd1的写数据请求Wq1之后,将写数据请求Wq1携带的状态信息集合体的版本号,与存储节点Nd3自身当前缓存的状态信息集合体的版本号进行比较。
在比较出写数据请求Wq1携带的状态信息集合体的版本号与存储节点Nd3自身当前缓存的状态信息集合体的版本号相同的情况下,并且存储节点Nd3与管理节点之间处于连接态的情况下(即存储节点Nd3中的从monitor与管理节点中的主monitor连接正常),而存储节点Nd3中的从monitor与管理节点中的主monitor连接正常,那么就可认为存储节点Nd3中的从monitor可以正常及时的接收到管理节点中的主monitor所发布的最新版本的状态信息集合体。因此这种情况下可认为存储节点Nd3当前缓存的状态信息集合体是主monitor维护的最新版本的状态信息集合体。而写数据请求Wq1携带的状态信息集合体的版本号与存储节点Nd3自身当前缓存的状态信息集合体的版本号相同,也就说明,存储节点Nd1当前缓存的状态信息集合体也是主monitor维护的最新版本的状态信息集合体,那么存储节点Nd1当前缓存的状态信息集合体中记录的存储器Nd3-D1的状态(信任访问状态)也应该是符合实际情况的),存储节点Nd3将数据条带Pd2写入存储器Nd3-D1。
此外,在比较出写数据请求Wq1携带的状态信息集合体的版本号与自身当前缓存的状态信息集合体的版本号不同的情况下,可认为Nd1所缓存的Nd3-D1的状态并不准确,换句话说,虽然在Nd1中所记录的Nd3-D1的状态是信任访问状态,但是并不能确保Nd3-D1的真实状态是信任访问状态。由于Nd3-D1可能处于不信任的访问状态,因此存储节点Nd3可拒绝将数据条带Pd2写入存储器Nd3-D1。
特别说明的是,当存储节点与管理节点之间处于连接态,那么这个存储节点缓存的状态信息集合体是管理节点最新发布的状态信息集合体的可能性相对很大。然而,在一些比较特殊的情况下,也有可能即使存储节点与管理节点之间处于连接态,但如果在管理节点发布的状态信息集合体那个时刻恰好存储节点出现了短暂性故障,或在管理节点发布的状态信息集合体的那个时刻,恰好管理节点与存储节点直接的通信质量极差,那么就可能导致存储节点未能获得管理节点最新发布的状态信息集合体,也就是说,即使存储节点与管理节点之间处于连接态,也有很小的可能出现:存储节点缓存的状态信息集合体不是管理节点最新发布的状态信息集合体。因此,在存储节点与管理节点之间处于连接态的情况下进一步比较各自当前缓存的状态信息集合体的版本号,如果版本号相同,可认为各自当前 缓存的状态信息集合体有效。这种做法可以进一步提高可靠性。
进一步的,存储节点Nd3向存储节点Nd1发送用于响应写数据请求Wq1的写数据响应Wr1,写数据响应Wr1携带数据条带Pd2的写操作结果。具体的,当存储节点Nd3成功将数据条带Pd2写入存储器Nd3-D1,那么写操作结果为写入成功。当存储节点Nd2未成功将数据条带Pd2写入存储器Nd3-D1(例如存储节点Nd2拒绝将数据条带写入存储器Nd3-D1),那么相应的写操作结果为写入失败。
607、存储节点Nd4在接收到来自存储节点Nd1的写数据请求Wq2之后,进行和步骤606类似的状态信息集合体的版本号判断(区别在于:存储节点Nd3变为存储节点Nd4,写数据请求Wq1变为写请求Wq1),以确定是否把校验条带Pj1写入存储器Nd4-D1。步骤607的具体执行过程此处不再赘述。
步骤606和步骤607可同时执行,也可以任意一个先执行。
608、存储节点Nd1接收来自所述存储节点Nd3的所述写数据响应Wr1。所述存储节点Nd1接收来自所述存储节点Nd4的所述写数据响应Wr2。
进一步的,若写数据响应指示相应写入失败,那么存储节点Nd1可以重新发送相应写数据请求来请求重新写入。例如,如果写数据响应Wr1指示相应写入失败,那么存储节点Nd1可以重新向存储节点Nd3发送写数据请求Wq1。
609、由于存储节点Nd2中的存储器Nd2-D1处于离线状态,数据条带Pd1写入存储器Nd2-D1失败,因此存储节点Nd1生成重构日志log1,所述存储节点Nd1向成功写入条带的存储节点(例如所述存储节点Nd3和/或存储节点Nd4)发送重构日志log1。相应的,存储节点Nd3和/或存储节点Nd4可接收并且存储所述重构日志log1,所述重构日志log1用于后续当存储器Nd2-D1重新上线之后重构数据条带Pd1。
其中,重构日志log1中记录写入失败的条带(数据条带Pd1)的条带标识、写入失败的存储器(存储器Nd2-D1)的存储器标识。重构日志log1还可记录写入失败的条带(数据条带Pd1)所属分条的分条标识等。
上述方案中主要针对的写数据场景,是分条中的各条带所对应存储器并未位于同一个存储节点的一种可能场景,在这种可能的场景下,由于在分布式存储系统中,引入了用于记录存储器状态的状态信息集合体,故而可利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态。
具体来说,本实施例介绍了对待写数据的写入方法,待写数据所对应分条中的各条带所对应存储器并未位于同一个存储节点,当存储节点Nd1与管理节点处于连接态,且储节点Nd1缓存的状态信息集合体中记录存储器Nd3-D1和Nd4-D1在线,那么储节点Nd1初步认为可向存储器Nd3-D1和Nd4-D1成功写入,为了验证储节点Nd1缓存的状态信息集合体的有效性,存储节点Nd1将自身缓存的状态信息集合体的版本号携带在发送给其他存储节点(存储节点Nd3和存储节点Nd4)的写数据请求中,其他存储节点将写数据请求携带的状态信息集合体的版本号和自身缓存的状态信息集合体的版本号进行比较,当写数据请求携带的状态信息集合体的版本号与自身当前缓存的状态信息集合体的版本号相同的情况下,可认为存储节点Nd1所缓存的状态信息集合体有效,因此这个状态信息集合体中记录的相关存储 器的状态应当准确,这个情况下进行相应数据写入通常成功,因此就直接向相关存储器写入数据。反之,可认为存储节点Nd1所缓存的状态信息集合体可能是无效的,因此这个状态信息集合体中记录的相关存储器的状态很可能不准确,这个情况下进行相应数据写入很可能失败,因此这个时候拒绝写入是相对合理做法。
本实施例和前一实施例相似的效果是:对于处于离线状态的存储器,可直接生成相关重构日志。重构日志无需按照传统方式在尝试写入失败之后再生成相关重构日志,进而有利于提升数据写入效率,进而有利于提升系统性能。
实施例五
下面再结合相关附图举例介绍一种数据重构方法,这种数据重构方法应用于分布式存储系统。例如分布式存储系统可包括m个存储节点,其中,所述m个存储节点包括存储节点Nd1、Nd2、Nd3和Nd4等。存储节点Nd1为发布状态信息集合体的管理节点(即为主monitor所在的存储节点)。其中,例如存储器Nd2-D1属于存储节点Nd2,存储器Nd3-D1属于存储节点Nd3,存储器Nd4-D1属于存储节点Nd4。这种数据重构方法例如可执行于图6-A举例所示方案之后。
参见图7-A,图7-A是本申请一实施例提供的一种数据重构方法的流程示意图。如图7-A举例所示,本申请的一实施例提供的一种数据重构方法可包括:
701、当存储节点Nd2确定存储器Nd2-D1上线之后,存储节点Nd2可向管理节点Nd1发送存储器状态变更报告P2,所述存储器状态变更报告P2可指示存储器Nd2-D1的状态由离线变更为数据重构状态。具体例如,存储器状态变更报告P2中可携带存储器Nd2-D1的存储器标识和状态标识(状态标识为数据重构状态的状态标识,即状态标识表示的状态为数据重构状态)。
702、管理节点Nd1接收来自存储节点Nd2的所述存储器状态变更报告P2,所述管理节点Nd1更新缓存的状态信息集合体。具体的,所述存储节点Nd1将其缓存的状态信息集合体中记录的存储器Nd2-D1的状态,更新为存储器状态变更报告P2所指示出的存储器Nd2-D1的状态,并且将这个状态信息集合体记录的状态信息集合体的版本号更新。
举例来说,在各存储节点中都部署了monitor的情况下,那么具体可以由存储节点Nd2中的从monitor向管理节点Nd1发送存储器状态变更报告P2,相应的可由所述管理节点Nd1中的主monitor接收来自存储节点Nd2的所述存储器状态变更报告P2。
举例来说,管理节点Nd1更新前的状态信息集合体例如图3-C举例所示,管理节点Nd1更新后的状态信息集合体例如图7-B举例所示,当然实际应用中,状态信息集合体并不限于这样的举例形式。
703、存储节点Nd1向其它存储节点发送更新后的状态信息集合体。
相应的,存储节点Nd2、Nd3和Nd4分别接收来自管理节点Nd1的更新后的状态信息集合体,所述存储节点Nd2、Nd3和Nd4用来自所述管理节点Nd1的状态信息集合体更新掉其当前缓存的状态信息集合体。这样,各个存储节点(Nd1、Nd2、Nd3和Nd4)所维护的状态信息集合体就可保持相对同步。
704、存储节点Nd1通知存储节点Nd2针对存储器Nd2-D1进行数据重构。
可以理解,管理节点Nd1可以通过发送更新的状态信息集合体来通知存储节点Nd2针对存储器Nd2-D1进行数据重构。或者,管理节点Nd1可通过其他消息(例如数据重构通知消息)来通知存储节点Nd2针对存储器Nd2-D1进行数据重构。
705、存储节点Nd2在接收到来自管理节点Nd1的针对存储器Nd2-D1进行数据重构的通知后,所述存储节点Nd2向所述存储节点Nd3和/或Nd4收集存储器Nd2-D1离线期间的相关重构日志。
706、存储节点Nd2基于收集到的存储器Nd2-D1离线期间的相关重构日志重构数据。
具体例如,存储节点Nd2收集到了存储器Nd2-D1离线期间的重构日志log1,重构日志log1记录了存储器Nd2-D1离线期间写入失败的条带(数据条带Pd1)的条带标识、写入失败的存储器(存储器Nd2-D1)的存储器标识。并且,所述重构日志log1还可以记录写入失败的条带(数据条带Pd1)所属分条的分条标识等。
其中,存储节点Nd2可以基于数据条带Pd1所属分条的分条标识,确定数据条带Pd1所属分条,进而获取数据条带Pd1所属分条的其它条带(数据条带Pd2和校验条带Pj1),存储节点Nd2基于数据条带Pd2和校验条带Pj1进行冗余校验运算以重构出数据条带Pd1,存储节点Nd2将重构出的数据条带Pd1写入存储器Nd2-D1。
可以理解,当存储节点Nd2收集到存储器Nd2-D1离线期间的多个相关重构日志,那么所述存储节点Nd2可先对收集到的这多个重构日志进行去重处理,而后基于去重后的各个重构日志分别进行数据重构。基于重构日志进行数据重构的方式可如上述举例。
707、当针对存储器Nd2-D1的数据重构完成后,存储节点Nd2将存储器Nd2-D1的状态设置为信任访问状态。进一步的,存储节点Nd2可向管理节点Nd1发送存储器状态变更报告P3,所述存储器状态变更报告P3可指示存储器Nd2-D1的状态由数据重构状态变更为信任访问状态。
708、管理节点Nd1接收来自存储节点Nd2的所述存储器状态变更报告P3,所述管理节点Nd1更新缓存的状态信息集合体。具体的,所述存储节点Nd1将其缓存的状态信息集合体中记录的存储器Nd2-D1的状态,更新为存储器状态变更报告P3所指示出的存储器Nd2-D1的状态,并且将这个状态信息集合体记录的状态信息集合体的版本号更新。
举例来说,在各存储节点中都部署了monitor的情况下,那么具体可以由存储节点Nd2中的从monitor向管理节点Nd1发送存储器状态变更报告P3,相应的可由所述管理节点Nd1中的主monitor接收来自存储节点Nd2的所述存储器状态变更报告P3。
例如,管理节点Nd1更新前的状态信息集合体例如图7-B举例所示,管理节点Nd1更新后的状态信息集合体例如图7-C举例所示,当然实际应用中,状态信息集合体并不限于这样的举例形式。
709、管理节点Nd1向其它存储节点(例如Nd2、Nd3和Nd4等)发送更新后的状态信息集合体。
相应的,存储节点Nd2、Nd3和Nd4可分别接收来自存储节点Nd1的更新后的状态信息集合体,存储节点Nd2、Nd3和Nd4可分别用来自管理节点Nd1的状态信息集合体更新掉其当前缓存的状态信息集合体。这样各个存储节点(Nd1、Nd2、Nd3和Nd4)所缓存的状态 信息集合体就可尽量保持相对的同步。
上述方案主要针对的是离线存储器重新在线之后的一种数据重构场景,给出在这种可能的场景下的一种可能的数据重构机制。原本离线的存储器重新上线后,其存储节点通知管理节点,管理节点对自己记录的这个存储器的状态更新为“在线重构状态”,然后管理节点向其他存储节点发布最新的状态信息集合体,进而通知其他存储节点这个存储器的状态已被更新为“在线重构状态”。接着,相关存储节点把之前未能写入这个存储器的条带进行重构。重构完成后,重新上线的存储器所在存储节点把这个重新上线的存储器的状态更新为“信任访问状态”,并通知管理节点把这个重新上线的存储器的状态更新为“信任访问状态”,接着,管理节点再次向其他存储节点发布最新的状态信息集合体,进而通知余下的存储节点这个重新上线的存储器的状态被更新为“信任访问状态”。
本实施例引入了用于记录存储器状态的状态信息集合体,故而可利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态,便于分布式存储系统中的存储节点较为准确了解存储器的状态切换(从“离线状态”切换到“在线重构状态”,再从“在线重构状态”切换到“信任访问状态”),进而有利于降低尝试读写的失败几率,进而有利于提升系统性能。
实施例六
下面再结合相关附图举例介绍另一种数据访问方法,主要针对终端从分布式存储系统读取数据的一种可能的场景。这种数据访问方法应用于分布式存储系统,例如分布式存储系统包括m个存储节点,所述m个存储节点包括存储节点Nd2、Nd1、Nd3和Nd4等。存储器Nd2-D1属于所述存储节点Nd2,存储器Nd3-D1属于所述存储节点Nd3,存储器Nd4-D1属于所述存储节点Nd4。本实施例中以数据体的格式为文件为例进行描述。
请参见图8,图8是本申请的一实施例提供的一种数据访问方法的流程示意图。如图8所示,本申请的一实施例提供的一种数据访问方法可包括:
801、当终端需要从分布式存储系统读取数据,那么终端可向分布式存储系统发送读数据请求,为了下面便于引述,本实施例中把这个读数据请求命名为Rq1。本步骤中,例如分布式存储系统中的存储节点Nd1接收来自终端的读数据请求Rq1。
例如,读数据请求Rq1携带数据Data2(数据Data2为待读取数据)所属文件的文件标识和数据Data2的位置信息。其中,数据Data2的位置信息用于描述数据Data2在其所属文件中的位置。数据Data2的位置信息例如可包括的数据Data2的文件偏移地址(数据Data2的文件偏移地址表示了数据Data2在文件中的起始位置或结束位置)和数据Data2的长度等。
802、存储节点Nd1基于数据Data2的位置信息确定数据Data2所在存储器。
例如存储节点Nd1确定出数据Data2所在存储器为存储器Nd2-D1。例如:Data2所属文件在存储时被划分为多个数据条带(以及若干校验条带),其中一个数据条带存储在存储器Nd2-D1,而所述数据Data2是这个条带的一部分。
803、存储节点Nd1确定其本地缓存的状态信息集合体之中当前记录的存储器Nd2-D1的状态。其中,本实施例中,存储节点Nd1与管理节点之间可能处于连接态,也可能处于未处于连接态。
例如,当存储节点Nd1当前缓存的状态信息集合体中记录:存储器Nd2-D1处于信任访问状态,那么执行步骤804。当存储节点Nd1当前维护的状态信息集合体记录:存储器Nd2-D1处于数据重构状态或离线状态,那么执行步骤步骤807。
804、由于确定出存储器Nd2-D1处于信任访问状态,那么所述存储节点Nd1可在读数据请求Rq1中添加存储节点Nd1当前本地缓存的状态信息集合体的版本号以得到更新的读数据请求Rq1,所述存储节点Nd1向所述存储器Nd2-D1所属的存储节点Nd2发送更新的读数据请求Rq1。
也就是说,更新的读数据请求Rq1可携带存储节点Nd1当前本地缓存的状态信息集合体的版本号。进一步的,读数据请求Rq1还可携带数据Data2的位置信息,数据条带Pd2所属文件的文件标识等等。
可以理解,存储节点Nd1将自身缓存的状态信息集合体的版本号携带在发送给其他存储节点的读数据请求中,主要是为便于其他存储节点通过版本号比较来验证存储节点Nd1缓存的状态信息集合体是否有效。
805、存储节点Nd2在接收到来了自存储节点Nd1的读数据请求Rq1之后,可将读数据请求Rq1携带的状态信息集合体的版本号,与其自身当前维护的状态信息集合体的版本号进行比较。
其中,在比较出读数据请求Rq1携带的状态信息集合体的版本号与自身当前缓存的状态信息集合体的版本号相同,并且存储节点Nd2与管理节点之间处于连接态的情况下(即存储节点Nd2中的从monitor与管理节点中的主monitor连接正常,在这种情况下,可以认为存储节点Nd2当前缓存的状态信息集合体极大可能是主monitor维护的最新版本的状态信息集合体),存储节点Nd2便可从存储器Nd2-D1读取数据Data2。此外,在比较出读数据请求Rq1携带的状态信息集合体的版本号,与自身当前缓存的状态信息集合体的版本号不相同的情况之下,存储节点Nd2则可拒绝从存储器Nd2-D1读取数据。此外,若存储节点Nd2与管理节点之间未处于连接态(即存储节点Nd2中的从monitor与管理节点中的主monitor直接没有连接正常,在这种情况下,可以认为存储节点Nd2当前缓存的状态信息集合体可能是主monitor维护的最新版本的状态信息集合体,也可能不是主monitor维护的最新版本的状态信息集合体),因此在这种情况下,存储节点Nd2也可拒绝从存储器Nd2-D1读取数据。当然在某些验证级别较低的场景下,也可能选择无论存储节点Nd2与管理节点之间是否处于连接态,在数据请求Rq1携带的状态信息集合体的版本号,与自身当前缓存的状态信息集合体的版本号相同的情况之下,存储节点Nd2便可从存储器Nd2-D1读取数据Data2。
当当存储节点Nd2成功从存储器Nd1-D1读取到数据Data2,存储节点Nd2向存储节点Nd1发送用于响应读数据请求Rq1的读数据响应Rr1,读数据响应Rr1携带读取到的数据Data2。当存储节点Nd2未成功从存储器Nd1-D1读取到数据Data2(例如存储节点Nd2拒绝从存储器Nd2-D1读取数据),那么读数据响应Rr1可携带读操作结果(读操作结果为读取失败)。
806、存储节点Nd1接收来自所述存储节点Nd2的所述读数据响应Rr1。若存储节点Nd2读取数据Data2成功,则所述存储节点Nd1可从读数据响应Rr1中获得数据Data2。
存储节点Nd1向终端发送用于响应读数据请求Rq1的读数据响应Rr1,读数据响应Rr1携带数据Data2。然后结束本流程。
807、由于确定出存储器Nd2-D1处于非信任访问状态,所述存储节点Nd1确定数据Data1所属的数据条带Pd1,所述存储节点Nd1确定数据条带Pd1所属分条,所述分条包括数据条带Pd1、数据条带Pd2和校验条带Pj1。所述存储节点Nd1确定数据条带Pd2和校验条带Pj1所在的存储器。
例如数据条带Pd2所在存储器表示为存储器Nd3-D1,存储器Nd3-D1所属存储节点表示为存储节点Nd3。例如校验条带Pj1所在存储器表示为存储器Nd4-D1,存储器Nd4-D1所属存储节点表示为存储节点Nd4。
本步骤中,由于待读数据Data2所属于条带所在的存储器处于非信任访问状态,因此本步骤中,收到读请求的存储节点Nd1确定出分条中余下条带所在的存储器,以便后续利用余下条带使用校验算法恢复出Data2所在的条带,进而获得Data2。
808、存储节点Nd1向存储器Nd3-D1所属的存储节点Nd3发送读数据请求Rq3。存储节点Nd1向所述存储器Nd4-D1所属的所述存储节点Nd4发送读数据请求Rq4。
例如读数据请求Rq3携带数据条带Pd2的条带标识和存储节点Nd1当前缓存的状态信息集合体的版本号。进一步的,读数据请求Rq3还可携带数据条带Pd2的长度(长度如为110Kb或其他长度)、数据条带Pd2的文件偏移地址、数据条带Pd2所属分条的分条标识、数据条带Pd2所属文件的文件标识,甚至还可能携带数据条带Pd2所属文件片段的片段标识。
例如读数据请求Rq4携带校验条带Pj1的条带标识和存储节点Nd1当前本地缓存的状态信息集合体的版本号。进一步的,读数据请求Rq4还可携带校验条带Pj1的长度和校验条带Pj1所属分条的分条标识等。
809、存储节点Nd3在接收到来自存储节点Nd1的读数据请求Rq3之后,将读数据请求Rq3携带的状态信息集合体的版本号,与自身当前缓存的状态信息集合体的版本号进行比较,在比较出读数据请求Rq3携带的状态信息集合体的版本号与自身当前缓存的状态信息集合体的版本号相同的情况下,并且存储节点Nd3与管理节点之间处于连接态的情况下(即存储节点Nd3中的从monitor与管理节点中的主monitor连接正常,这种情况下可认为存储节点Nd3当前缓存的状态信息集合体是主monitor维护的最新版本的状态信息集合体,而读数据请求Rq1携带的状态信息集合体的版本号与存储节点Nd3自身当前缓存的状态信息集合体的版本号相同,也就说明,存储节点Nd1当前缓存的状态信息集合体也是主monitor维护的最新版本的状态信息集合体,那么存储节点Nd1当前缓存的状态信息集合体中记录的存储器Nd3-D1的状态(信任访问状态)也是准确的),存储节点Nd3从存储器Nd3-D1读取数据条带Pd2。此外,在比较出读数据请求Rq3携带的状态信息集合体的版本号与自身当前维护的状态信息集合体的版本号不相同的情况下,存储节点Nd3可拒绝从存储器Nd3-D1读取数据条带Pd2。
进一步,存储节点Nd3向存储节点Nd1发送用于响应读数据请求Rq3的读数据响应Rr3,读数据响应Rr3携带数据条带Pd2的读操作结果。具体的,当存储节点Nd3成功从存储器Nd3-D1读取到数据条带Pd2,那么,读数据响应Rr3携带的读操作结果为读取成功,并且读 数据响应Rr3携带数据条带Pd2。当存储节点Nd3未成功从存储器Nd3-D1读取数据条带Pd2(例如存储节点Nd3拒绝从存储器Nd3-D1读取数据条带Pd2),那么读数据响应Rr3携带的读操作结果为读取失败。
810、存储节点Nd4在接收到来自存储节点Nd1的读数据请求Rq4之后,执行和存储节点Nd3类似操作(例如,进行状态信息集合体版本的比较,以及发送读数据响应Rr4给存储节点Nd1)。由于本步骤可以参考步骤809,故此处不做赘述。
811、存储节点Nd1接收来自存储节点Nd3的所述读数据响应Rr3。所述存储节点Nd1接收来自所述存储节点Nd4的所述读数据响应Rr4。
进一步的,若读数据响应指示相应读入失败,那么存储节点Nd1可以重新发送相应读数据请求来请求重新读取。例如,如果读数据响应Rr3指示相应读取失败,那么存储节点Nd1可重新向存储节点Nd3发送写数据请求来重新请求读取数据条带Pd2。又例如,如果读数据响应Rr4指示相应读取操作失败,那么存储节点Nd1可重新发送读数据请求Wq2来重新请求读取校验条带Pj1。以此类推。
812、当存储节点Nd1获得数据条带Pd2和校验条带Pj1,存储节点Nd1利用数据条带Pd2和校验条带Pj1进行校验运算以得到数据条带Pd1,从数据条带Pd1中获得数据Data2,存储节点Nd1向终端发送用于响应读数据请求Rq1的读数据响应Rr1,读数据响应Rr1携带数据Data2。
而当存储节点Nd1从来自所述存储节点Nd2的读数据响应Rr1中获得数据Data2,存储节点Nd1向终端发送用于响应读数据请求Rq1的读数据响应Rr1,读数据响应Rr1携带数据Data2。
可以看出,上述方案中,由于在分布式存储系统中,引入了用于记录存储器状态的状态信息集合体,故而可以利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态。具体引入了接收读数据请求的存储节点,基于这个读数据请求携带的状态信息集合体的版本号和其自身当前缓存的读数据请求携带的状态信息集合体的版本号之间的比较结果来进行相应读数据决策的机制,通过比较版本号有利于较好确定所使用的状态信息集合体的有效性。对于验证发现处于信任访问状态的存储器,可直接从中读取被请求数据来向终端反馈,而在这种信任访问情况下,由于无需执行传统方式中先读取出相关分条的全部条带来进行正确性校验的繁琐步骤,通过减少数据读取量就有利于降低存储器负担,进而有利于提升系统性能。
本实施例介绍了对待读数据的读取方法,待读数据是条带的一部分或整个条带。如果条带所在的存储节点是可以信任的,那么直接返回待读数据给发出请求的设备(终端或者主机);如果条带所在的存储节点不可信(处于重构状态或者离线状态),那么通过待读数据所在的分条的其他条带(除了待读数据所在条带)进行校验计算,从而获得待读数据所在条带,以便从中获得待读数据。
存储节点Nd1将自身缓存的状态信息集合体的版本号携带在发送给其他存储节点的读数据请求中,主要是为便于其他存储节点通过版本号比较来验证存储节点Nd1缓存的状态信息集合体是否有效。当存储节点Nd2与管理节点连接正常,在这种情况下,若在数据请 求Rq1携带的状态信息集合体的版本号,与存储节点Nd2自身当前缓存的状态信息集合体的版本号相同,那么可认为存储节点Nd1与Nd2缓存了相同的且最新的状态信息集合体,那么存储节点Nd2便可从存储器Nd2-D1读取数据Data2。此外,当存储节点Nd2与管理节点并未连接正常,在这种情况下,可以认为存储节点Nd2当前缓存的状态信息集合体可能是管理节点发布的最新版本的状态信息集合体,也可能不是管理节点发布的最新版本的状态信息集合体,因此由于难以确定存储节点Nd2当前缓存的状态信息集合体的有效性,因此存储节点Nd2便可拒绝从存储器Nd2-D1读取数据Data2。当然在某些验证级别较低场景下,也可能选择无论存储节点Nd2与管理节点之间是否处于连接态,在数据请求Rq1携带的状态信息集合体的版本号,与自身当前缓存的状态信息集合体的版本号相同的情况之下(当两个版本号相同,那么存储节点Nd1所缓存状态信息集合体有效的可能性也比较大),存储节点Nd2便可从存储器Nd2-D1读取数据Data2。其中,由于引入了状态信息集合体有效性的一些验证机制,因此有利于有利于较好确定所使用的状态信息集合体的有效性。
实施例七
下面再结合附图举例介绍另一种数据访问方法。这种数据访问方法应用于分布式存储系统,例如分布式存储系统包括m个存储节点,其中,m个存储节点包括存储节点Nd1、Nd2、Nd3和Nd4等。存储器Nd2-D1属于存储节点Nd2。本实施例中以数据体为文件为例进行描述。
请参见图9,图9是本申请另一实施例提供的另一种数据访问方法的流程示意图。如图9所示,本申请的另一实施例提供的另一种数据访问方法可包括:
901、当终端需要从分布式存储系统读取数据,那么终端可向分布式存储系统发送读数据请求,为了下面便于引述,本实施例中把这个读数据请求命名为Rq1。本步骤中,例如分布式存储系统中的存储节点Nd2接收来自终端的读数据请求Rq1。
例如步骤901中的读数据请求Rq1携带的相关信息,与步骤801中的读数据请求Rq1携带的相关信息相同。
902、存储节点Nd2基于数据Data2的位置信息确定数据Data2所在存储器。根据位置信息确定一个数据所在的存储器的方案,其他实施例已有描述,所以此处不再描述。
例如存储节点Nd2确定出数据Data2所在存储器为存储器Nd2-D1。
903、存储节点Nd2确定其本地缓存的状态信息集合体之中记录的所述存储器Nd2-D1的状态。
例如,存储节点Nd2当前缓存的状态信息集合体记录存储器Nd2-D1处于信任访问状态。并且,存储节点Nd2与管理节点之间处理连接态(即,存储节点Nd2中的从monitor与管理节点中的主monitor连接正常,这种情况下可认为存储节点Nd2当前缓存的状态信息集合体是主monitor发布的最新版本的状态信息集合体)。
904、由于确定出存储器Nd2-D1处于信任访问状态,那么存储节点Nd2可从存储器Nd2-D1读取数据Data2。存储节点Nd2向终端发送用于响应读数据请求Rq1的读数据响应Rr1,读数据响应Rr1携带数据Data2。
此外,当确定出存储器Nd2-D1处于非信任访问状态,那么可参考图8举例所示场景的实施方式,收集数据Data2所属分条的其他条带,利用其他条带来得到数据Data2。具体实 施方式此处不再赘述。
可以看出,上述方案中,由于在分布式存储系统中,引入了用于记录存储器状态的状态信息集合体,故而可以利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态。具体引入了接收读数据请求的存储节点,基于其缓存的发布状态信息集合体中记录的相关存储器的状态,及其与发布状态信息集合体的管理节点之间的连接状态来进行相应读数据决策的机制,这样有利于较好确定所使用的状态信息集合体的有效性。对于验证处于信任访问状态的存储器,可直接从中读取被请求数据来向终端反馈,而在这种信任访问情况下,由于无需执行传统方式中先读取出相关分条的全部条带来进行正确性校验的繁琐步骤,通过减少数据读取量就有利于降低存储器负担,进而有利于提升系统性能。
本方案中,待读数据(数据Data2)位于接收到终端/主机的读数据请求的存储节点(Nd2)的存储器(ND2-D1),这个存储节点(Nd2)和管理节点处于连接态,那么可以认为这个存储节点(Nd2)其本地缓存的状态信息集合体中记录的存储器状态是相对可靠的。当状态信息集合体中记录中记录的这个存储节点(Nd2)是信任访问状态,那么可以直接从中读取待读数据(数据Data2),然后返回给终端/主机。
实施例八
下面再结合附图举例介绍另一种数据访问方法。数据访问方法应用于分布式存储系统,例如分布式存储系统包括m个存储节点,其中,m个存储节点包括存储节点Nd2、Nd1、Nd3和Nd4等。存储器Nd2-D1属于存储节点Nd2,存储器Nd3-D1属于存储节点Nd3,存储器Nd4-D1属于存储节点Nd4。本实施例中以数据体为文件为例进行描述。
参见图10,图10是本申请另一实施例提供的另一种数据访问方法的流程示意图。如图10所示,本申请的另一实施例提供另的一种数据访问方法可包括:
1001~1002,其中,步骤1001~1002与步骤801~802相同,因此,相关描述可以相应参考步骤801~802,此处不再赘述。
1003、当存储节点Nd1与管理节点之间处于连接态,存储节点Nd1确定其本地缓存的状态信息集合体中记录的存储器Nd2-D1的状态。
例如,当存储节点Nd1当前缓存的状态信息集合体记录存储器Nd2-D1处于信任访问状态,那么执行步骤1004。当存储节点Nd1当前维护的状态信息集合体记录存储器Nd2-D1处于数据重构状态或离线状态,那么执行步骤步骤1007。
其中,当存储节点Nd1与管理节点之间未处于连接态,那么存储节点Nd1可将读数据请求Rq1转发给管理节点,或将读数据请求Rq1转发给与管理节点之间处于连接态的其他存储节点,由收到Rq1的其他存储节点确定其本地缓存的状态信息集合体中记录的存储器Nd2-D1的状态,并据此进行后续的相关操作。具体的后续操作可以参照本实施例1003~步骤1011。本实施例中以存储节点Nd1与管理节点之间处于连接态的情况为例进行描述。
1004、由于确定出存储器Nd2-D1处于信任访问状态,那么存储节点Nd1向所述存储器Nd2-D1所属的存储节点Nd2发送读数据请求Rq1。
1005、存储节点Nd2在接收到来自存储节点Nd1的读数据请求Rq1后,存储节点Nd2可从存储器Nd2-D1读取数据Data2。存储节点Nd2进一步向存储节点Nd1发送用于响应读数据 请求Rq1的读数据响应Rr1,其中,读数据响应Rr1携带读取到的数据Data2。
1006、存储节点Nd1接收来自所述存储节点Nd2的所述读数据响应Rr1。若存储节点Nd2读取数据Data2成功,则所述存储节点Nd1可从读数据响应Rr1中获得数据Data2。执行完本步骤后退出流程,不再执行下面的步骤1007及其他步骤。
1007、由于确定出存储器Nd2-D1处于非信任访问状态,存储节点Nd1确定数据Data1所属的数据条带Pd1,存储节点Nd1确定数据条带Pd1所属分条,所述分条包括数据条带Pd1、数据条带Pd2和校验条带Pj1。所述存储节点Nd1确定数据条带Pd2和校验条带Pj1所在的存储器。
例如数据条带Pd2所在存储器表示为存储器Nd3-D1,存储器Nd3-D1所属存储节点表示为存储节点Nd3。例如校验条带Pj1所在存储器表示为存储器Nd4-D1,存储器Nd4-D1所属存储节点表示为存储节点Nd4。存储节点Nd1确定其本地缓存的状态信息集合体中记录的存储器Nd3-D1和Nd4-D1的状态。
下面以存储节点Nd1当前缓存的状态信息集合体记录:存储器Nd3-D1和Nd4-D1处于信任访问状态为例进行描述,当存储节点Nd1当前维护的状态信息集合体记录:存储器Nd3-D1和Nd4-D1处于非信任访问状态,那么可直接反馈数据读取失败,或者在等待设定时长之后,重新查看当前缓存的状态信息集合体记录存储器Nd1-D1、Nd3-D1和Nd4-D1的状态。
1008、存储节点Nd1向存储器Nd3-D1所属的存储节点Nd3发送读数据请求Rq3。存储节点Nd1向所述存储器Nd4-D1所属的存储节点Nd4发送读数据请求Rq4。
例如读数据请求Rq3携带数据条带Pd2的条带标识的版本号。读数据请求Rq3还可以携带数据条带Pd2的长度(例如110Kb)、数据条带Pd2的文件偏移地址、数据条带Pd2所属分条的分条标识、数据条带Pd2所属文件的文件标识,甚至还可能携带数据条带Pd2所属文件片段的片段标识。
例如读数据请求Rq4携带校验条带Pj1的条带标识。进一步的,读数据请求Rq4还可携带校验条带Pj1的长度和校验条带Pj1所属分条的分条标识等。
1009、存储节点Nd3在接收到来自存储节点Nd1的读数据请求Rq3后,存储节点Nd3从存储器Nd3-D1读取数据条带Pd2。
进一步的,存储节点Nd3向存储节点Nd1发送用于响应读数据请求Rq3的读数据响应Rr3,读数据响应Rr3携带数据条带Pd2的读操作结果。具体的,当存储节点Nd3成功从存储器Nd3-D1读取到数据条带Pd2,那么,读数据响应Rr3携带的读操作结果为读取成功,并且读数据响应Rr3携带数据条带Pd2。当存储节点Nd3未成功从存储器Nd3-D1读取数据条带Pd2,那么读数据响应Rr3携带的读操作结果为读取失败。
1010、存储节点Nd4在接收到来自存储节点Nd1的读数据请求Rq4之后,存储节点Nd4从存储器Nd4-D1读取校验条带Pj1。
进一步的,存储节点Nd4向存储节点Nd1发送用于响应读数据请求Rq4的读数据响应Rr4,读数据响应Rr4携带校验条带Pj1的读操作结果。具体的,当存储节点Nd4成功从存储器Nd4-D1读取到校验条带Pj1,那么,读数据响应Rr4携带的读操作结果为读取成功,并且读数据响应Rr3携带校验条带Pj1。当存储节点Nd4未成功从存储器Nd4-D1读取到校验条带 Pj1,那么读数据响应Rr4携带的读操作结果为读取失败。
1011、存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点存储节点~1012。其中,步骤1011~1012与步骤811~812相同。
本实施中,收到读请求的是一个存储节点(Nd1),需读取的数据(数据Data1)位于另外一个存储节点(Nd2)的存储器(Nd2-D1)。如果Nd2和管理节点处于连接态,且Nd2本地缓存的状态信息集合体中记录的存储器Nd2-D1的状态是信任访问状态,那么可以直接从Nd2-D1读出需要读取的数据。如果Nd2和管理节点不处于连接态,那么,判断需要读取的数据所在的条带中,位于其他存储器的分条(数据条带Pd2和校验条带Pj1)所在的存储器(Nd3-D1和Nd4-D1)。如果位于其他存储器的分条(数据条带Pd2和校验条带Pj1)所在的存储器(Nd3-D1和Nd4-D1)的状态是信任访问状态,并且和管理节点处于连接态。那么就读出位于其他的存储器的分条,通过校验算法得到需要读取的数据(数据Data1)所在的条带(Pd1),从Pd1中得到需要读取的数据(数据Data1)返回给终端/主机。
可以看出,上述方案中,由于在分布式存储系统中,引入了用于记录存储器状态的状态信息集合体,故而可以利用状态信息集合体来记录管理分布式存储系统的存储器的访问状态。具体引入了接收和中转读数据请求的存储节点,基于其缓存的发布状态信息集合体中记录的相关存储器的状态,以及其与发布状态信息集合体的管理节点之间的连接状态来进行相应读数据请求中转决策的机制,这样有利于较好确定所使用的状态信息集合体的有效性。对于处于信任访问状态的存储器,可直接触发相关存储节点从中读取被请求数据来向终端反馈,而在这种信任访问情况下,由于无需执行传统方式中先读取出相关分条的全部条带来进行正确性校验的繁琐步骤,通过减少数据读取量就有利于降低存储器负担,进而有利于提升系统性能。
下面还提供用于实施上述方案的相关装置。
参见图11,本申请实施例提供一种分布式存储系统1100,所述分布式存储系统包括m个存储节点,其中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数。所述m个存储节点之中的第一存储节点1110用于:接收来自终端的第一读数据请求,所述第一读数据请求携带有第一位置信息,其中,所述第一位置信息用于描述第一数据在其所属数据体中的位置;基于所述第一位置信息确定所述第一数据所在的第一存储器。
在所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。所述第一数据例如为所述第一数据所属条带中的部分或全部数据。
在本申请一些可能实施方式中,所述分布式存储系统还包括用于发布状态信息集合体的管理节点。
其中,所述第一存储节点还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态,并且所述第一存储节点与用于发 布状态信息集合体的管理节点之间处于连接态的情况下,向所述第二存储节点转发所述第一读数据请求。
其中,所述第二存储节点1120用于,在接收到来自所述第一存储节点的所述第一读数据请求后,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述第一存储节点1110还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且,所述第一存储器处于信任访问状态的情况之下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述所述状态信息集合体的标识的所述第一读数据请求。
其中,所述第二存储节点1120用于,接收到来自所述第一存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述第一存储节点1110还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且,所述第一存储器处于信任访问状态的情况之下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识和用于发布状态信息集合体的管理节点的节点标识,向所述第二存储节点发送添加了所述状态信息集合体的标识和所述管理节点的节点标识的所述第一读数据请求。
相应的,所述第二存储节点1120用于,在接收到来自所述第一存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,且将所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,且比较出所述第一读数据请求携带的所述管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况之下,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述第一存储节点1110还用于:在确定所述第一存储器处于非信任访问状态,并且所述第一数据所属的分条的各条带所在的N个存储器中的x个存储器属于所述第一存储节点,且所述N个存储器中的N-x个存储器属于不同于所述第一存储节点的y个存储节点的情况下,向所述y个存储节点发送携带所述分条的分条标识和所述状态信息集合体的标识的读数据请求。
所述y个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的读数据请 求之后,将所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,若比较出所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,从其包括的相应存储器中读取所述分条的相应条带,向第一存储节点发送读取到的所述分条的相应条带。
所述第一存储节点还用于,基于基于从所述y个存储节点和所述第一存储节点收集到的所述分条的各条带进行校验运算以得到第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有得到的所述第一数据。
在本申请一些可能实施方式中,例如所述m个存储节点之中的其中1个存储节点为管理节点。其中,所述第一存储节点还用于,当所述第一存储器所处状态由第一状态变更为第二状态,向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态。其中,所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态。
其中,所述管理节点用于,在接收到了来自所述第一存储节点的存储器状态变更报告之后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并更新所述管理节点缓存的所述状态信息集合体的版本号;所述管理节点向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体。
相应的,所述第一存储节点还用于,用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。相应的,其他存储节点也用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
在本申请一些可能实施方式中,所述第一存储节点还用于,接收来自终端的第一写数据请求,所述第一写数据请求携带有第二数据和第二位置信息,所述第二位置信息用于描述所述第二数据在其所属数据体中的位置。
其中,所述第一存储节点还用于,基于所述第二位置信息确定写所述第二数据涉及的W个存储器;将第二数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点还用于,当所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W1个存储器处于非离线状态且所述W个存储器中的W2个存储器处于离线状态,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和所述状态信息集合体的标识的写数据请求。
相应的,所述y2个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述写数据请求携带的状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带;在比较出所述写数据请求携带的状态信息集合体的标识与自身当前缓存的状态信息集合体的标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,所述第一存储节点还用于,接收来自终端的第二写数据请求,所述第二写数据请求携带有第三数据和第三位置信息,所述第三位置信息用于描述所述第三数据在其所属数据体中的位置。
所述第一存储节点还用于,基于所述第三位置信息确定写所述第三数据涉及的W个存储器;将所述第三数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点还用于,确定其缓存的状态信息集合体中当前记录的所述W个存储器所处状态,在确定出所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带的写数据请求。
相应的,所述y2个存储节点中的每个1个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,所述第一存储节点还用于,接收来自终端的第三写数据请求,所述第三写数据请求携带有第四数据和第四位置信息,所述第四位置信息用于描述所述第四数据在其所属数据体中的位置;
所述第一存储节点还用于,基于所述第四位置信息确定写所述第四数据涉及的W个存储器;将所述第四数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W。
所述第一存储节点还用于,当所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W1个存储器处于非离线状态(例如并且所述W个存储器中的W2个存储器处于离线状态),并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和用于发布状态信息集合体的管理节点的节点标识的写数据请求。
相应的,所述y2个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识进行比较,在比较出所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带;在比较出所述写数据请求携带的管理节点的节点标识与自身当前缓存的管理节点的节点标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
在本申请一些可能实施方式中,第一存储节点还用于生成第一重构日志,所述第一重构日志中记录了所述w2个存储器中的第二存储器的存储器标识,所述第一重构日志中还记录了所述W个条带中的与所述第二存储器存储器所对应的第一条带的条带标识,所述第一重构日志中还记录了所述分条的分条标识;其中,所述第二存储器为所述W2个存储器中的任意一个存储器。
相应的,例如当所述W2个存储器中的所述第二存储器重新上线之后,并且所述第二存储器所归属的第二存储节点收集到了在所述第二存储器离线期间产生的所述第一重构日志,第二存储节点获取第一重构日志中记录的需写入所述第二存储器的第一条带的标识;确定所述分条所包括的所述第一条带之外其他W-T个条带所在的W-T个存储器,从所述W-T个存储器中读取所述W-T个条带,利用所述W-T个条带进行校验运算以重构出需写入所述第二存储器的所述第一条带,将重构出的所述第一条带写入所述第二存储器。
可以理解,本实施例中的分布式存储系统1100的功能,可用于基于上述方法实施例的方案的来实施,一些未描述的部分可参考上述实施例。
参见图12,本申请实施例还提供一种存储节点1200,所述存储节点为分布式存储系统所包括的m个存储节点的其中1个存储节点,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数。所述存储节点包括:通信单元和处理控制单元。
所述通信单元1210,用于接收来自终端的第一读数据请求,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置。
处理控制单元1220,用于在基于所述第一位置信息确定所述第一数据所在的存储器为的第一存储器,且所述第一存储器归属于所述存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据。
其中,所述通信单元1210还用于,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述通信单元1210还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,且所述第一存储器处于信任访问状态,且所述存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述第二存储节点转发所述第一读数据请求。
所述第一读数据请求用于触发所述第二存储节点在接收到了来自所述第一存储节点的所述第一读数据请求之后,从所述第一存储器读取所述第一数据,向所述存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述通信单元还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态的情况下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述状态信息集合体的标识的所述第一读数据请求;
所述第一读数据请求用于触发所述第二存储节点接收到来自所述存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
在本申请一些可能实施方式中,所述m个存储节点中其中1个存储节点为管理节点。
所述通信单元还用于,当所述第一存储器所处状态由第一状态变更为第二状态,向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态;所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态。
其中,所述存储器状态变更报告用于触发所述管理节点在接收到来自所述第一存储节点的存储器状态变更报告后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并且更新所述管理节点缓存的所述状态信息集合体的版本号。向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体。
所述处理控制单元用于,采用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
可以理解,本实施例中的存储节点1200的功能,可基于上述方法实施例的方案的来具体实现,一些未描述的部分可参考上述实施例。
参见图13,本申请实施例提供了一种存储节点1300,所述存储节点为分布式存储系统所包括的m个存储节点的其中1个存储节点,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数。
所述存储节点包括相互耦合的处理器1310和通信接口1320。处理器用于执行以上各方法实施例中由第一存储节点或其他存储节点所执行方法的部分或全部步骤。
所述存储器1330用于存储指令和数据,所述处理器1310用于执行所述指令,所述通信接口1320用于在所述处理器1310的控制下与其他设备进行通信。当所述处理器1310在执行所述指令时可根据所述指令执行本申请上述实施例中的任意一种方法中由任一存储节点执行的部分或全部步骤。
处理器1310还称中央处理单元(英文:Central Processing Unit,缩写:CPU)。具体的应用中存储节点的各组件例如通过总线系统耦合在一起。其中,总线系统除了可包括数据总线之外,还可包括电源总线、控制总线和状态信号总线等。但是为清楚说明起见,在图中将各种总线都标为总线系统1340。其中,上述本申请实施例揭示的方法可应用于处理器1310中,或由处理器1310实现。其中,处理器1310可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可通过处理器1310中的硬件的集成逻辑电路或者软件形式的指令完成。上述处理器1310可以是通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。处理器1310可实现或执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器1310可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等等本领域成熟的存储介质之中。该存储介质位于存储器1330,例如处理器1310可读取存储器1320中的信息,结合其硬件完成上 述方法的步骤。
例如所述处理器1310用于,通过所述通信接口1320接收第一读数据请求,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置;在基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据,通过所述通信接口发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
本申请实施例提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储了程序代码。所述程序代码包括用于执行第一方面中的任意一存储节点(例如第一存储节点或第二存储节点)所执行方法的部分或全部步骤的指令。
此外,本申请实施例提供了一种包括指令的计算机程序产品,当所述计算机程序产品在计算机(计算机例如为存储节点)上运行时,使得所述计算机执行以上各方面中由任意一存储节点(例如第一存储节点或第二存储节点等)所执行方法的部分或全部步骤。
此外,本申请实施例还提供一种业务系统,包括:分布式存储服务系统和终端,分布式存储服务系统和终端之间通信连接;所述分布式存储服务系统为如权本申请实施例提供的任意一种分布式存储服务系统。
在上述实施例中,可全部或部分地通过软件、硬件、固件或其任意组合来实现。当使用软件实现时,可全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如光盘)、或者半导体介质(例如固态硬盘)等。在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一 种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或者讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可位于一个地方,或者也可以分布到多个网络单元上。可根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元若以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可获取的存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或者部分,可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干请求用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请的各个实施例上述方法的全部或部分步骤。
以上所述,以上实施例仅用以说明本申请的技术方案而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,然而本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (22)

  1. 一种数据访问方法,其特征在于,所述方法应用于分布式存储系统,所述分布式存储系统包括m个存储节点,所述m个存储节点包括第一存储节点,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,所述方法包括:
    所述第一存储节点接收来自终端第一读数据请求,其中,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置;
    在所述第一存储节点基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,所述第一存储节点从所述第一存储器读取所述第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第一存储节点向所述第二存储节点转发所述第一读数据请求;
    所述第二存储节点在接收到来自所述第一存储节点的所述第一读数据请求之后,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态的情况下,或者,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,所述第一存储节点在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述状态信息集合体的标识的所述第一读数据请求;
    所述第二存储节点在接收到了来自所述第一存储节点的所述第一读数据请求之后,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:在存储节点所述第一存储器处于非信任访问状态的情况下,并且所述第一数据所属的分条的各条带所在的N个存储器中的x个存储器属于所述第一存储节点,且所述N个存储器中的N-x个存储器属于不同于所述第一存储节点的y个存储节点,所述第一存储节点向所述y个存储节点发送携带所述分条的分条标识和所述状态信息集合体的标识的读数据请求;
    所述y个存储节点中的每个存储节点在接收到来自所述第一存储节点的读数据请求之 后,将所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,若比较出所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,从其包括的相应存储器中读取所述分条的相应条带,向第一存储节点发送读取到的所述分条的相应条带;
    所述第一存储节点基于从所述y个存储节点和所述第一存储节点收集到的所述分条的各条带进行校验运算以得到第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有得到的所述第一数据。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述m个存储节点中其中1个存储节点为管理节点;
    所述方法还包括:当第一存储器所处状态由第一状态变更为第二状态,所述第一存储节点向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态;其中,所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态;
    所述管理节点在接收到来自所述第一存储节点的存储器状态变更报告后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并更新所述管理节点缓存的所述状态信息集合体的版本号;所述管理节点向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体;
    所述第一存储节点用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:所述第一存储节点接收来自终端的第一写数据请求,所述第一写数据请求携带有第二数据和第二位置信息,所述第二位置信息用于描述所述第二数据在其所属数据体中的位置;
    所述第一存储节点基于所述第二位置信息确定写所述第二数据所涉及到的W个存储器;将所述第二数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数,所述T小于所述W;
    当所述第一存储节点缓存的状态信息集合体当前记录所述W个存储器中的W1个存储器处于非离线状态,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和所述状态信息集合体的标识的写数据请求;
    所述y2个存储节点之中的每个存储节点在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带;在比较出所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
  7. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:所述第一 存储节点接收来自终端的第二写数据请求,所述第二写数据请求携带有第三数据和第三位置信息,所述第三位置信息用于描述所述第三数据在其所属数据体中的位置;
    所述第一存储节点基于所述第三位置信息确定所述第三数据所涉及的W个存储器之后;将所述第三数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数,所述T小于所述W;
    在所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W1个存储器处于非离线状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况之下,向所述W1个存储器所属的y2个存储节点发送携带有所述分条的条带的写数据请求;
    所述y2个存储节点之中的每个存储节点在接收到来自所述第一存储节点的写数据请求之后,向其包括的相应存储器中写入所述分条的相应条带。
  8. 根据权利要求6或7所述的方法,其特征在于,所述W1小于所述W,其中,所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W2个存储器处于离线状态;
    所述方法还包括:所述第一存储节点生成第一重构日志,所述第一重构日志中记录了所述W2个存储器中的第二存储器的存储器标识,所述第一重构日志中还记录了所述W个条带中的与所述第二存储器所对应的第一条带的条带标识,所述第一重构日志中还记录了所述分条的分条标识;所述第二存储器为所述W2个存储器中的任意一个存储器;
    当所述W2个存储器中的所述第二存储器重新上线之后,并且所述第二存储器所归属的第二存储节点收集到了在所述第二存储器离线期间产生的所述第一重构日志,第二存储节点获取第一重构日志中记录的需写入所述第二存储器的第一条带的标识;确定所述分条所包括的所述第一条带之外其他W-T个条带所在的W-T个存储器,从所述W-T个存储器中读取所述W-T个条带,利用所述W-T个条带进行校验运算以重构出需写入所述第二存储器的所述第一条带,将重构出的所述第一条带写入所述第二存储器。
  9. 一种分布式存储系统,其特征在于,包括m个存储节点,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,
    所述m个存储节点之中的第一存储节点用于:
    接收来自终端的第一读数据请求,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置;基于所述第一位置信息确定所述第一数据所在的第一存储器;
    在所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
  10. 根据权利要求9所述的分布式存储系统,其特征在于,所述分布式存储系统还包括用于发布状态信息集合体的管理节点;
    其中,所述第一存储节点还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述第二存储节点转发所述第一读数据请求;
    其中,所述第二存储节点用于,在接收到来自所述第一存储节点的所述第一读数据请求后,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  11. 根据权利要求9所述的分布式存储系统,其特征在于,
    所述第一存储节点还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态的情况下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述所述状态信息集合体的标识的所述第一读数据请求;
    其中,所述第二存储节点用于,接收到来自所述第一存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述第一存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  12. 根据权利要求9所述的分布式存储系统,其特征在于,
    所述第一存储节点还用于:在确定所述第一存储器处于非信任访问状态,并且所述第一数据所属的分条的各条带所在的N个存储器中的x个存储器属于所述第一存储节点,且所述N个存储器中的N-x个存储器属于不同于所述第一存储节点的y个存储节点的情况下,向所述y个存储节点发送携带所述分条的分条标识和所述状态信息集合体的体标识的读数据请求;
    所述y个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的读数据请求之后,将所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,若比较出所述读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同,从其包括的相应存储器中读取所述分条的相应条带,向第一存储节点发送读取到的所述分条的相应条带。
    所述第一存储节点还用于,基于基于从所述y个存储节点和所述第一存储节点收集到的所述分条的各条带进行校验运算以得到第一数据,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有得到的所述第一数据。
  13. 根据权利要求9至12任意一项所述的分布式存储系统,其特征在于,所述m个存储节点中其中1个存储节点为管理节点;
    其中,所述第一存储节点还用于,当所述第一存储器所处状态由第一状态变更为第二状态,向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存 储器处于第二状态;其中,所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态;
    其中,所述管理节点用于,在接收到了来自所述第一存储节点的存储器状态变更报告之后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并更新所述管理节点缓存的所述状态信息集合体的版本号;所述管理节点向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体;
    所述第一存储节点还用于,用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
  14. 根据权利要求9至12任一项所述的分布式存储系统,其特征在于,所述第一存储节点还用于,接收来自终端的第一写数据请求,所述第一写数据请求携带有第二数据和第二位置信息,所述第二位置信息用于描述所述第二数据在其所属数据体中的位置;
    所述第一存储节点还用于,基于所述第二位置信息确定写所述第二数据涉及的W个存储器;将第二数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数且所述T小于所述W;
    所述第一存储节点还用于,当所述第一存储节点缓存的状态信息集合体当前记录所述W个存储器中的W1个存储器处于非离线状态,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带和所述状态信息集合体的标识的写数据请求;
    所述y2个存储节点中的每个存储节点用于,在接收到来自所述第一存储节点的写数据请求之后,将所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,向其包括的相应存储器中写入所述分条的相应条带;在比较出所述写数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识不相同的情况下,拒绝向其包括的相应存储器中写入所述分条的相应条带。
  15. 根据权利要求9至12任一项所述的分布式存储系统,其特征在于,所述第一存储节点还用于,接收来自终端的第二写数据请求,所述第二写数据请求携带有第三数据和第三位置信息,所述第三位置信息用于描述所述第三数据在其所属数据体中的位置;
    所述第一存储节点还用于,基于所述第三位置信息确定所述第三数据所涉及的W个存储器之后;将所述第三数据切分为W-T个数据条带,利用所述W-T个数据条带计算得到T个校验条带,其中,所述T个校验条带和W-T个数据条带形成包括W个条带的分条,所述W个存储器与所述W个条带之间一一对应,所述T和W为正整数,所述T小于所述W;
    所述第一存储节点还用于,确定其缓存的状态信息集合体中当前记录的所述W个存储器所处状态,在确定出所述W个存储器中的W1个存储器处于非离线状态,并且所述第一存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述W1个存储器所属的y2个存储节点发送携带所述分条的条带的写数据请求;
    所述y2个存储节点中的每个1个存储节点用于,在接收到来自所述第一存储节点的写数 据请求之后,向其包括的相应存储器中写入所述分条的相应条带。
  16. 根据权利要求14至15任一项所述的分布式存储系统,其特征在于,所述W1小于所述W,其中,所述第一存储节点缓存的状态信息集合体中当前记录所述W个存储器中的W2个存储器处于离线状态
    所述第一存储节点还用于,生成第一重构日志,所述第一重构日志中记录了所述W2个存储器中的第二存储器的存储器标识,所述第一重构日志中还记录了所述W个条带中的与所述第二存储器存储器所对应的第一条带的条带标识,所述第一重构日志中还记录了所述分条的分条标识;其中,所述第二存储器为所述W2个存储器中的任意一个存储器;
    当所述W2个存储器中的所述第二存储器重新上线之后,并且所述第二存储器所归属的第二存储节点收集到了在所述第二存储器离线期间产生的所述第一重构日志,第二存储节点获取第一重构日志中记录的需写入所述第二存储器的第一条带的标识;
    确定所述分条所包括的所述第一条带之外其他W-T个条带所在的W-T个存储器,从所述W-T个存储器中读取所述W-T个条带,利用所述W-T个条带进行校验运算以重构出需写入所述第二存储器的所述第一条带,将重构出的所述第一条带写入所述第二存储器。
  17. 一种存储节点,其特征在于,所述存储节点为分布式存储系统所包括的m个存储节点的其中1个存储节点,其中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,
    所述存储节点包括:
    通信单元,用于接收来自终端的第一读数据请求,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置;
    处理控制单元,用于在基于所述第一位置信息确定所述第一数据所在的存储器为的第一存储器,且所述第一存储器归属于所述存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据;
    所述通信单元还用于,向所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
  18. 根据权利要求17所述的存储节点,其特征在于,
    所述通信单元还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,且所述第一存储器处于信任访问状态,且所述存储节点与用于发布状态信息集合体的管理节点之间处于连接态的情况下,向所述第二存储节点转发所述第一读数据请求;
    所述第一读数据请求用于触发所述第二存储节点在接收到了来自所述第一存储节点的所述第一读数据请求之后,从所述第一存储器读取所述第一数据,向所述存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  19. 根据权利要求17所述的存储节点,其特征在于,所述通信单元还用于,在所述第一存储器归属于所述m个存储节点之中的第二存储节点,并且所述第一存储器处于信任访问状态的情况下,或在所述第一存储器归属于所述m个存储节点中的第二存储节点,并且所述第一存储器处于信任访问状态,并且所述第一存储节点与用于发布状态信息集合体的 管理节点之间处于连接态的情况下,在所述第一读数据请求中添加所述第一存储节点当前缓存的所述状态信息集合体的标识,向所述第二存储节点发送添加了所述状态信息集合体的标识的所述第一读数据请求;
    所述第一读数据请求用于触发所述第二存储节点接收到来自所述存储节点的所述第一读数据请求之后,将所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识进行比较,在比较出所述第一读数据请求携带的所述状态信息集合体的标识与自身当前缓存的状态信息集合体的标识相同的情况下,从所述第一存储器读取所述第一数据,向所述存储节点或所述终端发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
  20. 根据权利要求17至19任一项所述的存储节点,其特征在于,所述m个存储节点中其中1个存储节点为管理节点;
    所述通信单元还用于,当所述第一存储器所处状态由第一状态变更为第二状态,向管理节点发送存储器状态变更报告,其中,所述存储器状态变更报告指示第一存储器处于第二状态;所述第一状态和第二状态不同,所述第一状态和所述第二状态包括如下状态中的任意一个:离线状态、数据重构状态和信任访问状态;其中,所述存储器状态变更报告用于触发所述管理节点在接收到来自所述第一存储节点的存储器状态变更报告后,将所述管理节点缓存的状态信息集合体中记录的所述第一存储器的状态更新为第二状态,并且更新所述管理节点缓存的所述状态信息集合体的版本号;向所述m个存储节点中除所述管理节点之外的其他存储节点发送更新后的状态信息集合体;
    所述处理控制单元用于,采用来自所述管理节点的状态信息集合体更新其当前缓存的状态信息集合体。
  21. 一种数据访问方法,应用于第一存储节点,其特征在于,所述第一存储位于包括m个存储节点的分布式存储系统中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,所述方法包括:
    所述第一存储节点接收第一读数据请求,其中,所述第一读数据请求携带有第一位置信息,所述第一位置信息用于描述第一数据在其所属数据体中的位置;
    在所述第一存储节点基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,所述第一存储节点从所述第一存储器读取所述第一数据,发送用于响应所述第一读数据请求的第一读数据响应,所述第一读数据响应携带有读取到的所述第一数据。
  22. 一种存储节点,其特征在于,所述存储节点为分布式存储系统所包括的m个存储节点的其中1个存储节点,其中,每个所述存储节点包括至少1个存储器,每个所述存储器包含非易失性存储介质,所述m为大于1的整数,
    所述存储节点包括相互耦合的处理器和通信接口;
    其中,所述处理器用于,通过所述通信接口接收第一读数据请求,所述第一读数据请求携带有第一位置信息,其中,所述第一位置信息用于描述第一数据在其所属数据体中的 位置;在基于所述第一位置信息确定所述第一数据所在的存储器为第一存储器,且所述第一存储器归属于所述第一存储节点的情况下,当状态信息集合体中当前记录所述第一存储器处于信任访问状态,从所述第一存储器读取所述第一数据,通过所述通信接口发送用于响应所述第一读数据请求的第一读数据响应,其中,所述第一读数据响应携带有读取到的所述第一数据。
PCT/CN2017/078579 2017-03-29 2017-03-29 分布式存储系统的访问方法和相关装置和相关系统 WO2018176265A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
PCT/CN2017/078579 WO2018176265A1 (zh) 2017-03-29 2017-03-29 分布式存储系统的访问方法和相关装置和相关系统
CN201780000202.8A CN108934187B (zh) 2017-03-29 2017-03-29 分布式存储系统的访问方法和相关装置和相关系统
EP17904157.9A EP3537687B1 (en) 2017-03-29 2017-03-29 Access method for distributed storage system, related device and related system
JP2019521779A JP6833990B2 (ja) 2017-03-29 2017-03-29 分散型ストレージシステムにアクセスするための方法、関係する装置及び関係するシステム
SG11201901608VA SG11201901608VA (en) 2017-03-29 2017-03-29 Method for accessing distributed storage system, related apparatus, and related system
US16/574,421 US11307776B2 (en) 2017-03-29 2019-09-18 Method for accessing distributed storage system, related apparatus, and related system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/078579 WO2018176265A1 (zh) 2017-03-29 2017-03-29 分布式存储系统的访问方法和相关装置和相关系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/574,421 Continuation US11307776B2 (en) 2017-03-29 2019-09-18 Method for accessing distributed storage system, related apparatus, and related system

Publications (1)

Publication Number Publication Date
WO2018176265A1 true WO2018176265A1 (zh) 2018-10-04

Family

ID=63673992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/078579 WO2018176265A1 (zh) 2017-03-29 2017-03-29 分布式存储系统的访问方法和相关装置和相关系统

Country Status (6)

Country Link
US (1) US11307776B2 (zh)
EP (1) EP3537687B1 (zh)
JP (1) JP6833990B2 (zh)
CN (1) CN108934187B (zh)
SG (1) SG11201901608VA (zh)
WO (1) WO2018176265A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6807457B2 (ja) * 2017-06-15 2021-01-06 株式会社日立製作所 ストレージシステム及びストレージシステムの制御方法
CN109814803B (zh) * 2018-12-17 2022-12-09 深圳创新科技术有限公司 一种分布式存储系统中容错能力自适应调整方法和装置
CN111367712A (zh) * 2018-12-26 2020-07-03 华为技术有限公司 一种数据处理方法及装置
CN111435323B (zh) * 2019-01-15 2023-06-20 阿里巴巴集团控股有限公司 信息的传输方法、装置、终端、服务器及存储介质
US11507545B2 (en) * 2020-07-30 2022-11-22 EMC IP Holding Company LLC System and method for mirroring a file system journal
US11669501B2 (en) 2020-10-29 2023-06-06 EMC IP Holding Company LLC Address mirroring of a file system journal
CN114816226A (zh) * 2021-01-29 2022-07-29 伊姆西Ip控股有限责任公司 用于管理存储系统的方法、设备和计算机程序产品
TWI802035B (zh) * 2021-10-06 2023-05-11 神雲科技股份有限公司 伺服器資料備援控制方法
CN115190044B (zh) * 2022-06-28 2023-08-08 平安银行股份有限公司 设备连接状态检查方法、装置、设备和存储介质
CN116627359B (zh) * 2023-07-24 2023-11-14 成都佰维存储科技有限公司 内存管理方法、装置、可读存储介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615145A (zh) * 2009-07-24 2009-12-30 中兴通讯股份有限公司 一种提高存储器数据缓存可靠性的方法和装置
CN102111448A (zh) * 2011-01-13 2011-06-29 华为技术有限公司 分布式哈希表dht存储系统的数据预取方法、节点和系统
CN105009085A (zh) * 2013-03-18 2015-10-28 株式会社东芝 信息处理系统、控制程序以及信息处理设备
WO2016065613A1 (zh) * 2014-10-31 2016-05-06 华为技术有限公司 访问文件的方法、分布式存储系统和网络设备

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952737B1 (en) * 2000-03-03 2005-10-04 Intel Corporation Method and apparatus for accessing remote storage in a distributed storage cluster architecture
DE10061998A1 (de) 2000-12-13 2002-07-18 Infineon Technologies Ag Kryptographieprozessor
JP2004192305A (ja) 2002-12-11 2004-07-08 Hitachi Ltd iSCSIストレージ管理方法及び管理システム
US7805633B2 (en) 2006-09-18 2010-09-28 Lsi Corporation Optimized reconstruction and copyback methodology for a disconnected drive in the presence of a global hot spare disk
US9298550B2 (en) * 2011-05-09 2016-03-29 Cleversafe, Inc. Assigning a dispersed storage network address range in a maintenance free storage container
EP3074907B1 (en) * 2013-11-27 2020-07-15 British Telecommunications public limited company Controlled storage device access
EP2933733A4 (en) * 2013-12-31 2016-05-11 Huawei Tech Co Ltd DATA PROCESSING METHOD AND DEVICE IN A DISTRIBUTED FILE STORAGE SYSTEM
CN104951475B (zh) * 2014-03-31 2018-10-23 中国电信股份有限公司 分布式文件系统和实现方法
US10230809B2 (en) * 2016-02-29 2019-03-12 Intel Corporation Managing replica caching in a distributed storage system
CN106406758B (zh) * 2016-09-05 2019-06-18 华为技术有限公司 一种基于分布式存储系统的数据处理方法及存储设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615145A (zh) * 2009-07-24 2009-12-30 中兴通讯股份有限公司 一种提高存储器数据缓存可靠性的方法和装置
CN102111448A (zh) * 2011-01-13 2011-06-29 华为技术有限公司 分布式哈希表dht存储系统的数据预取方法、节点和系统
CN105009085A (zh) * 2013-03-18 2015-10-28 株式会社东芝 信息处理系统、控制程序以及信息处理设备
WO2016065613A1 (zh) * 2014-10-31 2016-05-06 华为技术有限公司 访问文件的方法、分布式存储系统和网络设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3537687A4 *

Also Published As

Publication number Publication date
JP6833990B2 (ja) 2021-02-24
JP2019532440A (ja) 2019-11-07
US20200012442A1 (en) 2020-01-09
EP3537687B1 (en) 2021-09-22
EP3537687A1 (en) 2019-09-11
CN108934187A (zh) 2018-12-04
US11307776B2 (en) 2022-04-19
CN108934187B (zh) 2020-08-25
SG11201901608VA (en) 2019-03-28
EP3537687A4 (en) 2019-09-25

Similar Documents

Publication Publication Date Title
WO2018176265A1 (zh) 分布式存储系统的访问方法和相关装置和相关系统
JP6301318B2 (ja) 分散ストレージシステムのためのキャッシュ処理方法、ノード及びコンピュータ可読媒体
ES2881606T3 (es) Sistema de ficheros geográficamente distribuido que usa replicación de espacio de nombres coordinada
US9507678B2 (en) Non-disruptive controller replacement in a cross-cluster redundancy configuration
JP6202756B2 (ja) 支援型コヒーレント共有メモリ
JP6404907B2 (ja) 効率的な読み取り用レプリカ
CN103827843B (zh) 一种写数据方法、装置和系统
WO2017113276A1 (zh) 分布式存储系统中的数据重建的方法、装置和系统
JP6264666B2 (ja) データ格納方法、データストレージ装置、及びストレージデバイス
CN106547859B (zh) 一种多租户数据存储系统下的数据文件的存储方法及装置
CN108616574B (zh) 管理数据的存储方法、设备及存储介质
CN107526537B (zh) 用于锁定存储系统中的存储区域的方法和系统
US10387273B2 (en) Hierarchical fault tolerance in system storage
CN109582213B (zh) 数据重构方法及装置、数据存储系统
WO2018054079A1 (zh) 一种存储文件的方法、第一虚拟机及名称节点
CN105530294A (zh) 一种海量数据分布式存储的方法
CN105278882A (zh) 一种分布式文件系统的磁盘管理方法
KR101601877B1 (ko) 분산 파일시스템에서 클라이언트가 데이터 저장에 참여하는 장치 및 방법
JP6376626B2 (ja) データ格納方法、データストレージ装置、及びストレージデバイス
CN109992447B (zh) 数据复制方法、装置及存储介质
CN115470041A (zh) 一种数据灾备管理方法及装置
US9495292B1 (en) Cache management
CN116868173A (zh) 降低在恢复操作期间网络延时的影响
US20130124797A1 (en) Virtual disks constructed from unused distributed storage
CN104298467A (zh) 一种p2p缓存文件管理方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17904157

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019521779

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017904157

Country of ref document: EP

Effective date: 20190604

NENP Non-entry into the national phase

Ref country code: DE