CN113901024A

CN113901024A - Data storage system, data storage method, readable medium, and electronic device

Info

Publication number: CN113901024A
Application number: CN202111131243.5A
Authority: CN
Inventors: 王红岩; 何小春
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-01-07

Abstract

The disclosure provides a data storage system, a data storage method, a readable medium and an electronic device, and relates to the technical field of data processing. The system comprises: the client is used for acquiring original data to be stored; the metadata management device is used for persistently storing metadata information corresponding to the original data, storing the metadata information to the hot volume if the original data is determined to be hot data, and storing the metadata information to the cold volume if the original data is determined to be cold data; and the data storage device is used for storing the original data, storing the original data in a copy service module in a multi-copy mode in a persistent mode if the metadata information is stored in the hot volume, and storing the original data in the erasure code online storage engine if the metadata information is stored in the cold volume. The data read-write capacity of the data storage system can be improved, the occupation of bandwidth is reduced, the storage cost is reduced, and the system performance is improved.

Description

Data storage system, data storage method, readable medium, and electronic device

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data storage system, a data storage method, a computer-readable medium, and an electronic device.

Background

With the continuous progress of scientific technology, more and more Data needs to be processed, and the concept of Data Lake (Data Lake) comes along. The concept of a data lake is that all data are put together, which is very large, and in addition, the size of the file can reach the billion level, and a huge directory with the millions of files can be found under a flat directory. One part of the value of the data lake is to gather different kinds of data together, and the other part of the value is to perform data analysis without a predefined model.

Currently, in a typical data lake architecture, three layers are mainly included: a storage layer, a data definition layer, and a computation layer. The lake storage is responsible for storage and management of mass data and serves various cloud computing engines Spark, Presto and the like, and the whole data lake platform supports various complex applications such as data large screen, data report, data mining, machine learning and the like. In these application scenarios, as the data volume is continuously increased and accumulated, the data may also present a great difference in access heat, however, in the related art, the same storage policy is adopted regardless of the access heat of the data, and thus, the cluster resources cannot be fully utilized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Disclosure of Invention

The present disclosure aims to provide a data storage system, a data storage method, a computer readable medium and an electronic device, so as to avoid, at least to a certain extent, the problems of low cluster resource utilization rate and poor system performance caused by the fact that the same storage policy is adopted for data with different access heat degrees in the related art.

According to a first aspect of the present disclosure, there is provided a data storage system comprising:

the client is used for acquiring original data to be stored;

the metadata management device comprises a hot volume and a cold volume and is used for persistently storing metadata information corresponding to the original data, wherein if the original data is determined to be hot data, the metadata information is stored to the hot volume, and if the original data is determined to be cold data, the metadata information is stored to the cold volume;

and the data storage device comprises a copy service module and an erasure code online storage engine and is used for storing the original data, wherein if the metadata information is stored in the hot volume, the original data is persistently stored in the copy service module in a multi-copy form, and if the metadata information is stored in the cold volume, the original data is stored in the erasure code online storage engine.

According to a second aspect of the present disclosure, there is provided a data storage method applied to the data storage system of the first aspect, the method including:

acquiring original data to be stored;

if the original data is determined to be hot data, storing the metadata information to the hot volume, and persistently storing the original data in a multi-copy form in the copy service module;

and if the original data is determined to be cold data, storing the metadata information to the cold volume, and storing the original data in the erasure code online storage engine.

According to a third aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the data storage method described above.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising:

a processor; and

a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the data storage methods described above.

According to the data storage system provided by one embodiment of the disclosure, original data to be stored is acquired through a client, then metadata information corresponding to the original data is persistently stored based on a metadata management device, if the original data is determined to be hot data, the metadata information is stored to a hot volume, if the original data is determined to be cold data, the metadata information is stored to a cold volume, and finally the original data is stored through a data storage device comprising a copy service module and an erasure code online storage engine, wherein if the metadata information is stored in the hot volume, the original data is persistently stored in the copy service module in a multi-copy form, and if the metadata information is stored in the cold volume, the original data is stored in the erasure code online storage engine. On one hand, the metadata management device with the volume as the granularity stores the metadata, and sets the hot volume and the cold volume according to different application scenes, and when the data is stored, the metadata are respectively stored into different volumes according to the access heat corresponding to the data, so that the management efficiency of the metadata is effectively improved, and the write operation performance of the metadata is improved; on the other hand, the hot data is persistently stored in the copy service module of the storage unit with higher read-write efficiency in a multi-copy mode, and the cold data is stored in the erasure code online storage engine, so that the stability and the access efficiency of the hot data can be effectively guaranteed, the continuous writing requirement of the cold data can be guaranteed, the storage resources of the cluster can be fully utilized, and the read-write performance of the system can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a framework diagram of a data storage system in an exemplary embodiment of the disclosure;

FIG. 2 is a schematic diagram illustrating a hot volume mode in a data storage system in an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a cold volume mode in a data storage system in an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a cold volume mode in another data storage system in an exemplary embodiment of the present disclosure;

fig. 5 is a schematic structural diagram schematically illustrating a metadata management apparatus in an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a structure of a replica service module in an exemplary embodiment of the disclosure;

FIG. 7 is a schematic diagram illustrating an organization of a mapping relationship between volumes, data partitions, files, and slices in an exemplary embodiment of the disclosure;

FIG. 8 schematically illustrates a framework diagram of a computing acceleration device in an exemplary embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow chart of a data storage method in an exemplary embodiment of the disclosure;

fig. 10 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the related art, in the traditional data lake solution Apache Hadoop cluster, a Hadoop Distributed File System (HDFS) is always the mainstream storage standard, and the HDFS Distributed File System is one of three major components of the Hadoop and is the basis of data storage management in Distributed computing. However, under the traditional architecture of the HDFS, the Hadoop expansibility is limited to a certain extent, and performance bottlenecks and other problems are easy to occur. Because the metadata information (about 150 bytes) of each file, directory and data block in the HDFS must be stored in the memory of the Name Node in the Name space of the management file system, which means that for a huge cluster with a large number of files, the memory will severely limit the system lateral expansion, and a plurality of Name Node mechanisms are introduced in the Hadoop 2.x release, allowing the system to realize expansion by adding a plurality of Name Nodes, but a system administrator needs to maintain a plurality of Name Nodes and load balancing services, thus increasing the operation cost invisibly.

The storage and computation separation architecture is a trend of cloud computing, and the data lake is a trend of cloud computing, so that the traditional computing and storage fusion architecture has a prominent problem that storage and computation capabilities are not matched, so that when cluster capacity expansion is carried out, capacity expansion storage and computation are needed at the same time, and a large amount of resources are wasted. The data lake architecture adopting calculation combined with HDFS storage separation faces the following problems: the performance problem and the expansibility of the HDFS Name Node are insufficient, more than 10 hundred million scales of a single cluster basically reach a bottleneck, and a plurality of sets of HDFS clusters need to be maintained under a large data volume scene, so that a large amount of operation management work is brought; the HDFS Name Node has longer fault restart time and larger influence; the HDFS has a default copy strategy, the EC is immature, and the storage cost is high.

With the advent of object storage, public cloud vendors have also introduced a separate architecture for computing in conjunction with object storage. The structure well utilizes the mass, low cost and high reliability of object storage, but has the following problems: the long writing operation is slow, mainly the metadata operation performance is poor, because the object stores metadata such as rename, list, delete and the like; query jobs frequently access data, occupying a large amount of bandwidth, having an impact on real-time queries. In order to solve the problem of metadata and data access capability under a computing + object storage architecture, the solutions of public cloud manufacturers are basically gateways that implement file semantics on object storage and accelerate data caching at a client, but the core problem of data cannot be completely solved by the method: high performance metadata management; data cold and hot layered treatment capacity; multi-policy cache acceleration capability.

Based on one or more problems in the related art, the data storage system according to the exemplary embodiment of the present disclosure is first provided in the present exemplary embodiment, and is specifically described below with reference to fig. 1.

Fig. 1 schematically illustrates a framework diagram of a data storage system in an exemplary embodiment of the present disclosure.

Referring to fig. 1, a data storage system 100 may include a client 101, a metadata management apparatus 102, and a data storage apparatus 103. Wherein:

the Client 101(Client) may provide an Application Programming Interface (API), and may support multiple access protocols, for example, the access protocol may be HCFS, POSIX/FUSE, or S3, or may be other types of general access protocols, which is not limited in this example embodiment. Meanwhile, the client 101 may also support local metadata and data caching. In the data storage system 100, the client 101 may be mainly configured to receive a read-write request sent by a user, and obtain original data to be stored based on the read-write request.

The metadata management apparatus 102(Name Space Service, also referred to as uniform named metadata Space management) is mainly responsible for managing and persisting metadata information of each Volume, such as metadata fragment (Meta Partition), wherein a Volume (Volume) may represent a logical management unit of a data set, and the metadata management apparatus 102 manages clustered metadata with a Volume as a granularity.

Specifically, according to the difference of the data access heat, the metadata management apparatus 102 may set a Hot volume (Hot mode) and a Cold volume (Cold mode) for persistently storing metadata information corresponding to the original data acquired by the client 101, where the metadata information may be stored in the Hot volume if the original data is determined to be Hot data, and the metadata information may be stored in the Cold volume if the original data is determined to be Cold data.

The data Storage device 103 is mainly used for persistently storing original data, and may specifically include a copy Service module 1031 (replias Node Service) and an erasure code Online Storage engine 1032(Online-EC Storage Service).

The replica service module 1031 can support two usage scenarios, and when the replica service module 1031 is used as a hot data scenario, the replica service module 1031 has data persistence capability and adopts a 3-replica data management mode; in a cold data scene, the copy service module 1031 mainly locates a high-performance read acceleration Cache, and uses the Cache as a data block Cache to support an elastic copy (e.g., 1 to 3 parts), and a bottom storage medium may be a Solid State Disk (SSD) with high performance.

In colloquial, the data storage system 100 may divide a file corresponding to original data into a plurality of data blocks (blocks), the data storage device 103 may store each data Block on a different replica node (machine), and the metadata management device 102 may record storage location information of the data blocks, that is, record on which replica node each data Block is stored.

The erasure code online storage engine 1032 is mainly used in cold data scenes, supports multiple encoding and decoding modes, has a large data storage scale, and is used for low-cost persistence of mass cold data, and a bottom storage medium can be a traditional Hard Disk (Hard Disk Drive, HDD).

Specifically, if the metadata information of the original data is stored in the hot volume, the original data may be persistently stored in the copy service module in the form of multiple copies, and if the metadata information is stored in the cold volume, the original data may be persistently stored in the erasure code online storage engine.

FIG. 2 is a schematic diagram illustrating a hot volume mode in a data storage system according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in a read-write mode 200 of a hot volume, metadata information of original data acquired in a client 201 may be stored on a metadata Node 202(Meta Node) corresponding to a metadata management apparatus, and specifically stored in a set hot volume, and then the original data may be directly persisted on a copy Node 203 (replias Node) in a multi-backup (for example, 3-copy) manner, where the read-write mode 200 of the hot volume is mainly applicable to an application scenario of temporary data and a high-performance read-write application scenario, and of course, may also be applicable to other application scenarios, which is only a schematic example here, and this is not limited in this example embodiment.

Fig. 3 schematically illustrates a diagram of a cold volume mode in a data storage system according to an exemplary embodiment of the present disclosure.

Referring to fig. 3, in the read-write mode 300 of the cold volume, metadata information of original data acquired in the client 301 may be stored on a metadata node 302 corresponding to the metadata management apparatus, specifically, in the set cold volume, and then the original data may be directly written into the erasure code online storage engine 303.

The Erasure Code (EC) is a coding theory, and the coding can add m parts of check data to n parts of original data, and can restore the original data to n + m parts of data. That is, if any data with the number less than or equal to m fails, the data can still be restored through the rest data. Therefore, the original data may be persistently stored in the erasure code online storage engine 303 with 1.X backups (1.X backups are backup data formed by the data block and the check block, and the data amount of the general check block is smaller than that of the data block, so that the data formed by the data block and the check block may be regarded as 1.X copies of data), for example, 1.3 backups may be persistently stored according to the setting of the number of the check blocks, or 1.5 backups may be persistently stored, or backups formed by other numbers of check blocks may also be persistently stored, which is not particularly limited in this exemplary embodiment.

In another example embodiment, if the data volume of the cold data is large, 0 or more Cache backups may be cached in the copy node 304 while 1.X backup original data is persistently stored in the erasure code online storage engine 303 (for example, multiple Cache backups may be set in a range of 1 to 3, and of course, other ranges may also be possible, specifically, the setting is customized according to an actual usage scenario, and this example embodiment does not make any special limitation on this). The backup number of the original data cached in the replica node 304 can support various configurable strategies, is suitable for business scenes and backup scenes with less writing, more reading, large data volume and high cost sensitivity, and can effectively relieve the greater pressure on the erasure code online storage engine 303 under the scenes, improve the stability of the system and optimize the performance of the system.

In the application scenario of cold data, there may be a situation that part of the application scenario is sensitive to cost and has a higher requirement for write performance, and the read-write model 300 of the cold volume in fig. 3 directly writes the erasure correction code online storage engine 303 and a persistent write strategy may not completely satisfy the requirement of the application, because the write-erasure correction code online storage engine 303 has a higher time delay relative to the write copy node 304, especially in a small file scenario. Thus, in another embodiment, a new cold roll R/W die is provided.

FIG. 4 is a schematic diagram illustrating a cold volume mode in another data storage system according to an exemplary embodiment of the present disclosure.

Referring to fig. 4, in the Read-write mode 400 of the cold volume, metadata information of original data acquired in the client 401 may be stored in a metadata node 402 corresponding to the metadata management apparatus, and specifically stored in a set cold volume, then the data may be written into a copy node 403 first, and then returned immediately after being successful, and then Dirty data (Dirty Read) in the copy node 403 is asynchronously written into an erasure correction code online storage engine 404 by a pre-constructed asynchronous sinking service, and this write strategy can effectively meet a high-performance write scenario of cold data in an application program.

In the embodiment of fig. 4, since a supporting asynchronous sink service is required, a dirty data flag is introduced, and meanwhile, if additional writing (application) occurs to a file that has been sunk, multiple sinks are required, which is a certain complexity in implementation, and since a replica node generally employs a high-performance disk (such as an SSD disk), the size of the node is generally limited, and if the application layer writing speed is greater than the data sink speed, the replica node is easily full of storage, thereby causing a service to be unwritable. Therefore, in an actual cold data application scenario, the cold data read/write mode 300 in fig. 3 or the cold data read/write mode 400 in fig. 4 may be selected according to specific requirements, and this example embodiment does not make any special limitation on the usage scenarios of the two read/write modes.

In an example embodiment, the metadata management apparatus 102 in the data storage system 100 may manage metadata information in the form of volumes, one volume may represent one file system, and a single storage cluster may support multiple namespace Name spaces (i.e., volumes).

Specifically, the metadata of one volume may be divided into a plurality of fixed-length Metadata Partitions (MPs), the MPs may be responsible for allocating a file name (Inode) corresponding to the original data and persisting metadata information such as location information, file time, and size of each data block in a file of the original data, and the MP may persistently store the metadata information corresponding to the original data based on the MPs by the MP management apparatus 102.

In an example embodiment, the metadata management apparatus 102 may be composed of a plurality of metadata nodes (Meta nodes), and the metadata partition may include at least two metadata nodes, where at least two metadata nodes may persistently store metadata information corresponding to original data in a Master-Slave device mode (i.e., Master-Slave mode, or Leader-Follower mode).

Specifically, when metadata information corresponding to original data is persistently stored in a Master-Slave device mode between at least two metadata nodes in the metadata management apparatus 102, multiple copies of the metadata information between different metadata nodes may be kept consistent based on a data consistency protocol, for example, the data consistency protocol may be a Raft consistency algorithm, in the Raft consistency algorithm, a strong Master node Master (i.e., a metadata node serving as a Master node) is provided, and the Master node Master is generally in full charge of receiving a request command of the client 101, copying the command as a log entry to other Slave nodes Slave (i.e., metadata nodes serving as Slave nodes), and submitting and executing the log command when security is confirmed. When the Master node Master fails, a new Master node Master is elected and generated, and with the help of the Master node Master, the Raft decomposes the consistency problem into three sub-problems: electing a Master node, namely, selecting a new Master node when the existing Master node fails; the method comprises the steps of log replication, namely, a Master node Master receives a command from a client 101, records the command as a log, replicates the log to other metadata nodes in a cluster, and forces the logs of the other metadata nodes to be consistent with the Master node Master; security measures, i.e. measures that ensure the security of the system by some means, such as measures that ensure that all state machines execute the same commands in the same order. When the data consistency protocol may also be Paxos algorithm, of course, other types of distributed consistency algorithms may also be used, and this example embodiment is not particularly limited to this.

Fig. 5 schematically shows a structural diagram of a metadata management apparatus in an exemplary embodiment of the present disclosure.

Referring to fig. 5, in the embodiment, the metadata management apparatus 102 is described by taking the data consistency protocol as a Raft consistency algorithm as an example. The metadata management apparatus 102 may include a metadata node 501, a metadata node 502, and a metadata node 503, and a volume (Name Space) is formed by the metadata node 501, the metadata node 502, and the metadata node 503. Specifically, the volume may be divided into a metadata partition 504 and a metadata partition 505 with fixed length (for example, length is 10000), where the range of file names that the metadata partition 504 may manage data is 10000-20000, and the range of file names that the metadata partition 505 may manage data is 1-10000, and of course, the metadata partition may also be other lengths, and may be divided by self-definition according to actual scenes, which is not limited in this exemplary embodiment.

For a single metadata partition, for example, taking the metadata partition 504 as an example, the original data may be stored in the metadata partition 504 in the form of 3 copies, specifically, one copy is stored in each of the metadata node 501, the metadata node 502, and the metadata node 503, and data consistency between multiple copies is achieved through the data consistency protocol 506.

Specifically, the metadata node 501, the metadata node 502 and the metadata node 503 are selected as a Master node Master based on a consensus mechanism, the metadata node 502 and the metadata node 503 are used as Slave nodes Slave, the metadata node 501 receives a command from the client 101, records the command as a log, copies the log to the metadata node 502 and the metadata node 503, and forces the logs of the metadata node 502 and the metadata node 503 to be consistent with the Master node Master, so that data consistency among respective stored copies among the metadata node 501, the metadata node 502 and the metadata node 503 is realized, data security is ensured, and system stability is improved.

The high availability of storage service is realized by combining a plurality of copy forms in the metadata partition with a data consistency protocol, and the optimal storage performance is realized by combining the snapshot persistence of a full memory and a local disk.

Based on the above, the metadata management apparatus 102 in this exemplary embodiment achieves decentralization on a system architecture, can achieve full lateral expansion, has no single-point bottleneck, can easily support a billion-level file scale in a single storage cluster, effectively improves the read-write performance of the data storage system, reduces the expansion cost of the data storage system, and improves the data security and stability of the data storage system.

In an example embodiment, the copy service module 1031 in the data storage system 100 may be composed of a plurality of copy nodes (replias nodes), where a copy Node (replias Node) is mainly used to be responsible for storage and persistence of file fragment data (extend), a copy Node has an available area attribute (Availability Zone, AZ), and different copy nodes may correspond to different available areas; the regions (regions) refer to physical data centers, and each Region is completely independent, so that the fault tolerance capability and the stability can be realized to the maximum extent, and the regions cannot be replaced after the resources are successfully established; the available areas are physical areas in the same area, and power and networks are isolated from each other, and one available area is not affected by the faults of other available areas. A plurality of available areas can be arranged in one area, different available areas are physically isolated, but the internal networks are intercommunicated, so that the independence of the available areas is guaranteed, and low-cost and low-delay network connection is provided.

Specifically, at least one replica Node in the same available area constitutes a Node Set, and at least two Node management units belonging to different available areas constitute a fault domain (Virtual san). The fault domain refers to a "management container" or a "combination", and servers in the same cabinet or the same area can be put together for management by defining the fault domain. Virtual san hosts that may fail simultaneously may be grouped by creating a failure domain and assigning one or more hosts thereto. The failure of all hosts in a single failure domain will be considered to be the same failure. If a fault domain is established, the Virtual san will never place multiple copies of the same object in the same fault domain, and the fault domain can solve the "rack level" or "local level" fault, thereby further ensuring the security and stability of the data storage system.

Fig. 6 schematically shows a structural diagram of a replica service module in an exemplary embodiment of the disclosure.

Referring to fig. 6, in an embodiment, the data storage system 100 may include an available area 601, an available area 602, an available area 603, and an available area 604, where 3 replica nodes in the available area 601 constitute one node management unit 1, 3 replica nodes in the available area 602 constitute one node management unit 2 and 3 replica nodes constitute one node management unit 4, 3 replica nodes in the available area 603 constitute one node management unit 3 and 3 replica nodes constitute one node management unit 5, and 3 replica nodes in the available area 604 constitute one node management unit 6.

Specifically, the node management unit 1, the node management unit 2, and the node management unit 3 respectively belonging to the available area 601, the available area 602, and the available area 603 constitute a fault domain 605, and the node management unit 4, the node management unit 5, and the node management unit 6 respectively belonging to the available area 602, the available area 603, and the available area 604 constitute a fault domain 606.

Generally, a plurality of copies of original data cannot be stored in the same fault domain, so that at least one copy cannot be read due to faults under any condition, the safety of the data is further ensured, and the stability of a data storage system is ensured.

In an example embodiment, the replica node may include a plurality of Data Partitions (DPs), and the Data partitions belonging to different replica nodes form one replica set (generally, 3 replicas are possible, but may also be another number of replicas, which is not limited in this example embodiment).

Specifically, original data is persistently stored in a copy set in a multi-copy mode, and streaming copy protocol (Replication) is adopted among copies in the copy set to maintain data consistency.

Fig. 7 schematically illustrates an organization diagram of a mapping relationship among volumes, data partitions, files, and slices in an exemplary embodiment of the present disclosure.

Referring to fig. 7, in an embodiment, the replica node in the node management unit 4 belonging to the usable area 2 may include a data partition 701, the replica node in the node management unit 5 belonging to the usable area 3 may include a data partition 702, the replica node in the node management unit 6 belonging to the usable area 4 may include a data partition 703, and a replica set 704(dp2) is formed by the data partition 701, the data partition 702, and the replicas stored in the data partition 703.

The single data partition is actually a directory on the local XFS File system, for example, the copy set 704(dp2) corresponds to the directory "/data _ 2/", on the local disk 705, and each fragment data of the File is stored under the directory with information of the File, for example, the extent1 to extent4 in the local disk 707, and the fragment metadata information is persistently stored in the metadata management device, for example, the fragment metadata information 706, i.e., "File- > [ ext1, ext2, ext3 ], has size and File number limitations, and avoids the excessive number of files under a single directory, and the mapping relationship and organization between the whole volume, the data partition, the File and the fragment are as shown in fig. 7.

In an example embodiment, the data storage system may further include a computation acceleration device, the computation acceleration device being mainly used for read acceleration of the data.

Specifically, the computing acceleration device may include a primary acceleration unit and a secondary acceleration unit, and is configured to receive a data read request from the client to read and load access data from the primary acceleration unit and/or the secondary acceleration unit. The first-level acceleration unit and the second-level acceleration unit are in a cooperative relationship, and the system may select to read and load the access data from the first-level acceleration unit according to an actual situation, or may select to read and load the access data from the first-level acceleration unit and the second-level acceleration unit, which is not limited in this example embodiment.

Wherein the level one acceleration unit may include a first computing node that may be configured to cache access data at the local storage unit based on a cache eviction policy (LRU). The cache elimination strategy LRU refers to that when the cache space is insufficient, data which are not accessed for a long time can be deleted through the algorithm, and hot data are reserved.

The first-level acceleration unit can be regarded as a computing-side acceleration unit, and the block Cache component caches recently accessed data of the application service by using a local idle memory and disk resources of the first computing node based on an LRU elimination strategy.

The secondary acceleration unit may include a second computing node, and the second computing node may be bound to a replica node belonging to the same available region, and may be mainly configured to cache one replica of the access data in each available region based on a multidimensional data elimination policy.

The secondary acceleration unit may be considered as an available area AZ acceleration, and is suitable for a scenario where a computing cluster and a storage cluster are distributed in multiple available areas and the computing cluster is very dispersed, because a copy node has an attribute of AZ (area), the copy node may be bound To AZ where a second computing node is located, for example, data partitions DP under 3 AZ copy nodes respectively form a 3 copy set, and meanwhile, the data partitions may set elastic copy scaling capability, that is, the number of copies may be adjusted based on the number of computing clusters, when a certain file is accessed, a copy may be stored in each AZ Cache, and simultaneously, the copy node may implement storage space release based on a multidimensional data elimination policy, for example, the multidimensional data elimination policy refers To that the copy node uses an LRU policy based on upper and lower water levels and based on file expiration Time (To Live, TTL) elimination strategy, storage space release is realized, and system performance is improved.

The LRU policy based on upper and lower water levels is that when the utilization rate of the storage space reaches an upper space limit (upper water level), a round of LRU elimination is triggered, and the space is released to a lower space limit (lower water level), for example, assuming that the total storage space of 100GB of the copy nodes is total, the set upper water level is 80%, and the set lower water level is 20%, when the utilization amount of the storage space exceeds 80GB (100 × 80%), the LRU elimination is started, the storage space is released, until the utilization amount of the storage space is less than 20GB (100 × 20%), the release of the storage space is finished, and the read-write performance of the copy nodes is effectively ensured.

It should be noted that the "primary acceleration unit" and the "secondary acceleration unit", and the "first computing node" and the "second computing node" in this exemplary embodiment are only used to distinguish different "acceleration units" and "computing nodes" belonging to different acceleration units, and do not have any sequential or quantitative meaning, and should not cause any special limitation to this exemplary embodiment.

In an example embodiment, the computing acceleration apparatus may include a pre-heating unit, and the pre-heating unit may be configured to cache at least one copy of the data to be accessed corresponding to the application extent to the target available area before the application program in the client runs. The preheating unit can support copy elasticity, and can self-define and set to preheat one or more Cache copies according to different application scenes; the preheating unit can support directional preheating, namely preheating the data to be accessed to an available area AZ appointed by the computing cluster, wherein the copy node has AZ attribute; the preheating unit can support specified cooling time (TTL), and after the task of the computing node is finished, the data storage system actively recovers the storage space based on the expiration time.

Fig. 8 schematically illustrates a frame diagram of a computing acceleration device in an exemplary embodiment of the present disclosure.

Referring to fig. 8, the data storage system may include an available area 801, an available area 802, an available area 803, and an available area 804, where the available area 801, the available area 802, and the available area 803 each include a computing node, and the erasure code online storage engine is disposed in the available area 804, and at this time, how to quickly deliver data to the computing engine (computing node) is distributed in different available areas of the computing cluster and the storage cluster, so that the computing engine operates efficiently, and a problem that is urgently needed to be solved is to break through a bandwidth and a delay bottleneck across a computer room.

Specifically, the primary acceleration unit 805 may accelerate at a computing end, and the block Cache component caches recently accessed data of the application service by using a local idle memory and disk resources of a computing node in the primary acceleration unit 805 based on an LRU elimination policy. And realizing primary calculation. An acceleration node in secondary acceleration unit 806 may store a copy in each available area when a file is accessed. Meanwhile, the preheating unit may be configured to preheat data from erasure code online storage engines in other available areas to copy nodes belonging to the secondary acceleration unit 806 in which each available area is located before the application program in the client runs, so as to effectively improve the data reading efficiency and improve the reading performance of the data storage system.

The utility model provides a new data lake storage and acceleration model, realizes high performance, high reliability, extensible metadata management, is applicable to the cold and hot volume and the layering of cold and hot data of different application scenarios, and two-stage data lake calculation directional acceleration ability design accomplishes the balance of cost and performance in the true sense to solve the core problems such as mass data management, cost, performance under the data lake scenario from the architecture level.

In summary, in the exemplary embodiment, the client obtains original data to be stored, and then persistently stores metadata information corresponding to the original data based on the metadata management device, and if the original data is determined to be hot data, the metadata information is stored to a hot volume, and if the original data is determined to be cold data, the metadata information is stored to a cold volume, and finally the original data is stored by a data storage device including a copy service module and an erasure code online storage engine, where if the metadata information is stored in the hot volume, the original data is persistently stored in the copy service module in the form of multiple copies, and if the metadata information is stored in the cold volume, the original data is stored in the erasure code online storage engine. On one hand, the metadata management device with the volume as the granularity stores the metadata, and sets the hot volume and the cold volume according to different application scenes, and when the data is stored, the metadata are respectively stored into different volumes according to the access heat corresponding to the data, so that the management efficiency of the metadata is effectively improved, and the write operation performance of the metadata is improved; on the other hand, the hot data is persistently stored in the copy service module of the storage unit with higher read-write efficiency in a multi-copy mode, and the cold data is stored in the erasure code online storage engine, so that the stability and the access efficiency of the hot data can be effectively guaranteed, the continuous writing requirement of the cold data can be guaranteed, the storage resources of the cluster can be fully utilized, and the read-write performance of the system can be improved.

The embodiment of the present example also provides a data storage method, which is applied to the data storage system in the above embodiment. Fig. 9 schematically illustrates a flowchart of a data storage method in an exemplary embodiment of the disclosure, and referring to fig. 9, the method may include:

step S910, acquiring original data to be stored;

step S920, if it is determined that the original data is hot data, storing the metadata information to the hot volume, and persistently storing the original data in the form of multiple copies in the copy service module;

step S930, if it is determined that the original data is cold data, storing the metadata information to the cold volume, and storing the original data in the erasure code online storage engine;

step S940, if the metadata information is stored in the cold volume, the metadata information is cached in a copy service module in the form of at least one copy.

For example, an access frequency threshold may be set, data with an access frequency greater than or equal to the access frequency threshold is regarded as hot data, and data with an access frequency less than the access frequency threshold is regarded as cold data, for example, the access frequency threshold may be set to 10, that is, the frequency of storage and/or access within one hour is 10, which is only an illustrative example here and should not produce any special limitation to this example embodiment.

The data storage system specifies whether a volume is a hot volume or a cold volume when creating the volume, different volumes are created based on different service scenes, if an application scene is a hot data scene, the hot volume can be created, and if the application scene is a cold data scene, the cold volume can be created.

Specifically, in the application scenario of cold data, there may exist a scenario in which a part of the application scenario is sensitive to cost and has a higher requirement for write performance, and a write policy directly persisted in the erasure correction code online storage engine may not completely satisfy the application requirement, because the write erasure correction code online storage engine has a higher time delay with respect to the write of the copy node, especially in a small file scenario. Thus, in another embodiment, if the metadata information is stored on a cold volume, the metadata information may be cached in the replica service module in the form of at least one copy, e.g., 1-3 copies of the metadata information may be cached in the replica service module.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the data storage method is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 1000 according to such an embodiment of the present disclosure is described below with reference to fig. 10. The electronic device 1000 shown in fig. 10 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 is embodied in the form of a general purpose computing device. The components of the electronic device 1000 may include, but are not limited to: the at least one processing unit 1010, the at least one memory unit 1020, a bus 1030 connecting different system components (including the memory unit 1020 and the processing unit 1010), and a display unit 1040. Wherein the storage unit stores program code that is executable by the processing unit 1010 to cause the processing unit 1010 to perform steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above in this specification. For example, the processing unit 1010 may execute step S910 shown in fig. 9, acquiring original data to be stored; step S920, if it is determined that the original data is hot data, storing the metadata information to the hot volume, and persistently storing the original data in the form of multiple copies in the copy service module; step S930, if it is determined that the original data is cold data, storing the metadata information to the cold volume, and storing the original data in the erasure code online storage engine; step S940, if the metadata information is stored in the cold volume, the metadata information is cached in a copy service module in the form of at least one copy.

The memory unit 1020 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)1021 and/or a cache memory unit 1022, and may further include a read only memory unit (ROM) 1023. Storage unit 1020 may also include a program utility 1024 having a set (at least one) of program modules 1025, such program modules 1025 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1030 may be any one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and a local bus using any of a variety of bus architectures.

The electronic device 1000 may also communicate with one or more external devices 1070 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1000, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1000 to communicate with one or more other computing devices. Such communication may be through an Input Output (IO) interface 1050. Also, the electronic device 1000 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 1060. As shown, the network adapter 1060 communicates with the other modules of the electronic device 1000 over the bus 1030.

It should be understood that although not shown, other hardware and or software modules may be used in conjunction with the electronic device 1000, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others. Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 3 to 8 may be performed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A data storage system, comprising:

the client is used for acquiring original data to be stored;

2. The system according to claim 1, wherein said metadata management means manages metadata information in the form of a volume, said volume comprising a plurality of metadata partitions of a fixed length, said metadata management means persistently storing metadata information corresponding to said original data based on said metadata partitions.

3. The system of claim 2, wherein the metadata management apparatus is composed of a plurality of metadata nodes, and the metadata partition comprises at least two metadata nodes, and the at least two metadata nodes persistently store metadata information corresponding to the raw data in a master-slave device mode.

4. The system of claim 3, wherein multiple copies of the metadata information are stored between the at least two metadata nodes based on a data consistency protocol.

5. The method of claim 1, wherein the replica service module is comprised of a plurality of replica nodes, the replica nodes corresponding to different available regions;

at least one copy node under the same available area forms a node management unit, and at least two node management units belonging to different available areas form a fault domain.

6. The system of claim 5, wherein the replica nodes comprise a plurality of data partitions, the data partitions belonging to different replica nodes form a replica set, the original data is persistently stored in the replica set in a multi-replica manner, and a streaming replication protocol is adopted between the replicas in the replica set to maintain data consistency.

7. The system of claim 1, wherein the data storage device is further configured to:

and if the metadata information is stored in the cold volume, caching the metadata information in a copy service module in the form of at least one copy.

8. The system of any of claims 1 to 7, wherein the data storage system further comprises:

the computing acceleration device comprises a primary acceleration unit and a secondary acceleration unit, and is used for receiving a data reading request of the client to read and load access data from the primary acceleration unit and/or the secondary acceleration unit.

9. The system of claim 8, wherein the level one acceleration unit comprises a first computing node configured to cache access data at a local storage unit based on a cache eviction policy.

10. The system of claim 9, wherein the secondary acceleration unit comprises a second computing node bound to replica nodes belonging to the same usable region for caching a replica of the accessed data in each usable region based on a multidimensional data elimination policy.

11. The system according to claim 8, wherein the computing acceleration device comprises a preheating unit, and the preheating unit is configured to cache at least one copy of the data to be accessed corresponding to the application degree to a target available area before an application program in the client runs.

12. A data storage method applied to the data storage system of any one of claims 1 to 11, the method comprising:

acquiring original data to be stored;

13. The method of claim 12, further comprising:

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to claim 12 or 13.

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of claim 12 or 13 via execution of the executable instructions.