WO2023070935A1 - Data storage method and apparatus, and related device - Google Patents

Data storage method and apparatus, and related device Download PDF

Info

Publication number
WO2023070935A1
WO2023070935A1 PCT/CN2021/142795 CN2021142795W WO2023070935A1 WO 2023070935 A1 WO2023070935 A1 WO 2023070935A1 CN 2021142795 W CN2021142795 W CN 2021142795W WO 2023070935 A1 WO2023070935 A1 WO 2023070935A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
copies
azs
file system
partition
Prior art date
Application number
PCT/CN2021/142795
Other languages
French (fr)
Chinese (zh)
Inventor
阿瓦鲁卡纳卡•库马尔
库马尔潘卡吉
钱纳巴斯帕雷努卡普拉萨德
莫凯
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to CN202180006938.2A priority Critical patent/CN116635831A/en
Publication of WO2023070935A1 publication Critical patent/WO2023070935A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation

Definitions

  • the embodiments of the present application relate to the technical field of databases, and in particular, to a data storage method, device, and related equipment.
  • the data processing platform is used to provide users with data reading and writing services, such as data storage, data reading, etc., which may include distributed databases and file systems.
  • a distributed database such as an HBase database
  • a distributed database usually includes a master (master) node and multiple partition server (Region Server, RS) nodes.
  • the master node is used to assign to each RS node the region to which the data that the RS node is responsible for reading and writing belongs to.
  • the number of partitions assigned to each RS node can be one or more; the RS node is used to allocate Write new data to the file system, or feed back the data in the text system requested by the user to the user, so as to realize the data reading and writing service.
  • the embodiment of the present application provides a data storage method, so that when a disaster occurs, the quality of data reading and writing services provided by the data processing platform for users can be maintained at a relatively high level.
  • the present application also provides corresponding apparatuses, computing devices, computer-readable storage media, and computer program products.
  • the embodiment of the present application provides a data storage method, which can be applied to a data processing platform including multiple availability zones AZ, and when storing data, the data processing platform can first obtain multiple copy, and store multiple copies of the data to be stored in different AZs of the data processing platform.
  • the data processing platform Since the data processing platform stores multiple copies of the data to be stored in different AZs, when some AZs are unavailable due to natural disasters or other disasters, the physical distance between different AZs is usually large, therefore, The disaster usually does not affect other AZs, so the data processing platform can continue to provide users with data read and write services based on the copies of the data to be stored in other AZs that are running normally, thereby avoiding disasters and reducing the data processing platform. The quality of data read and write services. Moreover, based on this data storage method, the allowable service interruption time length of the data processing platform can be 0, that is, the recovery time target reaches 0; the maximum data loss that the data processing platform can tolerate can reach 0, that is, the recovery point target Can be 0.
  • the data processing platform stores multiple copies of the data to be stored in different AZs instead of in one AZ. This makes it impossible for all copies of the data to be stored to be lost after some AZs become unavailable. In the event of data loss, the reliability of data stored on the data processing platform can be improved.
  • the data to be stored includes the target data and/or the partition to which the target data belongs, then, when the data processing platform stores multiple copies of the data to be stored on the data processing platform, specifically, the Multiple copies of the target data are stored in different AZs of the data processing platform, and/or multiple partition copies of the partition to which the target data belongs are stored in different AZs of the data processing platform.
  • the data processing platform can obtain target data copies and/or partition copies from other AZs, so as to continue to provide users with data read and write services using the target data copies and/or partition copies in other AZs , to improve the reliability of data reading and writing services provided by the data processing platform.
  • the data to be stored includes target data and the partition to which the target data belongs
  • the data processing platform includes a distributed database and a file system
  • the distributed database includes partition servers RS under multiple AZs node
  • the file system includes data nodes under multiple AZs
  • the data processing platform when the data processing platform stores multiple copies of the data to be stored in different AZs, it may specifically store multiple partition copies in different AZs in the distributed database
  • the RS node under stores multiple data copies to the data nodes under different AZs in the file system. In this way, the reliability of data reading and writing services provided by the data processing platform can be improved.
  • multiple AZs under the distributed database and multiple AZs under the file system may overlap (all or part of the AZs may be the same), for example, the distributed database and the file system may have the same multiple AZs.
  • the multiple AZs under the distributed database do not overlap with the multiple AZs under the file system, for example, the distributed database includes AZ1 to AZ5, and the file system includes AZ6 to AZ10. In this way, the reliability of data reading and writing services provided by the data processing platform can be further improved.
  • the data processing platform when it stores multiple copies of data in data nodes under different AZs in the file system, it can first obtain the physical distance between different AZs in the file system, and according to the file The physical distance between different AZs in the system, determine the multiple first AZs in the file system, and the determined physical distance between the multiple first AZs does not exceed the distance threshold (such as 40 kilometers, etc.), so that the data processing The platform can store multiple data copies to the data nodes under the multiple first AZs. In this way, when some AZs fail and the data processing platform reads data copies from other AZs, since the other AZs are relatively close to the AZ, the data processing platform can quickly read the data copies, so that the data copy can be obtained Latency is kept low.
  • the distance threshold such as 40 kilometers, etc.
  • the data processing platform when the data processing platform stores multiple copies of data in data nodes under different AZs in the file system, it can obtain the availability of each AZ in the file system, for example, through The ratio between the data nodes available in the AZ and all data nodes is determined, so that the data processing platform can store multiple data copies in multiple first AZs in the file system according to the availability of each AZ in the file system data nodes, and the file system further includes at least one second AZ with a lower degree of availability, wherein the lower degree of availability of the second AZ means that the degree of availability of the second AZ is lower than that of the first AZ.
  • the data processing platform can preferentially select the data nodes under the first AZ with higher availability to store the data copy, so as to improve the reliability of the data processing platform for reading and writing data.
  • the data processing platform when the data processing platform stores multiple copies of data in data nodes under different AZs in the file system, it can obtain the availability of each AZ in the file system, for example, through The ratio between the data nodes available in the AZ and all data nodes is determined, so that the data processing platform can store some of the data copies in multiple copies of the data in multiple copies of the file system according to the availability of each AZ in the file system.
  • the file system also includes at least one second AZ, and the availability of the second AZ is lower than that of the first AZ; when the availability of the at least one second AZ rises to
  • the data processing platform may store other data copies in the multiple data copies to the data nodes under the at least one second AZ. In this way, when storing multiple data copies, it is possible to avoid migrating the data storage tasks required by AZs with low availability to other AZs, so as to avoid increasing the load of other AZs.
  • the data processing platform when storing multiple partition copies to RS nodes under different AZs in the distributed database, it may be specifically to obtain allocation indication information for the multiple partition copies, and the allocation indication information is used for Indicates the ratio of the number of copies of multiple partition copies stored in different AZs, so that the data processing platform can store multiple partition copies to RS nodes under different AZs in the distributed database according to the allocation instruction information.
  • the allocation indication information may be pre-configured by technicians or users, or the allocation indication information may be automatically generated by the data processing platform, and the like. In this way, the data processing platform can implement cross-AZ storage of multiple partition copies according to the allocation indication information.
  • the allocation indication information can be determined according to the load of the RS nodes under each AZ in the distributed database. In this way, when the data processing platform stores multiple partition copies according to the allocation indication information, it can balance multiple AZ load.
  • the distributed database and the file system both include multiple target AZs, and the multiple target AZs have already stored multiple partition copies of the partition to which the target data belongs, then when storing multiple data copies, It can track multiple target AZs that store multiple partition copies, and store multiple data copies to data nodes under different target AZs.
  • the distributed database can read the target data from the local (that is, the AZ where the partition copy is located) data node based on the partition copy, thereby reducing the delay in reading data and improving the data quality. The efficiency with which the processing platform feeds back data to users.
  • the data processing platform can perform load balancing on the RS nodes under different AZs in the distributed database to adjust The number of partition replicas stored in different AZs. In this way, the load balancing of different RS nodes is realized, and the excessive load of some ES nodes is prevented from affecting the quality of data reading and writing services provided by the data processing platform as a whole.
  • the embodiment of the present application provides a data storage device.
  • the device has functions corresponding to each implementation manner for realizing the above-mentioned first aspect.
  • This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application provides a computing device, where the computing device includes a processor, a memory, and a display.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute instructions stored in the memory, so that the computing device executes the data storage method in the first aspect or any implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computing device, the computing device executes the above-mentioned first aspect or any of the first aspects.
  • the present application provides a computer program product containing instructions, which, when run on a computing device, causes the computing device to execute the data storage method described in the first aspect or any implementation manner of the first aspect .
  • FIG. 1 is a schematic diagram of the architecture of an exemplary data processing platform 100 of the present application
  • FIG. 2 is a schematic structural diagram of a cluster 200 constructed across availability zones
  • FIG. 3 is a schematic flow diagram of a data storage method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of storing multiple data copies provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a data storage device provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a hardware structure of a computing device provided by an embodiment of the present application.
  • the data processing platform 100 includes a distributed database 101 and a file system 102 .
  • the file system 102 can be used to persistently store data in the form of files
  • the distributed database 101 can be used to manage the data in the file system 102, including reading, writing, and merging of data.
  • the file system 102 includes multiple data nodes (datanodes), and different data nodes may belong to different availability zones (availability zones, AZ).
  • the file system 102 includes data nodes 1021 to 1024 as an example for illustration, and data nodes 1021 and 1022 belong to AZ1, and data nodes 1023 and 1024 belong to AZ2.
  • AZ usually refers to a collection of one or more physical data centers, with independent wind, fire, water and electricity, and computing, network, storage and other resources can be logically divided into multiple clusters within the AZ.
  • the file system 102 may be, for example, a distributed file system (distributed file system, DFS), a Hadoop distributed file system (hadoop distributed file system, HDFS), etc., which are not limited in this embodiment.
  • the file system 102 may also include a named node (namenode), the named node (not shown in FIG. 1 ) may also be called a master node, and is used to manage multiple data nodes, including namespace, metadata that records the data stored in each data node, etc.
  • the distributed database 101 includes a master node 1011 and multiple partition server (region server, RS) nodes, and different RS nodes belong to different availability zones.
  • the distributed database 101 includes the RS node 1012 and the RS node 1012 is taken as an example for illustration, and the RS node 1012 belongs to AZ1, and the RS node 1012 belongs to AZ2.
  • the master node 1011 is used to divide the data managed by the distributed database 101 (that is, the data stored in the file system 102) to obtain multiple partitions, each partition includes one or more data identifiers, and belongs to Data in different partitions often differs.
  • partitioning when managing each piece of data in the distributed database 101, part of the content in the piece of data can be used as a primary key (primary key) corresponding to the piece of data, and the primary key is used for This piece of data is uniquely identified in the distributed database 101, so that the primary node 1011 can perform interval division according to the possible value range of the primary key, and each divided interval corresponds to a partition.
  • the primary node 101 can divide the value range of the primary key into 100 intervals, which are respectively [0, 10000), [10000, 20000), ..., [980000, 990000), [990000, 1000000], each partition can be used to index 10,000 pieces of data, correspondingly, based on the 100 partitions, the distributed database 101 can manage 1 million Article data.
  • the master node 1011 can also achieve high availability through a distributed application coordination service (such as zookeeper service, etc.) 1014, as shown in FIG. 1 .
  • the master node 1011 is also used to allocate partitions for the RS nodes 1012 and 1013 , and the partitions assigned to each RS node can be maintained through the management table created by the master node 1011 .
  • RS node 1012 and RS node 1013 are respectively used to perform data read and write services belonging to different partitions. As shown in Figure 1, RS node 1012 performs data read and write services belonging to partition 1 to partition N, while RS node 1013 performs data read and write services belonging to partition N+1. Data read and write services to partition M.
  • the master node 1011, the RS node 1012 and the RS node 1013 can all be implemented by hardware or software.
  • both the master node 1011 and multiple RS nodes may be physical servers in the distributed database 101 . That is, during actual deployment, at least one server in the distributed database 101 may be configured as the master node 1011, and other servers in the distributed database 101 may be configured as RS nodes.
  • the master node 1011 and each RS node are implemented by software.
  • the master node 1011 and multiple RS nodes may be processes or virtual machines running on one or more devices (such as servers, etc.).
  • the data processing platform 100 shown in FIG. 1 is only used as an exemplary illustration, and is not used to limit the specific implementation of the data processing platform.
  • the distributed database 101 may include any number of master nodes and RS nodes, or the RS nodes in the distributed database 101 and the data nodes in the file system 102 may belong to different AZ, which is not limited in this application.
  • the distributed database 101 can be respectively connected with the file system 102 and the client 103, for example, it can be connected through a wireless communication protocol such as a hypertext transfer protocol (HyperText Transfer Protocol, HTTP).
  • HTTP HyperText Transfer Protocol
  • the client 103 can send a data write request to the RS node 1012, and the data write request carries the data to be written and the corresponding data processing Operation content (such as write operation, modification operation, etc.).
  • the RS node 1012 can analyze and receive the data write request, generate a corresponding data processing record based on the data to be written and the data processing operation, and write the data processing record into a pre-created write ahead log (write ahead log, WAL) file. After determining that the writing of the WAL file is successful, the RS node 1012 persistently stores the WAL file to the file system 102 . And, the RS node 1012 inserts the data to be written in the data processing record into the internal memory 1 of the RS node 1012 .
  • WAL write ahead log
  • the RS node 1012 can first determine the primary key corresponding to the data processing record, and determine which partition to write the data processing record into according to the partition interval to which the value of the key belongs, so that the RS node 1012 can process the data
  • the data to be written in the record is inserted into the storage area corresponding to the partition in the memory 1; then, the RS node 1012 can feed back to the client 103 that the data writing is successful.
  • the RS node 1012 will write data in the memory 1 for one or more clients 103, so the amount of data temporarily stored in the memory 1 will continue to increase.
  • the RS node 1012 can persistently store the data in the memory 1 to the file system 102, such as persistently storing the data in the memory 1 to the data node 1021 under AZ1.
  • the RS node 1012 is also configured with a region store file (region store files) for each partition, and after persistently storing data in the file system 102, the RS node 1012 can store the data of each partition in the file system 102
  • the files stored in the partition are added to the partition storage file corresponding to the partition.
  • the file name corresponding to each data in the partition can be added in the directory of the partition storage file.
  • the client 103 when it needs to read data, it can send a data read request to the RS node 1012, and the data read request carries the primary key of the data to be read.
  • RS node 1012 After receiving the data write request, RS node 1012 can determine the partition according to the value of the primary key corresponding to the data to be written, so that RS node 1012 can store files from the data nodes under AZ1 according to the partition storage file corresponding to the partition. Find out the data required by the client 102 and feed it back to the client 103 .
  • the data processing platform 100 When the data processing platform 100 is deployed, the data processing platform 100 can be used as a local resource to provide local data reading and writing services to clients accessing the data processing platform 100 through the distributed database 101 and the file system 102 .
  • the data processing platform 100 can also be deployed on the cloud, and at this time, the distributed database 101 and the file system 102 can provide cloud services for reading and writing data to clients connected to the cloud.
  • AZs such as AZ1 or AZ2
  • the disaster causes physical damage to some or all computing devices under the AZ.
  • the data stored in the partitions in the RS nodes under the AZ and/or in the data nodes under the AZ may be lost or unreadable due to the unavailability of the AZ, which will reduce the data processing platform 100.
  • the quality of data reading and writing services that is, it is difficult for users to obtain the data stored in the AZ, which affects the user experience.
  • the embodiment of the present application provides a data storage method, aiming at allocating partitions and/or copies of data to different AZs, so as to improve the reliability of data reading and writing services provided by the data processing platform 100 .
  • the data processing platform 100 may acquire multiple copies of the data to be stored, and store the multiple copies of the data to be stored in different AZs of the data processing platform 100 .
  • the data to be stored may be data newly written by the user and/or the partition to which the data belongs, for example.
  • the data processing platform 100 replicates the data to be stored, and different copies of the data to be stored are stored in different AZs, when AZ1 (or AZ2) is unavailable due to natural disasters or other disasters, due to different The physical distance between the AZs is usually relatively large. Therefore, although the disaster makes AZ1 unavailable, it usually does not affect AZ2, so that the data processing platform 100 can be based on the data to be stored stored in AZ2 (or AZ1) in normal operation. continue to provide data reading and writing services for users, thereby avoiding disasters and reducing the quality of data reading and writing services provided by the data processing platform 100.
  • the allowable service interruption time length of the data processing platform 100 can be 0, that is, the recovery time objective (RTO); the maximum amount of data loss that the data processing platform 100 can tolerate can reach 0, that is, the recovery point objective (recovery point objective, RPO) can be 0.
  • RTO recovery time objective
  • RPO recovery point objective
  • the data processing platform 100 stores multiple copies of the data to be stored in different AZs, rather than storing them in one AZ. This makes it impossible for all copies of the data to be stored to be stored after some AZs become unavailable. All data loss occurs, so that the reliability of data storage on the data processing platform 100 can be improved.
  • the distributed database 101 obtains the partition to which the data belongs during the process of storing multiple copies of the data to be stored by the data storage platform 100 Multiple copies of partitions, such as obtaining multiple copies of partitions by copying partitions. Then, the distributed database 101 stores multiple partition copies to RS nodes under different AZs included in the distributed database 101, such as storing part of the partition copies to the RS node 1012 under AZ1, and storing the remaining partition copies to AZ2 RS node 1013 .
  • the file system 102 also obtains multiple data copies of the data, for example, the distributed database 101 sends multiple data copies obtained by duplicating the data to the file system 102, or the file system 102 sends the distributed database 101 The sent data is copied to obtain multiple copies of the data, etc. Then, the file system 102 stores multiple data copies to data nodes under different AZs included in the file system 102. For example, part of the data copies are stored in the data nodes under AZ1, and the remaining data copies are stored in the data nodes under AZ2. .
  • the data processing platform 100 Since the data processing platform 100 stores multiple copies of data in different AZs rather than in one AZ, this will not cause data loss for all data copies of one piece of data when some AZs become unavailable. Therefore, the reliability of data stored by the data processing platform 100 can be improved. Similarly, the data processing platform 100 stores multiple partition copies in different AZs, so that when some AZs become unavailable, data loss will not occur in all partition copies corresponding to one piece of data, so that the data processing platform can 100 The quality of data read based on partition replicas is maintained at a high level.
  • computing devices including RS nodes, data nodes, etc.
  • the data processing platform 100 can utilize the cluster Provide reliable data read and write services.
  • the deployed cluster can be shown in FIG. 2 , and the cluster 200 can include computing devices under AZ1 and AZ2.
  • the cluster 200 can include RS nodes 1012, data nodes 1021, and data nodes 1022 under AZ1.
  • AZ1 also includes other nodes under the AZ1, such as a named node 1031 for updating the namespace of the cluster 200, recording metadata of data stored in the data node, a log node (journalnode) 1032 for storing and managing logs, A resource management device 1033 for managing resources in AZ1, a node management device 1034 for managing data nodes, a node management device 1035, and a master server 1036 for monitoring and managing RS nodes 1012 (such as Hmaster etc.); Moreover, a distributed application program coordination service may also be configured in the AZ1 to improve the high availability of the AZ1. Similarly, AZ2 and AZ1 also have similar configurations, as shown in Figure 2 for details.
  • the computing device under AZ1 and the computing device under AZ2 can be active and standby each other.
  • the named node 1031 under AZ1 fails, the named node 1041 under AZ2 can update the namespace of the cluster 200, etc.
  • the cluster 200 shown in FIG. 2 is based on the deployment of the data processing platform 100 shown in FIG. For each computing device under AZ3 shown by the dotted line, this embodiment does not limit the specific architecture of the cluster deployed across AZs. Moreover, multiple clusters may be deployed across AZs in the data processing platform 100, and different clusters include computing devices under different AZs. For example, assuming that the data processing platform 100 includes 6 AZs, namely AZ1 to AZ6, one cluster may be deployed based on AZ1 to AZ3, and another cluster may be deployed based on AZ4 to AZ6, which is not limited in this embodiment.
  • a standby cluster 300 may also be deployed for the cluster 200, so as to replicate and retrieve data stored in the cluster 200 (including partitions, data written in the file system 102, etc.) storage, so that the reliability of data read and write services provided by the cluster 200 can be further improved.
  • an asynchronous replication method may be adopted between the cluster 200 and the cluster 300 to replicate data in the cluster 200 to the cluster 300 .
  • FIG. 3 it is a schematic flowchart of a data storage method in an embodiment of the present application.
  • the method can be applied to the data processing platform 100 shown in FIG. 1 above, and specifically can be executed by the distributed database 101 and the file system 102 . Alternatively, the method may also be executed by a device separately configured in the data processing platform 100, which is not limited in this embodiment.
  • the data to be stored includes target data (such as the new data provided by the above-mentioned user) and the partition of the target data, and the distributed database 101 and the file system 102 execute the data storage method as an example.
  • the data storage method shown in Figure 3 may specifically include:
  • the distributed database 101 acquires multiple data copies of the target data, and multiple partition copies of the partition to which the target data belongs.
  • the target data acquired by the distributed database 101 may be, for example, new data provided by the user to the data processing platform 100, or data generated by the data processing platform 100 based on the user's modification operation on the data.
  • the data processing platform 100 can write the primary key contained in the target data into the partition, so that the target data can be managed subsequently based on the primary key recorded in the partition; and, the data processing platform 100 can also write the The target data is persistently stored in the file system 102 .
  • the distributed database 101 can respectively replicate the target data and the partition to which the target data belongs, so that multiple data copies of the target data can be obtained (the target data itself can also be regarded as a data copy), a partition copy (The partition to which the target data belongs can also be regarded as a partition copy).
  • the distributed database 101 is used as an example to replicate partitions.
  • the distributed database 101 executes the replication operation on the partition to which the target data belongs to obtain multiple partition copies, and The target is sent to the file system 102, so that the file system 102 executes the copy operation of the target data to obtain multiple data copies.
  • This embodiment does not limit it.
  • the distributed database 101 stores multiple partition copies to RS nodes under different AZs included in the distributed database 101.
  • the distributed database 101 stores all partition copies of the target data in one AZ, then when the AZ is unavailable due to a disaster, it is difficult for the distributed database 101 to use the partition copies in the AZ to provide users with data for the target data Read and write services. Therefore, in this embodiment, the distributed database 101 stores multiple partition copies in at least two AZs, so that even if one of the AZs is unavailable, the distributed database 101 can also manage the target through the partition copies stored in the remaining AZs. data. In this way, unreadable target data in the data processing platform 100 due to the unavailability of a single AZ can be avoided, and data fault tolerance at the AZ level can be realized.
  • the number of partition copies allocated to different AZs may be the same or different.
  • the distributed database 101 when the distributed database 101 includes 3 AZs and the number of partition copies to which the target data belongs is 3, the distributed database 101 can store the 3 partition copies in the 3 AZs respectively, and each AZ stores a copy of the partition.
  • the distributed database 101 when the distributed database 101 stores three partition copies in AZ1 and AZ2 (not in AZ3), one partition copy can be stored in AZ1, two partition copies can be stored in AZ2, and so on.
  • this embodiment provides the following four implementations for storing multiple partition copies in different AZs:
  • the distributed database 101 may acquire the allocation indication information of the multiple partition copies, the allocation indication information may include the AZ identifier and the proportion of the number of copies, and may be used to indicate that the multiple partition copies The proportion of copies stored in different AZs. In this way, the distributed database 101 can store multiple partition copies to RS nodes under different AZs in the distributed database according to the allocation indication information.
  • the distributed database 101 can store 2 (ie 0.5*4) partition copies in AZ1, store 1 (ie 0.25*4) partition copies in AZ2, and store 1 partition copy in AZ2 Stored in AZ3. It is worth noting that among the multiple partition copies written in multiple AZs in the distributed database 101, one of the partition copies is used as the primary partition copy, that is, the distributed database 101 usually provides data read and write for users based on the primary partition copy. service, while the rest of the partition copies are used as secondary partition copies to continue to provide users with data read and write services based on the secondary partition copies when the primary partition copy is unreadable or data loss occurs.
  • the distributed database 101 may process copies of multiple partitions in batches based on the above allocation indication information. For example, assuming that there are currently 10 partitions, and the number of copies of each partition is 4, then when the allocation indication information is "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", the distributed database 101 can Store 20 (that is, 0.5*4*10) copies of the partition in AZ1, store 10 (that is, 0.25*4*10) copies of the partition in AZ2, and store 10 copies of the partition in AZ3. Among them, 3 AZs store a copy of each partition.
  • the allocation indication information may be determined by the distributed database 101 according to the load of the RS nodes in each AZ, for example. For example, when it is assumed that the load of RS node 1012 in AZ1 is 10%, the load of RS node 1013 in AZ2 is 30%, and the load of RS node 1014 in AZ3 is 35%, and the number of copies of each partition is set is 4, the distributed database 101 can determine that the allocation indication information is specifically "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", that is, AZ1 is used to store 50% of the partition copies, and AZ2 and AZ3 are both used To store 25% of the partition copies, so as to balance the load among RS nodes under different AZs.
  • the allocation instruction information may also be determined in other ways, such as manual configuration by a technician in advance, which is not limited in this embodiment.
  • the distributed database 101 may include a main server (such as the main server 1036 in FIG. 2 ) and an AZ aware balancer (AZ aware balancer), wherein the main server can perceive effective RS nodes, such as Valid RS nodes can be determined based on heartbeat messages sent by RS nodes, so that the master server can generate a network topology map based on valid RS nodes.
  • the master server can forward the allocation request to the AZ-aware balancer, and the AZ-aware balancer will use the AZ-based network
  • the topology map (AZ based network topology), together with the above allocation instructions, determines the AZ used to store the partition replica.
  • the distributed database 101 can also be provided with a default balancer (default balancer), and, for some partitions that do not need to be replicated and stored (such as the importance of the partition is low, etc.), the distributed database 101 can also use this
  • a default balancer is used to determine the AZ where the partition is stored.
  • the default balancer may, for example, determine the AZ of the storage partition through a random algorithm or a load balancing strategy.
  • the main server, the AZ-aware equalizer, and the default equalizer can all be implemented by software (such as a process) or hardware (such as a separately configured computing device), which is not limited in this embodiment.
  • the distributed database 101 may first create AZ allocates a partition copy, and, for the remaining unassigned partition copies, the distributed database 101 can determine the AZ where the RS node with a relatively small load is located according to the current load of the RS nodes in each AZ, so that the distributed database 101 may allocate the remaining partition copies to the AZ for storage. In this way, the distributed database 101 can flexibly allocate storage locations of partition copies according to the load conditions of RS nodes under multiple AZs in various time periods, so as to realize load balancing of RS nodes.
  • the distributed database 101 can be based on each AZ Availability, which determines the AZ where the replica of the partition is stored. Specifically, during the process of writing partition copies to multiple AZs, the distributed database 101 may first obtain the availability degree of each AZ for each AZ. The ratio between the total number of data nodes in the AZ is determined and can reflect the availability of the AZ. For example, assuming that the AZ includes 10 data nodes, and the number of available nodes is 8, the availability of the AZ may be 80% (ie, 8/10).
  • the available data node refers to the data node with further normal ability to read and write data; correspondingly, when the data node is physically damaged or the data is read and written incorrectly, the data node can be determined to be faulty, that is, unavailable , or, when the amount of data stored in the data node is too large and the data node cannot further store data, the data node may also be determined as an unavailable data node.
  • whether a data node is available can also be defined in other ways, for example, when the data to be stored does not have the permission to be written into the data node, the data node can be determined to be unavailable relative to the data to be stored, etc.
  • the distributed database 101 can determine a plurality of first AZs whose availability is higher than a preset threshold and at least one second AZ whose availability is lower than a preset threshold according to the availability of each AZ, so that the distributed database 101 can A plurality of partition copies are written to the plurality of first AZs. In this way, the number of new partition copies written by the distributed database 101 to the AZ with a low degree of availability can be reduced, thereby avoiding the excessive load of the RS node under the AZ with a low degree of availability, causing the RS node to be responsible for The amount of data read exceeds the maximum amount of data that can be stored in this AZ.
  • the distributed database 101 when the distributed database 101 writes the partition copy to the AZ, some AZs may be less available or become unavailable. Based on this, the distributed database 101 can Suspend writing partition copies to this part of the AZ. Specifically, before the distributed database 101 writes a partition copy to one of the AZs, it may first obtain the availability of the AZ, and if the availability of the AZ is lower than the preset threshold, the distributed database 101 may not write to the AZ. Instead, the master server can create a cache queue for the AZ, such as a region in transaction (RIT) queue, etc., mark the partition copy as unallocated and write it to the cache queue middle. Then, the distributed database 101 continues to write the partition copy to the next AZ.
  • RIT region in transaction
  • the distributed database 101 can switch the identity of the master server in other AZs with a high degree of availability from "standby" to "primary". identity, and create a cache queue for that AZ.
  • the distributed database 101 can monitor the availability of the AZ, such as by configuring the RIT work (chore) node for monitoring, and if the availability of the AZ rises and exceeds the preset threshold, Then the distributed database 101 can write the partition copy in the cache queue corresponding to the AZ into the RS node belonging to the AZ.
  • the distributed database 101 may write the secondary partition copy to the AZ after the availability of the AZ increases. If the partition copy stored in the AZ with low availability is the primary partition copy, at this time, because the availability of the AZ is too low, the distributed database 101 can select a secondary partition copy from other AZs with high availability.
  • the partition replica acts as the primary partition replica to ensure the high availability of the primary partition replica in the distributed database 101, and the partition replica in the AZ acts as the secondary partition replica.
  • the distributed database 101 may also store multiple partition copies in different AZs based on other methods.
  • the distributed database 101 may By combining the above implementation methods, multiple AZs with high availability and close physical distances are selected to store multiple partition copies, etc.
  • the file system 102 stores multiple data copies to data nodes under different AZs in the file system 102 .
  • multiple data copies of the target data may be provided by the distributed database 101 to the file system 102, or the file system 102 may replicate the target data to obtain the multiple data copies. Similar to the storage partition copy, the file system 102 may store the obtained multiple data copies in different AZs, specifically, the data nodes in different AZs. Wherein, each AZ includes at least one data copy. In this way, even if one of the AZs where the data copy is stored is unavailable, the distributed database 101 can read the target data from the remaining AZs, so as to avoid loss of the target data and achieve AZ-level data fault tolerance.
  • this embodiment provides the following four implementations for storing multiple copies of data in different AZs:
  • the file system 102 may also store multiple data copies based on allocation indication information.
  • the allocation indication information may include AZ identification And the proportion of the number of copies, which can be used to indicate the proportion of the number of copies of the data nodes that store the multiple copies of data in different AZs.
  • the allocation indication information can specifically be the allocation expression "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", then According to the allocation expression, the distributed database 101 can store 2 (ie 0.5*4) copies of data in AZ1, store 1 (ie 0.25*4) copies of data in AZ2, and store 1 copy of data in AZ2 Stored in AZ3. In this way, the file system 102 can store multiple data copies to data nodes under different AZs in the file system 102 according to the allocation indication information.
  • the allocation instruction information may be determined according to the load of the data nodes under each AZ, or manually configured by a technician, which is not limited in this embodiment.
  • a named node remote procedure call server (namenode remote procedure call server) and an availability zones block placement policy (availability zones block placement policy, AZ BPP) node may be set in the file system 102.
  • the named node remote procedure call server can instruct the AZ BPP node to execute the replication process of the target data; the AZ BPP node can determine multiple AZs and each The number of copies of the target data stored in the AZ, so that the corresponding data storage process can be executed.
  • a default block placement policy (default block placement policy) node may also be set in the file system 102, and, for some data that does not need to be copied and stored (such as the importance of the data is low, etc.), the file system 102 may also The default block placement strategy node is used to determine the AZ for storing the data.
  • the default block placement strategy node can be, for example, the AZ for storing data determined by a random algorithm or a load balancing strategy.
  • the file system 102 can generate an AZ based network topology based on the data nodes under each AZ, so that the default block placement policy node can determine the AZ for storing data copies according to the network topology.
  • the file system 102 may determine an AZ for storing data copies according to physical distances between different AZs. Specifically, the file system 102 can obtain the physical distance between different AZs in the file system 102, and store multiple copies of data in multiple copies of the data in the file system 102 that are physically closer to each other according to the physical distance between different AZs.
  • the data nodes under the first AZ that is, the physical distance between each first AZ and at least one of the first AZs in the plurality of first AZs does not exceed the distance threshold.
  • the file system 102 also includes at least one data node under the second AZ, and the physical distance between the second AZ and each first AZ exceeds a distance threshold (such as 40 kilometers, etc.).
  • a distance threshold such as 40 kilometers, etc.
  • the file system 102 can be based on each AZ Availability status, determine the AZ where the data copy is stored. Specifically, the file system 102 can first obtain the availability of each AZ, which can be determined by, for example, calculating the ratio between the number of data nodes available in the AZ and the total number of data nodes in the AZ, and can reflect AZ AVAILABILITY.
  • the file system 102 can store multiple copies of data in the data nodes under the multiple first AZs with relatively high availability in the file system 102 according to the availability of each AZ, each At least one data copy is stored in the first AZ.
  • the file system 102 also includes at least one data node under the second AZ, and the availability of the second AZ is lower than that of the first AZ. In this way, the file system 102 can preferentially allocate the data copy to an AZ with a high degree of availability for storage, thereby improving the reliability of data reading and writing of the data processing platform 100 .
  • the file system 102 when the file system 102 writes data copies to the AZ, some AZs may be less available or become unavailable. Based on this, the file system 102 can suspend Write a data copy to this part of AZ. Specifically, before writing multiple data copies to multiple AZs, the file system 102 can first obtain the availability of each AZ, and the file system 102 can write part of the multiple data copies according to the availability of each AZ The copy is stored in the data nodes under the first AZ with higher availability (higher than the preset threshold) in the file system 102, and at least one data copy is stored in each first AZ.
  • the file system 102 may temporarily not write a data copy to the second AZ. Then, when the availability of the second AZ rises to a preset threshold, the file system 102 stores the rest of the unstored data copies in the second AZ. In this way, when the file system 102 stores multiple data copies, it can avoid migrating the data storage task of the AZ with low availability to other AZs, so as to avoid increasing the load of other AZs.
  • the file system 102 may sequentially write three data copies of the data A into data node 1 under AZ1, data node 2 under AZ2, and data node 3 under AZ3; and
  • the file system 102 can suspend writing the copy of data B to AZ2.
  • a copy of data B is written, while a copy of data C is first written to AZ3.
  • the file system 102 writes a copy of data B into AZ2 according to the data copy stored in AZ1 or AZ3.
  • the file system 102 can also store multiple copies of data in different AZs based on other methods.
  • the file system 102 can be combined In the above implementation manner, multiple AZs with high availability and close physical distances are selected to store multiple copies of data.
  • the distributed database 101 and the file system 102 may include multiple identical AZs, or may not include the same AZ, which is not limited in this embodiment.
  • the file system 102 can track multiple target AZs that store multiple partition copies during the process of storing data copies (The file system 102 also includes the multiple target AZs), so the file system 102 can also store multiple data copies in data nodes under different target AZs.
  • the file system 102 and the distributed database 102 may store data copies and partition copies in the same AZ based on the same allocation indication information.
  • the distributed database 101 can read the target data from the local (that is, the AZ where the partition copy is located) data node based on the partition copy, thereby reducing the time delay for reading data and improving
  • the data processing platform 100 feeds back the efficiency of the data to the user.
  • the distributed database 101 and the file system 102 do not include the same AZ, the distributed database 101 and the file system 102 can independently execute the stored procedure of the copy.
  • N is a positive integer greater than 1
  • the data processing platform 100 can also be based on the partitions stored in the Nth AZ Replicas and data replicas continue to provide users with data read and write services.
  • the data processing platform 100 can also automatically restore the data copies and partition copies stored in the remaining N-1 AZs based on the partition copies and data copies stored in the Nth AZ. Data, so that there is no need for management personnel to intervene, and automatic processing after fault recovery is realized.
  • the target data and the partition to which the target data belongs are copied and stored across AZs as an example.
  • the data processing platform 100 can store them in a similar manner as described above to improve data
  • the processing platform 100 provides reliability of data reading and writing services. In practical applications, there may be differences in the importance of different data based on users.
  • the data processing platform 100 can store the user's data A in the above-mentioned manner, and when storing the user's data B, there is no need to copy and store the data B, or multiple data of the data B Replicas are stored in one AZ.
  • the data processing platform 100 may also periodically balance the loads of the RS nodes under each AZ. For example, the data processing platform 100 can periodically obtain the number of partition copies stored by the RS nodes under each AZ, so that the RS nodes under different AZs in the distributed database 101 can be loaded according to the number of partition copies in each RS node. Balance, specifically, may be migrating partition copies on some RS nodes to other RS nodes, so as to adjust the number of partition copies stored in different AZs, thereby reducing the difference in the number of partition copies stored in different AZs.
  • the data processing platform 100 A partition copy on the RS node under AZ3 can be migrated to the RS node under AZ1, so that the number of partition copies stored in each AZ is 2.
  • the data to be stored includes the target data and the partition to which the target data belongs as an example. In other possible embodiments, the data to be stored may only be the target data.
  • the data processing platform 100 can store multiple copies of the target data in different AZs through the file system 102, and the partition to which the target data belongs can be stored in a single AZ (one partition data can be saved in this AZ, It is also possible to store multiple copies of the partition at the same time), or it can be stored in multiple AZs. Alternatively, the data to be stored can also be only the partition to which the target data belongs.
  • the data processing platform 100 can store the data of multiple partitions in different AZs through the distributed data set 101, and the target data can be stored in the file system In a single AZ under the file system 102 (one copy of the target data can be stored in the AZ, or multiple copies of the target data can be stored at the same time), or can be stored in multiple AZs under the file system 102.
  • FIG. 5 is a schematic structural diagram of a data storage device provided by the present application.
  • the data storage device 500 can be applied to a data processing platform (such as the above-mentioned data processing platform 100 etc.), and the data processing platform includes multiple availability zones AZ.
  • the data recovery device 500 includes:
  • An acquisition module 501 configured to acquire multiple copies of the data to be stored
  • the storage module 502 is configured to store multiple copies of the data to be stored in different AZs of the data processing platform.
  • the data to be stored includes target data and/or the partition to which the target data belongs, and the storage module 502 is configured to:
  • the data to be stored includes target data and the partition to which the target data belongs
  • the data processing platform includes a distributed database and a file system
  • the distributed database includes multiple availability zones AZ
  • the RS node of the partition server, the file system includes data nodes under multiple AZs, and the storage module 502 is used for:
  • the multiple data copies are stored in data nodes under different AZs in the file system.
  • the storage module 502 is configured to:
  • the storage module 502 is configured to:
  • the multiple data copies are stored in the data nodes under the multiple first AZs in the file system, and the file system also includes at least one data node under the second AZ For a data node, the availability of the second AZ is lower than the availability of the first AZ.
  • the storage module 502 is configured to:
  • each AZ in the file system part of the data copies in the multiple data copies are stored in the data nodes under the first AZ in the file system, and the file system also includes at least one second For data nodes under the AZ, the availability of the second AZ is lower than the availability of the first AZ;
  • the storage module 502 is configured to:
  • Acquire allocation indication information for the multiple partition copies where the allocation indication information is used to indicate the proportions of the number of copies stored in different AZs for the multiple partition copies;
  • the allocation indication information store the multiple partition copies to RS nodes under different AZs in the distributed database.
  • the allocation indication information is determined according to the load of the RS nodes under each AZ in the distributed database.
  • both the distributed database and the file system include multiple target AZs, and the storage module 502 is configured to:
  • the multiple data copies are stored in data nodes under different target AZs.
  • the apparatus 500 further includes:
  • the load balancing module 503 is configured to perform load balancing on RS nodes under different AZs in the distributed database, so as to adjust the number of partition copies stored in different AZs.
  • the data storage device 500 may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules of the data storage device 500 are respectively in order to realize the method embodiment shown in FIG. 3
  • the corresponding process in for the sake of brevity, will not be repeated here.
  • Figure 6 provides a computing device.
  • the computing device 600 may be, for example, the device used to implement the functions of the data processing platform 100 in the foregoing embodiments, and the computer device 600 may specifically be used to implement the data storage device 500 in the embodiment shown in Figure 5 above function.
  • the computing device 600 includes a bus 601 , a processor 602 and a memory 603 .
  • the processor 602 and the memory 603 communicate through the bus 601 .
  • the bus 601 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6 , but it does not mean that there is only one bus or one type of bus.
  • the processor 602 may be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP) or a digital signal processor (digital signal processor, DSP) etc. Any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the memory 603 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • volatile memory such as a random access memory (random access memory, RAM).
  • Memory 603 can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, mechanical hard disk (hard drive drive, HDD) or solid state hard disk (solid state drive) , SSD).
  • Executable program codes are stored in the memory 603 , and the processor 602 executes the executable program codes to implement the data storage method performed by the aforementioned data processing platform 100 .
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state hard disk), etc.
  • the computer-readable storage medium includes instructions for instructing a computing device to execute the above data recovery method.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computing device, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wirelessly (such as infrared, wireless, microwave, etc.) to another website site, computer or data center.
  • another computer-readable storage medium e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wirelessly (such as infrared, wireless, microwave, etc.) to another website site, computer or data center.
  • the computer program product may be a software installation package which can be downloaded and executed on a computing device if any of the aforementioned object recognition methods are required.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present application provide a data storage method. The method can be applied to a data processing platform comprising a plurality of availability zones (AZs), and the data processing platform obtains a plurality of copies of data to be stored, and stores the plurality of copies of the data into different AZs of the data processing platform. In this way, when some of the AZs are unavailable due to natural disasters or other disasters, the disaster generally does not affect other AZs because a physical distance between different AZs is generally large, so that the data processing platform can continue to provide a data read-write service for a user on the basis of the copies of the data stored in other AZs running normally, thus the condition that the quality of the data read-write service provided by the data processing platform is reduced due to the occurrence of the disaster is avoided, and the data storage reliability of the data processing platform is improved. In addition, the embodiments of the present application further provide a corresponding apparatus and a related device.

Description

一种数据存储方法、装置及相关设备A data storage method, device and related equipment
本申请要求于2021年10月28日递交印度专利局、申请号为202131049367,发明名称为“DATA MANAGEMENT SYSTEM AND METHOD”的印度专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Indian patent application with the application number 202131049367 and the title of the invention "DATA MANAGEMENT SYSTEM AND METHOD" filed with the Indian Patent Office on October 28, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请实施例涉及数据库技术领域,尤其涉及一种数据存储方法、装置及相关设备。The embodiments of the present application relate to the technical field of databases, and in particular, to a data storage method, device, and related equipment.
背景技术Background technique
数据处理平台,用于为用户提供数据读写服务,如数据存储、数据读取等,可以包括分布式数据库以及文件系统。其中,分布式数据库,如HBase数据库等,通常包括主(master)节点以及多个分区服务器(Region Server,RS)节点。并且,master节点用于为各RS节点分配该RS节点负责读写的数据所属的分区(region),每个RS节点所分配到的分区数量可以是一个或者多个;RS节点用于根据其分配到的分区,向文件系统写入新数据,或者将用户请求的文本系统中的数据反馈给用户,以实现数据读写服务。The data processing platform is used to provide users with data reading and writing services, such as data storage, data reading, etc., which may include distributed databases and file systems. Among them, a distributed database, such as an HBase database, usually includes a master (master) node and multiple partition server (Region Server, RS) nodes. In addition, the master node is used to assign to each RS node the region to which the data that the RS node is responsible for reading and writing belongs to. The number of partitions assigned to each RS node can be one or more; the RS node is used to allocate Write new data to the file system, or feed back the data in the text system requested by the user to the user, so as to realize the data reading and writing service.
实际应用时,因为自然灾害等灾难,可能会导致数据处理平台为用户提供的数据读写服务的质量降低,从而影响用户的使用体验。因此,目前亟需一种数据存储方法,以实现在发生灾难时,数据处理平台为用户提供的数据读写服务的质量能够保持在较高水平。In actual application, disasters such as natural disasters may reduce the quality of the data reading and writing services provided by the data processing platform for users, thus affecting the user experience. Therefore, there is an urgent need for a data storage method, so that when a disaster occurs, the quality of data reading and writing services provided by the data processing platform to users can be maintained at a relatively high level.
发明内容Contents of the invention
有鉴于此,本申请实施例提供了一种数据存储方法,以使得在发生灾难时,数据处理平台为用户提供的数据读写服务的质量能够保持在较高水平。本申请还提供了对应的装置、计算设备、计算机可读存储介质以及计算机程序产品。In view of this, the embodiment of the present application provides a data storage method, so that when a disaster occurs, the quality of data reading and writing services provided by the data processing platform for users can be maintained at a relatively high level. The present application also provides corresponding apparatuses, computing devices, computer-readable storage media, and computer program products.
第一方面,本申请实施例提供一种数据存储方法,该方法可以应用于包括多个可用区AZ的数据处理平台,并且,该数据处理平台在存储数据时,可以先获取待存储数据的多份副本,并将该待存储数据的多份副本存储至数据处理平台的不同AZ中。In the first aspect, the embodiment of the present application provides a data storage method, which can be applied to a data processing platform including multiple availability zones AZ, and when storing data, the data processing platform can first obtain multiple copy, and store multiple copies of the data to be stored in different AZs of the data processing platform.
由于数据处理平台将待存储数据的多份副本存储于不同AZ中,这样,当部分AZ因为自然灾害或者其它灾难而导致其不可用时,由于不同AZ之间间隔的物理距离通常较大,因此,该灾难通常可以不对其它AZ产生影响,从而数据处理平台可以基于正常运行的其它AZ中存储的待存储数据的副本,继续为用户提供数据读写服务,进而可以避免灾难发生而降低数据处理平台提供数据读写服务的质量。并且,基于这样的数据存储方式,数据处理平台的可容许服务中断的时间长度可以为0,即复原时间目标达到0;数据处理平台所能容忍的最大数据丢失量可以达到0,即复原点目标可以为0。Since the data processing platform stores multiple copies of the data to be stored in different AZs, when some AZs are unavailable due to natural disasters or other disasters, the physical distance between different AZs is usually large, therefore, The disaster usually does not affect other AZs, so the data processing platform can continue to provide users with data read and write services based on the copies of the data to be stored in other AZs that are running normally, thereby avoiding disasters and reducing the data processing platform. The quality of data read and write services. Moreover, based on this data storage method, the allowable service interruption time length of the data processing platform can be 0, that is, the recovery time target reaches 0; the maximum data loss that the data processing platform can tolerate can reach 0, that is, the recovery point target Can be 0.
另外,数据处理平台是将多份待存储数据的副本存储至不同的AZ中,而并非是保存在一个AZ中,这使得部分AZ不可用后,不会导致一份待存储数据的所有副本均发生数据丢失,从而可以提高数据处理平台存储数据的可靠性。In addition, the data processing platform stores multiple copies of the data to be stored in different AZs instead of in one AZ. This makes it impossible for all copies of the data to be stored to be lost after some AZs become unavailable. In the event of data loss, the reliability of data stored on the data processing platform can be improved.
在一种可能的实施方式中,待存储数据包括目标数据和/或该目标数据所属的分区,则,数据处理平台在将待存储数据的多份副本存储至数据处理平台时,具体可以是将目标数据 的多份副本存储至数据处理平台的不同AZ中,和/或,将目标数据所属分区的多份分区副本存储至数据处理平台的不同AZ中。这样,当部分AZ不可用时,数据处理平台可以从其它AZ中获取目标数据的副本和/或分区副本,以便利用其它AZ中的目标数据的副本和/或分区副本继续为用户提供数据读写服务,提高数据处理平台提供数据读写服务的可靠性。In a possible implementation manner, the data to be stored includes the target data and/or the partition to which the target data belongs, then, when the data processing platform stores multiple copies of the data to be stored on the data processing platform, specifically, the Multiple copies of the target data are stored in different AZs of the data processing platform, and/or multiple partition copies of the partition to which the target data belongs are stored in different AZs of the data processing platform. In this way, when some AZs are unavailable, the data processing platform can obtain target data copies and/or partition copies from other AZs, so as to continue to provide users with data read and write services using the target data copies and/or partition copies in other AZs , to improve the reliability of data reading and writing services provided by the data processing platform.
在一种可能的实施方式中,待存储数据包括目标数据以及该目标数据所属分区,并且,数据处理平台包括分布式数据库以及文件系统,其中,该分布式数据库包括多个AZ下的分区服务器RS节点,该文件系统包括多个AZ下的数据节点,则数据处理平台在将待存储数据的多份副本存储至不同的AZ时,具体可以是将多份分区副本存储至分布式数据库中不同AZ下的RS节点,将多份数据副本存储至文件系统中不同AZ下的数据节点。如此,可以提高数据处理平台提供数据读写服务的可靠性。In a possible implementation manner, the data to be stored includes target data and the partition to which the target data belongs, and the data processing platform includes a distributed database and a file system, wherein the distributed database includes partition servers RS under multiple AZs node, the file system includes data nodes under multiple AZs, when the data processing platform stores multiple copies of the data to be stored in different AZs, it may specifically store multiple partition copies in different AZs in the distributed database The RS node under , stores multiple data copies to the data nodes under different AZs in the file system. In this way, the reliability of data reading and writing services provided by the data processing platform can be improved.
其中,分布式数据库下的多个AZ与文件系统下的多个AZ,可以存在重叠(可以是全部或者部分AZ相同),如分布式数据库以及文件系统可以具有相同的多个AZ。或者,分布式数据库下的多个AZ与文件系统下的多个AZ并不重叠,如分布式数据库包括AZ1至AZ5,而文件系统包括AZ6至AZ10等。如此,可以进一步提高数据处理平台提供数据读写服务的可靠性。Among them, multiple AZs under the distributed database and multiple AZs under the file system may overlap (all or part of the AZs may be the same), for example, the distributed database and the file system may have the same multiple AZs. Alternatively, the multiple AZs under the distributed database do not overlap with the multiple AZs under the file system, for example, the distributed database includes AZ1 to AZ5, and the file system includes AZ6 to AZ10. In this way, the reliability of data reading and writing services provided by the data processing platform can be further improved.
在一种可能的实施方式中,数据处理平台在将多份数据副本存储于文件系统中不同AZ下的数据节点时,可以先获取该文件系统中不同AZ之间的物理距离,并根据该文件系统中不同AZ之间的物理距离,确定该文件系统中的多个第一AZ,所确定的多个第一AZ之间的物理距离不超过距离阈值(如40千米等),从而数据处理平台可以将多份数据副本存储至该多个第一AZ下的数据节点。如此,当部分AZ发生故障并且数据处理平台从其它AZ中读取数据副本时,由于其它AZ与该AZ相距较近,因此,数据处理平台可以快速读取到该数据副本,以使得获取数据副本的时延保持在较低水平。In a possible implementation, when the data processing platform stores multiple copies of data in data nodes under different AZs in the file system, it can first obtain the physical distance between different AZs in the file system, and according to the file The physical distance between different AZs in the system, determine the multiple first AZs in the file system, and the determined physical distance between the multiple first AZs does not exceed the distance threshold (such as 40 kilometers, etc.), so that the data processing The platform can store multiple data copies to the data nodes under the multiple first AZs. In this way, when some AZs fail and the data processing platform reads data copies from other AZs, since the other AZs are relatively close to the AZ, the data processing platform can quickly read the data copies, so that the data copy can be obtained Latency is kept low.
在一种可能的实施方式中,数据处理平台在将多份数据副本存储于文件系统中不同AZ下的数据节点时,可以获取该文件系统中各个AZ的可用程度,该可用程度例如可以是通过该AZ中可用的数据节点与所有数据节点之间的比例进行确定,从而数据处理平台可以根据文件系统中各个AZ的可用程度,将多份数据副本存储至文件系统中的多个第一AZ下的数据节点,而该文件系统还包括的至少一个可用程度较低的第二AZ,其中,第二AZ的可用程度较低是指第二AZ的可用程度低于第一AZ的可用程度。如此,可以数据处理平台可以优先选择可用程度较高的第一AZ下的数据节点来存储数据副本,以此可以提高数据处理平台读写数据的可靠性。In a possible implementation manner, when the data processing platform stores multiple copies of data in data nodes under different AZs in the file system, it can obtain the availability of each AZ in the file system, for example, through The ratio between the data nodes available in the AZ and all data nodes is determined, so that the data processing platform can store multiple data copies in multiple first AZs in the file system according to the availability of each AZ in the file system data nodes, and the file system further includes at least one second AZ with a lower degree of availability, wherein the lower degree of availability of the second AZ means that the degree of availability of the second AZ is lower than that of the first AZ. In this way, the data processing platform can preferentially select the data nodes under the first AZ with higher availability to store the data copy, so as to improve the reliability of the data processing platform for reading and writing data.
在一种可能的实施方式中,数据处理平台在将多份数据副本存储于文件系统中不同AZ下的数据节点时,可以获取该文件系统中各个AZ的可用程度,该可用程度例如可以是通过该AZ中可用的数据节点与所有数据节点之间的比例进行确定,从而数据处理平台可以根据文件系统中各个AZ的可用程度,将多份数据副本中的部分数据副本存储至文件系统中的多个第一AZ下的数据节点,而文件系统中还包括至少一个第二AZ,并且该第二AZ的可用程度低于第一AZ的可用程度;当该至少一个第二AZ的可用程度上升至预设阈值时,数据处理平台可以将该多份数据副本中的其它数据副本存储至该至少一个第二AZ下的数据节点。如此,在存储多个数据副本时,可以避免将可用程度较低的AZ所需承担的数据存储任务迁移 至其它AZ,以此避免增加其它AZ的负载。In a possible implementation manner, when the data processing platform stores multiple copies of data in data nodes under different AZs in the file system, it can obtain the availability of each AZ in the file system, for example, through The ratio between the data nodes available in the AZ and all data nodes is determined, so that the data processing platform can store some of the data copies in multiple copies of the data in multiple copies of the file system according to the availability of each AZ in the file system. data nodes under the first AZ, and the file system also includes at least one second AZ, and the availability of the second AZ is lower than that of the first AZ; when the availability of the at least one second AZ rises to When the threshold is preset, the data processing platform may store other data copies in the multiple data copies to the data nodes under the at least one second AZ. In this way, when storing multiple data copies, it is possible to avoid migrating the data storage tasks required by AZs with low availability to other AZs, so as to avoid increasing the load of other AZs.
在一种可能的实施方式中,在将多份分区副本存储至分布式数据库中不同AZ下的RS节点时,具体可以是获取针对该多份分区副本的分配指示信息,该分配指示信息用于指示多份分区副本分别存储至不同AZ中的副本数量占比,从而数据处理平台可以根据该分配指示信息将多份分区副本存储至分布式数据库中不同AZ下的RS节点。其中,分配指示信息可以预先由技术人员或者用户进行配置,或者分配指示信息可以由数据处理平台自动生成等。如此,数据处理平台可以根据分配指示信息实现多份分区副本的跨AZ存储。In a possible implementation manner, when storing multiple partition copies to RS nodes under different AZs in the distributed database, it may be specifically to obtain allocation indication information for the multiple partition copies, and the allocation indication information is used for Indicates the ratio of the number of copies of multiple partition copies stored in different AZs, so that the data processing platform can store multiple partition copies to RS nodes under different AZs in the distributed database according to the allocation instruction information. Wherein, the allocation indication information may be pre-configured by technicians or users, or the allocation indication information may be automatically generated by the data processing platform, and the like. In this way, the data processing platform can implement cross-AZ storage of multiple partition copies according to the allocation indication information.
在一种可能的实施方式中,分配指示信息可以根据分布式数据库中各个AZ下的RS节点的负载进行确定,如此,数据处理平台在根据分配指示信息存储多份分区副本时,可以均衡化多个AZ的负载。In a possible implementation, the allocation indication information can be determined according to the load of the RS nodes under each AZ in the distributed database. In this way, when the data processing platform stores multiple partition copies according to the allocation indication information, it can balance multiple AZ load.
在一种可能的实施方式中,分布式数据库以及文件系统均包括多个目标AZ,并且该多个目标AZ已存储有目标数据所属分区的多份分区副本,则在存储多份数据副本时,可以跟踪存储有多份分区副本的多个目标AZ,并将多份数据副本存储至不同目标AZ下的数据节点。如此,在提供数据读写服务时,分布式数据库可以基于分区副本,从本地(也即分区副本所在的AZ)的数据节点中读取目标数据,从而可以降低读取数据的时延,提高数据处理平台向用户反馈数据的效率。In a possible implementation manner, the distributed database and the file system both include multiple target AZs, and the multiple target AZs have already stored multiple partition copies of the partition to which the target data belongs, then when storing multiple data copies, It can track multiple target AZs that store multiple partition copies, and store multiple data copies to data nodes under different target AZs. In this way, when providing data read and write services, the distributed database can read the target data from the local (that is, the AZ where the partition copy is located) data node based on the partition copy, thereby reducing the delay in reading data and improving the data quality. The efficiency with which the processing platform feeds back data to users.
在一种可能的实施方式中,在将多份分区副本存储至分布式数据库中不同AZ下的RS节点之后,数据处理平台可以对分布式数据库中不同AZ下的RS节点进行负载均衡,以调整不同AZ中存储的分区副本的数量。如此,实现不同RS节点的负载均衡,避免部分ES节点的负载过大而影响数据处理平台整体提供的数据读写服务的质量。In a possible implementation, after storing multiple partition copies to RS nodes under different AZs in the distributed database, the data processing platform can perform load balancing on the RS nodes under different AZs in the distributed database to adjust The number of partition replicas stored in different AZs. In this way, the load balancing of different RS nodes is realized, and the excessive load of some ES nodes is prevented from affecting the quality of data reading and writing services provided by the data processing platform as a whole.
第二方面,基于与第一方面的方法实施例同样的发明构思,本申请实施例提供了一种数据存储装置。该装置具有实现上述第一方面的各实施方式对应的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。In the second aspect, based on the same inventive concept as the method embodiment in the first aspect, the embodiment of the present application provides a data storage device. The device has functions corresponding to each implementation manner for realizing the above-mentioned first aspect. This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware. The hardware or software includes one or more modules corresponding to the above functions.
第三方面,本申请提供一种计算设备,所述计算设备包括处理器、存储器和显示器。所述处理器、所述存储器进行相互的通信。所述处理器用于执行存储器中存储的指令,以使得计算设备执行如第一方面或第一方面的任一种实现方式中的数据存储方法。In a third aspect, the present application provides a computing device, where the computing device includes a processor, a memory, and a display. The processor and the memory communicate with each other. The processor is configured to execute instructions stored in the memory, so that the computing device executes the data storage method in the first aspect or any implementation manner of the first aspect.
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算设备上运行时,使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的数据存储方法。In a fourth aspect, the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computing device, the computing device executes the above-mentioned first aspect or any of the first aspects. A data storage method described in an implementation manner.
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的数据存储方法。In a fifth aspect, the present application provides a computer program product containing instructions, which, when run on a computing device, causes the computing device to execute the data storage method described in the first aspect or any implementation manner of the first aspect .
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the foregoing aspects, the present application may further be combined to provide more implementation manners.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例, 对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations recorded in the application. For example, those skilled in the art can also obtain other drawings based on these drawings.
图1为本申请一示例性数据处理平台100的架构示意图;FIG. 1 is a schematic diagram of the architecture of an exemplary data processing platform 100 of the present application;
图2为跨可用区构建的集群200的结构示意图;FIG. 2 is a schematic structural diagram of a cluster 200 constructed across availability zones;
图3为本申请实施例提供的一种数据存储方法的流程示意图;FIG. 3 is a schematic flow diagram of a data storage method provided by an embodiment of the present application;
图4为本申请实施例提供的存储多个数据副本的示意图;FIG. 4 is a schematic diagram of storing multiple data copies provided by the embodiment of the present application;
图5为本申请实施例提供的一种数据存储装置的结构示意图;FIG. 5 is a schematic structural diagram of a data storage device provided by an embodiment of the present application;
图6为本申请实施例提供的一种计算设备的硬件结构示意图。FIG. 6 is a schematic diagram of a hardware structure of a computing device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请中的附图,对本申请提供的实施例中的方案进行描述。The solutions in the embodiments provided in the present application will be described below with reference to the drawings in the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms "first", "second" and the like in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a description of the manner in which objects with the same attribute are described in the embodiments of the present application.
如图1所示,为一示例性数据处理平台100的架构示意图。如图1所示,该数据处理平台100包括分布式数据库101以及文件系统102。其中,文件系统102可以用于持久化存储文件形式的数据,而分布式数据库101可以用于对文件系统102中的数据进行管理,包括数据的读取、写入、合并等。As shown in FIG. 1 , it is a schematic structural diagram of an exemplary data processing platform 100 . As shown in FIG. 1 , the data processing platform 100 includes a distributed database 101 and a file system 102 . Wherein, the file system 102 can be used to persistently store data in the form of files, and the distributed database 101 can be used to manage the data in the file system 102, including reading, writing, and merging of data.
文件系统102,包括多个数据节点(datanode),并且不同的数据节点可以属于不同的可用区(availability zones,AZ)。图1中以文件系统102包括数据节点1021~数据节点1024为例进行示例性说明,并且,数据节点1021以及数据节点1022属于AZ1、数据节点1023以及数据节点1024属于AZ2。其中,AZ,通常是指一个或多个物理数据中心的集合,有独立的风火水电,并且在AZ内部可以在逻辑上将计算、网络、存储等资源划分成多个集群。作为一些示例,文件系统102例如可以是分布式文件系统(distributed file system,DFS)、Hadoop分布式文件系统(hadoop distributed file system,HDFS)等,本实施例对此并不进行限定。实际应用时,文件系统102还可以包括命名节点(namenode),该命名节点(图1中未示出)也可以称为主节点,用于对多个数据节点进行管理,包括管理多个数据节点的命名空间、记录各个数据节点中所存储数据的元数据等。The file system 102 includes multiple data nodes (datanodes), and different data nodes may belong to different availability zones (availability zones, AZ). In FIG. 1 , the file system 102 includes data nodes 1021 to 1024 as an example for illustration, and data nodes 1021 and 1022 belong to AZ1, and data nodes 1023 and 1024 belong to AZ2. Among them, AZ usually refers to a collection of one or more physical data centers, with independent wind, fire, water and electricity, and computing, network, storage and other resources can be logically divided into multiple clusters within the AZ. As some examples, the file system 102 may be, for example, a distributed file system (distributed file system, DFS), a Hadoop distributed file system (hadoop distributed file system, HDFS), etc., which are not limited in this embodiment. In practical applications, the file system 102 may also include a named node (namenode), the named node (not shown in FIG. 1 ) may also be called a master node, and is used to manage multiple data nodes, including namespace, metadata that records the data stored in each data node, etc.
分布式数据库101包括主节点1011以及多个分区服务器(region server,RS)节点,并且不同的RS节点属于不同的可用区。图1中以分布式数据库101包括RS节点1012以及RS节点1012为例进行示例性说明,并且,RS节点1012属于AZ1、RS节点1012属于AZ2。The distributed database 101 includes a master node 1011 and multiple partition server (region server, RS) nodes, and different RS nodes belong to different availability zones. In FIG. 1 , the distributed database 101 includes the RS node 1012 and the RS node 1012 is taken as an example for illustration, and the RS node 1012 belongs to AZ1, and the RS node 1012 belongs to AZ2.
主节点1011用于对该分布式数据库101所管理的数据(也即文件系统102中所存储的数据)进行划分,得到多个分区,每个分区中包括一条或者多条数据的标识,并且属于不同分区的数据通常存在差异。作为一种划分分区的实现示例,分布式数据库101中在管理每条数据时,可以将该条数据中的部分内容作为该条数据对应的主关键字(primary key),该主关键字用于在分布式数据库101中对这条数据进行唯一标识,从而主节点1011可以根据主关键字的可能取值范围进行区间划分,每个划分得到的区间对应于一个分区。例如,假设分布式数据库101中作为主关键字的取值范围为[0,1000000],则主节点101可以将主关键字的取值范围划分为100个区间,分别为[0,10000)、[10000,20000)、……、[980000,990000)、 [990000,1000000],每个分区可以用于索引1万条数据,相应的,基于该100个分区,分布式数据库101可以管理100万条数据。进一步地,主节点1011还可以通过分布式应用程序协调服务(如zookeeper服务等)1014实现高可用,如图1所示。The master node 1011 is used to divide the data managed by the distributed database 101 (that is, the data stored in the file system 102) to obtain multiple partitions, each partition includes one or more data identifiers, and belongs to Data in different partitions often differs. As an implementation example of partitioning, when managing each piece of data in the distributed database 101, part of the content in the piece of data can be used as a primary key (primary key) corresponding to the piece of data, and the primary key is used for This piece of data is uniquely identified in the distributed database 101, so that the primary node 1011 can perform interval division according to the possible value range of the primary key, and each divided interval corresponds to a partition. For example, assuming that the value range of the primary key in the distributed database 101 is [0, 1000000], then the primary node 101 can divide the value range of the primary key into 100 intervals, which are respectively [0, 10000), [10000, 20000), ..., [980000, 990000), [990000, 1000000], each partition can be used to index 10,000 pieces of data, correspondingly, based on the 100 partitions, the distributed database 101 can manage 1 million Article data. Further, the master node 1011 can also achieve high availability through a distributed application coordination service (such as zookeeper service, etc.) 1014, as shown in FIG. 1 .
同时,主节点1011还用于为RS节点1012以及RS节点1013分配分区,每个RS节点所分配到的分区可以通过主节点1011所创建的管理表进行维护。RS节点1012以及RS节点1013分别用于执行属于不同分区的数据读写业务,如图1中RS节点1012执行属于分区1至分区N的数据读写业务,而RS节点1013执行属于分区N+1至分区M的数据读写业务。Meanwhile, the master node 1011 is also used to allocate partitions for the RS nodes 1012 and 1013 , and the partitions assigned to each RS node can be maintained through the management table created by the master node 1011 . RS node 1012 and RS node 1013 are respectively used to perform data read and write services belonging to different partitions. As shown in Figure 1, RS node 1012 performs data read and write services belonging to partition 1 to partition N, while RS node 1013 performs data read and write services belonging to partition N+1. Data read and write services to partition M.
示例性地,主节点1011、RS节点1012以及RS节点1013均可以通过硬件或者软件实现。例如,主节点1011与各RS节点通过硬件实现时,主节点1011与多个RS节点均可以是分布式数据库101中的物理服务器。即,在实际部署时,可以将分布式数据库101中的至少一个服务器配置为主节点1011,并将该分布式数据库101中的其它服务器配置为RS节点。或者,主节点1011与各RS节点通过软件实现,此时,主节点1011与多个RS节点可以分别为运行在一台或者多台设备(如服务器等)上的进程或者虚拟机。Exemplarily, the master node 1011, the RS node 1012 and the RS node 1013 can all be implemented by hardware or software. For example, when the master node 1011 and each RS node are implemented by hardware, both the master node 1011 and multiple RS nodes may be physical servers in the distributed database 101 . That is, during actual deployment, at least one server in the distributed database 101 may be configured as the master node 1011, and other servers in the distributed database 101 may be configured as RS nodes. Alternatively, the master node 1011 and each RS node are implemented by software. At this time, the master node 1011 and multiple RS nodes may be processes or virtual machines running on one or more devices (such as servers, etc.).
值得注意的是,图1所示的数据处理平台100仅作为一种示例性说明,并不用于对数据处理平台的具体实现进行限定。比如,在其它可能的数据处理平台100中,分布式数据库101可以包括任意数量的主节点以及RS节点,或者,分布式数据库101中的RS节点以及文件系统102中的数据节点可以分别属于不同的AZ,本申请对此并不进行限定。It should be noted that the data processing platform 100 shown in FIG. 1 is only used as an exemplary illustration, and is not used to limit the specific implementation of the data processing platform. For example, in other possible data processing platforms 100, the distributed database 101 may include any number of master nodes and RS nodes, or the RS nodes in the distributed database 101 and the data nodes in the file system 102 may belong to different AZ, which is not limited in this application.
分布式数据库101可以分别与文件系统102以及客户端103连接,例如可以是通过超文本传输协议(HyperText Transfer Protocol,HTTP)等无线通信协议进行连接。假设客户端103存在向RS节点1012修改数据或者写入新数据的需求时,客户端103可以向RS节点1012发送数据写入请求,该数据写入请求中携带有待写入数据以及相应的数据处理操作内容(如写操作、修改操作等)。然后,RS节点1012可以解析接收到该数据写入请求,基于待写入数据以及数据处理操作生成对应的数据处理记录,并将该数据处理记录写入预先创建的预写日志(write ahead log,WAL)文件中。在确定写入WAL文件成功后,RS节点1012将该WAL文件持久化存储至文件系统102。并且,RS节点1012将该数据处理记录中的待写入数据插入至RS节点1012的内存1中。例如,RS节点1012可以先确定该数据处理记录对应的主关键字,并根据该关键字的取值所属的分区区间,确定将数据处理记录写入哪个分区,从而RS节点1012可以将该数据处理记录中的待写入数据插入至该分区在内存1中对应的存储区域;然后,RS节点1012可以向客户端103反馈数据写入成功。The distributed database 101 can be respectively connected with the file system 102 and the client 103, for example, it can be connected through a wireless communication protocol such as a hypertext transfer protocol (HyperText Transfer Protocol, HTTP). Assuming that the client 103 has a need to modify data or write new data to the RS node 1012, the client 103 can send a data write request to the RS node 1012, and the data write request carries the data to be written and the corresponding data processing Operation content (such as write operation, modification operation, etc.). Then, the RS node 1012 can analyze and receive the data write request, generate a corresponding data processing record based on the data to be written and the data processing operation, and write the data processing record into a pre-created write ahead log (write ahead log, WAL) file. After determining that the writing of the WAL file is successful, the RS node 1012 persistently stores the WAL file to the file system 102 . And, the RS node 1012 inserts the data to be written in the data processing record into the internal memory 1 of the RS node 1012 . For example, the RS node 1012 can first determine the primary key corresponding to the data processing record, and determine which partition to write the data processing record into according to the partition interval to which the value of the key belongs, so that the RS node 1012 can process the data The data to be written in the record is inserted into the storage area corresponding to the partition in the memory 1; then, the RS node 1012 can feed back to the client 103 that the data writing is successful.
通常情况下,RS节点1012会为一个或者多个客户端103在内存1中写入数据,从而内存1中所暂存的数据量会不断增加。当内存1中的数据量达到预设阈值时,RS节点1012可以将内存1中的数据持久化存储至文件系统102,如将内存1中的数据持久化存储至AZ1下的数据节点1021等。Normally, the RS node 1012 will write data in the memory 1 for one or more clients 103, so the amount of data temporarily stored in the memory 1 will continue to increase. When the amount of data in the memory 1 reaches the preset threshold, the RS node 1012 can persistently store the data in the memory 1 to the file system 102, such as persistently storing the data in the memory 1 to the data node 1021 under AZ1.
进一步地,RS节点1012中针对每个分区还配置有分区存储文件(region store files),并且,在文件系统102中持久化存储数据后,RS节点1012可以将每个分区的数据在文件系统102中存储时的文件添加至该分区所对应的分区存储文件中,具体可以是在分区存储文件的目录下添加该分区中各个数据对应的文件名。Further, the RS node 1012 is also configured with a region store file (region store files) for each partition, and after persistently storing data in the file system 102, the RS node 1012 can store the data of each partition in the file system 102 The files stored in the partition are added to the partition storage file corresponding to the partition. Specifically, the file name corresponding to each data in the partition can be added in the directory of the partition storage file.
相应的,当客户端103存在读取数据的需求时,可以向RS节点1012发送数据读取请求, 该数据读取请求中携带有待读取数据的主关键字。RS节点1012在接收到该数据写入请求后,可以根据待写入数据对应的主关键字的取值确定分区,从而RS节点1012可以根据该分区对应的分区存储文件从AZ1下的数据节点中查找出客户端102所需的数据,并将其反馈给客户端103。Correspondingly, when the client 103 needs to read data, it can send a data read request to the RS node 1012, and the data read request carries the primary key of the data to be read. After receiving the data write request, RS node 1012 can determine the partition according to the value of the primary key corresponding to the data to be written, so that RS node 1012 can store files from the data nodes under AZ1 according to the partition storage file corresponding to the partition. Find out the data required by the client 102 and feed it back to the client 103 .
在部署数据处理平台100时,数据处理平台100可以作为本地资源,通过分布式数据库101以及文件系统102向接入数据处理平台100的客户端提供本地的数据读写服务。或者,数据处理平台100也可以部署于云端,此时,分布式数据库101以及文件系统102可以向接入云端的客户端提供读写数据的云服务等。When the data processing platform 100 is deployed, the data processing platform 100 can be used as a local resource to provide local data reading and writing services to clients accessing the data processing platform 100 through the distributed database 101 and the file system 102 . Alternatively, the data processing platform 100 can also be deployed on the cloud, and at this time, the distributed database 101 and the file system 102 can provide cloud services for reading and writing data to clients connected to the cloud.
实际应用场景中,难免会发生自然灾害等灾难,并且该灾难可能导致部分AZ(如AZ1或者AZ2)不可用,如灾难导致该AZ下的部分或者全部计算设备发生物理损坏等。此时,存储至该AZ下的RS节点中的分区和/或该AZ下的数据节点中的数据,可能会因为AZ的不可用而发生数据丢失或者不可读,这会降低数据处理平台100提供数据读写服务的质量,即用户难以获取到该AZ中存储的数据,从而影响用户的使用体验。In actual application scenarios, disasters such as natural disasters will inevitably occur, and the disaster may cause some AZs (such as AZ1 or AZ2) to be unavailable, for example, the disaster causes physical damage to some or all computing devices under the AZ. At this time, the data stored in the partitions in the RS nodes under the AZ and/or in the data nodes under the AZ may be lost or unreadable due to the unavailability of the AZ, which will reduce the data processing platform 100. The quality of data reading and writing services, that is, it is difficult for users to obtain the data stored in the AZ, which affects the user experience.
基于此,本申请实施例提供了一种数据存储方法,旨在将分区和/或数据的副本分配至不同AZ,以提高数据处理平台100提供数据读写服务的可靠性。具体实现时,对于待存储数据,数据处理平台100可以获取该待存储数据的多份副本,并将该待存储数据的多份副本存储至数据处理平台100的不同AZ中。示例性地,该待存储数据例如可以是用户新写入的数据和/或该数据所属的分区等。由于数据处理平台100将待存储数据进行了复制,并且,不同的待存储数据的副本存储于不同AZ中,这样,当AZ1(或者AZ2)因为自然灾害或者其它灾难而导致AZ1不可用时,由于不同AZ之间间隔的物理距离通常较大,因此,灾难虽然导致AZ1不可用,但是通常不会对AZ2产生影响,从而数据处理平台100可以基于正常运行的AZ2(或AZ1)中存储的待存储数据的副本,继续为用户提供数据读写服务,进而可以避免灾难发生而降低数据处理平台100提供数据读写服务的质量。并且,数据处理平台100的可容许服务中断的时间长度可以为0,即复原时间目标(recovery time objective,RTO);数据处理平台100所能容忍的最大数据丢失量可以达到0,即复原点目标(recovery point objective,RPO)可以为0。Based on this, the embodiment of the present application provides a data storage method, aiming at allocating partitions and/or copies of data to different AZs, so as to improve the reliability of data reading and writing services provided by the data processing platform 100 . During specific implementation, for the data to be stored, the data processing platform 100 may acquire multiple copies of the data to be stored, and store the multiple copies of the data to be stored in different AZs of the data processing platform 100 . Exemplarily, the data to be stored may be data newly written by the user and/or the partition to which the data belongs, for example. Since the data processing platform 100 replicates the data to be stored, and different copies of the data to be stored are stored in different AZs, when AZ1 (or AZ2) is unavailable due to natural disasters or other disasters, due to different The physical distance between the AZs is usually relatively large. Therefore, although the disaster makes AZ1 unavailable, it usually does not affect AZ2, so that the data processing platform 100 can be based on the data to be stored stored in AZ2 (or AZ1) in normal operation. continue to provide data reading and writing services for users, thereby avoiding disasters and reducing the quality of data reading and writing services provided by the data processing platform 100. Moreover, the allowable service interruption time length of the data processing platform 100 can be 0, that is, the recovery time objective (RTO); the maximum amount of data loss that the data processing platform 100 can tolerate can reach 0, that is, the recovery point objective (recovery point objective, RPO) can be 0.
另外,数据处理平台100是将多份待存储数据的副本存储至不同的AZ中,而并非是保存在一个AZ中,这使得部分AZ不可用后,不会导致一份待存储数据的所有副本均发生数据丢失,从而可以提高数据处理平台100存储数据的可靠性。In addition, the data processing platform 100 stores multiple copies of the data to be stored in different AZs, rather than storing them in one AZ. This makes it impossible for all copies of the data to be stored to be stored after some AZs become unavailable. All data loss occurs, so that the reliability of data storage on the data processing platform 100 can be improved.
进一步的,当待存储数据具体为用户提供的新数据以及该新数据所属的分区时,数据存储平台100在存储该待存储数据的多份副本的过程中,分布式数据库101获取该数据所属分区的多份分区副本,如通过对分区进行复制的方式获得多份分区副本等。然后,分布式数据库101将多份分区副本存储至该分布式数据库101包括的不同AZ下的RS节点,如将部分分区副本存储至AZ1下的RS节点1012,将剩余的分区副本存储至AZ2下的RS节点1013。另外,文件系统102还获取该数据的多份数据副本,如由分布式数据库101将对该数据进行复制所得到的多份数据副本发送给文件系统102,或者由文件系统102对分布式数据库101发送的数据进行复制而得到多份数据副本等。然后,文件系统102将多份数据副本存储至该文件系统102包括的不同AZ下的数据节点,如将部分数据副本存在至AZ1下的数据节点,将剩余 的数据副本存储至AZ2下的数据节点。Further, when the data to be stored is new data provided by the user and the partition to which the new data belongs, the distributed database 101 obtains the partition to which the data belongs during the process of storing multiple copies of the data to be stored by the data storage platform 100 Multiple copies of partitions, such as obtaining multiple copies of partitions by copying partitions. Then, the distributed database 101 stores multiple partition copies to RS nodes under different AZs included in the distributed database 101, such as storing part of the partition copies to the RS node 1012 under AZ1, and storing the remaining partition copies to AZ2 RS node 1013 . In addition, the file system 102 also obtains multiple data copies of the data, for example, the distributed database 101 sends multiple data copies obtained by duplicating the data to the file system 102, or the file system 102 sends the distributed database 101 The sent data is copied to obtain multiple copies of the data, etc. Then, the file system 102 stores multiple data copies to data nodes under different AZs included in the file system 102. For example, part of the data copies are stored in the data nodes under AZ1, and the remaining data copies are stored in the data nodes under AZ2. .
由于数据处理平台100是将多份数据副本存储至不同的AZ中,而并非是保存在一个AZ中,这使得部分AZ不可用后,不会导致一份数据的所有数据副本均发生数据丢失,从而可以提高数据处理平台100存储数据的可靠性。类似地,数据处理平台100是将多份分区副本存储至不同的AZ中,这使得部分AZ不可用后,不会导致一份数据对应的所有分区副本均发生数据丢失,从而可以使得数据处理平台100基于分区副本读取数据的质量保持在较高水平。Since the data processing platform 100 stores multiple copies of data in different AZs rather than in one AZ, this will not cause data loss for all data copies of one piece of data when some AZs become unavailable. Therefore, the reliability of data stored by the data processing platform 100 can be improved. Similarly, the data processing platform 100 stores multiple partition copies in different AZs, so that when some AZs become unavailable, data loss will not occur in all partition copies corresponding to one piece of data, so that the data processing platform can 100 The quality of data read based on partition replicas is maintained at a high level.
实际应用时,基于图1所示的数据处理平台100,可以将多个AZ下的计算设备(包括RS节点、数据节点等)部署成跨AZ分布的集群,并且数据处理平台100可以利用该集群提供可靠的数据读写服务。作为一种示例,所部署的集群可以如图2所示,集群200可以包括AZ1以及AZ2下的计算设备,具体的,集群200除了可以包括AZ1下的RS节点1012、数据节点1021、数据节点1022之外,还包括该AZ1下的其它节点,如用于更新集群200的命名空间、记录数据节点中存储数据的元数据的命名节点1031、用于存储和管理日志的日志节点(journalnode)1032、用于对AZ1中的资源进行管理的资源管理设备1033、用于对数据节点进行管理的节点管理设备1034、节点管理设备1035以及用于对RS节点1012进行监控和管理的主服务器1036(如Hmaster等);并且,该AZ1中还可以配置有分布式应用程序协调服务来提高AZ1的高可用性。类似地,AZ2与AZ1也具有相似的配置,具体可参见图2所示。其中,AZ1下的计算设备与AZ2下的计算设备可以互为主备,如当AZ1下的命名节点1031发生故障时,可以由AZ2下的命名节点1041来更新集群200的命名空间等。In actual application, based on the data processing platform 100 shown in FIG. 1 , computing devices (including RS nodes, data nodes, etc.) under multiple AZs can be deployed into clusters distributed across AZs, and the data processing platform 100 can utilize the cluster Provide reliable data read and write services. As an example, the deployed cluster can be shown in FIG. 2 , and the cluster 200 can include computing devices under AZ1 and AZ2. Specifically, the cluster 200 can include RS nodes 1012, data nodes 1021, and data nodes 1022 under AZ1. In addition, it also includes other nodes under the AZ1, such as a named node 1031 for updating the namespace of the cluster 200, recording metadata of data stored in the data node, a log node (journalnode) 1032 for storing and managing logs, A resource management device 1033 for managing resources in AZ1, a node management device 1034 for managing data nodes, a node management device 1035, and a master server 1036 for monitoring and managing RS nodes 1012 (such as Hmaster etc.); Moreover, a distributed application program coordination service may also be configured in the AZ1 to improve the high availability of the AZ1. Similarly, AZ2 and AZ1 also have similar configurations, as shown in Figure 2 for details. Wherein, the computing device under AZ1 and the computing device under AZ2 can be active and standby each other. For example, when the named node 1031 under AZ1 fails, the named node 1041 under AZ2 can update the namespace of the cluster 200, etc.
值得注意的是,图2所示的集群200是以基于图1所示的数据处理平台100进行部署为例,实际应用时,集群200还可以包括更多AZ下的计算设备,如图2中虚线所示的AZ3下的各个计算设备等,本实施例对于跨AZ部署的集群的具体架构并不进行限定。并且,在数据处理平台100中可以跨AZ部署多个集群,并且,不同集群包括不同AZ下的计算设备。比如,假设数据处理平台100包括6个AZ,分别为AZ1至AZ6,则可以基于AZ1至AZ3部署一个集群,基于AZ4至AZ6部署另一个集群等,本实施例对此并不进行限定。It should be noted that the cluster 200 shown in FIG. 2 is based on the deployment of the data processing platform 100 shown in FIG. For each computing device under AZ3 shown by the dotted line, this embodiment does not limit the specific architecture of the cluster deployed across AZs. Moreover, multiple clusters may be deployed across AZs in the data processing platform 100, and different clusters include computing devices under different AZs. For example, assuming that the data processing platform 100 includes 6 AZs, namely AZ1 to AZ6, one cluster may be deployed based on AZ1 to AZ3, and another cluster may be deployed based on AZ4 to AZ6, which is not limited in this embodiment.
另外,在进一步可能的实施方式中,还可以为该集群200部署备用的集群300,以用于对该集群200中存储的数据(包括分区、写入文件系统102中的数据等)进行复制和存储,从而可以进一步提高集群200提供数据读写服务的可靠性。示例性地,集群200与集群300之间可以采用异步复制的方式,将集群200中的数据复制至集群300中。In addition, in a further possible implementation manner, a standby cluster 300 may also be deployed for the cluster 200, so as to replicate and retrieve data stored in the cluster 200 (including partitions, data written in the file system 102, etc.) storage, so that the reliability of data read and write services provided by the cluster 200 can be further improved. Exemplarily, an asynchronous replication method may be adopted between the cluster 200 and the cluster 300 to replicate data in the cluster 200 to the cluster 300 .
接下来,对数据存储的各种非限定性的具体实施方式进行详细描述。Next, various non-limiting specific implementations of data storage are described in detail.
参阅图3,为本申请实施例中一种数据存储方法的流程示意图。该方法可以应用于上述图1所示的数据处理平台100,具体可以是由分布式数据库101以及由文件系统102进行执行。或者,该方法也可以由单独配置于数据处理平台100中的设备进行执行,本实施例对此并不进行限定。为便于描述,下面以待存储数据包括目标数据(如上述用户提供的新数据)以及该目标数据所述分区、并由分布式数据库101以及文件系统102执行数据存储方法为例进行说明。图3所示的数据存储方法具体可以包括:Referring to FIG. 3 , it is a schematic flowchart of a data storage method in an embodiment of the present application. The method can be applied to the data processing platform 100 shown in FIG. 1 above, and specifically can be executed by the distributed database 101 and the file system 102 . Alternatively, the method may also be executed by a device separately configured in the data processing platform 100, which is not limited in this embodiment. For ease of description, the data to be stored includes target data (such as the new data provided by the above-mentioned user) and the partition of the target data, and the distributed database 101 and the file system 102 execute the data storage method as an example. The data storage method shown in Figure 3 may specifically include:
S301:分布式数据库101获取目标数据的多份数据副本、该目标数据所属分区的多份分 区副本。S301: The distributed database 101 acquires multiple data copies of the target data, and multiple partition copies of the partition to which the target data belongs.
分布式数据库101所获取的目标数据,例如可以是用户向数据处理平台100提供的新数据,或者数据处理平台100基于用户对于数据的修改操作所生成的数据等。通常情况下,数据处理平台100可以将该目标数据包含的主关键字写入分区,以便后续基于该分区中记录的主关键字对该目标数据进行管理;并且,数据处理平台100还会将该目标数据持久化存储至文件系统102中。The target data acquired by the distributed database 101 may be, for example, new data provided by the user to the data processing platform 100, or data generated by the data processing platform 100 based on the user's modification operation on the data. Usually, the data processing platform 100 can write the primary key contained in the target data into the partition, so that the target data can be managed subsequently based on the primary key recorded in the partition; and, the data processing platform 100 can also write the The target data is persistently stored in the file system 102 .
本实施例中,分布式数据库101可以对目标数据以及该目标数据所属的分区分别进行复制,如此,可以获得目标数据的多份数据副本(目标数据本身也可以视为一个数据副本)、分区副本(目标数据所属的分区也可以视为一个分区副本)。In this embodiment, the distributed database 101 can respectively replicate the target data and the partition to which the target data belongs, so that multiple data copies of the target data can be obtained (the target data itself can also be regarded as a data copy), a partition copy (The partition to which the target data belongs can also be regarded as a partition copy).
值得注意的是,本实施例中是以分布式数据库101对分区进行复制为例,实际应用的其它方式中,分布式数据库101执行对目标数据所属分区的复制操作,得到多份分区副本,并将该目标发送至文件系统102,从而文件系统102执行对该目标数据的复制操作,得到多份数据副本。本实施例对此并不进行限定。It is worth noting that, in this embodiment, the distributed database 101 is used as an example to replicate partitions. In other practical applications, the distributed database 101 executes the replication operation on the partition to which the target data belongs to obtain multiple partition copies, and The target is sent to the file system 102, so that the file system 102 executes the copy operation of the target data to obtain multiple data copies. This embodiment does not limit it.
S302:分布式数据库101将多份分区副本存储至分布式数据库101包括的不同AZ下的RS节点。S302: The distributed database 101 stores multiple partition copies to RS nodes under different AZs included in the distributed database 101.
如果分布式数据库101将目标数据所属的所有分区副本均存储至一个AZ中,则该当该AZ因为灾难而不可用时,分布式数据库101难以利用该AZ中的分区副本为用户提供针对目标数据的数据读写服务。因此,本实施例中,分布式数据库101将多份分区副本存储至至少两个AZ中,这样,即使其中一个AZ不可用,分布式数据库101也可以通过其余AZ中存储的分区副本来管理目标数据。如此,可以避免单个AZ不可用而导致数据处理平台100中的目标数据不可读,实现AZ级别的数据容错。If the distributed database 101 stores all partition copies of the target data in one AZ, then when the AZ is unavailable due to a disaster, it is difficult for the distributed database 101 to use the partition copies in the AZ to provide users with data for the target data Read and write services. Therefore, in this embodiment, the distributed database 101 stores multiple partition copies in at least two AZs, so that even if one of the AZs is unavailable, the distributed database 101 can also manage the target through the partition copies stored in the remaining AZs. data. In this way, unreadable target data in the data processing platform 100 due to the unavailability of a single AZ can be avoided, and data fault tolerance at the AZ level can be realized.
其中,不同AZ所分配的分区副本的数量,可以相同,也可以是不同。比如,当分布式数据库101包括3个AZ,并且,目标数据所属的分区副本的数量为3时,分布式数据库101可以将3份分区副本分别存储至这3个AZ中,每个AZ存储一份分区副本。或者,当分布式数据库101将3份分区副本存储至AZ1以及AZ2(不存储至AZ3)时,可以将1份分区副本存储至AZ1中,将2份分区副本存储至AZ2中等。Wherein, the number of partition copies allocated to different AZs may be the same or different. For example, when the distributed database 101 includes 3 AZs and the number of partition copies to which the target data belongs is 3, the distributed database 101 can store the 3 partition copies in the 3 AZs respectively, and each AZ stores a copy of the partition. Alternatively, when the distributed database 101 stores three partition copies in AZ1 and AZ2 (not in AZ3), one partition copy can be stored in AZ1, two partition copies can be stored in AZ2, and so on.
示例性地,本实施例提供了以下四种将多份分区副本存储至不同AZ的实现方式:Exemplarily, this embodiment provides the following four implementations for storing multiple partition copies in different AZs:
在第一种可能的实施方式中,分布式数据库101可以获取该多份分区副本的分配指示信息,该分配指示信息可以包括AZ标识以及副本数量占比,可以用于指示将该多份分区副本分别存储至不同AZ中的副本数量占比。这样,分布式数据库101可以根据该分配指示信息,将多份分区副本存储至分布式数据库中不同AZ下的RS节点。比如,假设可用区包括AZ1、AZ2以及AZ3,分区副本的数量为4,并且,分配指示信息具体可以是分配表达式“REP:AZ1[0.5],AZ2[0.25],AZ3[0.25]”,则分布式数据库101可以根据该分配表达式,将2(即0.5*4)份分区副本存储至AZ1中,将1(即0.25*4)份分区副本存储至AZ2中,将1份分区副本存储至存储至AZ3中。值得注意的是,分布式数据库101箱多个AZ中写入的多个分区副本中,其中一份分区副本作为主要分区副本,即分布式数据库101通常基于该主要分区副本为用户提供数据读写服务,而其余分区副本作为次要分区副本,用于在主要分区副本不可读或者发生数据丢失时,基于次要分区副本继续为用户提供数据读写服务。In the first possible implementation manner, the distributed database 101 may acquire the allocation indication information of the multiple partition copies, the allocation indication information may include the AZ identifier and the proportion of the number of copies, and may be used to indicate that the multiple partition copies The proportion of copies stored in different AZs. In this way, the distributed database 101 can store multiple partition copies to RS nodes under different AZs in the distributed database according to the allocation indication information. For example, suppose the availability zone includes AZ1, AZ2, and AZ3, and the number of partition replicas is 4, and the allocation indication information can be the allocation expression "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", then According to the distribution expression, the distributed database 101 can store 2 (ie 0.5*4) partition copies in AZ1, store 1 (ie 0.25*4) partition copies in AZ2, and store 1 partition copy in AZ2 Stored in AZ3. It is worth noting that among the multiple partition copies written in multiple AZs in the distributed database 101, one of the partition copies is used as the primary partition copy, that is, the distributed database 101 usually provides data read and write for users based on the primary partition copy. service, while the rest of the partition copies are used as secondary partition copies to continue to provide users with data read and write services based on the secondary partition copies when the primary partition copy is unreadable or data loss occurs.
实际应用时,分布式数据库101可以基于上述分配指示信息批量处理多个分区的副本。比如,假设当前存在10个分区,并且每个分区的副本数量为4,则当分配指示信息具体为“REP:AZ1[0.5],AZ2[0.25],AZ3[0.25]”,分布式数据库101可以将20(即0.5*4*10)份分区副本存储至AZ1中,将10(即0.25*4*10)份分区副本存储至AZ2中,将10份分区副本存储至存储至AZ3中。其中,3个AZ均存储有每个分区的副本。In practical applications, the distributed database 101 may process copies of multiple partitions in batches based on the above allocation indication information. For example, assuming that there are currently 10 partitions, and the number of copies of each partition is 4, then when the allocation indication information is "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", the distributed database 101 can Store 20 (that is, 0.5*4*10) copies of the partition in AZ1, store 10 (that is, 0.25*4*10) copies of the partition in AZ2, and store 10 copies of the partition in AZ3. Among them, 3 AZs store a copy of each partition.
其中,分配指示信息,例如可以是分布式数据库101根据各个AZ内的RS节点的负载进行确定。比如,当假设AZ1内的RS节点1012的负载为10%、AZ2内的RS节点1013的负载为30%、AZ3内的RS节点1014的负载为35%时,并且,设定各个分区的副本数量为4,则分布式数据库101可以确定分配指示信息具体为“REP:AZ1[0.5],AZ2[0.25],AZ3[0.25]”,即AZ1用于存储50%的分区副本,AZ2以及AZ3均用于存储25%的分区副本,以便于均衡化不同AZ下的RS节点之间的负载。当然,分配指示信息也可以是通过其它方式进行确定,如由技术人员预先进行人工配置等,本实施例对此并不进行限定。Wherein, the allocation indication information may be determined by the distributed database 101 according to the load of the RS nodes in each AZ, for example. For example, when it is assumed that the load of RS node 1012 in AZ1 is 10%, the load of RS node 1013 in AZ2 is 30%, and the load of RS node 1014 in AZ3 is 35%, and the number of copies of each partition is set is 4, the distributed database 101 can determine that the allocation indication information is specifically "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", that is, AZ1 is used to store 50% of the partition copies, and AZ2 and AZ3 are both used To store 25% of the partition copies, so as to balance the load among RS nodes under different AZs. Of course, the allocation instruction information may also be determined in other ways, such as manual configuration by a technician in advance, which is not limited in this embodiment.
作为一种实现示例,分布式数据库101中可以包括主服务器(如图2中的主服务器1036等)以及可用区感知均衡器(AZ aware balancer),其中,主服务器可以感知有效的RS节点,如可以基于RS节点发送的心跳(heartbeat)消息确定有效RS节点,从而主服务器可以基于有效的RS节点生成网络拓扑图。这样,当主服务器接收到分布式数据库101生成的针对多份分区副本的分配请求时,主服务器可以将该分配请求转发给可用区感知均衡器,并由可用区感知均衡器根据基于可用区的网络拓扑图(AZ based network topology)以及上述分配指示信息,确定用于存储分区副本的AZ。进一步地,分布式数据库101还可以设置有默认均衡器(default balancer),并且,对于部分无需进行复制存储的分区(如该分区的重要性较低等),分布式数据库101也可以是利用该默认均衡器来确定存储该分区的AZ,该默认均衡器例如可以是通过随机算法或者负载均衡策略等方式确定存储分区的AZ等。其中,主服务器、可用区感知均衡器、默认均衡器,均可以通过软件(如进程)或者硬件(如单独配置的计算设备)实现,本实施例对此并不进行限定。As an implementation example, the distributed database 101 may include a main server (such as the main server 1036 in FIG. 2 ) and an AZ aware balancer (AZ aware balancer), wherein the main server can perceive effective RS nodes, such as Valid RS nodes can be determined based on heartbeat messages sent by RS nodes, so that the master server can generate a network topology map based on valid RS nodes. In this way, when the master server receives an allocation request for multiple partition copies generated by the distributed database 101, the master server can forward the allocation request to the AZ-aware balancer, and the AZ-aware balancer will use the AZ-based network The topology map (AZ based network topology), together with the above allocation instructions, determines the AZ used to store the partition replica. Further, the distributed database 101 can also be provided with a default balancer (default balancer), and, for some partitions that do not need to be replicated and stored (such as the importance of the partition is low, etc.), the distributed database 101 can also use this A default balancer is used to determine the AZ where the partition is stored. The default balancer may, for example, determine the AZ of the storage partition through a random algorithm or a load balancing strategy. Wherein, the main server, the AZ-aware equalizer, and the default equalizer can all be implemented by software (such as a process) or hardware (such as a separately configured computing device), which is not limited in this embodiment.
在第二种可能的实施方式中,针对于一个分区的多份分区副本,如果分区副本的数量大于用于存储该多个分区副本的AZ的数量,则分布式数据库101可以是先为每个AZ分配一个分区副本,并且,对于剩余未分配的分区副本,分布式数据库101可以再根据当前各个AZ下的RS节点的负载情况,确定负载相对较小的RS节点所在的AZ,从而分布式数据库101可以将剩余的分区副本分配至该AZ中进行存储。如此,分布式数据库101可以根据多个AZ下的RS节点在各个时间段内的负载情况,灵活分配分区副本的存储位置,以便于实现RS节点的负载均衡。In the second possible implementation, for multiple partition copies of a partition, if the number of partition copies is greater than the number of AZs used to store the multiple partition copies, the distributed database 101 may first create AZ allocates a partition copy, and, for the remaining unassigned partition copies, the distributed database 101 can determine the AZ where the RS node with a relatively small load is located according to the current load of the RS nodes in each AZ, so that the distributed database 101 may allocate the remaining partition copies to the AZ for storage. In this way, the distributed database 101 can flexibly allocate storage locations of partition copies according to the load conditions of RS nodes under multiple AZs in various time periods, so as to realize load balancing of RS nodes.
在第三种可能的实施方式中,由于实际应用场景中AZ下的部分数据节点可能会发生故障或者数据存储负荷过大,从而影响AZ进一步存储数据的能力,因此,分布式数据库101可以根据各个AZ的可用状况,确定存储分区副本的AZ。具体地,分布式数据库101在向多个AZ写入分区副本的过程中,针对每个AZ,可以先获取各个AZ的可用程度,该可用程度例如可以通过计算AZ中可用的数据节点的数量与该AZ内的数据节点的总数量之间的比值进行确定,并且可以反映AZ的可用状况。例如,假设AZ中包括10个数据节点,其中,可用节点的数量为8,则AZ的可用程度可以是80%(即8/10)。其中,可用的数据节点,是指具 有进一步正常读写数据能力的数据节点;相应的,当数据节点发生物理损坏或者数据读写错误时,该数据节点可以被确定为发生故障,也即不可用的数据节点,或者,当该数据节点已存储的数据量较大而导致该数据节点无法进一步存储数据时,也可以将该数据节点确定为不可用的数据节点。实际应用时,数据节点是否可用也可以是通过其他方式进行定义,如当待存储数据不具有被写入该数据节点的权限时可以将该数据节点确定为相对于该待存储数据不可用等。然后,分布式数据库101可以根据各个AZ的可用程度,确定可用程度高于预设阈值的多个第一AZ,以及可用程度低于预设阈值的至少一个第二AZ,从而分布式数据库101可以将多个分区副本写入该多个第一AZ。如此,可以减少分布式数据库101向可用程度较低的AZ中写入新的分区副本的数量,从而可以避免可用程度较低的AZ下的RS节点的负载过大,而导致该RS节点所负责读取的数据量超出该AZ所能存储的最大数据量。In the third possible implementation manner, because some data nodes under the AZ may fail or the data storage load is too large in the actual application scenario, which affects the ability of the AZ to further store data, the distributed database 101 can be based on each AZ Availability, which determines the AZ where the replica of the partition is stored. Specifically, during the process of writing partition copies to multiple AZs, the distributed database 101 may first obtain the availability degree of each AZ for each AZ. The ratio between the total number of data nodes in the AZ is determined and can reflect the availability of the AZ. For example, assuming that the AZ includes 10 data nodes, and the number of available nodes is 8, the availability of the AZ may be 80% (ie, 8/10). Among them, the available data node refers to the data node with further normal ability to read and write data; correspondingly, when the data node is physically damaged or the data is read and written incorrectly, the data node can be determined to be faulty, that is, unavailable , or, when the amount of data stored in the data node is too large and the data node cannot further store data, the data node may also be determined as an unavailable data node. In practical applications, whether a data node is available can also be defined in other ways, for example, when the data to be stored does not have the permission to be written into the data node, the data node can be determined to be unavailable relative to the data to be stored, etc. Then, the distributed database 101 can determine a plurality of first AZs whose availability is higher than a preset threshold and at least one second AZ whose availability is lower than a preset threshold according to the availability of each AZ, so that the distributed database 101 can A plurality of partition copies are written to the plurality of first AZs. In this way, the number of new partition copies written by the distributed database 101 to the AZ with a low degree of availability can be reduced, thereby avoiding the excessive load of the RS node under the AZ with a low degree of availability, causing the RS node to be responsible for The amount of data read exceeds the maximum amount of data that can be stored in this AZ.
在第四种可能的实施方式中,分布式数据库101在向AZ中写入分区副本的过程中,可能存在部分AZ的可用程度降低或者变成不可用的AZ,基于此,分布式数据库101可以对暂缓向该部分AZ写入分区副本。具体地,分布式数据库101在向其中一个AZ写入分区副本之前,可以先获取该AZ的可用程度,并且,如果该AZ的可用程度低于预设阈值,则分布式数据库101可以不向该AZ中写入分区副本,而是可以通过主服务器为该AZ创建缓存队列,如交易区域(region in transcation,RIT)队列等,将该分区副本标记为未分配状态并将其写入该缓存队列中。然后,分布式数据库101继续向下一个AZ写入分区副本。实际应用时,如果作为“主”身份的主服务器位于该可用程度较低的AZ,则分布式数据库101可以将其它可用程度较高的AZ中的主服务器由“备”身份切换为“主”身份,并为该AZ创建缓存队列。对于未存储分区副本的AZ,分布式数据库101可以监控到该AZ的可用程度,如通过配置RIT工作(chore)节点进行监控等,并且,如果该AZ的可用程度上升并且超过该预设阈值,则分布式数据库101可以将该AZ对应的缓存队列中的分区副本,写入属于该AZ的RS节点中。实际应用时,如果该可用程度较低的AZ中存储的分区副本为次要分区副本,则分布式数据库101可以在该AZ的可用程度上升后向该AZ中写入该次要分区副本。如果该可用程度较低的AZ中存储的分区副本为主要分区副本,此时,由于该AZ的可用程度过低,因此,分布式数据库101可以从其他可用程度较高的AZ中选择一个次要分区副本作为主要分区副本,以保证分布式数据库101中的主要分区副本的高可用性,而该AZ中的分区副本作为次要分区副本。In the fourth possible implementation, when the distributed database 101 writes the partition copy to the AZ, some AZs may be less available or become unavailable. Based on this, the distributed database 101 can Suspend writing partition copies to this part of the AZ. Specifically, before the distributed database 101 writes a partition copy to one of the AZs, it may first obtain the availability of the AZ, and if the availability of the AZ is lower than the preset threshold, the distributed database 101 may not write to the AZ. Instead, the master server can create a cache queue for the AZ, such as a region in transaction (RIT) queue, etc., mark the partition copy as unallocated and write it to the cache queue middle. Then, the distributed database 101 continues to write the partition copy to the next AZ. In practical application, if the master server with the identity of "primary" is located in the AZ with a low degree of availability, the distributed database 101 can switch the identity of the master server in other AZs with a high degree of availability from "standby" to "primary". identity, and create a cache queue for that AZ. For an AZ that does not store a partition copy, the distributed database 101 can monitor the availability of the AZ, such as by configuring the RIT work (chore) node for monitoring, and if the availability of the AZ rises and exceeds the preset threshold, Then the distributed database 101 can write the partition copy in the cache queue corresponding to the AZ into the RS node belonging to the AZ. In practical applications, if the partition copy stored in the AZ with low availability is a secondary partition copy, the distributed database 101 may write the secondary partition copy to the AZ after the availability of the AZ increases. If the partition copy stored in the AZ with low availability is the primary partition copy, at this time, because the availability of the AZ is too low, the distributed database 101 can select a secondary partition copy from other AZs with high availability. The partition replica acts as the primary partition replica to ensure the high availability of the primary partition replica in the distributed database 101, and the partition replica in the AZ acts as the secondary partition replica.
值得注意的是,上述四种实现方式仅作为一些示例性说明,实际应用时,分布式数据库101也可以是基于其它方式将多份分区副本存储在不同的AZ中,例如,分布式数据库101可以通过组合上述实现方式,选择可用程度较高并且物理距离相聚较近的多个AZ来存储多份分区副本等。It is worth noting that the above four implementation methods are only used as some exemplary illustrations. In practical applications, the distributed database 101 may also store multiple partition copies in different AZs based on other methods. For example, the distributed database 101 may By combining the above implementation methods, multiple AZs with high availability and close physical distances are selected to store multiple partition copies, etc.
S303:文件系统102将多份数据副本存储至文件系统102中不同AZ下的数据节点。S303: The file system 102 stores multiple data copies to data nodes under different AZs in the file system 102 .
其中,目标数据的多份数据副本可以由分布式数据库101提供给文件系统102,或者由文件系统102在对该目标数据进行复制,得到该多份数据副本。与存储分区副本类似,文件系统102可以将获得的多份数据副本分别存储至不同的AZ中,具体可以是存储至不同AZ下的数据节点。其中,每个AZ中至少包括一份数据副本。这样,即使存储数据副本的其中一个AZ不可用,分布式数据库101也能从其余AZ中读取到该目标数据,以此避免目标数据发 生丢失,实现AZ级别的数据容错。Wherein, multiple data copies of the target data may be provided by the distributed database 101 to the file system 102, or the file system 102 may replicate the target data to obtain the multiple data copies. Similar to the storage partition copy, the file system 102 may store the obtained multiple data copies in different AZs, specifically, the data nodes in different AZs. Wherein, each AZ includes at least one data copy. In this way, even if one of the AZs where the data copy is stored is unavailable, the distributed database 101 can read the target data from the remaining AZs, so as to avoid loss of the target data and achieve AZ-level data fault tolerance.
示例性地,本实施例提供了以下四种将多份数据副本存储至不同AZ的实现方式:Exemplarily, this embodiment provides the following four implementations for storing multiple copies of data in different AZs:
在第一种可能的实施方式中,与上述分布式数据库101存储多份分区副本类似,文件系统102也可以是基于分配指示信息存储多份数据副本,此时,该分配指示信息可以包括AZ标识以及副本数量占比,可以用于指示将该多份数据副本分别存储至不同AZ下的数据节点的副本数量占比。例如,假设可用区包括AZ1、AZ2以及AZ3,数据副本的数量为4,并且,分配指示信息具体可以是分配表达式“REP:AZ1[0.5],AZ2[0.25],AZ3[0.25]”,则分布式数据库101可以根据该分配表达式,将2(即0.5*4)份数据副本存储至AZ1中,将1(即0.25*4)份数据副本存储至AZ2中,将1份数据副本存储至存储至AZ3中。如此,文件系统102可以根据该分配指示信息,将多份数据副本存储至文件系统102中不同AZ下的数据节点。其中,分配指示信息可以根据各个AZ下的数据节点的负载进行确定,或者由技术人员进行人工配置等,本实施例对此并不进行限定。In the first possible implementation manner, similar to the above-mentioned distributed database 101 storing multiple partition copies, the file system 102 may also store multiple data copies based on allocation indication information. In this case, the allocation indication information may include AZ identification And the proportion of the number of copies, which can be used to indicate the proportion of the number of copies of the data nodes that store the multiple copies of data in different AZs. For example, assuming that the availability zone includes AZ1, AZ2, and AZ3, and the number of data copies is 4, and the allocation indication information can specifically be the allocation expression "REP: AZ1[0.5], AZ2[0.25], AZ3[0.25]", then According to the allocation expression, the distributed database 101 can store 2 (ie 0.5*4) copies of data in AZ1, store 1 (ie 0.25*4) copies of data in AZ2, and store 1 copy of data in AZ2 Stored in AZ3. In this way, the file system 102 can store multiple data copies to data nodes under different AZs in the file system 102 according to the allocation indication information. Wherein, the allocation instruction information may be determined according to the load of the data nodes under each AZ, or manually configured by a technician, which is not limited in this embodiment.
示例性地,文件系统102中可以设置有命名节点远程过程调用服务器(namenode remote procedure call server)以及可用区块放置策略(availability zones block placement policy,AZ BPP)节点。当文件系统102存在针对目标数据的复制任务时,命名节点远程过程调用服务器可以指示AZ BPP节点执行目标数据的复制过程;AZ BPP节点可以根据分配指示信息,确定存储数据副本的多个AZ以及各个AZ所存储的目标数据的副本数量,以便执行相应的数据存储过程。进一步地,文件系统102中还可以设置有默认块放置策略(default block placement policy)节点,并且,对于部分无需进行复制存储的数据(如该数据的重要性较低等),文件系统102也可以是利用该默认块放置策略节点来确定存储该数据的AZ,该默认块放置策略节点例如可以是通过随机算法或者负载均衡策略等方式确定存储数据的AZ等。其中,文件系统102可以基于各个AZ下的数据节点,生成基于可用区的网络拓扑图(AZ based network topology),从而默认块放置策略节点可以根据该网络拓扑图来确定存储数据副本的AZ。Exemplarily, a named node remote procedure call server (namenode remote procedure call server) and an availability zones block placement policy (availability zones block placement policy, AZ BPP) node may be set in the file system 102. When there is a replication task for the target data in the file system 102, the named node remote procedure call server can instruct the AZ BPP node to execute the replication process of the target data; the AZ BPP node can determine multiple AZs and each The number of copies of the target data stored in the AZ, so that the corresponding data storage process can be executed. Further, a default block placement policy (default block placement policy) node may also be set in the file system 102, and, for some data that does not need to be copied and stored (such as the importance of the data is low, etc.), the file system 102 may also The default block placement strategy node is used to determine the AZ for storing the data. The default block placement strategy node can be, for example, the AZ for storing data determined by a random algorithm or a load balancing strategy. Wherein, the file system 102 can generate an AZ based network topology based on the data nodes under each AZ, so that the default block placement policy node can determine the AZ for storing data copies according to the network topology.
在第二种可能的实施方式中,文件系统102可以根据不同AZ之间的物理距离,确定用于存储数据副本的AZ。具体地,文件系统102可以获取该文件系统102中不同AZ之间的物理距离,并根据不同AZ之间的物理距离,将多份数据副本存储至该文件系统102中物理距离相距较近的多个第一AZ下的数据节点,即每个第一AZ至少与该多个第一AZ中的其中一个第一AZ之间的物理距离不超过距离阈值。与此同时,文件系统102中还包括至少一个第二AZ下的数据节点,并且,该第二AZ与各个第一AZ之间的物理距离超过距离阈值(如40千米等)。这样,当存在部分第一AZ故障时,数据处理平台100可以从其余第一AZ中读取目标数据的副本,并且,由于不同第一AZ之间的物理距离相聚较近,因此,数据处理平台100从可用的其它第一AZ中读取数据副本的时延,与从故障之前的第一AZ中读取数据副本的时延通常相差较小。In a second possible implementation manner, the file system 102 may determine an AZ for storing data copies according to physical distances between different AZs. Specifically, the file system 102 can obtain the physical distance between different AZs in the file system 102, and store multiple copies of data in multiple copies of the data in the file system 102 that are physically closer to each other according to the physical distance between different AZs. The data nodes under the first AZ, that is, the physical distance between each first AZ and at least one of the first AZs in the plurality of first AZs does not exceed the distance threshold. At the same time, the file system 102 also includes at least one data node under the second AZ, and the physical distance between the second AZ and each first AZ exceeds a distance threshold (such as 40 kilometers, etc.). In this way, when there is a fault in some first AZs, the data processing platform 100 can read copies of the target data from the remaining first AZs, and since the physical distances between different first AZs are relatively close, the data processing platform 100 The time delay for 100 to read data copies from other available first AZs is usually less than the time delay for reading data copies from the first AZ before the failure.
在第三种可能的实施方式中,由于实际应用场景中AZ下的部分数据节点可能会发生故障或者数据存储负荷过大,从而影响AZ进一步存储数据的能力,因此,文件系统102可以根据各个AZ的可用状况,确定存储数据副本的AZ。具体地,文件系统102可以先获取各个AZ的可用程度,该可用程度例如可以通过计算AZ中可用的数据节点的数量与该AZ内的数 据节点的总数量之间的比值进行确定,并且可以反映AZ的可用状况。在获得各个AZ的可用程度之后,文件系统102可以根据各个AZ的可用程度,将多份数据副本存储至该文件系统102中可用程度相对较高的多个第一AZ下的数据节点,每个第一AZ中存储至少一份数据副本。与此同时,文件系统102中还包括至少一个第二AZ下的数据节点,该第二AZ的可用程度低于第一AZ的可用程度。这样,文件系统102可以优先将数据副本分配至可用程度较高的AZ中进行存储,以此提高数据处理平台100的数据读写的可靠性。In the third possible implementation manner, because some data nodes under the AZ may fail or the data storage load is too large in the actual application scenario, which affects the ability of the AZ to further store data, the file system 102 can be based on each AZ Availability status, determine the AZ where the data copy is stored. Specifically, the file system 102 can first obtain the availability of each AZ, which can be determined by, for example, calculating the ratio between the number of data nodes available in the AZ and the total number of data nodes in the AZ, and can reflect AZ AVAILABILITY. After obtaining the availability of each AZ, the file system 102 can store multiple copies of data in the data nodes under the multiple first AZs with relatively high availability in the file system 102 according to the availability of each AZ, each At least one data copy is stored in the first AZ. At the same time, the file system 102 also includes at least one data node under the second AZ, and the availability of the second AZ is lower than that of the first AZ. In this way, the file system 102 can preferentially allocate the data copy to an AZ with a high degree of availability for storage, thereby improving the reliability of data reading and writing of the data processing platform 100 .
在第四种可能的实施方式中,文件系统102在向AZ中写入数据副本的过程中,可能存在部分AZ的可用程度降低或者变成不可用的AZ,基于此,文件系统102可以对暂缓向该部分AZ写入数据副本。具体地,文件系统102在向多个AZ写入多个数据副本之前,可以先获取各个AZ的可用程度,并且,文件系统102可以根据各个AZ的可用程度,将多份数据副本中的部分数据副本存储至该文件系统102中可用程度较高(高于预设阈值)的第一AZ下的数据节点,每个第一AZ中存储至少一份数据副本。而对于文件系统102中可用程度较低(低于预设阈值)的至少一个AZ,也即除第一AZ之外的第二AZ,文件系统102可以暂时不向第二AZ写入数据副本。然后,当该第二AZ的可用程度上升至预设阈值时,文件系统102再将其余未存储的数据副本存储至该第二AZ中。如此,文件系统102在存储多个数据副本时,可以避免将可用程度较低的AZ所需承担的数据存储任务迁移至其它AZ,以此避免增加其它AZ的负载。In the fourth possible implementation manner, when the file system 102 writes data copies to the AZ, some AZs may be less available or become unavailable. Based on this, the file system 102 can suspend Write a data copy to this part of AZ. Specifically, before writing multiple data copies to multiple AZs, the file system 102 can first obtain the availability of each AZ, and the file system 102 can write part of the multiple data copies according to the availability of each AZ The copy is stored in the data nodes under the first AZ with higher availability (higher than the preset threshold) in the file system 102, and at least one data copy is stored in each first AZ. As for at least one AZ in the file system 102 with a lower degree of availability (below a preset threshold), that is, a second AZ other than the first AZ, the file system 102 may temporarily not write a data copy to the second AZ. Then, when the availability of the second AZ rises to a preset threshold, the file system 102 stores the rest of the unstored data copies in the second AZ. In this way, when the file system 102 stores multiple data copies, it can avoid migrating the data storage task of the AZ with low availability to other AZs, so as to avoid increasing the load of other AZs.
例如,如图4所示,对于数据A,文件系统102可以将该数据A的3份数据副本依次写入AZ1下的数据节点1、AZ2下的数据节点2以及AZ3下的数据节点3;而在存储数据B对应的多份数据副本时,文件系统102在将数据B的副本写入AZ1后,如果确定AZ2的可用程度较低(如不可用等),则文件系统102可以暂缓向AZ2中写入数据B的副本,而先向AZ3中写入数据C的副本。当后续AZ2的可用程度上升至预设阈值时,文件系统102再根据AZ1或者AZ3中存储的数据副本,向AZ2中写入数据B的副本。For example, as shown in FIG. 4, for data A, the file system 102 may sequentially write three data copies of the data A into data node 1 under AZ1, data node 2 under AZ2, and data node 3 under AZ3; and When storing multiple copies of data corresponding to data B, after the file system 102 writes the copy of data B into AZ1, if it determines that the availability of AZ2 is low (such as unavailable, etc.), the file system 102 can suspend writing the copy of data B to AZ2. A copy of data B is written, while a copy of data C is first written to AZ3. When the subsequent availability of AZ2 rises to a preset threshold, the file system 102 writes a copy of data B into AZ2 according to the data copy stored in AZ1 or AZ3.
值得注意的是,上述四种实现方式仅作为一些示例性说明,实际应用时,文件系统102也可以是基于其它方式将多份数据副本存储在不同的AZ中,例如,文件系统102可以通过组合上述实现方式,选择可用程度较高并且物理距离相聚较近的多个AZ来存储多份数据副本等。It is worth noting that the above four implementation methods are only used as some exemplary illustrations. In practical applications, the file system 102 can also store multiple copies of data in different AZs based on other methods. For example, the file system 102 can be combined In the above implementation manner, multiple AZs with high availability and close physical distances are selected to store multiple copies of data.
实际应用时,分布式数据库101以及文件系统102可以包括多个相同的AZ,也可以不包括相同的AZ,本实施例对此并不进行限定。在进一步可能的实施方式中,当分布式数据库101以及文件系统102可包括多个相同的AZ时,文件系统102在存储数据副本的过程中,可以跟踪存储有多份分区副本的多个目标AZ(文件系统102也包括该多个目标AZ),从而文件系统102可以将多份数据副本也存储至不同目标AZ下的数据节点。此时,文件系统102可以与分布式数据库102基于相同的分配指示信息,在同一AZ中存储数据副本以及分区副本。如此,在提供数据读写服务时,分布式数据库101可以基于分区副本,从本地(也即分区副本所在的AZ)的数据节点中读取目标数据,从而可以降低读取数据的时延,提高数据处理平台100向用户反馈数据的效率。另外,当分布式数据库101以及文件系统102不包括相同的AZ时,分布式数据库101以及文件系统102可以各自独立执行副本的存储过程。In actual application, the distributed database 101 and the file system 102 may include multiple identical AZs, or may not include the same AZ, which is not limited in this embodiment. In a further possible implementation, when the distributed database 101 and the file system 102 can include multiple identical AZs, the file system 102 can track multiple target AZs that store multiple partition copies during the process of storing data copies (The file system 102 also includes the multiple target AZs), so the file system 102 can also store multiple data copies in data nodes under different target AZs. At this time, the file system 102 and the distributed database 102 may store data copies and partition copies in the same AZ based on the same allocation indication information. In this way, when providing data read and write services, the distributed database 101 can read the target data from the local (that is, the AZ where the partition copy is located) data node based on the partition copy, thereby reducing the time delay for reading data and improving The data processing platform 100 feeds back the efficiency of the data to the user. In addition, when the distributed database 101 and the file system 102 do not include the same AZ, the distributed database 101 and the file system 102 can independently execute the stored procedure of the copy.
并且,假设存储数据副本以及分区副本的AZ的数量为N(N为大于1的正整数),则即 使高达N-1个AZ崩溃,数据处理平台100也可以基于第N个AZ中存储的分区副本以及数据副本继续为用户提供数据读写服务。并且,当其余N-1个AZ恢复运行时,数据处理平台100也可以根据第N个AZ中存储的分区副本以及数据副本,自动恢复其余N-1个AZ中存储的数据副本以及分区副本等数据,从而无需管理人员的干预,实现故障恢复后的自动化处理。Moreover, assuming that the number of AZs storing data copies and partition copies is N (N is a positive integer greater than 1), even if up to N-1 AZs crash, the data processing platform 100 can also be based on the partitions stored in the Nth AZ Replicas and data replicas continue to provide users with data read and write services. Moreover, when the remaining N-1 AZs resume operation, the data processing platform 100 can also automatically restore the data copies and partition copies stored in the remaining N-1 AZs based on the partition copies and data copies stored in the Nth AZ. Data, so that there is no need for management personnel to intervene, and automatic processing after fault recovery is realized.
值得注意的是,本实施例中是以对目标数据以及目标数据所属分区进行复制和跨AZ存储为例进行说明,对于其它数据,数据处理平台100均可以按照上述类似方式进行存储,以提高数据处理平台100提供数据读写服务的可靠性。实际应用时,基于用户对于不同数据的重视程度可能存在差异,比如,对于用户而言,数据A的重要程度较高,而数据B的重要程度较低(即使数据B发生丢失也不会对该用户产生较大影响),因此,数据处理平台100可以采用上述方式存储用户的数据A,而在存储用户的数据B时,可以无需对数据B进行复制存储,或者将该数据B的多个数据副本存储至一个AZ中。It is worth noting that in this embodiment, the target data and the partition to which the target data belongs are copied and stored across AZs as an example. For other data, the data processing platform 100 can store them in a similar manner as described above to improve data The processing platform 100 provides reliability of data reading and writing services. In practical applications, there may be differences in the importance of different data based on users. For example, for users, data A is more important, while data B is less important (even if data B is lost, it will not affect the The user has a greater influence), therefore, the data processing platform 100 can store the user's data A in the above-mentioned manner, and when storing the user's data B, there is no need to copy and store the data B, or multiple data of the data B Replicas are stored in one AZ.
进一步地,数据处理平台100还可以周期性的均衡化各个AZ下的RS节点的负载。例如,数据处理平台100可以周期性的获取各个AZ下的RS节点存储的分区副本的数量,从而可以根据各个RS节点中分区副本的数量,对分布式数据库101中不同AZ下的RS节点进行负载均衡,具体可以是将部分RS节点上的分区副本迁移至其它RS节点上,以调整不同AZ中存储的分区副本的数量,从而减小不同AZ之间存储的分区副本的数量差异。如,假设AZ1下的RS节点存储的分区副本的数量为1、AZ2下的RS节点存储的分区副本的数量为2、AZ3下的RS节点存储的分区副本的数量为3,则数据处理平台100可以将AZ3下的RS节点上的一个分区副本,迁移至AZ1下的RS节点,以使得各个AZ中存储的分区副本的数量均为2。Further, the data processing platform 100 may also periodically balance the loads of the RS nodes under each AZ. For example, the data processing platform 100 can periodically obtain the number of partition copies stored by the RS nodes under each AZ, so that the RS nodes under different AZs in the distributed database 101 can be loaded according to the number of partition copies in each RS node. Balance, specifically, may be migrating partition copies on some RS nodes to other RS nodes, so as to adjust the number of partition copies stored in different AZs, thereby reducing the difference in the number of partition copies stored in different AZs. For example, assuming that the number of partition copies stored by the RS node under AZ1 is 1, the number of partition copies stored by the RS node under AZ2 is 2, and the number of partition copies stored by the RS node under AZ3 is 3, then the data processing platform 100 A partition copy on the RS node under AZ3 can be migrated to the RS node under AZ1, so that the number of partition copies stored in each AZ is 2.
需要说明的是,本实施例中,是以待存储数据同时包括目标数据以及该目标数据所属分区为例进行示例性说明,在其它可能的实施例中,待存储数据可以仅为目标数据,此时,数据处理平台100可以通过文件系统102将目标数据的多个副本存储至不同的AZ中,而该目标数据所属的分区,可以存储至单个AZ中(该AZ中可以保存一份分区数据,也可以是同时存储该分区的多个副本),或者可以存储至多个AZ中。或者,待存储数据也可以仅为目标数据所属分区,此时,数据处理平台100可以通过分布式数据集101将多个分区数据存储至不同的AZ中,而该目标数据,可以存储至文件系统102下的单个AZ中(该AZ中可以保存一份目标数据,也可以是同时存储该目标数据的多个副本),或者可以存储至该文件系统102下的多个AZ中。It should be noted that in this embodiment, the data to be stored includes the target data and the partition to which the target data belongs as an example. In other possible embodiments, the data to be stored may only be the target data. At this time, the data processing platform 100 can store multiple copies of the target data in different AZs through the file system 102, and the partition to which the target data belongs can be stored in a single AZ (one partition data can be saved in this AZ, It is also possible to store multiple copies of the partition at the same time), or it can be stored in multiple AZs. Alternatively, the data to be stored can also be only the partition to which the target data belongs. At this time, the data processing platform 100 can store the data of multiple partitions in different AZs through the distributed data set 101, and the target data can be stored in the file system In a single AZ under the file system 102 (one copy of the target data can be stored in the AZ, or multiple copies of the target data can be stored at the same time), or can be stored in multiple AZs under the file system 102.
以上结合图1至图4对本申请实施例提供的数据存储方法进行介绍,接下来结合附图对本申请实施例提供的用于执行上述数据存储装置以及计算设备进行介绍。The data storage method provided by the embodiment of the present application is introduced above with reference to FIG. 1 to FIG. 4 . Next, the device for executing the above-mentioned data storage and the computing device provided by the embodiment of the present application are introduced with reference to the accompanying drawings.
图5为本申请提供的一种数据存储装置的结构示意图,该数据存储装置500可以应用于数据处理平台(如上述数据处理平台100等),该数据处理平台包括多个可用区AZ。其中,该数据恢复装置500包括:FIG. 5 is a schematic structural diagram of a data storage device provided by the present application. The data storage device 500 can be applied to a data processing platform (such as the above-mentioned data processing platform 100 etc.), and the data processing platform includes multiple availability zones AZ. Wherein, the data recovery device 500 includes:
获取模块501,用于获取待存储数据的多份副本;An acquisition module 501, configured to acquire multiple copies of the data to be stored;
存储模块502,用于将所述待存储数据的多份副本存储至所述数据处理平台的不同AZ中。The storage module 502 is configured to store multiple copies of the data to be stored in different AZs of the data processing platform.
在一种可能的实施方式中,所述待存储数据包括目标数据和/或所述目标数据所属分区,所述存储模块502,用于:In a possible implementation manner, the data to be stored includes target data and/or the partition to which the target data belongs, and the storage module 502 is configured to:
将所述目标数据的多份数据副本存储至所述数据处理平台的不同AZ中;storing multiple data copies of the target data in different AZs of the data processing platform;
和/或,将所述目标数据所属分区的多份分区副本存储至所述数据处理平台的不同AZ中。And/or, storing multiple partition copies of the partition to which the target data belongs to different AZs of the data processing platform.
在一种可能的实施方式中,所述待存储数据包括目标数据和所述目标数据所属分区,所述数据处理平台包括分布式数据库以及文件系统,所述分布式数据库包括多个可用区AZ下的分区服务器RS节点,所述文件系统包括多个AZ下的数据节点,所述存储模块502,用于:In a possible implementation manner, the data to be stored includes target data and the partition to which the target data belongs, the data processing platform includes a distributed database and a file system, and the distributed database includes multiple availability zones AZ The RS node of the partition server, the file system includes data nodes under multiple AZs, and the storage module 502 is used for:
将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点;storing the plurality of partition copies to RS nodes under different AZs in the distributed database;
将所述多份数据副本存储至所述文件系统中不同AZ下的数据节点。The multiple data copies are stored in data nodes under different AZs in the file system.
在一种可能的实施方式中,所述存储模块502,用于:In a possible implementation manner, the storage module 502 is configured to:
获取所述文件系统中不同AZ之间的物理距离;Obtain the physical distance between different AZs in the file system;
根据所述文件系统中不同AZ之间的物理距离,确定所述文件系统中的多个第一AZ,所述多个第一AZ之间的物理距离不超过距离阈值;determining multiple first AZs in the file system according to physical distances between different AZs in the file system, where the physical distances between the multiple first AZs do not exceed a distance threshold;
将所述多份数据副本存储至所述多个第一AZ下的数据节点。storing the multiple copies of data to data nodes under the multiple first AZs.
在一种可能的实施方式中,所述存储模块502,用于:In a possible implementation manner, the storage module 502 is configured to:
获取所述文件系统中各个AZ的可用程度;Obtain the availability of each AZ in the file system;
根据所述文件系统中各个AZ的可用程度,将所述多份数据副本存储至所述文件系统中的多个第一AZ下的数据节点,所述文件系统还包括至少一个第二AZ下的数据节点,所述第二AZ的可用程度低于所述第一AZ的可用程度。According to the degree of availability of each AZ in the file system, the multiple data copies are stored in the data nodes under the multiple first AZs in the file system, and the file system also includes at least one data node under the second AZ For a data node, the availability of the second AZ is lower than the availability of the first AZ.
在一种可能的实施方式中,所述存储模块502,用于:In a possible implementation manner, the storage module 502 is configured to:
获取所述文件系统中各个AZ的可用程度;Obtain the availability of each AZ in the file system;
根据所述文件系统中各个AZ的可用程度,将所述多份数据副本中的部分数据副本存储至所述文件系统中的第一AZ下的数据节点,所述文件系统还包括至少一个第二AZ下的数据节点,所述第二AZ的可用程度低于所述第一AZ的可用程度;According to the degree of availability of each AZ in the file system, part of the data copies in the multiple data copies are stored in the data nodes under the first AZ in the file system, and the file system also includes at least one second For data nodes under the AZ, the availability of the second AZ is lower than the availability of the first AZ;
当所述至少一个第二AZ的可用程度上升至预设阈值时,将所述多份数据副本中的其它数据副本存储至所述至少一个第二AZ下的数据节点。When the availability of the at least one second AZ rises to a preset threshold, store other data copies in the plurality of data copies to data nodes under the at least one second AZ.
在一种可能的实施方式中,所述存储模块502,用于:In a possible implementation manner, the storage module 502 is configured to:
获取针对所述多份分区副本的分配指示信息,所述分配指示信息用于指示所述多份分区副本分别存储至不同AZ中的副本数量占比;Acquire allocation indication information for the multiple partition copies, where the allocation indication information is used to indicate the proportions of the number of copies stored in different AZs for the multiple partition copies;
根据所述分配指示信息,将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点。According to the allocation indication information, store the multiple partition copies to RS nodes under different AZs in the distributed database.
在一种可能的实施方式中,所述分配指示信息,根据所述分布式数据库中各个AZ下的RS节点的负载进行确定。In a possible implementation manner, the allocation indication information is determined according to the load of the RS nodes under each AZ in the distributed database.
在一种可能的实施方式中,所述分布式数据库以及所述文件系统均包括多个目标AZ,所述存储模块502,用于:In a possible implementation manner, both the distributed database and the file system include multiple target AZs, and the storage module 502 is configured to:
跟踪存储有所述多份分区副本的多个目标AZ;tracking a plurality of target AZs storing said plurality of partition copies;
将所述多份数据副本存储至不同目标AZ下的数据节点。The multiple data copies are stored in data nodes under different target AZs.
在一种可能的实施方式中,在将所述多份分区副本存储至所述分布式数据库中不同AZ 下的RS节点之后,所述装置500还包括:In a possible implementation manner, after storing the multiple partition copies to RS nodes under different AZs in the distributed database, the apparatus 500 further includes:
负载均衡模块503,用于对所述分布式数据库中不同AZ下的RS节点进行负载均衡,以调整不同AZ中存储的分区副本的数量。The load balancing module 503 is configured to perform load balancing on RS nodes under different AZs in the distributed database, so as to adjust the number of partition copies stored in different AZs.
根据本申请实施例的数据存储装置500可对应于执行本申请实施例中描述的方法,并且数据存储装置500的各个模块的上述和其它操作和/或功能分别为了实现图3所示方法实施例中的相应流程,为了简洁,在此不再赘述。The data storage device 500 according to the embodiment of the present application may correspond to the implementation of the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules of the data storage device 500 are respectively in order to realize the method embodiment shown in FIG. 3 The corresponding process in , for the sake of brevity, will not be repeated here.
图6提供了一种计算设备。如图6所示,计算设备600例如可以是前述实施例中用于实现数据处理平台100的功能的设备等,并且计算机设备600具体可以用于实现上述图5所示实施例中数据存储装置500的功能。Figure 6 provides a computing device. As shown in Figure 6, the computing device 600 may be, for example, the device used to implement the functions of the data processing platform 100 in the foregoing embodiments, and the computer device 600 may specifically be used to implement the data storage device 500 in the embodiment shown in Figure 5 above function.
计算设备600包括总线601、处理器602和存储器603。处理器602、存储器603之间通过总线601通信。The computing device 600 includes a bus 601 , a processor 602 and a memory 603 . The processor 602 and the memory 603 communicate through the bus 601 .
总线601可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 601 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6 , but it does not mean that there is only one bus or one type of bus.
处理器602可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 602 may be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP) or a digital signal processor (digital signal processor, DSP) etc. Any one or more of them.
存储器603可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器603还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard drive drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 603 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM). Memory 603 can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, mechanical hard disk (hard drive drive, HDD) or solid state hard disk (solid state drive) , SSD).
存储器603中存储有可执行的程序代码,处理器602执行该可执行的程序代码以执行前述数据处理平台100执行的数据存储方法。Executable program codes are stored in the memory 603 , and the processor 602 executes the executable program codes to implement the data storage method performed by the aforementioned data processing platform 100 .
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述数据恢复方法。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state hard disk), etc. The computer-readable storage medium includes instructions for instructing a computing device to execute the above data recovery method.
本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算设备上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。The embodiment of the present application also provides a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computing device, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机或数据中心进行传输。The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wirelessly (such as infrared, wireless, microwave, etc.) to another website site, computer or data center.
所述计算机程序产品可以为一个软件安装包,在需要使用前述对象识别方法的任一方 法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。The computer program product may be a software installation package which can be downloaded and executed on a computing device if any of the aforementioned object recognition methods are required.
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。The description of the process or structure corresponding to each of the above drawings has its own emphasis. For the part that is not described in detail in a certain process or structure, you can refer to the relevant description of other processes or structures.

Claims (23)

  1. 一种数据存储方法,其特征在于,所述方法应用于数据处理平台,所述数据处理平台包括多个可用区AZ,所述方法包括:A data storage method, characterized in that the method is applied to a data processing platform, and the data processing platform includes multiple availability zones AZ, and the method includes:
    获取待存储数据的多份副本;Obtain multiple copies of the data to be stored;
    将所述待存储数据的多份副本存储至所述数据处理平台的不同AZ中。storing multiple copies of the data to be stored in different AZs of the data processing platform.
  2. 根据权利要求1所述的方法,其特征在于,所述待存储数据包括目标数据和/或所述目标数据所属分区,所述将所述待存储数据的多份副本存储至所述数据处理平台的不同AZ中,包括:The method according to claim 1, wherein the data to be stored includes target data and/or the partition to which the target data belongs, and storing multiple copies of the data to be stored to the data processing platform in different AZs, including:
    将所述目标数据的多份数据副本存储至所述数据处理平台的不同AZ中;storing multiple data copies of the target data in different AZs of the data processing platform;
    和/或,将所述目标数据所属分区的多份分区副本存储至所述数据处理平台的不同AZ中。And/or, storing multiple partition copies of the partition to which the target data belongs to different AZs of the data processing platform.
  3. 根据权利要求2所述的方法,其特征在于,所述待存储数据包括目标数据和所述目标数据所属分区,所述数据处理平台包括分布式数据库以及文件系统,所述分布式数据库包括多个可用区AZ下的分区服务器RS节点,所述文件系统包括多个AZ下的数据节点,所述将所述待存储数据的多份副本存储至所述数据处理平台的不同AZ中,包括:The method according to claim 2, wherein the data to be stored includes target data and the partition to which the target data belongs, the data processing platform includes a distributed database and a file system, and the distributed database includes multiple The partition server RS node under the availability zone AZ, the file system includes data nodes under multiple AZs, and storing multiple copies of the data to be stored in different AZs of the data processing platform includes:
    将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点;storing the plurality of partition copies to RS nodes under different AZs in the distributed database;
    将所述多份数据副本存储至所述文件系统中不同AZ下的数据节点。The multiple data copies are stored in data nodes under different AZs in the file system.
  4. 根据权利要求3所述的方法,其特征在于,所述将所述多份数据副本存储于所述文件系统中不同AZ下的数据节点,包括:The method according to claim 3, wherein the storing the multiple data copies in data nodes under different AZs in the file system comprises:
    获取所述文件系统中不同AZ之间的物理距离;Obtain the physical distance between different AZs in the file system;
    根据所述文件系统中不同AZ之间的物理距离,确定所述文件系统中的多个第一AZ,所述多个第一AZ之间的物理距离不超过距离阈值;determining multiple first AZs in the file system according to physical distances between different AZs in the file system, where the physical distances between the multiple first AZs do not exceed a distance threshold;
    将所述多份数据副本存储至所述多个第一AZ下的数据节点。storing the multiple copies of data to data nodes under the multiple first AZs.
  5. 根据权利要求3所述的方法,其特征在于,所述将所述多份数据副本存储于所述文件系统中不同AZ下的数据节点,包括:The method according to claim 3, wherein the storing the multiple data copies in data nodes under different AZs in the file system comprises:
    获取所述文件系统中各个AZ的可用程度;Obtain the availability of each AZ in the file system;
    根据所述文件系统中各个AZ的可用程度,将所述多份数据副本存储至所述文件系统中的多个第一AZ下的数据节点,所述文件系统还包括至少一个第二AZ下,所述第二AZ的可用程度低于所述第一AZ的可用程度。According to the degree of availability of each AZ in the file system, storing the multiple data copies to data nodes under multiple first AZs in the file system, the file system also includes at least one second AZ, The degree of availability of the second AZ is lower than the degree of availability of the first AZ.
  6. 根据权利要求3所述的方法,其特征在于,所述将所述多份数据副本存储于所述文件系统中不同AZ下的数据节点,包括:The method according to claim 3, wherein the storing the multiple data copies in data nodes under different AZs in the file system comprises:
    获取所述文件系统中各个AZ的可用程度;Obtain the availability of each AZ in the file system;
    根据所述文件系统中各个AZ的可用程度,将所述多份数据副本中的部分数据副本存储至所述文件系统中的第一AZ下的数据节点,所述文件系统还包括至少一个第二AZ,所述第二AZ的可用程度低于所述第一AZ的可用程度;According to the degree of availability of each AZ in the file system, part of the data copies in the multiple data copies are stored in the data nodes under the first AZ in the file system, and the file system also includes at least one second AZ, the availability of the second AZ is lower than the availability of the first AZ;
    当所述至少一个第二AZ的可用程度上升至预设阈值时,将所述多份数据副本中的其它数据副本存储至所述至少一个第二AZ下的数据节点。When the availability of the at least one second AZ rises to a preset threshold, store other data copies in the plurality of data copies to data nodes under the at least one second AZ.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述将所述多份分区副本存储 至所述分布式数据库中不同AZ下的RS节点,包括:The method according to any one of claims 1 to 6, wherein the storing the multiple partition copies to RS nodes under different AZs in the distributed database comprises:
    获取针对所述多份分区副本的分配指示信息,所述分配指示信息用于指示所述多份分区副本分别存储至不同AZ中的副本数量占比;Acquire allocation indication information for the multiple partition copies, where the allocation indication information is used to indicate the proportions of the number of copies stored in different AZs for the multiple partition copies;
    根据所述分配指示信息,将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点。According to the allocation indication information, store the multiple partition copies to RS nodes under different AZs in the distributed database.
  8. 根据权利要求7所述的方法,其特征在于,所述分配指示信息,根据所述分布式数据库中各个AZ下的RS节点的负载进行确定。The method according to claim 7, wherein the allocation indication information is determined according to the load of RS nodes under each AZ in the distributed database.
  9. 根据权利要求3所述的方法,其特征在于,所述分布式数据库以及所述文件系统均包括多个目标AZ,所述将所述多份数据副本存储至所述文件系统中不同AZ下的数据节点,包括:The method according to claim 3, wherein the distributed database and the file system both include multiple target AZs, and the multiple data copies are stored in different AZs in the file system Data nodes, including:
    跟踪存储有所述多份分区副本的多个目标AZ;tracking a plurality of target AZs storing said plurality of partition copies;
    将所述多份数据副本存储至不同目标AZ下的数据节点。The multiple data copies are stored in data nodes under different target AZs.
  10. 根据权利要求3至9任一项所述的方法,其特征在于,在将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点之后,所述方法还包括:The method according to any one of claims 3 to 9, wherein after storing the plurality of partition copies to RS nodes under different AZs in the distributed database, the method further comprises:
    对所述分布式数据库中不同AZ下的RS节点进行负载均衡,以调整不同AZ中存储的分区副本的数量。Perform load balancing on the RS nodes under different AZs in the distributed database, so as to adjust the number of partition copies stored in different AZs.
  11. 一种数据存储装置,其特征在于,所述装置应用于数据处理平台,所述数据处理平台包括多个可用区AZ,所述装置包括:A data storage device, characterized in that the device is applied to a data processing platform, the data processing platform includes multiple availability zones AZ, and the device includes:
    获取模块,用于获取待存储数据的多份副本;An acquisition module, configured to acquire multiple copies of the data to be stored;
    存储模块,用于将所述待存储数据的多份副本存储至所述数据处理平台的不同AZ中。A storage module, configured to store multiple copies of the data to be stored in different AZs of the data processing platform.
  12. 根据权利要求11所述的装置,其特征在于,所述待存储数据包括目标数据和/或所述目标数据所属分区,所述存储模块,用于:The device according to claim 11, wherein the data to be stored includes target data and/or the partition to which the target data belongs, and the storage module is configured to:
    将所述目标数据的多份数据副本存储至所述数据处理平台的不同AZ中;storing multiple data copies of the target data in different AZs of the data processing platform;
    和/或,将所述目标数据所属分区的多份分区副本存储至所述数据处理平台的不同AZ中。And/or, storing multiple partition copies of the partition to which the target data belongs to different AZs of the data processing platform.
  13. 根据权利要求12所述的装置,其特征在于,所述待存储数据包括目标数据和所述目标数据所属分区,所述数据处理平台包括分布式数据库以及文件系统,所述分布式数据库包括多个可用区AZ下的分区服务器RS节点,所述文件系统包括多个AZ下的数据节点,所述存储模块,用于:The device according to claim 12, wherein the data to be stored includes target data and the partition to which the target data belongs, the data processing platform includes a distributed database and a file system, and the distributed database includes multiple The partition server RS node under the availability zone AZ, the file system includes data nodes under multiple AZs, and the storage module is used for:
    将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点;storing the plurality of partition copies to RS nodes under different AZs in the distributed database;
    将所述多份数据副本存储至所述文件系统中不同AZ下的数据节点。The multiple data copies are stored in data nodes under different AZs in the file system.
  14. 根据权利要求13所述的装置,其特征在于,所述存储模块,用于:The device according to claim 13, wherein the storage module is used for:
    获取所述文件系统中不同AZ之间的物理距离;Obtain the physical distance between different AZs in the file system;
    根据所述文件系统中不同AZ之间的物理距离,确定所述文件系统中的多个第一AZ,所述多个第一AZ之间的物理距离不超过距离阈值;determining multiple first AZs in the file system according to physical distances between different AZs in the file system, where the physical distances between the multiple first AZs do not exceed a distance threshold;
    将所述多份数据副本存储至所述多个第一AZ下的数据节点。storing the multiple copies of data to data nodes under the multiple first AZs.
  15. 根据权利要求13所述的装置,其特征在于,所述存储模块,用于:The device according to claim 13, wherein the storage module is used for:
    获取所述文件系统中各个AZ的可用程度;Obtain the availability of each AZ in the file system;
    根据所述文件系统中各个AZ的可用程度,将所述多份数据副本存储至所述文件系统中的多个第一AZ下的数据节点,所述文件系统还包括至少一个第二AZ,所述第二AZ的可用程度低于所述第一AZ的可用程度。According to the degree of availability of each AZ in the file system, storing the multiple copies of data to data nodes under multiple first AZs in the file system, the file system also includes at least one second AZ, so The degree of availability of the second AZ is lower than the degree of availability of the first AZ.
  16. 根据权利要求13所述的装置,其特征在于,所述存储模块,用于:The device according to claim 13, wherein the storage module is used for:
    获取所述文件系统中各个AZ的可用程度;Obtain the availability of each AZ in the file system;
    根据所述文件系统中各个AZ的可用程度,将所述多份数据副本中的部分数据副本存储至所述文件系统中的第一AZ下的数据节点,所述文件系统还包括至少一个第二AZ,所述第二AZ的可用程度低于所述第一AZ的可用程度;According to the degree of availability of each AZ in the file system, part of the data copies in the multiple data copies are stored in the data nodes under the first AZ in the file system, and the file system also includes at least one second AZ, the availability of the second AZ is lower than the availability of the first AZ;
    当所述至少一个第二AZ的可用程度上升至预设阈值时,将所述多份数据副本中的其它数据副本存储至所述至少一个第二AZ下的数据节点。When the availability of the at least one second AZ rises to a preset threshold, store other data copies in the plurality of data copies to data nodes under the at least one second AZ.
  17. 根据权利要求11至16任一项所述的装置,其特征在于,所述存储模块,用于:The device according to any one of claims 11 to 16, wherein the storage module is used for:
    获取针对所述多份分区副本的分配指示信息,所述分配指示信息用于指示所述多份分区副本分别存储至不同AZ中的副本数量占比;Acquire allocation indication information for the multiple partition copies, where the allocation indication information is used to indicate the proportions of the number of copies stored in different AZs for the multiple partition copies;
    根据所述分配指示信息,将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点。According to the allocation indication information, store the multiple partition copies to RS nodes under different AZs in the distributed database.
  18. 根据权利要求17所述的装置,其特征在于,所述分配指示信息,根据所述分布式数据库中各个AZ下的RS节点的负载进行确定。The device according to claim 17, wherein the allocation indication information is determined according to the load of the RS nodes under each AZ in the distributed database.
  19. 根据权利要求13所述的装置,其特征在于,所述分布式数据库以及所述文件系统均包括多个目标AZ,所述存储模块,用于:The device according to claim 13, wherein the distributed database and the file system both include multiple target AZs, and the storage module is configured to:
    跟踪存储有所述多份分区副本的多个目标AZ;tracking a plurality of target AZs storing said plurality of partition copies;
    将所述多份数据副本存储至不同目标AZ下的数据节点。The multiple data copies are stored in data nodes under different target AZs.
  20. 根据权利要求13至19任一项所述的装置,其特征在于,在将所述多份分区副本存储至所述分布式数据库中不同AZ下的RS节点之后,所述装置还包括:The device according to any one of claims 13 to 19, wherein after storing the multiple partition copies to RS nodes under different AZs in the distributed database, the device further comprises:
    负载均衡模块,用于对所述分布式数据库中不同AZ下的RS节点进行负载均衡,以调整不同AZ中存储的分区副本的数量。A load balancing module, configured to perform load balancing on RS nodes under different AZs in the distributed database, so as to adjust the number of partition copies stored in different AZs.
  21. 一种计算设备,其特征在于,包括处理器、存储器;A computing device, characterized in that it includes a processor and a memory;
    所述处理器用于执行所述存储器中存储的指令,以使所述计算设备执行如权利要求1至10任一项所述的方法。The processor is configured to execute instructions stored in the memory, so that the computing device executes the method according to any one of claims 1 to 10.
  22. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算设备上运行时,使得所述计算设备执行如权利要求1至10中任一项所述的方法。A computer-readable storage medium, characterized by comprising instructions, which, when run on a computing device, cause the computing device to execute the method according to any one of claims 1 to 10.
  23. 一种包含指令的计算机程序产品,当其在计算设备上运行时,使得所述计算设备执行执行如权利要求1至10中任一项所述的方法。A computer program product comprising instructions which, when run on a computing device, cause the computing device to perform the method as claimed in any one of claims 1 to 10.
PCT/CN2021/142795 2021-10-28 2021-12-30 Data storage method and apparatus, and related device WO2023070935A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180006938.2A CN116635831A (en) 2021-10-28 2021-12-30 Data storage method and device and related equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202131049367 2021-10-28
IN202131049367 2021-10-28

Publications (1)

Publication Number Publication Date
WO2023070935A1 true WO2023070935A1 (en) 2023-05-04

Family

ID=86160103

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142795 WO2023070935A1 (en) 2021-10-28 2021-12-30 Data storage method and apparatus, and related device

Country Status (2)

Country Link
CN (1) CN116635831A (en)
WO (1) WO2023070935A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240873B (en) * 2023-11-08 2024-03-29 阿里云计算有限公司 Cloud storage system, data reading and writing method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223277A1 (en) * 2004-03-23 2005-10-06 Eacceleration Corporation Online storage system
CN104050102A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Object storing method and device in telecommunication system
CN107943867A (en) * 2017-11-10 2018-04-20 中国电子科技集团公司第三十二研究所 High-performance hierarchical storage system supporting heterogeneous storage
CN111104069A (en) * 2019-12-20 2020-05-05 北京金山云网络技术有限公司 Multi-region data processing method and device of distributed storage system and electronic equipment
CN111782152A (en) * 2020-07-03 2020-10-16 深圳市欢太科技有限公司 Data storage method, data recovery device, server and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050223277A1 (en) * 2004-03-23 2005-10-06 Eacceleration Corporation Online storage system
CN104050102A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Object storing method and device in telecommunication system
CN107943867A (en) * 2017-11-10 2018-04-20 中国电子科技集团公司第三十二研究所 High-performance hierarchical storage system supporting heterogeneous storage
CN111104069A (en) * 2019-12-20 2020-05-05 北京金山云网络技术有限公司 Multi-region data processing method and device of distributed storage system and electronic equipment
CN111782152A (en) * 2020-07-03 2020-10-16 深圳市欢太科技有限公司 Data storage method, data recovery device, server and storage medium

Also Published As

Publication number Publication date
CN116635831A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
US10613780B1 (en) Multi-node removal
JP6791834B2 (en) Storage system and control software placement method
CN108664496B (en) Data migration method and device
US7487390B2 (en) Backup system and backup method
KR101242458B1 (en) Intelligent virtual storage service system and method thereof
US10838829B2 (en) Method and apparatus for loading data from a mirror server and a non-transitory computer readable storage medium
US8195777B2 (en) System and method for adding a standby computer into clustered computer system
US9201747B2 (en) Real time database system
US9875056B2 (en) Information processing system, control program, and control method
US9317381B2 (en) Storage system and data management method
US20140188957A1 (en) Hierarchical storage system and file management method
CN110825704B (en) Data reading method, data writing method and server
KR100936238B1 (en) Lazy Replication System And Method For Balanced I/Os Between File Read/Write And Replication
US9984139B1 (en) Publish session framework for datastore operation records
WO2019148841A1 (en) Distributed storage system, data processing method and storage node
US10445295B1 (en) Task-based framework for synchronization of event handling between nodes in an active/active data storage system
CN113282564B (en) Data storage method, system, node and storage medium
US7849264B2 (en) Storage area management method for a storage system
WO2023070935A1 (en) Data storage method and apparatus, and related device
KR20170045928A (en) Method for managing data using In-Memory Database and Apparatus thereof
CN111225003B (en) NFS node configuration method and device
WO2022151593A1 (en) Data recovery method, apapratus and device, medium and program product
CN111459416B (en) Distributed storage-based thermal migration system and migration method thereof
CN114930281A (en) Dynamic adaptive partition partitioning
CN112748865A (en) Method, electronic device and computer program product for storage management

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180006938.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21962276

Country of ref document: EP

Kind code of ref document: A1