WO2023066222A1 - 数据处理方法、装置、电子设备、存储介质及程序产品 - Google Patents

数据处理方法、装置、电子设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2023066222A1
WO2023066222A1 PCT/CN2022/125826 CN2022125826W WO2023066222A1 WO 2023066222 A1 WO2023066222 A1 WO 2023066222A1 CN 2022125826 W CN2022125826 W CN 2022125826W WO 2023066222 A1 WO2023066222 A1 WO 2023066222A1
Authority
WO
WIPO (PCT)
Prior art keywords
partition
data
space
key
range
Prior art date
Application number
PCT/CN2022/125826
Other languages
English (en)
French (fr)
Inventor
熊亮春
潘安群
雷海林
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023066222A1 publication Critical patent/WO2023066222A1/zh
Priority to US18/450,577 priority Critical patent/US20230394024A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Definitions

  • the present application relates to the field of data processing and database technology, and specifically, the present application relates to a data processing method, device, electronic equipment, computer-readable storage medium and computer program product.
  • the data of a large table can be divided into multiple small subsets called partitions through a partition table.
  • the partition table can be divided into three types: range, list and hash.
  • data redistribution is generally completed by copying a partition table that needs to participate in redistribution, and performing data synchronization based on new data distribution rules in the copied table.
  • this method requires additional storage space to store a full amount of data, and it takes a long time to complete data redistribution, which may easily lead to business congestion.
  • Embodiments of the present application provide a data processing method, device, electronic device, computer-readable storage medium, and computer program product, which help to solve the problem of large storage space occupied by the execution of data redistribution and long execution time leading to business congestion . Described technical scheme is as follows:
  • a data processing method executed in an electronic device, the method includes:
  • the data redistribution information is used to represent a new partition plan for a partition table, and the partition table is a table for data distribution based on a partition key;
  • the data distribution in the partitions of the partition table is updated based on the corresponding relationship.
  • a data processing device includes:
  • An acquisition module configured to acquire data redistribution information, where the data redistribution information is used to represent a new partition plan for a partition table, where the partition table is a table for data distribution based on a partition key;
  • Adding a module for creating a corresponding partition space for each partition specified by the partition plan, and recording the current data range of the partition corresponding to the partition space in each partition space, and the data range includes the selection of the partition key One of the value range and the value list of the partition key;
  • An update module configured to update the data range recorded in each partition space based on the partition plan, and use the updated data range of each partition space to determine the correspondence between each partition space and the partitions of the partition table;
  • a distribution module configured to update data distribution in partitions of the partition table based on the correspondence.
  • an electronic device is provided, and the electronic device includes:
  • processors one or more processors
  • one or more computer programs wherein said one or more computer programs are stored in said memory and configured to be executed by said one or more processors, said one or more computer programs are configured to: Execute the above data processing method.
  • a computer-readable storage medium is provided, the computer storage medium is used for storing computer instructions, and when the computer instructions are run on a computer, the computer can execute the above data processing method.
  • a computer program product including a computer program or an instruction, and when the computer program or instruction is executed by a processor, the steps of the above data processing method are implemented.
  • FIG. 1 is a schematic diagram of performing data redistribution for a partition table in the related art
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic flow diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of data redistribution for a partition table in a data processing method provided in an embodiment of the present application
  • FIG. 5 is a schematic diagram of data redistribution for a partition table in a data processing method provided in an embodiment of the present application
  • FIG. 6 is a flowchart of a task preparation stage in a data processing method provided in an embodiment of the present application.
  • FIG. 7 is a flow chart of the data movement stage in a data processing method provided in an embodiment of the present application.
  • FIG. 8 is a flow chart of using an old version partition table to perform read and write operations in a data processing method provided by an embodiment of the present application
  • FIG. 9 is a flow chart of reading and writing operations using a new version of the partition table in a data processing method provided by an embodiment of the present application.
  • Fig. 10a is a schematic diagram of executing the first step in an application example of a data processing method provided by the embodiment of the present application;
  • Fig. 10b is a schematic diagram of executing the second step in an application example of a data processing method provided by the embodiment of the present application;
  • Fig. 10c is a schematic diagram of executing the third step in an application example of a data processing method provided by the embodiment of the present application;
  • Fig. 10d is a schematic diagram of executing the fourth step in an application example of a data processing method provided by the embodiment of the present application;
  • Fig. 10e is a schematic diagram of executing the fifth step in an application example of a data processing method provided by the embodiment of the present application;
  • Fig. 10f is a schematic diagram of executing the sixth step in an application example of a data processing method provided by the embodiment of the present application;
  • FIG. 11 is a schematic structural diagram of a data processing device provided in an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • connection or wireless coupling may include wireless connection or wireless coupling.
  • the term “and/or” used herein indicates at least one of the items defined by the term, for example, “A and/or B” indicates implementation as “A”, or implementation as “A”, or implementation as “A and B ".
  • Database in short, can be regarded as an electronic file cabinet - a place where electronic files are stored. Users can add, query, update, delete and other operations on the data in the files.
  • database is a collection of data that is stored together in a certain way, can be shared with multiple users, has as little redundancy as possible, and is independent of the application program.
  • Database Management System is a software system designed to manage databases, and generally has basic functions such as storage, interception, security, and backup.
  • Database management systems can be classified according to the database model it supports, such as relational, XML (Extensible Markup Language, Extensible Markup Language), or according to the type of computer supported, such as server clusters, mobile phones; Or classify according to the query language used, such as SQL (Structured Query Language), XQuery; or classify according to the focus of performance impulse, such as the largest scale, the highest running speed; or other classification methods.
  • SQL Structured Query Language
  • some DBMSs are able to cross categories, eg, support multiple query languages at the same time.
  • Distributed database technology combines database technology and distributed technology. Specifically, it refers to a database technology that combines the data of the geographically dispersed database nodes but belongs to the same system in the computer system logic. It has both the coordination between databases and the distribution of data. This distributed database management system does not focus on the centralized control of the system, but on the autonomy of each database node.
  • Partition table Used to divide the data corresponding to a large table into many small subsets called partitions.
  • the partition table can be the result of data distribution in the entire distributed database according to certain fields (partition keys) specified by the user when creating a certain table in the TDSQL distributed database; usually according to these fields in
  • the methods of data distribution between nodes in the cluster include the following methods, hash (that is, according to the method of partition key hash, the partition key in the records in the table is hashed to partition the table data), range (according to The range of the partition key to distribute data to different cluster storage nodes), list (that is, according to the value list specified by the user, the user data is distributed according to the key value).
  • hash that is, according to the method of partition key hash, the partition key in the records in the table is hashed to partition the table data
  • range accordinging to The range of the partition key to distribute data to different cluster storage nodes
  • list that is, according to the value list specified by the user, the user data is distributed according
  • Data rebalance When the user modifies the data distribution rules of the partition table, the data needs to be redistributed according to the new rules so that the data storage conforms to the new distribution rules defined by the user.
  • Partition The smallest unit used to actually store a certain range of data, which can be a file in the operating system, used to store a specific range of data in a partition in the table.
  • the name in different databases may be different (it can be a file of the operating system, or a part of the range in the file).
  • the data range of all partitions has changed, but the user often does not involve the data of all partitions when performing redistribution operations, and only concentrates on certain partitions.
  • the implementation of the above scheme needs to have enough additional storage space to store a full amount of data (or store a partition storage space equivalent to participating in redistribution), and also needs to correspond to a background copy data that users can clearly perceive. (Execution time of data redistribution operation), the execution time is relatively long, and during the redistribution process, the table lock processing time is required. During this time, the user business is in an interrupted state because the table is locked, causing user business to be blocked.
  • the present application proposes a data processing method, device, electronic equipment, computer-readable storage medium, and computer program product.
  • the implementation of this application is to achieve data redistribution through increased partition space, which does not require additional storage space to store a full amount of data; in addition, the data redistribution of this application does not require data synchronization for the full amount of data in the partition table It is beneficial to save storage space, shorten the execution time of data redistribution, improve the efficiency of data processing, and reduce the occurrence of business congestion.
  • FIG. 2 is a schematic diagram of a system architecture to which a data processing method can be applied according to an embodiment of the present application.
  • the system architecture may include a terminal 100 , a server 200 and a database 300 .
  • the user can initiate data redistribution operations and data read and write operations through the terminal 100
  • the terminal 100 can communicate with the server 200 through a network connection, etc.
  • the database 300 supports data services of the server 200 .
  • the database 300 may correspond to multiple databases in different geographic spaces.
  • the embodiment of the present application provides a data processing method, for example, executed in an electronic device, where the electronic device is, for example, the server 200 or the database 300 .
  • the method includes the following steps S101-S104:
  • Step S101 Obtain data redistribution information, which is used to represent a new partition plan for a partition table, which is a table for data distribution based on a partition key.
  • the partition plan can specify the number of new partitions and the range of data to be stored in each partition.
  • the data range may include, for example, one of a value range of the partition key and a value list of the partition key.
  • the partition key may correspond to some fields of the data to be stored.
  • partition selection can be performed according to the value range (key value range) given when the partition key is defined, and the actual value, and then data can be stored in the corresponding partition.
  • the partition key can be selected according to the value list given when the partition key is defined and the actual value, and then the data can be stored in the corresponding partition.
  • the data redistribution information can be obtained by analyzing the operation language submitted by the user.
  • the operation language can be a database model definition language DDL (Data Definition Language), and the DDL language can be used to describe real-world entities stored in the database.
  • the data redistribution information may include modifying the value range or value list of its partition key for at least one partition.
  • the user can adjust the data stored in each partition by modifying the value range and value list corresponding to the partition key, so as to redistribute the data stored in the partition table;
  • the distribution operation can obtain the latest data redistribution information for the partition table, that is, the data redistribution rules.
  • Step S102 Create a corresponding partition space for each partition specified in the partition plan, and record the current data range of the partition corresponding to the partition space in each partition space, and the data range includes the value range of the partition key and one of the list of values for the partition key.
  • the partition table before data redistribution may include a two-layer management structure, as shown in Figure 4, the first layer represents the corresponding table T1, the second layer represents the three partitions corresponding to Table 1, and the partition 1. Partition 2 and Partition 3. It can be understood that each partition is used as a data storage unit set corresponding to Table 1 to store data. In other words, each partition can also be called a data storage unit. Different partitions may be distributed on different physical storage entities, for example. Specifically, the metadata information of the partition table may record the table structure of the current partition table, the storage relationship of data, and the like.
  • an intermediate management structure that is, a partition space (PS) is proposed.
  • a partition management layer partition space
  • partition space is added in the middle of the original two-layer management structure, and accordingly, the data organization structure of the partition table has also changed.
  • the addition of the partition space can map the logical partition range or list value (the value range or value list of the partition key) to the physical partition, that is, the partition table partition definition can be decoupled through the partition space (in the original In the two-tier management structure of , the partition defines the binding between the data storage of the actual partition (stored in table T1, for example).
  • the data range recorded in the partition space includes one of the value range of the partition key and the value list of the partition key; that is, the data range can represent the range of data stored in each partition in the partition table.
  • step S102 when step S102 is executed, a corresponding partition space is generated for each current partition in the partition table, and the data range of the partition is recorded in the partition space corresponding to each partition.
  • step S102 when the partition plan indicates that a new partition is to be established, a new partition is created, a corresponding partition space is generated for the new partition, and the new partition is recorded in the partition space corresponding to the new partition The data range corresponding to the data to be stored.
  • step S102 can increase the partition space one by one for each partition included in the original partition table, that is, one partition corresponds to one partition space; it can also be based on the data redistribution information combined with the data organization of the original partition table Structure, increase the partition space. At this time, there may be cases where new partitions need to be added during redistribution, and the newly added partitions can be adapted to increase the partition space.
  • the partition space in the partition table has a corresponding relationship with the partitions, and the range of data actually stored in the corresponding partition can be known from the data range recorded in the partition space.
  • Step S103 Based on the partition plan, update the data range recorded in each partition space, and use the updated data range of each partition space to determine the corresponding relationship between each partition space and the partitions of the partition table.
  • step S103 may perform the following operations:
  • the data range is the range definition of each partition specified by the data redistribution information, which can be logically expressed as a partition range and a list value. For a partition, it can be the selection of the partition key corresponding to the partition. A range of values or a list of values.
  • the data redistribution information can only include the definition of the data range of the modified partition, that is, the logical expression of the data that can be stored in the modified partition scope.
  • the partition space is used to record the data range of the partition, when data redistribution is performed, the data range of the data to be stored in each partition will change, and the final change will be the data range determined based on the data redistribution information .
  • the corresponding relationship between each partition space and the partitions of the partition table can be determined based on the updated data range recorded in the partition space.
  • Step S104 updating the data distribution in the partitions of the partition table based on the corresponding relationship.
  • the correspondence between the partition space and the partitions after the update reflects the logical correspondence.
  • the actual execution is to move the data stored in the partitions, so that the data stored in each partition ( Data storage in the physical sense) meets the requirements of data redistribution information set by users.
  • the data involved in the data processing method provided by the embodiment of the present application can be saved on the blockchain.
  • Step S104 can perform the following operations:
  • the first data of the partition to be changed is to create a new first partition for the first data of the partition to be changed
  • the minimum value of the partition key corresponding to the data range of the data currently stored in the partition is less than the minimum value of the partition key in the partition space recorded in the partition space corresponding to the partition, then the minimum value of the partition key in the partition space corresponding to the partition is determined.
  • the second data of the partition is changed, and a new second partition is created for the second data of the partition to be changed.
  • the embodiment of the present application may perform the following operations:
  • the determining the second data of the partition to be changed based on the minimum value of the partition key in the partition space corresponding to the partition, and creating a new second partition for the second data of the partition to be changed includes:
  • step S104 may be implemented as:
  • the first data is the maximum value of the partition key in the partition space recorded in the partition space corresponding to the partition to the maximum value of the partition key corresponding to the data range of the data currently stored in the partition.
  • the second data is the partition key between the minimum value of the partition key corresponding to the data range of the data currently stored in the partition and the minimum value of the partition key in the partition space corresponding to the partition the corresponding data;
  • adding at least one partition space in the partition table in step S102 includes the following steps S1021-S1022:
  • Step S1021 correspondingly increase the partition space for each partition in the partition table.
  • the partition space may be increased in a one-to-one correspondence to the partitions currently included in the partition table.
  • the embodiment of the present application can adjust the data distribution of the data actually stored in each partition within the value range or key value of the partition key covered by each original partition.
  • Step S1022 Based on the data redistribution information, add at least one group of corresponding partition spaces and empty data storage units in the partition table, and the data storage information recorded in the partition space is determined based on the data redistribution information.
  • the definition of partitions in the currently represented data redistribution information that is, the value range or key value of the partition key logically covered by the newly set partitions is greater than that logically covered by the original partitions in the original partition table
  • the value range or key value of the partition key you can add at least one set of partition spaces and partitions with a corresponding relationship in the partition table; where the newly added partition is empty, that is, no data is stored in the physical sense , and at this time, the data range recorded in the partition space corresponding to the newly added partition is determined based on the data redistribution information.
  • the value range of the partition key covered by each partition is [0,10000); and in the data redistribution information, the value range of the partition key covered by each partition is set to [0, 15000), it is possible to adapt to adding a group of corresponding partition spaces and empty partitions. At this time, the value range of the partition key in the data range recorded in the newly added partition space is [10000, 15000).
  • the implementation of this step may refer to the schematic content of FIG. 10a.
  • the provided data processing method may also include the following step S1010:
  • Step S1010 Create a partition space list based on the partition space.
  • the partition space list is used to record the data range recorded in each partition space, the corresponding relationship between the partition space and the partition, the name of each partition space, the partition space and the partition table during the data redistribution process At least one kind of information in the corresponding relationship.
  • a system table may be generated to accommodate the increased partition space in the partition table to store the management information of each partition space.
  • Table Name space name Ranges partition name test t1 PS1 [0,1000) set 1 test t1 PS2 [1000,5000) set 2 test t1 PS3 [5000,10000) set 3
  • each partition space PS corresponds to the data storage unit set one by one, and the increased partition space at this time can be used to record the logical expression content of the actual stored data in each partition ( data range), that is, the value range of the partition key.
  • table 1 is also the list of partition spaces created in step S1010, which can record the table name (table name, such as test test t1) of the partition table, the partition space name (space name, such as PS1), the value range of the partition key ( data range, such as [0,1000)), partition name (set name, such as set 1) and other information.
  • table name table name, such as test test t1
  • partition space name space name, such as PS1
  • the value range of the partition key data range, such as [0,1000)
  • partition name set name, such as set 1
  • Table Name space name Ranges partition name test t1 PS1 [0,1500) set 1 test t1 PS2 [1000,1500) set 2 test t1 PS2 [1500,4500) set 3 test t1 PS3 [4500,7500) set 4 test t1 ps4 [7500,15000) set 5
  • Table 2 corresponds to the information saved in the partition space list after data redistribution.
  • the new partition space PS4, partition set 4 and 5 are added; in terms of correspondence, the same partition A space can correspond to two partitions (PS2 corresponds to sets 2 and 3).
  • step S103 based on the predefined data range corresponding to at least one partition in the data redistribution information, the correspondence between the partition space and the partition is updated, including the following steps A1-A4:
  • Step A1 Determine the predetermined data range of each partition in the partition table based on the data redistribution information.
  • the data redistribution information may only include the partition definition that needs to be adjusted, or may include the definition of each partition in the partition table after redistribution, and may correspond to the following two possible embodiments (taking the data storage information as Take the value range of the partition key as an example):
  • the predefined data range of set 2 corresponds to [0,1000)
  • set 2 corresponds to [1000,5000
  • the predefined data range of set 3 corresponds to [2500,5000
  • the predefined data range of each partition in the partition table is as follows: the partition space of set 1 corresponds to [0 ,1000), the predefined data range of set 2 corresponds to [1000,2500); the predefined data range of set 3 corresponds to [2500,5000); among them, set 3 can be a new partition.
  • the data range can be directly used as the predefined data range of each partition, that is, combined with the above implementation Example in example (1), in embodiment (2), it can be obtained directly based on the data redistribution information that the predefined data range of set 1 corresponds to [0,1000), and the predefined data range of set 2 corresponds to [1000 ,2500); the predefined data range of set 3 corresponds to [2500,5000).
  • Step A2 Adjust the partitions contained in the partition table based on the new data range of the partition.
  • the predefined data range corresponding to each partition may not correspond to the data range of the data actually stored in the current partition in the partition table, in the case of no correspondence, the following situations may exist (the data range is used as the partition key Take the value range as an example):
  • the minimum value of the value range of the partition key logically corresponding to the data actually stored in one or more partitions is smaller than the minimum value of the corresponding predefined data range
  • the minimum value of the total value range of the partition key logically corresponding to the data actually stored in all partitions is greater than the minimum value of the predefined data range of the partition.
  • Step A3 Update the data range recorded in each partition space based on the predefined data range.
  • each partition determined in step A1 may directly replace the data range of the existing record in each partition space. It can be understood that each partition determined in step A1 has a one-to-one correspondence with the partition space, and reference may be made to the schematic content in FIG. 10c.
  • Step A4 Based on the data range recorded after the update of each partition space, update the corresponding relationship between the partition space and the partitions included in the adjusted partition table.
  • the partitions are sorted based on the value range of the partition key or the size of the key value; in step A2, adjusting the partitions contained in the partition table based on the predefined data range includes the following step A21:
  • Step A21 Perform the following adjustment operation steps A211-A213 in sequence for each data storage unit included in the partition table:
  • Step A211 Compare the current data storage information of the partition with the predefined data storage information.
  • Step A212 If the maximum value of the partition key corresponding to the partition is greater than the maximum value of the partition key in the predefined data storage information, split the partition based on the maximum value of the partition key in the predefined data storage information.
  • the value range of the partition key corresponding to the partition as [1000,5000) as an example, if the value range of the partition key in the predefined data storage information corresponding to the partition is [1500,4500), then 5000 is greater than 4500, The partition will be split based on the key value 4500 of the partition key.
  • Step A213 If the minimum value of the partition key corresponding to the partition is smaller than the minimum value of the partition key in the predefined data range, split the partition based on the minimum value of the partition key in the predefined data range.
  • the value range of the partition key corresponding to the partition is [1000,5000) as an example, if the value range of the partition key in the predefined data range corresponding to the partition is [1500,4500), then 1000 is less than 1500, and the The partition is split based on the key value 1500 of the partition key.
  • steps A212 and A213 may be performed at the same time, or only step A212 or A213 may be performed, or neither step A212 and A213 may be performed.
  • steps A212 and A213 may be performed at the same time, or only step A212 or A213 may be performed, or neither step A212 and A213 may be performed.
  • steps A212 and A213 may be performed at the same time, or only step A212 or A213 may be performed, or neither step A212 and A213 may be performed.
  • splitting the partition based on the maximum value of the partition key in the predefined data range in step A23 includes the following step A231:
  • Step A231 Generate an empty first partition based on the maximum value of the partition key in the predefined data range and the maximum value of the partition key corresponding to the partition.
  • the currently stored data is empty, but the value range of its logically corresponding partition key can be based on the maximum value of set 2, 5000, and the predefined
  • the maximum value of the value range of the partition key in the data range is 4500, that is, the value range of the partition key corresponding to set 5 is [4500,5000).
  • the currently stored data is empty, but the value range of its logically corresponding partition key can be based on the maximum value of set 3, 10000, and the value range of the partition key in the predefined data range
  • the maximum value of 7500 is determined, that is, the value range of the partition key corresponding to set 6 is [7500, 10000).
  • splitting the partition based on the maximum value of the partition key in the predefined data range in step A23 also includes the following step A232:
  • Step A232 Mark the maximum split point in the partition based on the maximum value of the partition key in the predefined data range, and establish an association relationship between the partition and the first partition.
  • a maximum split point max flag can be marked based on the partition key key value 4500 in set 2; a maximum split point can be marked based on the partition key key value 7500 in set 3 max flag.
  • the association relationship between set 2 and set 5 can be established, As shown in Figure 10c. Since the value range of the partition key corresponding to the newly generated set 6 is part of the value range of the partition key corresponding to set 3 in the original partition table, an association relationship between set 3 and set 6 can be established.
  • splitting the partition based on the minimum value of the partition key in the predefined data range in step A24 includes the following steps A241:
  • Step A241 Generate an empty second partition based on the minimum value of the partition key corresponding to the partition and the minimum value of the partition key in the predefined data range.
  • the value range of its logically corresponding partition key can be based on the minimum value of set 2, 1000, and the predefined
  • the minimum value of the value range of the partition key in the data range is 1500, that is, the value range of the partition key corresponding to set 4 is [1000, 1500).
  • splitting the partition based on the minimum value of the partition key in the predefined data range in step A24 includes the following step A242:
  • Step A242 mark the minimum split point in the partition based on the minimum value of the partition key in the predefined data range, and establish an association relationship between the partition and the second partition.
  • a minimum split point min flag can be marked in set 2 based on the partition key value 1500.
  • updating the data distribution of the partition table based on the updated correspondence between the partition space and the partition in step S104 includes the following steps S1041-S1042:
  • Step S1041 Based on the updated correspondence between partition space and partitions, perform the following data movement steps B1-B2 for each partition that needs to be split:
  • Step B1 Move the data corresponding to the partition key between the maximum value of the partition key in the predefined data range and the maximum value of the partition key corresponding to the partition to the corresponding first partition.
  • part of the data actually stored in set 5 belonging to set 2 can be moved to set 5, and at this time, part of the data actually stored in set 2 will be reduced.
  • Part of the data actually stored in set 3 belonging to set 6 can be moved to set 6, which will reduce the part of the data actually stored in set 3.
  • Step B2 Move the data corresponding to the partition key between the minimum value of the partition key corresponding to the partition and the minimum value of the partition key in the predefined data range to the corresponding second partition.
  • part of the data actually required to be stored in set 2 belonging to set 4 can be moved to set 4.
  • steps B1 and B2 actually only processes the data that needs to be moved, and the data that does not need to be moved is still kept in the corresponding partition.
  • the data needs to be locked during the process of data movement, due to the small amount of data to be moved, the required execution time is short, so the time for data locking is also very short, making the embodiment of the present application perform data locking. There is very little impact on ongoing read and write operations (maybe some user traffic) while moving.
  • Step S1042 Delete the maximum split point, the minimum split point and the association relationship between each partition.
  • the relationship between the maximum split point, the minimum split point, and each partition can be used to represent the movement relationship between data during the data redistribution process, and can be used to perform data reading and writing while data redistribution processing operate.
  • the association between each split point and the partition can be deleted.
  • the provided data processing method further includes the following steps C1-C2 before completing the data distribution update of the partition table:
  • Step C1 In response to the processing operation on the target data, determine that the key value corresponding to the partition key of the target data is greater than the key value corresponding to the partition key corresponding to the maximum split point of any partition or less than the key value corresponding to the partition key corresponding to the minimum split point of any partition , then query the target data in this partition.
  • the processing operation on the target data can be a data read and write operation, that is, in the process of performing data redistribution, when receiving a read and write operation on the target data from a user or server, it can be based on the split point of the mark and The established association relationship among the partitions is obtained to the location where the target data is actually stored. It can be understood that the process of querying the target data in the partition in step C1 is implemented on the basis of determining that the target data has a corresponding storage location in the partition table.
  • Step C2 If the query obtains the storage location corresponding to the target data, then return the processing result for the target data; if the storage location is not queried, based on the association relationship corresponding to each partition, in the partition corresponding to the first partition or the second partition Determine the storage location corresponding to the target data and return the processing result of the target data.
  • Case 1 The process of performing target data processing operations using an old version of the table. If the storage location corresponding to the target data is not found in the old set (the partition contained in the original partition table), and it is found that the storage location corresponding to the target data is outside the split point flag of a partition mark, it is possible to represent the target data After being moved to the set generated by splitting, it is necessary to perform data traversal in the corresponding newly generated set again to check whether the target data is stored in the partition table.
  • Case 2 The process of performing target data processing operations using the new version of the table. If the storage location of the target data is not determined in any partition of the partitioned table whose data organization structure has been adjusted, and there are links to other sets (newly generated sets) in the set (the old set in the original partitioned table) (based on The association relationship between each partition is determined), then it is necessary to continue to search for the storage location of the target data in the corresponding set (newly generated set); the operation process of querying the target data in the newly generated set is as follows:
  • the target data is not outside the flag of the corresponding set (the old set in the original partition table); it indicates that the target data does not belong to the data range that the set needs to split, then it can be determined that the target data does not exist in the partition table, and the corresponding value can be returned directly The processing result of (the target data is not queried in the partition table);
  • the target data is outside the flag range of the corresponding set, if you find the corresponding split range, you can continue to query the target data. If the storage location of the target data is not queried, it also indicates that the target data does not exist in the partition table, and you can directly return the corresponding If the data is found in the flag (the data to be moved in the partition that needs to be split), the processing result of the returned target data can be locked; if the data is found in the flag (the data to be moved in the partition that needs to be split), However, if the data has been marked for deletion, it indicates that the data may be migrated.
  • the data can be queried in the newly generated partition (that is, the first or second partition); among them, use mark delete to indicate the partition that needs to be split
  • the data to be moved in the set has been migrated to the new set, and other methods can also be used to indicate that the data has been migrated.
  • the table T1V0 represents an old version of the partition table. From FIG. 5 , the data organization structure of the partition table to be redistributed can be seen. Table T1V1 represents the new version of the partition table. From Figure 5, we can see the updated data organization structure of the partition table that has already performed data redistribution.
  • Partition set 1 needs to be changed from the original [0,1000) to [0,1500):
  • Part of the stored data [1000, 1500) needs to be assigned to partition set 1 from the set 2 corresponding to [1000, 5000) of the partition set 2 management space PS2; therefore, the data corresponding to the original partition set 2 [1000, 5000] ), first decompose into set 4[1000,1500) and set2'[1500,5000); move set 4 from the management range PS2 of partition set 2 to the management range PS1 of partition set 1, then the redistribution process of partition set 1 Finish.
  • the set2' split from the partition set 2 in the step 1.2 corresponds to [1500,5000) and set2" corresponds to [1500,4500) and set 5 corresponds to [4500,5000) two new sets;
  • Partition set 4 is a newly added partition, and corresponding information needs to be added to the data dictionary; in addition, a corresponding set of set 7 corresponding to [10000,15000) is added, and the redistribution of partition set 4 The process is complete.
  • Step 1 The user performs data redistribution operations on the partition table through DDL statements.
  • Step 2 Read the definition of the original partition table data dictionary (that is, the definition of each partition, such as the value range of the partition key corresponding to partition 1), create a new partition table structure (an integer can be used to represent the version number of the partition table, for The version number +1 is used to indicate the version of the updated partition table); specifically, the embodiment of the present application only adjusts the structure of the partition table, and the original partition table used in the description corresponds to the old version of the table, and there is no need to create A new partition table.
  • the original partition table data dictionary that is, the definition of each partition, such as the value range of the partition key corresponding to partition 1
  • create a new partition table structure an integer can be used to represent the version number of the partition table, for The version number +1 is used to indicate the version of the updated partition table.
  • Step 3 Traverse the original table partition structure.
  • Step 4 Determine whether the redistribution operation will generate a new partition space; if so, go to step 5; if not, return to step 3 and traverse the next partition space.
  • Step 5 Add an empty partition space to the new partition table structure.
  • Step 6 After traversing the partition space, if yes, go to step 7; if not, go back to step 3.
  • Step 7. Traversing the partition structure (that is, the partition space).
  • Step 8 Determine whether redistribution needs to be performed on the current partition space; if not, go to step 22 to process the next partition space; if yes, go to step 9.
  • Step 9 Determine whether the range needs to be increased; if so, you can increase the left-end partition and the right-end partition for the list partition type (because the operation of the list type is similar to that of the range type, the range partition is used for illustration in Figure 6).
  • the new range that is, you can directly add an empty set and register with the partition space at the same time.
  • the new range indicates that the original partition table does not contain the data of the new range, so the new range does not involve subsequent data. Relocation, after modifying the data dictionary partition space and range information, the operation of adding a range is completed. If not, go to step 13.
  • Step 10 Determine whether to increase the left end range: if yes, go to step 12; if not, go to step 11.
  • Step 11 Add a right end set.
  • Step 12 Add a left end set.
  • Step 13 Determine whether the range needs to be split; if yes, split the left end partition and the right end partition; if not, go to step 22.
  • Step 14 Judging whether it is left-end splitting, no matter whether it is or not, it will go to step 15.
  • Step 15 Find the split data boundary point (split point).
  • Step 16 Put a flag on the split point (here the left end is split into the minimum split point min-flag, and the right end is split into the maximum split point max-flag).
  • Step 17 Create a new set.
  • the current set is empty without any stored data.
  • Step 18 Establish a connection from the original set to the new set.
  • Step 19 Establish a connection from the newly created set to the new set.
  • Step 20 Move the new set to the corresponding partition space and complete the registration.
  • Step 21 Register the split to the mobile data task list.
  • Step 22 Process the next partition space.
  • Step 23 Determine whether all the partition spaces of the partition table have been processed; if not, return to step 7; if yes, enter step 24.
  • Step 24 Update the definition of each partition in the partition table, the definition of the new version of the partition table is visible, and at the same time notify all nodes to update the definition (or not to notify).
  • Step 25 The redistribution preparation phase ends.
  • the data movement phase of the redistribution operation occurs after the preparation phase ends, and the corresponding preparation work has been completed.
  • the data redistribution of the partition table has been logically completed. However, in fact, the data has not been relocated, but it does not affect the user's reading and writing of data.
  • Step 1 Start the data movement task.
  • Step 2 Load the task list that needs to migrate data prepared in the first stage.
  • Step 3 Get a task and start the data migration task.
  • Step 4 Obtain a small batch of data and start to relocate and lock; (this small batch of data can be the number of records specified by the user, or hard-coded), the embodiment of the present application uses small batch data to relocate, but it can realize as much as possible in the process of data relocation. The blocking time of user read and write transactions is minimized.
  • Step 5 Judging that the locking is successful; if not, it indicates that there are user transactions being processed on this part of the data, and you can go to step 7; if so, go to step 6.
  • Step 6 Wait for a while and reacquire the task.
  • Step 7. Migrate the data to the new Set.
  • the relocated data does not modify the data, but only moves the data from the original Set range to the new Set. This process does not modify the metadata.
  • Step 8 Submit the transaction and release the lock on the relocated data in step 4.
  • Step 9 Determine whether there is still data that needs to be relocated within the range; if not, go to step 10; if yes, jump to step 4 to relocate the next batch of data.
  • Step 10 Determine whether there are tasks that need to be relocated; if not, go to step 11; if yes, go to step 3 and continue to the next data migration task.
  • Step 11 Refresh the table structure of all nodes to ensure that the table structure of all nodes has been synchronized to the latest.
  • Step 12. Clean up all flags in the Set and links to other Sets (as shown in Figure 10f).
  • Step 13 The data migration is over.
  • Step 1 Start read and write operations.
  • Step 2 Judging whether the qualified data is found; if not, proceed to step 3; if yes, proceed to step 6.
  • Step 3 Determine whether the data is outside the flag (data to be migrated) in step 17 of the first stage; that is, outside the split point marked in set in step 16 of the first stage; if not, it means that the data does not exist In the partition table, jump directly to step 8; if so, jump to the connected Set according to the link (association relationship between sets) and continue to search for data.
  • Step 4 Determine whether the data is found; if not, it means that the data does not exist in the partition table, and directly jump to step 8; if so, go to step 6.
  • Step 5 Lock the data according to the native logic.
  • Step 6 After processing, submit the transaction.
  • Step 7 The read and write operation ends.
  • Step 1 Start read and write operations.
  • Step 2 Determine whether the data is found; if not, go to step 3; if yes, go to step 7.
  • Step 3 Determine whether there are links pointing to other Sets on the Set; if not, it indicates that the data does not exist in the partition table; go to step 9; if yes, go to step 4.
  • Step 4 Find the corresponding Set according to the link.
  • Step 5 Determine whether the data is within the flag range of the corresponding Set. If the old Set has the flag and find the new Set, then the situation of retrieving the original Set from the new Set indicates that the data migration has not been completed; if not, there is no corresponding data (data does not exist in the partition table), jump to step 9; if so, jump to step 6;
  • Step 6 Judging whether to find qualified data; if no, there is no corresponding data (the data does not exist in the partition table), jump to step 9; if so, jump to step 7.
  • Step 7 Determine whether the data is marked delete; if it is marked delete, it means that this part of the data has been migrated to the new set; if not, the data has not been marked deleted, and go to step 9;
  • Step 8 Return to the new set to find the data again.
  • Step 9 Lock according to the transaction type (such as reading and writing).
  • Step 10 After processing, submit the transaction
  • Step 11 The read and write operation ends.
  • the above-mentioned third stage and fourth stage are actually implemented in the same piece of data.
  • This embodiment describes the two stages separately, which is more conducive to a clear understanding of the target data during the data redistribution process.
  • the implementation process of the operation is more conducive to a clear understanding of the target data during the data redistribution process.
  • Fig. 10a it illustrates the first step in the implementation of data redistribution in the embodiment of the present application, which is to increase the partition space.
  • table T1V1 represents a partition table whose data organization structure has been adjusted.
  • the new version of table T1 is invisible, that is, the user cannot view the table generated in this step.
  • the value range of the partition key covered in the original partition table is [0,10000)
  • the value range of the partition key covered in the partition table is [0,15000)
  • set 7 and the partition space PS4 corresponding to set 7 are added.
  • Fig. 10b it exemplifies the second step in the implementation of data redistribution in the embodiment of the present application, traversing partitions, and considering whether to split (add a new set).
  • the value ranges of the three partition keys set 4, set 5, and set 6 are added.
  • set 4, set 5, and set 6 in Figure 10b are still empty partitions. Shown in is only the value range of the partition key, and the actual stored data is still in set 2 and set3.
  • FIG 10d it exemplifies the fourth step in the implementation of data redistribution in the embodiment of the present application, specifically, updating the value range of the partition key of each partition set management data in table T1 (that is, the processing is the value range used in Figure 10d double-dot-dash line box and set shown in the dotted line box); in addition, the definition of the T1 table can be opened to the user's perception, that is, the user can view the table at this time.
  • the first stage (task preparation stage) shown in the above embodiment is completed, that is, the user sends the DDL statement. After the execution is completed, the user is fed back that the redistribution is successful, but the data is not actually migrated (data movement in the physical sense).
  • Figure 10e it exemplifies the fifth step in the implementation of data redistribution in the embodiment of the present application, specifically, enters the data movement process in the physical sense, and completes the data according to the definition of set in the partition management (the value range of the partition key ) to move data. Comparing the structures of the two tables in Figure 10e, it can be seen that the data actually stored in set 2" and set 3' has changed, and part of the data that has been moved out has been moved into the set that has an associated relationship with it.
  • Fig. 10f it exemplifies the sixth step in the implementation of data redistribution in the embodiment of the present application. Specifically, before the execution of the sixth step, all nodes can be refreshed so as to refresh to the latest table definition version. At this point, the association between the split points marked in the previous step and each set can be removed.
  • the first table in Figure 10a corresponds to the partition table that needs to be redistributed; the second table in Figure 10f corresponds to The partitioned table after redistribution; the two respectively illustrate the data organization structure of the same table in different versions.
  • user business can be processed synchronously, that is, the data processing method provided in the embodiment of the present application can be an online implementation method, which does not require business Downtime processing is performed; and the processing of user business and data redistribution are completed on one piece of data, which helps to avoid the problem of data inconsistency caused by new copies of data redistribution.
  • the data processing device 1100 may include: an acquisition module 1101 , an addition module 1102 , an update module 1103 , and a distribution module 1104 .
  • the obtaining module 1101 is used to obtain data redistribution information, and the data redistribution information is used to represent a new partition plan for a partition table, and the partition table is a table for data distribution based on a partition key;
  • the adding module 1102 It is used to create a corresponding partition space for each partition specified in the partition plan, and record the current data range of the partition corresponding to the partition space in each partition space, and the data range includes the value range of the partition key and One of the partition key value lists; an update module 1103, configured to update the data range recorded in each partition space based on the partition plan, and use the updated data range of each partition space to determine each partition space The corresponding relationship with the partition of the partition table;
  • the distribution module 1104 configured to update the data distribution in the partition of the partition table based on the corresponding relationship.
  • the adding module 1102 is specifically used for:
  • the adding module 1102 is also used for:
  • the partition space list is used to record the data range of each partition space record, the corresponding relationship between the partition space and the partition, the name of each partition space, and the corresponding relationship between the partition space and the partition table during the data redistribution process. at least one of the information.
  • the update module 1103 is used to update the data range recorded in each partition space based on the partition plan, and use the updated data range of each partition space to determine the relationship between each partition space and the partition.
  • the corresponding relationship between table partitions is specifically used for:
  • the partitions are sorted based on the value range of the partition key or the size of the key value; the update module 1103 is used to perform the updated data range based on each partition space, and determine the partition corresponding to each partition space to be When changing the data of a partition and creating a new partition for the data of the partition to be changed in each partition, it is specifically used for:
  • the first data of the partition to be changed is to create a new first partition for the first data of the partition to be changed in each partition;
  • the minimum value of the partition key corresponding to the data range of the data currently stored in the partition is less than the minimum value of the partition key in the partition space recorded in the partition space corresponding to the partition, then the minimum value of the partition key in the partition space corresponding to the partition is determined.
  • the second data of the partition is changed, and a new second partition is created for the second data of the partition to be changed.
  • the update module 1103 is used to determine the first data of the partition to be changed based on the maximum value of the partition key in the partition space corresponding to the partition, and create a new first partition for the first data of the partition to be changed. , specifically for:
  • the update module 1103 When the update module 1103 is used to determine the second data of the partition to be changed based on the minimum value of the partition key in the partition space corresponding to the partition, and create a new second partition for the second data of the partition to be changed, it is specifically used for:
  • the distribution module 1104 when used to update the data distribution in the partitions of the partition table based on the correspondence, it is specifically used to:
  • the first data is the maximum value of the partition key in the partition space recorded in the partition space corresponding to the partition to the maximum value of the partition key corresponding to the data range of the data currently stored in the partition.
  • the second data is the partition key between the minimum value of the partition key corresponding to the data range of the data currently stored in the partition and the minimum value of the partition key in the partition space corresponding to the partition the corresponding data;
  • the device 1100 further includes an online reading and writing module, which is used to perform:
  • the key value corresponding to the partition key of the target data is greater than the key value corresponding to the partition key of the maximum split point of any partition or smaller than the key value of the partition key corresponding to the minimum split point of any partition, Then query the target data in the partition;
  • the query obtains the storage location corresponding to the target data, return the processing result for the target data; if the storage location is not found, based on the association relationship corresponding to the partition, in the partition corresponding to the first partition or the second partition Determine the storage location corresponding to the target data and return the processing result of the target data.
  • the device in the embodiment of the present application can execute the method provided in the embodiment of the present application, and its implementation principle is similar.
  • the actions performed by the modules in the device in the embodiments of the present application are the same as the steps in the methods of the embodiments of the present application
  • the detailed functional description of each module of the device reference may be made to the description in the corresponding method shown above, which will not be repeated here.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a certain amount of processing data, which is used to verify its information. Validity (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • An embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory.
  • the processor executes the computer program to implement the steps of the data processing method. Compared with related technologies, the following can be realized: When the application is implemented, it is processed for the partition table.
  • the partition table can include a physical table for data distribution based on the preset partition key; obtain the latest data redistribution information, which is used for data distribution of the partition table Based on this, at least one partition space can be added to the partition table, which has a corresponding relationship with the partitions in the partition table, and the partition space can be used to store the data range of the corresponding partition, where the data range includes the partition The value range of the key or the value list of the partition key; at this time, the increased partition space can decouple the binding between the partition definition of the partition table and the data storage of the actual partition; furthermore, based on at least one partition in the data redistribution information
  • the predefined data range can update the corresponding relationship between the current partition space and the partition.
  • the data distribution in the partition table can be updated based on the newly obtained corresponding relationship between the partition space and the partition.
  • the implementation of this application is to realize data redistribution through the increased partition space; in addition, the data redistribution of this application does not need to perform data synchronization processing on the full amount of data in the partition table, which is beneficial to shorten the execution time of data redistribution and improve the efficiency of data processing , to reduce the occurrence of business congestion.
  • the electronic device 1200 shown in FIG. 12 includes: a processor 1201 and a memory 1203 . Wherein, the processor 1201 is connected to the memory 1203 , such as through a bus 1202 .
  • the electronic device 1200 may further include a transceiver 1204, and the transceiver 1204 may be used for data interaction between the electronic device and other electronic devices, such as sending data and/or receiving data. It should be noted that, in practical applications, the transceiver 1204 is not limited to one, and the structure of the electronic device 1200 does not limit the embodiment of the present application.
  • Processor 1201 can be CPU (Central Processing Unit, central processing unit), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit, application specific integrated circuit), FPGA (Field Programmable Gate Array , Field Programmable Gate Array) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor 1201 may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
  • Bus 1202 may include a path for communicating information between the above-described components.
  • the bus 1202 may be a PCI (Peripheral Component Interconnect, Peripheral Component Interconnect Standard) bus or an EISA (Extended Industry Standard Architecture, Extended Industry Standard Architecture) bus, etc.
  • the bus 1202 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 12 , but it does not mean that there is only one bus or one type of bus.
  • Memory 1203 can be ROM (Read Only Memory, read-only memory) or other types of static storage devices that can store static information and instructions, RAM (Random Access Memory, random access memory) or other types of memory that can store information and instructions Dynamic storage devices can also be EEPROM (Electrically Erasable Programmable Read Only Memory, Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read Only Memory, CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, compact disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and can be read by a computer, without limitation .
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • magnetic disk storage media including compressed optical disc, laser disc, compact disc, digital versatile disc, blu-ray disc, etc.
  • magnetic disk storage media
  • the memory 1203 is used to store the computer programs for executing the embodiments of the present application, and the execution is controlled by the processor 1201 .
  • the processor 1201 is configured to execute the computer program stored in the memory 1203 to implement the steps shown in the foregoing method embodiments.
  • electronic devices include but are not limited to: smart phones, tablet computers, laptops, smart speakers, smart watches, smart voice interaction devices, vehicle-mounted terminals, etc.
  • An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored.
  • the computer program is executed by a processor, the steps and corresponding contents of the aforementioned method embodiments can be realized.
  • the embodiment of the present application also provides a computer program product, including a computer program.
  • a computer program product including a computer program.
  • the steps and corresponding content of the aforementioned method embodiments can be realized.
  • arrows indicate various operation steps in the flow chart of the embodiment of the present application
  • the execution order of these steps is not limited to the order indicated by the arrows.
  • the implementation steps in each flowchart may be performed in other orders as required.
  • part or all of the steps in each flow chart may include multiple sub-steps or multiple stages based on actual implementation scenarios. Some or all of these sub-steps or stages may be executed at the same time, and each of these sub-steps or stages may also be executed at different times. In scenarios where execution times are different, the execution order of these sub-steps or stages can be flexibly configured according to requirements, which is not limited in this embodiment of the present application.

Abstract

一种数据处理方法、装置、电子设备、存储介质及程序产品,涉及数据处理及数据库技术领域;可以应用于数据存储、数据查询、地图数据处理等场景。该方法包括:获取数据重分布信息(S101);为分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,数据范围包括分区键的取值范围和分区键的取值列表中的一种(S102);基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系(S103);基于所述对应关系更新所述分区表的分区中的数据分布(S104)。该方法可以提高数据重分布的处理效率。

Description

数据处理方法、装置、电子设备、存储介质及程序产品
本申请要求于2021年10月19日提交中国专利局、申请号为202111217539.9、申请名称为“数据处理方法、装置、电子设备、存储介质及程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理及数据库技术领域,具体而言,本申请涉及一种数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
在分布式数据库中,可以通过分区表将大表的数据分成称为分区的多个小的子集。基于分布数据的方式可以将分区表划分为范围(range)、列表(list)和散列(hash)三种。
针对数据分布规则变化的情况,需要对所存储的数据进行重分布。在相关技术中,一般是通过拷贝一份需要参与重分布的分区表,并在拷贝所得表中基于新的数据分布规则进行数据同步,以完成数据重分布。然而,该方法需要有额外的存储空间来存储一份全量数据,且完成数据重分布的执行时间较长,容易导致业务的阻塞。
发明内容
本申请实施例提供了一种数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,有助于解决执行数据重分布所占用存储空间大,且执行时间长导致业务阻塞的问题。所述技术方案如下:
根据本申请实施例的一个方面,提供了一种数据处理方法,在电子设备中执行,该方法包括:
获取数据重分布信息,所述数据重分布信息用于表征对分区表的新的分区规划,所述分区表为基于分区键进行数据分布的表;
为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,所述数据范围包括分区键的取值范围和分区键的取值列表中的一种;基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系;
基于所述对应关系更新所述分区表的分区中的数据分布。
根据本申请实施例的一个方面,提供了一种数据处理装置,该装置包括:
获取模块,用于获取数据重分布信息,所述数据重分布信息用于表征对分区表的新的分区规划,所述分区表为基于分区键进行数据分布的表;
增加模块,用于为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,所述数据范围包括分区键的取值范围和分区键的取值列表中的一种;
更新模块,用于基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系;
分布模块,用于基于所述对应关系更新所述分区表的分区中的数据分布。根据本申请实施例的一个方面,提供了一种电子设备,该电子设备包括:
一个或多个处理器;
存储器;
一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所 述一个或多个处理器执行,所述一个或多个计算机程序配置用于:执行上述数据处理方法。
根据本申请实施例的一个方面,提供了一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机可以执行上述数据处理方法。
根据本申请实施例的一个方面,提供了一种计算机程序产品,包括计算机程序或指令,该计算机程序或指令被处理器执行时实现上述数据处理方法的步骤。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对本申请实施例描述中所需要使用的附图作简单地介绍。
图1为相关技术中针对分区表执行数据重分布的示意图;
图2为本申请实施例提供的一种系统架构示意图;
图3为本申请实施例提供的一种数据处理方法的流程示意图;
图4为本申请实施例提供的一种数据处理方法中针对分区表进行数据重分布的示意图;
图5为本申请实施例提供的一种数据处理方法中针对分区表进行数据重分布的示意图;
图6为本申请实施例提供的一种数据处理方法中任务准备阶段的流程图;
图7为本申请实施例提供的一种数据处理方法中数据移动阶段的流程图;
图8为本申请实施例提供的一种数据处理方法中使用旧版本分区表进行读写操作的流程图;
图9为本申请实施例提供的一种数据处理方法中使用新版本分区表进行读写操作的流程图;
图10a为本申请实施例提供的一种数据处理方法的应用例中执行第一步的示意图;
图10b为本申请实施例提供的一种数据处理方法的应用例中执行第二步的示意图;
图10c为本申请实施例提供的一种数据处理方法的应用例中执行第三步的示意图;
图10d为本申请实施例提供的一种数据处理方法的应用例中执行第四步的示意图;
图10e为本申请实施例提供的一种数据处理方法的应用例中执行第五步的示意图;
图10f为本申请实施例提供的一种数据处理方法的应用例中执行第六步的示意图;
图11为本申请实施例提供的一种数据处理装置的结构示意图;
图12为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
下面结合本申请中的附图描述本申请的实施例。应理解,下面结合附图所阐述的实施方式,是用于解释本申请实施例的技术方案的示例性描述,对本申请实施例的技术方案不构成限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请实施例所使用的术语“包括”以及“包含”是指相应特征可以实现为所呈现的特征、信息、数据、步骤、操作、元件和/或组件,但不排除实现为本技术领域所支持其他特征、信息、数据、步骤、操作、元件、组件和/或它们的组合等。应该理解,当我们称一个元件被“连接”或“耦接”到另一元件时,该一个元件可以直接连接或耦接到另一元件,也可以指该一个元件和另一元件通过中间元件建立连接关系。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的术语“和/或”指示该术语所限定的项目中的至少一个,例如“A和/或B”指示实现为“A”,或者实现为“A”,或者实现为“A和B”。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
下面对本申请涉及的术语及相关技术进行说明:
数据库(Database),简而言之可视为电子化的文件柜——存储电子文件的处所,用户可以对文件中的数据进行新增、查询、更新、删除等操作。所谓“数据库”是以一定方式储存在一起、能与多个用户共享、具有尽可能小的冗余度、与应用程序彼此独立的数据集合。
数据库管理系统(Database Management System,简称DBMS)是为管理数据库而设计的软件系统,一般具有存储、截取、安全保障、备份等基础功能。数据库管理系统可以依据它所支持的数据库 模型来作分类,例如关系式、XML(Extensible Markup Language,即可扩展标记语言);或依据所支持的计算机类型来作分类,例如服务器群集、移动电话;或依据所用查询语言来作分类,例如SQL(结构化查询语言(Structured Query Language)、XQuery;或依据性能冲量重点来作分类,例如最大规模、最高运行速度;亦或其他的分类方式。不论使用哪种分类方式,一些DBMS能够跨类别,例如,同时支持多种查询语言。
分布式数据库:分布式数据库技术,结合了数据库技术与分布式技术。具体是指把那些在地理意义上分散开的各个数据库节点,但在计算机系统逻辑上又是属于同一个系统的数据结合起来的一种数据库技术。既有着数据库间的协调性也有着数据的分布性。这种分布式数据库管理系统并不注重系统的集中控制,而是注重每个数据库节点的自治性。
分区表(partition table):用于将大表对应的数据分成称为分区的许多小的子集。在本申请中,分区表可以是在TDSQL分布式数据库中,按照用户在建立某一个表时指定的某些字段(分区键)在整个分布式数据库中进行数据分布的结果;通常按照这些字段在集群中各节点之间进行数据分布的方式包括如下几种方式,hash(即按照分区键hash的方式,对表中记录中的分区键进行hash,来对表的数据进行分区),range(按照分区键的范围,来分布数据到不同的集群存储节点中),list(即按照用户指定的值list,对于用户数据进行按照键值分布)。在本申请实施例中,涉及range和list两种方式的数据分布情况。
数据重分布(data rebalance):当用户修改分区表的数据分布规则时,需要按照新的规则对数据进行重新分布,使得数据的存储符合用户新定义的分布规则。
分区(set):用来真正存储一定范围数据的最小单元,可以是操作系统的一个文件,用来存储表中一个分区的特定范围的数据。不同数据库中的称呼可能不一样(可以是操作系统的一个文件,或者文件中的一部分范围)。
下面结合图1对相关技术中针对分区表执行数据重分布的技术方案进行说明,该技术方案涉及以下步骤:
(1)针对需要进行数据重分布的表T1,创建一个用户重新定义的临时表结构T1’。
(2)开始向表T1’同步数据,同步数据完成后,开始追(redo)日志。
(3)当日志追到一个比较小的范围,开始锁表,追剩余部分的日志。
(4)完成改名,释放锁,此时表T1’已经变为表T1,数据也完成了重分布。
上述例子中所有分区的数据范围都发生了变化,但是用户在做重分布操作时往往不会涉及所有分区的数据,只会集中在某几个分区当中。上述方案的实施需要有足够大的额外存储空间,来存储一份全量数据(或者存储一份和参与重分布相当的分区存储空间),另还需要对应一个用户能够明显感知的后台拷贝数据的时间(数据重分布操作执行时间),该执行时间较长,且在重分布过程中,需要锁表处理的时间,这段时间用户业务因为表被锁而处于中断状态,导致用户业务发生阻塞。
针对相关技术中所存在的上述至少一个技术问题或需要改善的地方,本申请提出一种数据处理方法、装置、电子设备、计算机可读存储介质和计算机程序产品。具体地,本申请的实施是通过增加的分区空间实现数据重分布,其无需采用额外的存储空间来存储一份全量数据;另外,本申请的数据重分布无需对分区表的全量数据进行数据同步处理,有利于节省存储空间,缩短数据重分布的执行时间,提高数据处理的效率,减少业务阻塞情况的发生。
下面通过对几个示例性实施方式的描述,对本申请实施例的技术方案以及本申请的技术方案产生的技术效果进行说明。需要指出的是,下述实施方式之间可以相互参考、借鉴或结合,对于不同实施方式中相同的术语、相似的特征以及相似的实施步骤等,不再重复描述。
图2为本申请实施例提供可以应用数据处理方法的系统架构的示意图。其中,系统架构可以包括终端100、服务器200和数据库300。具体地,用户可以通过终端100发起数据重分布操作和数据的读写操作,终端100可以通过网络连接等方式与服务器200通信,而数据库300支持服务器200的数据服务。可以理解的是,数据库300在分布式数据库中,可以对应于多个处于不同地理空间上的数据库。
本申请实施例中提供了一种数据处理方法,例如在电子设备中执行,该电子设备例如为服务器200或者数据库300。如图3所示,该方法包括以下步骤S101-S104:
步骤S101:获取数据重分布信息,数据重分布信息用于表征对分区表的新的分区规划,分区表为基于分区键进行数据分布的表。例如,分区规划可以指定新的分区数量以及每个分区所要存储的数据范围。数据范围例如可以包括分区键的取值范围和分区键的取值列表中的一种。
其中,分区键可以对应于待存储数据的某些字段。可选地,当分区表属于范围分区时,可以依据分区键定义时给出的取值范围(键值范围),根据实际的取值,进行分区的选择,进而在相应分区中存储数据。当分区表属于列表分区时,可以依据分区键定义时给出的取值列表,根据实际的取值,进行分区的选择,进而在相应的分区中存储数据。
其中,数据重分布信息可以通过解析用户提交操作语言得到,该操作语言可以是数据库模型定义语言DDL(Data Definition Language),DDL语言可以用于描述数据库中所存储的现实世界实体。可选地,数据重分布信息可以包括针对至少一个分区修改其分区键的取值范围或取值列表。
具体地,用户可以通过修改分区键对应的取值范围和取值列表,以调整各分区对应存储的数据,进而实现重新分布分区表中存储的数据;其中,响应于用户针对分区表触发的重分布操作,可以获取到针对分区表最新的数据重分布信息,也即数据重分布规则。
步骤S102:为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,所述数据范围包括分区键的取值范围和分区键的取值列表中的一种。
其中,在未进行数据重分布之前的分区表可以包括两层的管理结构,如图4所示,第一层表示对应于表table T1,第二层表示与表1对应的三个分区,分区1、分区2和分区3。可以理解的是,各分区作为表1对应的数据存储单元set对数据进行存储。换言之,每个分区也可以称为一个数据存储单元。不同的分区例如可以分布在不同的物理存储实体上。具体地,分区表的元数据信息可以记录当前分区表的表结构、数据的存储关系等。
具体地,在本申请实施例中在执行数据重分布的基础上,提出一种中间的管理结构,也即分区空间(partition space,PS)。如图4所示,在原有的两层管理结构的中间增加一层分区管理层(分区空间),相应地,分区表的数据组织结构也发生了变化。其中,分区空间的加入可以将逻辑的分区范围或者list值(分区键的取值范围或取值列表)映射到物理的分区中,也即,通过分区空间可以解耦分区表分区定义(在原有的两层管理结构中,分区定义例如存储在表T1中)与实际分区的数据存储之间的绑定。其中,分区空间中记录的数据范围包括分区键的取值范围和分区键的取值列表中的一种;也即,数据范围可以表征分区表中各分区存储数据的范围。
在一个实施例中,在步骤S102执行时,为分区表中当前每个分区生成对应的分区空间,并在每个分区对应的分区空间中记录该分区的数据范围。另外,步骤S102在分区规划指示要建立新的分区时,创建新的分区,为所述新的分区生成对应的分区空间,以及在所述新的分区对应的分区空间中记录所述新的分区所要存储数据对应的数据范围。
在一个实施例中,步骤S102可以针对原始分区表中包括的各分区,一一对应地增加分区空间,即一个分区对应一个分区空间;还可以基于数据重分布信息,结合原始分区表的数据组织结构,增加分区空间,此时可能存在重分布时需要新增分区的情况,则可以适应新增的分区对应增加分区空间。
具体地,增加分区空间后,分区表中的分区空间与分区具有对应关系,可以从分区空间记录的数据范围了解到对应的分区实际所存储的数据的范围。
步骤S103:基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系。
在一个实施例中,步骤S103可以执行如下操作:
基于所述分区规划确定所述分区表中各分区的新的数据范围;
基于所述各分区的新的数据范围,更新每个分区对应的分区空间中记录的数据范围;
基于每个分区空间的已更新的数据范围,确定每个分区空间对应的分区中待改变分区的数据,为每个分区中待改变分区的数据创建新的分区;
确定已创建的与待改变分区的数据对应的分区与分区空间的对应关系,以及确定与待改变分区的数据对应的分区的数据范围。
在一个实施例中,数据范围也即数据重分布信息指定的各分区的范围定义,其逻辑上可以表示为分区范围和列表值,针对一个分区而言,可以是该分区对应的分区键的取值范围或列表值。
其中,由于用户可以仅修改分区表中的一个或多个分区的定义,则数据重分布信息中,可以仅包括修改后分区的数据范围定义,也即通过逻辑表达修改后分区可以存储的数据的范围。
具体地,由于分区空间用于记录分区的数据范围,当进行数据重分布时,各分区的所要存储数据的数据范围将有所变化,最终变化呈现的也即基于数据重分布信息确定的数据范围。本申请实施例可以通过更新分区空间所记录的数据范围后,基于更新后的分区空间所记录的数据范围,确定每个分区空间与所述分区表的分区的对应关系。
步骤S104:基于所述对应关系更新所述分区表的分区中的数据分布。
具体地,更新后分区空间与分区的对应关系体现的是逻辑上的对应,要实现分区表的数据重分布,实际执行的是移动分区中所存储的数据,使各分区最终所存储的数据(物理意义上的数据存储)满足用户设定的数据重分布信息的要求。
在本申请实施例提供的数据处理方法中所涉及的数据可以保存于区块链上。
在一个实施例中,所述分区基于分区键的取值范围或键值的大小排序。步骤S104可以执行如下操作:
按序针对所述分区表中包含的每一分区执行下述调整操作:
比对分区当前所存储数据的数据范围与该分区对应的分区空间中记录的数据范围;
若分区当前所存储数据的数据范围对应分区键的最大值大于该分区对应的分区空间中记录的分区空间中分区键的最大值,则基于该分区对应的分区空间中分区键的最大值,确定待改变分区的第一数据,为待改变分区的第一数据创建新的第一分区;
若分区当前所存储数据的数据范围对应分区键的最小值小于该分区对应的分区空间中记录的分区空间中分区键的最小值,则基于该分区对应的分区空间中分区键的最小值确定待改变分区的第二数据,为待改变分区的第二数据创建新的第二分区。
在一个实施例中,为了确定待改变分区的第一数据,和为待改变分区的第一数据创建新的第一分区,本申请实施例可以执行如下操作:
基于该分区对应的分区空间中的分区键的最大值在该分区中标记最大分裂点;
为每个分区中待改变分区的第一数据创建新的第一分区;
建立该分区与第一分区的关联关系;
所述基于该分区对应的分区空间中分区键的最小值确定待改变分区的第二数据,为待改变分区的第二数据创建新的第二分区,包括:
基于该分区对应的分区空间中的分区键的最小值在该分区中标记最小分裂点;
为每个分区中待改变分区的第二数据创建新的第二分区;
建立该分区与第二分区的关联关系。
在一个实施例中,步骤S104可以实施为:
基于对应关系针对需要进行分裂的每一分区执行以下数据移动操作:
将第一数据移动至相应的第一分区,所述第一数据为该分区对应的分区空间中记录的分区空间中分区键的最大值至该分区当前所存储数据的数据范围对应分区键的最大值之间的分区键所对应的数据;
将第二数据移动至相应的第二分区,所述第二数据为分区当前所存储数据的数据范围对应分区键的最小值至该分区对应的分区空间中分区键的最小值之间的分区键所对应的数据;
删除对所述最大分裂点的标记、所述最小分裂点的标记及所述关联关系。
在一实施例中,步骤S102中在分区表中增加至少一个分区空间,包括以下步骤S1021-S1022:
步骤S1021:针对分区表中的每一分区对应增加分区空间。
具体地,可以适应分区表中当前包括的分区一一对应地增加分区空间。此时,本申请实施例可以在原有各个分区所涵盖分区键的取值范围或键值内对各分区中实际存储的数据进行数据分布的调整。
步骤S1022:基于数据重分布信息,在分区表中新增至少一组具有对应关系的分区空间和空的数据存储单元,该分区空间所记录的数据存储信息基于数据重分布信息确定。
具体地,若当前表征的数据重分布信息中对分区的定义,也即新设定的各个分区逻辑上所涵盖分区键的取值范围或键值大于原始分区表中原有各个分区逻辑上所涵盖分区键的取值范围或键值,则可以在分区表中新增至少一组具有对应关系的分区空间和分区;其中,新增的分区为空,也即并未在物理意义上存储任何数据,而此时与新增的分区对应的分区空间所记录的数据范围则是基于数据重分布信息来确定。举例说明:当前分区表中,各分区共同涵盖的分区键的取值范围是[0,10000);而数据重分布信息中,设定各分区共同涵盖的分区键的取值范围是[0,15000)时,可以适应增加一组具有对应关系的分区空间和空的分区,此时新增的分区空间所记录的数据范围中分区键的取值范围为[10000,15000)。
在一个实施例中,本步骤的实施可参考图10a的示意内容。
在一实施例中,提供的数据处理方法还可以包括以下步骤S1010:
步骤S1010:基于分区空间创建分区空间列表,分区空间列表用于在数据重分布过程中记录各分区空间记录的数据范围、分区空间与分区的对应关系、各分区空间的名称、分区空间与分区表的对应关系中的至少一种信息。
具体地,可以针对在分区表中增加的分区空间适应生成一个系统表来保存各分区空间的管理信息。结合图4与下述表1和表2进行说明:
表1
表名 空间名 取值范围 分区名
测试t1 PS1 [0,1000) set 1
测试t1 PS2 [1000,5000) set 2
测试t1 PS3 [5000,10000) set 3
具体地,结合图4中的原始分区表与上述表1可见,各分区空间PS一一对应于数据存储单元set,此时增加的分区空间可以用于记录各分区实际存储数据的逻辑表达内容(数据范围),也即分区键的取值范围。而表1也即步骤S1010中创建的分区空间列表,其可以记录分区表的表名(table name,如测试test t1),分区空间名(space name,如PS1),分区键的取值范围(data range,如[0,1000)),分区名(set name,如set 1)等信息,相应地,通过表格还可以知悉表名、空间名、取值范围、分区名之间的对应关系。
表2
表名 空间名 取值范围 分区名
测试t1 PS1 [0,1500) set 1
测试t1 PS2 [1000,1500) set 2
测试t1 PS2 [1500,4500) set 3
测试t1 PS3 [4500,7500) set 4
测试t1 PS4 [7500,15000) set 5
结合图4与上述表2可见,表2对应于数据重分布后,分区空间列表保存的信息,相对于表1,新增了分区空间PS4,分区set 4和5;在对应关系上,同一分区空间可以对应于两个分区(PS2对应于set 2和3)。
在本申请实施例中,通过上述表1和表2,可见基于分区空间创建的分区空间列表在数据重分布过程中,随着分区空间记录的数据范围,分区空间与分区的对应关系等的变化而一一变化。通过查询分区列表,可以快速确定当前数据重分布的进度;另外,在完成数据重分布的处理后,可以基于分区空间列表快速验证重分布后分区表的数据组织结构与数据重分布规则的符合程度。
在一实施例中,步骤S103中基于数据重分布信息中至少一个分区对应的预定义的数据范围,更新分区空间与分区的对应关系,包括以下步骤A1-A4:
步骤A1:基于数据重分布信息确定分区表中各分区的预定以的数据范围。
具体地,数据重分布信息可以仅包括需要调整的分区定义,也可以包括重分布后分区表中各个分区的定义,适应不同的情况可以对应下述两种可能的实施例(以数据存储信息为分区键的取值范围为例说明):
(1)当仅包括需要调整的分区的定义时,也即包括至少一个分区在逻辑上的数据范围时,可以结合原始分区表中各分区的物理意义上所存储数据的情况,基于数据重分布信息更新重分布后分区表中各分区对应的分区空间的数据范围。具体地,可以首先确定当前各分区逻辑上的数据范围,如set 1对应于[0,1000),set 2对应于[1000,5000);而数据重分布信息中,set 2预定义的数据范围对应于[1000,2500);set 3预定义的数据范围对应于[2500,5000),则可以确定此时分区表中各分区的预定义的数据范围如下:set 1的分区空间对应于[0,1000),set 2预定义的数据范围对应于[1000,2500);set 3预定义的数据范围对应于[2500,5000);其中,set 3可以是新增的分区。
(2)当包括调整后分区表中所有分区的定义时,也即包括所有分区在逻辑上的数据范围时,可以直接以该数据范围作为各分区的预定义的数据范围,也即结合上述实施例(1)中的例子,在实施例(2)中,可以直接基于数据重分布信息得到set 1预定义的数据范围对应于[0,1000),set 2预定义的数据范围对应于[1000,2500);set 3预定义的数据范围对应于[2500,5000)。
步骤A2:基于分区的新的数据范围调整分区表中所包含的分区。
具体地,由于各分区对应的预定义的数据范围与分区表中当前分区实际所存储数据的数据范围况可能不对应,在不对应的情况下,可能存在以下情形(以数据范围为分区键的取值范围为例说明):
(1)一个或者多个分区实际所存储的数据在逻辑上对应的分区键的取值范围的最小值小于对应预定义的数据范围的最小值;
(2)一个或者多个分区实际所存储的数据在逻辑上对应的分区键的取值范围的最大值大于对应分区预定义的数据范围的最大值;
(3)所有分区实际所存储的数据在逻辑上对应的分区键的总取值范围的最大值小于分区的预定义的数据范围的最大值;
(4)所有分区实际所存储的数据在逻辑上对应的分区键的总取值范围的最小值大于分区的预定义的数据范围的最小值。
结合上述几种情形可见,需要调整分区表中各分区实际存储的数据,实现物理上所存储数据的调整之前,需要对应在分区表中布局相应的分区,也即可能需要在原始分区表的数据组织结构的基础上增加或减少分区。
步骤A3:基于预定义的数据范围更新各分区空间所记录的数据范围。
具体地,可以直接将步骤A1中确定出的各分区对应的预定义的数据范围替换各分区空间现有记录的数据范围。可以理解的是,步骤A1中确定出的各分区与分区空间具有一一对应的关系,可参考图10c的示意内容。
步骤A4:基于各分区空间更新后记录的数据范围,更新分区空间与调整后分区表所包含的分区的对应关系。
具体地,在实际进行数据重分布之前,基于各分区空间更新记录的数据范围,将步骤A2调整分区表的数据组织结构后所包括的数据范围与各分区空间的对应关系适应进行调整,也即调整各分区空间实际管理的数据范围,可参考图10c和图10d的示意内容。
在一实施例中,分区基于分区键的取值范围或键值的大小排序;步骤A2中基于预定义数据范围调整分区表中所包含的分区,包括以下步骤A21:
步骤A21:按序针对分区表中包含的每一数据存储单元执行下述调整操作步骤A211-A213:
步骤A211:比对分区当前的数据存储信息与预定义数据存储信息。
步骤A212:若分区对应分区键的最大值大于预定义数据存储信息中分区键的最大值,则基于预定义数据存储信息中分区键的最大值分裂该分区。
具体地,以分区对应分区键的取值范围为[1000,5000)为例,若该分区对应的预定义数据存储信息中分区键的取值范围为[1500,4500),则5000大于4500,将基于分区键的键值4500对该分区进行分裂处理。
步骤A213:若分区对应分区键的最小值小于预定义数据范围中分区键的最小值,则基于预定义数据范围中分区键的最小值分裂该分区。
具体地,以分区对应分区键的取值范围为[1000,5000)为例,若该分区对应的预定义数据范围中分 区键的取值范围为[1500,4500),则1000小于1500,将基于分区键的键值1500对该分区进行分裂处理。
具体地,可参考图10b的示意内容;可以理解的是,对于任一分区而言,可能同时执行上述步骤A212和A213,也可能仅执行步骤A212或A213,还可能不执行任一步骤A212和A213。
在一实施例中,步骤A23中基于预定义数据范围中分区键的最大值分裂该分区,包括以下步骤A231:
步骤A231:基于预定义数据范围中分区键的最大值与该分区对应分区键的最大值,生成一个空的第一分区。
具体地,如图10b所示,针对新生成的set 5而言,其当前存储的数据为空,但其逻辑上对应的分区键的取值范围可以基于set 2的最大值5000,和预定义数据范围中分区键的取值范围的最大值4500确定,也即此时set 5对应的分区键的取值范围为[4500,5000)。针对新生成的set 6而言,其当前存储的数据为空,但其逻辑上对应的分区键的取值范围可以基于set 3的最大值10000,和预定义数据范围中分区键的取值范围的最大值7500确定,也即此时set 6对应的分区键的取值范围为[7500,10000)。
在一可能的实施例中,步骤A23中基于预定义数据范围中分区键的最大值分裂该分区,还包括以下步骤A232:
步骤A232:基于预定义数据范围中分区键的最大值在该分区中标记最大分裂点,并建立该分区与第一分区的关联关系。
具体地,适应于步骤A231中的例子而言,可以在set 2中基于分区键的键值4500标记一个最大分裂点max flag;可以在set 3中基于分区键的键值7500标记一个最大分裂点max flag。
其中,由于新生成的set 5对应的分区键的取值范围是set 2在原始分区表中对应的分区键的取值范围的一部分,因此,可以建立set 2与set 5之间的关联关系,如图10c所示。由于新生成的set 6对应的分区键的取值范围是set 3在原始分区表中对应的分区键的取值范围的一部分,因此,可以建立set 3与set 6之间的关联关系。
在一可行的实施例中,步骤A24中基于预定义数据范围中分区键的最小值分裂该分区,包括以下步骤A241:
步骤A241:基于该分区对应分区键的最小值与预定义数据范围中分区键的最小值,生成一个空的第二分区。
具体地,如图10b所示,针对新生成的set 4而言,其当前存储的数据为空,但其逻辑上对应的分区键的取值范围可以基于set 2的最小值1000,和预定义数据范围中分区键的取值范围的最小值1500确定,也即此时set 4对应的分区键的取值范围为[1000,1500)。
在一实施中,步骤A24中基于预定义数据范围中分区键的最小值分裂该分区,包括以下步骤A242:
步骤A242:基于预定义数据范围中分区键的最小值在该分区中标记最小分裂点,并建立该分区与第二分区的关联关系。
具体地,结合步骤A241中的示例,可以在set 2中基于分区键的键值1500标记一个最小分裂点min flag。
其中,由于set 4对应的分区键的取值范围是set 2在原始分区表中对应的分区键的取值范围的一部分,因此,可以建立set 2和set 4之间的关联关系。
在一实施例中,步骤S104中基于更新后分区空间与分区的对应关系更新分区表的数据分布,包括以下步骤S1041-S1042:
步骤S1041:基于更新后分区空间与分区的对应关系针对需要进行分裂的每一分区执行以下数据移动操作步骤B1-B2:
步骤B1:将预定义数据范围中分区键的最大值至该分区对应分区键的最大值之间的分区键所对应的数据移动至相应的第一分区。
具体地,如图10e所示,可以将set 2中归属set 5实际所需存储的部分数据移动至set 5,此时将减少了set 2中实际所存储的部分数据。可以将set 3中归属set 6实际所需存储的部分数据移动至set 6,此时将减少了set 3中实际所存储的部分数据。
步骤B2:将分区对应分区键的最小值至预定义数据范围中分区键的最小值之间的分区键所对应的数据移动至相应的第二分区。
具体地,如图10e所示,可以将set 2中归属set 4实际所需存储的部分数据移动至set 4。
在本申请实施例中,步骤B1和B2的实施实际上仅针对需要移动的数据进行处理,而无需移动的数据仍保持在所对应的分区中。虽然在数据移动的过程中,需要对数据进行加锁,但由于移动的数据量较少,所需的执行时间较短,因此数据加锁的时间也非常短,使得本申请实施例在进行数据移动时对正在处理的读写操作(可以是一些用户业务)的影响非常少。
步骤S1042:删除最大分裂点、最小分裂点及各分区之间的关联关系。
具体地,最大分裂点、最小分裂点与各分区之间的关联关系可以用于在数据重分布过程中,表征数据之间的移动关系,可以应用在数据重分布处理的同时执行数据的读写操作。而在执行完成数据的重分布后,也即已经按照用户定义的数据重分布规则调整数据实际存储在各个分区的情况之后,可以删除各个分裂点和分区之间的关联关系。
在一实施例中,提供的数据处理方法中,在完成分区表的数据分布更新之前,还包括以下步骤C1-C2:
步骤C1:响应于对目标数据的处理操作,确定目标数据对应分区键的键值大于任一分区的最大分裂点对应分区键的键值或小于任一分区的最小分裂点对应分区键的键值,则在该分区中查询目标数据。
具体地,对目标数据的处理操作可以是数据的读写操作,也即在执行数据重分布的过程中,接收到用户或服务器发出针对目标数据的读写操作是,可以基于标记的分裂点和建立的各分区之间的关联关系获取到目标数据所实际存储的位置。可以理解的是,步骤C1在分区中查询目标数据的处理,是在确定目标数据在分区表中具有对应存储位置的基础上实施的。
步骤C2:若查询获得目标数据对应的存储位置,则返回针对目标数据的处理结果;若未查询到所述存储位置,基于各分区对应的关联关系,在分区对应第一分区或者第二分区中确定目标数据对应的存储位置并返回目标数据的处理结果。
具体地,下面针对数据重分布过程中,执行读写操作的操作流程进行详细说明:
第一种情况:使用旧版本的表执行目标数据处理操作的流程。如果在旧set(原始分区表中所包含的分区)中未找到目标数据对应的存储位置,且发现目标数据对应的存储位置在某一分区标记的分裂点flag之外,则表征目标数据有可能被移动到了分裂生成的set中,需要再次到相应新生成的set中执行数据遍历,进而目标数据是否存储在分区表中。
第二种情况:使用新版本的表执行目标数据处理操作的流程。如果未在已调整数据组织结构的分区表中的任一分区中确定目标数据的存储位置,且set(原始分区表中的旧set)中有指向其他set(新生成的set)的链接(基于各分区之间的关联关系确定),那么需要到对应set(新生成的set)中继续查找目标数据的存储位置;其中,在新生成的set中查询目标数据的操作流程如下:
如果目标数据不在对应set(原始分区表中的旧set)的flag之外;则表征目标数据不属于set需要分裂的数据范围内,那么可以确定目标数据确实不存在分区表中,可以直接返回相应的处理结果(未在分区表中查询到目标数据);
如果目标数据在对应set的flag范围之外,找到相对应的分裂范围,可以继续查询目标数据,如果未查询得到目标数据的存储位置,也同样表明目标数据不存在分区表中,可以直接返回相应的处理结果;若在flag中找到数据(需要分裂的分区中待移动的数据),则可以锁定返回目标数据的处理结果;若在flag中找到数据(需要分裂的分区中待移动的数据),但该数据已经被标记mark为删除,则表明该数据可能被迁移,此时可以在新生成的分区(即第一或第二分区)中查询数据;其中,使用mark delete来表明需要分裂的分区中待移动的数据已经迁移到新的set中,还可以采用其他方法来表明数据已经迁移。
下面结合图5从分区表的数据组织结构的整体上说明本申请实施例提供的数据处理方法的执行逻辑:
其中,如图5所示,表T1V0表示旧版本的分区表,从图5可见待进行数据重分布的分区表的数据组织结构。表T1V1表示新版本的分区表,从图5可见已经执行数据重分布的分区表更新后的数据 组织结构。
数据组织结构变化1:分区set 1需要从原来的[0,1000)变化为[0,1500):
1.1、原先的分区set 1管理空间PS1已经包含了set 1是[0,1000),则set 1保持不变;
1.2、需要从分区set 2管理空间PS2对应[1000,5000)的set 2中将所存储的部分数据[1000,1500)划到分区set 1;因此原来的分区set 2对应的数据[1000,5000),先分解为set 4[1000,1500)和set2’[1500,5000);将set 4从分区set 2的管理范围PS2移动到分区set 1的管理范围PS1,则分区set 1的重分布过程完成。
数据组织结构变化2:分区set 3从原来的[5000,10000)变化为[4500,7500):
2.1、从分区set 2在1.2的步骤中分裂的set2’对应为[1500,5000)分裂出set2”对应为[1500,4500)和set 5对应为[4500,5000)两个新的set;
2.2、将set 5从分区set 2的管理范围PS2移动到分区set 3的管理范围PS3中,这时候分区set 2的重分布过程完成。
2.3、将之前set 3对应[5000,10000)的分区,分裂为set 6对应为[5000,7500)和set 3’对应为[7500,10000);
2.4、将分区set 3’从分区set 3的管理范围PS3移动到分区4的管理范围PS4中,此时分区set 3的重分布过程完成。
数据组织结构变化3:分区set 4是一个新增的分区,需要在数据字典中增加相应的信息;另,对应增加一个set 7对应为[10000,15000)的set,则分区set 4的重分布过程完成。
下面结合图6对本申请实施例提供的数据处理方法中针对分区表执行数据重分布的任务准备阶段(第一阶段)的执行操作进行说明:
步骤1.用户通过DDL语句对分区表进行数据重分布操作。
步骤2.读取原始分区表数据字典定义(也即对各个分区的定义,如分区1对应的分区键的取值范围),创建新分区表结构(可以采用整数代表分区表的版本号,对版本号+1用于表示更新的分区表的版本);具体地,本申请实施例仅是对分区表的结构进行调整,在描述时所采用的原始分区表对应该表的旧版本,无需创建一个新的分区表。
步骤3.遍历原表分区结构。
步骤4.判断是否重分布操作会产生新的分区空间;若是,进入步骤5;若否,返回步骤3,遍历下一个分区空间。
步骤5.增加一个空的分区空间到新的分区表结构。
步骤6.分区空间遍历完毕,若是,进入步骤7;若否,返回步骤3。
步骤7.遍历分区结构(也即分区空间)。
步骤8.判断是否需要针对当前分区空间进行重分布;若否,跳转到步骤22,处理下一个分区空间;若是,进入步骤9。
步骤9.判断是否需要增加范围;若是,可以增加左端分区和右端分区,针对list分区类型(因为list类型相对range类型操作类似,图6中均以range分区来说明)。其中,新增范围,也即可以直接增加一个空的set,同时向分区空间完成注册,通常新增范围都表明原来的分区表不包含新增范围的数据,因此新增范围不涉及后续的数据搬迁,修改完数据字典分区空间和范围信息,就完成了新增范围的操作。若否,进入步骤13。
步骤10.判断是否增加左端范围:若是,进入步骤12;若否,进入步骤11。
步骤11.增加一个右端set。
步骤12.增加一个左端set。
步骤13.判断是否需要分裂范围;若是,可以分区分裂左端分区和右端分区;如否,进入步骤22。
步骤14.判断是否为左端分裂,无论是否均将进入步骤15。
步骤15.找到分裂的数据分界点(分裂点)。
步骤16.在分裂点上打一个flag(这里左端分裂为最小分裂点min-flag,右端分裂为最大分裂点max-flag)。
步骤17.创建一个新的set,当前这个set为空,没有任何存储数据。
步骤18.从原来的set向新的set建立一个连接。
步骤19.从新建的set向新的set建立一个连接。
步骤20.将新的set移动到对应的分区空间,并完成注册。
步骤21.将分裂注册到移动数据任务列表。
步骤22.处理下一个分区空间。
步骤23.判断是否分区表的所有分区空间都处理完毕;若否,返回步骤7;若是,进入步骤24。
步骤24.更新分区表中各分区定义,新版本分区表定义可见,同时通知所有节点更新定义(也可以不作通知)。
步骤25.重分布准备阶段结束。
下面结合图7对本申请实施例提供的数据处理方法中针对分区表执行数据重分布的数据移动阶段(第二阶段)的执行操作进行说明:
具体地,重分布操作数据移动阶段是发生在准备阶段结束后,相应的准备工作已经完成,任务准备阶段完成后,对于用户而言,在逻辑上已经完成了对于分区表的数据重分布,只不过实际上数据还没有完成搬迁的操作,但是不影响用户对数据的读写。
步骤1.开始数据移动任务。
步骤2.加载第一阶段准备好的需要迁移数据的任务列表。
步骤3.获取一个任务,启动数据搬迁任务。
步骤4.获取一个小批量数据开始搬迁加锁;(这个小批量数据可以是用户指定记录条数,或者写死),本申请实施例采用小批量数据搬迁,可是实现尽量将数据搬迁过程中对于用户读写事务的阻塞时间降到最低。
步骤5.判断加锁成功;若否,则表明在该部分数据上具有用户事务正在处理,可以进入步骤7;若是,进入步骤6。
步骤6.等待一段时间,重新获取任务。
步骤7.将数据搬迁到新的Set上,其中,搬迁数据不对数据做任何修改,只是将数据从原Set范围,移动到新的Set,该处理过程不对元数据进行修改。
步骤8.提交事务,释放步骤4步对搬迁数据加的锁。
步骤9.判断是否范围内还有需要搬迁的数据;若否,进入步骤10;若是,跳转到步骤4,搬迁下一批数据。
步骤10.判断是否还有需要搬迁的任务;若否,进入步骤11;若是,跳转到步骤3,继续下一个数据迁移任务。
步骤11.所有节点刷新一下表结构,确保所有节点的表结构都已经同步到最新的。
步骤12.清理掉所有Set中的flag标记和指向其他Set的链接(如图10f所示)。
步骤13.数据迁移结束。
下面结合图8对本申请实施例提供的数据处理方法中针对分区表执行数据重分布的读写操作阶段(第三阶段)的执行操作进行说明:
具体地,适用旧版本表结构操作数据(如图10a中表T1V0);其中,描述的是第一阶段准备工作过程中到第二阶段迁移数据后的步骤11刷新表结构之前的阶段,有些节点可以使用旧的表结构操作数据,对于新增范围数据的操作,则必须要在第一阶段步骤25刷新到新的表结构之后,用户才能够对于新增的数据范围上的数据(需要先增加set)进行操作。
步骤1.开始读写操作。
步骤2.判断是否找到符合条件的数据;若否,进入步骤3;若是,进入步骤6。
步骤3.判断是否数据在第一阶段的步骤17中flag之外(待迁移的数据);也即第一阶段步骤16中在set中标记的分裂点之外;若否,说明数据确实不存在分区表中,直接跳转到步骤8;若是,根据链接(各set之间的关联关系)跳转到连接的Set中继续查找数据。
步骤4.判断是否找到数据;若否,说明数据确实不存在分区表中,直接跳转到步骤8;若是, 进入步骤6。
步骤5.按照原生逻辑对数据加锁。
步骤6.处理完毕后,提交事务。
步骤7.读写操作结束。
下面结合图9对本申请实施例提供的数据处理方法中针对分区表执行数据重分布的读写操作阶段(第四阶段)的执行操作进行说明:
具体地,使用新版本表结构读写数据:
步骤1.开始读写操作。
步骤2.判断是否找到数据;若否,进入步骤3;若是,进入步骤7。
步骤3.判断是否Set上面有指向其他Set的链接;若否,则表明数据不存在分区表中;进入步骤9;若是,进入步骤4。
步骤4.根据链接找到对应的Set。
步骤5.判断是否数据在对应的Set的flag范围内,如果旧的Set有flag然后找到新Set的,则从新Set找回原Set的情况说明数据还没有迁移完成;若否,没有相应的数据(数据不存在分区表中),跳转到步骤9;若是,跳转到步骤6;
步骤6.判断是否找到符合条件的数据;若否,没有相应的数据(数据不存在分区表中),跳转到步骤9;若是,跳转到步骤7。
步骤7.判断是否数据标记mark delete;如果被标记delete,说明这部分数据已经迁移到新set;若否,数据未被mark delete,进入步骤9;若是,进入步骤8。
步骤8.返回新set重新查找数据。
步骤9.按照事务类型(如读写)加锁。
步骤10.处理完毕后,提交事务;
步骤11.读写操作结束。
在本申请实施例中,上述第三阶段和第四阶段实际上是在同一份数据中实现,本实施例通过两个阶段分别描述更有利于清晰了解针对目标数据在数据重分布过程中执行业务操作的实施过程。
下面结合图10a至图10f所示的示例对本申请实施例提供的数据处理方法作进一步的说明(以数据范围为分区键的取值范围为例):
如图10a,其示例出本申请实施例在执行数据重分布中的第一步,增加分区空间。具体地,相对于表T1V0,表T1V1表征已经调整过数据组织结构的分区表,在增加分区空间的处理中,新版本的表T1是不可见的,也即用户无法查看该步骤所产生的表T1V1。其中,原始分区表中所涵盖分区键的取值范围为[0,10000),而V1版本的T1表结构定义中,分区表中所涵盖的分区键的取值范围为[0,15000);相应地,增加了set 7,以及与set 7对应的分区空间PS4。
如图10b,其示例出本申请实施例在执行数据重分布中的第二步,遍历分区,并考虑是否需要进行分裂(新增set)。在图10b的示例中,新增了set 4,set 5和set 6三个分区键的取值范围,此时图10b中set 4,set 5和set 6仍为空的分区,这三个set中所示的仅为分区键的取值范围,实际所存储的数据仍在set 2和set3中。
如图10c,其示例出本申请实施在执行数据重分布中的第三步,将第二步建立的分区键的取值范围(并非是实际数据),调整到正确的分区中;在第三步中,产生变化的包括分区空间所记录的数据范围,以及各分区空间与各set之间的对应关系。
如图10d,其示例出本申请实施例在执行数据重分布中的第四步,具体地,更新表T1各分区set管理数据的分区键的取值范围(也即处理的是图10d中采用双点划线框以及虚线框所示的set);另外,可以将T1表的定义开放给用户感知,也即此时用户可以查看该表。对应于第四步,也即完成了上述实施例所示的第一阶段(任务准备阶段),也就是用户发送DDL语句,这里执行完成后,反馈给用户重分布成功,但实际并未迁移数据(物理意义上的数据移动)。
如图10e,其示例出本申请实施例在执行数据重分布中的第五步,具体地,进入物理意义上的数据移动处理,完成数据按照分区管理中set的定义(分区键的取值范围)来搬动数据。对比图10e的两 个表的结构可见,set 2”和set 3’所实际存储的数据发生了变化,其被移出的部分数据移入至与其具有关联关系的set中。
如图10f,其示例出本申请实施例在执行数据重分布中的第六步,具体地,在第六步执行前可以刷新一下所有节点,以使得刷新到最新的表定义版本。此时,可以去除之前步骤中标记的分裂点和各set之间的关联关系。
在本申请实施例中,从图10a至图10f的数据组织结构的示例性变化中,图10a中的第一个表对应于需要重分布的分区表;图10f中的第二个表对应于重分布结束后的分区表;两者分别示例了同一个表处于不同时期版本下的数据组织结构。具体地,在上述实施例所示的六个步骤的执行过程中,可以同步处理用户业务,也即本申请实施例提供的数据处理方法可以是一种线上online实施的方法,其无需对业务进行停机处理;且针对用户业务的处理和数据重分布的处理是在一份数据上完成的,有利于避免因为数据重分布新建副本带来数据不一致的问题。
本申请实施例提供了一种数据处理装置,如图11所示,该数据处理装置1100可以包括:获取模块1101、增加模块1102、更新模块1103、分布模块1104。
其中,获取模块1101,用于获取数据重分布信息,所述数据重分布信息用于表征对分区表的新的分区规划,所述分区表为基于分区键进行数据分布的表;增加模块1102,用于为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,所述数据范围包括分区键的取值范围和分区键的取值列表中的一种;更新模块1103,用于基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系;分布模块1104,用于基于所述对应关系更新所述分区表的分区中的数据分布。
在一实施例中,增加模块1102具体用于:
为所述分区表中当前每个分区生成对应的分区空间,并在每个分区对应的分区空间中记录该分区的数据范围;
在所述分区规划指示要建立新的分区时,创建新的分区,为所述新的分区生成对应的分区空间,以及在所述新的分区对应的分区空间中记录所述新的分区所要存储数据对应的数据范围。
在一实施例中,增加模块1102还用于:
基于分区空间创建分区空间列表,分区空间列表用于在数据重分布过程中记录各分区空间记录的数据范围、分区空间与分区的对应关系、各分区空间的名称、分区空间与分区表的对应关系中的至少一种信息。
在一实施例中,更新模块1103在用于执行基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系时,具体用于:
基于所述分区规划确定所述分区表中各分区的新的数据范围;
基于所述各分区的新的数据范围,更新每个分区对应的分区空间中记录的数据范围;
基于每个分区空间的已更新的数据范围,确定每个分区空间对应的分区中待改变分区的数据,为每个分区中待改变分区的数据创建新的分区;
确定已创建的与待改变分区的数据对应的分区与分区空间的对应关系,以及确定与待改变分区的数据对应的分区的数据范围。
在一实施例中,分区基于分区键的取值范围或键值的大小排序;更新模块1103在用于执行基于每个分区空间的已更新的数据范围,确定每个分区空间对应的分区中待改变分区的数据,为每个分区中待改变分区的数据创建新的分区时,具体用于:
按序针对所述分区表中包含的每一分区执行下述调整操作:
比对分区当前所存储数据的数据范围与该分区对应的分区空间中记录的数据范围;
若分区当前所存储数据的数据范围对应分区键的最大值大于该分区对应的分区空间中记录的分区空间中分区键的最大值,则基于该分区对应的分区空间中分区键的最大值,确定待改变分区的第一数据,为每个分区中待改变分区的第一数据创建新的第一分区;
若分区当前所存储数据的数据范围对应分区键的最小值小于该分区对应的分区空间中记录的分区空间中分区键的最小值,则基于该分区对应的分区空间中分区键的最小值确定待改变分区的第二数据,为待改变分区的第二数据创建新的第二分区。
在一实施例中,更新模块1103在用于执行基于该分区对应的分区空间中分区键的最大值,确定待改变分区的第一数据,为待改变分区的第一数据创建新的第一分区时,具体用于:
基于该分区对应的分区空间中的分区键的最大值在该分区中标记最大分裂点;
为每个分区中待改变分区的第一数据创建新的第一分区;
建立该分区与第一分区的关联关系。
更新模块1103在用于执行基于该分区对应的分区空间中分区键的最小值确定待改变分区的第二数据,为待改变分区的第二数据创建新的第二分区时,具体用于:
基于该分区对应的分区空间中的分区键的最小值在该分区中标记最小分裂点;
为每个分区中待改变分区的第二数据创建新的第二分区;
建立该分区与第二分区的关联关系。
在一实施例中,分布模块1104在用于执行基于所述对应关系更新所述分区表的分区中的数据分布时,具体用于:
基于所述对应关系针对需要进行数据转移的每一分区执行以下数据移动操作:
将第一数据移动至相应的第一分区,所述第一数据为该分区对应的分区空间中记录的分区空间中分区键的最大值至该分区当前所存储数据的数据范围对应分区键的最大值之间的分区键所对应的数据;
将第二数据移动至相应的第二分区,所述第二数据为分区当前所存储数据的数据范围对应分区键的最小值至该分区对应的分区空间中分区键的最小值之间的分区键所对应的数据;
删除对所述最大分裂点的标记、所述最小分裂点的标记及所述关联关系。
在一实施例中,装置1100还包括在线读写模块,在所述更新所述分区表的数据分布被执行完成之前,用于执行:
响应于对目标数据的处理操作,确定所述目标数据对应分区键的键值大于任一分区的最大分裂点对应分区键的键值或小于任一分区的最小分裂点对应分区键的键值,则在该分区中查询目标数据;
若查询获得所述目标数据对应的存储位置,则返回针对所述目标数据的处理结果;若未查询到所述存储位置,基于分区对应的关联关系,在分区对应第一分区或者第二分区中确定所述目标数据对应的存储位置并返回所述目标数据的处理结果。
本申请实施例的装置可执行本申请实施例所提供的方法,其实现原理相类似,本申请各实施例的装置中的各模块所执行的动作是与本申请各实施例的方法中的步骤相对应的,对于装置的各模块的详细功能描述具体可以参见前文中所示的对应方法中的描述,此处不再赘述。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一定量的处理数据,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本申请实施例中提供了一种电子设备,包括存储器、处理器及存储在存储器上的计算机程序,该处理器执行上述计算机程序以实现数据处理方法的步骤,与相关技术相比可实现:本申请在实施时是针对分区表进行的处理,分区表可以包括基于预设分区键进行数据分布的物理表;获取当前最新的数据重分布信息,该数据重分布信息用于对分区表的数据分布情况进行调整;基于此,可以在分区表中增加至少一个分区空间,该分区空间与分区表中的分区具有对应关系,且分区空间可以用于存储所对应分区的数据范围,其中数据范围包括分区键的取值范围或分区键的取值列表;此时,增加的分区空间可以解耦分区表分区定义与实际分区的数据存储之间的绑定;进而,基于数据重分布信息中至少一个分区的预定义数据范围,可以更新当前分区空间与分区之间的对应关系,在更新两者的对应关系后,可以基于最新得到的分区空间与分区之间的对应关系更新分区表中的数据分布情况。本申请的实施是通过增加的分区空间实现数据重分布;另外,本申请的数据重分布无需对分区表的全量数据进行数据同步处理,有利于缩短数据重分布的执行时间,提高数据处理的效率,减少业务阻塞情况的发生。
在一个可选实施例中提供了一种电子设备,如图12所示,图12所示的电子设备1200包括:处理器1201和存储器1203。其中,处理器1201和存储器1203相连,如通过总线1202相连。可选地,电子设备1200还可以包括收发器1204,收发器1204可以用于该电子设备与其他电子设备之间的数据交互,如数据的发送和/或数据的接收等。需要说明的是,实际应用中收发器1204不限于一个,该电子设备1200的结构并不构成对本申请实施例的限定。
处理器1201可以是CPU(Central Processing Unit,中央处理器),通用处理器,DSP(Digital Signal Processor,数据信号处理器),ASIC(Application Specific Integrated Circuit,专用集成电路),FPGA(Field Programmable Gate Array,现场可编程门阵列)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器1201也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
总线1202可包括一通路,在上述组件之间传送信息。总线1202可以是PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。总线1202可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器1203可以是ROM(Read Only Memory,只读存储器)或可存储静态信息和指令的其他类型的静态存储设备,RAM(Random Access Memory,随机存取存储器)或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM(Electrically Erasable Programmable Read Only Memory,电可擦可编程只读存储器)、CD-ROM(Compact Disc Read Only Memory,只读光盘)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质、其他磁存储设备、或者能够用于携带或存储计算机程序并能够由计算机读取的任何其他介质,在此不做限定。
存储器1203用于存储执行本申请实施例的计算机程序,并由处理器1201来控制执行。处理器1201用于执行存储器1203中存储的计算机程序,以实现前述方法实施例所示的步骤。
其中,电子设备包括但不限于:智能手机、平板电脑、笔记本电脑、智能音箱、智能手表、智能语音交互设备、车载终端等。
本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时可实现前述方法实施例的步骤及相应内容。
本申请实施例还提供了一种计算机程序产品,包括计算机程序,计算机程序被处理器执行时可实现前述方法实施例的步骤及相应内容。
应该理解的是,虽然本申请实施例的流程图中通过箭头指示各个操作步骤,但是这些步骤的实施顺序并不受限于箭头所指示的顺序。除非本文中有明确的说明,否则在本申请实施例的一些实施场景中,各流程图中的实施步骤可以按照需求以其他的顺序执行。此外,各流程图中的部分或全部步骤基于实际的实施场景,可以包括多个子步骤或者多个阶段。这些子步骤或者阶段中的部分或全部可以在同一时刻被执行,这些子步骤或者阶段中的每个子步骤或者阶段也可以分别在不同的时刻被执行。在执行时刻不同的场景下,这些子步骤或者阶段的执行顺序可以根据需求灵活配置,本申请实施例对此不限制。
以上所述仅是本申请部分实施场景的可选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请的方案技术构思的前提下,采用基于本申请技术思想的其他类似实施手段,同样属于本申请实施例的保护范畴。

Claims (12)

  1. 一种数据处理方法,在电子设备中执行,所述方法包括:
    获取数据重分布信息,所述数据重分布信息用于表征对分区表的新的分区规划,所述分区表为基于分区键进行数据分布的表;
    为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,所述数据范围包括分区键的取值范围和分区键的取值列表中的一种;
    基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围更新每个分区空间与所述分区表的分区的对应关系;
    基于所述对应关系更新所述分区表的分区中的数据分布。
  2. 根据权利要求1所述的方法,其中,所述为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,包括:
    为所述分区表中当前每个分区生成对应的分区空间,并在每个分区对应的分区空间中记录该分区的数据范围;
    在所述分区规划指示要建立新的分区时,创建新的分区,为所述新的分区生成对应的分区空间,以及在所述新的分区对应的分区空间中记录所述新的分区所要存储数据对应的数据范围。
  3. 根据权利要求1所述的方法,其中,所述方法还包括:
    基于所述分区空间创建分区空间列表,所述分区空间列表用于在数据重分布过程中记录各分区空间所记录的数据范围、分区空间与分区的对应关系、各分区空间的名称、分区空间与分区表的对应关系中的至少一种信息。
  4. 根据权利要求1所述的方法,其中,所述基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系,包括:
    基于所述分区规划确定所述分区表中各分区的新的数据范围;
    基于所述各分区的新的数据范围,更新每个分区对应的分区空间中记录的数据范围;
    基于每个分区空间的已更新的数据范围,确定每个分区空间对应的分区中待改变分区的数据,为每个分区中待改变分区的数据创建新的分区;
    确定已创建的与待改变分区的数据对应的分区与分区空间的对应关系,以及确定与待改变分区的数据对应的分区的数据范围。
  5. 根据权利要求4所述的方法,其中,所述分区表中的分区基于分区键的取值范围或键值的大小排序;所述基于每个分区空间的已更新的数据范围,确定每个分区空间对应的分区中待改变分区的数据,为每个分区中待改变分区的数据创建新的分区,包括:
    按序针对所述分区表中包含的每一分区执行下述调整操作:
    比对分区当前所存储数据的数据范围与该分区对应的分区空间中记录的数据范围;
    若分区当前所存储数据的数据范围对应分区键的最大值大于该分区对应的分区空间中记录的分区空间中分区键的最大值,则基于该分区对应的分区空间中分区键的最大值,确定待改变分区的第一数据,为待改变分区的第一数据创建新的第一分区;
    若分区当前所存储数据的数据范围对应分区键的最小值小于该分区对应的分区空间中记录的分区空间中分区键的最小值,则基于该分区对应的分区空间中分区键的最小值确定待改变分区的第二数据,为待改变分区的第二数据创建新的第二分区。
  6. 根据权利要求5所述的方法,其中,
    所述基于该分区对应的分区空间中分区键的最大值,确定待改变分区的第一数据,为待改变分区 的第一数据创建新的第一分区,包括:
    基于该分区对应的分区空间中的分区键的最大值在该分区中标记最大分裂点;
    为每个分区中待改变分区的第一数据创建新的第一分区;
    建立该分区与第一分区的关联关系;
    所述基于该分区对应的分区空间中分区键的最小值确定待改变分区的第二数据,为待改变分区的第二数据创建新的第二分区,包括:
    基于该分区对应的分区空间中的分区键的最小值在该分区中标记最小分裂点;
    为每个分区中待改变分区的第二数据创建新的第二分区;
    建立该分区与第二分区的关联关系。
  7. 根据权利要求6所述的方法,其中,所述基于所述对应关系更新所述分区表的分区中的数据分布,包括:
    基于所述对应关系针对需要进行数据转移的每一分区执行以下数据移动操作:
    将第一数据移动至相应的第一分区,所述第一数据为该分区对应的分区空间中记录的分区空间中分区键的最大值至该分区当前所存储数据的数据范围对应分区键的最大值之间的分区键所对应的数据;
    将第二数据移动至相应的第二分区,所述第二数据为分区当前所存储数据的数据范围对应分区键的最小值至该分区对应的分区空间中分区键的最小值之间的分区键所对应的数据;
    删除对所述最大分裂点的标记、所述最小分裂点的标记及所述关联关系。
  8. 根据权利要求6所述的方法,其中,在完成所述更新所述分区表的数据分布之前,还包括:
    响应于对目标数据的处理操作,确定所述目标数据对应分区键的键值大于任一分区的最大分裂点对应分区键的键值或小于任一分区的最小分裂点对应分区键的键值,则在该分区中查询目标数据;
    若查询获得所述目标数据对应的存储位置,则返回针对所述目标数据的处理结果;若未查询到所述存储位置,基于分区对应的关联关系,在分区对应第一分区或者第二分区中确定所述目标数据对应的存储位置并返回所述目标数据的处理结果。
  9. 一种数据处理装置,包括:
    获取模块,用于获取数据重分布信息,所述数据重分布信息用于表征对分区表的新的分区规划,所述分区表为基于分区键进行数据分布的表;
    增加模块,用于为所述分区规划指定的每个分区创建对应的分区空间,并在每个分区空间中记录该分区空间对应的分区的当前的数据范围,所述数据范围包括分区键的取值范围和分区键的取值列表中的一种;
    更新模块,用于基于所述分区规划,更新每个分区空间记录的数据范围,并利用每个分区空间的已更新的数据范围确定每个分区空间与所述分区表的分区的对应关系;
    分布模块,用于基于所述对应关系更新所述分区表的分区中的数据分布。
  10. 一种电子设备,所述电子设备包括:
    一个或多个处理器;
    存储器;
    一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于:执行根据权利要求1至8任一项所述的方法。
  11. 一种计算机可读存储介质,所述计算机存储介质用于存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机可以执行上述权利要求1至8中任一项所述的方法。
  12. 一种计算机程序产品,包括计算机程序或指令,该计算机程序或指令被处理器执行时实现权 利要求1至8中任一项所述方法的步骤。
PCT/CN2022/125826 2021-10-19 2022-10-18 数据处理方法、装置、电子设备、存储介质及程序产品 WO2023066222A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/450,577 US20230394024A1 (en) 2021-10-19 2023-08-16 Data processing method and apparatus, electronic device, storage medium, and program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111217539.9 2021-10-19
CN202111217539.9A CN113641686B (zh) 2021-10-19 2021-10-19 数据处理方法、装置、电子设备、存储介质及程序产品

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/450,577 Continuation US20230394024A1 (en) 2021-10-19 2023-08-16 Data processing method and apparatus, electronic device, storage medium, and program product

Publications (1)

Publication Number Publication Date
WO2023066222A1 true WO2023066222A1 (zh) 2023-04-27

Family

ID=78427355

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125826 WO2023066222A1 (zh) 2021-10-19 2022-10-18 数据处理方法、装置、电子设备、存储介质及程序产品

Country Status (3)

Country Link
US (1) US20230394024A1 (zh)
CN (1) CN113641686B (zh)
WO (1) WO2023066222A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641686B (zh) * 2021-10-19 2022-02-15 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、存储介质及程序产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10157214B1 (en) * 2014-11-19 2018-12-18 Amazon Technologies, Inc. Process for data migration between document stores
CN106415534B (zh) * 2015-05-31 2019-09-20 华为技术有限公司 一种分布式数据库中关联表分区的方法和设备
CN111767268A (zh) * 2020-06-23 2020-10-13 平安普惠企业管理有限公司 数据库表分区方法、装置、电子设备及存储介质
CN112765262A (zh) * 2019-11-05 2021-05-07 中兴通讯股份有限公司 一种数据重分布方法、电子设备及存储介质
CN113434470A (zh) * 2021-06-24 2021-09-24 华云数据控股集团有限公司 数据分布方法、装置及电子设备
CN113641686A (zh) * 2021-10-19 2021-11-12 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、存储介质及程序产品

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7877405B2 (en) * 2005-01-07 2011-01-25 Oracle International Corporation Pruning of spatial queries using index root MBRS on partitioned indexes
US9836492B1 (en) * 2012-11-01 2017-12-05 Amazon Technologies, Inc. Variable sized partitioning for distributed hash tables
JP6175958B2 (ja) * 2013-07-26 2017-08-09 富士通株式会社 メモリダンプ方法及びプログラム、並びに、情報処理装置
US10375164B1 (en) * 2013-12-30 2019-08-06 Emc Corporation Parallel storage system with burst buffer appliance for storage of partitioned key-value store across a plurality of storage tiers
WO2017113276A1 (zh) * 2015-12-31 2017-07-06 华为技术有限公司 分布式存储系统中的数据重建的方法、装置和系统
CN108932256A (zh) * 2017-05-25 2018-12-04 中兴通讯股份有限公司 分布式数据重分布控制方法、装置及数据管理服务器
CN111104057B (zh) * 2018-10-25 2022-03-29 华为技术有限公司 存储系统中的节点扩容方法和存储系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10157214B1 (en) * 2014-11-19 2018-12-18 Amazon Technologies, Inc. Process for data migration between document stores
CN106415534B (zh) * 2015-05-31 2019-09-20 华为技术有限公司 一种分布式数据库中关联表分区的方法和设备
CN112765262A (zh) * 2019-11-05 2021-05-07 中兴通讯股份有限公司 一种数据重分布方法、电子设备及存储介质
CN111767268A (zh) * 2020-06-23 2020-10-13 平安普惠企业管理有限公司 数据库表分区方法、装置、电子设备及存储介质
CN113434470A (zh) * 2021-06-24 2021-09-24 华云数据控股集团有限公司 数据分布方法、装置及电子设备
CN113641686A (zh) * 2021-10-19 2021-11-12 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、存储介质及程序产品

Also Published As

Publication number Publication date
US20230394024A1 (en) 2023-12-07
CN113641686A (zh) 2021-11-12
CN113641686B (zh) 2022-02-15

Similar Documents

Publication Publication Date Title
US11461356B2 (en) Large scale unstructured database systems
CN111338766B (zh) 事务处理方法、装置、计算机设备及存储介质
CN109906448B (zh) 用于促进可插拔数据库上的操作的方法、设备和介质
US10853242B2 (en) Deduplication and garbage collection across logical databases
US20180373708A1 (en) Systems and methods of database tenant migration
CN113535656B (zh) 数据访问方法、装置、设备及存储介质
CN111597015B (zh) 事务处理方法、装置、计算机设备及存储介质
US11461347B1 (en) Adaptive querying of time-series data over tiered storage
WO2016167999A1 (en) Geo-scale analytics with bandwidth and regulatory constraints
CN102129469A (zh) 一种面向虚拟实验的非结构化数据访问方法
WO2017113962A1 (zh) 访问分布式数据库的方法和分布式数据服务的装置
US10990571B1 (en) Online reordering of database table columns
CN102890678A (zh) 一种基于格雷编码的分布式数据布局方法及查询方法
US11907251B2 (en) Method and system for implementing distributed lobs
Chen et al. Bestpeer++: A peer-to-peer based large-scale data processing platform
JP2017534986A (ja) オンライン・スキームおよびデーター変換
CN115114374B (zh) 事务执行方法、装置、计算设备及存储介质
KR20200092095A (ko) 관계형 데이터베이스의 DML문장을 NoSQL 데이터베이스로 동기화하기 위한 트랜잭션 제어 방법
CN111723161A (zh) 一种数据处理方法、装置及设备
WO2023066222A1 (zh) 数据处理方法、装置、电子设备、存储介质及程序产品
US11782953B2 (en) Metadata access for distributed data lake users
EP3061011B1 (en) Method for optimizing index, master database node and subscriber database node
US10997160B1 (en) Streaming committed transaction updates to a data store
WO2014180395A1 (zh) 海量数据融合存储方法及系统
US11789971B1 (en) Adding replicas to a multi-leader replica group for a data set

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882828

Country of ref document: EP

Kind code of ref document: A1