CN113760902A

CN113760902A - Data splitting method, device, equipment, medium and program product

Info

Publication number: CN113760902A
Application number: CN202110236743.9A
Authority: CN
Inventors: 王思佳; 许海华; 王云博; 鲁大帅; 傅朋; 张师聪
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-12-07

Abstract

The application provides a data splitting method, a device, equipment, a medium and a program product, wherein at least one data source node is determined according to a preset requirement in a database cluster to be split; then sending the data to be split in the database cluster to be split to a data source node; splitting data to be split in the data source node according to a preset splitting mode to determine split data; and finally, sending the split data to at least one target node of at least one target database cluster through the data source node. The method solves the technical problems that in the prior art, data splitting involves multi-aspect changes, the reusability of split codes is low, and the query codes need to be modified on a business side at the same time when the logic of the sub-table is changed. The method achieves the technical effects of efficient data splitting without logic change of the branch table, high reusability of split codes and no need of changing the query codes by a service end.

Description

Data splitting method, device, equipment, medium and program product

Technical Field

The present application relates to the field of computer databases, and in particular, to a data splitting method, apparatus, device, medium, and program product.

Background

With the continuous development of internet technology, various internet services are also rapidly increased, but most services cannot predict the data scale of the later stage of the services when being online, so that most services are developed by using a single database instance cluster to carry the services, however, with the development of the services, the services rapidly increase, and the data volume in some data tables is too large, so that the query and update performance of the system is integrally reduced, and the overall service throughput of the system is further influenced.

When the prior art is used for solving the problem, generally, a data table with a complex header structure is split into a plurality of data tables with simple header structures, or a data table with a large data volume is separately split into a new database by using a database-based replication technology, so that the overall operation efficiency of the whole system is improved.

However, in the above splitting method, no matter the table or the database is split, the universality of the splitting process is poor, the splitting process is complex, when the service continuously increases, the splitting process again faces the same problem, and the SQL (Structured Query Language) code when the service party accesses the database needs to be modified at the same time, that is, in the prior art, the data splitting involves many-sided changes, the reusability of the split code is low, and the technical problem that the service party needs to modify the database access code at the same time.

Disclosure of Invention

The application provides a data splitting method, a device, equipment, a medium and a program product, which aim to solve the technical problems that in the prior art, data splitting involves multi-aspect changes, split codes are low in reusability, and a database access code needs to be modified on a business side at the same time.

In a first aspect, the present application provides a data splitting method, including:

determining at least one data source node in a database cluster to be split according to preset requirements, wherein the data source node is an intermediate medium for data copying and transferring between the database cluster to be split and a target database cluster;

sending the data to be split in the database cluster to be split to the data source node;

splitting the data to be split in the data source node according to a preset splitting mode to determine split data, wherein the preset splitting mode reserves a logic structure of an original data table, namely the preset splitting mode is used for splitting the data under the condition that a table head structure of the original data table is not changed;

and sending the split data to a target node of a target database cluster through the data source node.

In one possible design, the database cluster to be split includes at least one master-slave relationship node, which includes: the data source node comprises at least one slave node in the master-slave relationship node.

In one possible design, the splitting the data to be split in the data source node according to a preset splitting manner to determine split data includes:

and determining the split data according to preset split parameters corresponding to preset key fields in the data to be split.

Optionally, the preset splitting parameter includes: splitting an interval range and at least one splitting value in the splitting interval range, and determining the splitting data according to preset splitting parameters corresponding to preset key fields in the data to be split, wherein the splitting value comprises:

keeping a header structure of a data table fixed, and determining the splitting data according to the splitting interval range corresponding to the preset key field in the data table and the splitting value, wherein the splitting data has the same structure as the data table, and the data to be split comprises at least one data table.

In one possible design, the sending, by the data source node, the split data to a target node of a target database cluster includes:

creating a write queue for each target node;

reading the split data from the data source node by using a reading coroutine, and inserting the split data into the write-in queue;

and utilizing a write coroutine to sequentially copy the split data from the write queue to the corresponding target node.

In one possible design, after the sending the data to be split in the database cluster to be split to the data source node, the method further includes:

and removing the data source node from the database cluster to be split so as to keep the total amount of data in the data source node unchanged.

In one possible design, after the sending, by the data source node, the split data to at least one target node of at least one target database cluster, the method further includes:

re-accessing the data source node into the database cluster to be split;

sending the updated data to be split to the data source node;

splitting the data to be split again according to a preset splitting mode to determine new split data;

and sending the new split data to the target node through the data source node in a filtering and copying mode, wherein the filtering and copying mode is used for filtering out the split data existing in the target node.

Optionally, the preset requirement includes: the data source node is a read-only node.

Further optionally, the read-only node is a read-only type slave node in the master-slave relationship node.

In one possible design, the sending the data to be split in the database cluster to be split to the data source node includes:

and sending the first data to be split meeting preset conditions to the data source node according to the attribute characteristics of the data to be split, wherein the data to be split comprises the first data to be split.

Optionally, the sending, according to the attribute feature of the data to be split, the first data to be split that meets a preset condition to the data source node according to a preset sending manner includes:

sending second data to be split and third data to be split to the data source node according to a preset sequence, wherein the attribute feature of the second data to be split is greater than or equal to a preset feature threshold, and the attribute feature of the third data to be split is smaller than the preset feature threshold;

the first data to be split comprises the second data to be split and the third data to be split.

Optionally, the sending, by the data source node, the second data to be split and the third data to be split according to a preset sequence includes:

one of the second data to be split and the third data to be split is sent to the data source node;

and after the data source node sends the split data corresponding to the second data to be split or the third data to be split to the target node, the other data to be split is sent to the data source node.

arranging the data to be split according to the requirements of preset attribute characteristics to determine a data queue to be split;

sending the data to be split to the data source node in batches according to the arrangement sequence of the data queue to be split; wherein,

and after the data of the previous batch is split in the data source node and the split data is sent to the target node through the data source node, sending the data of the next batch to the data source node.

In one possible design, when the number of the data source nodes is at least two, the sending the data to be split in the database cluster to be split to the data source nodes includes:

and distributing the data to be split to different data source nodes according to the attribute characteristics of the data to be split.

Optionally, the attribute features include: the distributing the data to be split to different data source nodes according to the attribute characteristics of the data to be split includes:

sending fourth data to be split to a first data source node, wherein a first attribute characteristic of the fourth data to be split meets a first characteristic requirement;

sending fifth data to be split to a second data source node, wherein a second attribute characteristic of the fifth data to be split meets a second characteristic requirement;

the data to be split comprises: the fourth data to be split and the fifth data to be split, where the data source node includes: the first data source node and the second data source node.

checking whether the data in the target node is correctly copied;

if so, switching part or all of the services of the database cluster to be split into the target database cluster;

if not, correcting the corresponding problem data.

In a second aspect, the present application provides a data splitting apparatus, including:

the system comprises a source node selection module, a data replication module and a data replication module, wherein the source node selection module is used for determining at least one data source node in a database cluster to be split according to a preset requirement, and the data source node is an intermediate medium for data replication and transfer between the database cluster to be split and a target database cluster;

the data to be split preparation module is used for sending the data to be split in the database cluster to be split to the data source node, and the preset splitting mode keeps the logic structure of an original data table;

the splitting module is used for splitting the data to be split in the data source node according to a preset splitting mode so as to determine split data;

the splitting module is further configured to send the split data to at least one target node of at least one target database cluster through the data source node.

In a possible design, the splitting module is specifically configured to determine the split data according to a preset splitting parameter corresponding to a preset key field in the data to be split.

Optionally, the preset splitting parameter includes: split interval scope and in split value of at least one in interval scope of split, split module is specifically used for:

keeping the structure of a data table unchanged, and determining the splitting data according to the splitting interval range corresponding to the preset key field in the data table and the splitting value, wherein the splitting data has the same structure as the data table, and the data to be split comprises at least one data table.

In one possible design, the splitting module is further specifically configured to:

creating a write queue for each target node;

In one possible design, the splitting module is further configured to remove the data source node from the database cluster to be split, so that the total amount of data in the data source node remains unchanged.

In one possible design, the splitting module is further configured to:

re-accessing the data source node into the database cluster to be split;

sending the updated data to be split to the data source node;

In a possible design, the data to be split preparation module is configured to send, according to attribute features of the data to be split, first data to be split that meets a preset condition to the data source node in a preset sending manner, where the data to be split includes the first data to be split.

Optionally, the to-be-split data preparation module is specifically configured to:

In one possible design, the data to be split preparation module is configured to arrange the data to be split according to requirements of preset attribute characteristics to determine a data queue to be split;

the data to be split preparation module is further configured to send the data to be split to the data source node in batches according to the arrangement order of the data queue to be split; wherein,

In a possible design, when the number of the data source nodes is at least two, the data to be split preparation module is configured to distribute the data to be split to different data source nodes according to the attribute characteristics of the data to be split.

Optionally, the attribute features include: the data to be split preparation module is used for sending fourth data to be split to the first data source node, and the first attribute feature of the fourth data to be split meets the first feature requirement;

the data to be split preparation module is further configured to send fifth data to be split to a second data source node, where a second attribute characteristic of the fifth data to be split meets a second characteristic requirement;

In one possible design, the data splitting apparatus further includes:

the checking module is used for checking whether the data in the target node is copied correctly or not;

if so, then

The switching module is used for switching part or all of services of the database cluster to be split into the target database cluster;

if not, then

The splitting module is also used for correcting the corresponding problem data.

In a third aspect, the present application provides an electronic device comprising:

a memory for storing program instructions;

and the processor is used for calling and executing the program instructions in the memory to execute any one of the possible item storage information determination methods provided by the first aspect.

In a fourth aspect, the present application provides a storage medium, where a computer program is stored, where the computer program is used to execute any one of the possible data splitting methods provided in the first aspect.

In a fifth aspect, the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements any one of the possible data splitting methods provided in the first aspect.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIGS. 1a to 1c are schematic diagrams of an application scenario for maintaining a logical structure of a data table unchanged according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a data splitting method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a database cluster to be split according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of another database cluster to be split according to an embodiment of the present application;

FIG. 5 is a diagram of a subsequence constructed from a logical sequence provided in an embodiment of the present application;

fig. 6 is a schematic flowchart illustrating an implementation flow of S202 in the data splitting method shown in fig. 2 according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of another data splitting method provided in the present application;

fig. 8 is a schematic structural diagram of a data splitting apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, including but not limited to combinations of embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any inventive step are within the scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Most of services are online, the data scale of the later stage of the services cannot be predicted, so that a single database instance cluster is used for bearing the services, the services grow rapidly along with the development of the services, and the query and update performance is integrally reduced due to the fact that the data volume of tables is too large, and the overall throughput of the system is further influenced.

At this time, splitting is needed, and the database splitting method can have various implementation modes. When the prior art splits database data, firstly, a data table with a complex header structure is split into a plurality of new data tables with a simple header structure, for example, a t table is split into t 1-t 16 or more tables to provide services; secondly, the data table with large data volume is separately divided into the new databases, the system versions of the new databases are updated, the database capacity is larger, and the reading and writing capacity of the new databases to the data table with large data volume is stronger.

However, a common problem in the prior art is that the universality of split codes is very poor, and developers need to rewrite the split codes each time of splitting, or existing split codes are manually modified; and no matter the table or the database is disassembled, SQL read-write codes of the service side access database are inevitably modified correspondingly. The technical problems that in the prior art, data splitting involves multi-aspect modification, the reusability of split codes is low, and database access codes need to be modified on a business side at the same time are solved.

In order to solve the technical problems, the invention idea of the application is as follows:

the data table is a basic element forming the database, and the essence of accessing the database by the service side is to perform a series of operations such as adding, deleting, reading, changing and the like on the data table. That is, the data table cannot be changed without changing the access code on the service side or changing the service logic.

However, the data size of the data table is too large, which affects the access efficiency of the service side and requires us to change the data table. This seems to be a contradictory unsolved problem.

The inventors of the present application, after further analysis, found that the reason why the above-mentioned problems are not considered to be solved by those skilled in the art is the inertial thinking of "modifying the data table" by those skilled in the art. The inherent thinking of those skilled in the art is that a table is a visual table, and the vertical or horizontal splitting of the table will split the table into a plurality of new tables, and the original table will not exist.

The inventor of the application breaks through the cognitive inertia and thinks from the essence of the data table. The nature of the data table does not actually exist in dependence on the table of the entity, i.e. the data table in the database is not equivalent to the table we print on the paper. A data table is actually a collection of data with a predetermined logical structure. It is the logical structure and the inherent relation of the data itself that is the constraint to determine whether a data element belongs to this set of data tables.

That is, a data table is invariant wherever the data is stored, or in whatever form, as long as the logical structure of the data table itself is not altered. The so-called splitting does not basically aim to shred the data table or reconstruct the data table, but enables the service end to quickly locate the data when accessing, and as long as the logic of the splitting table is not changed, the service end naturally does not need to rewrite the access code.

In this way, we convert the problem into how to be able to unify the data stored in each new database with the logical structure of the data table in the original database, i.e. the table splitting logic, so that the logical data table is never changed no matter how many times the data table is split. And the traditional method cannot be too complicated, so that excessive calculation amount is avoided.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

First, we first introduce an implementation that keeps the logical structure of the branch table logic, i.e., the data table, unchanged.

Fig. 1a to 1c are schematic diagrams of an application scenario in which a logical structure of a data table is kept unchanged according to an embodiment of the present application. As shown in fig. 1a, a data table 100 to be split contains a large number of data records, and a header structure can be extracted separately to form a first logic sequence 101, while the following data records are associated with a second logic sequence 102. The first logic sequence 101 and the second logic sequence 102 thus form a logical structure of the data table 100, or the data table 100 can be represented using the first logic sequence 101 and the second logic sequence 102.

It should be noted that the values of the elements in the first logic sequence 101 and the second logic sequence 102 may be coded values obtained through a preset coding manner, for example, coding through a hash algorithm. The access code of the service side can correspond to the first logic sequence 101 and the second logic sequence 102 at compile time.

The split data table 100 may be converted to a split of the first logical sequence 101 and/or the second logical sequence 102, i.e., the first logical sequence 101 and/or the second logical sequence 102 may contain a plurality of sub-sequences.

Fig. 1b and 1c show two parts of the data table 100 stored in the new database after the first logic sequence 101 is left unchanged and the second logic sequence 102 is split horizontally. Note that here are two parts rather than two tables, since the present application splits a logical sequence rather than a table. This is a departure from the prior art, which for example gives each table a separate file when distinguishing between tables, and the prior art unpacking causes an increase in the number of files, which is why the service-side access code needs to be modified. The logic sequence of the present application is not a data table, but the table can be restored, or the first logic sequence 101 and the second logic sequence 102 establish a virtual table, and the virtual table is an integral entity no matter how the actual data storage is split.

Next, a detailed description will be given of the data splitting method that is applied to the above-described data table to keep the logical structure of the data table unchanged.

Fig. 2 is a schematic flow chart of a data splitting method according to an embodiment of the present application. As shown in fig. 2, the specific steps of the data splitting method include:

s201, in a database cluster to be split, determining at least one data source node according to preset requirements.

In this step, the data source node is an intermediary for data replication and transfer between the database cluster to be split and the target database cluster.

In this embodiment, the database cluster to be split includes at least one master-slave relationship node, where the master-slave relationship node includes: the data source node comprises at least one slave node in the master-slave relationship node.

In a possible embodiment, the preset requirements comprise: the data source node is a read-only node. I.e. nodes for which the type of data source node is read-only. Therefore, the situations of mistaken deletion and mistaken modification of data due to uncertain factors during data splitting can be avoided.

Fig. 3 is a schematic structural diagram of a database cluster to be split according to an embodiment of the present application. As shown in fig. 3, there is only one master-slave relationship node 300 in the database cluster to be split, where the master node 301 and the slave node 302 store data in the same manner, but the slave node 302 is only responsible for providing services such as complex query, data analysis, and the like to the outside, and cannot perform a write operation, and an operation of writing or changing data must access the master node 301 to complete it. And the master node 301 periodically or in real time performs data synchronization with the slave node 302 to ensure that the data of the two nodes are consistent. In principle, any one of the master node 301 and the slave node 302 may be arbitrarily selected as a data source node for data splitting, but in order to avoid erroneous operations such as erroneous deletion and erroneous modification of data in the original database cluster during splitting and copying, the slave node is generally selected as the data source node of the data splitting method of this embodiment in real time.

If the to-be-split database cluster includes multiple nodes and the relationship between the nodes is complex, as shown in fig. 4.

Fig. 4 is a schematic structural diagram of another database cluster to be split according to an embodiment of the present application. As shown in fig. 4, the database cluster to be split includes: an independent node 401, a first master-slave relationship node 402, a second master-slave relationship node 403, and a third master-slave relationship node 404. Wherein the first master-slave relationship node 402 comprises: a first master node 4021 and a first slave node 4022; the second master-slave relationship node 403 includes: a second master node 4031, a second slave node 4032 and a third slave node 4033; the third master-slave relationship node 404 includes: a third master node 4041 and a fourth slave node 4042. In principle, any node can be arbitrarily selected as a data source node for data splitting, but in order to avoid erroneous operations such as erroneous deletion and erroneous modification of data in an original database cluster in the splitting and copying process, a read-only node is generally selected as the data source node of the data splitting method of this embodiment in real time. For example, if the second slave node 4032 and the fourth slave node 4042 are read-only slave nodes, one of the two may be selected as the data source node, or both may be selected as the data source node.

It is understood that, in the case that the possibility of error operation is greatly reduced after the copy check mechanism is added, the master node may be selected as the data source node. Alternatively, for some non-important data, for example, the independent node 401 may store some non-important data of the process class, and at this time, it may also be directly used as a data source node.

In general, there may be more than one data source node, so that data in a database cluster to be split can be classified and split, a plurality of split intermediate media interfaces are provided, and the whole splitting process is accelerated.

S202, sending the data to be split in the database cluster to be split to a data source node.

In this step, the data to be split in the database cluster to be split is copied to the data source node, and if the database cluster to be split is completely split alternatively, that is, all the database cluster to be split is replaced by the new database cluster, all the data of the database cluster to be split is copied to the data source node. If the data is partially alternatively split, namely part of the services are still reserved in the database cluster to be split, copying the data to be split corresponding to the migrated services to the data source node.

It can be understood that, limited by the capacity and processing speed of the data source node, the data to be split may also be sent to the data source node in batches, and after the data of the previous batch is split to the new database cluster by the data source node, the data of the next batch to be split is copied to the data source node.

S203, splitting the data to be split in the data source node according to a preset splitting mode to determine the split data.

In this step, the preset splitting mode maintains the logical structure of the original data table, that is, the logic of the sub-tables of the database cluster to be split is kept unchanged, and the data tables in the data to be split are kept as a whole with unchanged logical structure in the form of a virtual table.

It can be understood that the business side can still add, delete, read, change, etc. the data table by using the same database access code, i.e. SQL code.

In this embodiment, specifically, as shown in fig. 1a to 1c, a splitting manner for constructing subsequences for the first logic sequence 101 and/or the second logic sequence 102 is provided.

Fig. 5 is a schematic diagram of constructing a subsequence of a logical sequence according to an embodiment of the present application. As shown in fig. 5, a logical sequence 500 is a continuous or discontinuous sequence of values from a start value 501 to an end value 502, for example, the start value 501 is [0X00], and the end value 502 is [0XFF ]. At least one subsequence split value 503, e.g., [0X80], is taken between start value 501 and end value 502 according to a predetermined algorithm, e.g., a hash algorithm, and logical sequence 500 is split into two subsequences, a first subsequence having a start value 501 to subsequence split value 503, and a second subsequence having a subsequence split value 503 to end value 502.

It is understood that the sub-sequence may be further expanded after the splitting to correspond to the newly added data record in the data table. And the sub-sequence can be further split in the same way, so that the universality or reusability of the data splitting code is realized.

Further, in a possible design, the splitting data may be determined according to a preset splitting parameter corresponding to a preset key field in the data to be split.

For example, the second logic sequence 102 may be all data in the original data table or may be partial data, so that the start value, the end value and the subsequence split value of the second logic sequence all need to be adjusted accordingly to obtain the subsequence, i.e., split data.

And S204, sending the split data to a target node of the target database cluster through the data source node.

In this step, the target database cluster includes at least one new database cluster, each database cluster includes at least one node, and the target node may be any one or more nodes in the new database cluster.

The split data obtained in step S203 further includes a logic sequence, in addition to the data of the original data table in the data to be split, for maintaining the table splitting logic of the data table unchanged. And different split data are put into different nodes of the new database or the new database cluster according to a preset rule.

It should be noted that, because the amount of data to be serviced is small, the processing efficiency can be improved when the service side accesses, and thus the performance of the whole system is improved. Of course, the new database or the new database cluster may also be a database with better data processing performance, and can better and faster process the access requirement of the service side when the data volume is the same.

Further, in a possible design, after the replication of the split data to the target node of the target database cluster is completed by the data source node, the method further includes:

verifying whether the data in the target node has been properly replicated

and if not, correcting corresponding problem data in the database cluster to be split, and re-copying the problem data through the data source node.

The embodiment provides a data splitting method, which includes that at least one data source node is determined according to preset requirements in a database cluster to be split; then sending the data to be split in the database cluster to be split to a data source node; splitting data to be split in the data source node according to a preset splitting mode to determine split data; and finally, sending the split data to at least one target node of at least one target database cluster through the data source node. The method solves the technical problems that in the prior art, data splitting involves multi-aspect changes, the reusability of split codes is low, and the query codes need to be modified on a business side at the same time when the logic of the sub-table is changed. The method achieves the technical effects of efficient data splitting without logic change of the branch table, high reusability of split codes and no need of changing the query codes by a service end.

For ease of understanding, the following embodiments explain several possible implementations corresponding to step S202 based on the embodiment shown in fig. 2.

Fig. 6 is a schematic flowchart of an implementation flow of S202 in the data splitting method shown in fig. 2 according to the embodiment of the present application.

For S202, sending the data to be split in the database cluster to be split to the data source node, there may be at least three possible parallel implementations of S610, S602-S603, and S630.

The first embodiment is as follows:

s601, according to the attribute characteristics of the data to be split, sending the first data to be split meeting preset conditions to a data source node according to a preset sending mode, wherein the data to be split comprises the first data to be split.

The attribute features may have a number of features including: the data access frequency, the data type, the corresponding service classification and the like, and may further include that the data records of different fields in each data table select different value ranges as preset conditions corresponding to preset attribute characteristics. A person skilled in the art can select the attribute characteristics and the preset conditions that need to be met correspondingly according to the requirements of a specific application scenario to screen out the data to be split, which is not limited in this embodiment.

Further, the first data to be split includes: the data splitting method comprises the following steps that data to be split are second data to be split and third data to be split, attribute features of the second data to be split are larger than or equal to a preset feature threshold, and attribute features of the third data to be split are smaller than the preset feature threshold.

And sending the second data to be split and the third data to be split to the data source node according to a preset sequence.

Specifically, one of the second data to be split and the third data to be split is sent to the data source node;

For example, in a database cluster to be split, data with a data access frequency greater than or equal to a preset frequency is determined as second data to be split, that is, hotspot data, and since the part of data is frequently accessed, the influence of the part of data on a service side is large, the part of data needs to be split into a new database cluster. And if the data access frequency is less than the preset frequency, determining that the data is the third data to be split, namely the cold data, the data can be split in the data source node after the second data to be split is completed, and the second data to be split is correspondingly split and copied to the target nodes of the target database clusters, and then sending the third data to be split to the data source node for splitting and copying to the target nodes in the corresponding target database clusters.

The second embodiment is as follows:

s602, arranging the data to be split according to the requirements of preset attribute characteristics to determine a data queue to be split.

S603, sending the data to be split to the data source node in batches according to the arrangement sequence of the data queue to be split.

In this step, the data of the previous batch is split in the data source node, and after the split data is sent to the target node through the data source node, the data of the next batch is sent to the data source node.

For example, data to be split is classified according to service types, then the data to be split is arranged according to operation time recorded by an operation log of the data in each service type, the data in the same time period is used as the data to be split in the same batch and sent to a data source node, the data source node splits the data according to splitting parameters such as subsequence splitting values, initial values and final values, and then the split data is sent to each target node in each target database cluster through a plurality of parallel co-processes. And after completing splitting and copying of a batch of data to be split, performing the next batch until all the data to be split pass through each data source node to complete splitting and copying.

The third embodiment is:

when the number of the data source nodes is at least two, the sending the data to be split in the database cluster to be split to the data source nodes comprises:

s604, distributing the data to be split to different data source nodes according to the attribute characteristics of the data to be split.

In this step, the attribute features include: the data to be split comprises a first attribute characteristic and a second attribute characteristic, wherein the data to be split comprises: the fourth data to be split and the fifth data to be split, the data source node includes: a first data source node and a second data source node.

The method comprises the following specific steps:

s6041, sending fourth data to be split to a first data source node, wherein a first attribute characteristic of the fourth data to be split meets a first characteristic requirement;

s6042, sending fifth data to be split to a second data source node, wherein second attribute characteristics of the fifth data to be split meet second characteristic requirements.

For example, in a database to be split, splitting sales data with a monthly sales quota of more than 10 ten thousand in related data of digital product services into a database cluster A, and sending the sales data to a first data source node for splitting and copying; and splitting the related data of the catering service business, wherein the access frequency is greater than or equal to a preset access threshold value, such as the number of daily visits of 2000 persons, into a database cluster B, and then sending the related data to a second data source node for splitting and copying.

Furthermore, considering that the data source node is also responsible for some services in the database to be split, such as providing services such as complex query and data analysis, at this time, data interaction exists between the data source node and other nodes. Considering that the interaction may affect the splitting and the copying, or in order to ensure that the data source node implements the static splitting and the static copying, it is necessary to disconnect the data to be split from other nodes after the data source node receives the data to be split, and stop external services, that is, remove the data from the database cluster to be split, and change the working state of the data to be split into an unavailable state.

The following is a description of another data splitting method obtained by removing the data source node based on the embodiment shown in fig. 2.

Fig. 7 is a schematic flow chart of another data splitting method provided in the present application. As shown in fig. 7, the data splitting method includes the specific steps of:

s701, in the database cluster to be split, at least one data source node is determined according to preset requirements.

In this step, the database cluster to be split includes at least one master-slave relationship node, where the master-slave relationship node includes: the data source node comprises at least one slave node in the master-slave relationship node.

The preset requirements include: the data source node is a read-only node, and further, the read-only node is a read-only type slave node in the master-slave relationship node.

For a detailed description of this step, refer to S201 of the embodiment corresponding to fig. 2, which is not described herein again.

S702, sending the data to be split in the database cluster to be split to a data source node.

For a detailed explanation of this step, reference may be made to the embodiment shown in fig. 6, which is not described herein again.

And S703, removing the data source node from the database cluster to be split.

In this step, the connection between the data source node and other nodes in the to-be-split database cluster is cut off, and the type of the data source node is changed into an unavailable type, such as a NotServing identifier marked on the data source node. So that the external service side is also not able to access the database via the data source node. Therefore, static splitting and static copying are realized for the subsequent splitting and copying process, and errors of the split or copied data are avoided.

S704, keeping the header structure of the data table unchanged, and determining split data according to the split interval range and the split value corresponding to the preset key field in the data table.

In this step, the split data has the same structure as the data table, and the data to be split includes at least one data table.

S705, the split data is sent to at least one target node of at least one target database cluster through the data source node.

For the detailed explanation of steps S704-S705, refer to steps S203-S204 of the embodiment shown in fig. 2, which are not described herein again.

It should be noted that, in one possible design, the copying or cloning of data into each target node is performed in parallel, and the copying mode can be set by specifying parameters. For example, multiple protocols are set up to simultaneously read data from the source node in the unavailable state and write to each write queue. By default, there are 10 parallel coroutines to read the data in the source node and write it into the corresponding target node.

It should be further noted that, in a possible case, if the target node has dirty data, the correctness and the validity of the dirty data cannot be determined, and the dirty data is deleted or corrected in the process of cloning or copying the data to the target node; conversely, it is not possible to insert data into the target node before cloning the data, and the data may be deleted as if it were dirty.

And S706, re-accessing the data source node into the database cluster to be split.

In this step, since the data source node may write data into the database to be split or update data in the process of executing the two steps S704 and S705, in order to ensure that the data in each target node in each target database cluster after splitting and the data in the database cluster to be split keep consistent, the data source node needs to be re-accessed to the database to be split for checking.

In addition, in a possible design, if the data to be split in the database cluster to be split is transmitted in batches, this step can also ensure that the data source node can receive the data of the next batch.

Further, since the data source node is also service-enabled, in order to shorten the duration of the unavailable state of the data source node as much as possible, the data amount of the data to be split by the data source node each time and/or the data amount of the split data to be copied to each target node may be limited. And the splitting and copying are completed when the service is idle by batching for a plurality of times, so that the influence on the service side is minimized when the database is split.

And S707, sending the updated data to be split to a data source node.

This step can be regarded as repeatedly executing step S702. But only transmits the updated data to the data source node.

And S708, splitting the data to be split again according to a preset splitting mode to determine new split data.

And S709, sending the new split data to the target node through the data source node in a filtering and copying mode.

In this step, the filtering replication mode is used to filter out split data that already exists in the target node.

Specifically, the filtering replication is to start a log consumption service on a target node, and the log consumption service reads binlog from a source node, which is used for representing an operation record of a service side on data when accessing a database, and determines whether to execute the binlog according to the binlog and splitting parameters (such as an initial value, a final value and a subsequence splitting value of a logic sequence).

It should be noted that the filtering replication is still in an on state after the cloning is completed, that is, the data of the source database cluster, i.e., the database cluster to be split and the target database cluster, are consistent while ignoring replication delay. Any update operations of the source database cluster may now be read on the corresponding target database cluster.

And S710, checking whether the data in the target node is copied correctly.

In this step, whether the data in the data table of the target node is consistent with the data in the data to be split may be compared line by line, if yes, step S711 is executed, and if no, step S712 is executed.

And S711, switching part or all services of the database cluster to be split into the target database cluster.

In this step, when it is determined that the service of the target database cluster is normal, the database cluster to be split may be closed or a part of the nodes of the database cluster to be split may be closed, and the corresponding service may be switched to the target database cluster, i.e., a new split database cluster, i.e., the splitting task is completed.

And S712, correcting the corresponding problem data.

Specifically, if a problem, such as a type error or a corresponding business type error, exists in data in the database cluster to be split, the corresponding problem data is corrected in the database cluster to be split, and correct data is copied again through the data source node.

And if the splitting and copying process is wrong, directly copying correct data again through the data source node.

Fig. 8 is a schematic structural diagram of a data splitting apparatus according to an embodiment of the present application. The data splitting apparatus 800 may be implemented by software, hardware, or a combination of both.

As shown in fig. 8, the data splitting apparatus 800 includes:

an obtaining module 801, configured to obtain first tag information corresponding to an article to be stored;

a source node selection module 802, configured to determine, according to preset requirements, at least one data source node in a to-be-split database cluster, where the data source node is an intermediary for data replication and transfer between the to-be-split database cluster and a target database cluster;

a to-be-split data preparing module 803, configured to send to-be-split data in the to-be-split database cluster to the data source node, where the preset splitting manner reserves a logical structure of an original data table;

a splitting module 804, configured to split the data to be split in the data source node according to a preset splitting manner, so as to determine split data;

the splitting module 804 is further configured to send the split data to at least one target node of at least one target database cluster through the data source node.

In a possible design, the splitting module 804 is specifically configured to determine the split data according to a preset splitting parameter corresponding to a preset key field in the data to be split.

Optionally, the preset splitting parameter includes: the splitting module 804 is specifically configured to split a range of an interval and at least one split value within the range of the split interval:

In a possible design, the splitting module 804 is further specifically configured to:

creating a write queue for each target node;

In one possible design, the splitting module 804 is further configured to remove the data source node from the database cluster to be split, so that the total amount of data in the data source node remains unchanged.

In one possible design, the splitting module 804 is further configured to:

re-accessing the data source node into the database cluster to be split;

sending the updated data to be split to the data source node;

In a possible design, the to-be-split data preparing module 803 is configured to send, according to an attribute feature of the to-be-split data, first to-be-split data that meets a preset condition to the data source node in a preset sending manner, where the to-be-split data includes the first to-be-split data.

Optionally, the to-be-split data preparing module 803 is specifically configured to:

In a possible design, the to-be-split data preparing module 803 is configured to arrange the to-be-split data according to requirements of preset attribute characteristics to determine a to-be-split data queue;

the to-be-split data preparation module 803 is further configured to send the to-be-split data to the data source node in batches according to the arrangement order of the to-be-split data queues; wherein,

In a possible design, when the number of the data source nodes is at least two, the to-be-split data preparation module 803 is configured to distribute the to-be-split data to different data source nodes according to attribute characteristics of the to-be-split data.

Optionally, the attribute features include: the to-be-split data preparation module 803 is configured to send fourth to-be-split data to the first data source node, where the first attribute feature of the fourth to-be-split data meets the first feature requirement;

the to-be-split data preparation module 803 is further configured to send fifth to-be-split data to a second data source node, where a second attribute characteristic of the fifth to-be-split data meets a second characteristic requirement;

In one possible design, the data splitting apparatus further includes:

a checking module 805, configured to check whether the data in the target node has been copied correctly;

if so, then

A switching module 806, configured to switch a part or all of services of the database cluster to be split to the target database cluster;

if not, then

The splitting module 804 is further configured to correct the corresponding problem data.

It should be noted that the apparatus provided in the embodiment shown in fig. 8 can execute the method provided in any of the above method embodiments, and the specific implementation principle, technical features, term explanation and technical effects thereof are similar and will not be described herein again.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device 900 may include: at least one processor 901 and memory 902. Fig. 9 shows an electronic device as an example of a processor.

And a memory 902 for storing programs. In particular, the program may include program code including computer operating instructions.

Memory 902 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 901 is configured to execute computer-executable instructions stored in the memory 902 to implement the methods described in the above method embodiments.

The processor 901 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.

Alternatively, the memory 902 may be separate or integrated with the processor 901. When the memory 902 is a device independent of the processor 901, the electronic device 900 may further include:

a bus 903 for connecting the processor 901 and the memory 902. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. Buses may be classified as address buses, data buses, control buses, etc., but do not represent only one bus or type of bus.

Alternatively, in a specific implementation, if the memory 902 and the processor 901 are integrated into a chip, the memory 902 and the processor 901 may complete communication through an internal interface.

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium may include: various media that can store program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and in particular, the computer-readable storage medium stores program instructions for the methods in the above method embodiments.

An embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method in the foregoing method embodiments.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A data splitting method is characterized by comprising the following steps:

in a database cluster to be split, determining at least one data source node according to preset requirements;

splitting the data to be split in the data source node according to a preset splitting mode to determine split data, wherein the preset splitting mode reserves the logic structure of an original data table;

2. The data splitting method according to claim 1, wherein the database cluster to be split includes at least one master-slave relationship node, and the master-slave relationship node includes: the data source node comprises at least one slave node in the master-slave relationship node.

3. The data splitting method according to claim 2, wherein the splitting the data to be split in the data source node according to a preset splitting manner to determine split data includes:

4. The data splitting method according to claim 3, wherein the preset splitting parameters include: splitting an interval range and at least one splitting value in the splitting interval range, and determining the splitting data according to preset splitting parameters corresponding to preset key fields in the data to be split, wherein the splitting value comprises:

keeping a header structure of a data table unchanged, and determining the splitting data according to the splitting interval range corresponding to the preset key field in the data table and the splitting value, wherein the splitting data has the same structure as the data table, and the data to be split comprises at least one data table.

5. The data splitting method according to claim 1, wherein the sending, by the data source node, the split data to a target node of a target database cluster comprises:

creating a write queue for each target node;

6. The data splitting method according to any one of claims 1 to 5, further comprising, after the sending the data to be split in the database cluster to be split to the data source node:

7. The data splitting method according to claim 6, further comprising, after said sending the split data by the data source node into at least one target node of at least one target database cluster:

re-accessing the data source node into the database cluster to be split;

sending the updated data to be split to the data source node;

8. The data splitting method according to claim 7, wherein the preset requirements include: the data source node is a read-only node.

9. The data splitting method according to claim 8, wherein the read-only node is a slave node of a read-only type in a master-slave relationship node.

10. The data splitting method according to any one of claims 1 to 5, wherein the sending the data to be split in the database cluster to be split to the data source node comprises:

11. The data splitting method according to claim 10, wherein the sending, according to the attribute characteristics of the data to be split, the first data to be split that meets a preset condition to the data source node according to a preset sending manner includes:

12. The data splitting method according to claim 11, wherein the sending of the second data to be split and the third data to be split to the data source node in a preset order includes:

13. The data splitting method according to any one of claims 1 to 5, wherein the sending the data to be split in the database cluster to be split to the data source node comprises:

14. The data splitting method according to any one of claims 1 to 5, wherein when the number of the data source nodes is at least two, the sending the data to be split in the database cluster to be split to the data source nodes comprises:

15. The data splitting method according to claim 14, wherein the attribute features comprise: the distributing the data to be split to different data source nodes according to the attribute characteristics of the data to be split includes:

16. The data splitting method according to any one of claims 1-5, further comprising, after said sending the split data by the data source node into at least one target node of at least one target database cluster:

checking whether the data in the target node is correctly copied;

if not, correcting the corresponding problem data.

17. A data splitting apparatus, comprising:

the source node selection module is used for determining at least one data source node in the database cluster to be split according to preset requirements;

18. An electronic device, comprising: a processor and a memory; wherein,

the memory for storing a computer program for the processor;

the processor is configured to perform the data splitting method of any of claims 1 to 16 via execution of the computer program.

19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the data splitting method according to any one of claims 1 to 16.

20. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the data splitting method of any of claims 1 to 16.