CN116521672A - Data processing method, device, equipment, system and storage medium - Google Patents

Data processing method, device, equipment, system and storage medium Download PDF

Info

Publication number
CN116521672A
CN116521672A CN202310269303.2A CN202310269303A CN116521672A CN 116521672 A CN116521672 A CN 116521672A CN 202310269303 A CN202310269303 A CN 202310269303A CN 116521672 A CN116521672 A CN 116521672A
Authority
CN
China
Prior art keywords
column
copy
subtasks
task
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310269303.2A
Other languages
Chinese (zh)
Inventor
罗小兵
刘涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310269303.2A priority Critical patent/CN116521672A/en
Publication of CN116521672A publication Critical patent/CN116521672A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The disclosure provides a data processing method, a device, equipment, a system and a storage medium, and belongs to the technical field of data processing, in particular to the technical fields of database, distributed information recommendation, application program development and the like. The specific implementation scheme is as follows: and receiving a column copying task submitted by a target computing node aiming at a target database table by a metadata node in the distributed database system, wherein the column copying task comprises a data definition language task for realizing a column copying function, splitting the column copying task into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different row ranges in the target database table, distributing the at least two subtasks to the at least two computing nodes based on a preset concurrency logic so as to instruct the at least two computing nodes to perform corresponding column copying operation aiming at the row ranges corresponding to the received subtasks. By adopting the technical scheme, the column copying efficiency can be effectively improved, and the normal writing time of the blocking service is reduced.

Description

Data processing method, device, equipment, system and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to the technical fields of databases, distributed, information recommendation, application program development, and the like.
Background
In databases, data is typically stored in units of tables, which typically include rows, which may also be referred to as records, and columns, which may also be referred to as fields. In some application scenarios, there are cases where data of a column is copied to another column, which may be referred to as column copying.
Disclosure of Invention
The present disclosure provides a data processing method, apparatus, device, system, and storage medium.
According to an aspect of the present disclosure, there is provided a data processing method applied to metadata nodes in a distributed database system, the distributed database system further including a plurality of computing nodes, the method including:
receiving a column copy task submitted by a target computing node and aiming at a target database table, wherein the column copy task comprises a data definition language task for realizing a column copy function;
splitting the column copy task into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different line ranges in the target database table;
and distributing the at least two subtasks to at least two computing nodes based on preset concurrency logic to instruct the at least two computing nodes to perform corresponding column copy operations for a row range corresponding to the received subtasks. .
According to another aspect of the present disclosure, there is provided a data processing method applied to a computing node in a distributed database system, the distributed database system including a metadata node and a plurality of computing nodes, the method including:
receiving subtasks sent by a metadata node, wherein the subtasks are contained in at least two subtasks distributed to at least two computing nodes by the metadata node, the at least two subtasks are obtained by splitting column copy tasks according to preset granularity after receiving the column copy tasks submitted by a target computing node and aiming at a target database table by the metadata node, the column copy tasks comprise data definition language tasks for realizing column copy functions, and different subtasks correspond to different line ranges in the target database table;
and performing corresponding column copy operation on the row range corresponding to the received subtask.
According to another aspect of the present disclosure, there is provided a data processing apparatus configured as a metadata node in a distributed database system, the distributed database system further including a plurality of computing nodes therein, the apparatus comprising:
The task receiving module is used for receiving a column copy task aiming at a target database table, which is submitted by a target computing node, wherein the column copy task comprises a data definition language task for realizing a column copy function;
the task splitting module is used for splitting the column copying task into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different line ranges in the target database table;
the subtask distribution module is used for distributing the at least two subtasks to at least two computing nodes based on preset concurrency logic so as to instruct the at least two computing nodes to perform corresponding column copy operation on a row range corresponding to the received subtasks.
According to another aspect of the present disclosure, there is provided a data processing apparatus configured in a computing node in a distributed database system including a metadata node and a plurality of computing nodes therein, the apparatus comprising:
the subtask receiving module is used for receiving subtasks sent by the metadata node, wherein the subtasks are contained in at least two subtasks distributed to at least two computing nodes by the metadata node, the at least two subtasks are obtained by splitting the column copy tasks according to preset granularity after receiving the column copy tasks submitted by the target computing node and aiming at the target database table, the column copy tasks comprise data definition language tasks for realizing column copy functions, and different subtasks correspond to different row ranges in the target database table;
And the column copying module is used for carrying out corresponding column copying operation on the row range corresponding to the received subtask.
According to another aspect of the present disclosure, there is provided an electronic device configured as a metadata node or a computing node in a distributed database system, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing methods described in embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a distributed database system including a metadata node for executing a data processing method applied to the metadata node according to an embodiment of the present disclosure, and a plurality of computing nodes, at least two of which are for executing the data processing method applied to the computing nodes.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any embodiment of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any embodiment of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a data processing method provided in accordance with an embodiment of the present disclosure;
FIG. 2 is a flow chart of another data processing method provided in accordance with an embodiment of the present disclosure;
FIG. 3 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present disclosure;
FIG. 4 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic illustration of an interaction process provided in accordance with an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a data processing apparatus provided according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of another data processing apparatus provided in accordance with an embodiment of the present disclosure;
Fig. 8 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
For a better understanding of the embodiments of the present disclosure, the related art will be described below. In the related art, a conventional database system (such as MySQL) uses an UPDATE statement to implement column copying, the execution of the statement is a transaction (transaction) in a database, the database transaction is a sequence of database operations for accessing and possibly operating various data items, the operations are all executed or not executed, and are an integral unit of work, the execution of the UPDATE statement is synchronous, and the execution always holds a lock during the execution, which prevents normal business writing, and the execution of the statement is completed by interaction between a single computing node and a storage node, which is inefficient.
In the embodiment of the disclosure, a data definition language (Data Definition Language, DDL) is used to define DDL tasks (which may be denoted as Column DDL tasks) for implementing a Column copy function, so that a Column copy operation can be split, and at least two computing nodes are concurrently executed in a distributed database system, thereby improving Column copy efficiency.
FIG. 1 is a flow chart of a data processing method provided in accordance with an embodiment of the present disclosure, which may be applicable to cases where column copies are made in a distributed database system. The method may be performed by a data processing apparatus, which may be implemented in hardware and/or software, may be configured in an electronic device, which may be configured as a metadata node in a distributed database system. Referring to fig. 1, the method specifically includes the following:
s101, receiving a column copy task submitted by a target computing node and aiming at a target database table, wherein the column copy task comprises a data definition language task for realizing a column copy function;
s102, splitting the column copying task into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different line ranges in the target database table;
And S103, distributing the at least two subtasks to at least two computing nodes based on preset concurrency logic to instruct the at least two computing nodes to perform corresponding column copy operations for a row range corresponding to the received subtasks.
The Data of the database table generally includes metadata (schema) and Data (Data), and the metadata node may be a node responsible for processing the metadata and may be denoted as meta node. In the disclosed embodiment, the metadata node is also used for column copy task scheduling.
Illustratively, when there is a Column copy demand, the client may send a Column copy command (which may be referred to as a Column DDL command) for a target database table, which may be any one of the computing nodes in the distributed database system, to a target computing node in the distributed database system, which may be one or more database tables. If the target database table is one, the data of the source column is copied into the target column in the same database table; if there are multiple target database tables, it may be considered that the data of the source column in the source database table is copied to the destination column in the destination database table. After receiving the column copy command, the target computing node generates a corresponding column copy task and submits the column copy task for the target database table to the metadata node.
Illustratively, after receiving the column copy task, the metadata node splits the column copy task into at least two sub-tasks (which may be denoted as column ddlworks) at a preset granularity. The preset granularity may be set according to practical situations, for example, may include a data slicing unit (region) of the distributed database system, and for example, may be determined according to a product of a total number of rows of the target database table and a preset proportional value. Different subtasks correspond to different row ranges in the target database table, that is, by task splitting, the different subtasks can make column copies for the different row ranges. For example, the target database table in a column copy task contains 1 ten thousand pieces of data, and every 2 thousand pieces of data, that is, every 2 thousand rows, are divided into one subtask.
Optionally, after splitting the column copy task into at least two subtasks according to a preset granularity, the method further includes: and performing persistent storage on the at least two subtasks. The method has the advantages that the sub-tasks after the persistent storage can be recovered when the metadata node fails, normal execution of the column copy tasks is guaranteed, the success rate of task execution is improved, the task splitting operation is not required to be executed again, and the task execution efficiency under the condition that the metadata node fails is improved.
For example, the preset concurrency logic may be, for example, to evenly distribute at least two sub-tasks to all computing nodes in the distributed database system (or computing nodes currently in a normal operating state); or, a subtask is allocated to all the computing nodes, if there is still a remaining non-allocated subtask, the next subtask is allocated to the computing node that has executed to complete the allocated subtask, and so on, and the remaining subtasks are allocated in turn according to the order in which the subtasks are completed by each computing node.
For example, after receiving the subtasks allocated by the metadata node, the computing node may perform a corresponding column copy operation with respect to a row range corresponding to the subtasks. As an example, assuming that the row range corresponding to the subtask received by the current computing node is 1 to 2000 rows, the data in the source column in the range is copied to the target column, so as to implement column copy in the range, thereby completing execution of the subtask.
Optionally, the data in the database system may be stored in a storage node (may be recorded as a store), and when performing a column copy operation, the computing node may send a copy request to the storage node, where the storage node locks the data in the row range included in the copy request and copies the column data, and unlocks the data in the row range after the column data copy is completed, so as to restore normal service data writing.
According to the data processing scheme provided by the embodiment of the disclosure, a metadata node in a distributed database system receives a column copy task submitted by a target computing node and aiming at a target database table, the column copy task comprises a data definition language task for realizing a column copy function, the column copy task is split into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different row ranges in the target database table, at least two subtasks are distributed to the at least two computing nodes based on preset concurrency logic so as to instruct the at least two computing nodes to perform corresponding column copy operations aiming at the row ranges corresponding to the received subtasks. By adopting the technical scheme, the column copy completed by executing the transaction is changed into the DDL task for realizing the column copy function, and the metadata nodes split and distribute the column copy task, so that the split subtasks can be executed concurrently among all computing nodes, the column copy efficiency can be effectively improved, the overhead of task execution and the performance can be reduced, the time consumption of a large number of data column copies can be effectively reduced, and the time period for blocking normal writing of the service can be reduced.
In an alternative embodiment, the distributing the at least two sub-tasks to at least two computing nodes includes: receiving heartbeat requests sent by at least two computing nodes; and replying heartbeat responses to the at least two computing nodes, wherein each heartbeat response comprises at least one subtask in the at least two subtasks. The method has the advantages that whether the computing node is in a normal state can be confirmed through the mode of receiving the heartbeat request, if the heartbeat request is received, the computing node sending the heartbeat request can be considered to be normal at present, subtasks can be allocated to the computing node, and the success rate of subtask execution is guaranteed.
In an alternative embodiment, the column copy task includes a preset column copy statement, where the preset column copy statement includes a source column, a destination column, and a copy condition, and the column copy operation is used to copy data in the source column that meets the copy condition to the destination column. The method has the advantages that conditional column copying can be realized, and the column copying efficiency can be greatly improved by adopting the scheme provided by the embodiment of the disclosure because the determination of the copying condition is needed when the column copying is carried out.
For example, the copying conditions may be set according to actual conditions. In some application scenarios, such as information recommendation scenarios, where information may include, for example, online delivery of information (e.g., advertisements, etc.), videos, information, questions or commodities, etc., a preferred policy of information materials needs to be recommended, and information materials need to set multiple versions for fields (e.g., video definition, etc.) of the same preferred policy, experimental versions may be used for policy low-traffic experiments, and release versions may be used for policy full-traffic applications, by storing policy data of multiple versions, flexible backup and secure rollback may be possible, and after large-scale policy training is completed, data needs to be copied from the experimental version column to the release version column according to conditions, and operations in the database, that is, column copy functions, are required. In the copying process, the number of information materials is large, fields are large, and copying conditions are complex, so that the conventional column copying scheme is difficult to meet actual requirements.
Illustratively, a conventional database system such as MySQL conditionally implements column copying from a WHERE clause via an UPDATE statement. The relevant structured statement (Structured Query Language, SQL) may be, for example:
UPDATE TableName SET column1_release=column1,column2_release=column2 WHERE column1!=column1_release。
The statement may implement the column1, column2 fields of TableName as per column 1-! The condition=column1_release is copied to column1_release, column2_release.
In the embodiment of the disclosure, the SQL grammar corresponding to the preset column copy statement may be expressed as follows:
ALTER TABLE TableName MODIFY COLUMN
SET assignment_list
[WHERE where_condition]
value:
{expr|DEFAULT}
assignment:
col_name=value
assignment_list:
assignment[,assignment]...
wherein, assignment_list represents an item list needing column copying, assignment represents an item needing column copying, and the expression form of the item is as follows: column name = value of column data, value of column data is expression.
Illustratively, to implement the above-exemplified functions, the preset column copy statements in the embodiments of the present disclosure may be expressed as:
ALTER TABLE TableName MODIFY COLUMN SET column1_rele ase=column1,column2_release=column2 WHERE column1!=column1_rele ase。
the TableName is a target database table, and if more than two target database tables exist, a source table and a target table can be added into the statement; column 1-! =column 1_release represents copy conditions; column1 and column2 represent source columns; column1_release and column2_release represent destination columns.
The split subtasks also comprise the preset column copy statement and a row range corresponding to the subtasks.
In an alternative embodiment, the column copy task includes a preset column copy sentence, and the preset type log generation rule in the distributed database system includes that a preset type log is not generated for the preset column copy sentence. The advantage of this arrangement is that the generation of redundant log data can be reduced, reducing the occupation of storage resources in the database system.
The preset type log may be, for example, binlog. binlog is a binary log that records database table structure changes and table data modifications.
In the related art, column copying is implemented by an UPDATE statement, and the UPDATE operation is incorporated into binlog, i.e., binlog data is generated.
In the embodiment of the present disclosure, the preset column copy statement is, for example, the ALTER TABLE MODIFY COLUMN statement, which is a newly added custom statement in the embodiment of the present disclosure, may be implemented by setting a preset type log generation rule, and the database system does not generate binlog when determining that the current statement is the preset column copy statement, so that generation of redundant log data may be reduced.
Fig. 2 is a flowchart of another data processing method according to an embodiment of the present disclosure, where the steps of receiving subtask execution information returned by at least two computing nodes are added by optimizing the foregoing embodiments, so that a metadata node can timely learn about the subtask execution situation, and further determine the overall execution situation of a column copy task.
Optionally, after the distributing the at least two subtasks to the at least two computing nodes, the method further comprises: and executing corresponding control operation in response to receiving the task control instruction sent by the target computing node, and returning an execution result to the target computing node. This has the advantage that the user can easily view or control the column copy task.
As shown in fig. 2, the method may include:
s201, receiving a column copy task submitted by a target computing node and aiming at a target database table.
Wherein the column copy task includes a data definition language task for implementing a column copy function. The column copy task comprises a preset column copy sentence, the preset column copy sentence comprises a source column, a destination column and copy conditions, and the preset type log generation rule in the distributed database system comprises that a preset type log is not generated aiming at the preset column copy sentence.
S202, splitting the column copy task into at least two subtasks according to a preset granularity.
Splitting according to the data slicing units, wherein different subtasks correspond to different line ranges in the target database table.
S203, distributing at least two subtasks to at least two computing nodes based on preset concurrency logic to instruct the at least two computing nodes to perform corresponding column copy operations for a row range corresponding to the received subtasks.
Optionally, the computing node may further split the subtasks, for example, perform line range division for a line range corresponding to the received subtasks, to obtain at least two copy areas; and generating a copy request corresponding to the current copy area for each of the at least two copy areas in sequence, and sending the copy request to the storage node to instruct the storage node to lock and copy the column data for the current copy area in the copy request, and unlocking the current copy area after the column data is copied. The advantage of this arrangement is that a small batch of column copies can be implemented in the subtask, shortening the data locking time, and further reducing the blocking to business data writing.
S204, receiving sub-task execution information returned by at least two computing nodes.
The subtask execution information is used for determining the progress of the column copy task.
For example, the subtask execution information may include subtask execution completion information or subtask execution failure information, etc. The metadata node may determine the progress of the column copy task based on the number of received subtask execution completion information and the total number of split subtasks. When receiving the execution completion information of the sub-tasks corresponding to all the sub-tasks in the split at least two sub-tasks, the metadata node can determine that the column copy task is executed. If the metadata node receives the execution failure information of the subtask corresponding to a certain subtask, the subtask can be restarted or redistributed.
S205, corresponding control operation is executed in response to receiving the task control instruction sent by the target computing node, and an execution result is returned to the target computing node.
Wherein the task control instruction includes at least one of: viewing subtask allocation, viewing the progress of the column copy task, pausing the subtask, and restarting the subtask.
By way of example, subtask allocation conditions may include a compute node allocated for each subtask, a row range corresponding to each subtask, allocated subtasks, unallocated subtasks, and the like. The progress of the column copy task may include the number of completed sub-tasks, the ratio of the number of completed sub-tasks to the total number of sub-tasks, and the like.
According to the data processing method provided by the embodiment of the disclosure, the column copy completed by executing the transaction is changed into the DDL task for realizing the column copy function, the metadata node splits and distributes the column copy task, so that the split subtasks can be executed concurrently among all computing nodes, the column copy efficiency can be effectively improved, the overhead of task execution is reduced, the performance is improved, when the computing nodes execute the subtasks, the data can be copied in small batches, the blockage of normal writing of the service is reduced, the time consumption of a large number of data column copies can be effectively reduced, useless binlog data cannot be generated, and experiments prove that the time consumption of trillion data column copies is less than 1 hour.
FIG. 3 is a flow chart of yet another data processing method provided in accordance with an embodiment of the present disclosure that is applicable to cases where column copies are made in a distributed database system. The method may be performed by a data processing apparatus, which may be implemented in hardware and/or software, may be configured in an electronic device, which may be configured as a computing node in a distributed database system. Referring to fig. 3, the method specifically includes the following:
S301, receiving subtasks sent by a metadata node, wherein the subtasks are contained in at least two subtasks distributed to at least two computing nodes by the metadata node, the at least two subtasks are obtained by splitting the column copy tasks according to preset granularity after receiving the column copy tasks submitted by a target computing node and aiming at a target database table by the metadata node, the column copy tasks comprise data definition language tasks for realizing a column copy function, and different subtasks correspond to different row ranges in the target database table.
S302, corresponding column copy operation is conducted on the row range corresponding to the received subtask.
According to the data processing scheme provided by the embodiment of the disclosure, the column copy completed by executing the transaction is changed into the DDL task for realizing the column copy function, and the metadata nodes split and distribute the column copy task, so that the split subtasks can be executed concurrently among all computing nodes, the column copy efficiency can be effectively improved, the overhead of task execution and the performance are reduced, the time consumption of a large number of data column copies can be effectively reduced, and the time period for blocking normal writing of the service is shortened.
In an alternative embodiment, the method further comprises: and sending subtask execution information to the metadata node, wherein the subtask execution information is used for the metadata node to determine the progress of the column copy task. The method has the advantages that the metadata node can know the execution condition of the subtasks in time, and further determine the overall execution condition of the column copy tasks.
FIG. 4 is a flowchart of yet another data processing method provided in accordance with an embodiment of the present disclosure, optimized based on the various alternative embodiments described above, with the addition of steps associated with further splitting of sub-tasks by computing nodes.
As shown in fig. 4, the method may include:
s401, receiving the subtasks sent by the metadata nodes.
The subtasks are contained in at least two subtasks distributed to at least two computing nodes by a metadata node, the at least two subtasks are obtained by splitting column copying tasks according to preset granularity after receiving column copying tasks submitted by a target computing node and aiming at a target database table by the metadata node, the column copying tasks comprise data definition language tasks for realizing a column copying function, and different subtasks correspond to different line ranges in the target database table.
S402, dividing the row range corresponding to the received subtasks to obtain at least two copy areas.
For example, assuming that the currently received subtasks correspond to a line range of 1 to 2000, the subtasks may be divided according to a preset division rule, for example, dividing the subtasks into a preset number of copy areas on average, or dividing the subtasks into preset line numbers, that is, each copy area has at most a preset line number. For example, 1 to 2000 can be split into 10 copy areas, each of which is recorded in 200 lines.
S403, generating a copy request corresponding to the current copy area for each of the at least two copy areas in turn, and sending the copy request to the storage node to instruct the storage node to lock and copy the data for the current copy area in the copy request, and unlocking the current copy area after the data is copied.
As described above, assuming that the first copy area is 1 to 200 rows, a corresponding copy request is generated for the copy area, and after receiving the copy request, the storage node locks 1 to 200 rows of data and performs corresponding column data copy, while 201 to 2000 rows will not be locked, so that writing of service data can be performed normally. After the column data corresponding to the first copy area is copied, unlocking 1 to 200 rows, locking and column data copying are carried out on the next copy area, such as 201 to 400 rows, and in the copying process, 1 to 200 rows and 401 to 2000 rows cannot be locked, so that writing of service data can be normally carried out. And so on until the corresponding column data copy is completed for all copy areas.
Optionally, the column copy task includes a preset column copy statement; the split subtasks also comprise the preset column copy sentences and row ranges corresponding to the subtasks; the copy request sent to the storage node includes the preset column copy statement and further includes an area range corresponding to the current copy area.
According to the data processing method provided by the embodiment of the disclosure, after the computing node receives the subtasks distributed by the metadata node, the row range of the subtasks is further divided, so that small batches of column copies can be realized, the data locking time is shortened, and the blocking to business data writing is further reduced.
FIG. 5 is a schematic diagram of an interaction process provided in accordance with an embodiment of the present disclosure, as shown in FIG. 5, in which a computing node submits a column copy task to a metadata node upon receiving a column copy command. And the metadata node generates subtasks by taking region as granularity, and performs persistent storage on the subtasks for fault recovery of the metadata node. And the metadata node selects a computing node instance to issue a subtask according to preset concurrency logic in a heartbeat mode. After receiving the subtasks, the computing node divides the data into batches according to the region range corresponding to the subtasks, sequentially generates copy requests and sends the copy requests to the storage node, the storage node scans the data in sequence according to the request content of the copy requests, locks the columns meeting the copy conditions, unlocks the columns, and repeatedly executes the columns. After the computing node confirms that the data processing within the region of the subtask is completed, information that the subtask is successfully executed is returned to the metadata node. And after receiving all the sub-task execution completion information, the metadata node determines that the whole column of copy tasks are completed. The metadata node may further perform a corresponding control operation, such as checking a subtask state, when receiving a control instruction.
Fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure, which is applicable to a case of performing column copying in a distributed database system. The apparatus may be used to perform the data processing methods in embodiments of the present disclosure, may be implemented in hardware and/or software, and may be configured in an electronic device that may be configured as a metadata node in a distributed database system. Referring to fig. 6, the data processing apparatus 600 specifically includes:
the task receiving module 601 is configured to receive a column copy task submitted by a target computing node for a target database table, where the column copy task includes a data definition language task for implementing a column copy function;
the task splitting module 602 is configured to split the column copy task into at least two subtasks according to a preset granularity, where different subtasks correspond to different line ranges in the target database table;
the subtask distribution module 603 is configured to distribute the at least two subtasks to at least two computing nodes based on preset concurrency logic, so as to instruct the at least two computing nodes to perform corresponding column copy operations for a row range corresponding to the received subtasks.
According to the data processing scheme provided by the embodiment of the disclosure, the column copy completed by executing the transaction is changed into the DDL task for realizing the column copy function, and the metadata nodes split and distribute the column copy task, so that the split subtasks can be executed concurrently among all computing nodes, the column copy efficiency can be effectively improved, the overhead of task execution and the performance are reduced, the time consumption of a large number of data column copies can be effectively reduced, and the time period for blocking normal writing of the service is shortened.
In an alternative embodiment, the apparatus further comprises: and the storage module is used for carrying out persistent storage on at least two subtasks after splitting the column copy task into at least two subtasks according to a preset granularity.
In an alternative implementation, the column copy task includes a source column, a destination column, and a copy condition, and the column copy operation is used to copy the data in the source column that satisfies the copy condition into the destination column.
In an alternative embodiment, the apparatus further comprises:
and the execution information receiving module is used for receiving the subtask execution information returned by the at least two computing nodes after the at least two subtasks are distributed to the at least two computing nodes, wherein the subtask execution information is used for determining the progress of the column copy task.
In an alternative embodiment, the apparatus further comprises:
the task control module is used for responding to the task control instruction sent by the target computing node after the at least two subtasks are distributed to the at least two computing nodes, executing corresponding control operation and returning an execution result to the target computing node;
wherein the task control instruction includes at least one of: viewing subtask allocation, viewing the progress of the column copy task, pausing the subtask, and restarting the subtask.
FIG. 7 is a schematic diagram of another data processing apparatus provided in accordance with an embodiment of the present disclosure, which may be applicable to cases where column copies are made in a distributed database system. The apparatus may be used to perform the data processing methods of embodiments of the present disclosure, may be implemented in hardware and/or software, and may be configured in an electronic device that may be configured as a computing node in a distributed database system. Referring to fig. 7, the data processing apparatus 700 specifically includes:
a subtask receiving module 701, configured to receive a subtask sent by a metadata node, where the subtask is included in at least two subtasks distributed by the metadata node to at least two computing nodes, where the at least two subtasks are obtained by splitting, by the metadata node, a column copy task submitted by a target computing node for a target database table according to a preset granularity after receiving the column copy task, where the column copy task includes a data definition language task for implementing a column copy function, and different subtasks correspond to different line ranges in the target database table;
And the column copy module 702 is configured to perform a corresponding column copy operation for a row range corresponding to the received subtask.
According to the data processing scheme provided by the embodiment of the disclosure, the column copy completed by executing the transaction is changed into the DDL task for realizing the column copy function, and the metadata nodes split and distribute the column copy task, so that the split subtasks can be executed concurrently among all computing nodes, the column copy efficiency can be effectively improved, the overhead of task execution and the performance are reduced, the time consumption of a large number of data column copies can be effectively reduced, and the time period for blocking normal writing of the service is shortened.
In an alternative embodiment, the column copy module includes:
the region dividing unit is used for dividing the line range corresponding to the received subtasks to obtain at least two copy regions;
and the column copying unit is used for generating a copying request corresponding to the current copying area for each copying area in the at least two copying areas in sequence, sending the copying request to a storage node, indicating the storage node to lock the current copying area in the copying request and copy column data, and unlocking the current copying area after the copying of the column data is completed.
In an alternative embodiment, the apparatus further comprises:
and the execution information sending module is used for sending subtask execution information to the metadata node, wherein the subtask execution information is used for the metadata node to determine the progress of the column copy task.
In the technical scheme of the disclosure, the related personal information of the user is collected, stored, used, processed, transmitted, provided, disclosed and the like, all conform to the regulations of related laws and regulations and do not violate the popular public order.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a distributed database system, a readable storage medium, and a computer program product.
An embodiment of the present disclosure provides a distributed database system, including a metadata node and a plurality of computing nodes, where the metadata node is configured to execute a data processing method applied to the metadata node in the embodiment of the present disclosure, and at least two computing nodes in the plurality of computing nodes are configured to execute a data processing method applied to the computing nodes.
Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (19)

1. A data processing method applied to metadata nodes in a distributed database system, the distributed database system further comprising a plurality of computing nodes, the method comprising:
receiving a column copy task submitted by a target computing node and aiming at a target database table, wherein the column copy task comprises a data definition language task for realizing a column copy function;
splitting the column copy task into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different line ranges in the target database table;
and distributing the at least two subtasks to at least two computing nodes based on preset concurrency logic to instruct the at least two computing nodes to perform corresponding column copy operations for a row range corresponding to the received subtasks.
2. The method of claim 1, wherein the column copy task includes a preset column copy statement, the preset column copy statement includes a source column, a destination column, and a copy condition, and the column copy operation is used to copy data in the source column that satisfies the copy condition into the destination column.
3. The method of claim 1, wherein the column copy task includes a preset column copy statement, and wherein the preset type log generation rule in the distributed database system includes not generating a preset type log for the preset column copy statement.
4. The method of claim 1, further comprising, after the distributing the at least two sub-tasks to at least two computing nodes:
and receiving subtask execution information returned by the at least two computing nodes, wherein the subtask execution information is used for determining the progress of the column copy task.
5. The method of claim 4, further comprising, after said distributing said at least two sub-tasks to at least two computing nodes:
responding to the task control instruction sent by the target computing node, executing corresponding control operation, and returning an execution result to the target computing node;
Wherein the task control instruction includes at least one of: viewing subtask allocation, viewing the progress of the column copy task, pausing the subtask, and restarting the subtask.
6. A data processing method applied to a computing node in a distributed database system, the distributed database system including a metadata node and a plurality of computing nodes, the method comprising:
receiving subtasks sent by a metadata node, wherein the subtasks are contained in at least two subtasks distributed to at least two computing nodes by the metadata node, the at least two subtasks are obtained by splitting column copy tasks according to preset granularity after receiving the column copy tasks submitted by a target computing node and aiming at a target database table by the metadata node, the column copy tasks comprise data definition language tasks for realizing column copy functions, and different subtasks correspond to different line ranges in the target database table;
and performing corresponding column copy operation on the row range corresponding to the received subtask.
7. The method of claim 6, wherein the performing a respective column copy operation for the row range corresponding to the received subtask comprises:
Dividing the row range according to the row range corresponding to the received subtasks to obtain at least two copy areas;
and generating a copy request corresponding to the current copy area for each of the at least two copy areas in sequence, and sending the copy request to a storage node to instruct the storage node to lock and copy data for the current copy area in the copy request, and unlocking the current copy area after the data is copied.
8. The method of claim 6, further comprising:
and sending subtask execution information to the metadata node, wherein the subtask execution information is used for the metadata node to determine the progress of the column copy task.
9. A data processing apparatus configured as a metadata node in a distributed database system, the distributed database system further comprising a plurality of computing nodes therein, the apparatus comprising:
the task receiving module is used for receiving a column copy task aiming at a target database table, which is submitted by a target computing node, wherein the column copy task comprises a data definition language task for realizing a column copy function;
The task splitting module is used for splitting the column copying task into at least two subtasks according to a preset granularity, wherein different subtasks correspond to different line ranges in the target database table;
the subtask distribution module is used for distributing the at least two subtasks to at least two computing nodes based on preset concurrency logic so as to instruct the at least two computing nodes to perform corresponding column copy operation on a row range corresponding to the received subtasks.
10. The apparatus of claim 9, further comprising:
and the storage module is used for carrying out persistent storage on at least two subtasks after splitting the column copy task into at least two subtasks according to a preset granularity.
11. The apparatus of claim 9, wherein the column copy task includes a source column, a destination column, and a copy condition, the column copy operation to copy data in the source column that satisfies the copy condition into the destination column.
12. The apparatus of claim 9, further comprising:
and the execution information receiving module is used for receiving the subtask execution information returned by the at least two computing nodes after the at least two subtasks are distributed to the at least two computing nodes, wherein the subtask execution information is used for determining the progress of the column copy task.
13. The apparatus of claim 12, further comprising:
the task control module is used for responding to the task control instruction sent by the target computing node after the at least two subtasks are distributed to the at least two computing nodes, executing corresponding control operation and returning an execution result to the target computing node;
wherein the task control instruction includes at least one of: viewing subtask allocation, viewing the progress of the column copy task, pausing the subtask, and restarting the subtask.
14. A data processing apparatus configured as a compute node in a distributed database system, the distributed database system including a metadata node and a plurality of compute nodes, the apparatus comprising:
the subtask receiving module is used for receiving subtasks sent by the metadata node, wherein the subtasks are contained in at least two subtasks distributed to at least two computing nodes by the metadata node, the at least two subtasks are obtained by splitting the column copy tasks according to preset granularity after receiving the column copy tasks submitted by the target computing node and aiming at the target database table, the column copy tasks comprise data definition language tasks for realizing column copy functions, and different subtasks correspond to different row ranges in the target database table;
And the column copying module is used for carrying out corresponding column copying operation on the row range corresponding to the received subtask.
15. The apparatus of claim 14, wherein the column copy module comprises:
the region dividing unit is used for dividing the line range corresponding to the received subtasks to obtain at least two copy regions;
and the column copying unit is used for generating a copying request corresponding to the current copying area for each copying area in the at least two copying areas in sequence, sending the copying request to a storage node, indicating the storage node to lock the current copying area in the copying request and copy column data, and unlocking the current copying area after the copying of the column data is completed.
16. The apparatus of claim 14, further comprising:
and the execution information sending module is used for sending subtask execution information to the metadata node, wherein the subtask execution information is used for the metadata node to determine the progress of the column copy task.
17. An electronic device configured as a metadata node or a computing node in a distributed database system, the electronic device comprising:
At least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5 or 6-8.
18. A distributed database system comprising a metadata node for performing the method of any of claims 1-5 and a plurality of computing nodes, at least two of the plurality of computing nodes for performing the method of any of claims 6-8.
19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202310269303.2A 2023-03-15 2023-03-15 Data processing method, device, equipment, system and storage medium Pending CN116521672A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310269303.2A CN116521672A (en) 2023-03-15 2023-03-15 Data processing method, device, equipment, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310269303.2A CN116521672A (en) 2023-03-15 2023-03-15 Data processing method, device, equipment, system and storage medium

Publications (1)

Publication Number Publication Date
CN116521672A true CN116521672A (en) 2023-08-01

Family

ID=87389376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310269303.2A Pending CN116521672A (en) 2023-03-15 2023-03-15 Data processing method, device, equipment, system and storage medium

Country Status (1)

Country Link
CN (1) CN116521672A (en)

Similar Documents

Publication Publication Date Title
US10180812B2 (en) Consensus protocol enhancements for supporting flexible durability options
US8595732B2 (en) Reducing the response time of flexible highly data parallel task by assigning task sets using dynamic combined longest processing time scheme
US10546021B2 (en) Adjacency structures for executing graph algorithms in a relational database
CN111506401B (en) Automatic driving simulation task scheduling method and device, electronic equipment and storage medium
US11409722B2 (en) Database live reindex
EP3825865A2 (en) Method and apparatus for processing data
CN112346834A (en) Database request processing method and device, electronic equipment and medium
CN112527474A (en) Task processing method and device, equipment, readable medium and computer program product
CN113364877A (en) Data processing method, device, electronic equipment and medium
US20130138418A1 (en) Modeling of Cross System Scenarios
US20170075943A1 (en) Maintaining in-memory database consistency by parallelizing persistent data and log entries
US10127270B1 (en) Transaction processing using a key-value store
CN111459882B (en) Namespace transaction processing method and device for distributed file system
US10970175B2 (en) Flexible per-request data durability in databases and other data stores
US11169855B2 (en) Resource allocation using application-generated notifications
US9910893B2 (en) Failover and resume when using ordered sequences in a multi-instance database environment
US11386153B1 (en) Flexible tagging and searching system
CN111782147A (en) Method and apparatus for cluster scale-up
CN111782341A (en) Method and apparatus for managing clusters
US20220244990A1 (en) Method for performing modification task, electronic device and readable storage medium
CN116521672A (en) Data processing method, device, equipment, system and storage medium
US11941055B2 (en) Method and apparatus for graph computing, electronic device and storage medium
CN116974983A (en) Data processing method, device, computer readable medium and electronic equipment
CN105518664B (en) Managing database nodes
US20160140117A1 (en) Asynchronous sql execution tool for zero downtime and migration to hana

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination