CN116954490A

CN116954490A - Data processing method, device, electronic equipment and storage medium

Info

Publication number: CN116954490A
Application number: CN202211529384.7A
Authority: CN
Inventors: 王登山; 于华丽; 陈蒙; 余建涛; 唐暾
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-10-27

Abstract

The application discloses a data processing method, a device, electronic equipment and a storage medium, wherein the method can be used for cloud data calculation and comprises the following steps: and executing data reassignment processing based on the data reassignment task assigned by the driving node, and storing the data to be processed of the target data partition corresponding to the data reassignment task locally. Based on the data reduction task distributed by the driving node, the reallocation data corresponding to the target data partition is read locally, and data reduction processing is executed to obtain a data reduction processing result. According to the method, the data partitions are sequentially written in during reassignment, and the data are locally read for processing during reduction processing, so that the execution efficiency of data reassignment and data reduction is improved.

Description

Data processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of cloud computing technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium.

Background

Since in distributed computing, each computing node of each stage only processes a part of data of a task, if the next stage needs to rely on all computing results of the previous stage, it is necessary to re-integrate and classify all computing results of the previous stage through shuffling operation. In the prior art, when a shuffling operation is performed, data management is required to be performed by relying on a third party service, or a currently running reassignment task is randomly written or a data reduction task is randomly read, so that the efficiency of data reassignment and data reduction is reduced.

Disclosure of Invention

The application provides a data processing method, a data processing device, electronic equipment and a storage medium, which improve the execution efficiency of data redistribution and data reduction.

In one aspect, the present application provides a data processing method, the method including:

receiving a data reassignment task assigned by a driving node; the data redistribution task corresponds to a target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

performing data redistribution based on the data characteristics of the plurality of items of data to be processed, and determining redistribution partition identifiers corresponding to the plurality of items of data to be processed respectively; the data partition corresponding to the reassignment partition identifier is one of the plurality of data partitions;

based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, the reassignment data corresponding to the target data partition is stored locally;

receiving a data reduction task distributed by the driving node; the data reduction task corresponds to the target data partition;

and obtaining the reassignment data in the target data partition from the local, and carrying out data reduction processing on the reassignment data to obtain a data reduction processing result corresponding to the target data partition.

In another aspect, the present application provides a data processing method, the method including:

generating a data reassignment task; the data redistribution task corresponds to a target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

sending the data reassignment task to each data processing node so that each data processing node performs data reassignment based on the data characteristics of the plurality of items of data to be processed, and determining reassignment partition identifications corresponding to the plurality of items of data to be processed respectively; based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, the reassignment data corresponding to the target data partition is stored locally; the data partition corresponding to the reassignment partition identifier is one of the plurality of data partitions;

generating a data reduction task; the data reduction task corresponds to a target data partition, and is used for carrying out data reduction processing on multiple items of reassigned data in the target data partition; the target data partition is at least one of a plurality of data partitions;

And sending the data reduction task corresponding to each data processing node, wherein a target data partition corresponding to the data reduction task running in the data processing node is the same as a target data partition corresponding to the data redistribution task running in the data processing node, so that each data processing node obtains the redistribution data corresponding to the target data partition from the local, and performs data reduction processing on the redistribution data to obtain a data reduction processing result corresponding to the target data partition.

Another aspect provides a data processing apparatus, the apparatus comprising:

the first task receiving module is used for receiving the data reassignment task assigned by the driving node; the data redistribution task corresponds to a target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

the data redistribution module is used for carrying out data redistribution based on the data characteristics of the plurality of items of data to be processed, and determining redistribution partition identifications corresponding to the plurality of items of data to be processed respectively; the data partition corresponding to the reassignment partition identifier is one of the plurality of data partitions;

The reassignment data storage module is used for storing reassignment data corresponding to the target data partition locally based on reassignment partition identifiers corresponding to the multiple pieces of data to be processed respectively;

the second task receiving module is used for receiving the data reduction task distributed by the driving node; the data reduction task corresponds to the target data partition;

and the data reduction processing module is used for locally acquiring the reassigned data in the target data partition, and carrying out data reduction processing on the reassigned data to obtain a data reduction processing result corresponding to the target data partition.

Another aspect provides a data processing apparatus, the apparatus comprising:

the first task generating module is used for generating a data reassignment task; the data redistribution task corresponds to a target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

the first task sending module is used for sending the data reassignment task to each data processing node so that each data processing node can conduct data reassignment based on the data characteristics of the plurality of items of data to be processed, and the reassignment partition identifications corresponding to the plurality of items of data to be processed are determined; based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, the reassignment data corresponding to the target data partition is stored locally; the data partition corresponding to the reassignment partition identifier is one of the plurality of data partitions;

The second task generating module is used for generating a data reduction task; the data reduction task corresponds to a target data partition, and is used for carrying out data reduction processing on multiple items of reassigned data in the target data partition; the target data partition is at least one of a plurality of data partitions;

and the second task sending module is used for sending the data reduction task corresponding to each data processing node, wherein the target data partition corresponding to the data reduction task running in the data processing node is the same as the target data partition corresponding to the data redistribution task running in the data processing node, so that each data processing node obtains the redistribution data corresponding to the target data partition from the local, and performs data reduction processing on the redistribution data to obtain a data reduction processing result corresponding to the target data partition.

In another aspect, an electronic device is provided, the electronic device comprising a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement a data processing method as described above.

Another aspect provides a computer readable storage medium comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement a data processing method as described above.

The application provides a data processing method, a device, electronic equipment and a storage medium, wherein the data processing method comprises the following steps: and executing data reassignment processing based on the data reassignment task assigned by the driving node, and storing the data to be processed of the target data partition corresponding to the data reassignment task locally. Based on the data reduction task distributed by the driving node, the reallocation data corresponding to the target data partition is read locally, and data reduction processing is executed to obtain a data reduction processing result. According to the method, the data partitions are sequentially written in during reassignment, and the data are locally read for processing during reduction processing, so that the execution efficiency of data reassignment and data reduction is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a data processing method according to an embodiment of the present application;

FIG. 2 is an interactive flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a change of a data partition after data redistribution in a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a data reduction task corresponding to each data processing node in a data processing method according to an embodiment of the present application;

FIG. 5 is a flowchart of storing local data in a data processing method according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for sending first external data in a data processing method according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for storing second external data in a data processing method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a method for data processing according to an embodiment of the present application for reassigning a prompt message for ending execution of a task sent between sub-tasks;

FIG. 9 is a flowchart of a data processing method on a data processing node side according to an embodiment of the present application;

FIG. 10 is a flow chart of a data processing method on a driving node side according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating data storage of local data, first external data, and second external data when writing data according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a data processing method for performing data processing in a shuffling operation of a distributed computing system according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a structure of a data processing device on a data processing node side according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a data processing apparatus at a driving node side according to an embodiment of the present application;

fig. 15 is a schematic hardware structure of an apparatus for implementing the method provided by the embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the description of the present application, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. Moreover, the terms "first," "second," and the like, are used to distinguish between similar objects and do not necessarily describe a particular order or precedence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein.

It will be appreciated that in the specific embodiments of the present application, related data such as user information is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Referring to fig. 1, an application scenario of a data processing method provided by the embodiment of the present application is shown, where the application scenario includes a driving node 110 and a data processing node 120, the driving node 110 distributes a data redistribution task to the data processing node 120, the data processing node 120 operates the data redistribution task to redistribute data to be processed and writes the data into a corresponding target data partition, the redistributed data in the target data partition is stored in the data processing node 120, the driving node 110 determines a data reduction task corresponding to each data processing node 120 based on a corresponding relationship between the data redistribution task and the target data partition, and the data processing node 120 operates the data reduction task to read data from the local to perform data reduction processing.

In the embodiment of the present application, the server includes a driving node 110 and a data processing node 120, and the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

Referring to fig. 2, a data processing method is shown, the method includes:

s210, a driving node generates a data redistribution task; the data redistribution task corresponds to the target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

In some embodiments, the data reassignment task may include a plurality of reassignment sub-tasks, the number of reassignment sub-tasks in the data reassignment task being consistent with the number of data partitions included in the target data partition. That is, the data partitions and the reassignment subtasks are in one-to-one correspondence. The reassignment subtask may be a reassignment task (map task) when the distributed computing engine performs a shuffle operation (shuffle), and the data to be processed may be reassigned each time the maptask is run. As shown in fig. 3, the data partitions of the data to be processed may change after the data is redistributed, for example, n data to be processed originally exist and correspond to 3 data partitions, and after the data is redistributed, the n data to be processed may correspond to 5 data partitions.

S220, the driving node sends a data reassignment task to each data processing node;

in some embodiments, the data re-allocation tasks sent to each data processing node may be randomly allocated. The data processing node corresponding to each data reassignment task may be determined based on a remainder of the ratio between the task number identification of the data reassignment task and the number of data processing nodes.

S230, the data processing node receives a data redistribution task distributed by the driving node;

in some embodiments, the target data partition corresponding to the data reassignment task received by the data processing node is consistent with the data partition to be processed by the data processing node. The data processing node may run a data redistribution task, so that each redistribution subtask in the data redistribution task performs data redistribution on the data to be processed in the corresponding data partition.

S240, the data processing node performs data redistribution based on the data characteristics of the plurality of items of data to be processed, and determines redistribution partition identifiers corresponding to the plurality of items of data to be processed respectively; the data partition corresponding to the reassigned partition identifier is one of a plurality of data partitions;

in some embodiments, the data processing node may perform classification processing on the multiple pieces of data to be processed based on data features of the multiple pieces of data to be processed, divide the data to be processed originally in the same data partition into at least one type of data to be processed corresponding to each of the at least one type of data to be processed, and determine the reassigned partition identifier corresponding to the data to be processed based on the data type corresponding to the data to be processed.

S250, the data processing node stores the reassigned data corresponding to the target data partition locally based on reassigned partition identifiers corresponding to the multiple pieces of data to be processed;

in some embodiments, the reassigned data of the same class is stored in the same data partition, if the data to be processed in the original data partition is the data in the data partition, the reassigned data can be directly stored in the local area, and if the data to be processed in the original data partition is not the data in the data partition, the reassigned data can be sent to other data processing nodes. And, the data to be processed corresponding to the local data partition sent by other data processing nodes can be received.

S260, driving the nodes to generate data reduction tasks; the target data partition corresponding to the data reduction task running in the data processing node is the same as the target data partition corresponding to the data redistribution task running in the data processing node;

in some embodiments, the data reduction task is used to aggregate or sort data in the corresponding data partition. And determining a data reduction task corresponding to each data processing node based on the target data partition corresponding to the data redistribution task running in each data processing node, so that the target data partition corresponding to the data reduction task is consistent with the target data partition corresponding to the data redistribution task.

The data reduction task may include a plurality of reduction processing sub-tasks, the number of reduction processing sub-tasks in the data reduction task being consistent with the number of data partitions included in the target data partition. That is, the data partitions and the reduction processing subtasks are in one-to-one correspondence.

Referring to fig. 4, a schematic diagram of a data reduction task corresponding to each data processing node is shown in fig. 4. The driving node distributes data reduction tasks corresponding to each target data partition to the data processing nodes corresponding to the target data partitions based on the corresponding relation between the data processing nodes and the target data partitions.

The data reduction task may be a reduction task (reduction task) when the distributed computing engine performs a shuffle operation (shuffle), and each time the reduction task is run, data may be read locally and data reduction processing such as data aggregation or sorting may be performed.

S270, the driving node sends data reduction tasks corresponding to the target data partitions of each data processing node to each data processing node;

in some embodiments, the target data partitions corresponding to the data reduction tasks sent by the drive node to each data processing node are consistent with the target data partitions corresponding to the data redistribution tasks running in the data processing nodes.

S280, the data processing node receives a data reduction task distributed by the driving node; the data reduction task corresponds to a target data partition;

in some embodiments, the data processing nodes may run data reduction tasks such that each of the reduction processing subtask processes in a data reduction task performs data reduction processing on data to be processed in a corresponding data partition.

S290, the data processing node locally acquires the reassignment data in the target data partition, and performs data reduction processing on the reassignment data to obtain a data reduction processing result corresponding to the target data partition.

In some embodiments, when the data processing node runs the data reduction task, the reassignment data may be directly obtained from the assigned data processing node locally when each reduction processing subtask in the data reduction task processes the data to be processed in the corresponding data partition.

In some embodiments, referring to fig. 5, the reassigned partition identifier includes a local partition identifier, where the local partition identifier is a partition identifier corresponding to a local target data partition;

based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, storing reassignment data corresponding to the target data partition locally includes:

S510, determining local data from a plurality of pieces of data to be processed; the reassigned partition identifier corresponding to the local data is a local partition identifier;

s520, writing the local data into a target data partition;

s530, storing the reassigned data in the target data partition locally.

In some embodiments, the local data is data that is identified by a redistribution partition and written into a target data partition corresponding to the local data processing node in a process of determining a data redistribution task by the local data processing node. When the data processing node runs the data redistribution task to redistribute the plurality of pieces of data to be processed, partition identifiers corresponding to the data partitions where the plurality of pieces of data to be processed are located can be updated, and the redistributed partition identifiers are obtained. And under the condition that the reassigned partition identification is the same as the partition identification of the target data partition corresponding to the data processing node, determining the data to be processed corresponding to the reassigned partition identification as local data, wherein the reassigned partition identification is the local partition identification. The local data can be directly written into the target data partition, and the reassigned data in the target data partition is stored locally.

When the data reassignment task includes a plurality of reassignment subtasks, data traversing processing can be performed on the data to be processed in the data partition corresponding to each reassignment subtask, a reassignment partition identification corresponding to the currently traversed data to be processed is determined, and when the reassignment partition identification is the same as the partition identification of the data partition corresponding to the reassignment subtask to be traversed, the currently traversed data to be processed is determined to be local data, and the reassignment partition identification is determined to be the local partition identification. And writing the currently traversed data to be processed into a cache of the data partition corresponding to the local partition identifier. When the data volume of the data to be processed stored in the cache of the data partition reaches a preset data volume threshold, sequentially storing the reallocated data in the data partition to a local disk of the data processing node.

Based on the reassignment partition identification, local data stored locally in the data processing node is obtained by screening from the data to be processed, and the local data in the data to be processed is directly written into the target data partition and stored locally in the data processing node, so that the local data can be quickly identified and directly stored, and the execution efficiency of the data reassignment task is improved.

In some embodiments, referring to fig. 6, the reassigned partition identifier includes an external partition identifier, and the local partition identifier is a partition identifier corresponding to an external target data partition;

the method further comprises the steps of performing data redistribution based on the data characteristics of the plurality of pieces of data to be processed, and after determining the redistribution partition identifiers corresponding to the plurality of pieces of data to be processed, respectively:

s610, determining first external data from a plurality of pieces of data to be processed; the reassigned partition identifier corresponding to the first external data is an external partition identifier;

s620, determining a first data processing node corresponding to the first external data;

s630, the first external data is sent to the first data processing node, so that the first data processing node writes the first external data into the data partition corresponding to the external partition identifier.

In some embodiments, the first external data is data that determines a redistribution partition identifier in a process of running a data redistribution task by the local data processing node, and is written into a target data partition corresponding to another data processing node except the local data processing node. When the data processing node runs the data redistribution task to redistribute the plurality of pieces of data to be processed, partition identifiers corresponding to the data partitions where the plurality of pieces of data to be processed are located can be updated, and the redistributed partition identifiers are obtained. Under the condition that the partition identifiers of the reallocated partitions are different from the partition identifiers of the target data partitions corresponding to the data processing nodes, the data to be processed corresponding to the reallocated partition identifiers can be determined to be first external data, and the reallocated partition identifiers are external partition identifiers. Based on the external partition identification, a first data processing node corresponding to the first external data may be determined, and the first external data may be sent to the first data processing node. The first external data may be written in the first data processing node to the data partition corresponding to the external partition identification.

When the data reassignment task includes a plurality of reassignment subtasks, data traversing processing can be performed on the data to be processed in the data partition corresponding to each reassignment subtask, a reassignment partition identification corresponding to the currently traversed data to be processed is determined, and when the reassignment partition identification is different from the partition identification of the data partition corresponding to the reassignment subtask which is traversed by executing, the currently traversed data to be processed is determined to be the first external data, and the reassignment partition identification is determined to be the external partition identification. After the first external data is sent to the first data processing node, the first data processing node may perform data writing on the first external data, and write the first external data into a cache of the data partition corresponding to the external partition identifier. When the data volume of the data to be processed stored in the cache of the data partition reaches a preset data volume threshold, sequentially storing the reallocated data in the data partition to a local disk of the first data processing node.

Based on the reassignment partition identification, first external data is obtained by screening from the data to be processed, and the first external data is sent to the corresponding first data processing node for data writing, so that non-local data can be rapidly identified and sent to the corresponding data processing node for storage, and accuracy of a data reassignment task is improved.

In some embodiments, referring to fig. 7, the reassignment data includes local data and second external data; the second external data is data sent by an external second data processing node, and the reassignment partition identifier corresponding to the second external data is a local partition identifier;

based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, storing reassignment data corresponding to the target data partition locally, including:

s710, receiving second external data sent by a second data processing node, wherein a data redistribution task running in the second data processing node is different from a target data partition corresponding to a data redistribution task running in a local data processing node;

s720, storing the local data and the second external data locally.

In some embodiments, the second external data is external data of the second data processing node determined in the process of the second data processing node running the data reassignment task, and the data reassignment task running in the second data processing node is different from the data reassignment task running in the local data processing node, and the corresponding target data partition is also different.

The second data processing node sends the second external data to the local data processing node based on the reassigned partition identification corresponding to the second external data. The local data processing node receives second external data sent by the second data processing node, and the reassignment partition identifier corresponding to the second external data is the local partition identifier corresponding to the local data processing node.

And writing the local data and the second external data into a target data partition corresponding to the local data processing node, and storing the local data and the second external data as reallocation data of the target data partition locally.

After the data is redistributed, the data processing nodes corresponding to the redistributed data can be determined based on the partition identifiers corresponding to the redistributed data and the number of the data processing nodes. And taking the remainder of the ratio between the partition identification corresponding to the reallocated data and the number of the data processing nodes to determine the data processing nodes corresponding to the reallocated data.

In the case where the data reassignment task includes a plurality of reassignment subtasks, the reassignment subtask corresponding to the reassignment data may be determined based on the partition identification corresponding to the reassignment data and the number of reassignment subtasks running in the data processing node. And taking the remainder from the ratio between the partition identification corresponding to the reassigned data and the number of reassigned subtasks running in the data processing node to determine the reassigned subtasks corresponding to the reassigned data.

Based on the reassignment partition identification, second external data sent by the second data processing node is received, the second external data is written into the target data partition and stored in the data processing node, so that the data processing node can locally store all data to be processed of the same data partition, and the comprehensiveness of the data reassignment task is improved.

In some embodiments, after performing data redistribution based on the data features of the plurality of pieces of data to be processed and determining the redistribution partition identifiers corresponding to the plurality of pieces of data to be processed, the method further includes:

stopping the data writing operation of the target data partition under the condition that prompt information of ending execution of the associated data reassignment task is received; the associated data reassignment tasks are data reassignment tasks running in other data processing nodes than the local data processing node.

In some embodiments, under the condition that the reassignment subtask in the local data processing node traverses the data to be processed in the corresponding data partition, a prompt message of the end of execution of the reassignment task is generated, and the prompt message is sent to other reassignment subtasks except for the reassignment subtask. After the associated data redistribution task includes the redistribution subtasks which complete the traversing processing of the data to be processed in the corresponding data partition, the prompt information of the end of the redistribution task execution is also generated, and the local data processing node can receive the prompt information sent by each redistribution subtask included in the associated data redistribution task. Referring to fig. 8, fig. 8 is a schematic diagram illustrating sending a prompt message for ending execution of a task between reassigned sub-tasks. After the traversing is completed, the reassignment subtask 1 running on the data processing node 1 sends prompt information to reassignment subtasks such as reassignment subtask 2, reassignment subtask 3 and the like.

After the local data processing node receives the prompt information sent by all the reassignment subtasks in the associated data reassignment tasks, the data writing flow of the target data partition is closed, and the data writing operation of the target data partition is stopped. Before the local data processing node receives prompt information sent by all reassignment subtasks in the associated data reassignment tasks, the data write stream of the target data partition is kept open, so that second external data sent by the second data processing node can be written conveniently.

Under the condition that prompt information of ending execution of the associated data redistribution task is received, data writing of a target data partition of the local data processing node is stopped, missing of data to be processed, which can be written into the target data partition, in the associated data redistribution task can be avoided, and therefore accuracy and comprehensiveness of data redistribution processing are improved.

Referring to fig. 9, a data processing method is shown, which is applied to a data processing node side, and includes:

s910, receiving a data redistribution task distributed by a driving node; the data redistribution task corresponds to the target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

S920, carrying out data redistribution based on data characteristics of the plurality of items of data to be processed, and determining redistribution partition identifiers corresponding to the plurality of items of data to be processed respectively; the data partition corresponding to the reassigned partition identifier is one of a plurality of data partitions;

s930, based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, the reassignment data corresponding to the target data partition is stored locally;

s940, receiving a data reduction task distributed by a driving node; the data reduction task corresponds to a target data partition;

s950, obtaining the reassignment data in the target data partition from the local, and performing data reduction processing on the reassignment data to obtain a data reduction processing result corresponding to the target data partition.

In some embodiments, each data processing node may run a data reassignment task and a data reduction task, with the target data partition corresponding to the data reassignment task and the target data partition corresponding to the data reduction task running on each data processing node being the same.

And running a data reassignment task in the data processing node, and carrying out data reassignment on the plurality of pieces of data to be processed based on the data characteristics of the plurality of pieces of data to be processed, so that reassignment partition identifiers corresponding to the plurality of pieces of data to be processed can be determined. Referring to fig. 10, fig. 10 is a schematic diagram illustrating data storage of local data, first external data, and second external data during data writing. The reassigned partition identification includes a local partition identification and an external partition identification.

And under the condition that the reassigned partition identification is the same as the partition identification of the target data partition corresponding to the data processing node, determining the data to be processed corresponding to the reassigned partition identification as local data, wherein the reassigned partition identification is the local partition identification. The local data may be written directly to the local target data partition. Under the condition that the partition identifiers of the reallocated partitions are different from the partition identifiers of the target data partitions corresponding to the data processing nodes, the data to be processed corresponding to the reallocated partition identifiers can be determined to be first external data, and the reallocated partition identifiers are external partition identifiers. And determining a first data processing node corresponding to the external partition identifier, and sending the first external data to the first data processing node for data writing.

And receiving second external data sent by the second data processing node, wherein the reassignment partition identification corresponding to the second external data is matched with the partition identification of the target data partition corresponding to the local data processing node. And writing the second external data into the local target data partition based on the reassigned partition identification corresponding to the second external data. The first data processing node and the second data processing node are all other data processing nodes except the local data processing node, and the first data processing node and the second data processing node can be the same data processing node or different data processing nodes.

After the data redistribution tasks corresponding to each data processing node are executed, the data processing nodes can report the corresponding relation between the target data partition and the data processing nodes to the driving nodes, and the driving nodes can distribute data reduction tasks according to the corresponding relation.

The data processing node receives the data reduction task distributed by the driving node and runs the data reduction task, after the data of the data partition processed by the data reduction task is requested to the driving node, the data of the data partition processed by the data reduction task can be determined to be local to the data processing node, the reassigned data in the target data partition can be directly obtained from the local data processing node, and the data reduction processing is carried out on the reassigned data, so that a data reduction processing result corresponding to the target data partition is obtained.

And directly acquiring the reallocation data in the target data partition from the data processing node, and reading the data by using a file system cache and a memory mapping technology of an operating system to obtain the reallocation data in the target data partition.

The sequential writing of each data partition is executed through the corresponding relation between the data partition and the data processing node, and the reassignment data of each data partition is read through the corresponding relation between the data partition and the data processing node, so that the steps of data query and data pulling can be omitted, and the complex data index relation among reassigned data can be omitted, thereby improving the execution efficiency of data reassignment and data reduction processing, avoiding the dependence of the data processing process on third-party remote service, and further reducing the cost of operation and deployment.

Referring to fig. 11, a data processing method is shown, which is applied to a driving node side, and includes:

s1110, generating a data reassignment task;

s1120, sending a data reassignment task to each data processing node so that each data processing node performs data reassignment based on the data characteristics of the plurality of items of data to be processed, and determining reassignment partition identifications corresponding to the plurality of items of data to be processed respectively; based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, reassignment data corresponding to the target data partitions are stored locally; the data partition corresponding to the reassigned partition identifier is one of a plurality of data partitions;

s1130, generating a data reduction task;

s1140, sending a data reduction task corresponding to each data processing node, wherein the target data partition corresponding to the data reduction task running in the data processing node is the same as the target data partition corresponding to the data redistribution task running in the data processing node, so that each data processing node obtains the redistribution data corresponding to the target data partition from the local, and performs data reduction processing on the redistribution data to obtain a data reduction processing result corresponding to the target data partition.

In some embodiments, the drive node may generate and assign data reassignment tasks and data reduction tasks to each data processing node. The data redistribution task corresponds to a target data partition, and after the data processing node executes the data redistribution task, the corresponding relationship between the target data partition and the data processing node can be sent to the driving node.

The driving node can allocate the data reduction tasks corresponding to the target data partitions to the data processing nodes corresponding to the target data partitions based on the corresponding relations between the target data partitions and the data processing nodes, so that the target data partitions corresponding to the data redistribution tasks corresponding to the data reduction tasks running in the data processing nodes are identical.

The data redistribution task is generated, and is distributed through the corresponding relation between the data partition and the data processing node, so that the data redistribution task can be sequentially written based on the corresponding data partition, then a data reduction task is generated, and is distributed through the corresponding relation between the data partition and the data processing node, so that the data reduction task can be directly read from the local when the data is read, and the execution efficiency of the data redistribution and the data reduction is improved.

The data processing method provided by the embodiment of the application can be applied to shuffling operation of a distributed computing system. Maptask and reduce may be performed by the data processing methods described above when performing the shuffle operation. In the process of executing the shuffle operation, the driving node generates data reassignment subtasks with the same number as the data partitions, performs task assignment on the data reassignment subtasks, assigns at least one data reassignment subtask to the data processing node, and each data reassignment subtask is a maptask. The at least one data redistribution subtask is a data redistribution task corresponding to the data processing node.

The local data processing node receives the data redistribution task and the target data partition corresponding to the data redistribution task. And in the execution process of each data reassignment subtask, performing data traversing processing on the data to be processed in the data partition corresponding to the data reassignment subtask, performing data reassignment on the currently traversed data to be processed based on the data characteristics of the currently traversed data to be processed, determining the reassignment partition identification corresponding to the currently traversed data to be processed, and repeating the data traversing process until all the data to be processed in the data partition are traversed.

The reassigned partition identifier corresponding to the currently traversed data to be processed may be a local partition identifier or an external partition identifier, and the local partition identifier is a partition identifier of a target data partition corresponding to the local data processing node. Under the condition that the reassigned partition identifier is a local partition identifier, writing the currently traversed data to be processed into a cache corresponding to the data partition as local data, and storing the data in the cache into a local disk corresponding to a local data processing node when the data volume in the cache reaches a preset data volume threshold. Under the condition that the reassigned partition identifier is an external partition identifier, the currently traversed data to be processed is used as first external data, a first data processing node corresponding to the first external data is determined, the first external data is sent to the first data processing node, and the first data processing node writes data and stores the data into a local disk corresponding to the first data processing node.

In the process that the local data processing node stores the data to be processed in the local, second external data sent by the second data processing node can be received, and the reassigned partition identifier corresponding to the second external data is the local partition identifier. The local data processing node writes the second external data into a cache corresponding to the data partition, and stores the data in the cache into a local disk corresponding to the local data processing node when the data amount in the cache reaches a preset data amount threshold.

Referring to fig. 12, a schematic diagram of performing data processing in a shuffling operation of a distributed computing system is shown in fig. 12. When the data processing node 1 runs the data redistribution task, local data is stored, first external data corresponding to the data processing node 2 is sent to the data processing node 2, second external data sent by the data processing node 2 is received, first external data corresponding to the data processing node 3 is sent to the data processing node 3, and second external data sent by the data processing node 3 is received. Similarly, the case of the first external data transmission and the case of the second external data reception at the time of the data reassignment task by the data processing node 2 and the data processing node 3 are available.

After the traversal is finished, each data redistribution subtask in the local data processing node generates prompting information of the end of the traversal, and sends the prompting information to other data redistribution subtasks except for the data redistribution subtask of the end of the traversal in the data redistribution subtasks. When receiving the instruction information of the end of the traversal corresponding to all the other data reassignment sub-tasks, any one of the data reassignment sub-tasks may stop the data write operation of the corresponding data partition.

After the execution of the data redistribution task is finished, the driving node can generate a reduction processing subtask corresponding to each data partition, and distribute the reduction processing subtask to each corresponding data processing node based on the corresponding relation between the data partition and the data processing node, so that each data processing node receives the corresponding data reduction task. When the data reduction task in the data processing node executes the data reduction processing, the reassigned data can be directly read from the data processing node locally and subjected to the data reduction processing, so that a data reduction processing result is obtained.

The embodiment of the application provides a data processing method, which comprises the following steps: and executing data reassignment processing based on the data reassignment task assigned by the driving node, and storing the data to be processed of the target data partition corresponding to the data reassignment task locally. Based on the data reduction task distributed by the driving node, the reallocation data corresponding to the target data partition is read locally, and data reduction processing is executed to obtain a data reduction processing result. According to the method, the data partitions are sequentially written in during reassignment, and the data are locally read for processing during reduction processing, so that the execution efficiency of data reassignment and data reduction is improved, the dependence on third-party remote service is avoided, and the cost of operation and deployment is reduced.

The embodiment of the application also provides a data processing device, which is applied to a data processing node side, please refer to fig. 13, and the device includes:

a first task receiving module 1310, configured to receive a data redistribution task allocated by a driving node; the data redistribution task corresponds to the target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

the data reassignment module 1320 is configured to perform data reassignment based on data features of the plurality of items of data to be processed, and determine reassignment partition identifiers corresponding to the plurality of items of data to be processed respectively; the data partition corresponding to the reassigned partition identifier is one of a plurality of data partitions;

the reallocation data storage module 1330 is configured to store, locally, reallocation data corresponding to the target data partition based on the reallocation partition identifiers corresponding to the multiple items of data to be processed;

a second task receiving module 1340, configured to receive data reduction tasks allocated by the driving node; the data reduction task corresponds to a target data partition;

and the data reduction processing module 1350 is configured to obtain the reallocation data in the target data partition locally, and perform data reduction processing on the reallocation data to obtain a data reduction processing result corresponding to the target data partition.

In some embodiments, the reassigned partition identification includes a local partition identification, the local partition identification being a partition identification corresponding to the local target data partition;

the reassignment data storage module comprises:

a local data determining unit configured to determine local data from a plurality of items of data to be processed; the reassigned partition identifier corresponding to the local data is a local partition identifier;

the local data writing unit is used for writing the local data into the target data partition;

and the first reassignment data storage unit is used for storing reassignment data in the target data partition locally.

In some embodiments, the reassigned partition identification includes an external partition identification, the local partition identification being a partition identification corresponding to an external target data partition;

the reassignment data store module further comprises:

a first external data determination unit configured to determine first external data from a plurality of items of data to be processed; the reassigned partition identifier corresponding to the first external data is an external partition identifier;

the first data processing node determining unit is used for determining a first data processing node corresponding to the first external data;

and the first external data sending unit is used for sending the first external data to the first data processing node so that the first data processing node writes the first external data into the data partition corresponding to the external partition identifier.

In some embodiments, the reassignment data comprises local data and second external data; the second external data is data sent by an external second data processing node, and the reassignment partition identifier corresponding to the second external data is a local partition identifier;

the reassignment data store module further comprises:

the second external data receiving unit is used for receiving second external data sent by a second data processing node, and a data redistribution task running in the second data processing node is different from a target data partition corresponding to the data redistribution task running in the local data processing node;

and the first reassignment data storage unit is used for storing the local data and the second external data locally.

In some embodiments, the apparatus further comprises:

the write operation stopping module is used for stopping the data write operation of the target data partition under the condition that prompt information of ending execution of the associated data reassignment task is received; the associated data reassignment tasks are data reassignment tasks running in other data processing nodes than the local data processing node.

The embodiment of the present application further provides a data processing apparatus, which is applied to a driving node side, please refer to fig. 14, and the apparatus includes:

A first task generating module 1410, configured to generate a data reassignment task; the data redistribution task corresponds to the target data partition, and is used for carrying out data redistribution processing on a plurality of pieces of data to be processed in the target data partition; the target data partition is at least one of a plurality of data partitions;

the first task sending module 1420 is configured to send a data reassignment task to each data processing node, so that each data processing node performs data reassignment based on data features of a plurality of items of data to be processed, and determines reassignment partition identifiers corresponding to the plurality of items of data to be processed respectively; based on the reassignment partition identifiers corresponding to the multiple pieces of data to be processed, reassignment data corresponding to the target data partitions are stored locally; the data partition corresponding to the reassigned partition identifier is one of a plurality of data partitions;

a second task generating module 1430 for generating a data reduction task; the data reduction task corresponds to the target data partition, and is used for carrying out data reduction processing on multiple items of reassigned data in the target data partition; the target data partition is at least one of a plurality of data partitions;

The second task sending module 1440 is configured to send, to each data processing node, a data reduction task corresponding to each data processing node, where a target data partition corresponding to the data reduction task running in the data processing node is the same as a target data partition corresponding to a data redistribution task running in the data processing node, so that each data processing node obtains, locally, redistribution data corresponding to the target data partition, and performs data reduction processing on the redistribution data, to obtain a data reduction processing result corresponding to the target data partition.

The device provided in the above embodiment can execute the method provided in any embodiment of the present application, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be referred to a data processing method provided in any of the embodiments of the present application.

The present embodiment also provides a computer-readable storage medium having stored therein computer-executable instructions loaded by a processor and executing a data processing method of the present embodiment.

The present embodiments also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations of data processing described above.

The present embodiment also provides an electronic device, which includes a processor and a memory, where the memory stores a computer program adapted to be loaded by the processor and to perform a data processing method according to the present embodiment.

The device may be a computer terminal, a mobile terminal or a server, and the device may also participate in forming an apparatus or a system provided by an embodiment of the present application. As shown in fig. 15, the server 15 may include one or more processors 1502 (shown in the figures as 1502a, 1502b, … …,1502 n) (the processor 1502 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPLD), a memory 1504 for storing data, and a transmission device 1506 for communication functions. In addition, the method may further include: input/output interface (I/O interface), network interface. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 15 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the server 15 may also include more or fewer components than shown in fig. 15, or have a different configuration than shown in fig. 15.

It should be noted that the one or more processors 1502 and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the server 15.

The memory 1504 may be used to store software programs and modules of application software, and the processor 1502 executes the software programs and modules stored in the memory 1504 to perform various functional applications and data processing, i.e., to implement a method for generating a time-series behavior capturing frame based on a self-attention network according to the program instructions/data storage device corresponding to the method according to the embodiments of the present application. The memory 1504 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1504 may further include memory remotely located relative to the processor 1502, which may be connected to the server 15 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1506 is configured to receive or first external data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the server 15. In one example, the transmission device 1506 includes a network adapter (NetworkInterfaceController, NIC) that may be connected to other network devices through a base station to communicate with the internet. In one example, the transmission device 1506 may be a radio frequency (RadioFrequency, RF) module configured to communicate wirelessly with the internet.

The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. The steps and sequences recited in the embodiments are merely one manner of performing the sequence of steps and are not meant to be exclusive of the sequence of steps performed. In actual system or interrupt product execution, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing).

The structures shown in this embodiment are only partial structures related to the present application and do not constitute limitations of the apparatus to which the present application is applied, and a specific apparatus may include more or less components than those shown, or may combine some components, or may have different arrangements of components. It should be understood that the methods, apparatuses, etc. disclosed in the embodiments may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and the division of the modules is merely a division of one logic function, and may be implemented in other manners, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or unit modules.

Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of data processing, the method comprising:

2. The data processing method according to claim 1, wherein the reassigned partition identifier includes a local partition identifier, the local partition identifier being a partition identifier corresponding to a local target data partition;

the storing the reassigned data corresponding to the target data partition locally based on the reassigned partition identifiers corresponding to the plurality of pieces of data to be processed includes:

determining local data from the plurality of items of data to be processed; the reassigned partition identifier corresponding to the local data is the local partition identifier;

writing the local data into the target data partition;

and storing the reassignment data in the target data partition locally.

3. The data processing method according to claim 2, wherein the reassigned partition identifier includes an external partition identifier, and the local partition identifier is a partition identifier corresponding to an external target data partition;

the method further comprises the steps of after the data redistribution is carried out based on the data characteristics of the plurality of items of data to be processed and the redistribution partition identifiers corresponding to the plurality of items of data to be processed are determined, the method further comprises:

determining first external data from the plurality of items of data to be processed; the reassigned partition identifier corresponding to the first external data is the external partition identifier;

determining a first data processing node corresponding to the first external data;

and sending the first external data to the first data processing node so that the first data processing node writes the first external data into a data partition corresponding to the external partition identifier.

4. The data processing method according to claim 2, wherein the reassignment data includes the local data and second external data; the second external data is data sent by an external second data processing node, and a reassignment partition identifier corresponding to the second external data is the local partition identifier;

The storing the reassigned data corresponding to the target data partition locally based on the reassigned partition identifiers corresponding to the plurality of pieces of data to be processed respectively includes:

receiving second external data sent by the second data processing node, wherein a data reassignment task running in the second data processing node is different from a target data partition corresponding to a data reassignment task running in a local data processing node;

the local data and the second external data are stored locally.

5. The data processing method according to claim 1, wherein after the data reassigning is performed based on the data characteristics of the plurality of pieces of data to be processed and the reassigned partition identifications corresponding to the plurality of pieces of data to be processed are determined, the method further comprises:

stopping the data writing operation of the target data partition under the condition that prompt information of ending execution of the associated data reassignment task is received; the associated data reassignment task is a data reassignment task running in a data processing node other than the local data processing node.

6. A method of data processing, the method comprising:

7. A data processing apparatus, the apparatus comprising:

8. A data processing apparatus, the apparatus comprising:

9. An electronic device comprising a processor and a memory, wherein the memory has stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement a data processing method according to any of claims 1-6.

10. A computer readable storage medium, characterized in that the storage medium comprises a processor and a memory, in which at least one instruction or at least one program is stored, which is loaded and executed by the processor to implement a data processing method according to any of claims 1-6.