CN115361382A

CN115361382A - Data processing method, device, equipment and storage medium based on data group

Info

Publication number: CN115361382A
Application number: CN202210956462.5A
Authority: CN
Inventors: 姚宏宇; 朱朝强; 蒲俊; 王东明
Original assignee: BEIJING YOYO TIANYU SYSTEM TECHNOLOGY CO LTD
Current assignee: BEIJING YOYO TIANYU SYSTEM TECHNOLOGY CO LTD
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-18
Anticipated expiration: 2042-08-10
Also published as: CN115361382B

Abstract

The embodiment of the application provides a data processing method, a device, equipment and a computer readable storage medium based on a data group. The method comprises the steps of acquiring a data exchange task; splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output; constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized. In this way, resource utilization rate can be greatly improved, whether tasks in each server survive can be timely ascertained, and task operation and state circulation efficiency are greatly improved.

Description

Data processing method, device, equipment and storage medium based on data group

Technical Field

Embodiments of the present application relate to the field of data processing, and in particular, to a data group-based data processing method, apparatus, device, and computer-readable storage medium.

Background

For a task group formed by splitting a large data exchange task into small tasks, the tasks need to be executed in different servers in a distributed manner and concurrently. In the running process, each task in the task group needs to be subjected to coordination and task state synchronous management.

At present, the common practice is that each task reports execution progress and state by itself. If the task does not report the progress or the state for a long time, the timing check program actively inquires the task execution server about the task condition. Or the task sends heartbeat to the checking program at regular time to inform the survival state of the task.

However, when the process executing the task or the server crashes or does not respond, a waiting time exists no matter whether the heartbeat is sent or the checking program actively inquires the progress and the state of the task, and the progress and the state of the task cannot be judged in time.

Disclosure of Invention

According to an embodiment of the application, a data group-based data processing scheme is provided.

In a first aspect of the present application, a data group-based data processing method is provided. The method comprises the following steps:

acquiring a data exchange task;

splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output;

constructing a data group based on the split plurality of subtasks; and realizing the stable operation of the data exchange task through the data group.

Further, the constructing a data group based on the split sub-tasks includes:

based on the split subtasks, a coordinator for transmitting, receiving and/or managing data is constructed; and, a worker for data transmission or reception;

based on the coordinators and workers, a data group is constructed.

Further, the stable operation of the data exchange task through the data group includes:

the group changes its own state according to the states of input and output.

Further, the states of the inputs and outputs include:

the input initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the initialization is finished and META information is received, the STATE is changed into STATE _ META _ DONE; after starting to receive data, the STATE is changed to STATE _ START; after the data reception is finished, the STATE is changed into STATE _ DONE;

the output initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the META information is successfully sent, the STATE is changed to STATE _ META; after starting to transmit data, the STATE is changed to STATE _ START; after the data transmission is completed, the STATE is changed to STATE _ DONE.

Further, the group changing its own state according to the input and output states includes:

the initial state of the group is CLUSTER _ UNITIALZED, and if all input and output members are added into the group, the state is changed into CLUSTER _ BOOTSTRAP _ SERVICES;

if all the outputs are ready to send data, the output STATE is changed to STATE _ META, the group STATE is changed to CLUSTER _ RECEIVE _ MATA, and the data are ready to be sent;

if the initialization of the output is completed and META information is received, the STATE of the input is changed to STATE _ META _ DONE, the STATE of the group is changed to CLUSTER _ RUNNING, and the group starts to operate;

if the input reception data is completed and the input status is changed to STATE _ DONE, the group status is changed to CLUSTER _ STOP.

Further, still include:

in the running process of the group, if any member is off-line or abnormal, the group immediately receives the message, reports that the number of members is not in accordance with the expected abnormality or abnormal, simultaneously changes the state of the group into CLUSTER _ KILL and ends the running of the group.

Further, the splitting the data exchange task into a plurality of subtasks includes:

splitting the data exchange task into a plurality of subtasks through an ETL model;

wherein the ETL model comprises a data source, a source adapter, a target adapter, a converter and a data target;

if the data source and the data target are set to operate in a cross-domain mode and different specified container conditions are selected, the data exchange task is divided into a plurality of subtasks in a vertical dividing mode;

if the data source segmentation is set and the operation mode is single node, splitting the data exchange task into a plurality of subtasks in a horizontal splitting mode;

and if the conditions of horizontal splitting and vertical splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks in a hybrid splitting mode.

In a second aspect of the present application, a data group-based data processing apparatus is provided. The device includes:

the acquisition module is used for acquiring a data exchange task;

the data exchange task comprises a splitting module, a data exchange task processing module and a data exchange task processing module, wherein the splitting module is used for splitting the data exchange task into a plurality of subtasks, and each subtask comprises an input and an output;

the processing module is used for constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized.

In a third aspect of the present application, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.

In a fourth aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method as according to the first aspect of the present application.

The data processing method based on the data group provided by the embodiment of the application obtains the data exchange task; splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output; constructing a data group based on the split plurality of subtasks; through the data group, stable operation of data exchange tasks is achieved, whether the tasks in the servers survive or not can be timely ascertained, and task operation and state transfer efficiency are greatly improved.

It should be understood that the statements described in this summary are not intended to limit the scope of the disclosure, or the various features described in this summary. Other features of the present application will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present application will become more apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

fig. 1 shows a system architecture diagram in accordance with a method provided by an embodiment of the present application.

FIG. 2 shows a flow diagram of a data group-based data processing method according to an embodiment of the present application;

FIG. 3 shows an undivided task diagram according to an embodiment of the application;

FIG. 4 shows a schematic diagram of group partitioning according to an embodiment of the present application;

FIG. 5 illustrates a state flow diagram of inputs according to an embodiment of the application;

FIG. 6 shows a state flow diagram of an output according to an embodiment of the application;

FIG. 7 illustrates a state flow diagram for a group according to an embodiment of the present application;

FIG. 8 shows a block diagram of a data group-based data processing apparatus according to an embodiment of the present application;

fig. 9 shows a schematic structural diagram of a terminal device or a server suitable for implementing the embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

In addition, the term "and/or" herein is only one kind of association relationship describing the association object, and means that there may be three kinds of relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the data group based data processing method or data group based data processing apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. Various communication client applications, such as a model training application, a video recognition application, a web browser application, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminals

101, 102, 103 are hardware, a video capture device may also be installed thereon. The video acquisition equipment can be various equipment capable of realizing the function of acquiring video, such as a camera, a sensor and the like. The user may capture video using a video capture device on the terminal 101, 102, 103.

The server 105 may be a server that provides various services, such as a background server that processes data displayed on the

terminal devices

101, 102, 103. The background server may perform processing such as analysis on the received data, and may feed back a processing result (e.g., an identification result) to the terminal device.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In particular, in the case where the target data does not need to be acquired from a remote place, the above system architecture may not include a network, but only a terminal device or a server.

Fig. 2 is a flowchart of a data group-based data processing method according to an embodiment of the present application. As can be seen from fig. 2, the data processing method based on data groups of this embodiment includes the following steps:

s210, acquiring a data exchange task.

In this embodiment, an execution subject (for example, a server shown in fig. 1) for the data group-based data processing method may acquire the data exchange task by a wired manner or a wireless connection manner.

Further, the execution main body may acquire a data exchange task transmitted by an electronic device (for example, the terminal device shown in fig. 1) communicatively connected to the execution main body, or may be a data exchange task stored locally in advance.

S220, splitting the data exchange task into a plurality of subtasks, wherein each subtask includes an input and an output.

In some embodiments, the data exchange task is split into multiple subtasks by an ETL model. Wherein the ETL model comprises a data source, a source adapter, a target adapter, a converter and a data target; the data source is data obtained by extracting the data to be transmitted; the ETL model may include one or more translators, depending on the type of data source and data target.

In some embodiments, if the data source and the data target both set cross-domain operation and select different specified container conditions, the data exchange task is split into a plurality of subtasks in a vertical splitting manner;

and if the conditions of horizontal splitting and vertical splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks in a mixed splitting mode.

In some embodiments, splitting the data exchange task into a plurality of subtasks by way of vertical splitting comprises:

when the data source and the data target are set to run across domains and different specified container conditions (different machine IP, different machine tag) are selected, the operation can be cut in the mode. The corresponding operation mode is a P2P cross-domain mode. After splitting, a plurality of subtasks form a task, and simultaneous scheduling and concurrent collaborative operation are required.

Specifically, before splitting, the source adapter + the converter + the target adapter form an ETM, and operate in a container, such as an ETR container. After vertical splitting, combining a source adapter, a converter and a distribution Output into an ETM, named as K1, and deploying in a first container; the distribution Input and the target adapter are merged into one ETM, named as K2, and deployed in the second container. That is, K1, K2 may run in two different containers, respectively, which may be across the network/domain in between.

Further, after vertical splitting, a data RPC sending/receiving step is added between K1 and K2, and the memory queue before splitting is replaced by RPC calling and a remote queue.

Further, the operation mode is that the task initialization sequence is K2 and K1, that is, the reverse order initialization (the operation sequence can be ignored). In operation: the adapter of K1 reads data, the data are transmitted through a memory, the flow is calculated through a converter, the data are transmitted into distribution output, the output is subjected to data packing, compression and encryption, and the data are transmitted to distribution input of K2 through rpc (remote procedure call). And the input of the K2 receives the data and then sends the data to the target adapter through the memory, and finally the data is written into the data target.

Further, data can be monitored (optimized) by means of full-queue slowdown, empty packet detection and/or timeout retry and the like; the receiving end can monitor (optimize) the data through message idempotent, asynchronous receipt and/or overtime exit.

By the vertical splitting method, the single-node performance bottleneck can be expanded in physical logic, and cross-network transmission can be performed in network logic when the data source and the data target both designate operation nodes.

In some embodiments, splitting the data exchange task into a plurality of subtasks by way of horizontal splitting comprises:

the purpose of horizontal splitting is to improve the concurrency of tasks and improve the operation efficiency on the premise of not influencing the final operation result. Therefore, the data needs to be fully covered and evenly distributed during splitting. Namely, performing uniform segmentation on a data source, and performing fragment consumption by using read-in conditions of the data source, for example, original SQL, post-segmentation SQL a: select from (original SQL) where rownum%2=0, SQL b.

Specifically, according to the attribute of the data source, that is, the field included in the data source, the data source is split to obtain the input condition of the source adapter, for example, after the data source is split, a field (input condition) is obtained: SELECT id ', "name '," mark ', "pid '," order _ idx ', "update _ time '," driver _ class ' FROM ' e _ desired _ ds _ type _2 ', etc.

Further, carrying out interval value splitting, fixed value splitting, field value splitting and/or partition table splitting on the input conditions;

the interval value splitting comprises splitting date and numerical value fields, and different groups are formed by using between and coverage after splitting. For example: the interval of id is: 1-60, splitting it can be divided into: id > =1and id-woven fabric 30; id > =30and id is constructed as a result of these factors.

Fixed value splitting, equates to a less enumerated field that typically works on value ranges. Such as: sex =1; and sex =2.

Field value splitting, i.e., splitting a value field, is typically used on an autonomous key. And grouping by using a modulus function of the database. Such as: MOD (id, 2) =0, MOD (id, 2) =1.

And splitting the partition table, acting on the partition table, and splitting according to the partition number and the partition setting of the current database.

Based on the split data, a plurality of ETMs are constructed, which can run in parallel.

When the data is horizontally split, if the table clearing operation is set, the table clearing operation is independently extracted into an independent task and is serially connected to the front of the slicing task. Otherwise every subtask empties the table, which results in the loss of written data. And running the sequence- > clearing the table after the table is finished- > calling each fragmentation task in parallel after the actual segmentation.

In some embodiments, splitting the data exchange task into a plurality of subtasks by way of hybrid splitting includes:

and if the conditions of vertical splitting and horizontal splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks based on the characteristics of the vertical splitting and the horizontal splitting.

The hybrid splitting can satisfy both vertical splitting and horizontal splitting functions. That is, in an effective network environment, data distribution of any share is performed, and data fragmentation is performed in network transmission. Namely, when the data is split vertically, horizontal splitting is added, the original single data source node and the original single data target node are changed into multiple copies, namely the copies are changed from 1 × 1 to n × m, a group communication algorithm is added during running, and the state consistency of the 2 nodes is changed into the cooperative consistency of n + m copies.

Specifically, the task operation mode after the hybrid splitting corresponds to a distributed operation mode, the distributed mode uses a group communication algorithm to ensure global transaction consistency, and the corresponding role number is as follows: and n parts of source nodes and m parts of target nodes, wherein when the system runs, a group algorithm is used for deducing out the main nodes, the coordinating nodes perform group control, and other nodes are controlled to perform a series of operations such as state change, data transmission and the like.

Different from horizontal and vertical splitting, the converter and the data target can also specify any number of fragments besides the source step can be split, and the distributed operation mode uniformly distributes data and performs fragmentation.

In some embodiments, the ETM is generally equivalent to a subtask obtained by splitting the data exchange task.

S230, constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized.

In some embodiments, the split subtasks each include one Input (dirrunbute Input) and one Output (dirrunbute Output), and different subtasks can be executed on different servers (containers).

The task before splitting can only be executed on one service. As shown in fig. 3, before splitting, a task is executed on a server, which includes an input, two converters, and an output.

After splitting, a plurality of subtasks can be generated and executed on a plurality of servers. The following is illustrated by way of example:

specifically, as shown in fig. 4, after a task is mixed and split, 6 subtasks are obtained (a specific split number of the subtasks may be determined according to a task complexity, and the higher the task complexity is, the more the split subtasks are, which is not specifically limited herein), where the 6 subtasks may be respectively run on different servers, for example, respectively run on 6 servers; any combination may be made, depending on the application scenario, for example,

subtasks

1, 2 run on one server, and subtasks 3, 4, 5 run on another server.

The Output of the previous subtask (dirrunbute Output) is used as the input of the next subtask (dirrunbute input), e.g., the Output of subtask 1 is the input of subtask 2. That is, the transmission data is output and the reception data is input in one group. The Distribute Input is used for receiving data and is a front component (except Input) of each subtask, and the Distribute Output is used for sending data and is a rear component (except Output) of each subtask. Wherein, the group A is composed of Distribute Output A-1, distribute Output A-2, distribute Input A-1 and Distribute Input A-2, and the group B is composed of Distribute Output B-1, distribute Output B-2, distribute Input B-1 and Distribute Input B-2.

Each group may have two roles: coordinators and workers. Usually, there is only one coordinator, which is responsible for sending or receiving data and for managing the whole group; the number of workers can be multiple, and the workers can be used for sending or receiving data.

Preferably, the first member to join the group will generally act as a coordinator. And then, the join and exit events of each member are notified to the coordinator, and the coordinator updates the state of the group according to the events and the states of the members. When all members join, the group establishment is successful.

Further, after the group is established, each Distribute Output sends the data of the subtask to all Distribute inputs of the next subtask in the group. The Distribute Output is capable of flow control. And when the processing speed of some DistributeInput is higher, the data transmission speed is increased.

All sub-tasks after distributed splitting of one task (data exchange task) are used as a group, and input and output are members of the group. When the group is successfully established, if one member crashes or is offline, the group immediately senses the crash and changes the group state according to the current member condition.

In some embodiments, the state of the Input (Dirstrubute Input) flows as shown in fig. 5:

the initial state of the Input (Dirstrubute Input) is UNINITIALIZED, initialized after startup, and joined to the group. The STATE changes to STATE _ SERVICE after initialization is successful. At this point, the OUTPUT (OUTPUT) initialization is waited for to complete. Until META information is received, the STATE is changed to STATE _ META _ DONE, and data reception is started; after the data begins to be received, the STATE is changed to STATE _ START; the STATE is changed to STATE _ DONE after the reception is completed.

Further, the state flow of the Output (dirrunbute Output) is as shown in fig. 6:

the initial state of the Output (Dirstrubute Output) is UNINITIALIZED, and the group is initialized and joined after startup. The STATE changes to STATE _ SERVICE after initialization is successful. The output will send meta information at this time. After the META information is successfully transmitted, the STATE is changed to STATE _ META, and data transmission is started. After starting to send data, the STATE will be changed to STATE _ START; the STATE changes to STATE _ DONE after the transmission is completed.

In some embodiments, in the process of executing the Input (dirrunbute Input) and the Output (dirrunbute Output), the group changes the state of the group according to the state of the Input (dirrunbute Input) and the Output (dirrunbute Output), and performs corresponding processing;

the distributed task cluster state is as shown in table 1:

TABLE 1

Cluster change trigger conditions, as shown in table 2:

TABLE 2

It should be noted that the cluster state values can only change from small to large, and are not reversible and out of order. In table 2, "+" indicates a role state or a cluster state in which the current step triggers a change. And after the master node calculates that all the group member roles are consistent, the master node triggers the conditions to change next step.

Specifically, as shown in FIG. 7, the initial state of the group is CLUSTER _ UNINITIALIZED. After the group is started, the group will wait for the members to join the group, and the status will be changed to CLUSTER _ BOOTSTRAP _ PENDING.

If all input and output members (all subtasks) join the group, the state changes to CLUSTER _ BOOTSTRAP _ SERVICES. When all the outputs are ready to send data, that is, the output status is changed to STATE _ META, the group status is changed to clock _ RECEIVE _ MATA, and data is ready to be sent.

When the input is ready to receive data, namely the input STATE is changed into STATE _ META _ DONE, the group STATE is changed into CLUSTER _ RUNNING, and the group starts to run;

if the output transmission data is completed, that is, the output STATE is STATE _ DONE, the group STATE is changed to clock _ RUNNING _ DONE. At this time, the data receiving is finished waiting for inputting;

if the incoming received data is complete, i.e., the incoming status changes to STATE _ DONE (the entire data exchange process ends), the group status changes to CLUSTER _ STOP.

Further, in the whole group operation process, if some members are off-line or abnormal, the group will receive the message immediately, report that the number of members is not in accordance with the expected abnormality or abnormality, change the group state to CLUSTER _ KILL, and end the group operation.

According to the embodiment of the disclosure, the following technical effects are achieved:

in the group mode, the same group of tasks are in one group. After any task in the group is crashed or disconnected, the task state can be sensed in real time, the state consistency of task members in the group is ensured, and the task running and state transfer efficiency is greatly improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments and that the acts and modules referred to are not necessarily required for the application.

The above is a description of embodiments of the method, and the embodiments of the apparatus are described further below.

Fig. 8 shows a block diagram of a data group-based data processing apparatus 800 according to an embodiment of the present application, and as shown in fig. 8, the apparatus 800 comprises:

an obtaining module 810, configured to obtain a data exchange task;

a splitting module 820, configured to split the data exchange task into a plurality of subtasks, where each subtask includes an input and an output;

a processing module 830, configured to construct a data group based on the split sub-tasks; and finishing data exchange through the data group.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

Fig. 9 shows a schematic structural diagram of a terminal device or a server suitable for implementing the embodiments of the present application.

As shown in fig. 9, the terminal device or server 900 includes a Central Processing Unit (CPU) 901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, the above method flow steps may be implemented as a computer software program according to embodiments of the present application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable storage medium stores one or more programs that when executed by one or more processors perform the methods described herein.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the application referred to in the present application is not limited to the embodiments with a particular combination of the above-mentioned features, but also encompasses other embodiments with any combination of the above-mentioned features or their equivalents without departing from the spirit of the application. For example, the above features and the technical features (but not limited to) having similar functions in the present application are mutually replaced to form the technical solution.

Claims

1. A method for data processing based on data groups, comprising:

acquiring a data exchange task;

constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized.

2. The method of claim 1, wherein constructing a data group based on the split plurality of subtasks comprises:

based on the coordinator and workers, a data group is constructed.

3. The method of claim 2, wherein the enabling stable operation of the data exchange task through the data group comprises:

the group changes its own state according to the states of input and output.

4. The method of claim 3, wherein the states of the inputs and outputs comprise:

the input initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the initialization is finished and META information is received, the STATE is changed into STATE _ META _ DONE; after starting to receive data, the STATE is changed to STATE _ START; after the data reception is finished, the STATE is changed to STATE _ DONE;

the output initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the META information is sent successfully, the STATE is changed into STATE _ META; after starting to transmit data, the STATE is changed to STATE _ START; after the data transmission is completed, the STATE is changed to STATE _ DONE.

5. The method of claim 4, wherein the group changes its state according to the states of the input and output, comprising:

the initial state of the group is CLUSTER _ UNINITIALIZED, and if all input and output members are added into the group, the state is changed to CLUSTER _ BOOTSTRAP _ SERVICES;

6. The method of claim 5, further comprising:

in the group operation process, if some members are off-line or abnormal, the group can immediately receive the message, report that the number of the members is not in accordance with the expected abnormality or abnormal, change the group state to CLUSTER _ KILL and end the group operation.

7. The method of claim 1, wherein the splitting the data exchange task into a plurality of subtasks comprises:

wherein the ETL model comprises a data source, a source adapter, a target adapter, a translator, and a data target;

if the data source segmentation is set and the operation mode is a single node, splitting the data exchange task into a plurality of subtasks in a horizontal splitting mode;

8. A data group-based data processing apparatus, comprising:

the acquisition module is used for acquiring a data exchange task;

the processing module is used for constructing a data group based on the split plurality of subtasks; through the data group, stable operation of the data exchange task is realized.

9. An electronic device comprising a memory and a processor, the memory having a computer program stored thereon, wherein the processor, when executing the computer program, implements the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.