CN115361382B - Data processing method, device, equipment and storage medium based on data group - Google Patents

Data processing method, device, equipment and storage medium based on data group Download PDF

Info

Publication number
CN115361382B
CN115361382B CN202210956462.5A CN202210956462A CN115361382B CN 115361382 B CN115361382 B CN 115361382B CN 202210956462 A CN202210956462 A CN 202210956462A CN 115361382 B CN115361382 B CN 115361382B
Authority
CN
China
Prior art keywords
data
state
group
splitting
subtasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210956462.5A
Other languages
Chinese (zh)
Other versions
CN115361382A (en
Inventor
姚宏宇
朱朝强
蒲俊
王东明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING YOYO TIANYU SYSTEM TECHNOLOGY CO LTD
Original Assignee
BEIJING YOYO TIANYU SYSTEM TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING YOYO TIANYU SYSTEM TECHNOLOGY CO LTD filed Critical BEIJING YOYO TIANYU SYSTEM TECHNOLOGY CO LTD
Priority to CN202210956462.5A priority Critical patent/CN115361382B/en
Publication of CN115361382A publication Critical patent/CN115361382A/en
Application granted granted Critical
Publication of CN115361382B publication Critical patent/CN115361382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

The embodiment of the application provides a data processing method, a device, equipment and a computer readable storage medium based on a data group. The method comprises the steps of acquiring a data exchange task; splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output; constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized. In this way, the resource utilization rate can be greatly improved, whether tasks in each server survive can be timely ascertained, and the task running and state circulation efficiency is greatly improved.

Description

Data processing method, device, equipment and storage medium based on data group
Technical Field
Embodiments of the present application relate to the field of data processing, and in particular, to a data group-based data processing method, apparatus, device, and computer-readable storage medium.
Background
For a task group formed after a large data exchange task is split into small tasks, the tasks need to be executed in a distributed manner in different servers concurrently. In the running process, each task in the task group needs to be subjected to coordination and task state synchronous management.
At present, the common practice is that each task reports execution progress and state by itself. If the task does not report the progress or the state for a long time, the timing check program actively inquires the task execution server about the task condition. Or the task sends heartbeat to the checking program at regular time to inform the survival state of the task.
However, when a process executing a task or a server crashes or does not respond, whether a heartbeat is sent or an inspection program actively inquires the progress and the state of the task, a waiting time exists, and the progress and the state of the task cannot be judged in time.
Disclosure of Invention
According to an embodiment of the application, a data group-based data processing scheme is provided.
In a first aspect of the present application, a data group-based data processing method is provided. The method comprises the following steps:
acquiring a data exchange task;
splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output;
constructing a data group based on the split plurality of subtasks; and realizing the stable operation of the data exchange task through the data group.
Further, the constructing a data group based on the split sub-tasks includes:
based on the split subtasks, a coordinator for data transmission, reception and/or management is constructed; and, a worker for data transmission or reception;
based on the coordinators and workers, a data group is constructed.
Further, the stable operation of the data exchange task through the data group includes:
the group changes its own state according to the states of input and output.
Further, the states of the inputs and outputs include:
the input initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the initialization is finished and META information is received, the STATE is changed to STATE _ META _ DONE; after starting to receive data, the STATE is changed to STATE _ START; after the data reception is finished, the STATE is changed to STATE _ DONE;
the output initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the META information is successfully sent, the STATE is changed to STATE _ META; after starting to transmit data, the STATE is changed to STATE _ START; the STATE changes to STATE _ DONE after the data transmission is completed.
Further, the group changing its own state according to the states of input and output includes:
the initial state of the group is CLUSTER _ UNINITIALIZED, and if all input and output members are added into the group, the state is changed to CLUSTER _ BOOTSTRAP _ SERVICES;
if all the outputs are ready to send data, the output STATE is changed to STATE _ META, the group STATE is changed to CLUSTER _ RECEIVE _ MATA, and the data are ready to be sent;
if the initialization of the output is completed and META information is received, the STATE of the input is changed to STATE _ META _ DONE, the STATE of the group is changed to CLUSTER _ RUNNING, and the group starts to operate;
if the input reception data is completed and the input status is changed to STATE _ DONE, the group status is changed to CLUSTER _ STOP.
Further, still include:
in the running process of the group, if any member is off-line or abnormal, the group immediately receives the message, reports that the number of members is not in accordance with the expected abnormality or abnormal, simultaneously changes the state of the group into CLUSTER _ KILL and ends the running of the group.
Further, the splitting the data exchange task into a plurality of subtasks includes:
splitting the data exchange task into a plurality of subtasks through an ETL model;
wherein the ETL model comprises a data source, a source adapter, a target adapter, a converter and a data target;
if the data source and the data target are set to operate in a cross-domain mode and different specified container conditions are selected, the data exchange task is divided into a plurality of subtasks in a vertical dividing mode;
if the data source segmentation is set and the operation mode is single node, splitting the data exchange task into a plurality of subtasks in a horizontal splitting mode;
and if the conditions of horizontal splitting and vertical splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks in a mixed splitting mode.
In a second aspect of the present application, a data group-based data processing apparatus is provided. The device includes:
the acquisition module is used for acquiring a data exchange task;
the data exchange task comprises a splitting module, a data exchange task processing module and a data exchange task processing module, wherein the splitting module is used for splitting the data exchange task into a plurality of subtasks, and each subtask comprises an input and an output;
the processing module is used for constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized.
In a third aspect of the present application, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.
In a fourth aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method as according to the first aspect of the present application.
According to the data processing method based on the data group, the data exchange task is obtained; splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output; constructing a data group based on the split plurality of subtasks; through the data group, the stable operation of the data exchange task is realized, whether the task in each server survives or not can be timely ascertained, and the task operation and state transfer efficiency is greatly improved.
It should be understood that what is described in this summary section is not intended to limit key or critical features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present application will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters denote like or similar elements, and wherein:
fig. 1 shows a system architecture diagram related to the method provided by the embodiment of the present application.
FIG. 2 shows a flow diagram of a data group-based data processing method according to an embodiment of the present application;
FIG. 3 illustrates an undisrupted task diagram according to an embodiment of the present application;
FIG. 4 shows a schematic diagram of group partitioning according to an embodiment of the present application;
FIG. 5 illustrates a state flow diagram of an input according to an embodiment of the present application;
FIG. 6 shows a state flow diagram of an output according to an embodiment of the application;
FIG. 7 illustrates a state flow diagram for a group according to an embodiment of the present application;
FIG. 8 shows a block diagram of a data group-based data processing apparatus according to an embodiment of the present application;
fig. 9 shows a schematic structural diagram of a terminal device or a server suitable for implementing the embodiments of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the data group based data processing method or the data group based data processing apparatus of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. Various communication client applications, such as a model training application, a video recognition application, a web browser application, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
When the terminals 101, 102, 103 are hardware, a video capture device may also be installed thereon. The video acquisition equipment can be various equipment capable of realizing the function of acquiring video, such as a camera, a sensor and the like. The user may capture video using a video capture device on the terminal 101, 102, 103.
The server 105 may be a server that provides various services, such as a background server that processes data displayed on the terminal devices 101, 102, 103. The background server may perform processing such as analysis on the received data, and may feed back a processing result (e.g., an identification result) to the terminal device.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In particular, in the case where the target data does not need to be acquired from a remote place, the above system architecture may not include a network but only a terminal device or a server.
Fig. 2 is a flowchart of a data group-based data processing method according to an embodiment of the present application. As can be seen from fig. 2, the data processing method based on data groups of this embodiment includes the following steps:
s210, acquiring a data exchange task.
In this embodiment, an execution subject (for example, a server shown in fig. 1) for the data group-based data processing method may acquire the data exchange task by a wired manner or a wireless connection manner.
Further, the execution main body may acquire a data exchange task transmitted by an electronic device (for example, the terminal device shown in fig. 1) communicatively connected to the execution main body, or may be a data exchange task stored locally in advance.
S220, splitting the data exchange task into a plurality of subtasks, wherein each subtask includes an input and an output.
In some embodiments, the data exchange task is split into multiple subtasks by an ETL model. Wherein the ETL model comprises a data source, a source adapter, a target adapter, a converter and a data target; the data source is data obtained by extracting the data to be transmitted; the ETL model may include one or more translators, depending on the type of data source and data target.
In some embodiments, if the data source and the data target both set cross-domain operation and select different specified container conditions, the data exchange task is split into a plurality of subtasks in a vertical splitting manner;
if the data source segmentation is set and the operation mode is a single node, splitting the data exchange task into a plurality of subtasks in a horizontal splitting mode;
and if the conditions of horizontal splitting and vertical splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks in a mixed splitting mode.
In some embodiments, splitting the data exchange task into a plurality of subtasks by way of vertical splitting comprises:
the mode may be used to split operation when both data source and data target are set to run across domains and different specified container conditions (different machine IP, different machine tag) are selected. The corresponding operation mode is a P2P cross-domain mode. After splitting, a plurality of subtasks form a task, and simultaneous scheduling and concurrent collaborative operation are required.
Specifically, before splitting, the source adapter + the converter + the target adapter constitute an ETM, and operate in a container, such as an ETR container. After vertical splitting, combining a source adapter, a converter and a distribution Output into an ETM, named as K1, and deploying in a first container; the DistributeInput and the target adapter are merged into one ETM, named K2, and deployed in a second container. That is, K1, K2 may run in two different containers, respectively, which may be across the network/territory in between.
Further, after vertical splitting, a data RPC sending/receiving step is added between K1 and K2, and the memory queue before splitting is replaced by an RPC calling and remote queue.
Further, the operation mode is that the task initialization sequence is K2 and K1, that is, the reverse order initialization (the operation sequence can be ignored). And (3) running: the adapter of K1 reads data, and through memory transfer, the stream passes through the converter to calculate, and transmits the data to the distribution output, and the output performs data packing, compression and encryption, and sends the data to the distribution input of K2 through rpc (remote procedure call). And the input of the K2 receives the data, then sends the data to the target adapter through the memory, and finally writes the data into the data target.
Further, data can be monitored (optimized) by means of full-queue slowdown, empty packet detection and/or timeout retry and the like; the receiving end can monitor (optimize) the data through message idempotent, asynchronous receipt and/or timeout exit.
By the vertical splitting method, the single-node performance bottleneck can be expanded in physical logic, and cross-network transmission can be performed in network logic when the data source and the data target both designate operation nodes.
In some embodiments, splitting the data exchange task into a plurality of subtasks by way of horizontal splitting comprises:
the purpose of horizontal splitting is to improve the concurrency of tasks and improve the operation efficiency on the premise of not influencing the final operation result. Therefore, the data needs to be fully covered and evenly distributed during splitting. Namely, performing uniform segmentation on a data source, and performing fragment consumption by using read-in conditions of the data source, for example, original SQL, post-segmentation SQL a: select from (original SQL) where rownum%2=0, SQL b.
Specifically, according to the attribute of the data source, that is, the field included in the data source, the data source is split to obtain the input condition of the source adapter, for example, after the data source is split, a field (input condition) is obtained: SELECT 'id', "name '," mark', "pid '," order _ idx', "update _ time '," driver _ class' FROM 'e _ default _ ds _ type _ 2', etc.
Further, carrying out interval value splitting, fixed value splitting, field value splitting and/or partition table splitting on the input conditions;
the interval value splitting comprises splitting date and numerical value fields, and different groups are formed by using between and coverage after splitting. For example: the interval of id is: 1-60, splitting it can be divided into: id > =1and id-woven fabric 30; id > =30and id is constructed as a result of these factors.
Fixed value splitting, which is equivalent to generally acting on enumeration fields with fewer value ranges. Such as: sex =1; and sex =2.
Field value splitting, i.e., splitting a value field, is typically used on an autonomous key. And grouping by using a modulus taking function of the database. Such as: MOD (id, 2) =0, MOD (id, 2) =1.
And splitting the partition table, acting on the partition table, and splitting according to the partition number and the partition setting of the current database.
And constructing a plurality of ETMs based on the split data, wherein the ETMs can run in parallel.
When the task is horizontally split, if the table clearing operation is set, the table clearing operation is independently extracted into an independent task and is serially connected to the front of the slicing task. Otherwise every subtask empties the table, which results in the loss of written data. And running the sequence- > clearing the table after the table is finished- > calling each fragmentation task in parallel after the actual segmentation.
In some embodiments, splitting the data exchange task into a plurality of subtasks by way of hybrid splitting includes:
and if the conditions of vertical splitting and horizontal splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks based on the characteristics of the vertical splitting and the horizontal splitting.
The hybrid splitting can satisfy both vertical splitting and horizontal splitting functions. That is, in an effective network environment, data distribution of any share is performed, and data fragmentation is performed in network transmission. Namely, when the data is split vertically, horizontal splitting is added, the original single data source node and the original single data target node are changed into multiple copies, namely the copies are changed from 1 × 1 to n × m, a group communication algorithm is added during running, and the state consistency of the 2 nodes is changed into the cooperative consistency of n + m copies.
Specifically, the task operation mode after the hybrid splitting corresponds to a distributed operation mode, the distributed mode uses a group communication algorithm to ensure global transaction consistency, and the corresponding role number is as follows: and n parts of source nodes and m parts of target nodes, wherein when the system runs, a group algorithm is used for deducing out the main nodes, the coordinating nodes perform group control, and other nodes are controlled to perform a series of operations such as state change, data transmission and the like.
Different from horizontal and vertical splitting, the converter and the data target can also specify any number of fragments besides the source step can be split, and the distributed operation mode uniformly distributes data and performs fragmentation.
In some embodiments, the ETM is generally equivalent to a subtask obtained by splitting the data exchange task.
S230, constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized.
In some embodiments, the split subtasks each include one Input (dirrunbute Input) and one Output (dirrunbute Output), and different subtasks can be executed on different servers (containers).
The task before splitting can only be executed on one service. As shown in fig. 3, before splitting, a task including one input, two converters, and one output is run on one server.
The split will generate a plurality of subtasks which can be executed on a plurality of servers. The following is illustrated by way of example:
specifically, as shown in fig. 4, after a task is mixed and split, 6 subtasks are obtained (a specific split number of the subtasks may be determined according to a task complexity, and the higher the task complexity is, the more the split subtasks are, which is not specifically limited herein), where the 6 subtasks may be respectively run on different servers, for example, respectively run on 6 servers; any combination may be used, depending on the application scenario, for example, subtasks 1, 2 run on one server, and subtasks 3, 4, 5 run on another server.
The Output of the previous subtask (dirrunbute Output) is used as the input of the next subtask (dirrunbute input), e.g., the Output of subtask 1 is the input of subtask 2. That is, the transmission data is output and the reception data is input in one group. The distribution Input is used for receiving data and is a leading component (except Input) of each subtask, and the distribution Output is used for sending data and is a trailing component (except Output) of each subtask. Wherein, the distribution Output A-1, the distribution Output A-2, the distribution Input A-1 and the distribution Input A-2 form a group A, and the distribution Output B-1, the distribution Output B-2, the distribution Input B-1 and the distribution Input B-2 form a group B.
Each group may have two roles: coordinators and workers. Usually, there is only one coordinator, which is responsible for sending or receiving data and for managing the whole group; the number of workers can be multiple, and the workers can be used for sending or receiving data.
Preferably, the first member to join the group will generally act as a coordinator. And then, the join and exit events of each member are notified to the coordinator, and the coordinator updates the state of the group according to the events and the states of the members. When all members join, the group establishment is successful.
Further, after the group is established, each Distribute Output sends the data of the subtask to all Distribute inputs of the next subtask in the group. The Distribute Output is capable of flow control. And when the processing speed of some DistributeInput is higher, the data transmission speed is increased.
All sub-tasks after distributed splitting of one task (data exchange task) are used as a group, and input and output are members of the group. When the group is successfully established, if one member crashes or is offline, the group immediately senses the crash and changes the group state according to the current member condition.
In some embodiments, the state of the Input (dirrunbute Input) flows, as shown in fig. 5:
the initial state of the Input (Dirstrubute Input) is UNINITIALIZED, initialized after startup, and joined to the group. The STATE changes to STATE _ SERVICE after initialization is successful. At this point, the OUTPUT (OUTPUT) initialization is waited for to complete. Until META information is received, the STATE is changed to STATE _ META _ DONE, and data reception is started; after the data begins to be received, the STATE is changed to STATE _ START; the STATE is changed to STATE _ DONE after the reception is completed.
Further, the state flow of the Output (dirrunbute Output) is as shown in fig. 6:
the initial state of the Output (Dirstrubute Output) is UNINITIALIZED, initialized after startup, and joined to the group. The STATE changes to STATE _ SERVICE after initialization is successful. The output will send meta information at this time. After the META information is successfully transmitted, the status is changed to STATE _ META, and data transmission is started. After the data transmission is started, the STATE is changed to STATE _ START; the STATE changes to STATE _ DONE after the transmission is completed.
In some embodiments, in the process of executing the Input (dirrunbute Input) and the Output (dirrunbute Output), the group changes the state of the group according to the state of the Input (dirrunbute Input) and the Output (dirrunbute Output), and performs corresponding processing;
the distributed task cluster state is as shown in table 1:
Figure BDA0003791566940000111
/>
Figure BDA0003791566940000121
TABLE 1
Cluster change trigger conditions, as shown in table 2:
Figure BDA0003791566940000122
/>
Figure BDA0003791566940000131
/>
Figure BDA0003791566940000141
TABLE 2
It should be noted that the cluster state values can only change from small to large, and are not reversible and out of order. In table 2, "", indicates the role state or cluster state in which the current step triggered the change. And after the master node calculates that all the group member roles are consistent, the master node triggers the conditions to change next step.
Specifically, as shown in FIG. 7, the initial state of the group is CLUSTER _ UNINITIALIZED. After the group is started, the group will wait for the members to join the group, and the status will be changed to CLUSTER _ BOOTSTRAP _ PENDING.
If all input and output members (all subtasks) join the group, the state changes to CLUSTER _ BOOTSTRAP _ SERVICES. When all the outputs are ready to send data, that is, the output status is changed to STATE _ META, the group status is changed to clock _ RECEIVE _ MATA, and data is ready to be sent.
When the input is ready to receive data, namely the input STATE is changed into STATE _ META _ DONE, the group STATE is changed into CLUSTER _ RUNNING, and the group starts to operate;
if the output transmission data is completed, i.e., the output STATE is STATE _ DONE, the group STATE is changed to clock _ RUNNING _ DONE. At this time, the data receiving is finished waiting for inputting;
if the incoming received data is completed, i.e., the incoming status is changed to STATE _ DONE (the entire data exchange process is finished), the group status is changed to CLUSTER _ STOP.
Further, in the whole group operation process, if the members are off-line or abnormal, the group can immediately receive the message, report that the member number does not meet the expected abnormality or abnormality, change the group state to CLUSTER _ KILL and end the group operation.
According to the embodiment of the disclosure, the following technical effects are achieved:
in the group mode, the same group of tasks are in one group. After any task in the group is crashed or disconnected, the task state can be sensed in real time, the state consistency of task members in the group is ensured, and the task operation and state transfer efficiency is greatly improved.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments and that the acts and modules referred to are not necessarily required for the application.
The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.
Fig. 8 shows a block diagram of a data group-based data processing apparatus 800 according to an embodiment of the present application, and as shown in fig. 8, the apparatus 800 comprises:
an obtaining module 810, configured to obtain a data exchange task;
a splitting module 820, configured to split the data exchange task into a plurality of subtasks, where each subtask includes an input and an output;
a processing module 830, configured to construct a data group based on the split sub-tasks; and finishing data exchange through the data group.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Fig. 9 shows a schematic structural diagram of a terminal device or a server suitable for implementing the embodiments of the present application.
As shown in fig. 9, the terminal device or server 900 includes a Central Processing Unit (CPU) 901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, the above method flow steps may be implemented as a computer software program according to embodiments of the present application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. Wherein the designation of such a unit or module does not in some way constitute a limitation on the unit or module itself.
As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable storage medium stores one or more programs that, when executed by one or more processors, perform the methods described herein.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the application referred to in the present application is not limited to the embodiments in which the above-mentioned features are combined in particular, and also encompasses other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the concept of the application. For example, the above features may be replaced with (but not limited to) features having similar functions as those described in this application.

Claims (9)

1. A data group-based data processing method, comprising:
acquiring a data exchange task;
splitting the data exchange task into a plurality of subtasks, wherein each subtask comprises an input and an output;
constructing a data group based on the split plurality of subtasks; the stable operation of the data exchange task is realized through the data group;
the splitting of the data exchange task into a plurality of subtasks comprises:
splitting the data exchange task into a plurality of subtasks through an ETL model;
wherein the ETL model comprises a data source, a source adapter, a target adapter, a converter and a data target;
if the data source and the data target are set to operate in a cross-domain mode and different specified container conditions are selected, the data exchange task is divided into a plurality of subtasks in a vertical dividing mode;
if the data source segmentation is set and the operation mode is single node, splitting the data exchange task into a plurality of subtasks in a horizontal splitting mode;
and if the conditions of horizontal splitting and vertical splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks in a mixed splitting mode.
2. The method of claim 1, wherein constructing a data group based on the split plurality of subtasks comprises:
based on the split subtasks, a coordinator for transmitting, receiving and/or managing data is constructed; and, a worker for data transmission or reception;
based on the coordinators and workers, a data group is constructed.
3. The method of claim 2, wherein the enabling stable operation of the data exchange task through the data group comprises:
the group changes its own state according to the states of input and output.
4. The method of claim 3, wherein the states of the inputs and outputs comprise:
the input initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the initialization is finished and META information is received, the STATE is changed to STATE _ META _ DONE; after starting to receive data, the STATE is changed to STATE _ START; after the data reception is finished, the STATE is changed to STATE _ DONE;
the output initial state is UNINITIALIZED, initialization is carried out after starting, and the group is added; after the initialization is successful, the STATE is changed to STATE _ SERVICE; after the META information is successfully sent, the STATE is changed to STATE _ META; after starting to transmit data, the STATE is changed to STATE _ START; the STATE changes to STATE _ DONE after the data transmission is completed.
5. The method of claim 4, wherein the group changes its state according to the states of the input and output, comprising:
the initial state of the group is CLUSTER _ UNINITIALIZED, and if all input and output members are added into the group, the state is changed to CLUSTER _ BOOTSTRAP _ SERVICES;
if all the outputs are ready to send data, the output STATE is changed into STATE _ META, the group STATE is changed into CLUSTER _ RECEIVE _ MATA, and the data are ready to be sent;
if the initialization of the output is completed and META information is received, the STATE of the input is changed to STATE _ META _ DONE, the STATE of the group is changed to CLUSTER _ RUNNING, and the group starts to operate;
if the input reception data is completed and the input status is changed to STATE _ DONE, the group status is changed to CLUSTER _ STOP.
6. The method of claim 5, further comprising:
in the running process of the group, if any member is off-line or abnormal, the group immediately receives the message, reports that the number of members is not in accordance with the expected abnormality or abnormal, simultaneously changes the state of the group into CLUSTER _ KILL and ends the running of the group.
7. A data group-based data processing apparatus, comprising:
the acquisition module is used for acquiring a data exchange task;
the data exchange task comprises a splitting module, a data exchange task processing module and a data exchange task processing module, wherein the splitting module is used for splitting the data exchange task into a plurality of subtasks, and each subtask comprises an input and an output; the splitting of the data exchange task into a plurality of subtasks comprises:
splitting the data exchange task into a plurality of subtasks through an ETL model;
wherein the ETL model comprises a data source, a source adapter, a target adapter, a converter and a data target;
if the data source and the data target are set to operate in a cross-domain mode and different specified container conditions are selected, the data exchange task is divided into a plurality of subtasks in a vertical dividing mode;
if the data source segmentation is set and the operation mode is single node, splitting the data exchange task into a plurality of subtasks in a horizontal splitting mode;
if the conditions of horizontal splitting and vertical splitting are met simultaneously, splitting the data exchange task into a plurality of subtasks in a mixed splitting mode;
the processing module is used for constructing a data group based on the split plurality of subtasks; by the data group, stable operation of the data exchange task is realized.
8. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the processor, when executing the computer program, implements the method of any of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN202210956462.5A 2022-08-10 2022-08-10 Data processing method, device, equipment and storage medium based on data group Active CN115361382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210956462.5A CN115361382B (en) 2022-08-10 2022-08-10 Data processing method, device, equipment and storage medium based on data group

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210956462.5A CN115361382B (en) 2022-08-10 2022-08-10 Data processing method, device, equipment and storage medium based on data group

Publications (2)

Publication Number Publication Date
CN115361382A CN115361382A (en) 2022-11-18
CN115361382B true CN115361382B (en) 2023-03-31

Family

ID=84033333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210956462.5A Active CN115361382B (en) 2022-08-10 2022-08-10 Data processing method, device, equipment and storage medium based on data group

Country Status (1)

Country Link
CN (1) CN115361382B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116614512B (en) * 2023-02-21 2024-04-26 北京友友天宇系统技术有限公司 Method, device and equipment for managing strong consistency group view of distributed group communication

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966051A (en) * 2021-03-31 2021-06-15 陕西省大数据集团有限公司 Distributed data exchange system and method
CN113282649A (en) * 2020-02-19 2021-08-20 北京国双科技有限公司 Distributed task processing method and device and computer equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840149B (en) * 2019-02-14 2021-07-30 百度在线网络技术(北京)有限公司 Task scheduling method, device, equipment and storage medium
CN111400012A (en) * 2020-03-20 2020-07-10 中国建设银行股份有限公司 Data parallel processing method, device, equipment and storage medium
CN113962597A (en) * 2021-11-11 2022-01-21 北京锐安科技有限公司 Data analysis method and device, electronic equipment and storage medium
CN114780214B (en) * 2022-04-01 2024-01-09 中国电信股份有限公司 Task processing method, device, system and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282649A (en) * 2020-02-19 2021-08-20 北京国双科技有限公司 Distributed task processing method and device and computer equipment
CN112966051A (en) * 2021-03-31 2021-06-15 陕西省大数据集团有限公司 Distributed data exchange system and method

Also Published As

Publication number Publication date
CN115361382A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN111290854B (en) Task management method, device, system, computer storage medium and electronic equipment
US8584136B2 (en) Context-aware request dispatching in clustered environments
CN112069265B (en) Synchronization method of configuration data, business data system, computer system and medium
CN107025139A (en) A kind of high-performance calculation Scheduling Framework based on cloud computing
CN109117252B (en) Method and system for task processing based on container and container cluster management system
CN105512083A (en) YARN based resource management method, device and system
CN109783151B (en) Method and device for rule change
CN112114950A (en) Task scheduling method and device and cluster management system
US8606908B2 (en) Wake-up server
CN105847332A (en) Desktop virtualization method, client device and server-side device
CN112631800A (en) Kafka-oriented data transmission method and system, computer equipment and storage medium
CN114610474A (en) Multi-strategy job scheduling method and system in heterogeneous supercomputing environment
CN111124640A (en) Task allocation method and system, storage medium and electronic device
CN110392106A (en) A kind of method for pushing and device of job state
CN115361382B (en) Data processing method, device, equipment and storage medium based on data group
WO2020147601A1 (en) Graph learning system
CN115378937B (en) Distributed concurrency method, device, equipment and readable storage medium for tasks
CN112817992B (en) Method, apparatus, electronic device and readable storage medium for executing change task
CN111435315A (en) Method, apparatus, device and computer readable medium for allocating resources
CN111813529B (en) Data processing method, device, electronic equipment and storage medium
CN112825525B (en) Method and apparatus for processing transactions
CN116932147A (en) Streaming job processing method and device, electronic equipment and medium
CN104754040B (en) System for end-to-end cloud service virtualization
CN115840648A (en) Simulation task processing method and device and electronic equipment
CN115001692A (en) Model updating method and device, computer readable storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant