CN117707779A

CN117707779A - Data processing method, device, computer equipment and storage medium

Info

Publication number: CN117707779A
Application number: CN202311749345.2A
Authority: CN
Inventors: 曹宸
Original assignee: Shanghai Shuhe Information Technology Co Ltd
Current assignee: Shanghai Shuhe Information Technology Co Ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-15
Anticipated expiration: 2043-12-19
Also published as: CN117707779B

Abstract

The application relates to a data processing method, a data processing device, computer equipment and a storage medium. The method comprises the following steps: acquiring data to be processed; pre-slicing the data to be processed to obtain slicing execution data corresponding to a plurality of task slices; robbing the execution authority of the current task fragment according to the fragment execution data of each task fragment; when the execution right of the current task slice is robbed, the data to be processed are cut and processed according to the slice execution data of the current task slice. By adopting the method, the central node is not required to be set and the centralized state is not required to be maintained, so that the design is simplified, and the overhead and risk of state management are reduced.

Description

Data processing method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, computer device, and storage medium.

Background

With the rapid development of the internet and information technology, massive data and time-consuming task processing demands are faced. Conventional serial processing approaches fail to meet these requirements, and thus distributed computing has evolved.

Distributed computing is a method of parallel computing and task decomposition using multiple computers or computing resources. The parallel execution of the tasks and the sharing utilization of resources are realized by dividing the tasks into a plurality of subtasks and distributing the subtasks to different computing nodes for processing. However, many conventional distributed computing schemes employ a centralized process, where a centralized state needs to be maintained, the central node information needs to be synchronized to the executing node, the executing node needs to keep a heartbeat connection with the central node, report the central node state, and if the central node is unavailable, the entire service is unavailable.

Therefore, the conventional distributed computing scheme needs to coordinate the centering and executing nodes, and has the problem of low data processing efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data processing method, apparatus, computer device, and storage medium.

A method of data processing, said method comprising:

acquiring data to be processed;

pre-slicing the data to be processed to obtain slicing execution data corresponding to a plurality of task slices;

robbing the execution authority of the current task fragment according to the fragment execution data of each task fragment;

When the execution right of the current task slice is robbed, the data to be processed are cut and processed according to the slice execution data of the current task slice.

In one embodiment, the pre-slicing processing is performed on the data to be processed to obtain sliced execution data corresponding to a plurality of task slices, where the pre-slicing processing includes:

and performing pre-segmentation processing according to the preset segmentation number to obtain segmentation execution data of each task segmentation.

In one embodiment, the foregoing slice execution data of the current task slice includes a processing state, and the robbing is performed on the execution authority of the current task slice according to the slice execution data of each task slice, including:

rob a preset distributed lock;

when the distributed lock is robbed, determining the current task fragment according to the processing state of each task fragment;

the processing state of the current task segment is modified into processing to rob the execution authority of the current task segment.

In one embodiment, the method is applied to a distributed computing node, and the method further includes:

reporting the node state to a preset health state table through a heartbeat thread;

periodically inquiring a health state table;

When the condition that the corresponding node state is not updated by the target node in the preset time period is inquired, the processing state of the task fragment processed by the target node is obtained;

and when the processing state of the task slices processed by the target node is in processing, resetting the processing state of the task slices processed by the target node to an initial state.

In one embodiment, each of the task slices includes at least one data record, and before the preempting the preset distributed lock, the method further includes:

inquiring whether unprocessed data records exist in the current task partition;

when the unprocessed data record exists in the current task partition, robbing the distributed lock;

the above-mentioned sliced execution data of the current task slice further includes identification information of the last data record of the last data slicing, and performs data slicing and processing on the data to be processed according to the sliced execution data of the current task slice, including:

when the distributed lock is robbed, determining the query range of the current data segmentation according to the identification information of the last data record of the last data segmentation of the current task segmentation;

inquiring target data records of a specified number from the data to be processed according to the inquiring range of the current data segmentation;

Modifying the processing state of the target data record into processing, and releasing the distributed lock;

performing data processing on the target data record;

and returning to the step of inquiring whether unprocessed data records exist in the current task segment until no unprocessed data records exist in the current task segment.

In one embodiment, the method further comprises:

when no unprocessed data record exists in the current task segment, inquiring whether unprocessed task segments exist in the data to be processed or not;

when unprocessed task fragments exist in the data to be processed, acquiring the next task fragment;

and returning to the step of inquiring whether unprocessed data records exist in the current task fragment or not by taking the next task fragment as the current task fragment.

In one embodiment, the method further comprises:

acquiring the total number of data records of the current data to be processed and the preset expected execution duration;

determining an expected rate according to the total number of the data records and the expected execution duration;

acquiring the number of the data records which are processed currently and the current execution time length;

determining a current rate according to the number of the data records which are processed currently and the current execution duration;

The number of slices is adjusted according to the desired rate and the current rate.

A data processing apparatus, said apparatus comprising:

the acquisition module is used for acquiring data to be processed;

the slicing module is used for performing pre-slicing processing on the data to be processed to obtain slicing execution data corresponding to the task slices;

the robbing module is used for robbing the executing authority of the current task fragment according to the fragment executing data of each task fragment;

and the processing module is used for carrying out data segmentation and processing on the data to be processed according to the sliced execution data of the current task slice when the execution right of the current task slice is robbed.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring data to be processed;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring data to be processed;

The data processing method, the device, the computer equipment and the storage medium acquire data to be processed through any node in the distributed computing nodes, perform pre-segmentation processing on the data to be processed to obtain a plurality of task segments, and rob each task segment through each node when the task is executed so as to process the task. Compared with the traditional mode of setting a central node and distributing tasks through the central node, the method is unnecessary to set the central node, simplifies design, does not need to maintain a centralized state, saves resources, and does not cause the condition that if the central node is unavailable, the whole service is unavailable, so that the stability of the whole server cluster in the operation period is improved.

Drawings

FIG. 1 is a flow diagram of a data processing method in one embodiment;

FIG. 2 is a flow chart of a data slicing and processing step for data to be processed according to sliced execution data of a current task slice in one embodiment;

FIG. 3 is a functional block diagram of a data processing method in one embodiment;

FIG. 4 is a functional block diagram of the execution of a pre-slice in one embodiment;

FIG. 5 is a functional block diagram of persistence of data to be processed in one embodiment;

FIG. 6 is a functional block diagram of data slicing and processing of data to be processed according to sliced execution data of a current task slice in one embodiment;

FIG. 7 is a block diagram of a data processing apparatus in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a data processing method is provided, where the method is applied to a distributed computing node, and the method includes:

S11, acquiring data to be processed.

In the present application, the above method is applied to any one node of the distributed computing nodes. Distributed computing nodes herein refer to clusters of multiple computers or computing resources. Illustratively, the distributed computing nodes herein may be a cluster of distributed servers. Or a cluster composed of a plurality of pod units disposed on the server, where pod refers to the minimum deployable unit of the server. That is, one server may be provided with at least one pod, and the pods of multiple servers may constitute the distributed computing node described above.

The nodes in the distributed computing nodes can be servers or pod of the servers. The data to be processed refers to data which is issued by an upstream system and is processed by the distributed computing nodes. The data to be processed here comprises a plurality of data records. In particular, the data to be processed herein may include one batch or multiple batches of data. Each batch of data includes a plurality of data records.

S12, performing pre-slicing processing on the data to be processed to obtain slicing execution data corresponding to the task slices.

The pre-slicing refers to performing execution resource division on the data to be processed, that is, dividing the data to be processed according to the number of execution resources. The execution resources herein refer to the nodes or execution threads on the nodes. Specifically, when the data to be processed is processed, the data can be processed through the execution threads set by each pod, and when resource partitioning is performed, pre-slicing is needed according to the number of the execution threads of each pod and the number of the pods.

The task slicing refers to a plurality of task groups obtained after task division of data to be processed. The above-described slice execution data refers to data related to execution of each task slice. For example, the slice execution data here may include the number of task slices, identification information of each task slice, a lot number to which each task slice belongs, and the like.

In this application, the above-mentioned pre-slicing is to divide data into multiple execution slices (corresponding to the task slices described above), unlike the traditional slicing strategy, which performs task slicing with a key value range, a hash function or other partitioning rules of data, where the pre-slicing is based on a non-master model, each slice is only responsible for execution, and the data source of the execution is from the robbery of the task until the execution of all the data is completed.

S12, robbing the execution authority of the current task fragment according to the fragment execution data of each task fragment.

In the present application, the execution authority refers to an authority for processing each task partition. When the processing right of a current task slice is robbed, the current task slice can be processed.

In the application, after pre-slicing is performed on data to be processed, a plurality of task slices are obtained, and each task slice is provided with slice identification information, such as a slice sequence number. Each task segment includes a plurality of data records therein. The serial numbers of the task slices are sequentially arranged and inserted into a pre-created data table. When data processing is carried out, the task fragments are processed sequentially according to the arrangement sequence. When processing each task segment, the execution authority of each task segment needs to be robbed. When the current task slice is processed, the executing authority of the current task slice is robbed according to the slice executing data of each task slice.

In the application, a non-master mode is adopted, and different nodes rob resource processing in a mutually competing mode.

S13, when the execution right of the current task slice is robbed, data segmentation is carried out on the data to be processed according to the slice execution data of the current task slice, and the data are processed.

In the present application, the data slicing refers to screening out the data record of the current task slice from the data to be processed for processing.

In the application, after pre-slicing is performed on the data to be processed, sliced execution data of each task slice is generated. The slice execution data of each task slice includes identification information of each slice, such as a slice serial number. The method and the device are used for creating the data table in advance and storing the fragment execution data of each task fragment. For example, a table_queue table is created in advance in the present application, and is used to store the slice serial numbers of the task slices and the identification information of the data records in the task slices.

When data processing is performed, processing is performed sequentially according to serial numbers of the task slices in the table_queue table, and data records of the corresponding task slices are cut out from data to be processed sequentially to be processed.

In the conventional technical solutions, many centralized processes are adopted, that is, by setting a central node, the central node is responsible for allocating tasks to each executing node for execution. However, the conventional technical solution has at least the following drawbacks:

Firstly, the centralized state needs to be maintained, the information of the central node needs to be synchronized to an execution node, the execution node needs to keep heartbeat connection with the central node, and the central node state is reported, so that the data processing efficiency is affected.

And secondly, if the central node is not available, the whole service is not available, so that the running stability of the whole server cluster is affected.

Thirdly, the conventional method depends on other middleware to assist in the execution process, like zookeeper, etc., and unnecessary extra dependence can be generated if the user uses the method in a lightweight way.

Fourth, technical data adopted by the traditional solution are stored in middleware, and when problems and adjustment occur, the data are not clear and transparent enough, so that obstacle removal is difficult.

In the execution process, a central node is not required to be set, roles among all nodes are equal, when one node fails, the execution authority processing data can be robbed continuously through the rest other nodes, the operation of the whole cluster is not affected, the nodes can be added at will to take over the functions of the failed node, the fault tolerance capability is high, middleware is not required to be relied on, and resources are saved.

In one embodiment, the pre-slicing processing is performed on the data to be processed to obtain sliced execution data corresponding to a plurality of task slices, which may include:

For one executing thread, the predetermined value of one executing thread is read, typically 25 executing threads per pod, and the parameter is adjustable and dynamically optimized according to different executing loads, so as to maximize the resource utilization. Assuming that we are currently on-line, there are a total of 8 pod, we have a total of 8×25=200 algorithms to perform the slicing, as per the default setting. In this case, the number of fragments needs to be set to a value smaller than 200.

In the present application, the number of fragments is a preset number, and the number is supported to be adjustable. The number may be set according to the total number of resources of the actual cluster. When setting the number of fragments, it is necessary to set the number of threads of the pod.

By the method, the data to be processed can be divided into a plurality of task slices in advance, and the dividing purpose is to obtain slice execution data, so that the data processing according to the slice execution data can be performed later.

In one embodiment, the method may further include:

and performing persistent storage on the data to be processed.

In the application, after the data to be processed is obtained from the upstream system, the data to be processed is persisted in a local database table_user_group table. The table_user_group table here is used to store execution detail records. The method for persistent storage of the data to be processed specifically comprises the following steps: and carrying out data processing on the data to be processed, executing a slicing data source and inserting slicing tasks.

The processing of the data refers to that the acquired data to be processed may come from a file system, other tables, messages and the like, and the data format is not uniform, so that the data processing can uniformly convert the data into a uniform data format, and the subsequent execution of the fragments is convenient.

And when the data is processed, different classification marking is carried out on the files of the data source in induction, and the follow-up execution of slicing can carry out data pre-slicing and execution operation according to the marking and induction marking which are preset. In one possible embodiment, the present application may label and generalize the disposable data to the same lot data by lot number.

The inserting the slicing task refers to inserting the slicing execution data into a corresponding data table. Specifically, the present application creates a table_queue table in advance, where the table is used to store the number of slices, the execution slices of data slicing, and for example, the table may record the serial numbers of the task slices, the batch numbers to which the task slices belong, the processing states, and so on. After persisting all of the original data, the batch inserts the executable fragmented data according to the order of magnitude or preset configuration of the target file.

In one embodiment, the table_queue table may be as shown in table 1 below:

TABLE 1

Task fragment sequence number	Batch number	pod information	Processing state
				0	20230501	Initial state
1	20230501		Initial state
				2	20230501	Initial state
3	20230501		Initial state

As shown in table 1, the expected fragment data is inserted into the table_queue table at the time of data synchronization. Each record in the table_queue table represents an executable slice. As shown in table 1, the batch data representing the batch number 20230501 in table 1 can be executed with 4 slices.

In the application, a table_batch table is also pre-created, and is a batch summary table used for recording mass data information such as the number of the batch. After pre-slicing, the slicing execution data needs to be correspondingly updated into the table_queue table and the table_batch table to realize the insertion of the slicing task. After inserting the assigned task, the nodes in the cluster can monitor the insertion of the new task from the two tables, thereby triggering subsequent task robbery.

In a possible embodiment, the method may include:

acquiring data to be processed;

processing data to be processed;

performing resource division (namely the pre-slicing) on the processed data to obtain slicing execution data;

Inserting the slice execution data into a pre-created slice task table (i.e., the table_queue table described above);

each node monitors the slicing task list, and when the slicing task list is monitored to have a newly-added task, the execution authority of the newly-added task is robbed, wherein the newly-added task refers to newly-added task slicing and can comprise a plurality of task slices;

when the target node robs to the execution right, the target node cuts out the newly added task fragments from the data to be processed according to the fragment execution data corresponding to each task fragment and processes the newly added task fragments. When the newly added task segment includes a plurality of segments, the robbery and segmentation processes need to be performed in sequence.

In one embodiment, the executing data of the current task slice includes a processing state, and the preempting the executing authority of the current task slice according to the executing data of each task slice may include:

rob a preset distributed lock;

In the method, a distributed lock is preset, each node robs the distributed lock, and when a target node robs the distributed lock, the execution authority of the target node to the current task partition is determined. When the processing state of the current task segment is modified into processing, the current task segment is robbed to the execution authority of the current task segment.

According to the method and the device, the distributed locks are set, the execution authority control of each task partition is locked and concurrent, so that resource competition is serialized, the ordering of the resource competition is guaranteed, the task distribution is carried out in a robbery mode, more data are processed by the fast nodes, the reasonable utilization of the resources is realized, the traditional mode can possibly lead to the fact that tasks distributed on the slow node cannot be completed in time, and other fast nodes are in an idle state after the fast nodes are likely to be completed quickly, so that the resources are wasted.

In one embodiment, the method is applied to a distributed computing node, and the method may further include:

periodically inquiring a health state table;

In the present application, the above health status table records the operation status of each node in each time period, for example, whether a fault occurs or not. The node status refers to the operation status of each node, such as whether a fault exists. The preset time period refers to a preset maximum heartbeat reporting time, and the specific value can be set according to actual conditions, which is not limited herein.

In the present application, the above-described processing state refers to a completed state. The process state may include INIT (initial state), DOING (in process), and DONE (complete process). The processing state is recorded in the table_queue table in advance, and the processing state of the current task slice can be obtained by inquiring from the table_queue table. Specifically, the table_queue table records the processing state of each task slice and the node information, such as the node identifier, for processing each task slice.

According to the method and the device, the health state table is queried, when an abnormality is found, an abnormal time period of an abnormal node is acquired, the table_queue table is queried according to the abnormal time period, and when the processing state of a task fragment robbed by the abnormal node in the abnormal time period is found to be in process, the processing state of the task fragment corresponding to the abnormal node in the abnormal time period in the table_queue table is reset to be an initial state.

In this application, the business engine sets a heartbeat thread in each node (pod) by adopting a manner of sdk (Software Development Kit ), and the heartbeat thread has the following main effects at present:

reporting pod information

Clearing expired pod and compensating execution status

And executing the slicing task.

Reporting the pod information means that the pod itself is registered in a table_worker_host table (corresponding to the health status table described above), and the heartbeat thread may be set to execute once every 10 seconds. The report is used for subsequent failure processing and fault tolerance compensation of the health check nodes.

In the application, when the server is restarted or the server is down, an intermediate state often occurs to rob tasks for example, when the server is just refreshed as DOING, the server is down, and then the state cannot be transferred to the final state because the subsequent execution data is not completed, and the server is always clamped in the DOING. Therefore, by setting the heartbeat thread, after the information of the pod is reported, the heartbeat thread can detect the pod, and once the pod is found out that the pod information is not reported due to overtime, the reset operation is processed, and the execution fragments are reset to the INIT state again, so that the fragments of the system have automatic recovery capability and do not need the intervention of operation and maintenance personnel.

According to the embodiment of the method and the device, the failure processing and fault tolerance compensation of the health check node can be realized.

In one embodiment, referring to fig. 2, before each task segment includes at least one data record and robs a preset distributed lock, the method further includes:

s31, inquiring whether unprocessed data records exist in the current task segment;

s32, when the unprocessed data record exists in the current task partition, the distributed lock is robbed;

the slicing execution data of the current task slicing further comprises identification information of the last data record of the last data slicing, and the data slicing and processing are carried out on the data to be processed according to the slicing execution data of the current task slicing, and the method comprises the following steps:

s33, when the distributed lock is robbed, determining the query range of the current data segmentation according to the identification information of the last data record of the last data segmentation of the current task segmentation;

s34, inquiring target data records of a designated number from the data to be processed according to the inquiring range of the current data segmentation;

s35, modifying the processing state of the target data record into processing, and releasing the distributed lock;

s36, performing data processing on the target data record;

In the application, when each task segment is executed, the task segment is completed through each node to circularly rob for a plurality of times and cut and process the data. After the current task segment is executed, the execution of the next task segment is entered, and when the next task segment is executed, the process is completed by circulating multiple robbers and data segmentation and processing through each node, and the circulating execution is performed until unprocessed task segments do not exist in the current batch of to-be-processed data.

The above-mentioned query range refers to the identification information of the initial data record of the query and the identification information of the last data record of the query, or may also be the identification information of the initial data record of the query and the number of data records of each query. The number of queries each time is a predetermined number of specified queries. The identification information of the data record here may be an id number of the data record.

In the application, after the execution permission of the task partition is acquired, the data to be processed is subjected to segmentation query. Once a task is robbed, the thread continues to run in a loop until all data within the current task slice is executed. The loop is divided into 2 phases, one is a data query locking phase and one is a data execution phase. To prevent different shards from operating on the same batch of data, a distributed lock may be performed to control concurrency prior to data querying.

When inquiring, the designated number of the inquiry is required to be set, the entry field of each inquiry, namely the id of the initial data record, is required to be set, the id is used for designating the inquiry range, the id is 0 when the cycle starts for the first time, and the id is set as the id of the last bar of the last cycle after the cycle. After the appointed number is queried, the state field recorded in the table_user_group table is refreshed from INIT to DOING in the distributed lock, so that repeated data is prevented from being queried later. Therefore, through task robbery and data segmentation, the isolation operation among different fragments can be realized.

In one possible design, the distributed lock includes a first control lock and a second control lock, where the first control lock is used to lock the authority of the table_queue table. The second control lock is used for locking the authority of the table_user_group table. the table_queue table records the serial numbers of the task slices, the processing states of the task slices and the data batch to which the task slices belong. the table_user_group table records each data record in each slice and the processing state of each data record.

Further, the preempting preset distributed lock includes preempting the first control lock, when the first control lock is preempted, querying a table_queue table, modifying a processing state of a current data slice in the table_queue table into processing, representing that the processing authority of the current data slice is preempted, and releasing the first control lock. And when the second control lock is robbed, inquiring a table_user_group table to obtain the target data records of the appointed number of the current inquiry, and modifying the state of the target data records into the processing state, wherein the state represents the execution authority of the target data records which are robbed into the current data partition, and the second control lock is released.

In the present application, after robbery, an executing thread in a pod will rob the executing authority of the current task slice, then the data in the database will become processed, and the pod information of the database is updated in the table, which represents that the server robs the executable task, as shown in the following table 2. Table 2 is a data structure of a table_queue table, in one embodiment.

TABLE 2

As shown in table 2, the task slices are arranged sequentially and processed sequentially, when task slice 0 is locked, the state is modified into processing, when the next node robs the first control lock, the state of task slice 1 is modified sequentially, and so on, and then executed sequentially. After competing for the task, the logic of the heartbeat thread does not participate in the subsequent task execution, and the heartbeat thread can cut and execute real data to be processed by an execution thread pool.

According to the method and the device, the first control lock and the second control lock are set, the concurrency can be effectively controlled, the resource competition is serialized, and because the operation in the lock is only the inquiry and the update of the state, the time-consuming and excessive behavior cannot be generated, and hunger, thirst and waste of the resource cannot be caused.

In this application, when the current task slice is executed, the loop needs to be completed multiple times. Each cycle needs to rob the distributed lock, when rob the distributed lock, according to the query scope, the corresponding data record is queried and processed, and then the next cycle is entered. The id of the last data record of the current query is recorded after each query, so that the next query can be traversed from the last query, the query time is saved, and the data processing efficiency is improved.

In one embodiment, with continued reference to fig. 2 above, the method may further include:

s37, inquiring whether unprocessed task slices exist in the data to be processed or not when the fact that unprocessed data records do not exist in the current task slices is inquired;

s38, when unprocessed task fragments exist in the data to be processed, acquiring the next task fragment;

s39, taking the next task slice as the current task slice, and returning to the step of inquiring whether unprocessed data records exist in the current task slice.

In the application, before each time the distributed lock is robbed, the current task partition is queried whether the data record is unprocessed, and if so, the robbed distributed lock is executed. And when the fact that the current task segment is not processed by the data record is inquired, further inquiring whether the task segment which is processed exists in the data to be processed in the current batch, if so, entering the processing of the next task segment, and if not, ending the flow.

In one embodiment, the method further comprises:

In the application, the rate fragmentation number refers to sensing the execution rate and whether the expected duration is satisfied in the execution process so as to perform intervention and adjustment.

The batch framework supports execution of desired management, and a user can input a desired execution duration. After the user inputs the expected duration, the batch processing framework initially inserts execution fragments according to the number of fragments set by the user, and during the execution process, the heartbeat thread is adjusted by using the rate to evaluate whether the current number of fragments meets the requirement of the expected rate.

In particular, the data to be processed comprises a plurality of data records. When processing the data to be processed, the processing state of each data record is recorded. The processing state of each data record is integrated into the above-described slice execution data of each slice. The total number of the data records of the current data to be processed refers to the total number of the data to be processed of the current batch. When data is processed, the data is processed in batches.

The above-described determined expected rate can be calculated by the following formula:

desired rate = total number of data records/desired execution duration

The current rate described above can be calculated by the following formula:

current rate = number of data records currently processed/current execution duration

The above adjustment of the slice number according to the desired rate and the current rate can be adjusted by the following formula:

adjusted number of slices = number of slices + number of slices × (desired rate/current rate)

According to the method and the device, the number of fragments can be adjusted in real time according to the total number of the data to be processed. The method and the device receive the data to be processed in batches, adjust the number of fragments according to the total number of the data to be processed in real time, and improve the utilization rate of resources.

In one embodiment, after the target node performs data processing on the target data record, the method may further include:

the target node changes the processing state of the target data record into processing completion;

the method further comprises the steps of:

when no unprocessed data record exists in the current task segment, the target node changes the processing state of the current task segment into processing completion.

In the application, when no unprocessed data record remains in the current task segment, the current task segment is processed, the next task segment is processed, and the like until the data record of the current batch is processed.

In one embodiment, the method may further include:

when the node is newly added, the information of the newly added node is registered in the table_worker_host table.

The distributed non-main mode of the slicing execution algorithm also performs concurrent control in task robbery and data slicing, so that horizontal capacity expansion can be easily performed without affecting the operation of the existing program if insufficient resources occur in the batch processing process.

The method and the device can easily expand the capacity and the computing resources of the system by using the database to manage the task fragments. When more shards need to be processed, the need can be met by adding database servers or adding database shards, thereby realizing a highly scalable system.

In addition, the database provides a mechanism for transaction management and data consistency, so that the correct execution of the task fragments and the consistency of results are ensured, the transaction characteristics of the database can be utilized, the atomicity and consistency of the tasks are ensured, and the reliability of the system is improved.

Referring to fig. 3, fig. 3 is a schematic block diagram of a data processing method according to the present application in one embodiment. In fig. 3, the data processing method of the present application includes data persistence and data execution, where the persistence portion is used for a large amount of data from outside, and the execution portion performs slicing and running on the content of the data set that has been persistence.

Referring to fig. 4, fig. 4 is a schematic block diagram illustrating the execution of the pre-slicing according to the present application in one embodiment. In fig. 4, the block execution schematic diagram, the overall operation is divided into 5 parts, including pod information reporting, data synchronization, data persistence, generation of block execution data, and decentralization of block execution tasks. In the figure, table_batch, table_queue, table_worker_host and table_user_group are table names of a data table. The table_batch table is used for storing a batch total table and recording mass data information such as the number of the batch. the table_queue table is used for storing the number of fragments and executing fragments of data segmentation. the table_worker_host table is used for storing the working machine table and distributing the state of the execution host machine. the table_user_group table is used for storing execution detail records.

Referring to fig. 5, fig. 5 is a schematic block diagram illustrating persistence of data to be processed according to an embodiment of the present application. In fig. 5, persisting the data to be processed includes processing the data and marking the data, and storing the processed data in the local database table_user_group table.

Referring to fig. 6, fig. 6 is a schematic block diagram of data slicing and processing of data to be processed according to slicing execution data of a current task slice in an embodiment. When all pod execute heartbeat thread trigger, executable data with initial state in a table_queue table is queried, and in order to ensure the ordering of resource competition, a distributed lock is used for controlling concurrency so as to serialize the resource competition, and the state is modified into DOING after query. The INIT data in fig. 6 refers to data whose processing state is an initial state. DOING in FIG. 6 refers to the processing state being a completed state.

In one embodiment, as shown in fig. 7, there is provided a data processing apparatus including: an acquisition module 11, a slicing module 12, a robbery module 13 and a processing module 14, wherein:

an acquisition module 11, configured to acquire data to be processed;

the slicing module 12 is configured to perform pre-slicing processing on data to be processed to obtain slicing execution data corresponding to a plurality of task slices;

the robbing module 13 is used for robbing the executing authority of the current task slice according to the slice executing data of each task slice;

and the processing module 14 is used for carrying out data segmentation and processing on the data to be processed according to the sliced execution data of the current task slice when the execution right of the current task slice is robbed.

In one embodiment, the slicing module 12 may perform pre-slicing processing according to a preset number of slices, to obtain slice execution data of each task slice.

In one embodiment, the robbing module 13 may rob a preset distributed lock, and when the distributed lock is robbed, determine the current task slice according to the processing state of each task slice, and modify the processing state of the current task slice into processing to rob the execution authority of the current task slice.

In one embodiment, the above device is applied to a distributed computing node, the above robbery module 13 may report the node status to a preset health status table through a heartbeat thread, periodically query the health status table, acquire the processing status of the task slices processed by the target node when the target node is queried that the corresponding node status is not updated within a preset time period, and reset the processing status of the task slices processed by the target node to an initial status when the processing status of the task slices processed by the target node is in processing.

In one embodiment, each of the task slices includes at least one data record, before the preset distributed lock is robbed, the robbed module 13 may further query whether an unprocessed data record exists in the current task slice, when the unprocessed data record exists in the current task slice, rob the distributed lock, the slice execution data of the current task slice further includes identification information of the last data record of the last data slice, and when the distributed lock is robbed, the processing module 14 may determine a query range of the current data slice according to the identification information of the last data record of the last data slice of the current task slice, query a target data record of a specified number from the data to be processed according to the query range of the current data slice, modify the processing state of the target data record into processing, release the distributed lock, perform data processing on the target data record, and return to the step of querying whether the unprocessed data record exists in the current task slice until no unprocessed data record exists in the current task slice.

In one embodiment, the processing module 14 may further query whether an unprocessed task partition exists in the to-be-processed data when it is queried that the unprocessed data record does not exist in the current task partition, and obtain a next task partition when the unprocessed task partition exists in the to-be-processed data, and return to the step of querying whether the unprocessed data record exists in the current task partition.

In one embodiment, the processing module 14 may further obtain the total number of data records of the current data to be processed and a preset expected execution duration, determine the expected rate according to the total number of data records and the expected execution duration, obtain the current number of data records processed and the current execution duration, determine the current rate according to the current number of data records processed and the current execution duration, and adjust the fragmentation number according to the expected rate and the current rate.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as operation data of the intelligent household equipment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program: acquiring data to be processed; pre-slicing the data to be processed to obtain slicing execution data corresponding to a plurality of task slices; robbing the execution authority of the current task fragment according to the fragment execution data of each task fragment; when the execution right of the current task slice is robbed, the data to be processed are cut and processed according to the slice execution data of the current task slice.

In one embodiment, when the processor executes the computer program to implement the pre-slicing processing on the data to be processed to obtain the slicing execution data steps corresponding to the task slices, the following steps are specifically implemented:

In one embodiment, the executing data of the current task segment includes a processing state, and when the processor executes the computer program to perform the step of robbing the executing authority of the current task segment according to the executing data of each task segment, the following steps are specifically implemented:

Rob a preset distributed lock;

In one embodiment, the method described above is applied to a distributed computing node, and the processor, when executing the computer program, specifically further implements the following steps:

periodically inquiring a health state table;

In one embodiment, each task segment includes at least one data record, and before the processor executes the computer program to implement the preemption preset distributed lock step, the following steps are specifically implemented:

inquiring whether unprocessed data records exist in the current task partition;

the above-mentioned sliced execution data of current task slicing further includes the identification information of the last data record of the last data slicing, and when the processor executes the computer program to implement the above-mentioned steps of data slicing and processing the data to be processed according to the sliced execution data of current task slicing, the following steps are specifically implemented:

performing data processing on the target data record;

In one embodiment, the processor, when executing the computer program, specifically further implements the steps of:

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring data to be processed; pre-slicing the data to be processed to obtain slicing execution data corresponding to a plurality of task slices; robbing the execution authority of the current task fragment according to the fragment execution data of each task fragment; when the execution right of the current task slice is robbed, the data to be processed are cut and processed according to the slice execution data of the current task slice.

In one embodiment, when the computer program is executed by the processor to perform the pre-slicing processing on the data to be processed to obtain the slicing execution data steps corresponding to the task slices, the following steps are specifically implemented:

In one embodiment, the above-mentioned current task segment execution data includes a processing state, and when the computer program is executed by the processor to implement the above-mentioned step of robbing the execution authority of the current task segment according to the segment execution data of each task segment, the following steps are specifically implemented:

rob a preset distributed lock;

In one embodiment, the method described above is applied to a distributed computing node, and the computer program when executed by a processor specifically implements the steps of:

Periodically inquiring a health state table;

In one embodiment, each task segment includes at least one data record, and before the computer program is executed by the processor to implement the preemption preset distributed lock step, the following steps are specifically implemented:

inquiring whether unprocessed data records exist in the current task partition;

the above-mentioned sliced execution data of current task slicing further includes the identification information of the last data record of the last data slicing, and when the computer program is executed by the processor to implement the above-mentioned steps of data slicing and processing the data to be processed according to the sliced execution data of current task slicing, the following steps are specifically implemented:

performing data processing on the target data record;

In one embodiment, the computer program when executed by the processor, specifically further performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of data processing, the method comprising:

acquiring data to be processed;

performing pre-slicing processing on the data to be processed to obtain slicing execution data corresponding to a plurality of task slices;

when the execution right of the current task fragment is robbed, the data to be processed are subjected to data segmentation and processing according to the fragment execution data of the current task fragment.

2. The method of claim 1, wherein the pre-slicing the data to be processed to obtain sliced execution data corresponding to a plurality of task slices, includes:

3. The method of claim 1, wherein the sliced execution data of the current task slice includes a processing state, and wherein the robbing the execution authority of the current task slice according to the sliced execution data of each task slice includes:

rob a preset distributed lock;

and modifying the processing state of the current task fragment into processing so as to rob the execution authority of the current task fragment.

4. The method of claim 3, wherein the method is applied to a distributed computing node, the method further comprising:

periodically inquiring the health state table;

When the condition that the corresponding node state of the target node is not updated in a preset time period is inquired, acquiring the processing state of the task fragment processed by the target node;

and when the processing state of the task segment processed by the target node is in processing, resetting the processing state of the task segment processed by the target node to an initial state.

5. A method according to claim 3, wherein each of said task slices includes at least one data record therein, said method further comprising, prior to said preempting a predetermined distributed lock:

inquiring whether unprocessed data records exist in the current task segment;

when the unprocessed data record exists in the current task partition, the distributed lock is robbed;

Inquiring target data records of a designated number from the data to be processed according to the inquiring range of the current data segmentation;

performing data processing on the target data record;

and returning to the step of inquiring whether unprocessed data records exist in the current task segment until the step of inquiring that unprocessed data records do not exist in the current task segment.

6. The method of claim 5, wherein the method further comprises:

when no unprocessed data record exists in the current task partition, inquiring whether unprocessed task partitions exist in the data to be processed or not;

7. The method according to claim 2, wherein the method further comprises:

acquiring the total number of data records of the current data to be processed and preset expected execution duration;

and adjusting the number of fragments according to the expected rate and the current rate.

8. A data processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring data to be processed;

the slicing module is used for performing pre-slicing processing on the data to be processed to obtain slicing execution data corresponding to a plurality of task slices;

and the processing module is used for carrying out data segmentation and processing on the data to be processed according to the segmentation execution data of the current task segmentation when the execution right of the current task segmentation is robbed.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.