CN111190703A

CN111190703A - Real-time data processing method and device, computer equipment and storage medium

Info

Publication number: CN111190703A
Application number: CN201911277999.3A
Authority: CN
Inventors: 陈金路
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-05-22
Anticipated expiration: 2039-12-11
Also published as: CN111190703B

Abstract

The present application relates to the field of big data processing, and in particular, to a real-time data processing method, apparatus, computer device, and storage medium. The method comprises the following steps: acquiring real-time data to be processed, and generating a directed acyclic graph according to a dependence mode of the real-time data to be processed; splitting the directed acyclic graph to obtain initial tasks, and acquiring the task quantity of each split initial task; distributing the initial tasks to different execution machines for processing; judging whether the difference value between the task quantities of each initial task is larger than a preset threshold value or not; when the initial tasks with the difference values larger than the preset threshold value exist, splitting the initial tasks with larger task quantity in the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks; distributing the excessive tasks to different threads of corresponding execution machines respectively for execution; and storing the transaction objects obtained by the execution of each execution machine into the distributed storage system. The method can improve the processing efficiency and the resource utilization rate.

Description

Real-time data processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a real-time data processing method and apparatus, a computer device, and a storage medium.

Background

Real-time streaming has become a basic requirement for today's large companies, log processing, daily PV (page view volume), UV (page click volume) statistics, popular topN commodities, active zone topN, etc. The method can grasp the index conditions of the last few minutes and hours in real time, can deal with different conditions, and can enable a company to carry out corresponding adjustment in the first time, such as personalized recommendation, increase the weight of a certain commodity and reduce the weight of certain products. Sparkstream is popular with companies due to its powerful offline batch processing capability and real-time batch data processing capability.

When the data volume of the message for data processing is large, for example, one piece of message data reaches 500K or even 1M, and spark timing processes these messages, the task is heavy and often blocked, so that backlog of batch data is formed, overflow of the memory is caused finally, and the task is terminated. The service types of the messages are various, the same topoc has various service types, the messages processed by the streaming processing program cannot be processed in batch, data inclination is easily caused in the processing process, the processing time of part of task tasks is long, the processing time of the whole batch is long, and resource waste is caused.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a real-time data processing method, apparatus, computer device and storage medium capable of improving data processing efficiency and resource utilization.

A real-time data processing method, the method comprising:

acquiring real-time data to be processed, and generating a directed acyclic graph according to a dependence mode of the real-time data to be processed;

splitting the directed acyclic graph to obtain initial tasks, and obtaining the task quantity of each split initial task;

distributing the initial tasks to different execution machines for processing;

judging whether the difference value between the task quantities of each initial task is larger than a preset threshold value or not;

when the initial tasks with the difference values larger than the preset threshold value exist, splitting the initial tasks with larger task quantity in the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks;

distributing the excessive tasks to different threads of corresponding execution machines respectively for execution;

and storing the transaction objects obtained by the execution of each execution machine into the distributed storage system.

In one embodiment, the splitting the initial task with a larger task amount to obtain a plurality of transition tasks in the initial tasks with the difference values larger than the preset threshold includes:

acquiring an executive machine distributed by an initial task with a difference value larger than a preset threshold value, and acquiring the number of threads which can be set currently by the executive machine;

determining an initial task with a larger task amount in the initial tasks with the difference value larger than a preset threshold value;

and splitting the initial task with larger task amount according to the thread number to obtain a plurality of transition tasks.

In one embodiment, the allocating the initial task to different execution machines for processing includes:

assigning the initial task to different execution machines;

acquiring the current operating environment of an executing machine to acquire an executing method corresponding to the initial task;

acquiring current data in the initial task;

calling an intermediate task execution method in the execution methods to process the current data to obtain an intermediate task;

calling a target task execution method in the execution method to process the intermediate task to obtain a transaction object;

and obtaining next data in the initial task as current data according to an iteration method, and continuing to call an intermediate task execution method in the execution method to process the current data to obtain an intermediate task until the data processing in the initial task is completed.

In one embodiment, the acquiring the real-time data to be processed includes:

inquiring whether the real-time data to be processed is cached;

if the real-time data to be processed is cached, reading the real-time data to be processed from the cache;

and if the to-be-processed real-time data is not cached, acquiring the to-be-processed real-time data in a preset data set.

In one embodiment, the method further comprises:

initializing an object of a preset storage system to define a data storage class of the preset storage system;

consuming the data storage class of the preset storage system in a direct connection mode to establish a preset data set corresponding to a preset storage partition in the data storage class;

analyzing the structured data stored in each preset partition in the preset storage system as a transaction type object, and storing the transaction type object into a corresponding preset data set.

In one embodiment, before consuming the data storage class of the preset storage system in a direct connection manner, the method further includes:

judging whether a corresponding distributed storage system is defined;

if a corresponding distributed storage system is not defined, the distributed storage system is defined along with a transaction type object that corresponds to the structured data.

In one embodiment, the parsing the structured data stored in each preset partition in the preset storage system into a transaction type object includes:

acquiring the maximum reading number in the set preset time;

reading the structured data of which the number is less than or equal to the maximum reading number from each preset partition in the preset storage system within preset time;

and analyzing the structured data stored in each preset partition in the preset storage system as a transaction type object.

A real-time data processing apparatus, the apparatus comprising:

the graph generation module is used for acquiring real-time data to be processed and generating a directed acyclic graph according to the dependence mode of the real-time data to be processed;

the first splitting module is used for splitting the directed acyclic graph to obtain initial tasks and acquiring the task quantity of each split initial task;

the first allocation module is used for allocating the initial tasks to different execution machines for processing;

the first judgment module is used for judging whether the difference value between the task quantities of each initial task is larger than a preset threshold value or not;

the second splitting module is used for splitting the initial task with larger task amount in the initial tasks with the difference value larger than the preset threshold value to obtain a plurality of transition tasks when the initial tasks with the difference value larger than the preset threshold value exist;

the second distribution module is used for distributing the excessive tasks to different threads of corresponding execution machines respectively for execution;

and the storage module is used for storing the transaction objects obtained by the execution of each execution machine into the distributed storage system.

A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above.

After the initial tasks are generated, the initial tasks are allocated to different execution machines to be executed, and in order to avoid that the data inclination is caused by the larger task amount of the initial tasks processed by a certain execution machine, the task amounts of the initial tasks can be compared in advance, the initial tasks with the larger task amount are split into a plurality of transition tasks in the initial tasks with the difference values larger than the preset threshold value, so that the transition tasks are processed by opening a plurality of threads in the corresponding execution machines, the execution time of the initial tasks with the larger task amount can be reduced, the execution efficiency can be improved, the initial tasks with the smaller task amount can be prevented from waiting for the completion of the processing of the initial tasks with the larger task amount, the waste of resources is reduced, and the resource utilization rate is improved.

Drawings

FIG. 1 is a diagram illustrating an exemplary implementation of a method for real-time data processing;

FIG. 2 is a schematic flow chart diagram illustrating a method for real-time data processing in one embodiment;

FIG. 3 is a flowchart of the steps in one embodiment for a server to store data in kafka into rdd objects;

FIG. 4 is a block diagram of a real-time data processing apparatus according to an embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The real-time data processing method provided by the application can be applied to the application environment shown in fig. 1. The preset storage system 102 may acquire data in the data source database 104 and store the data; the server 102 may consume the data in the preset storage system 106 in a direct connection manner, and then store the acquired data in the distributed storage system 108 in real time.

For convenience of understanding, the application environment diagram in fig. 1 is first described, which is mainly divided into two processing flows, one is to store data in the preset storage system 102 into the preset data set, and the second is to drop data in the preset data set into the distributed storage system 108. The data storage class of the preset storage system 102 is consumed in a direct connection mode, the pulled batch interval is 15s, namely, in one 15s, the server processes data in the preset data set and then places the processed data in the distributed storage system 108, and reads new data from the preset storage system 102 and stores the new data in the preset data set, so that the data of multiple service types in the preset storage system 102 are placed in the distributed storage system 108 in real time. When the data in the preset data set falls into the distributed storage system 108, the amount of data in a part of tasks is large, which results in a long processing time, and thus a long processing time of the whole batch, and thus a low processing efficiency.

Therefore, the server acquires the real-time data to be processed, and then generates the directed acyclic graph according to the dependence mode of the real-time data to be processed; splitting a directed acyclic graph to obtain initial tasks, acquiring the task amount of each initial task, and allocating the initial tasks to different execution machines for processing, wherein the initial tasks with larger task amounts need to be further divided to be allocated to different threads in the execution machines for processing due to larger task amounts, so that the processing time of the initial tasks with larger task amounts can be reduced, namely when the initial tasks with the difference values larger than a preset threshold value exist, the initial tasks with the difference values larger than the preset threshold value are split to obtain a plurality of transition tasks; respectively distributing the excessive tasks to different threads of corresponding execution machines for execution; and storing the transaction objects obtained by the execution of each execution machine into the distributed storage system. Therefore, the execution time of the initial task with larger task amount can be reduced, the execution efficiency can be improved, the initial task with smaller task amount can be prevented from waiting for the initial task with larger task amount to be processed and completed all the time, the waste of resources is reduced, and the resource utilization rate is improved.

The preset storage system 102 may be a kafka storage system, the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers, the server is integrated with sparkstemaing, and the distributed storage system 108 may be a kudu distributed storage system.

In one embodiment, as shown in fig. 2, a real-time data processing method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

s202: and acquiring real-time data to be processed, and generating a directed acyclic graph according to the dependence mode of the real-time data to be processed.

Specifically, the server may generate a directed acyclic graph, a DAG graph, which is a spark retard module according to a task-dependent operation manner, that is, the server generates through processing transformation on a preset data set, that is, rdd objects, for example, rdd3 depends on rdd2, and rdd2 depends on rdd1, and then may generate a corresponding directed acyclic graph according to the operation dependency relationship. That is, all rdd objects are obtained to generate a corresponding number of nodes, and then connecting lines between the nodes are established according to the dependency relationship of the rdd objects, wherein the direction of the connecting lines represents the operation dependency relationship of the rdd, for example, there are three nodes as described above, each node represents rdd1, rdd2 and rdd3, and drr3 points to rdd2, and rdd2 points to rdd 1.

In one embodiment, acquiring the real-time data to be processed may include: inquiring whether the real-time data to be processed is cached; if the real-time data to be processed is cached, reading the real-time data to be processed from the cache; and if the to-be-processed real-time data is not cached, acquiring the to-be-processed real-time data in the preset data set. Before calculation, the sparkstreaming module firstly checks the cache level, namely, checks whether the current data to be landed is cached according to a task-dependent operation mode, if so, directly reads the data, and if not, continuously acquires the data in the rdd object and generates a DAG (direct current) graph. The spark timing module may further determine whether a check point exists, and if the check point exists, may read data of the check point, where the check point is used to read the cached data after the failure or downtime recovery.

S204: splitting the directed acyclic graph to obtain initial tasks, and obtaining the task quantity of each split initial task.

Specifically, splitting the directed acyclic graph may obtain a plurality of initial tasks, where the splitting may be performed according to the number of data pieces, for example, splitting is preferentially performed according to branches in the directed acyclic graph, and then splitting is performed according to the number of data pieces corresponding to each branch, for example, data of the same type is divided into one initial task for processing, or averaging is performed according to the number of data to be processed and the number of available execution machines.

The task amount of the initial task is a size of a data amount of data to be processed related to the initial task, and for example, if data related to one initial task is 1M, the task amount of the initial task is 1M.

S206: and distributing the initial task to different execution machines for processing.

Specifically, the execution machine is a set of execution mechanism in spark, and is configured to execute the initial task to obtain the corresponding transaction object, so that the transaction object can be landed in the corresponding kudu table, that is, in the distributed storage system.

S208: and judging whether the difference value between the task quantities of each initial task is larger than a preset threshold value.

S210: and when the initial tasks with the difference values larger than the preset threshold value exist, splitting the initial tasks with larger task quantity in the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks.

S212: and distributing the excessive tasks to different threads of corresponding execution machines for execution.

Specifically, the idea of the present application is to prevent a certain initial task from being processed for a long time due to data skew, and thus the processing time of the whole batch of initial tasks is long, the adopted method is that the task volume of each split initial task is calculated first, the split tasks are split according to the number of data during splitting, so that the task volume of each initial task is different, before the data is processed by the execution machine, the task volume of each initial task is obtained first, which can be measured according to the size of the data volume, and then the difference value between the data volumes is determined, so as to determine whether the processing time of each initial task is different greatly, if the difference value is large, that is, there is an initial task with a difference value larger than a preset threshold value, the initial task with a large task volume is split to obtain an excessive task, and then the excessive task is allocated to a newly established thread to be executed to obtain a transaction object, therefore, the phenomenon that the time for processing a task with a large task amount by one thread in one execution machine is long, so that other execution machines can process other initial tasks after the execution machines finish processing is avoided.

When the server calculates, the server may first obtain the task volume of each initial task, then sequence the tasks in the order from large to small or from small to large, then calculate the difference value between the initial task with the largest task volume and the initial task with the smallest task volume, if the difference value is not greater than a preset threshold value, the server may directly perform processing by an execution machine to obtain a transaction object, and store the obtained transaction object in a distributed storage system. When the difference value between the initial task with the largest task amount and the initial task with the smallest task amount is larger than a preset threshold value, the initial task with the largest task amount is split to obtain a plurality of transition tasks, and the transition tasks are distributed to a plurality of threads in the execution machine to be executed. After the splitting of the initial task with the largest task amount is completed, the initial task with the largest task amount is continuously obtained, at this time, because the initial task with the largest task amount is already split, the initial task with the largest task amount is sequenced at the second position, then the difference value between the initial task with the largest task amount and the initial task with the smallest task amount is calculated, and the calculation is stopped until the difference value between the initial task with the largest task amount and the initial task with the smallest task amount is smaller than a preset threshold value, so that each split excessive task and each split initial task which is not split are started.

S214: and storing the transaction objects obtained by the execution of each execution machine into the distributed storage system.

Specifically, because the kudu table and the transaction object in the distributed storage system are predefined, and the corresponding relationship between the transaction object and the kudu table is established, that is, a certain type of transaction object corresponds to one kudu table and the like, the transaction object can be directly stored in the corresponding kudu table according to the corresponding relationship.

According to the real-time data processing method, after the initial tasks are generated, the initial tasks are distributed to different execution machines to be executed, in order to avoid the problem that the data inclination is caused by the fact that the task quantity of the initial tasks processed by one execution machine is large, the task quantity of each initial task can be compared in advance, the initial tasks with large task quantity are split into a plurality of transition tasks in the initial tasks with the difference values larger than the preset threshold value, and therefore a plurality of threads are opened up in the corresponding execution machines to process the transition tasks, the execution time of the initial tasks with large task quantity can be reduced, the execution efficiency can be improved, the initial tasks with small task quantity can be prevented from waiting for the initial tasks with large task quantity to be processed, the waste of resources is reduced, and the resource utilization rate is improved.

In one embodiment, splitting an initial task with a larger task amount from among initial tasks with difference values larger than a preset threshold to obtain a plurality of transition tasks includes: acquiring an executive machine distributed by an initial task with a difference value larger than a preset threshold value, and acquiring the number of threads which can be set currently by the executive machine; determining an initial task with a larger task amount in the initial tasks with the difference value larger than a preset threshold value; and splitting the initial task with larger task amount according to the thread number to obtain a plurality of excessive tasks.

Specifically, when splitting a task with a large task amount, the number of threads that can be set by the current physical hardware may be obtained, and then the task amount is split according to the number of remaining threads, for example, if the number of remaining available threads is n, the task is split into n +1 threads. That is, the number of threads is limited to physical hardware, and therefore, the number of concurrent multitasks in the thread pool needs to be set according to actual conditions. The server may establish a thread pool, place all threads of an execution machine in the thread pool, and mark states of the threads, such as available, unavailable, and the like, so that the server may obtain the number of threads that can be set according to the states of the threads.

In the embodiment, when the real-time data is processed, the threads are added, so that the processing efficiency of the message can be greatly improved, and the longer overall data processing time caused by the larger data volume of part of the tasks is avoided.

In one embodiment, assigning the initial task to different execution machines for processing includes: distributing the initial tasks to different execution machines; acquiring the current operating environment of an execution machine to acquire an execution method corresponding to the initial task; acquiring current data in an initial task; calling an intermediate task execution method in the execution method to process the current data to obtain an intermediate task; calling a target task execution method in the execution method to process the intermediate task to obtain a transaction object; and obtaining next data in the initial task as current data according to the iteration method, and continuing to call an intermediate task execution method in the execution method to process the current data to obtain an intermediate task until the data processing in the initial task is completed.

Specifically, when the execution machine executes, firstly, an operation environment is obtained, then, a first piece of data in an initial task or an excessive task is processed, an intermediate task is generated according to the first piece of data through a run method, and the intermediate task is processed through a run method corresponding to a final task ResultTask to obtain a transaction object of the final task. Thirdly, calling an rdd iteration method to continuously process the second piece of data in the initial task or the transition task until all data are processed, namely the current initial task or the transition task is completed.

In practical application, when the execution machine executes, the execution machine acquires the running environment first and then calls a run method of a task to start executing, wherein one of two tasks is ShuffleMapTask and the other task is ResultTask during execution. All intermediate processes executed by the directed acyclic graph DAG task generate ShuffleMapTask, and ResultTask is generated for the Partition of the final result. The execution machine calculates according to the intermediate task ShuffleMapTask generated by the run method call or according to the run method corresponding to the final task ResultTask to obtain the final result, namely the trading object DF. And the calculation is carried out by the iterator method of rdd until all data processing is finished.

In the embodiment, the data of the tasks are sequentially processed in an iterative manner, so that the processing sequence of the data is ensured, no confusion occurs, and the processing efficiency is improved.

In one embodiment, the real-time data processing method may further include: initializing an object of a preset storage system to define a data storage class of the preset storage system; consuming a data storage class of a preset storage system in a direct connection mode to establish a preset data set corresponding to a preset storage partition in the data storage class; analyzing the structured data stored in each preset partition in the preset storage system to be transaction type objects, and storing the transaction type objects into corresponding preset data sets.

In one embodiment, before consuming the data storage class of the preset storage system in a direct connection manner, the method further includes: judging whether a corresponding distributed storage system is defined; if a corresponding distributed storage system is not defined, the distributed storage system is defined along with a transaction type object, the transaction type object corresponding to the structured data.

In one embodiment, parsing the structured data stored in each preset partition in the preset storage system into a transaction type object includes: acquiring the maximum reading number in the set preset time; reading the structured data with the maximum reading quantity smaller than or equal to the maximum reading quantity from each preset partition in the preset storage system within preset time; and analyzing the structured data stored in each preset partition in the preset storage system as a transaction type object.

Specifically, referring to fig. 3, fig. 3 is a flowchart illustrating steps of the server in one embodiment of storing data in kafka into rdd object, which may specifically include:

initializing objects of the preset storage system to define data storage classes of the preset storage system, wherein the initialization of kafkaProgramm object definition kafka parameters comprises a serialization mode, bootstrap, topic, offset and the like.

Secondly, consuming the data storage class of the preset storage system in a direct connection mode to establish a preset data set corresponding to a preset storage partition in the data storage class, namely, consuming the topic of the kafka by the sparktreatening module in the direct connection mode.

Specifically, the steps include: the sparktrating module firstly obtains the number of partions in the vertex in the kafka, and then establishes a corresponding rdd partition, namely one partion corresponds to one rdd partition, so that the sparktrating module submits a request by increasing and concurrently using an asynchronous confirmation mode to read the data in the kafka into the partition corresponding to the rdd. Namely, the monitoring method is used in kafka, and the message callback is asynchronously waited. If several partitions exist, the partitions correspond to several rdd partitions, so that the parallelism of spark streaming processing can be increased and the processing efficiency can be improved corresponding to several concurrent threads. The method is characterized in that topic of kafka is consumed in a direct connection mode, batch interval is pulled to be 15s, the kafka is butted in a spare mode of spark, enable.

Thirdly, the structured data stored in each preset partition in the preset storage system is analyzed to be a transaction type object, and the transaction type object is stored in the corresponding preset data set, namely, the spark tracking module analyzes Json data stored in each partition of kafka to be the transaction type object and stores the Json data in the corresponding rdd object.

Wherein the transaction type object here is in the form of data stored in correspondence with, i.e. in the kudu table. When the spark logging module analyzes the Json data, a corresponding analysis method is obtained according to the type of the message of the Json data, then the Json data is analyzed into a transaction type corresponding to the transaction type through the analysis method and stored in the rdd object, and the rdd object is a processing logic concept.

Preferably, before processing, the sparktrating module defines a kudu table first, and then defines a transaction type object for storing the parsed Json message, where the Json message is sent to the kafka for storage by other databases and the like, and the sparktrating module determines whether the kudu table is created, and if not, the creating is performed, otherwise, the processing is started, that is, the second step and the third step.

In one embodiment, the maximum amount of information per second that each partition pulls may also be set and the backpressure mechanism set. The spark tracking module acquires the real-time processing efficiency, adjusts the number of the pulled data per second of each partition according to the real-time processing efficiency, and the adjusted number cannot be larger than the maximum number, so that the phenomenon that when a spark submits a task, the initialization time is long, the data volume pulled once is large, and the processing pressure is large can be avoided. And the timeout time can be set to avoid that spark cannot pull data due to the kafka problem.

In the method, a direct connection mode is adopted, the parallelism can be improved by adding threads, and the processing efficiency is improved. And when spark, namely the server pulls data from kafka, a backpressure mechanism is adopted, and the maximum quantity is set, so that the quantity pulled out by each partition per second can be dynamically adjusted, and the data processing efficiency is improved.

It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided a real-time data processing apparatus including: a graph generating module 100, a first splitting module 200, a first determining module 400, a second splitting module 500, a second allocating module 600, and a storing module 700, wherein:

the graph generating module 100 is configured to obtain real-time data to be processed, and generate a directed acyclic graph according to a dependency manner of the real-time data to be processed.

The first splitting module 200 is configured to split the directed acyclic graph to obtain initial tasks, and obtain a task amount of each split initial task.

The first allocating module 300 is used for allocating the initial task to different execution machines for processing.

The first determining module 400 is configured to determine whether a difference between task amounts of each initial task is greater than a preset threshold.

The second splitting module 500 is configured to split the initial task with a larger task amount from the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks when the initial tasks with the difference values larger than the preset threshold value exist.

The second allocating module 600 is configured to allocate the excess tasks to different threads of corresponding execution machines for execution.

The storage module 700 is configured to store the transaction object obtained by the execution of each execution machine into the distributed storage system.

In one embodiment, the second splitting module 500 comprises:

and the thread number acquisition unit is used for acquiring the execution machine allocated to the initial task of which the difference value is greater than the preset threshold value and acquiring the thread number which can be set currently by the execution machine.

And the determining unit is used for determining the initial task with larger task amount in the initial tasks with the difference value larger than the preset threshold value.

And the splitting unit is used for splitting the initial task with larger task amount according to the thread number to obtain a plurality of transition tasks.

In one embodiment, the first assignment module 300 may include:

and the distribution unit is used for distributing the initial tasks to different execution machines.

And the execution method acquisition unit is used for acquiring the current operating environment of the execution machine so as to acquire the execution method corresponding to the initial task.

And the current data acquisition unit is used for acquiring current data in the initial task.

And the intermediate task generating unit is used for calling an intermediate task execution method in the execution methods to process the current data to obtain an intermediate task.

And the transaction object generating unit is used for calling a target task execution method in the execution methods to process the intermediate task to obtain a transaction object.

And the iteration unit is used for acquiring next data in the initial task as current data according to the iteration method, and continuing to call an intermediate task execution method in the execution method to process the current data to obtain an intermediate task until the data processing in the initial task is finished.

In one embodiment, the graph generation module 100 may include:

and the query unit is used for querying whether the real-time data to be processed is cached.

And the cache reading unit is used for reading the real-time data to be processed from the cache if the real-time data to be processed is cached.

And the data set reading unit is used for acquiring the to-be-processed real-time data in the preset data set if the to-be-processed real-time data is not cached.

In one embodiment, the real-time data processing apparatus may further include:

the initialization module is used for initializing the object of the preset storage system so as to define the data storage class of the preset storage system.

And the consumption module is used for consuming the data storage class of the preset storage system in a direct connection mode so as to establish a preset data set corresponding to the preset storage partition in the data storage class.

And the preset data set generating module is used for analyzing the structured data stored in each preset partition in the preset storage system into transaction type objects and storing the transaction type objects into the corresponding preset data sets.

In one embodiment, the real-time data processing apparatus may further include:

and the second judging module is used for judging whether the corresponding distributed storage system is defined.

A definition module to define the distributed storage system and a transaction type object if the corresponding distributed storage system is not defined, the transaction type object corresponding to the structured data.

In one embodiment, the preset data set generating module comprises:

and the threshold value acquisition unit is used for acquiring the maximum reading number in the set preset time.

And the data reading unit is used for reading the structured data of which the number is less than or equal to the maximum reading number from each preset partition in the preset storage system within preset time.

And the analysis unit is used for analyzing the structured data stored in each preset partition in the preset storage system into the transaction type object.

For specific limitations of the real-time data processing apparatus, reference may be made to the above limitations of the real-time data processing method, which are not described herein again. The respective modules in the above-mentioned real-time data processing apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing real-time data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a real-time data processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: acquiring real-time data to be processed, and generating a directed acyclic graph according to a dependence mode of the real-time data to be processed; splitting the directed acyclic graph to obtain initial tasks, and acquiring the task quantity of each split initial task; distributing the initial tasks to different execution machines for processing; judging whether the difference value between the task quantities of each initial task is larger than a preset threshold value or not; when the initial tasks with the difference values larger than the preset threshold value exist, splitting the initial tasks with larger task quantity in the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks; respectively distributing the excessive tasks to different threads of corresponding execution machines for execution; and storing the transaction objects obtained by the execution of each execution machine into the distributed storage system.

In one embodiment, splitting an initial task with a larger task amount to obtain a plurality of transition tasks in an initial task with a difference value larger than a preset threshold, which is implemented when a processor executes a computer program, includes: acquiring an executive machine distributed by an initial task with a difference value larger than a preset threshold value, and acquiring the number of threads which can be set currently by the executive machine; determining an initial task with a larger task amount in the initial tasks with the difference value larger than a preset threshold value; and splitting the initial task with larger task amount according to the thread number to obtain a plurality of excessive tasks.

In one embodiment, the allocation of the initial tasks to different execution machines for processing, as implemented by the processor executing the computer program, comprises: distributing the initial tasks to different execution machines; acquiring the current operating environment of an execution machine to acquire an execution method corresponding to the initial task; acquiring current data in an initial task; calling an intermediate task execution method in the execution method to process the current data to obtain an intermediate task; calling a target task execution method in the execution method to process the intermediate task to obtain a transaction object; and obtaining next data in the initial task as current data according to the iteration method, and continuing to call an intermediate task execution method in the execution method to process the current data to obtain an intermediate task until the data processing in the initial task is completed.

In one embodiment, the obtaining of the real-time data to be processed, which is performed by the processor when executing the computer program, comprises: inquiring whether the real-time data to be processed is cached; if the real-time data to be processed is cached, reading the real-time data to be processed from the cache; and if the to-be-processed real-time data is not cached, acquiring the to-be-processed real-time data in the preset data set.

In one embodiment, the processor, when executing the computer program, further performs the steps of: initializing an object of a preset storage system to define a data storage class of the preset storage system; consuming a data storage class of a preset storage system in a direct connection mode to establish a preset data set corresponding to a preset storage partition in the data storage class; analyzing the structured data stored in each preset partition in the preset storage system to be transaction type objects, and storing the transaction type objects into corresponding preset data sets.

In one embodiment, before consuming the data storage class of the preset storage system in a direct connection manner when the processor executes the computer program, the method further includes: judging whether a corresponding distributed storage system is defined; if a corresponding distributed storage system is not defined, the distributed storage system is defined along with a transaction type object, the transaction type object corresponding to the structured data.

In one embodiment, the parsing structured data stored in each default partition of the default storage system into the transaction type object when the processor executes the computer program includes: acquiring the maximum reading number in the set preset time; reading the structured data with the maximum reading quantity smaller than or equal to the maximum reading quantity from each preset partition in the preset storage system within preset time; and analyzing the structured data stored in each preset partition in the preset storage system as a transaction type object.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring real-time data to be processed, and generating a directed acyclic graph according to a dependence mode of the real-time data to be processed; splitting the directed acyclic graph to obtain initial tasks, and acquiring the task quantity of each split initial task; distributing the initial tasks to different execution machines for processing; judging whether the difference value between the task quantities of each initial task is larger than a preset threshold value or not; when the initial tasks with the difference values larger than the preset threshold value exist, splitting the initial tasks with larger task quantity in the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks; respectively distributing the excessive tasks to different threads of corresponding execution machines for execution; and storing the transaction objects obtained by the execution of each execution machine into the distributed storage system.

In one embodiment, splitting an initial task with a larger task amount from among initial tasks with difference values larger than a preset threshold to obtain a plurality of transition tasks when the computer program is executed by the processor includes: acquiring an executive machine distributed by an initial task with a difference value larger than a preset threshold value, and acquiring the number of threads which can be set currently by the executive machine; determining an initial task with a larger task amount in the initial tasks with the difference value larger than a preset threshold value; and splitting the initial task with larger task amount according to the thread number to obtain a plurality of excessive tasks.

In one embodiment, the allocation of the initial tasks to different execution machines for processing, as implemented by the computer program when executed by the processor, comprises: distributing the initial tasks to different execution machines; acquiring the current operating environment of an execution machine to acquire an execution method corresponding to the initial task; acquiring current data in an initial task; calling an intermediate task execution method in the execution method to process the current data to obtain an intermediate task; calling a target task execution method in the execution method to process the intermediate task to obtain a transaction object; and obtaining next data in the initial task as current data according to the iteration method, and continuing to call an intermediate task execution method in the execution method to process the current data to obtain an intermediate task until the data processing in the initial task is completed.

In one embodiment, the obtaining of the real-time data to be processed, which is implemented when the computer program is executed by the processor, comprises: inquiring whether the real-time data to be processed is cached; if the real-time data to be processed is cached, reading the real-time data to be processed from the cache; and if the to-be-processed real-time data is not cached, acquiring the to-be-processed real-time data in the preset data set.

In one embodiment, the computer program when executed by the processor further performs the steps of: initializing an object of a preset storage system to define a data storage class of the preset storage system; consuming a data storage class of a preset storage system in a direct connection mode to establish a preset data set corresponding to a preset storage partition in the data storage class; analyzing the structured data stored in each preset partition in the preset storage system to be transaction type objects, and storing the transaction type objects into corresponding preset data sets.

In one embodiment, before the computer program is executed by the processor to consume the data storage class of the preset storage system in a direct connection manner, the method further includes: judging whether a corresponding distributed storage system is defined; if a corresponding distributed storage system is not defined, the distributed storage system is defined along with a transaction type object, the transaction type object corresponding to the structured data.

In one embodiment, the computer program when executed by the processor is operable to parse structured data stored in each of the predetermined partitions in the predetermined storage system into transaction type objects, comprising: acquiring the maximum reading number in the set preset time; reading the structured data with the maximum reading quantity smaller than or equal to the maximum reading quantity from each preset partition in the preset storage system within preset time; and analyzing the structured data stored in each preset partition in the preset storage system as a transaction type object.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of real-time data processing, the method comprising:

distributing the initial tasks to different execution machines for processing;

2. The method according to claim 1, wherein the splitting of the initial task with a larger task amount from the initial tasks with the difference values larger than the preset threshold value to obtain a plurality of transition tasks comprises:

3. The method of claim 1, wherein the assigning the initial task to different execution machines for processing comprises:

assigning the initial task to different execution machines;

acquiring current data in the initial task;

4. The method of claim 1, wherein the obtaining real-time data to be processed comprises:

inquiring whether the real-time data to be processed is cached;

5. The method of claim 4, further comprising:

6. The method of claim 5, wherein before consuming the data storage class of the preset storage system by direct connection, further comprising:

judging whether a corresponding distributed storage system is defined;

7. The method according to claim 5, wherein the parsing the structured data stored in each pre-defined partition in the pre-defined storage system into transaction type objects comprises:

acquiring the maximum reading number in the set preset time;

8. A real-time data processing apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.