CN116737349A

CN116737349A - Stream data processing method, system and storage medium

Info

Publication number: CN116737349A
Application number: CN202311029714.0A
Authority: CN
Inventors: 赵丹怀; 艾怀丽; 孟浩; 王一淳; 陆田
Original assignee: China Mobile Zijin Jiangsu Innovation Research Institute Co ltd
Current assignee: China Mobile Zijin Jiangsu Innovation Research Institute Co ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-09-12
Anticipated expiration: 2043-08-16
Also published as: CN116737349B

Abstract

The invention provides a streaming data processing method, a streaming data processing system and a storage medium, and belongs to the field of big data processing. The method comprises the following steps: monitoring a plurality of data sources, and pushing data to a message queue for an acquisition task in an activated state; judging whether the required input data is available according to the data in the message queue, and dividing the calculation task into a plurality of independent or parallel sub-calculation tasks; defining one or more input data packets and one or more output data packets for each computing task according to the sub-computing tasks; judging whether the currently started sub-computing task is completed or not: and once a certain sub-computing task is completed, sending a notification to a resource scheduling module, and recovering the memory resources occupied by the completed sub-computing task after receiving the notification. Compared with the prior art, the method provided by the invention has obvious advantages in the aspects of data security, authority management, complex flow task processing, fault tolerance, processing speed and the like.

Description

Stream data processing method, system and storage medium

Technical Field

The present invention relates to the field of big data processing, and in particular, to a streaming data processing method, system and storage medium.

Background

Existing streaming technologies, including Apache Flink and Apache Storm, have achieved significant achievements in the field of big data processing. However, they still present some challenges in terms of data security protection and complex stream task handling.

Apache Flink is excellent in the field of stream processing, and has the characteristics of high processing speed and low delay. However, its functionality in terms of data security and rights management is relatively weak, and lacks an effective security mechanism to protect sensitive data, which is particularly critical in big data environments. Second, while Flink supports some complex streaming tasks, problems of reduced efficiency may be faced when handling large-scale, complex streaming data tasks.

Apache Storm is an open source distributed real-time computing streaming engine based on JVM, and is widely applied to tasks such as real-time analysis, online machine learning, continuous computing, distributed RPC and the like. However, the data security problem of Storm is also prominent. On the other hand, storm has a unique advantage in processing large-scale stream data, but is relatively weak in complex stream task processing capability.

The above problems may cause problems such as data leakage, difficulty in rights management, and reduced processing efficiency, which are particularly apparent in a big data environment.

Disclosure of Invention

The invention aims to: a streaming data processing method, system and storage medium are provided to solve the above-mentioned problems in the prior art.

In a first aspect, a streaming data processing method is provided, which includes the following steps:

monitoring a plurality of data sources, creating an acquisition task aiming at the data sources, detecting the state of the acquisition task, and finding out the task in an activated state; for the acquisition task in an activated state, pushing the data into a message queue Kafka so as to acquire the data when new data exists;

judging whether the required input data is available according to the data in the message queue Kafka, organizing a series of calculation rules and data streams into calculation task data packets to be stored in a memory, and dividing the calculation tasks into a plurality of independent or parallel sub-calculation tasks according to the minimum calculation rule factors;

defining one or more input data packets and one or more output data packets for each computing task according to the divided sub-computing tasks; verifying the execution order of the two calculation phases or whether the two calculation phases can be executed simultaneously;

monitoring and judging whether the currently started sub-computing task is completed or not: once a certain sub-computing task is completed, a notification is sent to a resource scheduling module; and after the resource scheduling module receives the notification, recovering the memory resources occupied by the completed sub-computing tasks.

In a further embodiment of the first aspect, the partitioning rule of the sub-computing task includes:

rule a, whether the input data required by one calculation task depends on the output results of other tasks or operations, if so, dividing the current calculation task into sub-tasks;

a rule b, whether the calculated amount or the processing time required for executing a certain calculation task is in a preset interval A or not;

a rule c, whether the number of resources for executing the computing task or operation is within a preset interval B;

before executing the calculation sub-tasks, the corresponding data are taken out from the receiver according to the rules a to c and the initial interval value, and the data are organized into a sub-calculation task.

In a further embodiment of the first aspect, verifying the order of execution or whether two computing phases can be executed simultaneously further comprises:

checking the preceding condition of each calculation stage, namely the condition which needs to be met when the current calculation stage can run correctly; the conditions include: all input data is ready and all necessary streaming engine resources have been allocated to the current stage;

judging the parallelism of sub-calculation tasks: when all the preceding conditions are satisfied, judging whether the next calculation stage can run simultaneously with the calculation stage started currently; the judgment basis comprises: no data dependency relationship exists between the two phases, and the streaming engine resources are sufficient, so that the two phases can be operated simultaneously;

Determining an initial interval valueWhether the following batch processing time is greater than or equal to N times of initial interval value:

if the processing time is greater than or equal to N times the initial interval value, starting a level A adjustment program: setting the interval value of the next batch as N times of the current interval value; the corresponding data are taken out from the receiver according to the newly calculated interval value of the next batch, the data are calculated, and the processing time is recorded;

if the processing time is less than N times the initial interval value, then a B-stage adjustment procedure is started: setting the next batch interval value to be a certain number within N times of the current batch interval value, and gradually reducing the batch interval value along with the increase of the operation times; the corresponding data are taken out from the receiver according to the newly calculated interval value of the next batch, the data are processed, and the processing time is recorded;

the sub-computing task performs both computing phases simultaneously, with the confirmation that both computing phases can be performed simultaneously, and without affecting the overall operation result.

In a further embodiment of the first aspect, the N-valued interval is (1, 2).

In a further embodiment of the first aspect, the streaming data processing method further includes:

when processing the multi-task stream data, determining an optimal interval value of two tasks so that the data processing time is equal to the interval value; selecting one of the two optimal interval values smaller than a predetermined value as an actually used interval value;

Determining whether the interval value of two consecutive batches exceeds a larger optimal interval value: if the interval value exceeds the preset interval value, the interval value is adjusted;

the interval value of the new batch is determined based on the previous interval value.

In a further embodiment of the first aspect, in computing the sub-computing tasks, a parallel group computing strategy is employed, comprising:

initial lot spacing value of set groupUsing an adjustment factor ρ willThe first batch interval value t1 is adjusted to be an initial value of the first batch interval value t1, then the first batch is calculated, and after the calculation is completed, the execution time of the first batch is recorded as p (t 1);

using p (t 1) as an initial value of the second lot spacing value t2 of the present set; the final t2 value is obtained after t2 is regulated, then the calculation of the second batch is carried out, and after the calculation is completed, the execution time of the second batch of the group is recorded as p (t 2);

calculating the next set of batch initiation interval values according to the following formula-next：

-next=+ρ*(p(t2)-p(t1))；

Wherein p (t 2) and p (t 1) are the execution times of the second lot and the first lot of the group, respectively,an initial lot spacing value for the present set; initial interval value of next groupNext depends on the initial interval value of the present set and the execution time difference of the two batches.

In a further embodiment of the first aspect, when calculating the sub-calculation task, the method further includes adopting a three-stage calculation method to improve data processing efficiency under a high concurrency situation, including three stages of fuzzy hierarchical clustering, coarse-granularity cluster tree adaptation and fine-granularity cluster scheduling.

In a further embodiment of the first aspect, the process of fuzzy hierarchical clustering includes:

for each arriving data point, calculating the membership of each cluster in real time;

updating the current clustering result according to the membership of the new data point;

performing multistage division on the data by utilizing the characteristics of a fuzzy hierarchical clustering algorithm to form a hierarchical structure;

processing the uncertainty and the ambiguity of the data by using the ambiguity of the fuzzy hierarchical clustering;

when there is a problem with the data, the part of the data is re-requested from the sender.

In a further embodiment of the first aspect, optimizing the data structure using a coarse-grained cluster tree adaptation algorithm based on fuzzy hierarchical clustering comprises:

according to the distribution characteristics of given data, firstly generating an initial cluster tree;

dynamically adjusting the structure of the clustering tree according to the real-time change of the data by a coarse-granularity clustering tree adaptation algorithm; when new data points arrive, the algorithm performs merging, splitting and moving operations on the cluster tree according to the characteristics of the data points.

In a further embodiment of the first aspect, the data processing using a fine-grained clustered scheduling algorithm, based on forming an optimized data structure, comprises:

Distributing a processing weight to each cluster according to the size, complexity and processing requirement of the cluster; wherein the processing weight is positively correlated to the size, complexity, and processing requirements of the cluster;

when processing data, processing the data in sequence from high to low according to the weight of the cluster;

in the processing process, continuously monitoring the processing state of each cluster and calculating the service condition of resources; if the processing progress of a cluster is found to fall behind, or the utilization rate of a computing resource is higher than a threshold, dynamically adjusting the processing weight to balance the processing load.

In a further embodiment of the first aspect, the streaming data processing method further includes: and constructing a workflow, and customizing the operator behaviors and the dependency relationship between operators.

In a further embodiment of the first aspect, the workflow construction process includes:

initializing: creating an empty workflow for storing operator instances to be added;

adding operator instance: the user selects the required operator and adds it to the workflow; each operator instance has a unique id and a set of parameters in the form of key value pairs, and the user defines the operator behavior by filling in the parameters;

Connection operator instance: the user defines the dependency relationship between operator instances, and the operator instances with front and back dependency relationships are connected together through a connection class; in the process, each connection is automatically filled in to generate input and output, so that a complete data processing flow is formed;

preservation and loading: describing and configuring a workflow by using a JSON file when the workflow configuration needs to be saved or backed up, serializing a JSON configuration object into a character string, and then loading the character string from a memory when the JSON configuration object needs to be saved or backed up; once the configurations are loaded into memory, the streaming engine executes the workflow in accordance with the configurations;

executing a workflow: when the workflow setting is completed, the user manually or periodically executes the entire workflow, and the streaming engine executes each task instance in accordance with the predetermined and program calculated order and dependency.

In a further embodiment of the first aspect, the streaming data processing method further includes: the data is protected by adopting a method combining an asymmetric encryption algorithm and a symmetric encryption algorithm:

firstly, a receiver generates a pair of asymmetrically encrypted keys, namely a public key and a private key; the receiver reserves a private key, and the public key is sent to the sender;

When a sender needs to transmit data, firstly generating a symmetric encryption key, and encrypting the data by using the key; then, the sender encrypts the symmetrically encrypted key with the received public key of the receiver, and then sends the encrypted key and the encrypted data to the receiver;

after receiving the data, the receiver firstly decrypts the encrypted symmetric encryption key by using the private key of the receiver, and recovers the original symmetric encryption key; the receiver then decrypts the data using the symmetric encryption key, recovering the original data.

In a further embodiment of the first aspect, the streaming data processing method further includes: packaging data and calculation logic into data packets, wherein the data packets are cached in a shared queue and flow in the calculation process;

encapsulating the data and computational logic associated therewith into a data packet;

the encapsulated data packet is placed in a shared queue; this queue is shared by all computing resources;

the computing resource takes out the data packet from the shared queue according to the need;

after the calculation is completed, the result data and the next round of calculation logic of the data are packaged into a new data packet, and the new data packet is put into a sharing queue again; the calculation is driven by the data and fed back to the data to form a closed loop.

In a further embodiment of the first aspect, the streaming data processing method further includes: establishing a strict access control mechanism, and limiting that only authorized users can access and process stream data; through authentication, authorization and rights management, it is ensured that only legitimate users can acquire and manipulate stream data.

In a second aspect, a streaming data processing system is presented, the system comprising:

the data acquisition unit is used for monitoring a plurality of data sources, creating an acquisition task aiming at the data sources, detecting the state of the acquisition task and finding out the task in an activated state; for the acquisition task in an activated state, pushing the data into a message queue Kafka so as to acquire the data when new data exists;

the computing task partition unit is used for judging whether the required input data is available according to the data in the message queue Kafka, organizing a series of computing rules and data streams into computing task data packets, storing the computing task data packets in a memory, and dividing the computing task into a plurality of independent or parallel executable sub-computing tasks according to the minimum computing rule factors;

the sub-calculation task execution unit is used for defining one or more input data packets and one or more output data packets for each calculation task according to the divided sub-calculation tasks; verifying the execution order of the two calculation phases or whether the two calculation phases can be executed simultaneously;

The sub-calculation task calculation unit is used for monitoring and judging whether the currently started sub-calculation task is completed or not: once a certain sub-computing task is completed, a notification is sent to a resource scheduling module; and after the resource scheduling module receives the notification, recovering the memory resources occupied by the completed sub-computing tasks.

In a third aspect, a computer readable storage medium is provided, in which at least one executable instruction is stored, which when run on an electronic device, causes the electronic device to perform the operations of the streaming data processing method according to the first aspect.

The beneficial effects are that:

the stream data processing method provided by the invention supports complex stream tasks, and can maintain high-efficiency processing speed and accuracy no matter the size of the task. The design can optimize the algorithm and architecture for processing the complex data stream, and can effectively utilize hardware resources, thereby realizing the efficient processing of large-scale and complex stream data tasks.

The method has stronger fault tolerance and higher processing speed. For big data processing tasks, the engine can rapidly process data regardless of the data volume, and meanwhile, through a built-in fault tolerance mechanism, the continuity and accuracy of data processing can be ensured even if part of nodes are in fault.

In summary, the method provided by the application has obvious advantages in the aspects of data security, authority management, complex stream task processing, fault tolerance capability, processing speed and the like, and provides an effective solution for large-scale and complex stream data processing.

Drawings

Fig. 1 is a flowchart of a streaming data processing method according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present application. It will be apparent, however, to one skilled in the art that the application may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the application.

Research discovers that the existing stream processing technology may cause problems of data leakage, difficulty in authority management, reduced processing efficiency and the like, and the problems are particularly obvious in a big data environment. Therefore, the application encapsulates the data and the calculation logic into subtask data packets according to the data flow theory, and the subtask data packets are cached in a shared queue and flow in the calculation process. The computing resources are not distributed to each computing link in advance, but data packets are acquired from the shared queue, and the data are computed according to the data in the packets and computing logic; and performing priority intelligent execution, and enabling the result data and the next round of calculation logic of the data to become new data packets and enter the shared queue again. By realizing high-efficiency built-in security mechanism and fine-granularity authority management and comprehensively supporting complex stream tasks, the efficiency and the security of stream data processing are improved.

Example 1:

the following embodiment discloses the detailed steps of the method, as shown in fig. 1, and the streaming data processing method disclosed in the embodiment includes the following steps:

step one: development data acquisition plug-in

And determining a data source and creating an acquisition task according to the service requirement. Aiming at different data sources, customizing and developing a data acquisition plug-in, configuring acquisition task information and generating corresponding data acquisition interface rules. Each acquisition task corresponds to one data acquisition interface, and a user can add different types of data acquisition interface information by configuring acquisition task information.

After loading the data acquisition interface rule, the task in the activated state is found out by detecting the state of the acquisition task. For the acquisition task in the active state, the monitoring data source plug-in is arranged to send data to the message queue Kafka so as to acquire the data when new data exists.

Step two: sub-computing task (task) partitioning into data packets

According to the data pushed in the step one, judging whether the required input data is available, organizing a series of calculation rules and data streams into calculation Task (Task) data packets to be stored in a memory, and dividing the calculation Task (Task) into a plurality of independent or parallel sub-calculation tasks (Task) according to the minimum calculation rule factors.

Step three: sub-computing task (task) intelligent execution

The sub-calculation task (task) comprises calculation rules, data flows and calculation time according to the step two. Each computing Task (Task) defines one or more input data packets and one or more output data packets. The design computing Task (Task) determines the order of execution or whether two computing phases can be executed simultaneously.

Step four: intelligent sub-computing task computation

The intelligent sub-computing task computing is responsible for monitoring and determining whether a currently started sub-computing task (task) has completed its sub-computing task (task). Once a certain sub-computing Task (Task) has completed its computing Task (Task), this module sends a notification to the resource scheduling module. Upon receiving this notification, the resource scheduling module will reclaim the memory resources occupied by the sub-compute Task (Task) completing the compute Task (Task).

Step five: parallel group computing design

The method for adjusting the batch interval value comprises the steps of taking two continuous batches as one group, and calculating the batch initial interval value of the next group.

1. Initial lot spacing value of set groupThen using an adjustment factor ρ toIs adjusted to the initial value of the first batch interval value t1 of the group. Then, the first lot is calculated, and after the calculation is completed, the execution time of the first lot in the group is denoted as p (t 1).

2. Next, p (t 1) is used as an initial value of the second lot interval value t2 of the present group. And (3) adjusting the t2 to obtain a final t2 value, then calculating a second batch, and recording the execution time of the second batch of the group as p (t 2) after the calculation is completed.

3. Finally, the next set of batch initiation interval values is calculated according to the following formula-next：

-next=+ρ*(p(t2)-p(t1))

Wherein p (t 2) and p (t 1) are the execution times of the second lot and the first lot of the set, respectively,is the initial lot spacing value for the present set. Thus, the initial interval value of the next groupNext depends on the initial interval value of the set and the execution time difference of the two batches.

Through the complexity calculation, the self-adjustment of the parallel calculation process can be realized, so that the processing efficiency is improved. It is also possible to efficiently process a large-scale data stream and provide a fast data processing speed.

Step six: data reception calculation

After data is received, the patent adopts a unique three-stage calculation method to improve the data processing efficiency under the high concurrency situation. Specifically, the method comprises three stages of fuzzy hierarchical clustering, coarse-granularity cluster tree adaptation and fine-granularity cluster scheduling.

Step seven: built-in generic data operator behavior

It is important to integrate versatility and intelligence in a streaming engine to enable it to accommodate a variety of different data processing requirements. By providing a set of common data processing operators, the streaming engine can conveniently process elements in a data stream. These operators are predefined and are selected and combined by the user as the case may be to achieve the desired data processing logic. By the method, the user does not need to write and register functions, and the flow of data processing is simplified.

Meanwhile, the streaming engine also supports user-defined functions to meet more personalized and complex data processing requirements. The user can write the custom function according to the business logic and the requirement of the user and register the custom function into the stream engine. In this way, the user can perform more flexible and customized processing on the data stream according to the actual situation. The support of the user-defined function enables the streaming engine to have higher expansibility and adaptability, and can cope with various complex data processing scenes.

Step eight: setting workflow rule Unit

The user sets a rule unit according to own requirements, and self-defines operator behaviors and dependency relations among operators. A customized Workflow (Workflow) is built according to its own data processing logic and business processes. When constructing a Workflow (Workflow), a user can define the dependency relationship between operators, and ensure the correct flow of data in the Workflow (Workflow). For example, some operators may need to execute after other operators are completed, or some operators may need to rely on the output results of other operators. By defining these dependencies, the user can ensure the correct order and consistency of data processing.

Step nine: parallel computing encryption algorithm using GPU

The streaming engine, when processing data, needs to transfer data from a data source or task node to other processing nodes or data stores. In this process, how to secure the data is very important. The patent adopts a method combining an asymmetric encryption algorithm and a symmetric encryption algorithm to protect data.

In this method, a pair of asymmetrically encrypted keys, namely a public key and a private key, is first generated by a receiving party. The receiver retains the private key and the public key is sent to the sender.

When the sender (last subtask) needs to transmit data, it first generates a symmetrically encrypted key and encrypts the data with this key. The sender (next subtask (task)) then encrypts this symmetrically encrypted key with the received public key of the receiver and then sends the encrypted key to the receiver together with the encrypted data.

After receiving the data, the receiver firstly uses the private key to decrypt the encrypted symmetric encryption key and recovers the original symmetric encryption key. The receiver then decrypts the data using the symmetric encryption key, recovering the original data.

Since only the receiving party has the private key, only the receiving party can decrypt the symmetrically encrypted key, so that only the receiving party can interpret the data. This method effectively prevents data from being stolen by a third party during transmission. By using the GPU to perform parallel computation, the operation efficiency of the cryptographic algorithm is greatly improved. Meanwhile, the method can effectively process a large amount of data due to the high efficiency of the symmetric encryption algorithm.

Step ten: streaming data processing

According to the theory of data flow, in step one, data and calculation logic are packaged into data packets, and the data packets are cached in a shared queue and flow in the calculation process.

Step eleven: access control

The streaming engine establishes a strict access control mechanism that limits access and processing of streaming data to only authorized users. Through authentication, authorization and rights management, it is ensured that only legitimate users can acquire and manipulate stream data. This mechanism includes three main parts of authentication, authorization and rights management:

(1) And (3) identity authentication: the streaming engine performs authentication by the user's username and password. Only authenticated users can further access and process the streaming data. In addition, in order to improve the security, two-factor verification codes are adopted for identity verification.

(2) Authorization: even if the user passes the authentication, all stream data cannot be accessed and processed at will. The streaming engine authorizes each user, deciding which streaming data they can access and process. Authorization is based on factors such as the role, responsibility, need of the user, etc., ensuring that the user can only access and process the stream data they are authorized to.

(3) Rights management: rights management is the last line of defense for access control mechanisms. The streaming engine can precisely control specific operation authorities such as viewing, modifying, deleting, exporting and the like of a user when accessing and processing streaming data through authority management. Success can only be performed if the user's operation meets their rights settings.

Example 2:

based on the above embodiment 1, this embodiment further discloses a detailed design step of the sub-calculation task (task), as follows:

1. and (3) algorithm judgment: whether the input data required for one computing Task (Task) depends on the output results of other tasks or operations. If such a dependency exists, the Task segmentation module segments the computing Task (Task) into a plurality of independent sub-computing tasks (Task), and the execution sequence among the computing partitions is determined according to the data dependency.

2. And (3) algorithm judgment: the amount of computation or processing time required to perform a certain computation Task (Task). The complex task is split up as much as possible into a plurality of simple sub-computing tasks (Stask).

3. And (3) algorithm judgment: the number of resources available in the stream engine to perform a computing Task (Task) or operation includes computing resources (e.g., CPU, GPU), storage resources (e.g., memory, hard disk), etc. The available resources are reasonably allocated to each sub-computing task (task) through a resource scheduling module.

4. Before performing the calculation sub-tasks, corresponding data are retrieved from the receiver according to the design steps and initial interval values, which data are organized into a batch (sub-calculation task (task)). And the calculation time t required for processing this batch of data is recorded. According to the specific requirements of the task and the performance condition of the streaming engine, the initial interval value is [ ]) Dynamic adjustments are made to better match the characteristics of the computing Task (Task) and the running conditions of the streaming engine.

Example 3:

the embodiment further discloses a specific flow of intelligent execution of the sub-computing task (task) based on embodiment 1 or embodiment 2:

first, the advance condition of each calculation stage is checked. I.e. the conditions that this calculation phase needs to fulfil to be able to function correctly. The conditions include: all input data is ready, all necessary streaming engine resources have been allocated to this phase, etc.

Secondly, judging the parallelism of the sub-calculation tasks (task): when all of the preconditions have been met, the module determines whether the next computing phase can run concurrently with the currently initiated computing phase. The judgment basis comprises: there is no data dependency between the two phases, the streaming engine resources are sufficient, the two phases can be run simultaneously, etc.

Determining an initial interval valueWhether the next batch processing time is greater than or equal to 1.5 times the initial interval value。

If the processing time is more than or equal to 1.5 times of the initial interval value>) Program intelligence carries out A level adjustment: the next lot interval value is set to 1.5 times the current interval value. And according to the newly calculated next batch interval value, corresponding data are taken out from the receiver, the data are calculated, and the processing time is recorded.

If the treatment time is less than 1.5 times of the initial interval value<) The program intelligently adjusts the B level: the next lot interval value is set to be a number within 1.5 of the current lot interval value, and the lot interval value gradually decreases as the number of runs increases. And according to the newly calculated next batch interval value, corresponding data are taken out from the receiver, are processed, and the processing time is recorded.

The sub-computing task (task) intelligently executes both phases simultaneously, confirming that both phases can be executed simultaneously, and doing so does not affect the overall operation result.

When all sub-computation Task (Task) input data packets of one computation Task (Task) receive data, the operation starts to execute the computation Task (Task), and after the execution is finished, the result is sent to the next computation Task (Task) connected to the output data packet. By constructing the data transmission relation among the computing tasks (tasks), the parallel division and intelligent execution of the computing tasks (tasks) are realized.

Example 4:

the embodiment further discloses a detailed process of calculating the intelligent sub-calculation task in the fourth step based on the above embodiments.

When processing the multi-tasking stream data, step three determines the optimal interval value of the two tasks such that the data processing time is equal to the interval value. The smaller one of the two optimal interval values is selected as the interval value actually used. And determines whether the interval value of two consecutive batches exceeds a larger optimal interval value. If so, the interval value needs to be adjusted to prevent degradation of the processing efficiency. The new lot interval value is calculated based on the previous interval value and a series of parameters (e.g., the number of times a certain adjustment is performed, the number of iterations, and a constant k between 0.5 and 1.0):

First, the number i of certain adjustment runs and the number j of iterations are determined.

Then, a constant k between 0.5 and 1.0 is selected. This value may need to be adjusted according to the actual situation to ensure optimal data processing efficiency.

Using the formulaA new batch interval value is calculated.

The calculation Task (Task) is performed according to the new batch interval value, and the running time of the Task is recorded. This time can be used to evaluate the efficiency of the current interval value and make adjustments if necessary.

In order to optimize the efficiency of data processing, too long processing time caused by too large batch interval values is avoided, and too frequent processing tasks caused by too small batch interval values are also avoided. And (3) dynamically adjusting according to the actual data processing condition, and ensuring the efficiency and accuracy of data processing.

The detailed design steps are as follows:

when the processing time of the current batch is greater than or equal to 1.5 times the current batch interval value, the processing time of the next batch is set to 1.5 times the current batch interval value. The parallel computing engine considers that the batch interval value is too small, and increases the batch interval value according to the interval coefficient to improve the processing efficiency.

The next lot interval value is set toInterval value of previous batch Multiple times. Wherein j is the current adjustment times, j is a natural number; kappa is a constant between 0.5 and 1.0. In this case, the batch interval value is set to have a certain elasticity, and fine adjustment can be performed according to the values of κ and j.

If the batch running time is smaller than or equal to the current batch interval value when the batch running time is smaller than or equal to the current batch interval value after a certain adjustment running to a certain moment, the batch interval value is not adjusted any more, the interval value size used as the next batch is executed according to the current batch interval value until the data calculation in all caches is completed, and the current batch interval value is used as the starting interval value size of the next step. This stage is where the streaming engine considers the current lot interval value to have been optimized, no adjustment is needed, and data processing can continue according to this value.

Example 5:

the embodiment further discloses a detailed process of step six data receiving calculation:

the data is first processed using a fuzzy hierarchical clustering algorithm. The method comprises the following specific steps:

(1) for each arriving data point, its membership to each cluster is calculated in real time. The step can timely respond to new data in the data stream, and the instantaneity of the clustering result is maintained.

(2) And updating the current clustering result according to the membership degree of the new data point. Therefore, the arrival of each data point can possibly cause the change of the clustering result, and the clustering result is ensured to be capable of being adjusted along with the data flow.

(3) And carrying out multistage division on the data by utilizing the characteristics of a fuzzy hierarchical clustering algorithm to form a hierarchical structure. The hierarchical structure not only captures the global structure of the data, but also shows the nuances inside the data, and enhances the interpretation of the data.

(4) The ambiguity of the fuzzy hierarchical clustering is utilized to process the uncertainty and the ambiguity of the data. When the data has ambiguity or uncertainty, the fuzzy hierarchical clustering can still give an effective clustering result, and the processing robustness is ensured.

(5) When there is a problem with the data, such as a data loss or error, the system may re-request the portion of the data from the sender. The mechanism ensures the integrity and accuracy of the data and provides a reliable data source for subsequent processing and analysis.

The fuzzy hierarchical clustering algorithm is designed as follows:

(1) the goal of fuzzy hierarchical clustering is to find a membership matrix u= []And a group of cluster centers v= []To minimize the objective function:

；

1) The membership matrix U and the clustering center V are initialized randomly.

2) In each iteration, the clustering center V is updated according to the current membership matrix U:

；

3) And then updating the membership matrix U according to the new cluster center V:

；

and repeating the step 2 until the objective function J (U, V) converges or the maximum iteration number is reached.

On the basis of fuzzy hierarchical clustering, the application further proposes to optimize the data structure by adopting a coarse-granularity cluster tree adaptation algorithm. The specific flow is as follows:

(1) An initial cluster tree is first generated based on the distribution characteristics of the given data. The structure of the cluster tree reflects the real distribution characteristics of the data as far as possible, and provides a basis for subsequent optimization.

(2) The structure of the cluster tree can be dynamically adjusted according to the real-time change of the data through a coarse-grained cluster tree adaptation algorithm. When new data points arrive, the algorithm performs necessary adjustment on the cluster tree, such as merging, splitting, moving and the like, according to the characteristics of the data points. Therefore, the structure of the clustering tree can reflect the latest state of the data in real time, and the accuracy and the instantaneity of the clustering result are ensured.

(3) Through the above steps, a large number of highly dimensional, dynamically changing data streams can be efficiently processed. Regardless of the scale of the data and the change of the data, the clustering result conforming to the actual distribution of the data can be quickly obtained through a coarse-granularity clustering tree adaptation algorithm, so that the requirement of large data processing is met.

The coarse-grained cluster tree adaptation algorithm is designed as follows:

(1) given a packet x= { X1, X2, …, xn }, a cluster tree T is found such that some defined cost function J (T) is minimized. The total distance from the data complex calculation to the clustering center is:

；

where c represents one cluster in the cluster tree T,representing the center of cluster c.

(2) Initial clustering: and (3) carrying out initial clustering on the data packet X by applying a fuzzy hierarchical clustering algorithm to generate an initial clustering tree T0.

(3) Coarse granularity adaptation: the initial cluster tree T0 is adjusted to generate a new cluster tree T1 to minimize the cost function J (T):

；

(4) and (3) outputting results: and outputting the adapted cluster tree T1.

On the basis of the optimized data structure formed in the first two stages, a fine-grained cluster scheduling algorithm is used for data processing. The method comprises the following specific steps:

(1) And allocating a processing weight to each cluster according to the size, complexity and processing requirement of the cluster. Large, complex, high processing-demanding clusters will get higher processing weights, while small, simple, low processing-demanding clusters will get lower processing weights.

(2) When processing data, the clustering clusters with high weights are processed preferentially. In this way, the most important and urgent data can be ensured to be processed in time.

(3) During the processing, the processing state of each cluster and the use condition of the computing resource are continuously monitored. If the processing progress of a cluster is found to fall behind, or the utilization rate of a computing resource is too high, the processing weight is dynamically adjusted to balance the processing load and optimize the resource usage.

(4) The fine-grained cluster scheduling algorithm is designed as follows:

(1) given a series of clusters c= { C1, C2, …, cn }, where each cluster ci has a size si and a processing requirement di, the goal is to find a scheduling order o= { O1, O2, …, on } such that the total processing time T is minimized. The total processing time function used was:

；

wherein the method comprises the steps ofAndis a weight coefficient for cluster size and processing requirements.

(2) The fine granularity cluster scheduling process comprises the following steps:

(3) calculating priority: for each cluster ci, its priority is calculated according to its size si and processing requirement di:

；

(4) sequencing and scheduling: according to priority levelAnd (3) sequencing the cluster clusters to generate a scheduling sequence O.

(5) And (3) outputting results: and outputting the scheduling sequence.

Through the calculation of the three stages, the data processing efficiency under the high concurrency situation can be effectively improved, and meanwhile, the accuracy of data processing can be ensured. The advantage of this approach is that it can handle both large-scale data and adapt to dynamic changes in the data.

Example 6:

based on the above embodiments, the mentioned built-in generic data operators may be of the following types:

1. conversion operator: for converting and mapping elements in the data stream. The map operator may apply a function to each element to generate a new element; the filter operator may filter elements according to conditions; the flatMap operator may map one element into a plurality of elements, etc.

2. Aggregation operator: for aggregating elements in a data stream to produce a single result. The sum operator may sum the digital elements in the stream; the count operator may calculate the number of elements in the stream; the min and max operators may find the minimum and maximum elements in the stream, etc.

3. Grouping and partitioning operators: for grouping or partitioning elements in a data stream according to a certain attribute. The groupBy operator may group elements according to a certain attribute; the keyBy operator may partition elements according to a key, etc.

4. Time window operator: for grouping and aggregating elements in a data stream according to time. For example, a rolling window operator may group elements according to a fixed length time window; the sliding window operator may group elements according to a specified sliding interval, etc.

5. Join and merge operators: for merging or concatenating multiple data streams. For example, the unit operator may merge multiple data streams into one; the connect and coplatmap operators may join multiple data streams together for operations, etc.

Example 7:

the embodiment further discloses a feasible detailed processing step of stream data processing:

1. and encapsulating the data and the computing logic associated with the data into a data packet according to the step one and the step two. The design of the data packet should be able to accommodate enough data and corresponding computation logic to ensure the integrity of the data and the correctness of the computation.

2. The encapsulated packet will be placed in a shared queue. This queue is shared by all computing resources so that any resource that needs to be computed can fetch a packet from this queue.

The computing resource takes out the data packet from the shared queue according to the need, according to step six: the three-stage calculation method determines the priority and performs the calculation.

3. After the calculation is completed, the result data and the next round of calculation logic of the data are packaged into a new data packet, and the new data packet is put into the sharing queue again. The calculation is not only driven by the data, but also can be fed back to the data to form a closed loop.

The calculation driven by data is realized, the parallel efficiency is mined to the greatest extent, and the calculation speed is improved. Meanwhile, as the computing resources are always distributed when needed, the waste of the resources is avoided, and the overall efficiency of the streaming engine is improved.

Example 8:

the embodiment further discloses a feasible workflow rule unit setting workflow step:

1. initializing: an empty Workflow (Workflow) is created for storing the operator instance to be added.

2. Adding operator instance: the user selects the required operator and adds it to the Workflow (Workflow). Each operator instance has a unique id and a set of parameters in the form of key-value pairs that the user can customize the operator's behavior by filling in.

3. Connection operator instance: the user can define the dependency relationship between operator instances, and the operator instances with front-back dependency relationship are connected together through the connection class. In this process, each connection automatically fills in the generated inputs and outputs, forming a complete data processing flow.

4. Designing operators: each operator has its own corresponding conversion interface. This interface defines the behavior of the operators, filtering operations, aggregation operations, ordering operations, etc.

5. Preservation and loading: when a save or backup Workflow (Workflow) configuration is required, the Workflow (Workflow) is described and configured using JSON files, and JSON configuration objects are serialized into strings, which are then loaded from memory when required. Once these configurations are loaded into memory, the flow engine executes a Workflow (Workflow) in accordance with these configurations.

6. Executing Workflow (Workflow): when the Workflow (Workflow) is set up, the user manually or periodically executes the entire Workflow (Workflow), and the streaming engine executes each Task (Task) instance in a predetermined and programmed sequence and dependency.

7. The user views and manipulates the operator class and the connection class in the setup rules interface. The page supports a drag operation so that the user can conveniently place the operator in the appropriate position.

Example 9:

the embodiment provides a streaming data processing system, which comprises a data acquisition unit, a calculation task partition unit, a sub-calculation task execution unit and a sub-calculation task calculation unit.

The data acquisition unit is used for monitoring a plurality of data sources, creating an acquisition task aiming at the data sources, detecting the state of the acquisition task, and finding out the task in an activated state; for the acquisition task in an activated state, pushing the data into a message queue Kafka so as to acquire the data when new data exists;

Example 10:

the present embodiment proposes a computer readable storage medium, in which at least one executable instruction is stored, which when executed on an electronic device, causes the electronic device to perform the operations of the streaming data processing method described in the above embodiments 1 to 8.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories. The computer may be a variety of computing devices including smart terminals and servers.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of streaming data processing comprising the steps of:

judging whether the required input data is available according to the data pushed into the message queue Kafka, organizing a series of calculation rules and data streams into calculation task data packets to be stored in a memory, and dividing the calculation tasks into a plurality of independent or parallel sub-calculation tasks according to the minimum calculation rule factors;

2. The streaming data processing method according to claim 1, wherein the division rule of the sub-calculation task includes:

3. The streaming data processing method according to claim 1, wherein verifying the execution order of the two calculation phases or whether the two calculation phases can be executed simultaneously, further comprises:

4. A streaming data processing method according to claim 3, wherein the N-valued interval is (1, 2).

5. The streaming data processing method according to claim 1, further comprising:

6. The streaming data processing method according to claim 1, wherein in calculating the sub-calculation tasks, a parallel group calculation strategy is adopted, comprising:

initial lot spacing value of set groupUse of an adjustment factor ρ will +.>The first batch interval value t1 is adjusted to be an initial value of the first batch interval value t1, then the first batch is calculated, and after the calculation is completed, the execution time of the first batch is recorded as p (t 1);

-next=/>+ρ*(p(t2)-p(t1))；

Wherein p (t 2) and p (t 1) are the execution times of the second lot and the first lot of the group, respectively,an initial lot spacing value for the present set; the initial interval value of the next group +.>Next depends on the initial interval value of the present set and the execution time difference of the two batches.

7. The streaming data processing method according to claim 1, wherein: when calculating the sub-calculation task, the method further comprises the step of adopting a three-stage calculation method to improve the data processing efficiency under the high concurrency situation, wherein the three stages comprise fuzzy hierarchical clustering, coarse-granularity cluster tree adaptation and fine-granularity cluster scheduling.

8. The streaming data processing method according to claim 7, wherein the process of fuzzy hierarchical clustering comprises:

when there is a problem with the data, the partial data is re-requested from the sender.

9. The streaming data processing method according to claim 8, wherein optimizing the data structure based on fuzzy hierarchical clustering by adopting a coarse-grained cluster tree adaptation algorithm comprises:

10. The streaming data processing method according to claim 9, wherein the data processing using a fine-grained cluster scheduling algorithm based on forming an optimized data structure comprises:

11. The streaming data processing method according to claim 1, further comprising: and constructing a workflow, and customizing the operator behaviors and the dependency relationship between operators.

12. The method for processing streaming data according to claim 11, wherein the process of constructing the workflow comprises:

13. The streaming data processing method according to claim 1, further comprising: the data is protected by adopting a method combining an asymmetric encryption algorithm and a symmetric encryption algorithm:

14. The streaming data processing method according to claim 1, further comprising: packaging data and calculation logic into data packets, wherein the data packets are cached in a shared queue and flow in the calculation process;

15. The streaming data processing method according to claim 1, further comprising: establishing a strict access control mechanism, and limiting that only authorized users can access and process stream data; through authentication, authorization and rights management, it is ensured that only legitimate users can acquire and manipulate stream data.

16. A streaming data processing system, comprising:

17. A computer readable storage medium, wherein at least one executable instruction is stored in the storage medium, which when executed on an electronic device, causes the electronic device to perform the operations of the streaming data processing method according to any of claims 1-15.