CN116841752A

CN116841752A - Data analysis and calculation system based on distributed real-time calculation framework

Info

Publication number: CN116841752A
Application number: CN202311109401.6A
Authority: CN
Inventors: 张文博; 李平; 陈昌龙
Original assignee: Hangzhou Instan Information Technology Co ltd
Current assignee: Hangzhou Instan Information Technology Co ltd
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-10-03
Anticipated expiration: 2043-08-31
Also published as: CN116841752B

Abstract

The invention discloses a data analysis and calculation system based on a distributed real-time calculation framework, which comprises: a distributed real-time computing framework and a business application, the distributed real-time computing framework comprising: the request module generates a writing request and a reading request according to the streaming data; the filtering module generates a writing instruction when the writing request meets a preset writing condition, and generates a reading instruction when the reading request meets a preset reading condition; the calculation module comprises: the extraction unit extracts a plurality of calculation parameters from the streaming data according to the type of the aggregation field; the first calculation unit substitutes each calculation parameter into a primary calculation formula in an operator template to obtain a plurality of primary data, and divides each primary data into a plurality of intermediate data according to fine granularity; and after receiving the reading instruction, the second calculation unit screens each intermediate data to obtain a plurality of screened data, and substitutes each screened data into a convergence formula to obtain final data. The invention reduces the calculation cost and improves the calculation speed.

Description

Data analysis and calculation system based on distributed real-time calculation framework

Technical Field

The invention relates to the technical field of business data processing, in particular to a data analysis and calculation system based on a distributed real-time calculation framework.

Background

At present, most of data analysis and calculation systems adopt distributed calculation frameworks such as Spark, flink and the like to complete tasks of data analysis and calculation. In the process of using the distributed computing framework, the data analysis computing system performs big data task development, writes the distributed computing task, executes the computing task in the distributed computing framework such as Spark or Flink, and writes the computing result into a database for the service system to use. For safety data analysis, the existing data analysis computing system needs to deploy a huge operation architecture, and meanwhile, the computation of real-time streaming data in service application needs to be actively initiated, so that additional dispatch service connection needs to be configured, and therefore, the operation cost and the computation amount in the data computation process are large.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a data analysis and calculation system based on a distributed real-time calculation frame, which is used for reducing calculation cost and improving calculation speed.

In order to achieve the above purpose, the present invention provides the following technical solutions: a data analysis computing system based on a distributed real-time computing framework, comprising a distributed real-time computing framework and a business application, the distributed real-time computing framework being connected to the business application and comprising:

the request module is used for generating a writing request and a reading request according to the streaming data of the business application;

the filtering module is connected with the request module and is used for generating a writing instruction when the writing request meets preset writing conditions and generating a reading instruction when the reading request meets preset reading conditions;

the calculation module is connected with the filtering module and comprises:

the extraction unit is used for extracting a plurality of calculation parameters from the streaming data according to the type of the aggregation field after receiving the writing instruction;

the first calculation unit is connected with the extraction unit and is used for substituting each calculation parameter into a primary calculation formula in the operator template to obtain a plurality of primary data, and dividing each primary data into a plurality of intermediate data according to fine granularity;

the second calculation unit is connected with the first calculation unit and is used for screening each intermediate data according to a preset screening condition in the operator template to obtain a plurality of screening data after receiving the reading instruction, and substituting each screening data into a preset convergence formula to obtain final data.

Further, the write request includes an aggregation field and a calculation target of the streaming data, the read request includes the aggregation field, the write condition includes a field condition and a limiting condition, and the read condition includes a field condition;

the filter module includes:

the analysis unit is used for analyzing the write-in request to obtain the aggregation field and the data content, and analyzing the read-out request to obtain the aggregation field;

and the generation unit is connected with the analysis unit and is used for generating the writing instruction when the aggregation field in the writing request meets the field condition and the calculation target meets the limiting condition, and generating the reading instruction when the aggregation field in the reading request meets the field condition.

Further, the distributed real-time computing framework and the business application are integrated to run on nodes, and a plurality of the nodes are connected with a node task balancing module, wherein the node task balancing module comprises: the node counting unit is used for counting the number of the nodes and generating a task adjustment instruction when the number of the nodes changes;

and the task balancing unit is connected with the node counting unit and is used for summarizing the tasks on the current nodes into the total tasks according to the task adjustment instructions and redistributing the total tasks according to the number of the current nodes.

Further, the computing module further includes:

the storage unit is used for storing a plurality of operator templates, and each operator template is matched with the corresponding type of the aggregation field;

and the matching unit is respectively connected with the storage unit and the first computing unit and is used for matching the types of the aggregation fields in the storage unit to obtain the corresponding operator templates.

Further, the fine granularity is a temporal granularity.

Further, a read interface and a write interface are configured on the distributed real-time computing framework, and the distributed real-time computing framework obtains the read request through the read interface and obtains the write request through the write interface.

The invention has the beneficial effects that:

the invention directly obtains the read request and the write request through the interface aiming at the streaming data generated by the business application in the safety data analysis scene, and does not need to be configured with additional dispatch service connection, thereby realizing analysis and calculation of the streaming data and reducing the use cost of distributed calculation; meanwhile, the streaming data is subjected to primary calculation by utilizing a primary calculation formula when being written, and is subjected to convergence calculation according to a convergence formula when the call is required to be read, so that the streaming data is different from calculation which is required to be actively initiated in the prior art, the calculation cost is reduced as a whole, the calculation cost and the storage cost of a large amount of real-time calculation are saved, the real-time calculation task can be rapidly and conveniently completed by service application, and the aim of the service application is fulfilled.

Drawings

FIG. 1 is a schematic diagram of a data analysis computing system based on a distributed real-time computing framework in accordance with the present invention.

Reference numerals: 1. a distributed real-time computing framework; 2. business application; 3. a request module; 4. a filtration module; 41. an analysis unit; 42. a generating unit; 5. a computing module; 51. an extraction unit; 52. a first calculation unit; 53. a second calculation unit; 54. a storage unit; 55. a matching unit; 6. a node task balancing module; 61. a node statistics unit; 62. and a task balancing unit.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It will be understood that when an element is referred to as being "fixed to" another element, it can be directly on the other element or intervening elements may also be present. When a component is considered to be "connected" to another component, it can be directly connected to the other component or intervening components may also be present. When an element is referred to as being "disposed on" another element, it can be directly on the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Referring to fig. 1, the present embodiment provides a data analysis computing system based on a distributed real-time computing framework, which includes a distributed real-time computing framework 1 and a service application 2, wherein the distributed real-time computing framework 1 is connected with the service application 2, and includes:

a request module 3, configured to generate a write request and a read request according to the streaming data of the service application 2;

a filtering module 4, connected to the request module 3, configured to generate a write instruction when the write request satisfies a preset write condition, and generate a read instruction when the read request satisfies a preset read condition;

a calculation module 5, connected to the filtering module 4, comprising:

the extracting unit 51 is configured to extract a plurality of calculation parameters from the streaming data according to the type of the aggregation field after receiving the write command;

a first calculation unit 52, connected to the extraction unit 51, configured to substitute each calculation parameter into a primary calculation formula in the operator template to obtain a plurality of primary data, and divide each primary data into a plurality of intermediate data according to a fine granularity;

the second calculating unit 53 is connected to the first calculating unit 52, and is configured to, after receiving the reading instruction, screen each intermediate data according to a preset screening condition in the operator template to obtain a plurality of screening data, and then substitute each screening data into a preset convergence formula to obtain final data.

Specifically, in this embodiment, when service data needs to be written into or read from the distributed real-time computing framework 1, the request module 3 generates a write request or a read request according to the streaming data, and sends the write request or the read request to the filter module 4. The filtering module 4 is configured to determine whether the write request satisfies a write condition, and determine whether the read request satisfies a read condition, generate a write instruction when the write request satisfies the write condition, and generate a read instruction when the read instruction satisfies the read condition. The writing instruction and the reading instruction both comprise calculation tasks, the calculation tasks are matched with the operator templates, after the extracting unit 51 receives the writing instruction, the corresponding operator templates are obtained by matching according to the calculation tasks in the writing instruction, then the extracting unit 51 extracts a plurality of calculation parameters from the streaming data according to the type of the aggregation field, the calculation parameters are substituted into a primary calculation formula in the operator templates, primary data is obtained by calculation, and then the primary data is aggregated to a fine-grained intermediate result to form intermediate data. The second calculation unit 53 matches the calculation tasks included in the read instruction to obtain a corresponding operator template after receiving the read instruction, then the second calculation unit 53 firstly screens each intermediate data according to the screening conditions in the operator template, and then substitutes the screened data obtained after the screening into a convergence formula to perform convergence operation to obtain final data.

According to the technical scheme, aiming at the streaming data generated by the service application 2 in the safety data analysis scene, the reading request and the writing request are directly acquired through the interface, and additional dispatching service connection is not required to be configured, so that the analysis and calculation of the streaming data are realized, and the use cost of the distributed calculation is reduced; meanwhile, in the technical scheme, the streaming data is subjected to primary calculation by utilizing a primary calculation formula when being written, and is subjected to convergence calculation according to a convergence formula when being required to be read and called, so that the streaming data is different from calculation which is required to be actively initiated in the prior art, the calculation cost is reduced as a whole, a large amount of calculation cost and storage cost of real-time calculation are saved, the real-time calculation task can be rapidly and conveniently completed by the service application 2, and the aim of the service application 2 is fulfilled.

The functions realized by the invention also comprise:

1. a lightweight distributed real-time computing framework 1 is constructed:

the invention provides a computing framework formed by designing and packaging for meeting the scene requirement of real-time analysis service of time sequence stream data. Different from the traditional real-time computing framework in the market, which has huge functions and numerous modules, the distributed real-time computing framework 1 solves the problem of multi-value real-time computing of streaming data and adopts a fixed computing mode and flow. Therefore, the business personnel can quickly understand the distributed real-time computing framework 1, and can easily use the distributed real-time computing framework. The real-time computing developer can also rapidly complete design development work under the framework according to the limiting mode. Meanwhile, by adopting a fixed calculation mode and flow, the distributed real-time calculation framework 1 can perform deeper performance optimization and characteristic support, and the calculation mode reaches the advanced performance which the general real-time calculation module 5 cannot have.

2. Supporting operator template encapsulation

In the use of the invention, based on the industry experience of a data analyst, more than twenty real-time calculation operators are designed and packaged into operator templates, and only corresponding calculation parameters are required to be configured during the use. In one embodiment, a standard deviation calculation operator is used, when the operator is used, a latitude field A is configured and calculated, a field B (which is a digital value) for standard deviation calculation is needed, and additionally, a filtering condition can be configured by the filtering module 4 to screen a data set participating in calculation from streaming data. After the configuration is completed, the distributed real-time computing framework 1 divides stream data meeting the filtering condition according to the configuration, according to the dimension of the field A, then calculates variance according to the value accumulation of the field B, and stores the calculation result in the database in real time.

Because of the light weight and flexibility of the distributed real-time computing framework 1, users can conveniently and rapidly complete the secondary development, the general capability packaging and other works on the framework. The encapsulated operator can rapidly apply the algorithm mode to a plurality of business scenes through simple configuration. Through the operator template capability, the business purpose can be achieved in most scenes only through configuration, and the portability and flexibility of configuration are effectively improved.

The data analysis and calculation system can rapidly complete the application of real-time calculation, and can play a good role in small-scale service nodes or large-scale distributed clusters.

Small-scale service nodes behave as: under the 8-thread 16GB memory virtual machine, more than 15000 times per second of real-time computation can be completed under the scale of about 1000 ten thousand difference values, and 99% of computation response time is ensured to be completed within 50 milliseconds.

Large-scale distributed clusters behave as: under a 21-station 32-thread 64GB virtual machine, more than 20 ten thousand real-time calculations per second can be completed on the scale of about 4 hundred million difference values, and 99% of calculation response time is ensured to be completed within 50 milliseconds.

The initial calculation formula and the convergence formula referenced by each calculation task are different because the operator templates invoked each time are different.

Embodiment one:

the calculation task is configured as 'last 5 minutes login failure times', and the calculation process of the calculation task comprises the following steps:

the filtering module 4 screens the data in the writing instruction transmitted by the service application 2, if the field of the login success mark is not available or the value of the login success mark is failed, the calculation condition is not met, the process is terminated and the process returns. And otherwise, performing subsequent processes.

The extracting unit 51 extracts necessary writing data fields, such as a login account number, a login success flag, and a login time as calculation parameters, from the complete streaming data according to the writing instruction.

The first calculation unit 52 performs a process of reading the intermediate calculation data of the current minute according to the operator template of the number of times, and writing the result into the storage after adding up the current result, namely completing the writing calculation process, wherein the fine granularity is the minute in the process.

When reading, the second computing unit 53 queries the query condition generated according to the last 5 minutes in the computing task, queries all intermediate data in the last 5 minutes, and performs final aggregation and accumulation, and the final value is the configured value of the "last 5 minutes login failure times", namely the final data.

Embodiment two:

the calculation task is configured with a 'last 24 hours transaction amount variance', and the calculation process of the calculation task comprises the following steps:

the filtering module 4 filters the data in the writing instruction transmitted by the business application 2, if the data type is not the transaction type, or the transaction amount field is not available, or the value of the transaction amount is not the digital type, the calculation condition is not met, the process is terminated, and the process returns. And otherwise, performing subsequent processes.

The extracting unit 51 extracts "transaction type", "transaction amount" as a calculation parameter from the complete stream data according to the writing instruction.

The first calculation unit 52 performs the calculation using the incremental calculation mode to avoid the need to acquire all value calculations each time. And according to the variance and the average value of the current value and the previous value in each calculation, storing the accumulated variance result, replacing the accumulated variance result with the total variance result, wherein the variance result is intermediate data, and the fine granularity is hour in the flow.

In reading, the second calculation unit 53 inquires all variance result values within the last 24 hours, and then sums all variances according to a formula according to an incremental variance similar principle to obtain a total variance result.

In this embodiment, referring to fig. 1, the write request includes an aggregation field and a calculation target of the streaming data, the read request includes an aggregation field, the write condition includes a field condition and a constraint condition, and the read condition includes a field condition;

the filter module 4 includes:

a parsing unit 41, configured to parse the write request to obtain an aggregation field and data content, and parse the read request to obtain an aggregation field;

a generating unit 42, a connection parsing unit 41, for generating a write instruction when the aggregation field in the write request satisfies the field condition and the calculation target satisfies the constraint condition, and generating a read instruction when the aggregation field in the read request satisfies the field condition.

In the first embodiment, the aggregation condition is that the aggregation field includes a "login success flag" field, and the constraint is that the value of the "login success flag" is not "failure"; in the second embodiment, the aggregation condition is that the aggregation field includes a "transaction amount" field, the constraint is that the value of the "transaction amount" is a digital type and the data type is a "transaction" type.

The distributed real-time computing framework 1 and the business application 2 are integrated and run on nodes, a plurality of nodes are connected with a node task balancing module 6, and the node task balancing module 6 comprises:

a node statistics unit 61, configured to count the number of each node, and generate a task adjustment instruction when the number of nodes changes;

the task balancing unit 62 is connected to the node statistics unit 61, and is configured to aggregate the tasks on the current nodes into a task total according to the task adjustment instruction, and redistribute the task total according to the number of the current nodes.

Specifically, in the present embodiment, a lightweight centreless distributed cluster architecture is constructed. The centreless distributed cluster architecture may consist of several nodes, where the distributed real-time computing framework 1 and business applications 2 in this embodiment are running integrated on the nodes. There is no explicit central node in the distributed cluster architecture, and all nodes are equal. The node task balancing module 6 is utilized to realize the task allocation balance on each node, and the process is as follows: the node statistics unit 61 senses the change of the number of nodes in the distributed cluster architecture through network intercommunication, when the number of the nodes is increased, the node statistics unit 61 generates a task adjustment instruction and sends the task adjustment instruction to the task balancing unit 62, and the task balancing unit 62 gathers the tasks on the current nodes into the task total amount according to the task balancing instruction and redistributes the task total amount according to the number of the current nodes; similarly, when the number of nodes decreases, it indicates that there are nodes offline from the distributed cluster architecture, at this time, the node statistics unit 61 generates a task adjustment instruction, and sends the task adjustment instruction to the task balancing unit 62, and the task balancing unit 62 sums the tasks on each current node into a task total amount according to the task balancing instruction, and redistributes the task total amount according to the number of each current node.

In this embodiment, as shown with reference to fig. 1,

the calculation module 5 further includes:

the storage unit 54 stores a plurality of operator templates, and each operator template is matched with the type of the corresponding aggregation field;

the matching unit 55 is respectively connected with the storage unit 54 and the first computing unit 52, and is configured to match the types of the aggregation fields in the storage unit 54 to obtain corresponding operator templates.

Preferably, the fine particle size is a temporal particle size.

Preferably, the distributed real-time computing framework 1 is configured with a read interface and a write interface, and the distributed real-time computing framework 1 obtains a read request through the read interface and obtains a write request through the write interface.

Specifically, in this embodiment, the data interaction between the distributed real-time computing framework 1 and the service application 2 is directly implemented through the read interface and the write interface, so that no additional scheduling service is required to be configured, and the deployment cost of distributed computing is reduced.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A data analysis computing system based on a distributed real-time computing framework, comprising a distributed real-time computing framework and a business application, wherein the distributed real-time computing framework is connected with the business application and comprises:

the calculation module is connected with the filtering module and comprises:

2. The distributed real-time computing framework based data analysis computing system of claim 1, wherein: the write-in request comprises an aggregation field and a calculation target of the streaming data, the read-in request comprises the aggregation field, the write-in condition comprises a field condition and a limiting condition, and the read condition comprises a field condition;

the filter module includes:

the generation unit is connected with the analysis unit and is used for generating the writing instruction when the aggregation field in the writing request meets the field condition and the calculation target meets the limiting condition, and generating the reading instruction when the aggregation field in the reading request meets the field condition.

3. The distributed real-time computing framework based data analysis computing system of claim 1, wherein: the distributed real-time computing framework and the business application are integrated and run on nodes, a plurality of the nodes are connected with a node task balancing module, and the node task balancing module comprises: the node counting unit is used for counting the number of the nodes and generating a task adjustment instruction when the number of the nodes changes;

4. The distributed real-time computing framework based data analysis computing system of claim 1, wherein: the computing module further includes:

5. The distributed real-time computing framework based data analysis computing system of claim 1, wherein: the fine granularity is a temporal granularity.

6. The distributed real-time computing framework based data analysis computing system of claim 1, wherein: the distributed real-time computing framework is provided with a reading interface and a writing interface, and the distributed real-time computing framework obtains the reading request through the reading interface and obtains the writing request through the writing interface.