CN114238463A

CN114238463A - Calculation engine control method and device for distributed index calculation

Info

Publication number: CN114238463A
Application number: CN202111376612.7A
Authority: CN
Inventors: 林思; 张亚; 李佳臻; 胡耀辉
Original assignee: Bank of Communications Co Ltd
Current assignee: Bank of Communications Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-25

Abstract

The invention relates to a method and a device for controlling a calculation engine for distributed index calculation, wherein the method specifically comprises the following steps: s1, the calculation front-end service module receives the index formula, analyzes the index formula, assembles the workflow and sends the workflow to the scheduling component module; s2, the scheduling component module analyzes the dependency corresponding to the workflow and sends the dependency to the calculation engine module; s3, the calculation engine module distributes and executes calculation tasks according to the task types according to the dependency relationship of the workflow, and sends calculation results to the calculation prepositive service module; and S4, the calculation front-end service module verifies the index blood relationship recorded according to the workflow and the calculation result sent by the calculation engine module, and finally generates the target index. Compared with the prior art, the method has the advantages of easy development, easy learning, flexibility, strong expandability, timely understanding of the processing relation between the index and the specification and the like, and convenience for business personnel to make abnormal operation report and data analysis.

Description

Calculation engine control method and device for distributed index calculation

Technical Field

The invention relates to the technical field of distributed computing, in particular to a computing engine control method and device for distributed index computing.

Background

When the big data technology is prevalent, the indexes are used as the measurement standards of the business, and play a very important auxiliary role in management decision, strategic planning and the like of the enterprise, but no distributed computing engine aiming at the indexes exists in the market in the face of the massive index computing requirements of the enterprise. The current prior art has the following development trend:

1. and recalculating the indexes under the condition of fine granularity and minimum influence. For the indexes, the real-time recalculation of part of the index values is a common service scene, for example, data of a certain basic table is reloaded, and the affected indexes need to be analyzed and the real-time recalculation needs to be triggered;

2. indexes and data, and the dependency relationship and the blood relationship between the indexes are automatically maintained, so that the error and leakage caused by manual maintenance are avoided;

3. distributed computing. In the face of mass data, distributed computing is obviously indispensable;

4. light weight, easy development, easy maintenance and easy expansion;

5. and the proper memory calculation improves the calculation performance.

In the current index calculation, the technical scheme of the ETL based on the SQL is one of the commonly used methods. The ETL technology refers to extraction (Extract), transformation (Transform), and loading (Load) of data, and is a process for loading data into a data warehouse in an integrated manner. The common ETL scheme at present is a data processing scheme developed by combining a relational database, a non-relational database and a MapperReduce model based on SQL (structured query language) as a development language.

On one hand, the SQL-based ETL scheme is heavier in dependence on a plurality of component schemes, special maintenance needs to be performed by a team proficient in each component, and the operation and maintenance cost is high; on the other hand, the SQL is deficient in the aspect of code segment multiplexing, and a section of logic developer usually faces hundreds of SQL rows, so that the development, test and maintenance costs are high; on the other hand, the system lacks the original blood relationship perspective capability, the data consistency can only be ensured by depending on the experience of data developers, and the cost for maintaining the data reliability is high.

Data storage techniques based on single point relational databases are another common method of index computation. A relational database refers to a database that employs a relational model to organize data, storing various entities in the real world and data between the entities based on the relational model. The current common scheme is to use relational databases such as MySQL, Maria DB, DB2 and the like for data storage. The single-point deployment is usually carried out, and the method has the advantages of low operation and maintenance cost and service friendliness.

However, the technical scheme of storing data based on the relational database for data volume has disadvantages in throughput and processing of high-dimensional data and mass data, and data warehousing emphasizes mass data analysis under the background of large data at present, and the technology cannot well meet the requirements; meanwhile, the traditional relational database is used for data storage, because parallel expansion is difficult to carry out, the online analytical processing (OLAP) emphasizes data value mining nowadays, the computing capability under the complex analysis scene of massive data is required, and the technical scheme also has defects in the aspect.

Disclosure of Invention

The invention aims to overcome the defects of high operation and maintenance cost, poor data reliability and disadvantages in processing of high-dimensional data and mass data in the prior art, and provides a method and a device for controlling a calculation engine for distributed index calculation.

The purpose of the invention can be realized by the following technical scheme:

a calculation engine control method for distributed index calculation specifically comprises the following steps:

s1, the calculation front-end service module receives the index formula, analyzes the index formula, assembles the workflow and sends the workflow to the scheduling component module;

s2, the scheduling component module analyzes the dependency corresponding to the workflow and sends the dependency to the calculation engine module;

s3, the calculation engine module distributes and executes calculation tasks according to the task types according to the dependency relationship of the workflow, and sends calculation results to the calculation prepositive service module;

and S4, the calculation front-end service module verifies the index blood relationship recorded according to the workflow and the calculation result sent by the calculation engine module, and finally generates the target index.

The invention only depends on java and a relational database and does not depend on any other middleware.

And the calculation preposed service module performs logic disassembly on the index formula to generate and store an execution plan.

The calculation preposition service module analyzes a specific calculation process into a directed acyclic graph through an analysis index formula.

The calculation engine module adopts a formula language as a calculation task submission interface.

Further, the calculation engine module completes the loading, data fetching, calculation and storage of the basic data according to the formula language and the workflow.

Further, the calculation engine module adopts a distributed database as a storage medium of the index library.

The calculation engine module supports a custom function for specific complex access requirements to meet most of service requirements.

The functions of the compute front-end service module include split workflow and blood margin tracking.

The workflow splitting process specifically includes that a group of index formulas are analyzed into a plurality of calculation tasks according to a syntax tree and assembled into a dependent workflow, and the calculation tasks without dependency relationship are calculated in a distributed mode through a calculation engine module.

The process of blood margin tracking specifically includes that indexes and indexes are stored when a formula is analyzed, dependency relationships between the indexes and reports or between the reports and the indexes are stored in a blood margin relationship table of a database and used for recalculating the dependency relationships among the source tracing indexes, the indexes and the reports after data is modified, and dependence is circularly searched to generate a recalculation workflow.

The calculation engine module is an execution module of a specific calculation task, and each calculation unit is based on an actor model based on distributed calculation.

Further, the operator model is composed of a state (state), a behavior (behavior) and a mailbox (mailbox). The state refers to variable information of an actor object, is managed by the actor, and avoids the problems of lock, memory atomicity and the like in a concurrent environment; the behavior specifies the computational logic in the actor, and the state of the actor is changed through the message received by the actor; the mailbox is a communication bridge between the operators, the inside of the mailbox stores and sends messages through an FIFO message queue, and the receiver acquires the messages from the mailbox.

The computing engine module supports dynamic expansion, and the quantity of computing resources can be dynamically adjusted according to task load by using a design mode of micro service and stateless service; meanwhile, the work node responsible for task distribution refreshes the machine load condition of the task node list according to heartbeat, and dynamic task dynamic distribution is carried out according to the load condition of the task node, so that load balance is realized.

The step S4 includes the following steps in performing the verification: and configuring a group of check formulas for the indexes and the report forms in the index check module, running the model check of the calculation formula after the indexes are generated, and triggering the reminding information of the failure of the index calculation task if the check formulas do not pass.

An apparatus using the calculation engine control method for distributed index calculation includes a memory and a processor, and the control method is stored in the memory in the form of a computer program and executed by the processor, and when executed, implements steps S1 to S4.

Compared with the prior art, the invention has the following beneficial effects:

1. and (4) analyzing a blood source. In the process of actually making a report and checking the details, business personnel hope to know the processing relation between the index and the details, how the data is generated and the like. If the requirement is to be met, the traditional method usually needs a general branch technology and business personnel to respectively configure different indexes to meet the requirement. The invention automatically generates the relation between the index and the detail by calculating the blood-related tracking in the preposed service module, and displays the relation to the business department so as to meet the requirements of business personnel.

2. And (6) generating data. After knowing the relation between the relevant indexes and the details through blood source analysis, business personnel need to correct the details data and finally generate a report. The current mode is that the business synchronously modifies detail and report data, and verification between the detail and the report is not carried out. The report is supposed to be regenerated for the corrected details in the traditional SQL mode, which is equivalent to regeneration of the whole data of the report, and the performance, the efficiency and the intuition are not high. According to the invention, direct comparison of the detail layer to the index layer is completed through the precompilation parser, namely, after the detail data is modified, the influence on which report forms or data is shown to business personnel, after the business personnel confirm, a calculation stream aiming at the detail is generated, then only the indexes affected by the detail items are reprocessed through a memory calculation mode to generate the report forms, and other indexes of the report forms are not affected.

3. And (4) intelligent translation. The invention can directly convert the DFL language into a demand term readable by business personnel through the precompilation parser, and the user can explain how to get the index, thereby facilitating the business personnel to know the meaning and the processing flow of the index on one hand and facilitating the business personnel to check whether the getting caliber is consistent with the demand on the other hand.

4. And (4) carrying out transaction tracing. If the data of the report layer has abnormal operation, the invention intelligently analyzes the root cause of the abnormal operation according to the interlinked iterative algorithm in the system, thereby facilitating business personnel to report the abnormal operation and analyze the data.

5. Easy development and easy learning. Although the grammar of the traditional SQL is simpler, the code readability is poorer, the maintenance cost is higher, particularly, the correlation is more and the nesting is more when the complex business logic is processed, and the disadvantages are particularly obvious.

6. Flexibility and strong expandability. When the traditional SQL mode is used for processing specific irregular requirements, a large amount of time is usually spent on data recombination and data processing and reprocessing; the invention can support the self-defined function based on the program memory calculation and the micro-service expansion form, namely, the function package can be dynamically loaded on the premise of not restarting the main application if the self-definition needs to be expanded.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a schematic structural view of the present invention;

fig. 3 is a schematic structural diagram of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example one

As shown in fig. 1, a method for controlling a calculation engine for distributed index calculation specifically includes the following steps:

And the calculation front-end service module performs logic disassembly on the index formula to generate and store the execution plan.

The calculation preposition service module analyzes the concrete calculation process into a directed acyclic graph through an analysis index formula.

In the embodiment, an open source scheduling system dolphin scheduler is used as a scheduling component module, and an interactive API (application programming interface) with a formula calculation engine is customized. The directed acyclic graph generated by the computation front-end service module is imported into the scheduling component through an interface, the scheduling component calls the computation engine distribution module to distribute computation tasks according to the dependency relationship, and then the computation engine distribution module calls corresponding computation engine execution units according to task needs; the method comprises the steps that a Huawei Gauss DB is used as a storage library of indexes, an execution unit loads data needed by a task from the storage library for calculation, and a pre-service is assembled and then writes newly generated indexes into the storage library. And then, the verification module verifies the generated index according to the configured rule, and if the verification is abnormal, an alarm is given out.

In this embodiment, the grammar rules of the formula language are as follows:

t _ [ access function ] ([ table name or index ] "[ value field ], # [ dimension ], [ judgment condition ] ([ value)), $ [ alias field ], [ actual field ], [ normal field ], [ judgment condition ] ([ value ]).

And the calculation engine module finishes the loading, data taking, calculation and storage of basic data according to the formula language and the workflow.

The calculation engine module adopts a distributed database as a storage medium of the index library.

The calculation engine module supports a custom function for specific complex access requirements to meet most of business requirements.

The functions of the compute front-end service module include split workflow and blood-margin tracking.

The workflow splitting process is that a group of index formulas are analyzed into a plurality of calculation tasks according to a syntax tree and assembled into a dependent workflow, and the calculation tasks without dependency relationship are calculated in a distributed mode through a calculation engine module. As shown in fig. 2, nodes such as 4, 5, 6, 2, 3, etc. are computed in parallel. Each task is an atomic calculation unit and is composed of an atomic formula, and the calculation engine module is responsible for calculation.

As shown in fig. 3, the process of blood margin tracking specifically includes storing the index and the index when analyzing the formula, and storing the index and the report or the dependency between the report and the index in a blood margin relation table of the database, and performing a loop search for the dependency to generate a recalculation workflow after modifying the data and recalculating the dependency between the source index, the index and the report in the process of recalculating the index.

The actor model is composed of a state (state), a behavior (behavior) and a mailbox (mailbox). The state refers to variable information of an actor object, is managed by the actor, and avoids the problems of lock, memory atomicity and the like in a concurrent environment; the behavior specifies the computational logic in the actor, and the state of the actor is changed through the message received by the actor; the mailbox is a communication bridge between the operators, the inside of the mailbox stores and sends messages through an FIFO message queue, and the receiver acquires the messages from the mailbox.

The calculation engine module supports dynamic expansion, and the number of calculation resources can be dynamically adjusted according to task load by using a design mode of micro service and stateless service; meanwhile, the work node responsible for task distribution refreshes the machine load condition of the task node list according to heartbeat, and dynamic task dynamic distribution is carried out according to the load condition of the task node, so that load balance is realized.

The step S4 includes the following procedures in the verification: and configuring a group of check formulas for the indexes and the report forms in the index check module, running the model check of the calculation formula after the indexes are generated, and triggering the reminding information of the failure of the index calculation task if the check formulas do not pass.

An apparatus using a calculation engine control method for distributed index calculation includes a memory and a processor, the control method being stored in the memory in the form of a computer program and executed by the processor, when executed, implementing steps S1-S4.

Example two

In step S1, the pre-service module receives the graphical interaction mode with the same expression capability, which is the same as that in the first embodiment.

In addition, it should be noted that the specific embodiments described in the present specification may have different names, and the above descriptions in the present specification are only illustrations of the structures of the present invention. All equivalent or simple changes in the structure, characteristics and principles of the invention are included in the protection scope of the invention. Various modifications or additions may be made to the described embodiments or methods may be similarly employed by those skilled in the art without departing from the scope of the invention as defined in the appending claims.

Claims

1. A calculation engine control method for distributed index calculation is characterized by specifically comprising the following steps of:

2. The method for controlling a calculation engine used for distributed index calculation according to claim 1, wherein the calculation engine module adopts a formula language as a calculation task submission interface.

3. The method for controlling the calculation engine for the distributed index calculation as claimed in claim 2, wherein the calculation engine module is used for completing the loading, the access, the calculation and the storage of the basic data according to a formula language and a workflow.

4. The calculation engine control method for distributed index calculation according to claim 3, wherein the calculation engine module employs a distributed database as a storage medium of an index library.

5. The method of claim 1, wherein the functions of the compute front-end service module include splitting workflow and blood-margin tracking.

6. The method as claimed in claim 5, wherein the process of splitting the workflow is to parse a set of index formulas into a plurality of computation tasks according to a syntax tree and assemble the computation tasks into a dependent workflow, and the computation tasks without dependency are computed in a distributed manner by the computation engine module.

7. The method as claimed in claim 5, wherein the process of blood margin tracking includes storing the index and the index when analyzing the formula, storing the dependency relationship between the index and the report or between the report and the index in a blood margin relationship table of the database, and performing a loop search to generate a recalculation workflow by recalculating the dependency relationship between the source index, the index and the report in the process of recalculating the index after modifying the data.

8. The method of claim 1, wherein the calculation engine module is an execution module of a specific calculation task, and each calculation unit is based on an operator model based on distributed calculation.

9. The calculation engine control method for distributed index calculation according to claim 1, wherein the step S4 includes the following steps in the verification: and configuring a group of check formulas for the indexes and the report forms in the index check module, running the model check of the calculation formula after the indexes are generated, and triggering the reminding information of the failure of the index calculation task if the check formulas do not pass.

10. An apparatus using the calculation engine control method for distributed index calculation according to claim 1, comprising a memory and a processor, wherein the control method is stored in the memory in the form of a computer program and executed by the processor, and when executed implements steps S1-S4.