CN116541413A

CN116541413A - Method and system for optimizing FlinkSQL repeated consumption data source data

Info

Publication number: CN116541413A
Application number: CN202310593793.1A
Authority: CN
Inventors: 李琳; 尹春光; 张璐波; 从光辉
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-04

Abstract

The invention discloses a method and a system for optimizing FlinkSQL repeated consumption data source data, and relates to the technical field of financial payment. The method comprises the following steps: when a Flink Client submits a Flink Job, acquiring a generated StreamGraph; analyzing whether the generated StreamGraph meets pruning conditions, if so, pruning optimization is carried out on the StreamGraph, a new StreamGraph is generated, and the new StreamGraph is submitted to the cluster for operation. The invention can realize that the problem of repeatedly consuming data source data is avoided when the FlinkSQL simultaneously runs a plurality of SQL.

Description

Method and system for optimizing FlinkSQL repeated consumption data source data

Technical Field

The invention relates to the technical field of financial payment, in particular to a method and a system for optimizing FlinkSQL repeated consumption data source data.

Background

Flink is a distributed real-time computing framework with low processing latency, high throughput, accurate one-time semantics in the industry. The Flink job is generated StreamGraph, streamGraph in the process of submitting execution, jobGraph is generated through automatic optimization, and then submitted to JobManager for operation. The Flink execution operates in a directed acyclic graph topology with one or more initiator nodes.

When a flank program has a Source and a Sink, then the generated topology has a starting node; when a Flink program has n sources, the topology created will have n initiator nodes. However, if the data sources consumed by the n sources are the same and the topics are the same (for the Kafka data Source), then the problem of repeatedly consuming the data Source data occurs, that is, in operation of the link job, n consumer threads will occur in the TaskManager to consume the data Source data. Under the condition, the pressure of the data source service including the memory and the network is multiplied, and if a plurality of repeated consumption tasks in the cluster are running, the stability of the data source service is greatly affected, so that the stability of the whole real-time computing data link is reduced, and the characteristics of high throughput and low processing time delay of system processing cannot be met.

Disclosure of Invention

In order to overcome the problems or at least partially solve the problems, the embodiment of the invention provides a method and a system for optimizing the repeated consumption of data source data by the FlinkSQL, which can solve the problem of repeated consumption of data source data when the FlinkSQL simultaneously operates a plurality of SQL, further effectively improve the execution efficiency of the Flink operation, increase the processing throughput of the system and reduce the processing time delay.

Embodiments of the present invention are implemented as follows:

in a first aspect, an embodiment of the present invention provides a method for optimizing flanksql repeated consumption data source data, including the following steps:

when a Flink Client submits a Flink Job, acquiring a generated StreamGraph;

analyzing whether the generated StreamGraph meets pruning conditions, if so, pruning optimization is carried out on the StreamGraph, a new StreamGraph is generated, and the new StreamGraph is submitted to the cluster for operation.

In order to solve the problems in the prior art, the invention dynamically performs pruning merging on Source nodes of a topological graph generated by the FlinkSQL; optimizing a topological structure in a pruning and merging mode, and reducing the pressure on a data source node; the repeated consumption of the data source by the Flink task is avoided, the problems of slow data transmission and excessive connection number of the data source caused by the repeated consumption of the data source data can be solved, the execution efficiency of the Flink job can be effectively improved, the processing throughput of the system is increased, and the processing time delay is reduced.

Based on the first aspect, in some embodiments of the present invention, the method for analyzing whether the generated StreamGraph meets pruning conditions includes the following steps:

and analyzing the topological structure in the generated StreamGraph, and judging whether pruning conditions are met or not based on the topological structure.

Based on the first aspect, in some embodiments of the present invention, the method for determining whether a pruning condition is satisfied based on a topology structure includes the following steps:

judging whether the number of Source nodes in the topological structure is larger than 1, if so, judging whether the corresponding consumed data sources are the same, and if so, meeting pruning conditions; otherwise, pruning conditions are not satisfied.

Based on the first aspect, in some embodiments of the present invention, the method for pruning and optimizing the StreamGraph to generate a new StreamGraph and submitting the new StreamGraph to the cluster for operation includes the following steps:

starting traversing all Source nodes, reserving a first Source node, adding a node with the outgoing degree of which one stream edge points to a second Source node, and deleting the stream node of the second Source node and the stream edge thereof;

and (3) circularly reciprocating until all Source nodes are traversed and corresponding operations are completed, generating a new topological graph, continuously submitting the newly generated transform set to generate JobGraph, and submitting the JobGraph to a cluster for operation.

In a second aspect, an embodiment of the present invention provides a system for optimizing the repeated consumption data source data of FlinkSQL, including an acquisition module and a pruning module, where:

the acquisition module is used for acquiring the generated StreamGraph when the Flink Client submits the Flink Job;

the pruning module is used for analyzing whether the generated StreamGraph meets pruning conditions, if so, pruning optimization is carried out on the StreamGraph, a new StreamGraph is generated, and the new StreamGraph is submitted to the cluster for operation.

In order to solve the problems in the prior art, the system dynamically performs pruning merging on Source nodes of a topological graph generated by the FlinkSQL through the cooperation of a plurality of modules such as an acquisition module, a pruning module and the like; optimizing a topological structure in a pruning and merging mode, and reducing the pressure on a data source node; the repeated consumption of the data source by the Flink task is avoided, the problems of slow data transmission and excessive connection number of the data source caused by the repeated consumption of the data source data can be solved, the execution efficiency of the Flink job can be effectively improved, the processing throughput of the system is increased, and the processing time delay is reduced.

In a third aspect, embodiments of the present application provide an electronic device comprising a memory for storing one or more programs; a processor. The method of any of the first aspects described above is implemented when one or more programs are executed by a processor.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as in any of the first aspects described above.

The embodiment of the invention has at least the following advantages or beneficial effects:

the embodiment of the invention provides a method and a system for optimizing the repeated consumption data Source data of FlinkSQL, which dynamically performs pruning merging on Source nodes of a topological graph generated by the FlinkSQL; optimizing a topological structure in a pruning and merging mode, and reducing the pressure on a data source node; the repeated consumption of the data source by the Flink task is avoided, the problems of slow data transmission and excessive connection number of the data source caused by the repeated consumption of the data source data can be solved, the execution efficiency of the Flink job can be effectively improved, the processing throughput of the system is increased, and the processing time delay is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for optimizing FlinkSQL repeat consumer data source data in accordance with an embodiment of the invention;

FIG. 2 is a schematic diagram of a topology of an acquired StreamGraph according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a topology structure without pruning operation according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of pruning operation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a cyclic reciprocating pruning operation according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a topology structure generated after pruning operation is completed in an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a system for optimizing FlinkSQL repeat consumer data source data in accordance with an embodiment of the invention;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.

Reference numerals illustrate: 100. an acquisition module; 200. pruning module; 101. a memory; 102. a processor; 103. a communication interface.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the description of the embodiments of the present invention, "plurality" means at least 2.

Examples:

1-6, in a first aspect, an embodiment of the present invention provides a method for optimizing FlinkSQL repeated consumption data source data, including the following steps:

s1, when a Flink Client submits a Flink Job, acquiring a generated StreamGraph;

s2, analyzing whether the generated StreamGraph meets pruning conditions, if so, pruning optimization is carried out on the StreamGraph, a new StreamGraph is generated, and the new StreamGraph is submitted to the cluster for operation.

Further, the method comprises the steps of: and analyzing the topological structure in the generated StreamGraph, and judging whether pruning conditions are met or not based on the topological structure.

Further, the method comprises the steps of: judging whether the number of Source nodes in the topological structure is larger than 1, if so, judging whether the corresponding consumed data sources are the same, and if so, meeting pruning conditions; otherwise, pruning conditions are not satisfied.

Further, the method comprises the steps of: starting traversing all Source nodes, reserving a first Source node, adding a node with the outgoing degree of which one stream edge points to a second Source node, and deleting the stream node of the second Source node and the stream edge thereof; and (3) circularly reciprocating until all Source nodes are traversed and corresponding operations are completed, generating a new topological graph, continuously submitting the newly generated transform set to generate JobGraph, and submitting the JobGraph to a cluster for operation.

In some embodiments of the present invention, the topology structure in the StreamGraph is analyzed, as shown in fig. 2, if the number of Source nodes is >1 and the consumed data sources are the same, pruning operation is started, otherwise, as shown in fig. 3, no operation is performed. Specific pruning operation is shown in fig. 4, traversing all Source nodes, reserving a first Source node, adding a node with the outbound directed by a StreamEdge directed to a second Source node, and deleting the StreamNode of the second Source node and the StreamEdge thereof; the method comprises the steps of circularly reciprocating, as shown in fig. 5, until all Source nodes are traversed and corresponding operations are completed, and a generated topological graph is shown in fig. 6; and continuing to submit the newly generated transform set to generate JobGraph, and submitting the newly generated transform set to the cluster for operation.

For ease of understanding, further examples are illustrated herein:

as shown in fig. 3: the developed flank job has 4 independent topology graphs, wherein the topology graphs have 4 source= { Source1, source2, source3, source4}, and Source1 = Source2, source1 = Source3, source1 = Source4;

source 1=source 2 means that two Source nodes have the same consumer attribute, i.e. the connected data Source connection information is the same, the consumer's consumption subject, the consumption group is the same, etc.;

as shown in fig. 4, all Source nodes are traversed, for Source1 node, a node of the outbound execution of which the StreamEdge points to Source2 node is added, and then the StreamNode and the StreamEdge of the Source2 node are deleted;

as shown in fig. 5, continuing to traverse, finding a Source3 node, adding a node of which the StreamEdge points to the outbound execution of the Source3 node to the Source1 node, and deleting the StreamNode and the StreamEdge of the Source2 node;

and (6) circularly traversing all Source nodes, and finally merging the topological graph into a complete directed acyclic graph, as shown in fig. 6.

As shown in fig. 7, in a second aspect, an embodiment of the present invention provides a system for optimizing flanksql repeated consumption data source data, including an obtaining module 100 and a pruning module 200, where:

an obtaining module 100, configured to obtain a generated StreamGraph when a Flink Client submits a Flink Job;

the pruning module 200 is configured to analyze whether the generated StreamGraph meets pruning conditions, and if so, perform pruning optimization on the StreamGraph to generate a new StreamGraph, and submit the new StreamGraph to the cluster for operation.

In order to solve the problems in the prior art, the system dynamically performs pruning merging on Source nodes of a topological graph generated by the FlinkSQL through the cooperation of a plurality of modules such as an acquisition module 100, a pruning module 200 and the like; optimizing a topological structure in a pruning and merging mode, and reducing the pressure on a data source node; the repeated consumption of the data source by the Flink task is avoided, the problems of slow data transmission and excessive connection number of the data source caused by the repeated consumption of the data source data can be solved, the execution efficiency of the Flink job can be effectively improved, the processing throughput of the system is increased, and the processing time delay is reduced.

As shown in fig. 8, in a third aspect, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The method of any of the first aspects described above is implemented when one or more programs are executed by the processor 102.

And a communication interface 103, where the memory 101, the processor 102 and the communication interface 103 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules that are stored within the memory 101 for execution by the processor 102 to perform various functional applications and data processing. The communication interface 103 may be used for communication of signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.

The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 may be a general purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In the embodiments provided in the present application, it should be understood that the disclosed method, system and method may be implemented in other manners. The above-described method and system embodiments are merely illustrative, for example, flow charts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by the processor 102, implements a method as in any of the first aspects described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method for optimizing the repeated consumption of data source data by FlinkSQL, comprising the steps of:

when a Flink Client submits a Flink Job, acquiring a generated StreamGraph;

2. The method for optimizing the repeated consumption of data source data by flanksql according to claim 1, wherein the method for analyzing whether the generated StreamGraph satisfies pruning conditions comprises the following steps:

3. The method for optimizing the repeated consumption data source data of the flanksql according to claim 2, wherein the topology-based method for judging whether the pruning condition is satisfied comprises the following steps:

4. A method for optimizing the repeated consumption of data source data by FlinkSQL according to claim 3, wherein said method for pruning the StreamGraph to generate a new StreamGraph and submitting the new StreamGraph to the cluster for operation comprises the following steps:

5. The system for optimizing the FlinkSQL repeated consumption data source data is characterized by comprising an acquisition module and a pruning module, wherein:

6. An electronic device, comprising:

a memory for storing one or more programs;

a processor;

the method of any of claims 1-4 is implemented when the one or more programs are executed by the processor.

7. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1-4.