CN112506737A

CN112506737A - Method for monitoring health condition of flink real-time operation through full link

Info

Publication number: CN112506737A
Application number: CN202011391805.5A
Authority: CN
Inventors: 刘星
Original assignee: Hangzhou Daishu Technology Co ltd
Current assignee: Hangzhou Daishu Technology Co ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-03-16

Abstract

The invention discloses a method for monitoring the health condition of flink real-time operation by a full link, which comprises the following steps: 1) initiate Prometheus and pushgetway services and configure flink-conf.yaml; 2) compiling a flink task configuration LatencyMarker related parameter and generating json described by an overall DAG graph through a jobgraph; 3) the foreground page displays a specific DAG task graph through analyzing json; 4) submitting the task to a yann cluster; 5) the server side acquires metrics indexes of the tasks and pushes the metrics indexes to the front end. The method can quickly position the degree of back pressure generated by which task of which operator generates the back pressure on the page; judging whether data tilt exists according to the data quantity of the subtasks of each operator on the page; the data size of each operator in the current time unit of each task can be checked; the average delay and the maximum delay of each task data of each operator from the source end to the operator can be checked.

Description

Method for monitoring health condition of flink real-time operation through full link

Technical Field

The invention relates to a method for monitoring health conditions of flink real-time operation by a full link.

Background

After the flink framework real-time task is started, in the data flowing process, the back pressure of the task, the inclination of the data and the delay of the data may be caused due to data distribution, so that the downstream service data is not obtained timely or the data is lost, and at this time, which sub task of which sub DAG is specifically caused needs to be analyzed, thereby specifically analyzing. The code is wrongly written due to the reason of a data developer, so that downstream business data errors are caused, but if the task is complex, the user cannot know which step is caused, and at this time, a task analysis system is needed to inquire which sub-task data of which sub-DAG are abnormal and reduced, and other abnormalities.

At present, a flink or peripheral tool cannot specifically analyze the abnormal conditions of operators with chains in a sub DAG graph, so that various abnormal conditions of each step of a task cannot be monitored for a real-time task, the problem cannot be quickly found and repaired, and data abnormity is caused.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a method for monitoring health status of flink real-time job in a full link.

A method for monitoring health status of flink real-time operation by a full link comprises the following steps:

1) initiate Prometheus and pushgetway services and configure flink-conf.yaml; collecting all monitoring index data in a flink task DAG graph;

2) compiling a flink task configuration LatencyMarker related parameter and generating json described by an overall DAG graph through a jobgraph;

3) the foreground page displays a specific DAG task graph through analyzing json;

4) submitting the task to a yann cluster;

5) the server side acquires metrics indexes of the tasks and pushes the metrics indexes to the front end; and acquiring various index data required by the task every 2 seconds, and pushing the index data to the constructed DAG graph, so that the running state of the task can be observed.

The step 1) comprises the following steps:

1) installing, configuring and starting pushgetway and Prometheus services;

2) yaml file is configured, assigned to pushgetway service address.

The step 2) comprises the following steps:

1) compiling a flink service processing code;

2) the written program can obtain JobGraph description of the task by calling, so that a set of a DAG graph encapsulating the task can be obtained;

3) defining a LatencyMarkerInfo class to encapsulate data in the collection;

4) parsing the set of the encapsulated DAG graph in JobGraph: firstly, acquiring a set encapsulating all sub DAG graphs; then acquiring the id and the name of the sub DAG graph; then acquiring other sub DAG graph calculation sub-id sets input by the upstream and downstream of the sub DAG graph; then iterating the set of id and name of each chain-together operator in the sub DAG graph; finally, packaging the information into a LatencyMarkerInfo class; and converts the class to json.

The step 3) comprises the following steps:

1) acquiring json data after DAG conversion in JobGraph;

2) resolving a DAG graph of json display tasks on a page: all outermost DAG graphs are traversed first.

The step 4) comprises the following steps:

1) submitting the tasks to a yann cluster and normally operating in the cluster;

2) and inquiring whether the Prometheus can normally acquire the relevant index data of the task, if so, the task is normally started, and if not, checking whether other steps are wrong.

The step 5) comprises the following steps:

1) acquiring the average delay from the subtask of each operator to the source end through the indexes of Prometheus;

2) acquiring the maximum delay from the subtask of each operator to the source end through the indicator of Prometheus;

3) acquiring the data volume of the subtask inflow and outflow of each operator through the indexes of Prometheus;

4) acquiring the data volume flowing in and out of each operator in the unit time of the subtask through the indicator of Prometheus;

5) acquiring the sub-task back pressure degree of each operator through the indexes of Prometheus, and displaying different colors on a page according to different return values;

6) after the front page acquires the captured parameters data, the parameters data can be displayed on the page in real time.

The invention has the beneficial effects that: the invention adopts jobgraph for analyzing the flink to generate DAG graph of the whole task, which comprises generating sub DAG graphs which are not linked together and combining with LatencyMarker of the flink to periodically send a streamRecord mechanism, thereby achieving the purpose of monitoring the flink operation in real time through a full link and displaying the flank operation at the front end in a visual mode. The monitored indexes comprise the average delay from the subtask of each operator to the source end, the maximum delay from the subtask of each operator to the source end, the data volume of the subtask of each operator flowing into and flowing out in unit time and the subtask back pressure degree of each operator.

The method can quickly position the degree of back pressure generated by which task of which operator generates the back pressure on the page; judging whether data tilt exists according to the data quantity of the subtasks of each operator on the page; the data size of each operator in the current time unit of each task can be checked; the average delay and the maximum delay of each task data of each operator from the source end to the operator can be checked.

Drawings

A more complete understanding of the present invention, and the attendant advantages and features thereof, will be more readily understood by reference to the following detailed description when considered in conjunction with the accompanying drawings wherein:

fig. 1 is a schematic diagram of the effect of the present invention.

Detailed Description

In order to make the contents of the present invention more comprehensible, the present invention is further described below with reference to the accompanying drawings. The invention is of course not limited to this particular embodiment, and general alternatives known to those skilled in the art are also covered by the scope of the invention.

Examples

As shown in fig. 1, the present invention provides a method for monitoring health status of flink real-time job in a full link, which comprises the following steps:

4) submitting the task to a yann cluster;

The step 1) comprises the following steps:

1) installing, configuring and starting pushgetway and Prometheus services;

2) yaml file is configured, assigned to pushgetway service address.

The step 2) comprises the following steps:

1) compiling a flink service processing code;

3) defining a LatencyMarkerInfo class to encapsulate data in the collection;

The step 3) comprises the following steps:

1) acquiring json data after DAG conversion in JobGraph;

The step 4) comprises the following steps:

The step 5) comprises the following steps:

And generating a DAG graph of the whole task by adopting a jobgraph for analyzing the flink, wherein the DAG graph comprises sub DAG graphs which are not linked together, and a streamRecord mechanism is periodically sent by combining with a LatencyMarker of the flink, so that the purpose of monitoring the flink operation in real time through a full link is achieved, and the flank operation is displayed at the front end in a visual mode. The monitored indexes comprise the average delay from the subtask of each operator to the source end, the maximum delay from the subtask of each operator to the source end, the data volume of the subtask of each operator flowing into and flowing out in unit time and the subtask back pressure degree of each operator.

Firstly, the degree of back pressure generated by the task of which operator can be quickly positioned on a page; judging whether data tilt exists according to the data quantity of the subtasks of each operator on the page; the data size of each operator in the current time unit of each task can be checked; the average delay and the maximum delay of each task data of each operator from the source end to the operator can be checked.

Claims

1. A method for monitoring health status of flink real-time operation through a full link is characterized by comprising the following steps:

4) submitting the task to a yann cluster;

2. The method for full link monitoring of flink real-time job health as claimed in claim 1, wherein said step 1) comprises the steps of:

1) installing, configuring and starting pushgetway and Prometheus services;

2) yaml file is configured, assigned to pushgetway service address.

3. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 2) comprises the steps of:

1) compiling a flink service processing code;

3) defining a LatencyMarkerInfo class to encapsulate data in the collection;

4. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 3) comprises the steps of:

1) acquiring json data after DAG conversion in JobGraph;

5. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 4) comprises the steps of:

6. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 5) comprises the steps of: