CN112506737A - Method for monitoring health condition of flink real-time operation through full link - Google Patents
Method for monitoring health condition of flink real-time operation through full link Download PDFInfo
- Publication number
- CN112506737A CN112506737A CN202011391805.5A CN202011391805A CN112506737A CN 112506737 A CN112506737 A CN 112506737A CN 202011391805 A CN202011391805 A CN 202011391805A CN 112506737 A CN112506737 A CN 112506737A
- Authority
- CN
- China
- Prior art keywords
- task
- operator
- flink
- dag
- acquiring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 title claims abstract description 16
- 230000036541 health Effects 0.000 title claims abstract description 9
- 230000003862 health status Effects 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 239000003086 colorant Substances 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000011144 upstream manufacturing Methods 0.000 claims description 3
- 230000002159 abnormal effect Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a method for monitoring the health condition of flink real-time operation by a full link, which comprises the following steps: 1) initiate Prometheus and pushgetway services and configure flink-conf.yaml; 2) compiling a flink task configuration LatencyMarker related parameter and generating json described by an overall DAG graph through a jobgraph; 3) the foreground page displays a specific DAG task graph through analyzing json; 4) submitting the task to a yann cluster; 5) the server side acquires metrics indexes of the tasks and pushes the metrics indexes to the front end. The method can quickly position the degree of back pressure generated by which task of which operator generates the back pressure on the page; judging whether data tilt exists according to the data quantity of the subtasks of each operator on the page; the data size of each operator in the current time unit of each task can be checked; the average delay and the maximum delay of each task data of each operator from the source end to the operator can be checked.
Description
Technical Field
The invention relates to a method for monitoring health conditions of flink real-time operation by a full link.
Background
After the flink framework real-time task is started, in the data flowing process, the back pressure of the task, the inclination of the data and the delay of the data may be caused due to data distribution, so that the downstream service data is not obtained timely or the data is lost, and at this time, which sub task of which sub DAG is specifically caused needs to be analyzed, thereby specifically analyzing. The code is wrongly written due to the reason of a data developer, so that downstream business data errors are caused, but if the task is complex, the user cannot know which step is caused, and at this time, a task analysis system is needed to inquire which sub-task data of which sub-DAG are abnormal and reduced, and other abnormalities.
At present, a flink or peripheral tool cannot specifically analyze the abnormal conditions of operators with chains in a sub DAG graph, so that various abnormal conditions of each step of a task cannot be monitored for a real-time task, the problem cannot be quickly found and repaired, and data abnormity is caused.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides a method for monitoring health status of flink real-time job in a full link.
A method for monitoring health status of flink real-time operation by a full link comprises the following steps:
1) initiate Prometheus and pushgetway services and configure flink-conf.yaml; collecting all monitoring index data in a flink task DAG graph;
2) compiling a flink task configuration LatencyMarker related parameter and generating json described by an overall DAG graph through a jobgraph;
3) the foreground page displays a specific DAG task graph through analyzing json;
4) submitting the task to a yann cluster;
5) the server side acquires metrics indexes of the tasks and pushes the metrics indexes to the front end; and acquiring various index data required by the task every 2 seconds, and pushing the index data to the constructed DAG graph, so that the running state of the task can be observed.
The step 1) comprises the following steps:
1) installing, configuring and starting pushgetway and Prometheus services;
2) yaml file is configured, assigned to pushgetway service address.
The step 2) comprises the following steps:
1) compiling a flink service processing code;
2) the written program can obtain JobGraph description of the task by calling, so that a set of a DAG graph encapsulating the task can be obtained;
3) defining a LatencyMarkerInfo class to encapsulate data in the collection;
4) parsing the set of the encapsulated DAG graph in JobGraph: firstly, acquiring a set encapsulating all sub DAG graphs; then acquiring the id and the name of the sub DAG graph; then acquiring other sub DAG graph calculation sub-id sets input by the upstream and downstream of the sub DAG graph; then iterating the set of id and name of each chain-together operator in the sub DAG graph; finally, packaging the information into a LatencyMarkerInfo class; and converts the class to json.
The step 3) comprises the following steps:
1) acquiring json data after DAG conversion in JobGraph;
2) resolving a DAG graph of json display tasks on a page: all outermost DAG graphs are traversed first.
The step 4) comprises the following steps:
1) submitting the tasks to a yann cluster and normally operating in the cluster;
2) and inquiring whether the Prometheus can normally acquire the relevant index data of the task, if so, the task is normally started, and if not, checking whether other steps are wrong.
The step 5) comprises the following steps:
1) acquiring the average delay from the subtask of each operator to the source end through the indexes of Prometheus;
2) acquiring the maximum delay from the subtask of each operator to the source end through the indicator of Prometheus;
3) acquiring the data volume of the subtask inflow and outflow of each operator through the indexes of Prometheus;
4) acquiring the data volume flowing in and out of each operator in the unit time of the subtask through the indicator of Prometheus;
5) acquiring the sub-task back pressure degree of each operator through the indexes of Prometheus, and displaying different colors on a page according to different return values;
6) after the front page acquires the captured parameters data, the parameters data can be displayed on the page in real time.
The invention has the beneficial effects that: the invention adopts jobgraph for analyzing the flink to generate DAG graph of the whole task, which comprises generating sub DAG graphs which are not linked together and combining with LatencyMarker of the flink to periodically send a streamRecord mechanism, thereby achieving the purpose of monitoring the flink operation in real time through a full link and displaying the flank operation at the front end in a visual mode. The monitored indexes comprise the average delay from the subtask of each operator to the source end, the maximum delay from the subtask of each operator to the source end, the data volume of the subtask of each operator flowing into and flowing out in unit time and the subtask back pressure degree of each operator.
The method can quickly position the degree of back pressure generated by which task of which operator generates the back pressure on the page; judging whether data tilt exists according to the data quantity of the subtasks of each operator on the page; the data size of each operator in the current time unit of each task can be checked; the average delay and the maximum delay of each task data of each operator from the source end to the operator can be checked.
Drawings
A more complete understanding of the present invention, and the attendant advantages and features thereof, will be more readily understood by reference to the following detailed description when considered in conjunction with the accompanying drawings wherein:
fig. 1 is a schematic diagram of the effect of the present invention.
Detailed Description
In order to make the contents of the present invention more comprehensible, the present invention is further described below with reference to the accompanying drawings. The invention is of course not limited to this particular embodiment, and general alternatives known to those skilled in the art are also covered by the scope of the invention.
Examples
As shown in fig. 1, the present invention provides a method for monitoring health status of flink real-time job in a full link, which comprises the following steps:
1) initiate Prometheus and pushgetway services and configure flink-conf.yaml; collecting all monitoring index data in a flink task DAG graph;
2) compiling a flink task configuration LatencyMarker related parameter and generating json described by an overall DAG graph through a jobgraph;
3) the foreground page displays a specific DAG task graph through analyzing json;
4) submitting the task to a yann cluster;
5) the server side acquires metrics indexes of the tasks and pushes the metrics indexes to the front end; and acquiring various index data required by the task every 2 seconds, and pushing the index data to the constructed DAG graph, so that the running state of the task can be observed.
The step 1) comprises the following steps:
1) installing, configuring and starting pushgetway and Prometheus services;
2) yaml file is configured, assigned to pushgetway service address.
The step 2) comprises the following steps:
1) compiling a flink service processing code;
2) the written program can obtain JobGraph description of the task by calling, so that a set of a DAG graph encapsulating the task can be obtained;
3) defining a LatencyMarkerInfo class to encapsulate data in the collection;
4) parsing the set of the encapsulated DAG graph in JobGraph: firstly, acquiring a set encapsulating all sub DAG graphs; then acquiring the id and the name of the sub DAG graph; then acquiring other sub DAG graph calculation sub-id sets input by the upstream and downstream of the sub DAG graph; then iterating the set of id and name of each chain-together operator in the sub DAG graph; finally, packaging the information into a LatencyMarkerInfo class; and converts the class to json.
The step 3) comprises the following steps:
1) acquiring json data after DAG conversion in JobGraph;
2) resolving a DAG graph of json display tasks on a page: all outermost DAG graphs are traversed first.
The step 4) comprises the following steps:
1) submitting the tasks to a yann cluster and normally operating in the cluster;
2) and inquiring whether the Prometheus can normally acquire the relevant index data of the task, if so, the task is normally started, and if not, checking whether other steps are wrong.
The step 5) comprises the following steps:
1) acquiring the average delay from the subtask of each operator to the source end through the indexes of Prometheus;
2) acquiring the maximum delay from the subtask of each operator to the source end through the indicator of Prometheus;
3) acquiring the data volume of the subtask inflow and outflow of each operator through the indexes of Prometheus;
4) acquiring the data volume flowing in and out of each operator in the unit time of the subtask through the indicator of Prometheus;
5) acquiring the sub-task back pressure degree of each operator through the indexes of Prometheus, and displaying different colors on a page according to different return values;
6) after the front page acquires the captured parameters data, the parameters data can be displayed on the page in real time.
And generating a DAG graph of the whole task by adopting a jobgraph for analyzing the flink, wherein the DAG graph comprises sub DAG graphs which are not linked together, and a streamRecord mechanism is periodically sent by combining with a LatencyMarker of the flink, so that the purpose of monitoring the flink operation in real time through a full link is achieved, and the flank operation is displayed at the front end in a visual mode. The monitored indexes comprise the average delay from the subtask of each operator to the source end, the maximum delay from the subtask of each operator to the source end, the data volume of the subtask of each operator flowing into and flowing out in unit time and the subtask back pressure degree of each operator.
Firstly, the degree of back pressure generated by the task of which operator can be quickly positioned on a page; judging whether data tilt exists according to the data quantity of the subtasks of each operator on the page; the data size of each operator in the current time unit of each task can be checked; the average delay and the maximum delay of each task data of each operator from the source end to the operator can be checked.
Claims (6)
1. A method for monitoring health status of flink real-time operation through a full link is characterized by comprising the following steps:
1) initiate Prometheus and pushgetway services and configure flink-conf.yaml; collecting all monitoring index data in a flink task DAG graph;
2) compiling a flink task configuration LatencyMarker related parameter and generating json described by an overall DAG graph through a jobgraph;
3) the foreground page displays a specific DAG task graph through analyzing json;
4) submitting the task to a yann cluster;
5) the server side acquires metrics indexes of the tasks and pushes the metrics indexes to the front end; and acquiring various index data required by the task every 2 seconds, and pushing the index data to the constructed DAG graph, so that the running state of the task can be observed.
2. The method for full link monitoring of flink real-time job health as claimed in claim 1, wherein said step 1) comprises the steps of:
1) installing, configuring and starting pushgetway and Prometheus services;
2) yaml file is configured, assigned to pushgetway service address.
3. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 2) comprises the steps of:
1) compiling a flink service processing code;
2) the written program can obtain JobGraph description of the task by calling, so that a set of a DAG graph encapsulating the task can be obtained;
3) defining a LatencyMarkerInfo class to encapsulate data in the collection;
4) parsing the set of the encapsulated DAG graph in JobGraph: firstly, acquiring a set encapsulating all sub DAG graphs; then acquiring the id and the name of the sub DAG graph; then acquiring other sub DAG graph calculation sub-id sets input by the upstream and downstream of the sub DAG graph; then iterating the set of id and name of each chain-together operator in the sub DAG graph; finally, packaging the information into a LatencyMarkerInfo class; and converts the class to json.
4. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 3) comprises the steps of:
1) acquiring json data after DAG conversion in JobGraph;
2) resolving a DAG graph of json display tasks on a page: all outermost DAG graphs are traversed first.
5. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 4) comprises the steps of:
1) submitting the tasks to a yann cluster and normally operating in the cluster;
2) and inquiring whether the Prometheus can normally acquire the relevant index data of the task, if so, the task is normally started, and if not, checking whether other steps are wrong.
6. The method for full link monitoring of flink real time job health as claimed in claim 1, wherein said step 5) comprises the steps of:
1) acquiring the average delay from the subtask of each operator to the source end through the indexes of Prometheus;
2) acquiring the maximum delay from the subtask of each operator to the source end through the indicator of Prometheus;
3) acquiring the data volume of the subtask inflow and outflow of each operator through the indexes of Prometheus;
4) acquiring the data volume flowing in and out of each operator in the unit time of the subtask through the indicator of Prometheus;
5) acquiring the sub-task back pressure degree of each operator through the indexes of Prometheus, and displaying different colors on a page according to different return values;
6) after the front page acquires the captured parameters data, the parameters data can be displayed on the page in real time.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011391805.5A CN112506737A (en) | 2020-12-02 | 2020-12-02 | Method for monitoring health condition of flink real-time operation through full link |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011391805.5A CN112506737A (en) | 2020-12-02 | 2020-12-02 | Method for monitoring health condition of flink real-time operation through full link |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112506737A true CN112506737A (en) | 2021-03-16 |
Family
ID=74968542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011391805.5A Pending CN112506737A (en) | 2020-12-02 | 2020-12-02 | Method for monitoring health condition of flink real-time operation through full link |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112506737A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704474A (en) * | 2019-09-24 | 2020-01-17 | 杭州玳数科技有限公司 | Real-time SQL extension processing method and device based on Flink |
CN111782371A (en) * | 2020-06-30 | 2020-10-16 | 北京百度网讯科技有限公司 | Stream type computing method and device based on DAG interaction |
-
2020
- 2020-12-02 CN CN202011391805.5A patent/CN112506737A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704474A (en) * | 2019-09-24 | 2020-01-17 | 杭州玳数科技有限公司 | Real-time SQL extension processing method and device based on Flink |
CN111782371A (en) * | 2020-06-30 | 2020-10-16 | 北京百度网讯科技有限公司 | Stream type computing method and device based on DAG interaction |
Non-Patent Citations (2)
Title |
---|
寰宇001: "《使用Prometheus监控Flink》", 《CSDN》 * |
浪尖聊大数据: "《干货|Flinkzai 监控系统上的实践和应用》", 《CSDN》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729214B (en) | Visual distributed system real-time monitoring operation and maintenance method and device | |
US11657309B2 (en) | Behavior analysis and visualization for a computer infrastructure | |
US10732618B2 (en) | Machine health monitoring, failure detection and prediction using non-parametric data | |
JP5267736B2 (en) | Fault detection apparatus, fault detection method, and program recording medium | |
US20090112932A1 (en) | Visualizing key performance indicators for model-based applications | |
EP3148116A1 (en) | Information system fault scenario information collection method and system | |
US20140282421A1 (en) | Distributed software validation | |
US20140298093A1 (en) | User operation history for web application diagnostics | |
CN107896172B (en) | Monitoring fault processing method and device, storage medium and electronic equipment | |
CN107766208B (en) | Method, system and device for monitoring business system | |
CN104991853A (en) | Method and apparatus for outputting early warning information | |
CN100549975C (en) | Computer maintenance support system and analysis server | |
CN114024834B (en) | Fault positioning method and device, electronic equipment and readable storage medium | |
CN110765189A (en) | Exception management method and system for Internet products | |
CN104407959A (en) | Application based monitoring method and monitoring device | |
CN108897669B (en) | Application monitoring method and device | |
CN108665237B (en) | Method for establishing automatic inspection model and positioning abnormity based on business system | |
US20190384691A1 (en) | Methods for providing an enterprise synthetic monitoring framework | |
CN112506737A (en) | Method for monitoring health condition of flink real-time operation through full link | |
CN117251353A (en) | Monitoring method, system and platform for civil aviation weak current system | |
CN112749140A (en) | MySQL slow SQL log real-time collection and optimization method and system | |
US10055277B1 (en) | System, method, and computer program for performing health checks on a system including a plurality of heterogeneous system components | |
CN106598770B (en) | Native layer exception reporting processing method and device in Android system | |
KR20030056301A (en) | System hindrance integration management method | |
CN113553272A (en) | Interface abnormity monitoring method, device, medium and computer program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 311121 room 102-1 / F, room 102-2 / F, building 6, Haizhi center, 2301 yuhangtang Road, Cangqian street, Yuhang District, Hangzhou, Zhejiang Province Applicant after: HANGZHOU DAISHU TECHNOLOGY Co.,Ltd. Address before: 310030 8F, building 2, Hangzhou Internet innovation and entrepreneurship Park, 176 Zixia street, Xihu District, Hangzhou City, Zhejiang Province Applicant before: HANGZHOU DAISHU TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210316 |
|
RJ01 | Rejection of invention patent application after publication |