WO2017167021A1 - 异常监控方法及装置 - Google Patents

异常监控方法及装置 Download PDF

Info

Publication number
WO2017167021A1
WO2017167021A1 PCT/CN2017/076891 CN2017076891W WO2017167021A1 WO 2017167021 A1 WO2017167021 A1 WO 2017167021A1 CN 2017076891 W CN2017076891 W CN 2017076891W WO 2017167021 A1 WO2017167021 A1 WO 2017167021A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
abnormal
time
alarm
running
Prior art date
Application number
PCT/CN2017/076891
Other languages
English (en)
French (fr)
Inventor
陈磊
Original Assignee
阿里巴巴集团控股有限公司
陈磊
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 陈磊 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017167021A1 publication Critical patent/WO2017167021A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Definitions

  • the present application relates to communication technologies, and in particular, to an abnormality monitoring method and apparatus.
  • the prior art task monitoring scheme is basically a user-configured complicated information, including an alarm trigger condition, an alarm time, an alarm object, an alarm mode, etc., based on the configuration information, monitoring the task running process, and when found to meet the alarm triggering condition During the task, the alarm is set to the set alarm object at the set alarm time. In this way, the alarm time is pre-configured, the flexibility is poor, and it is easy to cause an alarm that is not timely or unnecessary, resulting in poor alarm accuracy.
  • the present application provides an abnormality monitoring method and device for improving the flexibility of an abnormal task alarm, reducing the probability of occurrence of an untimely or unnecessary alarm, and improving the accuracy of the alarm.
  • an abnormality monitoring method including:
  • the abnormal task is alarm processed according to the latest start time and current time of re-running the abnormal task.
  • an abnormality monitoring apparatus including:
  • An abnormal task determining module configured to determine an abnormal task in the task scheduling system according to a preset reference task in the task scheduling system
  • a latest time determining module configured to determine a latest start time of re-running the abnormal task according to a preset reference completion time of the reference task
  • the alarm processing module is configured to perform alarm processing on the abnormal task according to the latest start time and current time of re-running the abnormal task.
  • the present application presets a reference task in the task scheduling system and its reference completion time.
  • the abnormal task is determined according to the reference task, and then the re-run abnormality is determined according to the reference completion time of the reference task.
  • the abnormal task is alarmed instead of having to perform alarm processing when the pre-configured alarm time arrives as in the prior art. Stronger, it is beneficial to reduce the probability that the alarm is not timely or unnecessary alarm, and improve the accuracy of the alarm.
  • FIG. 1 is a schematic flowchart diagram of an abnormality monitoring method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of task dependency relationships in a task scheduling system according to another embodiment of the present application.
  • FIG. 3 is a schematic diagram of task dependency relationships in a task scheduling system according to another embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an abnormality monitoring apparatus according to another embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an abnormality monitoring apparatus according to another embodiment of the present disclosure.
  • a task scheduling system refers to a system that schedules execution of a series of instructions or tasks in a manner and time set in advance.
  • a task monitoring scheme is generally adopted.
  • the existing task monitoring program is basically a user-configured complicated information, including alarm trigger conditions, alarm time, alarm object, alarm mode, etc. Based on these configuration information, the task running process is monitored, and when it is found that the alarm trigger condition is met, During the task, the alarm is set to the set alarm object at the set alarm time. In this way, the alarm time is pre-configured, the flexibility is poor, and it is easy to cause an alarm that is not timely or unnecessary, resulting in poor alarm accuracy.
  • the present application provides a solution, the main principle is: pre-configure the reference task in the task scheduling system and its reference completion time, determine the abnormal task according to the reference task, and determine the re-run according to the reference completion time of the reference task.
  • the flexibility is strong, which is beneficial to reduce the probability that the alarm is not timely or unnecessary alarm, and the alarm accuracy is improved.
  • the technical solution provided by the present application is applicable to the task scheduling system, and is preferably applicable to the offline task scheduling system in the data warehouse development process, but is not limited thereto.
  • Offline The tasks scheduled in the scheduling system belong to offline tasks, which are relative online or real-time tasks. They mainly refer to the data that is not required to be applied to the online business system immediately, but after a series of asynchronous processing. The task of reflowing into the online business system.
  • FIG. 1 is a schematic flowchart diagram of an abnormality monitoring method according to an embodiment of the present application. As shown in Figure 1, the method includes:
  • the embodiment provides an abnormality monitoring method, which can be executed by the abnormality monitoring device, so as to more flexibly perform alarm processing on abnormal tasks, reduce the probability of occurrence of an untimely or unnecessary alarm, and improve the accuracy of the alarm.
  • Task scheduling system there are upstream and downstream dependencies between tasks, and downstream tasks need to wait for the execution of the upstream tasks to be executed.
  • An example of the dependency between tasks in the task scheduling system is shown in FIG. 2 .
  • task A, task B, task C, task D, task E, and task F are included.
  • Task B and Task C depend on Task A
  • Task A is the upstream task of Task B and Task C
  • Task B and Task C are the downstream tasks of Task A.
  • Task F depends on Task A and Task C, and the task A and task C are the upstream tasks of task F, task F is the downstream tasks of task A and task C; task D and task E depend on task A and task B, and task A and task B are the upstream tasks of task D and task E, Task D and Task E are downstream tasks for Task A and Task B.
  • task A is a direct upstream task of task B and task C
  • task B and task C are direct downstream tasks of task A
  • task A is an indirect upstream task of task D
  • task D, task E and task F are indirect downstream tasks of task A.
  • the direct upstream and downstream tasks and the indirect upstream and downstream tasks are not subdivided.
  • the present embodiment presets the reference task in the task scheduling system and its reference completion time, and uses the reference task and its reference completion time as the baseline of the abnormal monitoring.
  • the baseline completes abnormal task monitoring and alarm processing.
  • the baseline completion time of the reference task refers to the latest completion time of the reference task, that is, the reference task must be completed before the baseline completion time, otherwise serious adverse consequences may occur, such as may lead to the entire task scheduling.
  • the system reports an error or affects the normal operation of the online business system that depends on the task scheduling system.
  • the reference task may be determined according to the importance degree of each task in the task scheduling system, for example, the task that meets the most severe condition (for example, the most important) as the reference task.
  • the reference task may be determined according to the dependency relationship between the tasks in the task scheduling system, for example, the task that satisfies certain conditions (for example, at most or greater than the specified number) of the upstream task number and the downstream task number as the reference task. If the number of upstream tasks and the number of downstream tasks of a task are both large, the task is relatively core and the impact is large. Therefore, it is necessary to ensure that the task is completed before the latest completion time. Therefore, setting it as a reference task is beneficial to ensure More tasks can run on time.
  • the benchmark completion time of the benchmark task can be determined according to the application situation of the benchmark task. For example, if the online business system needs to schedule the data calculated by the benchmark task at 9:00 every morning, the baseline completion time of the baseline task can be set to 9 points, which means that the baseline task must be completed before 9:00 every day. For another example, if the person concerned needs to view the report generated by the data calculated by the benchmark task at 10:00 every morning, the benchmark completion time of the benchmark task can be set to 10 points, which means that the benchmark task must be at 10 o'clock. Finished before.
  • the present embodiment does not limit the number of reference tasks, and may be one or plural.
  • different settings can be set for different reference tasks.
  • the baseline completion time can also be set to the same baseline completion time. As shown in Figure 2, task D and task E in the box are set as the baseline task, both of which need to be completed before 6 am, so the same baseline completion time can be set for the two benchmark tasks. , for example, 6 o'clock.
  • the abnormal task in the task scheduling system can be determined according to the dependency relationship between the reference task and other tasks in the task scheduling system.
  • the abnormality monitoring apparatus may determine, according to a dependency relationship between the reference task and other tasks in the task scheduling system, a task having a dependency relationship with the reference task as a task to be monitored; and then, the monitoring task is to be operated. The process is monitored to obtain a task with an abnormal running state in the task to be monitored as an abnormal task.
  • the task that depends on the baseline task includes its upstream task and its downstream task, but the start time and the completion time of the baseline task directly affect the upstream task of the baseline task, and the downstream task of the baseline task affects the baseline task. It is relatively small, so its downstream tasks can be ignored.
  • the abnormality monitoring device may determine the task that is dependent on the reference task in the task scheduling system as the task to be monitored, and then monitor the running process of the monitoring task, thereby obtaining the task whose operating state is abnormal in the task to be monitored as an abnormal task. .
  • the number of tasks to be monitored is relatively small, which is beneficial to saving various resources consumed by monitoring and improving the efficiency of discovering abnormal tasks.
  • the abnormality monitoring device can reversely launch all the upstream tasks of the reference task according to the dependency relationship between the tasks, thereby automatically monitoring all upstream tasks of the reference task, instead of
  • it is required to configure a trigger condition, an alarm time, and the like for all upstream tasks, and has the advantages of less configuration information and wide monitoring range, and is particularly suitable for a task scheduling system with a large number of tasks.
  • the abnormal task refers to a task to be monitored whose operating state is abnormal.
  • the running state exception is normal with respect to the running state.
  • a normal state condition indicative of a normal operating state may be preset. Based on this, the running process of the monitoring task can be monitored to determine whether the running state of the task to be monitored meets the normal state condition; if the judgment result is consistent, it is determined that the running state of the task to be monitored is normal; if the judgment result is not met, Determining the running state of the task to be monitored The state is abnormal, and the task to be monitored is regarded as an abnormal task. or,
  • an abnormal state condition indicating an abnormal operating state may be preset. Based on this, the running process of the monitoring task can be monitored to determine whether the running state of the task to be monitored meets the abnormal state condition; if the judgment result is not met, it is determined that the running state of the task to be monitored is normal; if the judgment result is met, Then, it is determined that the running status of the to-be-monitored task is abnormal, and the task to be monitored is regarded as an abnormal task.
  • the normal state condition indicating the normal running state and the abnormal state condition indicating the abnormal running state may also be set at the same time.
  • the abnormal state condition includes at least one of the following:
  • Run error The task indicating that the operation is running is an abnormal task
  • Slow running speed A task indicating that the running speed is slow is an abnormal task.
  • the abnormality monitoring device may acquire an abnormal task by using at least one of the following operations, as follows:
  • the abnormality monitoring device may acquire a task whose running time meets the specified duration in the task to be monitored as a task whose running speed is slow, that is, an abnormal task.
  • the foregoing specified duration condition includes but is not limited to at least one of the following conditions:
  • the running time of the task to be monitored needs to be greater than the preset duration threshold, it may be a task whose running speed is slow.
  • the running time of the task to be monitored needs to be more than the specified running time in the specified time period to be a task with slow running speed.
  • the duration threshold may be adaptively set according to an application scenario and a task attribute, and may be, for example, 1 hour, 30 minutes, or 2 hours.
  • the specified time period and the specified ratio may also be adaptively set according to an application scenario and a task attribute, for example, the specified time period may be It is 10 days, 15 days or 1 month, etc., and the above specified ratio may be 30%, 20% or 15%, and may even be a ratio range, for example, 15%-30%, and the like.
  • abnormal tasks in the task scheduling system can be determined.
  • the abnormal task refers to a task with an abnormality, so it needs to be re-run.
  • the reference task since the reference task depends on the abnormal task, and the reference task must be completed before the baseline completion time, this determines that the abnormal task cannot be re-run at will, and needs to be in a certain Start before the latest time to ensure that benchmark tasks that depend on abnormal tasks can be completed before the baseline completion time.
  • the abnormality monitoring device can determine the latest start time of re-running the abnormal task according to the reference completion time of the preset reference task.
  • the abnormality monitoring device may perform back-off according to the dependency relationship between the reference task and the abnormal task, the reference completion time of the reference task, the average running time of the reference task, and the average running time of the abnormal task, thereby determining that the abnormal task is re-run. The latest start time.
  • the task scheduling system includes tasks and dependencies between tasks.
  • the task scheduling system includes task A, task B, task C, task D, task E, and task F.
  • task B is the direct downstream task of task A
  • task C, task D and task E are the direct downstream tasks of task B, respectively
  • task F is the direct downstream task of task E.
  • task C and task D are set as a set of reference tasks, and the corresponding reference completion time is 6:00, which means that both task C and task D need to be at 6: Completed before 00; and task E and task F are set to another set of benchmark tasks, the corresponding benchmark completion time is 5:00, which means that both task E and task F need to be completed before 5:00.
  • the average running time of each task specifically: the average running time of task E is 0.5 hours, the average running time of task F is 20 minutes, and the average running time of task C is 1.5 hours.
  • the average running time of D is 2 hours, the average running time of task B is 2 hours, and the average running time of task A is 10 minutes.
  • the abnormality monitoring device may reverse the dependency from the reference task according to the above-mentioned known information, and first determine the downstream task of the abnormal task A, that is, the latest completion time of the task B; Then, based on the latest completion time of task B, the latest start time of rerunning abnormal task A is determined.
  • the time margin of task A that is, the time difference between the latest start time of task A and the current time. For example, if the current time is 1 hour, the time margin of task A is 50 minutes.
  • the abnormality monitoring device After determining the latest start time of re-running the abnormal task, the abnormality monitoring device can flexibly perform alarm processing on the abnormal task according to the latest start time and the current time.
  • the abnormal task can be immediately processed for alarm so that the abnormal task can be processed in time; if the latest start time is far from the current time, the abnormal task can be later.
  • Alarm processing is performed to alarm at a reasonable time, reducing the interruption of the alarm to the user and reducing unnecessary alarms.
  • the key to alarm handling of abnormal tasks is to determine the abnormal alarm time.
  • the abnormality monitoring device mainly determines the abnormal alarm time according to the latest start time and the current time of re-running the abnormal task, and then performs alarm processing on the abnormal task when the abnormal alarm time arrives.
  • the latest start time and current time of re-running the abnormal task are the main factors affecting the abnormal alarm time, and of course, some other factors, such as the time period that needs timely alarm and the abnormal type of the abnormal task.
  • you can pre-specify the time range that needs to be alarmed in time which is referred to as the specified time range.
  • the specified time range can be working hours, such as 9:00--20:00.
  • the abnormality monitoring device can determine whether the current time is within the specified time range. If the determination result is yes, that is, the current time is within the specified time range, the current time is regarded as the abnormal alarm time, and when the abnormal alarm time arrives, the abnormality is The task performs alarm processing, that is, immediately performs alarm processing on the abnormal task; if the judgment result is no, that is, the current time is not within the specified time range, the abnormality type of the abnormal task and the latest start time of the re-running abnormal task may be used. Determine the abnormal alarm time, and alarm the abnormal task when the abnormal alarm time arrives.
  • the exception type of the abnormal task includes an operation error and a slow running speed as an example.
  • the abnormal type of the abnormal task is a running error, it can be determined whether the latest starting time of the rerunning abnormal task is later than the preset first time. If the judgment result is yes, the latest starting time of the rerunning abnormal task is later than The preset first time sets a second time that is later than the current time but earlier than the first time as the abnormal alarm time; if the judgment result is no, that is, the latest start time of the re-run abnormal task is earlier than or equal to The preset first time sets the current time as the abnormal alarm time, that is, immediately performs alarm processing on the abnormal task.
  • the alarm processing is performed, which is equivalent to delaying the alarm, which is beneficial to avoid the user's rest time, can reduce the disturbance to the user, and in the long run is equivalent to widening the alarm between the two alarms.
  • the time interval is beneficial to reduce the number of alarms and save resources; and the current time as an abnormal alarm time can be timely alarmed to avoid problems caused by the alarm not being timely.
  • the value of the first time and the second specified time is not limited in this embodiment, and may be adaptively set according to an application scenario.
  • the preset first time may be 11:00.
  • the second specified time may be 9:00, but is not limited thereto.
  • the abnormal type of the abnormal task is slow, you can determine whether the time difference between the latest start time and the current time of the re-running abnormal task is greater than the preset time difference threshold. If the judgment result is yes, the latest task is re-run. If the time difference between the start time and the current time is greater than the preset time difference threshold, set a third time that is earlier than the latest start time of the rerun abnormal task as the abnormal alarm time; if the judgment result is no, the abnormal operation is re-run. If the time difference between the latest start time of the task and the current time is less than or equal to the preset time difference threshold, the current time is set as the abnormal alarm time.
  • the third time that is earlier than the latest start time of the rerunning abnormal task is the abnormal alarm time, which is equivalent to the delayed alarm, which is beneficial to avoid the rest time of the user, and can reduce the interruption to the user, and In the long run, it is equivalent to widening the time interval between two alarms, which is beneficial to reduce the number of alarms and save resources; and the current time as an abnormal alarm time can be timely alarmed to avoid problems caused by the alarm not being timely.
  • the value of the time difference threshold is not limited in this embodiment, and may be adaptively set according to an application scenario.
  • the time difference threshold may be 2 hours, but is not limited thereto.
  • the alarm object and the alarm mode can be set in advance.
  • the alarm object mainly refers to a responsible person or a person in charge who needs to process an abnormal task, for example, the alarm object can be configured in the duty table.
  • the alarm mode includes at least one of the following: a voice alarm, a short message alarm, an email alarm, an alarm light, and an instant communication alarm.
  • the above-mentioned alarm processing for the abnormal task is specifically: according to the pre-configured duty table, the corresponding responsible person or the responsible person is alerted by the configured alarm mode, for example, sending a text message or mail to the terminal device of the responsible person or the person in charge. , or voice prompts to the responsible person or responsible person, and so on.
  • the abnormality monitoring device can flexibly determine the abnormal alarm time according to the latest start time and the current time of re-running the abnormal task, which is advantageous for alarm processing of the abnormal task at an appropriate time, without having to be in the prior art as in the prior art.
  • the alarm processing is carried out, and the flexibility is strong.
  • the alarm can be timely and the unnecessary alarm can be reduced, which is beneficial to reducing the probability that the alarm is not timely or unnecessary, and improving the alarm precision. Intelligent alarm solution.
  • FIG. 4 is a schematic structural diagram of an abnormality monitoring apparatus according to another embodiment of the present application. As shown in FIG. 4, the apparatus includes: an abnormal task determining module 41, a latest time determining module 42 and an alarm. Processing module 43.
  • the abnormal task determining module 41 is configured to determine an abnormal task in the task scheduling system according to a preset reference task in the task scheduling system.
  • the latest time determining module 42 is configured to determine a latest start time of re-running the abnormal task according to a preset completion time of the reference task.
  • the alarm processing module 43 is configured to perform alarm processing on the abnormal task according to the latest start time and the current time of re-running the abnormal task.
  • an implementation structure of the abnormal task determining module 41 includes: a monitoring task determining unit 411 and an abnormal task acquiring unit 412.
  • the monitoring task determining unit 411 is configured to determine, as the task to be monitored, the task that the benchmark task depends on in the task scheduling system;
  • the abnormal task obtaining unit 412 is configured to acquire a task whose operating state is abnormal in the task to be monitored as an abnormal task.
  • abnormal task obtaining unit 412 is specifically configured to perform at least one of the following operations:
  • the abnormal task acquisition unit 412 obtains a task whose running speed is slow in the task to be monitored as an abnormal task
  • the abnormal task acquiring unit 412 is specifically configured to:
  • the specified duration condition includes at least one of the following:
  • an implementation structure of the alarm processing module includes: a first alarm processing unit 431 and a second alarm processing unit 432.
  • the first alarm processing unit 431 is configured to perform alarm processing on the abnormal task immediately when the current time is within the specified time range.
  • a second alarm processing unit 432 configured to: when the current time is not within the specified time range, The abnormal alarm time is determined according to the abnormal type of the abnormal task and the latest start time of the rerunning the abnormal task, and the abnormal task is alarmed when the abnormal alarm time arrives.
  • the second alarm processing unit 432 is specifically configured to:
  • the abnormal type of the abnormal task is a running error
  • the second time that is later than the current time but earlier than the first time is set as the abnormal alarm time.
  • the latest start time of re-running the abnormal task is earlier than or equal to the first time, set the current time as the abnormal alarm time;
  • the abnormal type of the abnormal task is slow, when the time difference between the latest start time of the rerunning abnormal task and the current time is greater than the preset time difference threshold, set the delay time threshold earlier than the latest start time of the rerun abnormal task.
  • the third time is used as the abnormal alarm time, or when the time difference between the latest start time of the re-running abnormal task and the current time is less than or equal to the time difference threshold, the current time is set as the abnormal alarm time.
  • the abnormality monitoring apparatus determines an abnormal task according to a preset reference task in the task scheduling process, and further determines a latest start time of the rerun abnormal task according to a preset reference completion time of the reference task. According to the latest start time and current time of re-running the abnormal task, the abnormal task is alarmed, instead of the alarm processing when the pre-configured alarm time arrives as in the prior art, the flexibility is strong, and the reduction is favorable. The probability of an alarm not being timely or an unnecessary alarm increases the accuracy of the alarm.
  • the abnormality monitoring device provided in this embodiment only needs to preset the reference task and its reference completion time.
  • the abnormality monitoring device provided in this embodiment may depend on the reference task and other tasks in the task scheduling system.
  • the relationship reverses all upstream tasks of the benchmark task, and then automatically monitors all upstream tasks of the benchmark task, instead of configuring the trigger condition, alarm time, etc. for all upstream tasks as in the prior art, with less configuration information.
  • the advantage of wide monitoring range is especially applicable to task scheduling systems with a large number of tasks.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program runs the steps including the foregoing method embodiments; and the foregoing storage medium includes: ROM, RAM, disk or optical disk, etc.
  • the media of the sequence code includes: ROM, RAM, disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请提供了异常监控方法及装置。异常监控方法包括:根据任务调度系统中预先设定的基准任务,确定任务调度系统中的异常任务;根据预先设定的基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间;根据重新运行异常任务的最晚开始时间和当前时间,对异常任务进行报警处理。本申请可以提高对异常任务报警的灵活性,降低出现报警不及时或非必要报警的几率,提高报警精度。

Description

异常监控方法及装置 技术领域
本申请涉及通信技术,尤其涉及一种异常监控方法及装置。
背景技术
在大数据时代,数据被越来越广泛的分析和使用,但是由于数据量之大、收集过程复杂,难免会出现不稳定或错误的情况,特别是在分布式系统中,出错重试更是难以避免。当出现问题时,如果能够及时预警甚至提前预警,会极大的降低数据错误带来的损失。
在任务调度系统中,为便于及时发现异常任务,一般采用任务监控方案。现有技术任务监控方案,基本上都是用户配置繁杂的信息,包括报警触发条件、报警时间、报警对象、报警方式等,基于这些配置信息,对任务运行过程进行监控,当发现符合报警触发条件的任务时,在设定的报警时间,以设定的报警方式向设定的报警对象进行报警。在这种方式中,报警时间是预先配置好的,灵活性较差,容易引起报警不够及时或者非必要的报警,导致报警精度较差。
发明内容
本申请提供一种异常监控方法及装置,用以提高对异常任务报警的灵活性,降低出现报警不及时或非必要报警的几率,提高报警精度。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,提供了一种异常监控方法,包括:
根据任务调度系统中预先设定的基准任务,确定所述任务调度系统中的异常任务;
根据预先设定的所述基准任务的基准完成时间,确定重新运行所述异常任务的最晚开始时间;
根据重新运行所述异常任务的最晚开始时间和当前时间,对所述异常任务进行报警处理。
第二方面,提供了一种异常监控装置,包括:
异常任务确定模块,用于根据任务调度系统中预先设定的基准任务,确定所述任务调度系统中的异常任务;
最晚时间确定模块,用于根据预先设定的所述基准任务的基准完成时间,确定重新运行所述异常任务的最晚开始时间;
报警处理模块,用于根据重新运行所述异常任务的最晚开始时间和当前时间,对所述异常任务进行报警处理。
由上述技术方案可知,本申请预先设定任务调度系统中的基准任务及其基准完成时间,在任务调度过程中,根据基准任务确定异常任务,进而根据基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间,根据重新运行异常任务的最晚开始时间和当前时间,对异常任务进行报警处理,而不是像现有技术那样必须在预先配置的报警时间到达时进行报警处理,灵活性较强,有利于降低出现报警不及时或非必要报警的几率,提高了报警精度。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1为本申请一实施例提供的异常监控方法的流程示意图;
图2为本申请另一实施例提供的任务调度系统中任务依赖关系示意图;
图3为本申请又一实施例提供的任务调度系统中任务依赖关系示意图;
图4为本申请又一实施例提供的异常监控装置的结构示意图;
图5为本申请又一实施例提供的异常监控装置的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
任务调度系统是指按照事先设定好的方式和时间对一系列的指令或任务进行调度执行的系统。在现有任务调度系统中,为便于及时发现异常任务,一般采用任务监控方案。现有任务监控方案,基本上都是用户配置繁杂的信息,包括报警触发条件、报警时间、报警对象、报警方式等,基于这些配置信息,对任务运行过程进行监控,当发现符合报警触发条件的任务时,在设定的报警时间,以设定的报警方式向设定的报警对象进行报警。在这种方式中,报警时间是预先配置好的,灵活性较差,容易引起报警不够及时或者非必要的报警,导致报警精度较差。
针对上述技术问题,本申请提供一种解决方案,主要原理是:预先配置任务调度系统中的基准任务及其基准完成时间,根据基准任务确定异常任务,根据基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间,进而根据重新运行异常任务的最晚开始时间和当前时间,对异常任务进行报警处理,不再像现有技术那样必须在预先配置的报警时间到达时进行报警处理,灵活性较强,有利于降低出现报警不及时或非必要报警的几率,提高了报警精度。
值得说明的是,本申请提供的技术方案适用于任务调度系统,并且优选适用于数据仓库开发过程中的离线任务调度系统,但不限于此。离线任 务调度系统中调度的任务属于离线任务,是相对在线或实时任务而言的,主要是指不需要立即应用到在线业务系统上,而是在经过一系列的异步处理之后,再将获得的数据回流到在线业务系统中的任务。
本申请以下实施例以离线任务调度系统为例进行说明,但对本领域技术人员来说,在本申请以下实施例给出的技术启示的基础上,很容易将本申请技术方案应用到在线任务调度系统中。
下面结合具体实施方式及附图对本申请技术方案进行详细说明。
图1为本申请一实施例提供的异常监控方法的流程示意图。如图1所示,该方法包括:
101、根据任务调度系统中预先设定的基准任务,确定任务调度系统中的异常任务。
102、根据预先设定的基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间。
103、根据重新运行异常任务的最晚开始时间和当前时间,对异常任务进行报警处理。
本实施例提供一种异常监控方法,可由异常监控装置来执行,用以更加灵活的对异常任务进行报警处理,降低出现报警不及时或非必要报警的几率,提高报警精度。
在任务调度系统中,任务之间有上下游依赖关系,下游任务需要等待上游任务执行完成之后方可执行。其中,任务调度系统中任务之间的依赖关系的一种示例如图2所示。在图2所示的任务调度系统中,包括任务A、任务B、任务C、任务D、任务E以及任务F。任务B和任务C依赖于任务A,任务A是任务B和任务C的上游任务,而任务B和任务C是任务A的下游任务;同理,任务F依赖于任务A和任务C,且任务A和任务C是任务F的上游任务,任务F是任务A和任务C的下游任务;任务D和任务E依赖任务A和任务B,任务A和任务B是任务D和任务E的上游任务,任务D和任务E是任务A和任务B的下游任务。
值得说明的是,在图2所示的上下游任务关系中,包括直接上下游任 务和间接上下游任务。例如,任务A是任务B和任务C的直接上游任务,而任务B和任务C是任务A的直接下游任务,而任务A是任务D、任务E和任务F的间接上游任务,任务D、任务E和任务F是任务A的间接下游任务。在本申请实施例中,并不细分直接上下游任务和间接上下游任务。
由于在任务调度系统中任务之间存在上下游依赖关系,所以本实施例预先设定任务调度系统中的基准任务及其基准完成时间,将基准任务及其基准完成时间作为异常监控的基线,通过该基线完成异常任务监控及报警处理。
其中,基准任务的基准完成时间是指该基准任务的最晚完成时间,也就是说,需要保证基准任务必须在基准完成时间之前完成,否则会带来严重的不利后果,例如可能导致整个任务调度系统报错,或者影响依赖该任务调度系统的在线业务系统的正常运行等。
可选的,可以根据任务调度系统中各个任务的重要程度,确定基准任务,例如将最重程度满足一定条件(例如最重要)的任务作为基准任务。或者,可以根据任务调度系统中各个任务之间的依赖关系,确定基准任务,例如将上游任务数量和下游任务数量均满足一定条件(例如最多或大于指定数量)的任务作为基准任务。如果一个任务的上游任务数量和下游任务数量均较多,说明该任务比较核心,影响面较大,所以有必要保证该任务在最晚完成时间之前完成,因此将其设置为基准任务有利于保证更多任务能够按时运行。
相应的,在确定基准任务之后,可以根据基准任务的应用情况,确定基准任务的基准完成时间。例如,如果在线业务系统需要在每天早上9点调度该基准任务计算出的数据,则可以将基准任务的基准完成时间设置为9点,这意味着该基准任务必须在每天9点之前完成。又例如,如果有关人员需要在每天早上10点查看由基准任务计算出的数据所生成的报表,那么可以将该基准任务的基准完成时间设置为10点,这意味着该基准任务必须在10点之前完成。
在此说明,本实施例不限定基准任务的个数,可以是一个,也可以是多个。另外,在基准任务为多个的情况下,可以为不同基准任务设置不同 的基准完成时间,也可以设置相同的基准完成时间。如图2所示,位于方框中的任务D和任务E被设置为基准任务,这两个基准任务均需要在早上6点之前完成,所以可以为这两个基准任务设置相同的基准完成时间,例如6点。
在设定基准任务及其基准完成时间之后,可以根据基准任务与任务调度系统中其它任务之间的依赖关系,确定任务调度系统中的异常任务。
在一可选实施方式中,异常监控装置可以根据基准任务与任务调度系统中其它任务之间的依赖关系,确定与该基准任务存在依赖关系的任务作为待监控任务;然后,对待监控任务的运行过程进行监控,从而获取待监控任务中运行状态异常的任务作为异常任务。
进一步,与基准任务存在依赖关系的任务包括其上游任务和其下游任务,但直接影响基准任务的开始时间及其完成时间的是基准任务的上游任务,而基准任务的下游任务对基准任务的影响相对较小,因此可以忽略其下游任务。基于此,异常监控装置可以确定任务调度系统中被该基准任务所依赖的任务作为待监控任务,然后,对待监控任务的运行过程进行监控,从而获取待监控任务中运行状态异常的任务作为异常任务。在该实施方式中,待监控任务的数量相对较少,有利于节约监控所消耗的各种资源,提高发现异常任务的效率。另外,在该实施方式中,只需预先设定基准任务,异常监控装置可以根据任务之间的依赖关系反推出该基准任务的所有上游任务,进而自动监控该基准任务的所有上游任务,而不是像现有技术那样需要针对所有的上游任务都配置一遍触发条件、报警时间等,具有配置信息较少而监控范围较广的优势,尤其适用于任务数量较多的任务调度系统。
在上述获取异常任务的过程中,异常任务是指运行状态异常的待监控任务。运行状态异常是相对于运行状态正常而言的。
在一可选实施方式中,可以预先设定表示正常运行状态的正常状态条件。基于此,可以对待监控任务的运行过程进行监控,判断待监控任务的运行状态是否符合正常状态条件;若判断结果为符合,则确定该待监控任务的运行状态正常;若判断结果为不符合,则确定该待监控任务的运行状 态异常,将该待监控任务作为异常任务。或者,
在另一可选实施方式中,可以预先设定表示异常运行状态的异常状态条件。基于此,可以对待监控任务的运行过程进行监控,判断待监控任务的运行状态是否符合异常状态条件;若判断结果为不符合,则确定该待监控任务的运行状态正常;若判断结果为符合,则确定该待监控任务的运行状态异常,将该待监控任务作为异常任务。
当然,在其他可选实施方式中,也可以同时设定表示正常运行状态的正常状态条件和表示异常运行状态的异常状态条件。
进一步可选的,上述异常状态条件包括以下至少一个:
运行出错:表示运行出错的任务属于异常任务;
运行速度变慢:表示运行速度变慢的任务属于异常任务。
基于上述异常状态条件,异常监控装置可以通过以下至少一种操作,来获取异常任务,具体如下:
获取待监控任务中运行出错的任务作为异常任务;以及
获取待监控任务中运行速度变慢的任务作为异常任务。
进一步,可以通过任务的运行时长来确定任务的运行速度是否变慢。具体的,异常监控装置可以获取待监控任务中运行时长满足指定时长条件的任务作为运行速度变慢的任务,即异常任务。
可选的,上述指定时长条件包括但不限于以下至少一个条件:
大于预设时长阈值:表示待监控任务的运行时长需要大于预设时长阈值时才有可能作为运行速度变慢的任务;
比指定时间段内的平均运行时长多出指定比例:表示待监控任务的运行时长需要比指定时间段内的平均运行时长多出指定比例才有可能作为运行速度变慢的任务。
上述时长阈值可以根据应用场景及任务属性等适应性设置,例如可以是1小时、30分钟或2小时等。相应的,上述指定时间段以及指定比例也可以根据应用场景及任务属性等适应性设置,例如上述指定时间段可以 是10天、15天或1个月等,上述指定比例可以是30%、20%或15%,甚至可以是一个比例范围,例如15%-30%等。
经过上述操作,可以确定出任务调度系统中的异常任务。所述异常任务是指出现异常的任务,所以需要重新运行,另外,由于基准任务依赖于异常任务,且基准任务必须在基准完成时间之前完成,这决定了异常任务不能随意重新运行,需要在某个最晚时间之前开始,以保证依赖于异常任务的基准任务能够在基准完成时间之前完成。基于此,异常监控装置可以根据预先设定的基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间。
具体的,异常监控装置可以根据基准任务与异常任务之间的依赖关系、基准任务的基准完成时间、基准任务的平均运行时长以及异常任务的平均运行时长进行反推,从而确定重新运行异常任务的最晚开始时间。
举例说明,假设一种任务调度系统包括的任务及任务之间的依赖关系如图3所示,该任务调度系统包括任务A、任务B、任务C、任务D、任务E以及任务F。其中,任务B是任务A的直接下游任务,任务C、任务D和任务E分别是任务B的直接下游任务,任务F是任务E的直接下游任务。另外,在图3所示的任务调度系统中,任务C和任务D被设置为一组基准任务,对应的基准完成时间是6:00,这意味着,任务C和任务D都需要在6:00之前完成;而任务E和任务F被设置为另一组基准任务,对应的基准完成时间是5:00,这意味着,任务E和任务F都需要在5:00前完成。
除上述信息之外,还可以获知各任务的平均运行时间,具体为:任务E的平均运行时长为0.5小时,任务F的平均运行时长为20分钟,任务C的平均运行时长为1.5小时,任务D的平均运行时长为2小时,任务B的平均运行时长为2小时,任务A的平均运行时长为10分钟。
假设监控到任务A为异常任务,则异常监控装置可以根据上述已知信息,从基准任务开始沿着依赖关系向上反推,首先确定异常任务A的下游任务,即任务B的最晚完成时间;然后,根据任务B的最晚完成时间,确定重新运行异常任务A的最晚开始时间。
具体的,对于任务E和任务F,若要任务E和任务F在基准完成时间 之前完成,则任务E和任务F的最晚开始时间为:任务E和任务F的基准完成时间减去任务E和任务F的平均运行时长,即5:00-20分钟-0.5小时=4:10分,任务E和任务F的最晚开始时间也就是根据任务E和任务F计算出的任务B的最晚完成时间,为4:10分;
对于任务C,若要任务C在基准完成时间之前完成,则任务C的最晚开始时间为:任务C的基准完成时间减去任务C的平均运行时长,即6:00-1.5小时=4:30分,任务C的最晚开始时间也就是根据任务C计算出的任务B的最晚完成时间,为4:30分;
对于任务D,若要任务D在基准完成时间之前完成,则任务D的最晚开始时间为:任务D的基准完成时间减去任务D的平均运行时长,即6:00-2小时=4:00,任务D的最晚开始时间也就是根据任务D计算出的任务B的最晚完成时间,为4:00;
由上述可以确定,任务B的最晚完成时间为4:00;
接着,由于任务B需要在4:00之前完成,那么意味着任务B的最晚开始时间应该为:任务B的最晚完成时间减去任务B的平均运行时长,即4:00-2小时=2:00,任务B的最晚开始时间也就是任务A的最晚完成时间;
由于任务A需要在2:00之前完成,那么意味着任务A的最晚开始时间应该为:任务A的最晚完成时间减去任务A的平均运行时长,即2:00-10分钟=1:50。
当然,若知道当前时间,还可以计算出任务A的时间余量,即任务A的最晚开始时间与当前时间的时间差。例如,若当前时间为1小时,则任务A的时间余量为50分钟。
当确定重新运行异常任务的最晚开始时间之后,异常监控装置可以根据该最晚开始时间与当前时间,灵活的对异常任务进行报警处理。
例如,若最晚开始时间距离当前时间较近,则可以立即对异常任务进行报警处理,以便能够及时对异常任务进行处理;若最晚开始时间距离当前时间较远,则可以晚一点对异常任务进行报警处理,以便在合理时间进行报警,降低报警对用户的打扰,减少非必要的报警。
对异常任务进行报警处理的关键是确定异常报警时间。其中,异常监控装置主要依据重新运行异常任务的最晚开始时间和当前时间,确定异常报警时间,然后在异常报警时间到达时,对异常任务进行报警处理。
其中,重新运行异常任务的最晚开始时间和当前时间是影响异常报警时间的主要因素,当然还包括一些其它因素,例如需要及时报警的时间段以及异常任务的异常类型等。对于一些应用场景,可以预先指定需要及时报警的时间范围,简称为指定时间范围。指定时间范围可以是工作时间,如9:00--20:00。
基于上述,异常监控装置可以判断当前时间是否处于指定时间范围内,若判断结果为是,即当前时间处于指定时间范围内,则将当前时间作为异常报警时间,在异常报警时间到达时,对异常任务进行报警处理,也就是立即对异常任务进行报警处理;若判断结果为否,即当前时间未处于指定时间范围内,则可以根据异常任务的异常类型以及重新运行异常任务的最晚开始时间,确定异常报警时间,在异常报警时间到达时,对异常任务进行报警处理。
可选的,以异常任务的异常类型包括运行出错和运行速度变慢为例。
若异常任务的异常类型为运行出错,则可以判断重新运行异常任务的最晚开始时间是否晚于预设的第一时间,若判断结果为是,即重新运行异常任务的最晚开始时间晚于预设的第一时间,则设置晚于当前时间但早于第一时间的第二时间作为异常报警时间;若判断结果为否,也就是说重新运行异常任务的最晚开始时间早于或等于预设的第一时间,则设置当前时间作为异常报警时间,即立即对异常任务进行报警处理。其中,在第二时间到达时在进行报警处理,相当于延迟报警,有利于避开用户的休息时间,可以减少对用户的打扰,并且从长远来看相当于拉大了两次报警之间的时间间隔,有利于减少报警次数,节约资源;而将当前时间作为异常报警时间可以及时报警,避免报警不及时带来的问题。
在此说明,本实施例并不限定第一时间和第二指定时间的取值,可以根据应用场景适应性设置。例如,预设的第一时间可以是11:00,相应的,若当前时间为9:00之前,则第二指定时间可以是9:00,但不限于此。
若异常任务的异常类型为运行速度变慢,可以判断重新运行异常任务的最晚开始时间与当前时间的时间差是否大于预设的时差阈值,若判断结果为是,即重新运行异常任务的最晚开始时间与当前时间的时间差大于预设的时差阈值,则设置比重新运行异常任务的最晚开始时间早所述时差阈值的第三时间作为异常报警时间;若判断结果为否,即重新运行异常任务的最晚开始时间与当前时间的时间差小于或等于预设的时差阈值,则设置当前时间作为异常报警时间。其中,将比重新运行异常任务的最晚开始时间早所述时差阈值的第三时间作为异常报警时间,相当于延迟报警,有利于避开用户的休息时间,可以减少对用户的打扰,并且从长远来看相当于拉大了两次报警之间的时间间隔,有利于减少报警次数,节约资源;而将当前时间作为异常报警时间可以及时报警,避免报警不及时带来的问题。
在此说明,本实施例并不限定上述时差阈值的取值,可以根据应用场景适应性设置。例如,时差阈值可以是2小时,但不限于此。
进一步,可以预先设置报警对象和报警方式。所述报警对象主要是指需要对异常任务进行处理的责任人或负责人,例如可以将报警对象配置在值班表中。所述报警方式包括以下至少一种:语音报警、短信报警、邮件报警、报警灯以及即时通讯报警等。基于此,上述对异常任务进行报警处理具体为:根据预先配置的值班表,以配置的报警方式向相应的责任人或负责人进行报警,例如向责任人或负责人的终端设备发短信或邮件,或者对责任人或负责人进行语音提示,等。
由上述可见,异常监控装置根据重新运行异常任务的最晚开始时间和当前时间,可以灵活确定异常报警时间,有利于在合适的时间对异常任务进行报警处理,而不用像现有技术那样必须在预先配置的报警时间到达时进行报警处理,灵活性较强,既可以及时报警又可以减少不必要的报警,有利于降低出现报警不及时或非必要报警的几率,提高了报警精度,是一种智能报警方案。
图4为本申请又一实施例提供的异常监控装置的结构示意图。如图4所示,该装置包括:异常任务确定模块41、最晚时间确定模块42和报警 处理模块43。
异常任务确定模块41,用于根据任务调度系统中预先设定的基准任务,确定任务调度系统中的异常任务。
最晚时间确定模块42,用于根据预先设定的基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间。
报警处理模块43,用于根据重新运行异常任务的最晚开始时间和当前时间,对异常任务进行报警处理。
在一可选实施方式中,如图5所示,异常任务确定模块41的一种实现结构包括:监控任务确定单元411和异常任务获取单元412。
监控任务确定单元411,用于确定任务调度系统中的被基准任务所依赖的任务作为待监控任务;
异常任务获取单元412,用于获取待监控任务中运行状态异常的任务作为异常任务。
进一步,异常任务获取单元412具体用于执行以下至少一种操作:
获取待监控任务中运行出错的任务作为异常任务;
获取待监控任务中运行速度变慢的任务作为异常任务。
更进一步,异常任务获取单元412在获取待监控任务中运行速度变慢的任务作为异常任务时,具体用于:
获取待监控任务中运行时长满足指定时长条件的任务作为异常任务;其中,指定时长条件包括以下至少一个:
大于预设时长阈值;
比指定时间段内的平均运行时长多出指定比例。
在一可选实施方式中,如图5所示,报警处理模块的一种实现结构包括:第一报警处理单元431和第二报警处理单元432。
第一报警处理单元431,用于在当前时间处于指定时间范围内时,立即对异常任务进行报警处理。
第二报警处理单元432,用于在当前时间未处于指定时间范围内时, 根据异常任务的异常类型以及重新运行异常任务的最晚开始时间,确定异常报警时间,在异常报警时间到达时,对异常任务进行报警处理。
进一步,第二报警处理单元432具体用于:
若异常任务的异常类型为运行出错,则在重新运行异常任务的最晚开始时间晚于预设的第一时间时,设置晚于当前时间但早于第一时间的第二时间作为异常报警时间,或者,在重新运行异常任务的最晚开始时间早于或等于第一时间时,设置当前时间作为异常报警时间;
若异常任务的异常类型为运行速度变慢,则在重新运行异常任务的最晚开始时间与当前时间的时间差大于预设的时差阈值时,设置比重新运行异常任务的最晚开始时间早时差阈值的第三时间作为异常报警时间,或者,在重新运行异常任务的最晚开始时间与当前时间的时间差小于或等于时差阈值时,设置当前时间作为异常报警时间。
本实施例提供的异常监控装置,在任务调度过程中,根据预先设定的基准任务确定异常任务,进而根据预先设定的基准任务的基准完成时间,确定重新运行异常任务的最晚开始时间,根据重新运行异常任务的最晚开始时间和当前时间,对异常任务进行报警处理,而不是像现有技术那样必须在预先配置的报警时间到达时进行报警处理,灵活性较强,有利于降低出现报警不及时或非必要报警的几率,提高了报警精度。
另外,采用本实施例提供的异常监控装置,只需预先设定基准任务及其基准完成时间即可,本实施例提供的异常监控装置可以根据基准任务与任务调度系统中其它任务之间的依赖关系反推出该基准任务的所有上游任务,进而自动监控该基准任务的所有上游任务,而不是像现有技术那样需要针对所有的上游任务都配置一遍触发条件、报警时间等,具有配置信息较少而监控范围较广的优势,尤其适用于任务数量较多的任务调度系统。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在运行时,运行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程 序代码的介质。
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (12)

  1. 一种异常监控方法,其特征在于,包括:
    根据任务调度系统中预先设定的基准任务,确定所述任务调度系统中的异常任务;
    根据预先设定的所述基准任务的基准完成时间,确定重新运行所述异常任务的最晚开始时间;
    根据重新运行所述异常任务的最晚开始时间和当前时间,对所述异常任务进行报警处理。
  2. 根据权利要求1所述的方法,其特征在于,所述根据任务调度系统中预先设定的基准任务,确定所述任务调度系统中的异常任务,包括:
    确定所述任务调度系统中的被所述基准任务所依赖的任务作为待监控任务;
    获取所述待监控任务中运行状态异常的任务作为所述异常任务。
  3. 根据权利要求2所述的方法,其特征在于,所述获取所述待监控任务中运行状态异常的任务作为所述异常任务,包括以下至少一种操作:
    获取所述待监控任务中运行出错的任务作为所述异常任务;
    获取所述待监控任务中运行速度变慢的任务作为所述异常任务。
  4. 根据权利要求3所述的方法,其特征在于,所述获取所述待监控任务中运行速度变慢的任务作为所述异常任务,包括:
    获取所述待监控任务中运行时长满足指定时长条件的任务作为所述异常任务;其中,所述指定时长条件包括以下至少一个:
    大于预设时长阈值;
    比指定时间段内的平均运行时长多出指定比例。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述根据重新运行所述异常任务的最晚开始时间和当前时间,对所述异常任务进行报警处理,包括:
    若所述当前时间处于指定时间范围内,立即对所述异常任务进行报警处理;
    若所述当前时间未处于指定时间范围内,根据所述异常任务的异常类型以及重新运行所述异常任务的最晚开始时间,确定异常报警时间,在所述异常报警时间到达时,对所述异常任务进行报警处理。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述异常任务的异常类型以及重新运行所述异常任务的最晚开始时间,确定异常报警时间,包括:
    若所述异常任务的异常类型为运行出错,则在重新运行所述异常任务的最晚开始时间晚于预设的第一时间时,设置晚于当前时间但早于所述第一时间的第二时间作为所述异常报警时间,或者,在重新运行所述异常任务的最晚开始时间早于或等于所述第一时间时,设置当前时间作为所述异常报警时间;
    若所述异常任务的异常类型为运行速度变慢,则在重新运行所述异常任务的最晚开始时间与当前时间的时间差大于预设的时差阈值时,设置比重新运行所述异常任务的最晚开始时间早所述时差阈值的第三时间作为所述异常报警时间,或者,在重新运行所述异常任务的最晚开始时间与当前时间的时间差小于或等于所述时差阈值时,设置当前时间作为所述异常报警时间。
  7. 一种异常监控装置,其特征在于,包括:
    异常任务确定模块,用于根据任务调度系统中预先设定的基准任务,确定所述任务调度系统中的异常任务;
    最晚时间确定模块,用于根据预先设定的所述基准任务的基准完成时间,确定重新运行所述异常任务的最晚开始时间;
    报警处理模块,用于根据重新运行所述异常任务的最晚开始时间和当前时间,对所述异常任务进行报警处理。
  8. 根据权利要求7所述的装置,其特征在于,所述异常任务确定模块包括:
    监控任务确定单元,用于确定所述任务调度系统中的被所述基准任务所依赖的任务作为待监控任务;
    异常任务获取单元,用于获取所述待监控任务中运行状态异常的任务作为所述异常任务。
  9. 根据权利要求8所述的装置,其特征在于,所述异常任务获取单元具体用于执行以下至少一种操作:
    获取所述待监控任务中运行出错的任务作为所述异常任务;
    获取所述待监控任务中运行速度变慢的任务作为所述异常任务。
  10. 根据权利要求9所述的装置,其特征在于,所述异常任务获取单元具体用于:
    获取所述待监控任务中运行时长满足指定时长条件的任务作为所述异常任务;其中,所述指定时长条件包括以下至少一个:
    大于预设时长阈值;
    比指定时间段内的平均运行时长多出指定比例。
  11. 根据权利要求7-10任一项所述的装置,其特征在于,所述报警处理模块包括:
    第一报警处理单元,用于在所述当前时间处于指定时间范围内时,立即对所述异常任务进行报警处理;
    第二报警处理单元,用于在所述当前时间未处于指定时间范围内时,根据所述异常任务的异常类型以及重新运行所述异常任务的最晚开始时间,确定异常报警时间,在所述异常报警时间到达时,对所述异常任务进行报警处理。
  12. 根据权利要求11所述的装置,其特征在于,所述第二报警处理单元具体用于:
    若所述异常任务的异常类型为运行出错,则在重新运行所述异常任务的最晚开始时间晚于预设的第一时间时,设置晚于当前时间但早于所述第一时间的第二时间作为所述异常报警时间,或者,在重新运行所述异常任 务的最晚开始时间早于或等于所述第一时间时,设置当前时间作为所述异常报警时间;
    若所述异常任务的异常类型为运行速度变慢,则在重新运行所述异常任务的最晚开始时间与当前时间的时间差大于预设的时差阈值时,设置比重新运行所述异常任务的最晚开始时间早所述时差阈值的第三时间作为所述异常报警时间,或者,在重新运行所述异常任务的最晚开始时间与当前时间的时间差小于或等于所述时差阈值时,设置当前时间作为所述异常报警时间。
PCT/CN2017/076891 2016-03-28 2017-03-16 异常监控方法及装置 WO2017167021A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610184288.1 2016-03-28
CN201610184288.1A CN107241205A (zh) 2016-03-28 2016-03-28 异常监控方法及装置

Publications (1)

Publication Number Publication Date
WO2017167021A1 true WO2017167021A1 (zh) 2017-10-05

Family

ID=59963429

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/076891 WO2017167021A1 (zh) 2016-03-28 2017-03-16 异常监控方法及装置

Country Status (3)

Country Link
CN (1) CN107241205A (zh)
TW (1) TW201737084A (zh)
WO (1) WO2017167021A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108011782A (zh) * 2017-12-06 2018-05-08 北京百度网讯科技有限公司 用于推送告警信息的方法和装置
CN110113201A (zh) * 2019-04-30 2019-08-09 平安科技(深圳)有限公司 监控数据处理方法、装置及监控系统
CN110348718A (zh) * 2019-06-28 2019-10-18 北京淇瑀信息科技有限公司 金融业务指标监控方法、装置及电子设备
CN111010292A (zh) * 2019-11-26 2020-04-14 苏宁云计算有限公司 一种离线任务延时告警系统、方法及计算机系统
CN112817686A (zh) * 2019-11-15 2021-05-18 北京百度网讯科技有限公司 检测虚拟机异常的方法、装置、设备和计算机存储介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245127A (zh) * 2019-06-12 2019-09-17 成都九洲电子信息系统股份有限公司 一种基于流程控制的数据迁移方法
CN111324650A (zh) * 2020-02-16 2020-06-23 广州信安数据有限公司 任务处理效能实时评估预警方法、计算机可读存储介质及企业数据管理系统
CN111427748B (zh) * 2020-03-31 2023-06-23 携程计算机技术(上海)有限公司 任务告警方法、系统、设备及存储介质
CN111858065B (zh) * 2020-07-28 2023-02-03 中国平安财产保险股份有限公司 数据处理方法、设备、存储介质及装置
CN112328377B (zh) * 2020-11-04 2022-04-19 北京字节跳动网络技术有限公司 基线监控方法、装置、可读介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283351A1 (en) * 2006-05-31 2007-12-06 Degenaro Louis R Unified job processing of interdependent heterogeneous tasks
CN101110041A (zh) * 2007-08-23 2008-01-23 南京联创科技股份有限公司 组任务管理的方法
CN101425024A (zh) * 2008-10-24 2009-05-06 中国移动通信集团山东有限公司 一种多任务处理方法及装置
CN102004973A (zh) * 2010-12-30 2011-04-06 用友软件股份有限公司 任务制定方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034554B (zh) * 2012-12-30 2015-11-18 焦点科技股份有限公司 一种纠错重启以及自动判断启动的etl调度系统及方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283351A1 (en) * 2006-05-31 2007-12-06 Degenaro Louis R Unified job processing of interdependent heterogeneous tasks
CN101110041A (zh) * 2007-08-23 2008-01-23 南京联创科技股份有限公司 组任务管理的方法
CN101425024A (zh) * 2008-10-24 2009-05-06 中国移动通信集团山东有限公司 一种多任务处理方法及装置
CN102004973A (zh) * 2010-12-30 2011-04-06 用友软件股份有限公司 任务制定方法和装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108011782A (zh) * 2017-12-06 2018-05-08 北京百度网讯科技有限公司 用于推送告警信息的方法和装置
CN108011782B (zh) * 2017-12-06 2020-10-16 北京百度网讯科技有限公司 用于推送告警信息的方法和装置
CN110113201A (zh) * 2019-04-30 2019-08-09 平安科技(深圳)有限公司 监控数据处理方法、装置及监控系统
CN110113201B (zh) * 2019-04-30 2022-12-23 平安科技(深圳)有限公司 监控数据处理方法、装置及监控系统
CN110348718A (zh) * 2019-06-28 2019-10-18 北京淇瑀信息科技有限公司 金融业务指标监控方法、装置及电子设备
CN110348718B (zh) * 2019-06-28 2023-11-14 北京淇瑀信息科技有限公司 业务指标监控方法、装置及电子设备
CN112817686A (zh) * 2019-11-15 2021-05-18 北京百度网讯科技有限公司 检测虚拟机异常的方法、装置、设备和计算机存储介质
CN112817686B (zh) * 2019-11-15 2023-07-25 北京百度网讯科技有限公司 检测虚拟机异常的方法、装置、设备和计算机存储介质
CN111010292A (zh) * 2019-11-26 2020-04-14 苏宁云计算有限公司 一种离线任务延时告警系统、方法及计算机系统

Also Published As

Publication number Publication date
CN107241205A (zh) 2017-10-10
TW201737084A (zh) 2017-10-16

Similar Documents

Publication Publication Date Title
WO2017167021A1 (zh) 异常监控方法及装置
US9720761B2 (en) System fault detection and processing method, device, and computer readable storage medium
CN110661659B (zh) 一种告警方法、装置、系统及电子设备
CN110309024B (zh) 数据处理系统及其执行数据处理任务的方法
CN103034554B (zh) 一种纠错重启以及自动判断启动的etl调度系统及方法
EP2940596B1 (en) Data acquisition method and device
EP2490090A1 (en) Wind turbine generator system fault processing method and system
CN107878366B (zh) 一种用于控制新能源汽车控制器工作状态的方法及装置
EP3932025B1 (en) Computing resource scheduling method, scheduler, internet of things system, and computer readable medium
CN112631761B (zh) 一种任务调度监控方法和装置
US11770199B2 (en) Traffic data self-recovery processing method, readable storage medium, server and apparatus
CN106598740B (zh) 一种限制多线程程序占用cpu利用率的系统及限制方法
WO2018149396A1 (zh) 一种业务流程处理方法、可读存储介质、终端设备及装置
WO2019047565A1 (zh) 任务处理方法、装置、计算机设备和存储介质
TW201737215A (zh) 異常監控報警方法及裝置
CN102521098A (zh) Cpu死机监控的处理方法和装置
CN110224848B (zh) 告警的播报方法及装置
CN110941525A (zh) 一种数据积压预警方法及装置
CN112200505A (zh) 跨业务系统的流程监控装置、方法及相应设备和存储介质
WO2022247219A1 (zh) 一种信息备份方法、设备及平台
CN109947015B (zh) 任务的执行方法及主控制器
CN115934480B (zh) 一种任务监控方法、系统、装置及计算机可读存储介质
CN104765648A (zh) 一种基于实时计算系统的问题节点检测方法及装置
CN101662382A (zh) 一种抑制网管系统中振荡告警上报的方法及系统
CN115629903A (zh) 任务延迟监控方法、装置、设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17773051

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17773051

Country of ref document: EP

Kind code of ref document: A1