CN116483663A - Abnormality warning method and device for platform - Google Patents

Abnormality warning method and device for platform Download PDF

Info

Publication number
CN116483663A
CN116483663A CN202310528732.7A CN202310528732A CN116483663A CN 116483663 A CN116483663 A CN 116483663A CN 202310528732 A CN202310528732 A CN 202310528732A CN 116483663 A CN116483663 A CN 116483663A
Authority
CN
China
Prior art keywords
platform
information
data processing
abnormal
abnormality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310528732.7A
Other languages
Chinese (zh)
Inventor
魏海
王永伟
丁冬超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310528732.7A priority Critical patent/CN116483663A/en
Publication of CN116483663A publication Critical patent/CN116483663A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the specification provides an abnormality warning method and device for a platform. In the abnormality alarming method for the platform, if abnormality information for indicating abnormality of a data processing task running on the platform is monitored, determining an abnormality cause; judging whether a condition for alarming a user corresponding to the data processing task is met or not according to the abnormal reason; and if the condition of alarming the user corresponding to the data processing task is not met, alarming and suppressing the abnormal information.

Description

Abnormality warning method and device for platform
Technical Field
Embodiments of the present disclosure relate generally to the field of intelligent operation and maintenance technologies, and in particular, to an anomaly alarm method and apparatus for a platform.
Background
With the rapid development of internet and big data technologies, various platforms (such as computing platform, development and operation platform, data service platform, etc.) have also developed. By means of the platform, various data processing tasks can be efficiently performed. In order to timely discover and process abnormal data processing tasks, a more appropriate task alarm system is required to accurately inform the abnormal conditions of the tasks.
Disclosure of Invention
In view of the foregoing, an embodiment of the present disclosure provides an anomaly alarm method and apparatus for a platform. By using the method and the device, unnecessary alarms sent to the user can be reduced, and the effect of the alarms is improved.
According to an aspect of embodiments of the present specification, there is provided an anomaly alert method for a platform, including: determining an abnormality cause in response to monitoring abnormality information for indicating that an abnormality occurs in a data processing task running on the platform; judging whether a condition for alarming a user corresponding to the data processing task is met or not according to the abnormal reason; and if the condition of alarming the user corresponding to the data processing task is not met, alarming and suppressing the abnormal information.
According to another aspect of embodiments of the present specification, there is provided an abnormality alert device for a platform, including: an abnormality cause determination unit configured to determine an abnormality cause in response to monitoring abnormality information indicating that an abnormality occurs in a data processing task running on the platform; an alarm condition judging unit configured to judge whether a condition for giving an alarm to a user corresponding to the data processing task is satisfied according to the abnormality cause; and the alarm suppression unit is configured to perform alarm suppression on the abnormal information if the condition of alarming to the user corresponding to the data processing task is not met.
According to still another aspect of embodiments of the present specification, there is provided an abnormality alert device for a platform, including: at least one processor, and a memory coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the anomaly alert method for a platform as described above.
According to another aspect of the embodiments of the present specification, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the abnormality alert method for a platform as described above.
According to another aspect of embodiments of the present specification, there is provided a computer program product comprising a computer program that is executed by a processor to implement the anomaly alert method for a platform as described above.
Drawings
A further understanding of the nature and advantages of the present description may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
FIG. 1 illustrates an exemplary architecture of an anomaly alert method and apparatus for a platform according to an embodiment of the present specification.
Fig. 2 shows a flowchart of one example of an anomaly alert method for a platform according to an embodiment of the present specification.
Fig. 3 is a flowchart showing an example of a process of judging whether a condition for alerting a user corresponding to a data processing task is satisfied according to an embodiment of the present specification.
Fig. 4 is a flowchart showing still another example of a process of judging whether a condition for alerting a user corresponding to a data processing task is satisfied according to an embodiment of the present specification.
Fig. 5 shows a signaling diagram of still another example of an anomaly alert method for a platform according to an embodiment of the present specification.
FIG. 6 illustrates a block diagram of one example of an anomaly alerting device for a platform in accordance with an embodiment of the present specification.
Fig. 7 is a block diagram showing one example of an alarm condition judging unit in the abnormality alarm device for a platform according to an embodiment of the present specification.
Fig. 8 is a block diagram showing still another example of the alarm condition judging unit in the abnormality alarm device for a platform according to the embodiment of the present specification.
FIG. 9 shows a block diagram of one example of an anomaly alerting device for a platform in accordance with an embodiment of the present specification.
FIG. 10 illustrates a block diagram of yet another example of an anomaly alerting device for a platform in accordance with an embodiment of the present specification.
FIG. 11 shows a schematic diagram of an anomaly alerting device for a platform of an embodiment of the present specification.
Detailed Description
The subject matter described herein will be discussed below with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the embodiments herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
In this specification, the term "data processing task running on a platform" may include various data processing tasks depending on resources of the platform. In one example, a user of the platform may provide data to be processed to the platform. The platform can process data according to various predefined or user-defined data processing modes by means of computing resources, storage resources and the like, so as to obtain data processing results, and the data processing results are fed back to the user of the platform. It can be understood that whether the "data processing task running on the platform" can be successfully executed depends on the data provided by the user of the platform and the designated data processing mode, and also depends on whether the platform itself is normally running.
An abnormality alert method and apparatus for a platform according to embodiments of the present specification will be described in detail with reference to the accompanying drawings.
FIG. 1 illustrates an exemplary architecture 100 of an anomaly alert method and apparatus for a platform according to an embodiment of the present specification.
In fig. 1, a network 110 is applied to interconnect between terminal devices 121, 122 and a platform 130 and an abnormality alert device 150. Network 160 is employed to interconnect between platform 130, maintenance end device 140, and anomaly alerting device 150.
The networks 110, 160 may be any type of network capable of interconnecting network entities. The networks 110, 160 may be a single network or a combination of networks. In terms of coverage, the networks 110, 160 may be Local Area Networks (LANs), wide Area Networks (WANs), and the like. In terms of a carrier medium, the networks 110, 160 may be wired networks, wireless networks, etc. In terms of data switching technology, the networks 110, 160 may be circuit switched networks, packet switched networks, or the like.
Terminal devices 121, 122 may be any type of electronic computing device capable of connecting to network 110, accessing servers or websites on network 110, processing data or signals, and the like. For example, the terminal devices 121, 122 may be desktop computers, notebook computers, tablet computers, smart phones, etc. It will be appreciated that there may be a different number of terminal devices connected to the network 110.
In one embodiment, the terminal devices 121, 122 may be used by a user. Terminal devices 121, 122 may interact with platform 130. For example, the terminal devices 121, 122 may transmit a message input by a user to the platform 130 and receive a response associated with the above-described message from the platform 130. Herein, a "message" may refer to any input information, such as raw data from user input. Accordingly, the response associated with the above-described message may also refer to various information such as a data processing result corresponding to the input original data.
Platform 130 may be a variety of platforms capable of large-scale data processing. In one example, the platform 130 may be a high performance computing (High performance computing, HPC) platform. In one example, the platform 130 may be a cloud computing platform such as ODPS (Open Data Processing Service). In one example, the platform 130 may be a data platform such as HBase, lindorm. In one example, platform 130 may be a variety of development and operation platforms. In one example, platform 130 may include a compute engine 131, a scheduling system 132, a storage engine 133, and the like. The computing engine 131 may perform various data computing tasks. The scheduling system 132 may be responsible for scheduling of various resources. The storage engine 133 may perform various database operations.
The maintenance end device 140 may be used by the operation and maintenance personnel of the platform 130. The maintenance end device 140 may interact with the platform 130 to support normal operation of the computing engine 131, the scheduling system 132, the storage engine 133, etc. on the platform 130.
The abnormality alert device 150 may be used for performing an abnormality alert for a data processing task running on the platform 130, and may send alert information to a designated device according to an abnormality cause, so as to improve the pertinence of the alert. Alternatively, the abnormality alert device 150 may be integrated with the platform 130. For a specific description of the abnormality alert device 150, reference may be made to the following description of the various embodiments.
It should be appreciated that all network entities shown in fig. 1 are exemplary and that any other network entity may be involved in architecture 100, depending on the particular application requirements.
FIG. 2 illustrates a flow chart of an anomaly alert method 200 for a platform according to an embodiment of the present specification.
As shown in FIG. 2, at 210, a cause of an anomaly is determined in response to monitoring anomaly information indicating an anomaly in a data processing task running on a platform.
In this embodiment, it may be determined whether or not abnormality information indicating that abnormality occurs in a data processing task running on the platform is monitored. If the abnormality information for indicating that the data processing task running on the platform is abnormal is monitored, the abnormality reason for indicating that the abnormality information is abnormal can be determined.
In the field of intelligent operation and maintenance technology, various monitoring tasks for monitoring data processing tasks running on a platform can be generally predefined. If the monitoring task is abnormal, abnormal information for indicating that the data processing task is abnormal can be generated. In one example, the anomaly information may be used to indicate that the monitored custom task triggered an alarm. In one example, the exception information may be used to indicate that cluster resource usage is abnormal (e.g., memory usage or CPU usage is greater than a corresponding preset threshold). In one example, the anomaly information may be used to indicate that the number of instances that successfully run within a time window (e.g., per hour) fluctuates by more than a preset magnitude threshold. In one example, a monitoring task may be directly created to monitor a data processing task running on a platform, so that when an abnormality occurs in the monitoring task, abnormality information indicating that the abnormality occurs in the data processing task may be generated. In one example, the monitoring tasks may be created and performed by an alarm system of the platform itself. Therefore, whether the alarm system of the platform generates abnormal information for indicating that the data processing task running on the platform is abnormal or not can be monitored.
In one example, the monitoring tasks may include a Base Line (BaseLine) monitoring task. For a task added to the base line, if it is judged that the calculated estimated completion time of the base line task may exceed a predetermined promise time according to the task operation condition, abnormality information indicating that abnormality of the base line task occurs may be generated. In one example, the monitoring task may include a data verification monitoring task. If the data content is found to not meet the data requirement in the execution process of the data processing task, abnormal information for indicating that the data checking and monitoring task is abnormal can be generated. The data requirements may include, for example, that the data table is not empty, that the primary key is not repeated, that the character types are all receivable types, and the like. In one example, the monitoring task described above may perform monitoring with a timed poll (e.g., 23 points per day). In one example, the monitoring task described above may perform monitoring in response to a trigger condition being met.
In one example, the cause of the anomaly may be determined from the anomaly information monitored above. In one example, the above-described reasons for the anomaly may be used to indicate whether the platform is causing an alarm. For example, if the anomaly information indicates that cluster resource usage is anomalous, then the anomaly cause may be used to indicate that the platform is causing an alarm. For example, if the anomaly information indicates that the data verification task is anomalous, the cause of the anomaly may be used to indicate that the non-platform is causing an alarm. For example, if the abnormality information indicates that the baseline task is abnormal, it may be determined whether to indicate an abnormality cause of the platform that causes an alarm according to a log analysis result of the baseline task.
At 220, it is determined whether a condition for alerting a user corresponding to the data processing task is satisfied based on the cause of the anomaly.
In this embodiment, the user corresponding to the data processing task may be the user to receive the alarm, typically the user of the platform. In one example, the user to which the data processing task corresponds may be the creator of the data processing task. In one example, the user to whom the data processing task corresponds may be the recipient of the alert information to which the data processing task corresponds. The relevant information (such as mobile phone number, mailbox, instant messaging account, etc.) of the alarm information receiver can be provided by the creator of the task when creating the data processing task.
In one example, if the cause of the anomaly is used to indicate that the platform is causing an alarm, then the condition of alerting the user corresponding to the data processing task is not met.
Optionally, with continued reference to fig. 3, fig. 3 illustrates a flowchart of one example of a determination process 300 of whether a condition for alerting a user corresponding to a data processing task is met, according to an embodiment of the present disclosure.
At 310, it is determined whether a task self-healing policy is satisfied according to the cause of the anomaly.
In this embodiment, the task self-healing policy may be obtained in advance. Optionally, the task self-healing policy may also be updated as maintenance data is accumulated. In one example, if the exception cause is used to indicate that the non-platform caused an alarm, then the task self-healing policy is not satisfied. In one example, if the cause of the anomaly is used to indicate that the platform is causing an alarm, relevant information of the anomaly is obtained and whether a task self-healing policy is satisfied is determined according to the obtained relevant information. For example, whether the task self-healing policy is satisfied may be determined based on whether the acquired related information hits the self-healing knowledge base. For example, whether the task self-healing policy is satisfied may be determined according to whether the occurrence time of the abnormality is within a period of time in which the re-running is allowed.
If yes, the following steps 320-330 are performed.
At 320, a self-healing operation is performed on the data processing task.
In this embodiment, the self-healing operation may be performed on the data processing task according to the self-healing policy. In one example, the errant data processing task may be re-executed. In one example, a data processing task that performed an error may be allocated more resources (e.g., memory) and re-executed.
At 330, in response to the data processing task returning from the abnormal state to the normal state, it is determined that the condition for alerting the user corresponding to the data processing task is not satisfied.
In the present embodiment, it may be determined whether or not the data processing task after the execution of the self-healing operation described above is restored from the abnormal state to the normal state. In one example, if the erroneous data processing task performs the self-healing operation to obtain the processing result, it may be determined that the data processing task is restored from the abnormal state to the normal state. In one example, if the task execution time of this time after the slow data processing task performs the self-healing operation matches the normal execution time (e.g., the average running time of the periodic task over a period of time), it may be determined that the data processing task is restored from the abnormal state to the normal state.
Based on the method, the scheme can filter a part of alarm information according to the abnormal reasons and the self-healing strategy, thereby being beneficial to avoiding the occurrence of alarm storm and realizing more accurate and refined abnormal alarm.
Optionally, with continued reference to fig. 4, fig. 4 shows a flowchart of yet another example of a determination process 400 of whether a condition for alerting a user corresponding to a data processing task is met according to an embodiment of the present disclosure.
At 410, it is determined whether a task self-healing policy is satisfied based on the cause of the anomaly.
In this embodiment, reference may be made to the description of step 310 above.
If not, the following steps 420-430 are performed.
At 420, a determination is made as to whether the cause of the anomaly indicates that the platform caused the anomaly.
If so, at 430, it is determined that the condition for alerting the user corresponding to the data processing task is not satisfied.
Alternatively, if yes, steps 440-460 may also continue to be performed.
At 440, the anomaly occurrence location of the platform is determined.
In the present embodiment, the position at which the abnormality indicated by the abnormality information occurs on the platform, that is, the abnormality occurrence position, may be further determined. In one example, information related to the exception (e.g., task execution log information) may be obtained. And determining the specific position of the abnormality on the platform according to the analysis result of the information related to the abnormality. In one example, the exception information may also include an exception category code. The abnormality occurrence position of the platform may be determined by identifying the above-described abnormality category code according to a predetermined correspondence relationship.
In this embodiment, the abnormal occurrence position of the platform may be associated with the architecture of the platform, for example, may be a certain module. In one example, the anomaly occurrence location of the platform may include, but is not limited to, at least one of: a scheduling system, an alarm system, a data verification system, a computing engine (e.g., maxcompter), a storage engine. Wherein, the abnormal occurrence position of the platform can be further refined to at least one of the following: a data synchronization tool (e.g., dataX) in the storage engine, a distributed transaction manager (e.g., DTM) in the storage engine, and an operational data store (e.g., ODS (Operational Data Store)) in the storage engine. Alternatively, the abnormality occurrence position of the platform may also include other positions not listed.
At 450, maintenance personnel information is obtained from the platform that matches the location of the anomaly occurrence.
In this embodiment, the platform may store maintenance personnel information corresponding to respective parts of the platform. In one example, the maintainer information may include a maintainer's cell phone number, mailbox, instant messaging account, and the like. In one example, maintenance personnel matching the dispatch system, alarm system, data verification system may be, for example, platform maintenance responsible personnel. In one example, the maintenance personnel that match the compute engine may be, for example, computing module maintenance personnel. In one example, the maintenance personnel that match the storage engine may be, for example, storage module maintenance personnel. Alternatively, the maintenance personnel matched to the data synchronization tool may be, for example, data synchronization maintenance personnel. The maintenance personnel matched to the distributed transaction manager may be, for example, transaction management maintenance personnel. The maintenance person matched to the operating data store may be, for example, an operating data store maintenance person.
Optionally, in one example, the maintenance personnel information may further include a corresponding on-duty time of each maintenance personnel. Thus, the matched maintenance personnel information can be determined in combination with the abnormality occurrence position and the abnormality occurrence time.
At 460, an anomaly alert message is sent to the matched maintenance personnel.
In this embodiment, the abnormality alert information may be sent to the corresponding maintenance personnel according to the matched maintenance personnel information determined in step 450. In one example, the anomaly alert information may include anomaly base information. Wherein, the exception basic information may include, for example, but not limited to, at least one of the following: alarm object, alarm reason, related log. In one example, the alert object may be, for example, an object for which the monitoring task is directed. The alert reasons described above may include, but are not limited to, at least one of: abnormal data verification results, baseline early warning, baseline task event warning (such as task error, task slowing, and the like), task event warning (such as task error, task slowing, task incompletion, task running timeout, and the like), and abnormal resource utilization (such as memory utilization, CPU load, and the like). The relevant logs may include the running logs that the exception relates to.
Based on the method, the condition that the user corresponding to the data processing task is not met can be determined by indicating the platform to cause the abnormality through the abnormality cause, and the abnormality caused by the platform is prevented from being sent to the user corresponding to the data processing task, so that the situation that the user cannot receive the alarm information can be avoided, and the pertinence and the effect of the alarm are improved. And the alarm information can be sent to matched maintenance personnel according to the abnormal occurrence position of the platform, so that the accurate transmission of the alarm information is ensured.
Returning to fig. 2, if the determination is negative, at 230, alarm suppression is performed on the anomaly information.
In this embodiment, if the condition of alerting the user corresponding to the data processing task is not satisfied, alert suppression is performed on the abnormal information, that is, the abnormal information is not sent to the user corresponding to the data processing task.
Referring now to fig. 5, fig. 5 shows a signaling diagram of yet another example of an anomaly alerting method 500 for a platform in accordance with an embodiment of the present specification.
At 510, it is monitored whether exception information is generated indicating that an exception has occurred to a data processing task running on the platform.
At 520, a cause of the anomaly is determined in response to monitoring anomaly information indicating that the data processing task running on the platform is anomalous.
Alternatively, the abnormality information may include information indicating at least one of: abnormal data verification result, baseline early warning, baseline task event warning, task event warning and abnormal resource utilization rate.
At 530, it is determined whether a condition for alerting a user corresponding to the data processing task is satisfied based on the cause of the anomaly.
If not, at 540, alarm suppression is performed on the anomaly information.
It should be noted that the steps 510-540 may refer to the relevant descriptions of the steps 210-230 in the embodiment of fig. 2.
If so, the following steps 550-580 are performed.
At 550, the anomaly alert information is sent to the user corresponding to the data processing task.
In this embodiment, the user corresponding to the data processing task may refer to the related description of 220 in the embodiment of fig. 2. In one example, the anomaly alert information may be sent to the user corresponding to the data processing task via a short message, mail, instant messaging tool, or the like.
Optionally, the abnormality alert information may include at least one of: the abnormal basic information, the abnormal influence evaluation information, the abnormal root cause analysis information and the abnormal treatment recommendation information. In one example, the anomaly alert information may be as described above with respect to 460 in the FIG. 4 embodiment. In one example, the abnormal impact assessment information may be used to indicate the scope of impact of the abnormality, e.g., whether the current task progress may impact on upstream and downstream tasks. In one example, the anomaly may be analyzed using various root cause analysis (Root Cause Analysis) tools to obtain anomaly root cause analysis information. In one example, exception handling recommendation information may be generated based on a preset knowledge base query for a handling manner that matches the exception.
At 560, feedback information is received from the user corresponding to the data processing task indicating that self-processing is disabled corresponding to the anomaly alert information.
In this embodiment, if the user corresponding to the data processing task determines that the problem indicated by the abnormality warning information cannot be processed by itself, feedback information indicating that the abnormality warning information cannot be processed by itself may be sent.
At 570, emergency treatment information is generated from the anomaly alert information.
In this embodiment, if the feedback information is received, emergency processing information may be generated according to the abnormality alert information. In one example, the emergency treatment information may include the anomaly basic information. Optionally, the emergency processing information may further include more detailed information related to the obtained abnormality, such as a log of the abnormality before and after a period of time, a history of similar abnormality, and a corresponding treatment record. Optionally, the emergency processing information may further include an abnormal emergency degree. The degree of abnormality urgency may be determined based on the abnormality influence evaluation information, for example.
At 580, the emergency treatment information is reported to the emergency treatment platform.
In this embodiment, the emergency processing platform may include various platforms supporting exception handling. Alternatively, the emergency treatment platform may be integrated with the platform. In one example, the emergency processing platform may include an emergency flow management module, an emergency operation and maintenance execution module, an emergency history management module, and the like. In one example, the emergency procedure management module may match the corresponding emergency treatment policy based on the reported emergency treatment information. The emergency operation and maintenance execution module can execute the matched emergency treatment strategy, such as switching to corresponding technicians, automatically executing the matched abnormal recovery strategy and the like. The emergency history management module may record each emergency treatment process. Optionally, the emergency history management module may also analyze the history to provide data support for optimizing emergency processing.
Based on the method, the system and the device, the full-flow exception handling service can be provided for the user corresponding to the data processing task, more effective alarm information is provided for the user on the basis of filtering out inappropriate alarm information (such as the exception caused by the platform), and the user can be assisted in handling the exception.
By utilizing the abnormality alarming method for the platform disclosed in the figures 1-5, the abnormality reasons can be determined according to the detected abnormality information, and alarming inhibition is carried out when the abnormality reasons are judged not to meet the alarming conditions for users corresponding to the data processing tasks, so that the number of unnecessary alarming times for the users is reduced by distinguishing the alarming information of different reasons, the pertinence and the effect of alarming are improved, so that the users can concentrate on the problem really needed to be solved, and the smooth proceeding of the calculation tasks is facilitated.
Fig. 6 shows a block diagram of one example of an anomaly alert device 600 for a platform according to an embodiment of the present specification. The apparatus embodiment may correspond to the method embodiments shown in fig. 2-5, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 6, the abnormality alert device 600 for a platform may include an abnormality cause determination unit 610, an alert condition judgment unit 620, and an alert suppression unit 630.
An abnormality cause determination unit 610 configured to determine an abnormality cause in response to monitoring abnormality information indicating that an abnormality occurs in a data processing task running on the platform. The operation of the abnormality cause determination unit 610 may refer to the operation of 210 described above in fig. 2.
And an alarm condition judging unit 620 configured to judge whether a condition for alarming to a user corresponding to the data processing task is satisfied according to the abnormality cause. The operation of the alarm condition judging unit 620 may refer to the operation of 220 described above with reference to fig. 2.
Alternatively, referring to fig. 7 below, fig. 7 shows a block diagram of one example of an alarm condition judging unit 700 in an abnormality alarm device for a platform according to an embodiment of the present specification.
As shown in fig. 7, the alarm condition judging unit 700 may include: the self-healing condition judging module 710 is configured to judge whether the task self-healing policy is satisfied according to the abnormal reason; a self-healing execution module 720 configured to execute a self-healing operation on the data processing task in response to the task self-healing policy being satisfied; and an alarm condition judging module 730 configured to determine that a condition for alerting a user corresponding to the data processing task is not satisfied in response to the data processing task being restored from the abnormal state to the normal state.
The operation of the self-healing condition judging module 710, the self-healing executing module 720, and the alarm condition judging module 730 described above may refer to the judging process 310-330 of whether the condition of alarming to the user corresponding to the data processing task is satisfied or not, which is described above with reference to fig. 3.
Alternatively, referring to fig. 8 below, fig. 8 shows a block diagram of still another example of an alarm condition judging unit 800 in an abnormality alarm device for a platform according to an embodiment of the present specification.
As shown in fig. 8, the alarm condition judging unit 800 may include: a self-healing condition judging module 810 configured to judge whether a task self-healing policy is satisfied according to the abnormality cause; an anomaly cause location module 820 configured to determine, in response to the task self-healing policy not being satisfied, whether the anomaly cause indicates that the platform caused the anomaly; an alarm condition determining module 830, configured to determine, in response to the abnormality cause indicating that the platform causes the abnormality, that a condition for alerting a user corresponding to the data processing task is not satisfied.
The operation of the self-healing condition determining module 810, the abnormality cause locating module 820, and the alarm condition determining module 830 described above may refer to the determining process 410-430 described above with reference to fig. 4 as to whether the condition for alerting the user corresponding to the data processing task is satisfied.
Returning to fig. 6, the alarm suppression unit 630 is configured to perform alarm suppression on the abnormal information if a condition of performing an alarm on the user corresponding to the data processing task is not satisfied. The operation of the alarm suppression unit 630 may refer to the operation of 230 described above with respect to fig. 2.
With continued reference to fig. 9, fig. 9 shows a block diagram of one example of an anomaly alerting device 900 for a platform in accordance with an embodiment of the present specification.
As shown in fig. 9, the abnormality alert device 900 for a platform may include: an abnormality cause determination unit 910, an alarm condition judgment unit 920, an alarm suppression unit 930, and a first alarm unit 940.
In this embodiment, a first alarm unit 940 is configured to determine an abnormality occurrence position of the platform in response to the abnormality cause indicating that the platform causes the abnormality; acquiring maintainer information matched with the abnormal occurrence position from the platform; and sending abnormal alarm information to the matched maintenance personnel. The operation of the first alarm unit 940 may refer to the operations of 440-460 described above with respect to fig. 4.
It should be noted that the operations of the abnormality cause determining unit 910, the alarm condition judging unit 920, and the alarm suppressing unit 930 may refer to the descriptions of the abnormality cause determining unit 610, the alarm condition judging unit 800, and the alarm suppressing unit 630 in the foregoing embodiments, and are not repeated here.
With continued reference to fig. 10, fig. 10 shows a block diagram of yet another example of an anomaly alerting device 1000 for a platform in accordance with an embodiment of the present specification.
As shown in fig. 10, an abnormality alert device 1000 for a platform may include: an abnormality cause determination unit 1010, an alarm condition judgment unit 1020, an alarm suppression unit 1030, a second alarm unit 1040, and an emergency information reporting unit 1050.
In this embodiment, the second alarm unit 1040 is configured to send abnormal alarm information to the user corresponding to the data processing task if the condition of alerting the user corresponding to the data processing task is satisfied.
An emergency information reporting unit 1050 configured to generate emergency processing information according to the abnormal alarm information in response to receiving feedback information corresponding to the abnormal alarm information indicating that self-processing is impossible; and reporting the emergency treatment information to an emergency treatment platform.
It should be noted that the operations of the second alarm unit 1040 and the emergency information reporting unit 1050 may refer to the operations 540 to 580 described above with reference to fig. 5. The operations of the abnormality cause determining unit 1010, the alarm condition judging unit 1020, and the alarm suppressing unit 1030 may refer to the related descriptions of the abnormality cause determining unit, the alarm condition judging unit, and the alarm suppressing unit in the foregoing embodiments of fig. 6 to 9, and are not repeated here.
Embodiments of an abnormality alert method and apparatus for a platform according to embodiments of the present specification are described above with reference to fig. 1 to 10.
The abnormality warning device for a platform in the embodiments of the present disclosure may be implemented by hardware, or may be implemented by software or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a memory into a memory by a processor of a device where the device is located. In the embodiment of the present specification, the abnormality alert device for a platform may be implemented using an electronic device, for example.
Fig. 11 shows a schematic diagram of an anomaly alerting device 1100 for a platform in an embodiment of the present specification.
As shown in fig. 11, an anomaly alerting device 1100 for a platform may include at least one processor 1110, a memory (e.g., a non-volatile memory) 1120, a memory 1130, and a communication interface 1140, and the at least one processor 1110, the memory 1120, the memory 1130, and the communication interface 1140 are connected together via a bus 1150. At least one processor 1110 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.
In one embodiment, computer-executable instructions are stored in memory that, when executed, cause at least one processor 1110 to: determining an abnormality cause in response to monitoring abnormality information for indicating that an abnormality occurs in a data processing task running on the platform; judging whether a condition for alarming a user corresponding to the data processing task is met or not according to the abnormal reason; and if the condition of alarming the user corresponding to the data processing task is not met, alarming and suppressing the abnormal information.
It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1110 to perform the various operations and functions described above in connection with fig. 1-5 in various embodiments of the present specification.
According to one embodiment, a program product, such as a computer readable medium, is provided. The computer-readable medium may have instructions (i.e., the elements described above implemented in software) that, when executed by a computer, cause the computer to perform the various operations and functions described above in connection with fig. 1-5 in various embodiments of the present specification.
In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
Computer program code required for operation of portions of the present description may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, VB, NET, python and the like, a conventional programming language such as C language, visual Basic 2003, perl, COBOL 2002, PHP and ABAP, a dynamic programming language such as Python, ruby and Groovy, or other programming languages and the like. The program code may execute on the user's computer or as a stand-alone software package, or it may execute partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the connection may be made to the cloud computing environment, or for use as a service, such as software as a service (SaaS).
Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Not all steps or units in the above-mentioned flowcharts and system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by multiple physical entities, or may be implemented jointly by some components in multiple independent devices.
The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The alternative implementation manner of the embodiment of the present disclosure has been described in detail above with reference to the accompanying drawings, but the embodiment of the present disclosure is not limited to the specific details of the foregoing implementation manner, and various simple modifications may be made to the technical solution of the embodiment of the present disclosure within the scope of the technical concept of the embodiment of the present disclosure, and all the simple modifications belong to the protection scope of the embodiment of the present disclosure.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. An anomaly alert method for a platform, comprising:
determining an abnormality cause in response to monitoring abnormality information for indicating that an abnormality occurs in a data processing task running on the platform;
judging whether a condition for alarming a user corresponding to the data processing task is met or not according to the abnormal reason; and
and if the condition of alarming the user corresponding to the data processing task is not met, alarming and suppressing the abnormal information.
2. The method of claim 1, wherein the determining whether the condition for alerting the user corresponding to the data processing task is satisfied according to the abnormality cause comprises:
judging whether a task self-healing strategy is met according to the abnormal reason;
in response to the task self-healing policy being satisfied,
executing self-healing operation on the data processing task; and
and responding to the data processing task to recover from the abnormal state to the normal state, and determining that the condition for alarming the user corresponding to the data processing task is not met.
3. The method of claim 1, wherein the determining whether the condition for alerting the user corresponding to the data processing task is satisfied according to the abnormality cause comprises:
Judging whether a task self-healing strategy is met according to the abnormal reason;
in response to not satisfying the task self-healing policy,
judging whether the abnormality cause indicates that the platform causes the abnormality; and
and responding to the abnormality cause to indicate that the platform causes the abnormality, and determining that the condition for alarming the user corresponding to the data processing task is not met.
4. A method as claimed in claim 3, wherein the method further comprises:
in response to the cause of the anomaly indicating that the platform caused the anomaly,
determining an abnormal occurrence position of the platform;
acquiring maintainer information matched with the abnormal occurrence position from the platform; and
and sending abnormal alarm information to the matched maintenance personnel.
5. The method of any one of claims 1 to 4, wherein the method further comprises:
if the condition of alarming to the user corresponding to the data processing task is met,
sending abnormal alarm information to a user corresponding to the data processing task;
in response to receiving feedback information corresponding to the anomaly alert information indicating failure to process by itself,
generating emergency processing information according to the abnormal alarm information; and
And reporting the emergency treatment information to an emergency treatment platform.
6. The method of claim 5, wherein the anomaly alert information comprises at least one of: abnormal basic information, abnormal influence evaluation information, abnormal root cause analysis information and abnormal treatment recommendation information;
the anomaly information includes information indicating at least one of: abnormal data verification result, baseline early warning, baseline task event warning, task event warning and abnormal resource utilization rate.
7. An anomaly alert device for a platform, comprising:
an abnormality cause determination unit configured to determine an abnormality cause in response to monitoring abnormality information indicating that an abnormality occurs in a data processing task running on the platform;
an alarm condition judging unit configured to judge whether a condition for giving an alarm to a user corresponding to the data processing task is satisfied according to the abnormality cause;
and the alarm suppression unit is configured to perform alarm suppression on the abnormal information if the condition of alarming to the user corresponding to the data processing task is not met.
8. The apparatus of claim 7, wherein the alarm condition judging unit comprises:
The self-healing condition judging module is configured to judge whether a task self-healing strategy is met according to the abnormal reasons;
a self-healing execution module configured to execute a self-healing operation on the data processing task in response to the task self-healing policy being satisfied;
and the alarm condition judging module is configured to respond to the data processing task to recover from the abnormal state to the normal state and determine that the condition for alarming the user corresponding to the data processing task is not met.
9. The apparatus of claim 7, wherein the alarm condition judging unit comprises:
the self-healing condition judging module is configured to judge whether a task self-healing strategy is met according to the abnormal reasons;
an anomaly cause location module configured to determine, in response to the task self-healing policy not being satisfied, whether the anomaly cause indicates that the platform caused the anomaly;
and the alarm condition judging module is configured to respond to the abnormality reason to indicate that the platform causes the abnormality and determine that the condition for alarming the user corresponding to the data processing task is not met.
10. The apparatus of claim 9, wherein the apparatus further comprises:
A first warning unit configured to determine an abnormality occurrence position of the platform in response to the abnormality cause indicating that the platform causes the abnormality; acquiring maintainer information matched with the abnormal occurrence position from the platform; and sending abnormal alarm information to the matched maintenance personnel.
11. The apparatus of any of claims 7 to 10, wherein the apparatus further comprises:
the second alarm unit is configured to send abnormal alarm information to the user corresponding to the data processing task if the condition of alarming to the user corresponding to the data processing task is met;
an emergency information reporting unit configured to generate emergency processing information according to the abnormal alarm information in response to receiving feedback information corresponding to the abnormal alarm information indicating that self-processing is impossible; and reporting the emergency treatment information to an emergency treatment platform.
12. An anomaly alert device for a platform, comprising: at least one processor, a memory coupled with the at least one processor, and a computer program stored on the memory, the at least one processor executing the computer program to implement the anomaly alert method for a platform of any one of claims 1 to 6.
CN202310528732.7A 2023-05-09 2023-05-09 Abnormality warning method and device for platform Pending CN116483663A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310528732.7A CN116483663A (en) 2023-05-09 2023-05-09 Abnormality warning method and device for platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310528732.7A CN116483663A (en) 2023-05-09 2023-05-09 Abnormality warning method and device for platform

Publications (1)

Publication Number Publication Date
CN116483663A true CN116483663A (en) 2023-07-25

Family

ID=87225082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310528732.7A Pending CN116483663A (en) 2023-05-09 2023-05-09 Abnormality warning method and device for platform

Country Status (1)

Country Link
CN (1) CN116483663A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057783A (en) * 2023-10-09 2023-11-14 巴斯夫一体化基地(广东)有限公司 Method and apparatus for determining maintenance routes within a plant

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057783A (en) * 2023-10-09 2023-11-14 巴斯夫一体化基地(广东)有限公司 Method and apparatus for determining maintenance routes within a plant

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN110224858B (en) Log-based alarm method and related device
US9009307B2 (en) Automated alert management
CN112162878A (en) Database fault discovery method and device, electronic equipment and storage medium
JP4892367B2 (en) Abnormal sign detection system
AU2007261542B2 (en) Method and system for monitoring non-occurring events
CN109861856B (en) Method and device for notifying system fault information, storage medium and computer equipment
US10896073B1 (en) Actionability metric generation for events
CN111475369A (en) Log monitoring adding method and device, computer equipment and storage medium
CN116483663A (en) Abnormality warning method and device for platform
CN111934913A (en) Intelligent network management system
CN111510339A (en) Industrial Internet data monitoring method and device
CN115001989A (en) Equipment early warning method, device, equipment and readable storage medium
CN112910733A (en) Full link monitoring system and method based on big data
CN110677271B (en) Big data alarm method, device, equipment and storage medium based on ELK
CN111062503B (en) Power grid monitoring alarm processing method, system, terminal and storage medium
CN112565228A (en) Client network analysis method and device
CN111949421A (en) SDK calling method and device, electronic equipment and computer readable storage medium
CN111367934A (en) Data consistency checking method, device, server and medium
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
CN114679295B (en) Firewall security configuration method and device
CN109508356B (en) Data abnormality early warning method, device, computer equipment and storage medium
CN114296979A (en) Method and device for detecting abnormal state of Internet of things equipment
CN113342596A (en) Distributed monitoring method, system and device for equipment indexes
EP4091084A1 (en) Endpoint security using an action prediction model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination