CN117743087A - Method and device for monitoring sub-health of equipment - Google Patents

Method and device for monitoring sub-health of equipment Download PDF

Info

Publication number
CN117743087A
CN117743087A CN202311745637.9A CN202311745637A CN117743087A CN 117743087 A CN117743087 A CN 117743087A CN 202311745637 A CN202311745637 A CN 202311745637A CN 117743087 A CN117743087 A CN 117743087A
Authority
CN
China
Prior art keywords
log
sub
health
risk
condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311745637.9A
Other languages
Chinese (zh)
Inventor
田英轩
裘雪敬
戴朗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202311745637.9A priority Critical patent/CN117743087A/en
Publication of CN117743087A publication Critical patent/CN117743087A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the invention provides a method and a device for monitoring sub-health of equipment, comprising the following steps: the method comprises the steps of regularly collecting logs of target equipment through a monitoring platform to obtain log packages; matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package; and determining the sub-health type of the target equipment according to the sub-health log when the matching result indicates that the sub-health log exists. The invention solves the problem that the device is difficult to find sub-health before the device fails in the related technology, and further achieves the effect of timely finding the sub-health of the device before the device fails.

Description

Method and device for monitoring sub-health of equipment
Technical Field
The embodiment of the invention relates to the field of communication, in particular to a method and a device for monitoring sub-health of equipment.
Background
Storage products have extremely high requirements on reliability, performance and the like, and are widely used in various industries as important components of infrastructure, but the complexity of the storage systems is gradually increased along with the continuous enhancement of the capabilities of the storage systems, and the hidden abnormality or hidden danger (sub-health) caused by the complexity of the systems is also more and more difficult to effectively identify in one test.
Meanwhile, more storage environments are involved in laboratories and data centers, a large number of environments in operation and maintenance and test processes are required to quickly and accurately identify risks, and even the risks are required to be identified and processed in early warning mode, so that operation and maintenance management of upper-layer business of storage products can be met.
Because of the critical infrastructure location where the stored product is located, and the high requirements and complexity of the product, it also presents a significant challenge in shipping and testing the stored product, how to monitor more efficiently and discover problems faster becomes an efficiency issue that must be broken through. The difficulties are mainly manifested in:
1. the storage devices are more, the storage system is complicated, the scenes are more, the observation points or the monitoring points are quite more, the storage system is difficult to completely cover in one operation and maintenance or test cases, and the efficiency is unacceptable;
2. when the equipment is abnormal, the hidden danger of the equipment is discovered earlier by alarming or other detection. The sub-health stage of the storage device is identified in advance to be key, so that the exposure risk of the device after long-term operation can be eliminated in a short time;
therefore, a solution that can detect storage anomalies comprehensively and recognize risks as early as possible is needed, and meanwhile, the solution needs to be capable of being automated and backstage, so that the cost of operation and maintenance personnel or testing personnel is prevented from being greatly increased.
At present, the traditional monitoring mode mainly realizes the monitoring management of equipment by making a special test case observation point or routine inspection scheme.
The prior art mainly realizes monitoring management by making special test case observation points or routine inspection, and the following two technologies are presented:
1. special observation points: the scheme mainly comprises the steps of defining special observation points in the traditional test case, checking and verifying item by item to identify and judge the problems.
The scheme is the most commonly used test and inspection mode, is simple to realize, the comprehensiveness and accuracy of the observation points are completely limited by the capability of designers, and modification hidden dangers caused by butterfly effect cannot be prevented and stopped (the range of the observation points of use cases is limited and focused on the observation of strong correlation functions, the number of the use cases is influenced, the test cost of the extended observation points is exponentially increased, and all the observation points of the whole storage system cannot be observed in one use case).
2. Routine inspection: the scheme is a method commonly used in the unified operation and maintenance management of multiple devices, and mainly comprises the step of checking all storage devices through tools or scripts and the like in a unified method to identify whether the system is abnormal or not.
The scheme well solves the problem of independent observation of each device or each use case, and the device can be checked through a general observation item and can be processed in parallel.
However, the scheme and the scheme of the special observation point have two common technical defects or shortages:
A. spatially, operation equipment and human assurance are required: all the access equipment is required to execute related commands on the equipment, the commands can generate competing or other influences with the upper layer business of the equipment, and meanwhile, the operation and maintenance and testing personnel are required to pay attention to or operate in the whole process, and the operation and maintenance and testing cost is consumed in the whole process;
B. in time, only the problem can be observed and early warning can not be performed: the problems of the existing equipment can be identified according to the observation points or the inspection method, but the points which have no risk but hidden danger cannot be effectively pre-warned (such as abnormal code accumulation or trace resource leakage).
There is currently no effective solution to the above problems.
Disclosure of Invention
The embodiment of the invention provides a method and a device for monitoring sub-health of equipment, which at least solve the problem that the sub-health of the equipment is difficult to find before the equipment fails in the related technology.
According to one embodiment of the present invention, there is provided a method of monitoring sub-health of a device, comprising: the method comprises the steps of regularly collecting logs of target equipment through a monitoring platform to obtain log packages; matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package; and if the matching result indicates that the sub-health log exists, determining the sub-health type of the target equipment according to the sub-health log.
In one exemplary embodiment, determining the sub-health type of the target device from the sub-health log includes: determining that the sub-health type of the target equipment is an internal concurrency bottleneck under the condition that the sub-health log indicates that the number of times of retrying error reporting reaches a first preset value; determining that the sub-health type of the target equipment is data verification abnormality under the condition that the sub-health log indicates that the verification value is wrong; and determining the sub-health type of the target equipment as resource leakage under the condition that the sub-health log indicates that resources continuously decrease.
In an exemplary embodiment, the method further comprises: sending a first prompt message under the condition that the internal concurrent bottleneck times reach a second preset value; and sending a second prompt message under the condition that the number of times of data verification abnormality reaches a third preset value.
In an exemplary embodiment, the method further comprises: under the condition that the sub-health log indicates that resources are continuously reduced, calling the resource usage record of the target equipment through a preset instruction; and sending a third prompt message under the condition that the resource usage record indicates that the duration of the continuous reduction of the resource of the target equipment is greater than or equal to a preset duration threshold.
In an exemplary embodiment, the method further comprises: under the condition that the sub-health log indicates that resources are continuously reduced, calling the resource usage record of the target equipment through a preset instruction; and sending a fourth prompting message under the condition that the resource usage record indicates that the resource usage rate of the target equipment is larger than or equal to a preset usage rate threshold value within a preset time range.
In an exemplary embodiment, in a case where the matching result indicates that the sub-health log exists, determining the sub-health type of the target device according to the sub-health log includes: under the condition that the matching result indicates that the sub-health log exists, matching the sub-health log with a preset white list; and determining the sub-health type of the target equipment according to the sub-health log under the condition that the sub-health log does not exist in the white list.
In an exemplary embodiment, matching the log package with a preset sub-health risk log to obtain a matching result includes: the following operations are performed on each log in the log package, and the log when the following operations are performed is called a current log: matching the current log with each risk item in the sub-health risk log, wherein a plurality of risk items are recorded in the sub-health risk log, and the risk items are used for indicating that sub-health risks exist in the log matched with the risk items; and under the condition that the current log is matched with the current risk item in the sub-health risk log, determining that the current log has sub-health risk indicated by the current risk item.
According to another embodiment of the present invention, there is provided an apparatus for monitoring sub-health of a device, including: the acquisition module is used for periodically acquiring logs of the target equipment through the monitoring platform to obtain log packets; the matching module is used for matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package; and the determining module is used for determining the sub-health type of the target equipment according to the sub-health log when the matching result indicates that the sub-health log exists.
According to yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program when executed by a processor implements the steps of the method as described in any of the above.
According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the invention, the log of the target equipment is collected regularly through the monitoring platform, so that a log packet is obtained; matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package; and determining the sub-health type of the target equipment according to the sub-health log when the matching result indicates that the sub-health log exists. Therefore, the problem that the device is difficult to find out that the device has sub-health before the device breaks down can be solved, and the effect that the sub-health of the device can be found out in time before the device breaks down is achieved.
Drawings
Fig. 1 is a block diagram of a hardware structure of a mobile terminal of a method for monitoring sub-health of a device according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of monitoring device sub-health in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment in accordance with the present invention;
FIG. 4 is a monitoring flow chart according to an embodiment of the invention;
fig. 5 is a block diagram of an apparatus for monitoring sub-health of a device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The application mainly aims at the technical problems encountered in the monitoring process of the storage product, and provides a technical optimization scheme. Because of the critical infrastructure location where the stored product is located, and the high requirements and complexity of the product, there are significant challenges in shipping and testing the stored product, how to monitor more efficiently and discover problems faster becomes an efficiency issue that must be broken through.
The utility model aims at solving the limitation of the existing scheme in time and space, namely, through the design realization of the scheme, the limitation in space is promoted, the operation and maintenance cost is reduced, the early warning timeliness is promoted, and the prevention is realized.
The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to a method for monitoring sub-health of a device according to an embodiment of the present invention. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store computer programs, such as software programs of application software and modules, such as computer programs corresponding to the methods of monitoring the subhealth of the device in the embodiments of the present invention, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, i.e., to implement the methods described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for monitoring sub-health of a device running on the mobile terminal is provided, and fig. 2 is a flowchart of a method for monitoring sub-health of a device according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:
step S202, periodically collecting logs of target equipment through a monitoring platform to obtain log packets;
step S204, matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package;
step S206, matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package.
Alternatively, the main body of execution of the above steps may be a background processor, or other devices with similar processing capability, and may also be a machine integrated with at least an image acquisition device and a data processing device, where the image acquisition device may include a graphics acquisition module such as a camera, and the data processing device may include a terminal such as a computer, a mobile phone, and the like, but is not limited thereto.
Through the steps, the log of the target equipment is collected regularly through the monitoring platform, so that a log packet is obtained; matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package; and determining the sub-health type of the target equipment according to the sub-health log when the matching result indicates that the sub-health log exists. Therefore, the problem that the device is difficult to find out that the device has sub-health before the device breaks down can be solved, and the effect that the sub-health of the device can be found out in time before the device breaks down is achieved.
In one exemplary embodiment, determining the sub-health type of the target device from the sub-health log includes: determining that the sub-health type of the target equipment is an internal concurrency bottleneck under the condition that the sub-health log indicates that the number of times of retrying error reporting reaches a first preset value; determining that the sub-health type of the target equipment is data verification abnormality under the condition that the sub-health log indicates that the verification value is wrong; and determining the sub-health type of the target equipment as resource leakage under the condition that the sub-health log indicates that resources continuously decrease.
In an exemplary embodiment, the method further comprises: sending a first prompt message under the condition that the internal concurrent bottleneck times reach a second preset value; and sending a second prompt message under the condition that the number of times of data verification abnormality reaches a third preset value.
In an exemplary embodiment, the method further comprises: under the condition that the sub-health log indicates that resources are continuously reduced, calling the resource usage record of the target equipment through a preset instruction; and sending a third prompt message under the condition that the resource usage record indicates that the duration of the continuous reduction of the resource of the target equipment is greater than or equal to a preset duration threshold.
In an exemplary embodiment, the method further comprises: under the condition that the sub-health log indicates that resources are continuously reduced, calling the resource usage record of the target equipment through a preset instruction; and sending a fourth prompting message under the condition that the resource usage record indicates that the resource usage rate of the target equipment is larger than or equal to a preset usage rate threshold value within a preset time range.
In an exemplary embodiment, in a case where the matching result indicates that the sub-health log exists, determining the sub-health type of the target device according to the sub-health log includes: under the condition that the matching result indicates that the sub-health log exists, matching the sub-health log with a preset white list; and determining the sub-health type of the target equipment according to the sub-health log under the condition that the sub-health log does not exist in the white list.
In an exemplary embodiment, matching the log package with a preset sub-health risk log to obtain a matching result includes: the following operations are performed on each log in the log package, and the log when the following operations are performed is called a current log: matching the current log with each risk item in the sub-health risk log, wherein a plurality of risk items are recorded in the sub-health risk log, and the risk items are used for indicating that sub-health risks exist in the log matched with the risk items; and under the condition that the current log is matched with the current risk item in the sub-health risk log, determining that the current log has sub-health risk indicated by the current risk item.
The method and the system need to ensure efficiency, therefore need to be fully automatic, need not interfere with or increase the cost of daily operation and maintenance test, therefore need to be in background and cooperate with a notification mechanism, need to monitor 'sub-health' rather than 'failure', and therefore cannot rely on alarms and need to use logs or operation and maintenance commands.
The distinction between sub-health and failure is explained as follows:
the hard disk has bad blocks, and the system can be tolerated and repaired within a certain proportion range, which means that the system has hidden danger of sub-health, but if the system is not processed in time and finally deteriorated, the system can finally become a whole hard disk fault;
the system is subjected to micro leakage or retry caused by bottlenecks such as concurrency of internal resources, and the influence of the micro leakage or the retry on the system is only a risk, and belongs to sub-health, but if the sub-health is not handled in time, the leakage amount or the retry time consumption can finally cause influence on the service and become a fault;
the core of the scheme is that an 'automatic, background, intelligent, triggering and early warning' is constructed, a sub-health monitoring system is constructed, storage equipment is connected with the system in a butt joint mode, the system automatically grabs the stored logs in the background at regular intervals, risk is matched according to a pre-set 'sub-health' monitoring method,
for example, a list of sub-health logs, i.e., which log prints representing which sub-health risk occurred, is monitored; how to deal with the occurrence of certain types of printing, the problem of positioning early warning can be triggered immediately, and a judgment method similar to the method that XX times are detected in XX time can be formulated.
If the risk is monitored, the environmental responsible person is notified to locate and check (part of the scene can be automatically lifted) through instant message or mail, and the monitoring needs to be capable of finding hidden trouble inside the equipment, such as error reporting of accumulated times, abnormal resources and the like, and the hidden trouble is exposed instead of being deteriorated to be externally visible phenomenon in the hidden trouble stage, as shown in fig. 3.
The whole idea of the scheme is to establish a background monitoring platform, analyze logs of various devices such as storage and the like to identify risks, analyze and identify hidden danger logs or command query results to pre-judge risks, and send attention to operation and maintenance personnel in a mode of alarming mails or messages for analyzed abnormalities or hidden dangers. The interference to the equipment is avoided through backstage on the whole, and meanwhile, triggering type reminding avoids operation and maintenance personnel from staring at the equipment or the monitoring platform for a long time. The monitoring platform regularly collects log packets of all environments, matches all log records in the log packets, namely basically searches to check whether the log records exist in the log packets;
in order to meet the capability, the scheme of the monitoring platform mainly comprises the following three key technical points.
Key technical point one, background log collection analysis
The unified analysis platform for collecting the backstage logs can regularly collect the logs of the registered equipment and analyze the logs by a method. For example, many monitoring methods are regular expression-like, such as "sig XX fdsa YY" is present, and an internal process reset is considered. At this time, the log needs to be parsed according to the rule, and XX and YY may be arbitrary information in the log, but sig and fdsa keywords in the log record can be extracted, and XX can be extracted to be a specific sig type to support positioning.
The system background acquires the equipment log according to the timing strategy, the equipment log is led out by one key and the test mode provided by the equipment is supported to drag the adjustment log, and the equipment operation and maintenance personnel or the test personnel do not perceive the monitoring system (except the equipment access record) to have no extra workload during daily operation and maintenance or test.
The analysis engine of the monitoring platform scans all logs by a method to find out problems in the logs or adjusts hidden dangers queried by operation and maintenance commands, matches a white list to reduce false alarms after identifying risks, and then notifies the problems to operation and maintenance personnel of the equipment through mails and instant messages to trigger artificial problem confirmation and positioning.
Because the early warning is not really a problem, it is possible that only the normal phenomenon of the system is true, for example, if the rule of a certain resource is more than 80%, the early warning may have a leakage problem, but the use of the certain resource is particularly direct 100%, the resource obviously does not meet the judgment rule of leakage, and a false judgment occurs, then the resource is put on a white list, when the system detects that the resource exceeds the standard, the false judgment is automatically shielded and avoided, and other log monitoring is also the logic (although the abnormality in the rule is detected, the items listed in the white list are excluded, and the false judgment is not true risk).
Key technical point II, hidden danger analysis and early warning
The technical point is the expansion of the scanning analysis capability of the method, and besides the scanning and reporting of the existing alarm and problem logs, the platform also has the capability of risk early warning.
Firstly, establishing each characteristic and module hidden danger log system to form a monitoring base line, and establishing a white list mechanism according to the daily analysis result conclusion. If a certain type of processing abnormality is accumulated for 10 times, the system can be influenced, but a log record is printed for each time the abnormality triggers the system, the log record can be used as a hidden danger log to be included into a monitoring base line. And for the logs which are triggered manually or necessarily in specific environments, the logs can be included in a white list to reduce false alarms.
There are two hidden danger judging mechanisms: in a stable running environment, risk judgment can be carried out according to the level of the log, if the Error log is not allowed, reporting is carried out without analyzing specific contents if the Error log is found; for the test environment, the method for carrying out specific log content matches with intelligent identification risk (hidden danger log refers to log that the software and hardware have problems or risks if printing is needed, and not fault log, so as to avoid misjudgment caused by printing of fault test, such as error code or error retry, abnormal log record that has no threshold value and the like). The hidden trouble logs in the method refer to log records corresponding to sub-health risks, and any environment appears to indicate that risks exist, whether the environment is subjected to test operations such as fault injection or not;
the Error log is a judgment logic for emphasizing another hidden danger, and refers to the level of log record, and the environment of stable operation (without any artificial faults and other operations) should not generate any Error log (Error reporting, not simple record) in principle, if the Error log appears, the problem is described as the inside, and the Error log does not need to be matched with a specific sub-health log any more, but is only suitable for the environment of stable operation without artificial test action;
key technical point three, command registration extension mechanism
The technical point is that the capability of command and echo analysis is increased on the basis of the log analysis capability of the monitoring platform, the possible problem of insufficient log or insufficient timeliness is solved, but the access to equipment is increased, and the background is still not needed to be operated or executed by testers.
Taking the problem of resource leakage as an example, the problem that the equipment may print in mass and possibly wash out the history log is solved, the resource calling or the utilization rate can hardly be early-warned through the log, and certain commands need to be executed to check the utilization condition of the memory or various resources to carry out comparison and judgment. For example, an operation in the running process applies for 2M memory but releases only 1M after running out, 1M is leaked, and no one can apply for 1M to be equivalent to being lost. If more such operations, say 1000 times, are used here, this results in a 1G memory reduction, which can be significant and even result in an insufficient memory system reset. In addition to code level testing or scan protection, the resources of the device in daily use often continue to change and it is difficult to directly determine whether a leak exists.
Resource leakage is caused by application and release problems of resources, the operations are often frequent operations in a certain function, if a print log affects the calling performance of the function, on the other hand, the application and release are performed for tens of thousands of times in a short time, if a print log record is printed, tens of thousands of records are printed in a short time, and other system logs are almost not preempted or covered by the system log, namely the common log.
The platform supports a registered resource query order and a resource monitoring list, and establishes a detection and early warning method. The current resource occupation or utilization rate of the resources is counted through the adjustment and measurement command, periodical trend comparison is carried out to predict risks, if the current resource utilization rate is 50%, the residual 50% of the resources are consumed in 10 years under the assumption that the abnormality is likely, the leakage amount is about to be leaked every two weeks, the leakage amount is used as a growth trend judging standard about every two weeks, if the leakage risk is considered to be present through the exceeding standard or the exceeding standard of continuous X periods, the abnormality is reported, and the judging standard and the threshold can be customized and adjusted at any time. In the simplest way, by periodically looking at the proportion of resources used, the likelihood of leakage is high if the available memory is continually decreasing.
As shown in fig. 4, which is a schematic diagram of an overall flow, the scheme integrally performs background log analysis and early warning, so that the daily operation of the equipment is not affected, the workload of operation and maintenance personnel is not increased, and various problems and even hidden dangers (sub-health) can be discovered in advance through an early warning mechanism and a command registration mechanism.
1. The background intelligent analysis capability reduces observation points in the operation and maintenance test of a single device, and converts the observation points into background analysis through logs, so that the observation sufficiency can be effectively improved and the cost can be reduced;
2. sub-health early warning analysis capability, an early warning base line can be formulated through logs or commands and the like, hidden danger of equipment can be analyzed and judged in advance, sub-health states of the equipment are exposed in advance, and risks when the equipment truly triggers a problem are reduced by means of advanced treatment according to requirements;
3. triggering type notification capability, namely, the platform does not need to pay attention to the monitoring platform at any time, the platform can intelligently position and judge risks according to analysis results, and after the problems are determined, the triggering type notification capability sends mails or messages to operation and maintenance personnel to remind the operation and maintenance personnel of intervention;
the invention has the advantages of mainly representing two aspects of space and time: in space, the operation and maintenance cost can be effectively reduced under the condition of ensuring that the observation points are sufficient by combining the actual conditions of the scheme; in time, the hidden trouble of equipment can be discovered earlier, and the energy of operation and maintenance personnel is released through a triggered alarm mechanism, so that the operation and maintenance and testing effects and efficiency are effectively improved.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
In this embodiment, a device for monitoring sub-health of equipment is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which have been described and will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 5 is a block diagram of an apparatus for monitoring sub-health of a device according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes: the acquisition module 52 is used for periodically acquiring the logs of the target equipment through the monitoring platform to obtain log packets; the matching module 54 is configured to match the log packet with a preset sub-health risk log to obtain a matching result, where the matching result is used to indicate whether a sub-health log exists in the log packet; a determining module 56, configured to determine a sub-health type of the target device according to the sub-health log if the matching result indicates that the sub-health log exists.
In an exemplary embodiment, the foregoing apparatus is further configured to determine that the sub-health type of the target device is an internal concurrency bottleneck if the sub-health log indicates that the number of retries of the error reporting reaches a first preset value; determining that the sub-health type of the target equipment is data verification abnormality under the condition that the sub-health log indicates that the verification value is wrong; and determining the sub-health type of the target equipment as resource leakage under the condition that the sub-health log indicates that resources continuously decrease.
In an exemplary embodiment, the foregoing apparatus is further configured to send a first prompting message if the internal concurrent bottleneck number reaches a second preset value; and sending a second prompt message under the condition that the number of times of data verification abnormality reaches a third preset value.
In an exemplary embodiment, the foregoing apparatus is further configured to, in a case where the sub-health log indicates that resources continue to decrease, invoke, by a preset instruction, a resource usage record of the target device; and sending a third prompt message under the condition that the resource usage record indicates that the duration of the continuous reduction of the resource of the target equipment is greater than or equal to a preset duration threshold.
In an exemplary embodiment, the foregoing apparatus is further configured to, in a case where the sub-health log indicates that resources continue to decrease, invoke, by a preset instruction, a resource usage record of the target device; and sending a fourth prompting message under the condition that the resource usage record indicates that the resource usage rate of the target equipment is larger than or equal to a preset usage rate threshold value within a preset time range.
In an exemplary embodiment, the foregoing apparatus is further configured to match the sub-health log with a preset whitelist if the matching result indicates that the sub-health log exists; and determining the sub-health type of the target equipment according to the sub-health log under the condition that the sub-health log does not exist in the white list.
In an exemplary embodiment, the above apparatus is further configured to perform the following operation on each log in the log packet, where the log when performing the following operation is referred to as a current log: matching the current log with each risk item in the sub-health risk log, wherein a plurality of risk items are recorded in the sub-health risk log, and the risk items are used for indicating that sub-health risks exist in the log matched with the risk items; and under the condition that the current log is matched with the current risk item in the sub-health risk log, determining that the current log has sub-health risk indicated by the current risk item.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
Embodiments of the present invention also provide a computer readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the steps of the method described in any of the above.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of monitoring the sub-health of a device, comprising:
the method comprises the steps of regularly collecting logs of target equipment through a monitoring platform to obtain log packages;
matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package;
and if the matching result indicates that the sub-health log exists, determining the sub-health type of the target equipment according to the sub-health log.
2. The method of claim 1, wherein determining the sub-health type of the target device from the sub-health log comprises:
determining that the sub-health type of the target equipment is an internal concurrency bottleneck under the condition that the sub-health log indicates that the number of times of retrying error reporting reaches a first preset value;
determining that the sub-health type of the target equipment is data verification abnormality under the condition that the sub-health log indicates that the verification value is wrong;
and determining the sub-health type of the target equipment as resource leakage under the condition that the sub-health log indicates that resources continuously decrease.
3. The method according to claim 2, wherein the method further comprises:
sending a first prompt message under the condition that the internal concurrent bottleneck times reach a second preset value;
and sending a second prompt message under the condition that the number of times of data verification abnormality reaches a third preset value.
4. The method according to claim 2, wherein the method further comprises:
under the condition that the sub-health log indicates that resources are continuously reduced, calling the resource usage record of the target equipment through a preset instruction;
and sending a third prompt message under the condition that the resource usage record indicates that the duration of the continuous reduction of the resource of the target equipment is greater than or equal to a preset duration threshold.
5. The method according to claim 2, wherein the method further comprises:
under the condition that the sub-health log indicates that resources are continuously reduced, calling the resource usage record of the target equipment through a preset instruction;
and sending a fourth prompting message under the condition that the resource usage record indicates that the resource usage rate of the target equipment is larger than or equal to a preset usage rate threshold value within a preset time range.
6. The method according to any one of claims 1 to 5, wherein, in case the matching result indicates that the sub-health log exists, determining the sub-health type of the target device from the sub-health log comprises:
under the condition that the matching result indicates that the sub-health log exists, matching the sub-health log with a preset white list;
and determining the sub-health type of the target equipment according to the sub-health log under the condition that the sub-health log does not exist in the white list.
7. The method according to any one of claims 1 to 5, wherein matching the log package with a preset sub-health risk log to obtain a matching result includes:
the following operations are performed on each log in the log package, and the log when the following operations are performed is called a current log:
matching the current log with each risk item in the sub-health risk log, wherein a plurality of risk items are recorded in the sub-health risk log, and the risk items are used for indicating that sub-health risks exist in the log matched with the risk items;
and under the condition that the current log is matched with the current risk item in the sub-health risk log, determining that the current log has sub-health risk indicated by the current risk item.
8. An apparatus for monitoring the sub-health of a device, comprising:
the acquisition module is used for periodically acquiring logs of the target equipment through the monitoring platform to obtain log packets;
the matching module is used for matching the log package with a preset sub-health risk log to obtain a matching result, wherein the matching result is used for indicating whether the sub-health log exists in the log package;
and the determining module is used for determining the sub-health type of the target equipment according to the sub-health log when the matching result indicates that the sub-health log exists.
9. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 7.
CN202311745637.9A 2023-12-18 2023-12-18 Method and device for monitoring sub-health of equipment Pending CN117743087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311745637.9A CN117743087A (en) 2023-12-18 2023-12-18 Method and device for monitoring sub-health of equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311745637.9A CN117743087A (en) 2023-12-18 2023-12-18 Method and device for monitoring sub-health of equipment

Publications (1)

Publication Number Publication Date
CN117743087A true CN117743087A (en) 2024-03-22

Family

ID=90253936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311745637.9A Pending CN117743087A (en) 2023-12-18 2023-12-18 Method and device for monitoring sub-health of equipment

Country Status (1)

Country Link
CN (1) CN117743087A (en)

Similar Documents

Publication Publication Date Title
CN108491305B (en) Method and system for detecting server fault
CN109726072B (en) WebLogic server monitoring and alarming method, device and system and computer storage medium
CN101997925A (en) Server monitoring method with early warning function and system thereof
CN102929773B (en) information collecting method and device
CN108599977B (en) System and method for monitoring system availability based on statistical method
CN110231998B (en) Detection method and device for distributed timing task and storage medium
CN111756582A (en) Service chain monitoring method based on NFV log alarm
CN115118581B (en) Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G
CN110765189A (en) Exception management method and system for Internet products
CN112994972B (en) Distributed probe monitoring platform
CN112612680A (en) Message warning method, system, computer equipment and storage medium
CN111881014A (en) System test method, device, storage medium and electronic equipment
CN108933693A (en) A kind of Domain Name Service System fault handling method and system
CN111526109B (en) Method and device for automatically detecting running state of web threat recognition defense system
CN105550088A (en) Automated testing method and automated testing system
CN107204868B (en) Task operation monitoring information acquisition method and device
CN104461847B (en) Data processor detection method and device
CN116415045A (en) Data acquisition method and device, electronic equipment and storage medium
CN107229499B (en) Master station simulation system and detection method for detecting fault terminal of power acquisition system
CN105703942B (en) Log collection method and device
CN117743087A (en) Method and device for monitoring sub-health of equipment
CN109086185B (en) Fault detection method, device and equipment of storage cluster and storage medium
CN116645082A (en) System inspection method, device, equipment and storage medium
CN109687584B (en) Power transmission internet of things communication network access optimization method
CN115509854A (en) Inspection processing method, inspection server and inspection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination