CN117271267A - Remote monitoring system and method for server hardware - Google Patents

Remote monitoring system and method for server hardware Download PDF

Info

Publication number
CN117271267A
CN117271267A CN202311345832.2A CN202311345832A CN117271267A CN 117271267 A CN117271267 A CN 117271267A CN 202311345832 A CN202311345832 A CN 202311345832A CN 117271267 A CN117271267 A CN 117271267A
Authority
CN
China
Prior art keywords
band
monitoring
event
target server
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311345832.2A
Other languages
Chinese (zh)
Inventor
钟阳
曾泓瀚
邵伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuxi Semiconductor Shenzhen Co ltd
Original Assignee
Fuxi Semiconductor Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuxi Semiconductor Shenzhen Co ltd filed Critical Fuxi Semiconductor Shenzhen Co ltd
Priority to CN202311345832.2A priority Critical patent/CN117271267A/en
Publication of CN117271267A publication Critical patent/CN117271267A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a remote monitoring system and a monitoring method of server hardware, wherein monitoring data of a target server are obtained and stored in real time, the monitoring data comprise in-band monitoring data and out-of-band monitoring data of the target server, when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, the in-band monitoring data and the out-of-band monitoring data of the target server in an abnormal state period are obtained, in-band risk characteristics and out-of-band risk characteristics are extracted from the in-band monitoring data and the out-of-band monitoring data of the target server in the abnormal state period, associated risk intensity analysis is carried out on the remote operation instruction, the in-band risk characteristics and the out-of-band risk characteristics, whether the remote operation instruction is transmitted or not is determined according to an analysis result of the associated risk intensity analysis, and the safety of the server can be improved.

Description

Remote monitoring system and method for server hardware
Technical Field
The invention relates to the technical field of server hardware monitoring, in particular to a remote monitoring system and a remote monitoring method for server hardware.
Background
The BMC (Baseboard Management Controller) system is a special system for monitoring and managing hardware conditions of a server, such as health, performance, functions, space and the like, and the BMC system is a low-power micro-control system which is independent of a software and hardware system for realizing inherent functions of the server and is integrated on a server main board for realizing monitoring and management of the hardware of the server, and more specifically, the BMC system is responsible for recording server information, monitoring server states, remotely controlling the server, maintaining and managing and the like. The BMC system provides an access interface to the outside through a IPMI (Intelligent Platform Management Interface) protocol, through which a user or an external program can conveniently monitor and manage a server. Although the modern server room is greatly improved in environment compared with the traditional server room, environmental conditions such as temperature, noise, air quality and the like are still not suitable for operation and maintenance personnel to work inside for a long time, and the BMC system enables the operation and maintenance personnel to monitor and maintain the state of the server remotely in an environment far away from the server room. The existing BMC system provides status data such as temperature, voltage, fan rotation speed, etc. of hardware of a server and provides a function menu for remotely operating the server, but when the hardware of the server fails, the existing BMC system is not suitable for directly performing remote control in any case, and secondary damage or even permanent damage to the server is likely to occur under the condition of improper operation.
Disclosure of Invention
Based on the above problems, the invention provides a remote monitoring system and a monitoring method for server hardware, which can improve the security of a server.
In view of this, a first aspect of the present invention proposes a remote monitoring system for server hardware, comprising a plurality of target servers as monitoring targets and a monitoring server connected to the target servers for remotely monitoring the target servers, the target servers comprising a business service subsystem for running a business service program and a baseboard management subsystem for monitoring the target servers, the business service subsystem comprising a first processing unit, a first storage unit, a first power supply unit and a first communication unit, the baseboard management subsystem comprising a second processing unit, a second storage unit, a second power supply unit, a second communication unit and a sensing unit provided on each device of the business service subsystem for acquiring status data of each device of the business service subsystem, the status data includes temperature data and/or voltage data of each device of the business service subsystem, the first processing unit is used for running an in-band monitoring program stored in the first storage unit to acquire in-band monitoring data of the target server, the second processing unit is used for running an out-of-band monitoring program stored in the second storage unit to acquire out-of-band monitoring data of the target server through the sensing unit, the monitoring server includes a third processing unit, a third storage unit, a third power supply unit and a third communication unit, the monitoring server establishes communication connection with the first communication unit and the second communication unit through the third communication unit to acquire the in-band monitoring data and the out-of-band monitoring data, the third processing unit is configured to:
Acquiring and storing monitoring data of a target server in real time, wherein the monitoring data comprises in-band monitoring data and out-of-band monitoring data of the target server;
judging whether the target server is in an abnormal state of a hardware level according to the monitoring data;
when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, determining a state abnormal period of the target server;
acquiring in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
carrying out associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature;
and determining whether to send the remote operation instruction according to the analysis result of the association risk intensity analysis.
A second aspect of the present invention proposes a method for remote monitoring of server hardware, comprising:
acquiring and storing monitoring data of a target server in real time, wherein the monitoring data comprises in-band monitoring data and out-of-band monitoring data of the target server;
Judging whether the target server is in an abnormal state of a hardware level according to the monitoring data;
when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, determining a state abnormal period of the target server;
acquiring in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
carrying out associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature;
and determining whether to send the remote operation instruction according to the analysis result of the association risk intensity analysis.
Further, in the above method for remote monitoring of server hardware, the step of determining whether the target server is in an abnormal state of a hardware level according to the monitoring data specifically includes:
acquiring a preconfigured abnormal monitoring period;
periodically reading out-of-band monitoring data of the target server in the last abnormal monitoring period;
Judging whether an abnormal event of a hardware level exists in out-of-band monitoring data in a previous abnormal monitoring period, wherein the abnormal event of the hardware level comprises an automatic restarting event, an automatic shutdown event, a dead halt event and/or an out-of-band monitoring index super-threshold event;
and when one or more hardware-level abnormal events exist in the out-of-band monitoring data in the previous abnormal monitoring period, determining that the target server is in an abnormal state.
Further, in the above method for remote monitoring of server hardware, the step of determining the abnormal state period of the target server specifically includes:
determining a time at which a user input of an operation instruction for remotely controlling the target server is detected as an instruction input time;
determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level;
acquiring the occurrence time of the key in-band event, and determining the occurrence time of the key in-band event as a first occurrence time;
traversing the in-band monitoring data of the target server by taking the first occurrence time as a starting point;
determining an associated event associated with the key event in the in-band monitoring data;
When the number of the association events is equal to 1, determining the occurrence time of the association events as a second occurrence time;
when the number of the associated events is greater than 1, determining the occurrence time corresponding to the associated event with the smallest occurrence time in the associated events as a second occurrence time;
a period of time between the second occurrence time and the instruction input time is determined as a state anomaly period of the target server.
Further, in the above method for remote monitoring of server hardware, the step of determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level specifically includes:
determining the abnormal event of the hardware layer as a target event;
reading historical in-band and out-of-band data of the same type of target servers, wherein the same type of target servers comprise servers with the same hardware configuration scheme and running the same type of business service programs in the same type of operating systems;
judging whether the target event exists in the historical out-of-band data;
when the target event exists in the historical out-of-band data, acquiring a fourth occurrence time of each occurrence of the target event from the historical out-of-band data;
Determining a time period in an abnormality monitoring period before each of the fourth occurrence times as a statistical time period;
an in-band event that is concurrently present within the statistical period is determined as the critical in-band event.
Further, in the above method for remote monitoring of server hardware, the step of extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the abnormal state period specifically includes:
screening out-of-band monitoring index super-threshold event list from in-band monitoring data of the target server in the state abnormal period, wherein the out-of-band monitoring index super-threshold event list comprises events of reading or writing files or databases by an application program running in the target server, and the out-of-band monitoring index super-threshold event list comprises super-threshold events of any out-of-band monitoring index of any component in the target server;
judging whether the read-write event in the read-write event list has a persistence characteristic or not;
When any read-write event in the read-write event list has a persistence feature, determining the associated parameter of the corresponding read-write event as the in-band risk feature;
and when any out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list has a persistence characteristic, determining the associated parameter of the corresponding out-of-band monitoring index super-threshold event as the out-of-band risk characteristic.
Further, in the above method for remote monitoring of server hardware, the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature specifically includes:
acquiring the associated parameters of each read-write event in the write event list;
estimating the duration delta tp of each read-write event according to the associated parameters rw,irw Wherein irw E [1, n ] rwe ],n rwe The number of the read-write events in the read-write event list is the number of the read-write events;
calculating the time difference Deltatd between the occurrence time and the current time of each read-write event rw,irw
When Deltatp rw,irw >Δtd rw,irw And determining the corresponding read-write event as the read-write event with the persistence characteristic.
Further, in the above method for remote monitoring of server hardware, the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature specifically includes:
Acquiring the associated parameter of each out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list;
constructing an index value change curve of each out-of-band monitoring index according to the monitoring value of each out-of-band monitoring index and the key value pair consisting of the corresponding monitoring time;
determining a super-threshold period deltath on an index value variation curve of each out-of-band monitoring index iet,ith Wherein iet E [1, n ] ete ],ith∈[1,n th,iet ],n ete For the number of out-of-band monitoring index super-threshold events in the out-of-band monitoring index super-threshold event list, n th,iet Monitoring the number of super-threshold periods on the index value change curve for an index super-threshold event for the iet out-of-band;
calculating the super-threshold time length of each out-of-band monitoring index:
obtaining pre-configured super-threshold tolerance time delta tx of corresponding component iec,iin Wherein iec E [1, n ] ec ],iin∈[1,n ind ],n ec N is the number of monitored components on the substrate of the target server ind The number of monitoring indexes of iec th components on the substrate of the target server;
then Deltatp th,iet >(σ·Δtx iec,iin ) And determining the corresponding out-of-band monitoring index super-threshold event as the out-of-band monitoring index super-threshold event with the persistence characteristic, wherein sigma epsilon (0, 1) is a pre-configured tolerance duration duty ratio coefficient.
Further, in the remote monitoring method of server hardware, the duration Δtp of each read-write event is estimated according to the association parameter rw,irw After the step of (a), further comprising:
duration Δtp of each read-write event rw,irw A feature duration configured as a corresponding in-band risk feature;
at the time of calculating the super-threshold time delta tp of each out-of-band monitoring index th,iet After the step of (a), further comprising:
super-threshold time duration delta tp of each out-of-band monitoring index th,iet Configured as feature duration of the corresponding out-of-band risk feature.
Further, in the foregoing remote monitoring method of server hardware, the step of performing associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature specifically includes:
configuring a first characteristic risk intensity coefficient μinter between an in-band risk characteristic and a teleoperational instruction iinter,ire Wherein iinter ε [1, n ] int ],ire∈[1,n re ],n int N is the number of in-band risk features preset in the system re For preset remote control instructions in the systemIs the number of (3);
configuring a second feature risk intensity coefficient mu outer between the out-of-band risk feature and the teleoperational instruction iouter,ire Wherein iouter epsilon 1, n out ],n out The number of out-of-band risk features preset in the system is set;
Calculating the associated risk intensity of the remote operation instruction, the in-band risk feature and the out-of-band risk feature according to the feature risk intensity coefficient and the corresponding feature duration:
wherein Deltatp iinter Δtp, feature duration for corresponding in-band risk feature iouter Is the feature duration of the corresponding out-of-band risk feature.
The invention provides a remote monitoring system and a monitoring method of server hardware, wherein monitoring data of a target server are obtained and stored in real time, the monitoring data comprise in-band monitoring data and out-of-band monitoring data of the target server, when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, the in-band monitoring data and the out-of-band monitoring data of the target server in an abnormal state period are obtained, in-band risk characteristics and out-of-band risk characteristics are extracted from the in-band monitoring data and the out-of-band monitoring data of the target server in the abnormal state period, associated risk intensity analysis is carried out on the remote operation instruction, the in-band risk characteristics and the out-of-band risk characteristics, whether the remote operation instruction is transmitted or not is determined according to an analysis result of the associated risk intensity analysis, and the safety of the server can be improved.
Drawings
FIG. 1 is a schematic diagram of a remote monitoring system for server hardware according to one embodiment of the present invention;
fig. 2 is a flowchart of a remote monitoring method for server hardware according to an embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
In the description of the present invention, the term "plurality" means two or more, unless explicitly defined otherwise, the orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. The terms "coupled," "mounted," "secured," and the like are to be construed broadly, and may be fixedly coupled, detachably coupled, or integrally connected, for example; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of this specification, the terms "one embodiment," "some implementations," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
A remote monitoring system for server hardware and a monitoring method thereof according to some embodiments of the present invention are described below with reference to the accompanying drawings.
As shown in fig. 1, a first aspect of the present invention proposes a remote monitoring system for server hardware, including a plurality of target servers as monitoring targets and a monitoring server connected to the target servers for remotely monitoring the target servers, the target servers including a service subsystem for running a service program and a baseboard management subsystem for monitoring the target servers, the service subsystem including a first processing unit, a first storage unit, a first power supply unit, and a first communication unit, the baseboard management subsystem including a second processing unit, a second storage unit, a second power supply unit, a second communication unit, and a sensing unit disposed on each device of the service subsystem to acquire status data of each device of the service subsystem, the state data comprise temperature data and/or voltage data of each device of the business service subsystem, the first processing unit is used for running an in-band monitoring program stored in the first storage unit to acquire in-band monitoring data of the target server, the second processing unit is used for running an out-of-band monitoring program stored in the second storage unit to acquire the out-of-band monitoring data of the target server through the sensing unit, the monitoring server comprises a third processing unit, a third storage unit, a third power supply unit and a third communication unit, and the monitoring server establishes communication connection with the first communication unit and the second communication unit through the third communication unit to acquire the in-band monitoring data and the out-of-band monitoring data.
Specifically, the target server may be various servers installed in a centralized manner in a machine room environment, such as a WEB server, a database server, a cloud computing server, etc., for providing business services to the internet. Typically, a large number of servers are installed in a machine room, and operation and maintenance personnel monitor the operation conditions of target servers in the machine room through monitoring servers disposed outside the machine room. In the technical scheme of the invention, an in-band monitoring program is operated in the business service subsystem of the target server to generate in-band monitoring data of the target server, and the in-band monitoring data refer to monitoring data obtained through an operating system of the business service subsystem and management software (namely the in-band monitoring program) operated in the operating system, and the monitoring data mainly relate to a software layer and comprise data directly related to the operation system and the running state of an application program, such as operating system information, process information, business service program information and the like. Meanwhile, the target server is a server integrated with a BMC (Baseboard Management Controller ) system (namely, the baseboard management subsystem), the monitoring server obtains out-of-band monitoring data of the target server through the BMC system of the target server, the out-of-band monitoring data refer to monitoring data which are not collected through an operation system of the business service subsystem but are collected by an embedded operation system in the baseboard management subsystem through the sensing unit, and the monitoring data mainly relate to hardware layers and include hardware state (temperature, voltage, fan speed and the like) of the target server and bottom information of hardware events and the like reported by an external management bus. The baseboard management subsystem and the business service subsystem are independent from each other both on a physical layer and a software layer, the baseboard management subsystem is functionally realized by an independent embedded operation system and is provided with an independent processing unit, a storage unit, a power supply unit and a communication unit, so that the normal operation of the baseboard management subsystem is not affected by the faults of the operation system or the application program or the hardware of the business service subsystem.
As shown in fig. 2, the third processing unit is configured to:
acquiring and storing monitoring data of a target server in real time, wherein the monitoring data comprises in-band monitoring data and out-of-band monitoring data of the target server;
judging whether the target server is in an abnormal state of a hardware level according to the monitoring data;
when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, determining a state abnormal period of the target server;
acquiring in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
carrying out associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature;
and determining whether to send the remote operation instruction according to the analysis result of the association risk intensity analysis.
Specifically, the abnormal state of the hardware layer refers to that one or more devices in the business service subsystem of the target server do not operate within a normal index range, including but not limited to that one or more devices are in an inactive state, an over-temperature state or an over-voltage state. The abnormal state of the hardware level generally indicates that the target server has a serious fault, which may be caused by a hardware fault or an application fault, and the abnormal state of the hardware level generally cannot be recovered by itself, and has a serious influence on the capability of the business service subsystem for providing the business service.
Further, in the remote monitoring system of server hardware, in the step of determining whether the target server is in an abnormal state of a hardware level according to the monitoring data, the third processing unit is configured to:
acquiring a preconfigured abnormal monitoring period;
periodically reading out-of-band monitoring data of the target server in the last abnormal monitoring period;
judging whether an abnormal event of a hardware level exists in out-of-band monitoring data in a previous abnormal monitoring period, wherein the abnormal event of the hardware level comprises an automatic restarting event, an automatic shutdown event, a dead halt event and/or an out-of-band monitoring index super-threshold event;
and when one or more hardware-level abnormal events exist in the out-of-band monitoring data in the previous abnormal monitoring period, determining that the target server is in an abnormal state.
It should be noted that, since the target server is used to provide business services for the internet, which is typically a business-like service, and is generally equipped with a better hardware configuration and a machine room environment, etc., and is monitored and maintained for a long time by a dedicated operation and maintenance person, the business service subsystem of the target server is generally more stable, and the anomaly monitoring period may be configured as a longer period, for example, for each target server, anomaly monitoring and analysis may be performed on the unit of several hours or even days, otherwise, in the case of a large number of target servers, high-frequency monitoring analysis may overload the monitoring server.
In the above embodiment, the out-of-band monitoring index includes monitoring indexes of the monitored components in the target server, such as a temperature index and a voltage index, and the monitoring indexes of the monitored components may be different according to the type of the sensor used, for example, the monitoring indexes may be vibration indexes, noise indexes, rotational speed indexes, and the like, which are not described herein. The out-of-band monitoring index super-threshold event refers to an event that the value of the monitoring index of any monitored component on the substrate monitored by the substrate management subsystem through the sensing unit exceeds a preset standard range.
Further, in the above remote monitoring system of server hardware, in the step of determining the state anomaly period of the target server, the third processing unit is configured to:
determining a time at which a user input of an operation instruction for remotely controlling the target server is detected as an instruction input time;
determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level;
acquiring the occurrence time of the key in-band event, and determining the occurrence time of the key in-band event as a first occurrence time;
Traversing the in-band monitoring data of the target server by taking the first occurrence time as a starting point;
determining an associated event associated with the key event in the in-band monitoring data;
when the number of the association events is equal to 1, determining the occurrence time of the association events as a second occurrence time;
when the number of the associated events is greater than 1, determining the occurrence time corresponding to the associated event with the smallest occurrence time in the associated events as a second occurrence time;
a period of time between the second occurrence time and the instruction input time is determined as a state anomaly period of the target server.
In some embodiments of the present invention, before the step of determining the associated event associated with the critical event in the in-band monitoring data, the third processing unit is configured to:
judging whether related events related to the key events exist in-band monitoring data within a preset duration range, wherein the preset duration is a preset maximum in-band monitoring data traversal duration;
and when the association event does not exist in the in-band monitoring data within the preset duration range, determining a time period between the first occurrence time and the instruction input time as a state abnormality period of the target server.
Similarly, because the physical environment is relatively stable, under the condition of good maintenance, the fault probability of the server at the hardware level is generally caused by the faults of the application program, and in the technical scheme of the real-time mode, the abnormal state period of the target server is determined by determining the key in-band event. In some embodiments of the present invention, before the step of determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level, the third processing unit is configured to:
judging whether a key in-band event which causes the target server to be in an abnormal state of a hardware layer exists in the previous abnormal monitoring period;
when the key in-band event does not exist in the previous abnormality monitoring period, acquiring the occurrence time of the abnormality event of the hardware level in the previous abnormality monitoring period;
when the number of the abnormal events of the hardware level in the previous abnormal monitoring period is equal to 1, determining the occurrence time of the abnormal events as a third occurrence time;
when the number of the abnormal events of the hardware level in the previous abnormal monitoring period is larger than 1, determining the occurrence time of the abnormal event of the hardware level with the minimum occurrence time in the previous abnormal monitoring period as a third occurrence time;
A period of time between the third occurrence time and the instruction input time is determined as a state anomaly period of the target server.
Further, in the remote monitoring system of server hardware, in the step of determining a critical in-band event that causes the target server to be in an abnormal state of a hardware level, the third processing unit is configured to:
determining the abnormal event of the hardware layer as a target event;
reading historical in-band and out-of-band data of the same type of target servers, wherein the same type of target servers comprise servers with the same hardware configuration scheme and running the same type of business service programs in the same type of operating systems;
judging whether the target event exists in the historical out-of-band data;
when the target event exists in the historical out-of-band data, acquiring a fourth occurrence time of each occurrence of the target event from the historical out-of-band data;
determining a time period in an abnormality monitoring period before each of the fourth occurrence times as a statistical time period;
an in-band event that is concurrently present within the statistical period is determined as the critical in-band event.
Specifically, servers of the same hardware configuration scheme have a higher similarity in failure type. In the technical solution of the foregoing embodiment, the same hardware configuration scheme refers to that the same brand model is used for main components, and the main components include a motherboard, and a processor, a memory, a hard disk, a graphics card, and the like mounted on the motherboard. It should be noted that the same hardware configuration scheme referred to in the present invention does not include some specific specification parameters, such as the capacity of the memory or the hard disk. The same type of operating system mainly refers to a platform type and version, such as WINDOWS SERVER or LINUX SERVER, etc., and it should be understood that the same is LINUX SERVER, and operating systems of different LINUX branches, such as CENTOS SERVER, DEBIAN SERVER, RET HAT SERVER, UBUNTU SERVER, etc., also have a large difference in fault type, so operating systems of different LINUX branches are also regarded as different types of operating systems. The service programs of the same type specifically refer to that they provide the same business service type, such as WEB services, database services, etc., and of course, the service programs may be further subdivided, for example, the service programs running the same service program provided by the same developer are regarded as the service programs of the same type, etc., which will not be described herein.
In the foregoing embodiment, when there are a plurality of the target events in the historical out-of-band data, the fourth occurrence time is also a plurality of the target events.
In some embodiments of the present invention, the in-band monitoring program classifies in-band events into four classes of normal, warning, error, and fatal error, and the step of determining in-band events that exist in the statistical period at the same time as the critical in-band events specifically determines in-band events that exist in the statistical period at the same time as the critical in-band events at the event class of error or fatal error.
Further, in the above-described remote monitoring system for server hardware, in the step of extracting an in-band risk feature and an out-of-band risk feature from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly period, the third processing unit is configured to:
screening out-of-band monitoring index super-threshold event list from in-band monitoring data of the target server in the state abnormal period, wherein the out-of-band monitoring index super-threshold event list comprises events of reading or writing files or databases by an application program running in the target server, and the out-of-band monitoring index super-threshold event list comprises super-threshold events of any out-of-band monitoring index of any component in the target server;
Judging whether the read-write event in the read-write event list has a persistence characteristic or not;
when any read-write event in the read-write event list has a persistence feature, determining the associated parameter of the corresponding read-write event as the in-band risk feature;
and when any out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list has a persistence characteristic, determining the associated parameter of the corresponding out-of-band monitoring index super-threshold event as the out-of-band risk characteristic.
Specifically, the read-write event refers to an operating system in a business service subsystem of the target server, or a read-write operation event of an application program running in the operating system on a file or an internal and external database stored in the system. The out-of-band monitoring index super-threshold event refers to an event that the value of the monitoring index of any monitored component on the substrate monitored by the substrate management subsystem of the target server through the sensing unit arranged on the substrate exceeds a preset standard range. In the technical scheme of the invention, the read-write event with the persistence characteristic means that the duration of the file/data read or write operation corresponding to the read-write event is longer than the preset time, and the out-of-band monitoring index super-threshold event with the persistence characteristic means that the duration of the index value of the out-of-band monitoring index corresponding to the out-of-band monitoring index super-threshold event exceeds the threshold value for longer than the preset time.
The associated parameters of the read-write event include, but are not limited to, the occurrence time, the read-write mode, the associated application program, the data source information or the target data information (including information such as the type, the name, the path, the number, the size and the like of the data source or the target data information), namely when the read-write mode corresponding to the read-write event is the read mode, the read-write event should contain the data source information, and when the read-write mode corresponding to the read-write event is the write mode, the read-write event should contain the target data information. The associated parameters of the out-of-band monitoring index super-threshold event include, but are not limited to, the type of the out-of-band monitoring index, monitored component information corresponding to the out-of-band monitoring index, monitoring numerical values of the out-of-band monitoring index and key value pair information consisting of corresponding monitoring time.
Further, in the above remote monitoring system for server hardware, in the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature, the third processing unit is configured to:
acquiring the associated parameters of each read-write event in the write event list;
Estimating the duration delta tp of each read-write event according to the associated parameters rw,irw Wherein irw E [1, n ] rwe ],n rwe The number of the read-write events in the read-write event list is the number of the read-write events;
calculating the time difference Deltatd between the occurrence time and the current time of each read-write event rw,irw
When Deltatp rw,irw >Δtd rw,irw And determining the corresponding read-write event as the read-write event with the persistence characteristic.
In the foregoing embodiment, the duration Δtp of each read-write event is estimated according to the correlation parameter rw,irw The step of calculating the duration delta tp according to the size of the data source or target data corresponding to the read-write event and the read-write speed rw,irw
At the time of calculating the time difference Deltatd between the occurrence time and the current time of each read-write event rw,irw In the step (a), the current time refers to the time when the read-write event in the read-write event list and whether the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature are cut off.
Further, in the above remote monitoring system for server hardware, in the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature, the third processing unit is configured to:
Acquiring the associated parameter of each out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list;
constructing an index value change curve of each out-of-band monitoring index according to the monitoring value of each out-of-band monitoring index and the key value pair consisting of the corresponding monitoring time;
determining a super-threshold period deltath on an index value variation curve of each out-of-band monitoring index iet,ith Wherein iet E [1, n ] ete ],ith∈[1,n th,iet ],n ete For the number of out-of-band monitoring index super-threshold events in the out-of-band monitoring index super-threshold event list, n th,iet Monitoring the number of super-threshold periods on the index value change curve for an index super-threshold event for the iet out-of-band;
calculating the super-threshold time length of each out-of-band monitoring index:
obtaining pre-configured super-threshold tolerance time delta tx of corresponding component iec,iin Wherein iec E [1, n ] ec ],iin∈[1,n ind ],n ec N is the number of monitored components on the substrate of the target server ind The number of monitoring indexes of iec th components on the substrate of the target server;
then Deltatp th,iet >(σ·Δtx iec,iin ) And determining the corresponding out-of-band monitoring index super-threshold event as the out-of-band monitoring index super-threshold event with the persistence characteristic, wherein sigma epsilon (0, 1) is a pre-configured tolerance duration duty ratio coefficient.
Specifically, the super-threshold period Δth iet,ith The duration of the period in which the ith successive index value exceeds the preset threshold value on the index value change curve for the iet out-of-band monitoring index. The super-threshold tolerance time period Deltatx iec,iin And (3) continuously working tolerability duration in the super-threshold state of the in-out-of-band monitoring index of the iec th component on the substrate of the target server. In the technical solutions of other embodiments of the present invention, the tolerance duration duty ratio coefficient σ is a dynamic coefficient, which may be dynamically calculated according to the duration duty ratio of the super-threshold period and the non-super-threshold period on the index value change curve of the out-of-band monitoring index super-threshold event.
Further, in the remote monitoring system of server hardware, the duration Δtp of each read-write event is estimated according to the association parameter rw,irw After the step of (a), the third processing unit is configured to:
duration Δtp of each read-write event rw,irw A feature duration configured as a corresponding in-band risk feature;
at the time of calculating the super-threshold time delta tp of each out-of-band monitoring index th,iet After the step of (a), further comprising:
super-threshold duration of each out-of-band monitoring index Δtp th,iet Configured as feature duration of the corresponding out-of-band risk feature.
Further, in the remote monitoring system of server hardware, in the step of performing the associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature, the third processing unit is configured to:
configuring a first characteristic risk intensity coefficient μinter between an in-band risk characteristic and a teleoperational instruction iinter,ire Wherein iinter ε [1, n ] int ],ire∈[1,n re ],n int N is the number of in-band risk features preset in the system re The number of the remote control instructions preset in the system is set;
configuring a second feature risk intensity coefficient mu outer between the out-of-band risk feature and the teleoperational instruction iouter,ire Wherein iouter epsilon 1, n out ],n out The number of out-of-band risk features preset in the system is set;
calculating the associated risk intensity of the remote operation instruction, the in-band risk feature and the out-of-band risk feature according to the feature risk intensity coefficient and the corresponding feature duration:
wherein Deltatp iinter Δtp, feature duration for corresponding in-band risk feature iouter Is the feature duration of the corresponding out-of-band risk feature.
Specifically, each risk feature is preset to include an in-band risk feature, a feature risk intensity coefficient between the out-of-band risk feature and each remote operation instruction preset in the system. The step of determining whether to send the remote operation instruction according to the analysis result of the associated risk intensity analysis specifically includes:
When the associated risk intensity is smaller than a preset risk intensity threshold, the remote operation instruction is sent
And when the associated risk intensity is greater than or equal to a preset risk intensity threshold value, not sending the remote operation instruction.
In some embodiments of the present invention, when the associated risk intensity is greater than or equal to a preset risk intensity threshold, the remote operation instruction is not sent to the target server, and an operation risk prompt is generated in a user operation interface of the monitoring server.
As shown in fig. 2, a second aspect of the present invention proposes a remote monitoring method for server hardware, including:
acquiring and storing monitoring data of a target server in real time, wherein the monitoring data comprises in-band monitoring data and out-of-band monitoring data of the target server;
judging whether the target server is in an abnormal state of a hardware level according to the monitoring data;
when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, determining a state abnormal period of the target server;
acquiring in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
Extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
carrying out associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature;
and determining whether to send the remote operation instruction according to the analysis result of the association risk intensity analysis.
Specifically, the abnormal state of the hardware layer refers to that one or more devices in the business service subsystem of the target server do not operate within a normal index range, including but not limited to that one or more devices are in an inactive state, an over-temperature state or an over-voltage state. The abnormal state of the hardware level generally indicates that the target server has a serious fault, which may be caused by a hardware fault or an application fault, and the abnormal state of the hardware level generally cannot be recovered by itself, and has a serious influence on the capability of the business service subsystem for providing the business service.
Further, in the above method for remote monitoring of server hardware, the step of determining whether the target server is in an abnormal state of a hardware level according to the monitoring data specifically includes:
Acquiring a preconfigured abnormal monitoring period;
periodically reading out-of-band monitoring data of the target server in the last abnormal monitoring period;
judging whether an abnormal event of a hardware level exists in out-of-band monitoring data in a previous abnormal monitoring period, wherein the abnormal event of the hardware level comprises an automatic restarting event, an automatic shutdown event, a dead halt event and/or an out-of-band monitoring index super-threshold event;
and when one or more hardware-level abnormal events exist in the out-of-band monitoring data in the previous abnormal monitoring period, determining that the target server is in an abnormal state.
It should be noted that, since the target server is used to provide business services for the internet, which is typically a business-like service, and is generally equipped with a better hardware configuration and a machine room environment, etc., and is monitored and maintained for a long time by a dedicated operation and maintenance person, the business service subsystem of the target server is generally more stable, and the anomaly monitoring period may be configured as a longer period, for example, for each target server, anomaly monitoring and analysis may be performed on the unit of several hours or even days, otherwise, in the case of a large number of target servers, high-frequency monitoring analysis may overload the monitoring server.
In the above embodiment, the out-of-band monitoring index includes monitoring indexes of the monitored components in the target server, such as a temperature index and a voltage index, and the monitoring indexes of the monitored components may be different according to the type of the sensor used, for example, the monitoring indexes may be vibration indexes, noise indexes, rotational speed indexes, and the like, which are not described herein. The out-of-band monitoring index super-threshold event refers to an event that the value of the monitoring index of any monitored component on the substrate monitored by the substrate management subsystem through the sensing unit exceeds a preset standard range.
Further, in the above method for remote monitoring of server hardware, the step of determining the abnormal state period of the target server specifically includes:
determining a time at which a user input of an operation instruction for remotely controlling the target server is detected as an instruction input time;
determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level;
acquiring the occurrence time of the key in-band event, and determining the occurrence time of the key in-band event as a first occurrence time;
traversing the in-band monitoring data of the target server by taking the first occurrence time as a starting point;
Determining an associated event associated with the key event in the in-band monitoring data;
when the number of the association events is equal to 1, determining the occurrence time of the association events as a second occurrence time;
when the number of the associated events is greater than 1, determining the occurrence time corresponding to the associated event with the smallest occurrence time in the associated events as a second occurrence time;
a period of time between the second occurrence time and the instruction input time is determined as a state anomaly period of the target server.
In some embodiments of the present invention, before the step of determining the associated event associated with the critical event in the in-band monitoring data, the method further includes:
judging whether related events related to the key events exist in-band monitoring data within a preset duration range, wherein the preset duration is a preset maximum in-band monitoring data traversal duration;
and when the association event does not exist in the in-band monitoring data within the preset duration range, determining a time period between the first occurrence time and the instruction input time as a state abnormality period of the target server.
Similarly, because the physical environment is relatively stable, under the condition of good maintenance, the fault probability of the server at the hardware level is generally caused by the faults of the application program, and in the technical scheme of the real-time mode, the abnormal state period of the target server is determined by determining the key in-band event. In some embodiments of the present invention, before the step of determining the critical in-band event that causes the target server to be in the abnormal state of the hardware layer, the method further includes:
judging whether a key in-band event which causes the target server to be in an abnormal state of a hardware layer exists in the previous abnormal monitoring period;
when the key in-band event does not exist in the previous abnormality monitoring period, acquiring the occurrence time of the abnormality event of the hardware level in the previous abnormality monitoring period;
when the number of the abnormal events of the hardware level in the previous abnormal monitoring period is equal to 1, determining the occurrence time of the abnormal events as a third occurrence time;
when the number of the abnormal events of the hardware level in the previous abnormal monitoring period is larger than 1, determining the occurrence time of the abnormal event of the hardware level with the minimum occurrence time in the previous abnormal monitoring period as a third occurrence time;
A period of time between the third occurrence time and the instruction input time is determined as a state anomaly period of the target server.
Further, in the above method for remote monitoring of server hardware, the step of determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level specifically includes:
determining the abnormal event of the hardware layer as a target event;
reading historical in-band and out-of-band data of the same type of target servers, wherein the same type of target servers comprise servers with the same hardware configuration scheme and running the same type of business service programs in the same type of operating systems;
judging whether the target event exists in the historical out-of-band data;
when the target event exists in the historical out-of-band data, acquiring a fourth occurrence time of each occurrence of the target event from the historical out-of-band data;
determining a time period in an abnormality monitoring period before each of the fourth occurrence times as a statistical time period;
an in-band event that is concurrently present within the statistical period is determined as the critical in-band event.
Specifically, servers of the same hardware configuration scheme have a higher similarity in failure type. In the technical solution of the foregoing embodiment, the same hardware configuration scheme refers to that the same brand model is used for main components, and the main components include a motherboard, and a processor, a memory, a hard disk, a graphics card, and the like mounted on the motherboard. It should be noted that the same hardware configuration scheme referred to in the present invention does not include some specific specification parameters, such as the capacity of the memory or the hard disk. The same type of operating system mainly refers to a platform type and version, such as WINDOWS SERVER or LINUX SERVER, etc., and it should be understood that the same is LINUX SERVER, and operating systems of different LINUX branches, such as CENTOS SERVER, DEBIAN SERVER, RET HAT SERVER, UBUNTU SERVER, etc., also have a large difference in fault type, so operating systems of different LINUX branches are also regarded as different types of operating systems. The service programs of the same type specifically refer to that they provide the same business service type, such as WEB services, database services, etc., and of course, the service programs may be further subdivided, for example, the service programs running the same service program provided by the same developer are regarded as the service programs of the same type, etc., which will not be described herein.
In the foregoing embodiment, when there are a plurality of the target events in the historical out-of-band data, the fourth occurrence time is also a plurality of the target events.
In some embodiments of the present invention, the in-band monitoring program classifies in-band events into four classes of normal, warning, error, and fatal error, and the step of determining in-band events that exist in the statistical period at the same time as the critical in-band events specifically determines in-band events that exist in the statistical period at the same time as the critical in-band events at the event class of error or fatal error.
Further, in the above method for remote monitoring of server hardware, the step of extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the abnormal state period specifically includes:
screening out-of-band monitoring index super-threshold event list from in-band monitoring data of the target server in the state abnormal period, wherein the out-of-band monitoring index super-threshold event list comprises events of reading or writing files or databases by an application program running in the target server, and the out-of-band monitoring index super-threshold event list comprises super-threshold events of any out-of-band monitoring index of any component in the target server;
Judging whether the read-write event in the read-write event list has a persistence characteristic or not;
when any read-write event in the read-write event list has a persistence feature, determining the associated parameter of the corresponding read-write event as the in-band risk feature;
and when any out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list has a persistence characteristic, determining the associated parameter of the corresponding out-of-band monitoring index super-threshold event as the out-of-band risk characteristic.
Specifically, the read-write event refers to an operating system in a business service subsystem of the target server, or a read-write operation event of an application program running in the operating system on a file or an internal and external database stored in the system. The out-of-band monitoring index super-threshold event refers to an event that the value of the monitoring index of any monitored component on the substrate monitored by the substrate management subsystem of the target server through the sensing unit arranged on the substrate exceeds a preset standard range. In the technical scheme of the invention, the read-write event with the persistence characteristic means that the duration of the file/data read or write operation corresponding to the read-write event is longer than the preset time, and the out-of-band monitoring index super-threshold event with the persistence characteristic means that the duration of the index value of the out-of-band monitoring index corresponding to the out-of-band monitoring index super-threshold event exceeds the threshold value for longer than the preset time.
The associated parameters of the read-write event include, but are not limited to, the occurrence time, the read-write mode, the associated application program, the data source information or the target data information (including information such as the type, the name, the path, the number, the size and the like of the data source or the target data information), namely when the read-write mode corresponding to the read-write event is the read mode, the read-write event should contain the data source information, and when the read-write mode corresponding to the read-write event is the write mode, the read-write event should contain the target data information. The associated parameters of the out-of-band monitoring index super-threshold event include, but are not limited to, the type of the out-of-band monitoring index, monitored component information corresponding to the out-of-band monitoring index, monitoring numerical values of the out-of-band monitoring index and key value pair information consisting of corresponding monitoring time.
Further, in the above method for remote monitoring of server hardware, the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature specifically includes:
acquiring the associated parameters of each read-write event in the write event list;
Estimating the duration delta tp of each read-write event according to the associated parameters rw,irw Wherein irw E [1, n ] rwe ],n rwe The number of the read-write events in the read-write event list is the number of the read-write events;
calculating the time difference Deltatd between the occurrence time and the current time of each read-write event rw,irw
When Deltatp rw,irw >Δtd rw,irw And determining the corresponding read-write event as the read-write event with the persistence characteristic.
In the foregoing embodiment, the duration Δtp of each read-write event is estimated according to the correlation parameter rw,irw The step of calculating the duration delta tp according to the size of the data source or target data corresponding to the read-write event and the read-write speed rw,irw
At the time of calculating the time difference Deltatd between the occurrence time and the current time of each read-write event rw,irw In the step (a), the current time refers to the time when the read-write event in the read-write event list and whether the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature are cut off.
Further, in the above method for remote monitoring of server hardware, the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have the persistence feature specifically includes:
Acquiring the associated parameter of each out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list;
constructing an index value change curve of each out-of-band monitoring index according to the monitoring value of each out-of-band monitoring index and the key value pair consisting of the corresponding monitoring time;
determining a super-threshold period deltath on an index value variation curve of each out-of-band monitoring index iet,ith Wherein iet E [1, n ] ete ],ith∈[1,n th,iet ],n ete For the number of out-of-band monitoring index super-threshold events in the out-of-band monitoring index super-threshold event list, n th,iet Index value change curve for out-of-band monitoring index super-threshold event ietThe number of super-threshold time periods;
calculating the super-threshold time length of each out-of-band monitoring index:
obtaining pre-configured super-threshold tolerance time delta tx of corresponding component iec,iin Wherein iec E [1, n ] ec ],iin∈[1,n ind ],n ec N is the number of monitored components on the substrate of the target server ind The number of monitoring indexes of iec th components on the substrate of the target server;
then Deltatp th,iet >(σ·Δtx iec,iin ) And determining the corresponding out-of-band monitoring index super-threshold event as the out-of-band monitoring index super-threshold event with the persistence characteristic, wherein sigma epsilon (0, 1) is a pre-configured tolerance duration duty ratio coefficient.
Specifically, the super-threshold period Δth iet,ith The duration of the period in which the ith successive index value exceeds the preset threshold value on the index value change curve for the iet out-of-band monitoring index. The super-threshold tolerance time period Deltatx iec,uin And (3) continuously working tolerability duration in the super-threshold state of the in-out-of-band monitoring index of the iec th component on the substrate of the target server. In the technical solutions of other embodiments of the present invention, the tolerance duration duty ratio coefficient σ is a dynamic coefficient, which may be dynamically calculated according to the duration duty ratio of the super-threshold period and the non-super-threshold period on the index value change curve of the out-of-band monitoring index super-threshold event.
Further, in the remote monitoring method of server hardware, the duration Δtp of each read-write event is estimated according to the association parameter rw,irw After the step of (a), further comprising:
duration Δtp of each read-write event rw,irw A feature duration configured as a corresponding in-band risk feature;
in the calculationSuper-threshold duration Δtp for each out-of-band monitoring indicator th,iet After the step of (a), further comprising:
super-threshold time duration delta tp of each out-of-band monitoring index th,iet Configured as feature duration of the corresponding out-of-band risk feature.
Further, in the foregoing remote monitoring method of server hardware, the step of performing associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature specifically includes:
configuring a first characteristic risk intensity coefficient μinter between an in-band risk characteristic and a teleoperational instruction iinter,ire Wherein iinter ε [1, n ] int ],ire∈[1,n re ],n int N is the number of in-band risk features preset in the system re The number of the remote control instructions preset in the system is set;
configuring a second feature risk intensity coefficient mu outer between the out-of-band risk feature and the teleoperational instruction iouter,ire Wherein iouter epsilon 1, n out ],n out The number of out-of-band risk features preset in the system is set;
calculating the associated risk intensity of the remote operation instruction, the in-band risk feature and the out-of-band risk feature according to the feature risk intensity coefficient and the corresponding feature duration:
wherein Deltatp iinter Δtp, feature duration for corresponding in-band risk feature iouter Is the feature duration of the corresponding out-of-band risk feature.
Specifically, each risk feature is preset to include an in-band risk feature, a feature risk intensity coefficient between the out-of-band risk feature and each remote operation instruction preset in the system. The step of determining whether to send the remote operation instruction according to the analysis result of the associated risk intensity analysis specifically includes:
When the associated risk intensity is smaller than a preset risk intensity threshold, the remote operation instruction is sent
And when the associated risk intensity is greater than or equal to a preset risk intensity threshold value, not sending the remote operation instruction.
In some embodiments of the present invention, when the associated risk intensity is greater than or equal to a preset risk intensity threshold, the remote operation instruction is not sent to the target server, and an operation risk prompt is generated in a user operation interface of the monitoring server.
It should be noted that in this document relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Embodiments in accordance with the present invention, as described above, are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention and various modifications as are suited to the particular use contemplated. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims (10)

1. A remote monitoring system of server hardware, characterized by comprising a plurality of target servers as monitoring targets and a monitoring server connected with the target servers for remotely monitoring the target servers, wherein the target servers comprise a business service subsystem for running business service programs and a baseboard management subsystem for monitoring the target servers, the business service subsystem comprises a first processing unit, a first storage unit, a first power supply unit and a first communication unit, the baseboard management subsystem comprises a second processing unit, a second storage unit, a second power supply unit, a second communication unit and a sensing unit, the sensing unit is arranged on each device of the business service subsystem to acquire state data of each device of the business service subsystem, the state data includes temperature data and/or voltage data of each device of the business service subsystem, the first processing unit is used for running an in-band monitoring program stored in the first storage unit to acquire in-band monitoring data of the target server, the second processing unit is used for running an out-of-band monitoring program stored in the second storage unit to acquire out-of-band monitoring data of the target server through the sensing unit, the monitoring server includes a third processing unit, a third storage unit, a third power supply unit and a third communication unit, the monitoring server establishes communication connection with the first communication unit and the second communication unit through the third communication unit to acquire the in-band monitoring data and the out-of-band monitoring data, and the third processing unit is configured to:
Acquiring and storing monitoring data of a target server in real time, wherein the monitoring data comprises in-band monitoring data and out-of-band monitoring data of the target server;
judging whether the target server is in an abnormal state of a hardware level according to the monitoring data;
when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, determining a state abnormal period of the target server;
acquiring in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
carrying out associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature;
and determining whether to send the remote operation instruction according to the analysis result of the association risk intensity analysis.
2. A method for remote monitoring of server hardware, comprising:
acquiring and storing monitoring data of a target server in real time, wherein the monitoring data comprises in-band monitoring data and out-of-band monitoring data of the target server;
Judging whether the target server is in an abnormal state of a hardware level according to the monitoring data;
when the target server is in an abnormal state of a hardware level and an operation instruction for remotely controlling the target server is detected to be input by a user, determining a state abnormal period of the target server;
acquiring in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server in the state anomaly time period;
carrying out associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature;
and determining whether to send the remote operation instruction according to the analysis result of the association risk intensity analysis.
3. The method for remotely monitoring hardware of a server according to claim 2, wherein the step of determining whether the target server is in an abnormal state of a hardware level according to the monitoring data specifically comprises:
acquiring a preconfigured abnormal monitoring period;
periodically reading out-of-band monitoring data of the target server in the last abnormal monitoring period;
Judging whether an abnormal event of a hardware level exists in out-of-band monitoring data in a previous abnormal monitoring period, wherein the abnormal event of the hardware level comprises an automatic restarting event, an automatic shutdown event, a dead halt event and/or an out-of-band monitoring index super-threshold event;
and when one or more hardware-level abnormal events exist in the out-of-band monitoring data in the previous abnormal monitoring period, determining that the target server is in an abnormal state.
4. A method for remote monitoring of server hardware according to claim 3, wherein the step of determining the status anomaly period of the target server specifically comprises:
determining a time at which a user input of an operation instruction for remotely controlling the target server is detected as an instruction input time;
determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level;
acquiring the occurrence time of the key in-band event, and determining the occurrence time of the key in-band event as a first occurrence time;
traversing the in-band monitoring data of the target server by taking the first occurrence time as a starting point;
determining an associated event associated with the key event in the in-band monitoring data;
When the number of the association events is equal to 1, determining the occurrence time of the association events as a second occurrence time;
when the number of the associated events is greater than 1, determining the occurrence time corresponding to the associated event with the smallest occurrence time in the associated events as a second occurrence time;
a period of time between the second occurrence time and the instruction input time is determined as a state anomaly period of the target server.
5. The method for remote monitoring of server hardware according to claim 4, wherein the step of determining a critical in-band event that causes the target server to be in an abnormal state at a hardware level comprises:
determining the abnormal event of the hardware layer as a target event;
reading historical in-band and out-of-band data of the same type of target servers, wherein the same type of target servers comprise servers with the same hardware configuration scheme and running the same type of business service programs in the same type of operating systems;
judging whether the target event exists in the historical out-of-band data;
when the target event exists in the historical out-of-band data, acquiring a fourth occurrence time of each occurrence of the target event from the historical out-of-band data;
Determining a time period in an abnormality monitoring period before each of the fourth occurrence times as a statistical time period;
an in-band event that is concurrently present within the statistical period is determined as the critical in-band event.
6. The method for remote monitoring of server hardware according to claim 2, wherein the step of extracting in-band risk features and out-of-band risk features from in-band monitoring data and out-of-band monitoring data of the target server during the state anomaly period specifically comprises:
screening out-of-band monitoring index super-threshold event list from in-band monitoring data of the target server in the state abnormal period, wherein the out-of-band monitoring index super-threshold event list comprises events of reading or writing files or databases by an application program running in the target server, and the out-of-band monitoring index super-threshold event list comprises super-threshold events of any out-of-band monitoring index of any component in the target server;
judging whether the read-write event in the read-write event list has a persistence characteristic or not;
When any read-write event in the read-write event list has a persistence feature, determining the associated parameter of the corresponding read-write event as the in-band risk feature;
and when any out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list has a persistence characteristic, determining the associated parameter of the corresponding out-of-band monitoring index super-threshold event as the out-of-band risk characteristic.
7. The method for remote monitoring of server hardware according to claim 6, wherein the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have a persistence feature specifically comprises:
acquiring the associated parameters of each read-write event in the write event list;
estimating the duration delta t of each read-write event according to the associated parameters prw,irw Wherein irw E [1, n ] rwe ],n rwe The number of the read-write events in the read-write event list is the number of the read-write events;
calculating the time difference Deltatd between the occurrence time and the current time of each read-write event rw,irw
When Deltat prw,irw >Δtd rw,irw And determining the corresponding read-write event as the read-write event with the persistence characteristic.
8. The method for remote monitoring of server hardware according to claim 7, wherein the step of determining whether the read-write event in the read-write event list and the out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list have a persistence feature specifically comprises:
Acquiring the associated parameter of each out-of-band monitoring index super-threshold event in the out-of-band monitoring index super-threshold event list;
constructing an index value change curve of each out-of-band monitoring index according to the monitoring value of each out-of-band monitoring index and the key value pair consisting of the corresponding monitoring time;
determining a super-threshold period deltath on an index value variation curve of each out-of-band monitoring index iet,ith Wherein iet E [1, n ] ete ],ith∈[1,n th,iet ],n ete For the number of out-of-band monitoring index super-threshold events in the out-of-band monitoring index super-threshold event list, n th,iet Monitoring the number of super-threshold periods on the index value change curve for an index super-threshold event for the iet out-of-band;
calculating the super-threshold time length of each out-of-band monitoring index:
obtaining pre-configured super-threshold tolerance time delta tx of corresponding component iec,iin Wherein iec E [1, n ] ec ],iin∈[1,n ind ],n ec N is the number of monitored components on the substrate of the target server ind The number of monitoring indexes of iec th components on the substrate of the target server;
then Deltatp th,iet >(σ·Δtx iec,iin ) And determining the corresponding out-of-band monitoring index super-threshold event as the out-of-band monitoring index super-threshold event with the persistence characteristic, wherein sigma epsilon (0, 1) is a pre-configured tolerance duration duty ratio coefficient.
9. The method for remote monitoring of server hardware according to claim 8, wherein the duration Δtp of each read-write event is estimated based on the correlation parameters rw,irw After the step(s) of (c) are performed,further comprises:
duration Δtp of each read-write event rw,irw A feature duration configured as a corresponding in-band risk feature;
at the time of calculating the super-threshold time delta tp of each out-of-band monitoring index th,iet After the step of (a), further comprising:
super-threshold time duration delta tp of each out-of-band monitoring index th,iet Configured as feature duration of the corresponding out-of-band risk feature.
10. The method for remote monitoring of server hardware according to claim 9, wherein the step of performing associated risk intensity analysis on the remote operation instruction, the in-band risk feature and the out-of-band risk feature specifically comprises:
configuring a first characteristic risk intensity coefficient μinter between an in-band risk characteristic and a teleoperational instruction iinter,ire Wherein iinter ε [1, n ] int ],ire∈[1,n re ],n int N is the number of in-band risk features preset in the system re The number of the remote control instructions preset in the system is set;
configuring a second feature risk intensity coefficient mu outer between the out-of-band risk feature and the teleoperational instruction iouter,ire Wherein iouter epsilon 1, n out ],n out The number of out-of-band risk features preset in the system is set;
calculating the associated risk intensity of the remote operation instruction, the in-band risk feature and the out-of-band risk feature according to the feature risk intensity coefficient and the corresponding feature duration:
wherein Deltatp iinter Δtp, feature duration for corresponding in-band risk feature iouter Is the feature duration of the corresponding out-of-band risk feature.
CN202311345832.2A 2023-10-16 2023-10-16 Remote monitoring system and method for server hardware Pending CN117271267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311345832.2A CN117271267A (en) 2023-10-16 2023-10-16 Remote monitoring system and method for server hardware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311345832.2A CN117271267A (en) 2023-10-16 2023-10-16 Remote monitoring system and method for server hardware

Publications (1)

Publication Number Publication Date
CN117271267A true CN117271267A (en) 2023-12-22

Family

ID=89214368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311345832.2A Pending CN117271267A (en) 2023-10-16 2023-10-16 Remote monitoring system and method for server hardware

Country Status (1)

Country Link
CN (1) CN117271267A (en)

Similar Documents

Publication Publication Date Title
US11422595B2 (en) Method and system for supervising a health of a server infrastructure
US6904391B2 (en) System and method for interpreting sensor data utilizing virtual sensors
JP5186211B2 (en) Health monitoring technology and application server control
US10519960B2 (en) Fan failure detection and reporting
US20150127814A1 (en) Monitoring Server Method
US20190101876A1 (en) Machine diagnostics based on overall system energy state
US8055928B2 (en) Method for characterizing the health of a computer system power supply
US9021317B2 (en) Reporting and processing computer operation failure alerts
US11030038B2 (en) Fault prediction and detection using time-based distributed data
US8195340B1 (en) Data center emergency power management
WO2015023201A2 (en) Method and system for determining hardware life expectancy and failure prevention
CN109040277A (en) A kind of long-distance monitoring method and device of server
CN114328102A (en) Equipment state monitoring method, device, equipment and computer readable storage medium
CN115658408A (en) Sensor state detection method and device and readable storage medium
CN117318297A (en) Alarm threshold setting method, system, equipment and medium based on state monitoring
CN111625386A (en) Monitoring method and device for power-on overtime of system equipment
CN115794588A (en) Memory fault prediction method, device and system and monitoring server
CN108899059B (en) Detection method and equipment for solid state disk
CN114676019A (en) Method, device, equipment and storage medium for monitoring state of central processing unit
US10176033B1 (en) Large-scale event detector
US10067549B1 (en) Computed devices
CN116225812B (en) Baseboard management controller system operation method, device, equipment and storage medium
CN111338891A (en) Fan stability testing method and device
CN117271267A (en) Remote monitoring system and method for server hardware
CN110873613A (en) Method and device for processing machine room abnormity based on temperature monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination