CN113268389A - Abnormal node monitoring method and device, electronic equipment and readable storage medium - Google Patents

Abnormal node monitoring method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN113268389A
CN113268389A CN202110645106.7A CN202110645106A CN113268389A CN 113268389 A CN113268389 A CN 113268389A CN 202110645106 A CN202110645106 A CN 202110645106A CN 113268389 A CN113268389 A CN 113268389A
Authority
CN
China
Prior art keywords
node
parameter
determining
abnormal
utilization rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110645106.7A
Other languages
Chinese (zh)
Inventor
焦玉楼
吴晓斌
周圆
苟小刚
陈宁
岳勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Xuani Technology Co ltd
Original Assignee
Wuxi Xuani Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Xuani Technology Co ltd filed Critical Wuxi Xuani Technology Co ltd
Priority to CN202110645106.7A priority Critical patent/CN113268389A/en
Publication of CN113268389A publication Critical patent/CN113268389A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to the technical field of data processing, and discloses a method, a device, electronic equipment and a readable storage medium for monitoring abnormal nodes, wherein the method comprises the steps of respectively obtaining each parameter value of an operating parameter of each node in a preset time period; and respectively determining the operation state of the corresponding node according to each parameter value of the operation parameter of each node, wherein the operation state is used for indicating whether the corresponding node operates abnormally. According to the method, whether the operation of the corresponding node is abnormal or not is determined according to the parameter values of the operation parameters of each node in the preset time period, the abnormal node in the cluster can be determined, and the subsequent user can conveniently conduct exception processing on the abnormal node.

Description

Abnormal node monitoring method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for monitoring an abnormal node, an electronic device, and a readable storage medium.
Background
With the continuous progress of science and technology, data processing clusters are widely used for data processing, and a data processing cluster usually includes a management server and hundreds of node servers, and the management server distributes data processing tasks to the node servers, so that the node servers perform parallel processing on data, and the data processing efficiency can be effectively improved.
In the data processing process, if a certain node server is abnormal, operation and maintenance personnel usually need to perform one-by-one troubleshooting on each node server through the management server to find out the abnormal node, however, by adopting the mode, the troubleshooting efficiency of the abnormal node is low, and the condition of missing detection is easy to occur.
Therefore, how to monitor the abnormal node to improve the detection efficiency of the abnormal node is a problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device, an electronic device and a readable storage medium for monitoring abnormal nodes, which are used for improving the detection efficiency of the abnormal nodes when the abnormal nodes are detected.
In a first aspect, an embodiment of the present application provides a method for monitoring an abnormal node, where the method includes:
and respectively acquiring each parameter value of the operation parameter of each node in a preset time period.
And respectively determining the operation state of the corresponding node according to each parameter value of the operation parameter of each node, wherein the operation state is used for indicating whether the corresponding node operates abnormally.
In the implementation process, whether the operation of the corresponding node is abnormal or not can be determined according to each parameter value of the operation parameter of each node in a preset time period, so that a user can be reminded of the abnormal node, and the detection efficiency of the abnormal node is improved.
With reference to the first aspect, in one embodiment, the operating parameter includes at least a utilization rate of the CPU.
With reference to the first aspect, in an implementation manner, determining an operating state of each node according to each parameter value of the operating parameter of each node includes:
respectively aiming at each node, the following steps are executed:
and determining the average value of the utilization rate of a node at each time point in a preset time period.
And if the average utilization rate is not lower than a first preset threshold, determining that one node normally operates, otherwise, determining that one node abnormally operates.
In the implementation process, the average value of the utilization rate of each node at each time point in the preset time period is determined, so that the overall operation condition of each node in the preset time period can be effectively reflected, and further, the utilization rate average value is compared with the utilization rate threshold value under the normal condition, so that whether the corresponding node is abnormal or not can be effectively judged.
With reference to the first aspect, in an implementation manner, determining an operating state of each node according to each parameter value of the operating parameter of each node includes:
respectively aiming at each node, the following steps are executed:
and determining the utilization rate of each time point of one node in a preset time period.
And if the utilization rate of the specified quantity is lower than a second preset threshold value, determining that one node is abnormal in operation, and otherwise, determining that one node is normal in operation.
In the implementation process, the utilization rate of each node at a plurality of time points in the preset time period is determined, so that the operation condition of each node at the plurality of time points can be reflected, the overall operation condition of each node in the preset time period is reflected through the operation condition of the plurality of time points, and further, the utilization rate of each node at the plurality of time points is respectively compared with the utilization rate threshold value in the normal condition, so that whether the corresponding node is abnormal or not can be effectively judged.
With reference to the first aspect, in an implementation manner, before obtaining respective parameter values of the operating parameter of each node within a preset time period, the method further includes:
and if the parameter reporting message sent by any node is determined to be received, acquiring node identification information, parameter values corresponding to the operation parameters and parameter acquisition time in the parameter reporting message.
And adopting a specified data structure to perform associated storage on the received node identification information, the parameter values corresponding to the operation parameters and the parameter acquisition time.
In the implementation process, the content and the reporting time of the reporting information sent by each node are stored through the specified data structure, so that the node abnormity can be conveniently judged according to the content and the reporting time of the reporting information in the follow-up process.
With reference to the first aspect, in an embodiment, after determining the operation state of the corresponding node, the method further includes:
and determining the abnormal grade according to the obtained utilization rates.
And displaying the running state of one node according to the display style corresponding to the abnormal grade.
In the implementation process, the nodes with different abnormal levels are displayed in a distinguishing way, so that a user can conveniently and orderly process the abnormal conditions of the nodes according to the abnormal levels of the nodes.
In a second aspect, an embodiment of the present application provides an abnormal node monitoring apparatus, where the apparatus includes:
and the acquisition unit is used for respectively acquiring each parameter value of the operation parameter of each node in a preset time period.
And the determining unit is used for determining the operation state of the corresponding node according to each parameter value of the operation parameter of each node, wherein the operation state is used for indicating whether the corresponding node operates abnormally.
With reference to the second aspect, in one embodiment, the operating parameter includes at least a utilization of the CPU.
With reference to the second aspect, in an embodiment, the determining module is specifically configured to:
respectively aiming at each node, the following steps are executed:
and determining the average value of the utilization rate of a node at each time point in a preset time period.
And if the average utilization rate is not lower than a first preset threshold, determining that one node normally operates, otherwise, determining that one node abnormally operates.
With reference to the second aspect, in another embodiment, the determining module is specifically configured to:
respectively aiming at each node, the following steps are executed:
and determining the utilization rate of each time point of one node in a preset time period.
And if the utilization rate of the specified quantity is lower than a second preset threshold value, determining that one node is abnormal in operation, and otherwise, determining that one node is normal in operation.
With reference to the second aspect, in an embodiment, the determining unit is further configured to:
and if the parameter reporting message sent by any node is determined to be received, acquiring node identification information, parameter values corresponding to the operation parameters and parameter acquisition time in the parameter reporting message.
And adopting a specified data structure to perform associated storage on the received node identification information, the parameter values corresponding to the operation parameters and the parameter acquisition time.
With reference to the second aspect, in an embodiment, the determining unit is further configured to:
and determining the abnormal grade according to the obtained utilization rates.
And displaying the running state of one node according to the display style corresponding to the abnormal grade.
In a third aspect, an embodiment of the present application provides an electronic device, including:
the system comprises a processor, a memory and a bus, wherein the processor is connected with the memory through the bus, and the memory stores computer readable instructions which are used for realizing the method provided by any one of the implementation modes of the first aspect when being executed by the processor.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps in the method as provided in any of the embodiments of the first aspect.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart of a method for monitoring an abnormal node according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a node display style provided in an embodiment of the present application;
fig. 3 is a block diagram of an abnormal node monitoring apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
First, some terms referred to in the embodiments of the present application will be described to facilitate understanding by those skilled in the art.
The terminal equipment: may be a mobile terminal, a fixed terminal, or a portable terminal such as a mobile handset, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal communication system device, personal navigation device, personal digital assistant, audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also contemplated that the terminal device can support any type of interface to the user (e.g., wearable device), and the like.
A server: the cloud server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platform and the like.
In an application scenario, when image rendering is performed, a management server in a cluster is generally used to distribute rendering tasks to a plurality of nodes, and after each node receives a rendering task, a Central Processing Unit (CPU) is used to perform rendering, but in the rendering process, some nodes may have an abnormal condition at a certain stage, which results in that the task cannot be performed normally, and at this time, operation and maintenance personnel are required to find the abnormal node as soon as possible and perform exception handling to ensure that the task is performed normally.
In the conventional situation, operation and maintenance personnel usually find abnormal nodes by checking each node one by one through a management server in a cluster, but the abnormal nodes are found through the method, so that the checking efficiency of the abnormal nodes is influenced, and the problem of missed checking easily occurs in the manual checking method.
It should be noted that the method, the apparatus, the electronic device, and the readable storage medium for monitoring the abnormal node provided in the present application may be used in a scene where a rendering task is performed, and may also be used in a scene where other data processing tasks are performed, but the present application is not limited thereto.
Referring to fig. 1, fig. 1 is a flowchart of a method for monitoring an abnormal node according to an embodiment of the present application, in the embodiment of the present application, an execution main body of the method may be an electronic device, and optionally, the electronic device may be a server or a terminal device, but the present application is not limited thereto.
As an example, the specific implementation flow of the method shown in fig. 1 is as follows:
step 110: and respectively acquiring each parameter value of the operation parameter of each node in a preset time period.
Specifically, before step 110 is executed, the parameter values of the operating parameters reported by each node are respectively received.
When receiving the parameter values of the operating parameters reported by the nodes, the following steps can be adopted:
s1101: and if the parameter reporting message sent by any node is determined to be received, acquiring node identification information, parameter values corresponding to the operation parameters and parameter acquisition time in the parameter reporting message.
Specifically, each node may periodically or in real time send a parameter reporting message to the management server.
The parameter reporting message at least includes a parameter value corresponding to the current operating parameter, and optionally, may further include node identification information and parameter acquisition time.
As an embodiment, each node sends a parameter reporting message to the management server every 1 minute. When the management server determines that a parameter reporting message sent by any node is received, the management server acquires an Identity card identification number (ID) of the node in the reporting message, the utilization rate of the current CPU of the node and the time for acquiring the utilization rate of the current CPU.
It should be noted that, in the embodiment of the present application, only the case where each node periodically sends the parameter report message to the management server every 1 minute is taken as an example for description, in practical applications, the periodic interval time may be 2 minutes, 30 seconds, 50 seconds, or real-time parameter report message sent to the management server, but the present application is not limited to this.
In the embodiment of the present application, only the identification information of the node is the ID of the node as an example, and in practical application, the identification information of the node may also be the name of the node, but the present application is not limited thereto.
In the embodiment of the present application, only the operation parameter of the node is the utilization rate of the CPU of the node as an example, in practical applications, the operation parameter may also be the memory utilization rate of the node, and may also be the read-write rate of the hard disk of the node, but the present application is not limited thereto.
In the embodiment of the present application, only the parameter obtaining time refers to a time when the management server obtains a parameter value corresponding to an operating parameter, and in practical applications, the parameter obtaining time may also be a time when the management server sends a parameter reporting message, but the present application is not limited thereto.
S1102: and adopting a specified data structure to perform associated storage on the received node identification information, the parameter values corresponding to the operation parameters and the parameter acquisition time.
As an embodiment, the management server adopts a key-value pair set (map) data structure, and stores the received ID of the node as a key word key of the node in the set, the corresponding value is the current utilization rate, and the time association obtained by the current utilization rate is stored in the set.
It should be noted that, in the embodiment of the present application, only the map data structure is taken as an example for description, and in practical applications, the specified data structure may also be a stack data structure, a queue data structure, or a linked list data structure, but the present application is not limited thereto.
In the implementation process, the content and the reporting time of the reporting information sent by each node are stored through the specified data structure, so that the node abnormity can be conveniently judged according to the content and the reporting time of the reporting information in the follow-up process.
As an embodiment, in executing step 110, the parameter values of the operating parameters of each node within the preset time period may be obtained from the set.
Specifically, the utilization rate of the CPU of the operating parameter of each node in a preset time period is obtained from the set.
Step 120: and determining the operation state of the corresponding node according to each parameter value of the operation parameter of each node.
In one embodiment, when performing step 120, any one or combination of the following may be employed:
the first method is as follows: and judging whether the running state of each node is abnormal or not according to the utilization rate average value of each node.
Specifically, for each node, the following steps may be performed:
step a 1: and determining the average value of the utilization rate of a node at each time point in a preset time period.
Specifically, the management server obtains the utilization rate of the node corresponding to the ID at each time point in the preset time period from the set according to the ID of the node, and determines the average value of the utilization rates of the node at each time point in the preset time period.
Step b 1: and if the average utilization rate is not lower than a first preset threshold, determining that one node normally operates, otherwise, determining that one node abnormally operates.
Specifically, if the average value of the utilization rate of the node is not lower than a first preset threshold, the node is indicated to operate normally, otherwise, the node is indicated to operate abnormally.
As an embodiment, the management server obtains, from the set, the utilization rate of the node corresponding to the ID in the last 6 minutes per minute according to the ID of the node, and determines an average value of the utilization rates of the node in the last 6 minutes, further, if the average value of the utilization rates of the node is not lower than a first preset threshold, it indicates that the node operates normally, and if the average value of the utilization rates of the node is smaller than the first preset threshold, it indicates that the node operates abnormally.
For example, assuming that the first preset threshold is 90%, the management server performs, at 10: when 06, according to the ID of a certain node, acquiring the node corresponding to the ID from the set at 10: the utilization rate at 01 was 82%, and the ratio was 10: the utilization rate at 02 was 97%, and the ratio was 10: the utilization rate at 03 was 92%, and the ratio was 10: the utilization rate at 04 is 89%, and the ratio is 10: the utilization rate at 05 was 91%, and the ratio was 10: the utilization ratio at 06 was 93%.
Further, the average utilization over the last 6 minutes was determined as: (82% + 97% + 92% + 89% + 91% + 93%)/6 ═ 90.67%, since the utilization averages 90.67% > 90%, it is determined that the node is operating normally.
As an example, assuming that the first preset threshold is 90%, the management server performs, at 10: the utilization rate at 06 is 85%, and the utilization rates obtained at other time points are unchanged, so that the average utilization rate of the node in the last 6 minutes is as follows: (82% + 97% + 92% + 89% + 91% + 85%)/6 ═ 89.33%, since utilization averages 89.33% < 90%, it was determined that the node was operating abnormally.
Thus, it can be determined whether each node in the cluster is abnormal through the step a1 and the step b 1.
In the embodiment of the present application, only the preset time period is described as 6 minutes, but in practical applications, the preset time period may be 7 minutes, 10 minutes, or other times, but the present application is not limited thereto. In the embodiment of the present application, only the first preset threshold is taken as an example to be described as 90%, and in practical applications, the first preset threshold may be 95%, 98%, or another value, but the present application is not limited thereto.
In the implementation process, the average value of the utilization rate of each node at each time point in the preset time period is determined, so that the overall operation condition of each node in the preset time period can be effectively reflected, and further, the utilization rate average value is compared with the utilization rate threshold value under the normal condition, so that whether the corresponding node is abnormal or not can be effectively judged.
The second method comprises the following steps: and judging whether the running state of each node is abnormal or not according to the utilization rate number of each node which is lower than a second preset threshold value in a preset time period.
Specifically, for each node, the following steps may be performed:
a 2: and determining the utilization rate of each time point of one node in a preset time period.
Specifically, the management server obtains, from the set, the utilization rate of the node corresponding to the ID at each time point in a preset time period according to the ID of the node.
b 2: and if the utilization rate of the specified quantity is lower than a second preset threshold value, determining that one node is abnormal in operation, and otherwise, determining that one node is normal in operation.
Specifically, if the utilization rates of the node at the time points in the preset time period have the utilization rates of the specified number lower than the second preset threshold, the node is indicated to be abnormal in operation, otherwise, the node is indicated to be normal in operation.
As an embodiment, the management server acquires the utilization rate of the ID corresponding to the node every minute in the last 6 minutes from the set according to the ID of the node.
For example, assuming that the specified number is 5, the second preset threshold is 95%, the management server performs, at 10: when 06, according to the ID of a certain node, acquiring the node corresponding to the ID from the set at 10: the utilization rate at 01 was 82%, and the ratio was 10: the utilization rate at 02 was 94%, and the ratio was 10: the utilization rate at 03 was 92%, and the ratio was 10: the utilization rate at 04 is 89%, and the ratio is 10: the utilization rate at 05 was 91%, and the ratio was 10: the utilization rate at 06 is 93%, since in the preset time period 10: 01-10: 06, the utilization rate of 6 time points is lower than a second preset threshold value of 95%, and therefore, the node is determined to be abnormal in operation.
If the ratio is 10: the utilization value at time 06 is 97%, the utilization values at other time points are not changed, and the second preset threshold value is also 95%, that is, in the preset time period 10: 01-10: 06, the utilization rate of 5 time points is lower than a second preset threshold value of 95%, and therefore, the node is determined to be abnormal in operation.
If the ratio is 10: the utilization value at 06 was 97%, and the ratio was 10: the utilization value at 03 is 98%, the utilization values at other time points are not changed, the second preset threshold value is also 95%, that is, in the preset time period 10: 01-10: 06, the utilization rate of only 4 time points is lower than the second preset threshold value of 95%, and therefore, the node is determined to operate normally.
In the embodiments of the present application, only the specified number is 5, but in practical applications, the specified number may be 3, 4, or 6, but the present application is not limited to this. In the embodiment of the present application, only the second preset threshold is taken as 95% for example, and in practical applications, the second preset threshold may be 97%, 98%, or another value, but the present application is not limited thereto.
In the implementation process, the operating conditions of multiple time points of each node can be reflected by determining that each utilization rate of the multiple time points exists in the preset time period of each node, the overall operating conditions of each node in the preset time period can be reflected by the operating conditions of the multiple time points, and further, the utilization rate of each node at each time point is compared with the utilization rate threshold value under the normal condition respectively, so that whether the corresponding node is abnormal or not can be effectively judged.
The third method comprises the following steps: and judging whether the running state of each node is abnormal or not according to the standard deviation of the utilization rate of each node.
Specifically, for each node, the following steps may be performed:
a 3: and determining the standard deviation of the utilization rate of a node at each time point in a preset time period.
Specifically, the management server obtains the utilization rate of the node corresponding to the ID at each time point in the preset time period from the set according to the ID of the node, and determines the standard deviation of the utilization rate of the node at each time point in the preset time period.
b 3: and if the standard deviation of the utilization rate is not lower than a third preset threshold, determining that one node operates abnormally, otherwise, determining that one node operates normally.
Specifically, if the standard deviation of the utilization rate of the node is not lower than a third preset threshold, the node is indicated to be abnormal in operation, otherwise, the node is indicated to be normal in operation.
As an example, assuming that the third preset threshold is 10%, the management server performs, at 10: when 06, according to the ID of a certain node, acquiring the node corresponding to the ID from the set at 10: the utilization value at 01 was 82%, and the ratio was 10: the utilization value at 02 was 97%, and the ratio was 10: the utilization value at 03 was 92%, and the ratio was 10: the utilization value at 04 is 89%, and the ratio is 10: the utilization value at 05 was 91%, and the average value was 10: the utilization value at 06 was 93%.
Further, the standard deviation of utilization in approximately 6 minutes was determined to be: 4.57%, and 4.57% < 10%, then the node is operating normally.
In the embodiment of the present application, only the third preset threshold is taken as an example to be described, in practical applications, the second preset threshold may be 15%, or 20%, or may be another value, and the first preset threshold, the second preset threshold, and the third preset threshold may be the same or different, but the present application is not limited thereto.
In the implementation process, whether the operation of the corresponding node is abnormal or not is determined according to each parameter value of the operation parameter of each node in a preset time period, so that the detection efficiency of the abnormal node is improved when the abnormal node is detected.
Furthermore, the running state of each node can be displayed to the user.
Specifically, when the operating state of each node is displayed, the following steps may be adopted:
s1201: and determining the abnormal grade according to the obtained utilization rates.
Specifically, the abnormality level may be determined in any one or a combination of the following ways:
the first method is as follows: and correspondingly setting an abnormal grade according to the range of the average value of the utilization rate of each node at each time point in a preset time period.
As an embodiment, when the average value of the utilization rates of the nodes is [ 60%, 90%), the exception level of the corresponding node is set to 3; when the average value of the node utilization rate is 30 percent or 60 percent, the abnormal level of the correspondingly set node is 2 level; when the average value of the node utilization rate is 0 percent and 30 percent, the abnormal level of the correspondingly set node is 1 level; when the average node utilization rate is 90% or above, the node running state is normal. For example, when the average utilization of a node in the last 6 minutes is: 90.67%, and determining that the operating state of the node is normal because 90.67% > 90%; when the average utilization rate of the node in the last 6 minutes is: 89.33%, since 60% < 89.33% < 90%, the node is determined to have an anomaly level of 3.
Further, determining the abnormal level of each node according to the range of the utilization rate average value of each node in the preset time period.
It should be noted that the range of the average utilization rate corresponding to each abnormal level may be adjusted according to actual situations, and is not limited herein.
The second method comprises the following steps: and correspondingly setting an abnormal grade according to the utilization rate number of each node which is lower than a second preset threshold value in a preset time period.
As an embodiment, when the number of utilization rates of the node is 5, which is lower than the second preset threshold, in the preset time period, the corresponding set exception level is 3 levels; when the utilization rate number of the nodes which is lower than a second preset threshold value in a preset time period is 6, the corresponding abnormal grade is 2 grade; when the utilization rate number of the nodes which is lower than a second preset threshold value is 7 within a preset time period, the corresponding abnormal grade is 1 grade; and when the utilization rate quantity of the nodes is lower than 5 when the node is in the preset time period and is lower than the second preset threshold, the corresponding nodes normally operate.
Further, determining the abnormal level of each node according to the utilization rate number of each node which is lower than a second preset threshold value in a preset time period.
It should be noted that, in the preset time period, the number of the utilization rates of each node that is lower than the second preset threshold and the corresponding abnormal level may be adjusted according to the actual situation, which is not limited herein.
The third method comprises the following steps: and correspondingly setting an abnormal grade according to the range of the standard deviation of the utilization rate of each node at each time point in a preset time period.
As an embodiment, when the standard deviation of the utilization rate of a node is [ 10%, 30%), the exception level of the correspondingly set node is 3; when the standard deviation of the utilization rate of the nodes is [ 30%, 60%), the abnormal grade of the nodes correspondingly set is level 2; when the standard deviation of the utilization rate of the nodes is 60% or above, the abnormal grade of the correspondingly set node is level 1; when the standard deviation of the utilization rate of the node is below 10%, the node is indicated to be in a normal operation state. For example, when the standard deviation of the utilization of a node is: 4.57%, determining that the node normally operates, and when the standard deviation of the utilization rate of the node is as follows: 20.17%, determining the abnormal level corresponding to the node to be 3.
Further, determining the abnormal grade of each node according to the range of the utilization rate standard deviation of each node in the preset time period.
It should be noted that the range to which the standard deviation of the utilization rate of each node at each time point in the preset time period belongs and the corresponding abnormal level may be adjusted according to the actual situation, which is not limited herein.
In the embodiment of the present application, only the exception level is described as level 1, level 2, or level 3, but in practical application, the exception level may be level a, level B, or level C, or may be an expression form of another level.
In the foregoing process, a lower abnormality level corresponding to a node indicates that the degree of abnormality of the node is greater, and it is necessary to be processed preferentially.
S1202: and displaying the running state of one node according to the display style corresponding to the abnormal grade.
As an embodiment, if the anomaly level is level 3, the correspondingly set display style is blue and not high; if the abnormal grade is level 2, the correspondingly set display style is yellow highlight; and if the abnormal grade is level 1, the correspondingly set display style is red and highlighted, and if the node running state is normal, the corresponding display style is green and not highlighted.
And further, according to the abnormal level of each node, adopting a corresponding display style on a display interface to display the running state of the corresponding node.
In the embodiment of the present application, only the display style is taken as an example to explain whether the display style is highlighted or not, in practical applications, the display style may also be a form of a length of a progress bar indicating that the exception level is high, and a length of a progress bar indicating that the exception level is low, or may also indicate the exception level of a node through other display styles, which is not limited in this application.
As another embodiment, an "abnormal" typeface may be displayed on a display interface for an abnormal node, a "normal" typeface is displayed on a normal node, and an average value of CPU utilization of the node within a preset time is displayed, as shown in fig. 2, fig. 2 is a schematic diagram of a node display style provided in the embodiment of the present application, in fig. 2, an "abnormal" typeface is marked on an abnormal node, a "normal" typeface is displayed on a normal node, and corresponding nodes are sequentially displayed from small to large according to the average value of CPU utilization of the preset time.
In the implementation process, the running state and the exception level of each node are displayed on the display interface, so that a user can conveniently check and/or exception the corresponding node according to the running state of each node. And by displaying the nodes with different abnormal levels in a distinguishing way, a user can conveniently and orderly process the abnormal conditions of the nodes according to the abnormal levels of the nodes.
Referring to fig. 3, fig. 3 is a block diagram of an abnormal node monitoring apparatus according to an embodiment of the present disclosure, and the abnormal node monitoring apparatus 300 shown in fig. 3 corresponds to the method shown in fig. 1 and includes functional modules capable of implementing the method shown in fig. 1.
In some possible embodiments, the apparatus 300 for abnormal node monitoring shown in fig. 3 includes:
an obtaining unit 310 is configured to obtain parameter values of the operating parameter of each node within a preset time period.
The determining unit 320 is configured to determine an operating state of each node according to each parameter value of the operating parameter of each node, where the operating state is used to indicate whether the corresponding node operates abnormally.
In some possible embodiments, the operating parameters include at least a utilization of the CPU.
In some possible embodiments, the determining unit is specifically configured to:
respectively aiming at each node, the following steps are executed:
and determining the average value of the utilization rate of a node at each time point in a preset time period.
And if the average utilization rate is not lower than a first preset threshold, determining that one node normally operates, otherwise, determining that one node abnormally operates.
In some possible embodiments, the determining unit is specifically configured to:
respectively aiming at each node, the following steps are executed:
and determining the utilization rate of each time point of one node in a preset time period.
And if the utilization rate of the specified quantity is lower than a second preset threshold value, determining that one node is abnormal in operation, and otherwise, determining that one node is normal in operation.
In some possible embodiments, the determining unit is further configured to:
and if the parameter reporting message sent by any node is determined to be received, acquiring node identification information, parameter values corresponding to the operation parameters and parameter acquisition time in the parameter reporting message.
And adopting a specified data structure to perform associated storage on the received node identification information, the parameter values corresponding to the operation parameters and the parameter acquisition time.
In some possible embodiments, the determining unit is further configured to:
and determining the abnormal grade according to the obtained utilization rates.
And displaying the running state of one node according to the display style corresponding to the abnormal grade.
It should be noted that the abnormal node monitoring apparatus 300 shown in fig. 3 can implement the processes of the method related to abnormal node monitoring in the embodiment of the method in fig. 1. The operation and/or function of each module in the abnormal node monitoring apparatus 300 is respectively to implement the corresponding flow in the method embodiment in fig. 1. Reference may be made specifically to the description of the above method embodiments, and a detailed description is appropriately omitted herein to avoid redundancy.
Referring to fig. 4, fig. 4 is a schematic view of an electronic device according to an embodiment of the present disclosure, where the electronic device 400 shown in fig. 4 may include: at least one processor 410, such as a CPU, at least one communication interface 420, at least one memory 430, and at least one communication bus 440. Wherein the communication bus 440 is used to enable direct connection communication of these components. In this embodiment, the communication interface 420 of the device in this application is used for performing signaling or data communication with other node devices. The memory 430 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 430 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 430 stores computer readable instructions, which when executed by the processor 410, cause the electronic device to perform the method processes described above with reference to fig. 1.
The embodiment of the application provides a readable storage medium, on which a computer program is stored, and when the computer program is executed by a server, the steps in the method shown in fig. 1 are implemented.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the system apparatus into only one logical functional division may be implemented in other ways, and for example, a plurality of apparatuses or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of abnormal node monitoring, the method comprising:
respectively acquiring each parameter value of the operation parameter of each node in a preset time period;
and respectively determining the operation state of the corresponding node according to each parameter value of the operation parameter of each node, wherein the operation state is used for indicating whether the corresponding node operates abnormally.
2. The method of claim 1, wherein the operating parameters include at least central processor CPU utilization.
3. The method of claim 2, wherein determining the operating status of each node based on the respective parameter value of the operating parameter of the corresponding node comprises:
respectively aiming at each node, the following steps are executed:
determining the average value of the utilization rate of one node at each time point in the preset time period;
and if the utilization rate average value is not lower than a first preset threshold value, determining that the node normally operates, otherwise, determining that the node abnormally operates.
4. The method of claim 2, wherein determining the operating status of each node based on the respective parameter value of the operating parameter of the corresponding node comprises:
respectively aiming at each node, the following steps are executed:
determining the utilization rate of a node at each time point in the preset time period;
and if the utilization rate of the specified quantity is lower than a second preset threshold value, determining that the one node is abnormal in operation, otherwise, determining that the one node is normal in operation.
5. The method according to any one of claims 1-4, wherein before the separately obtaining the respective parameter values of the operating parameters of each node within a preset time period, the method further comprises:
if the parameter reporting message sent by any node is determined to be received, acquiring node identification information, parameter values corresponding to the operation parameters and parameter acquisition time in the parameter reporting message;
and adopting a specified data structure to perform associated storage on the received node identification information, the parameter value corresponding to the operating parameter and the parameter acquisition time.
6. The method according to any of claims 2-4, wherein after said determining the operational status of the respective node, the method further comprises:
determining an abnormal grade according to the obtained utilization rates;
and displaying the running state of the node according to the display style corresponding to the abnormal grade.
7. An apparatus for abnormal node monitoring, the apparatus comprising:
the acquisition unit is used for respectively acquiring each parameter value of the operation parameter of each node in a preset time period;
and the determining unit is used for determining the operation state of the corresponding node according to each parameter value of the operation parameter of each node, wherein the operation state is used for indicating whether the corresponding node operates abnormally.
8. The apparatus of claim 7, wherein the operating parameters comprise at least central processor CPU utilization.
9. An electronic device, comprising:
a processor, a memory, and a bus, the processor being connected to the memory through the bus, the memory storing computer readable instructions for implementing the method of any one of claims 1-6 when the computer readable instructions are executed by the processor.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a server, implements the method of any one of claims 1-6.
CN202110645106.7A 2021-06-09 2021-06-09 Abnormal node monitoring method and device, electronic equipment and readable storage medium Pending CN113268389A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110645106.7A CN113268389A (en) 2021-06-09 2021-06-09 Abnormal node monitoring method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110645106.7A CN113268389A (en) 2021-06-09 2021-06-09 Abnormal node monitoring method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113268389A true CN113268389A (en) 2021-08-17

Family

ID=77234794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110645106.7A Pending CN113268389A (en) 2021-06-09 2021-06-09 Abnormal node monitoring method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113268389A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106487612A (en) * 2016-11-01 2017-03-08 广东浪潮大数据研究有限公司 A kind of server node monitoring method, monitoring server and system
US20180074878A1 (en) * 2016-09-14 2018-03-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for monitoring robot operating system
CN110347561A (en) * 2019-06-11 2019-10-18 平安科技(深圳)有限公司 Monitoring alarm method and terminal device
CN111290917A (en) * 2020-02-26 2020-06-16 深圳市云智融科技有限公司 YARN-based resource monitoring method and device and terminal equipment
CN111865720A (en) * 2020-07-20 2020-10-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing request

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180074878A1 (en) * 2016-09-14 2018-03-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for monitoring robot operating system
CN106487612A (en) * 2016-11-01 2017-03-08 广东浪潮大数据研究有限公司 A kind of server node monitoring method, monitoring server and system
CN110347561A (en) * 2019-06-11 2019-10-18 平安科技(深圳)有限公司 Monitoring alarm method and terminal device
CN111290917A (en) * 2020-02-26 2020-06-16 深圳市云智融科技有限公司 YARN-based resource monitoring method and device and terminal equipment
CN111865720A (en) * 2020-07-20 2020-10-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing request

Similar Documents

Publication Publication Date Title
CN112131073B (en) Monitoring method and system of server
CN113176978A (en) Monitoring method, system and device based on log file and readable storage medium
CN111385123A (en) WEB service distributed intelligent monitoring method, device, computer equipment and storage medium
CN109766198B (en) Stream processing method, device, equipment and computer readable storage medium
CN103647662A (en) Fault monitoring alarm method and apparatus
CN111010318A (en) Method and system for discovering loss of connection of terminal equipment of Internet of things and equipment shadow server
CN112437001A (en) Method and device for guaranteeing reliable delivery and consumption of messages
CN112651367A (en) Method for intelligently monitoring multi-dimensional aviation display screen
CN110807050B (en) Performance analysis method, device, computer equipment and storage medium
CN109639490B (en) Downtime notification method and device
CN110881224B (en) Network long connection method, device, equipment and storage medium
CN110333916A (en) Request message processing method, device, computer system and readable storage medium storing program for executing
CN113254274A (en) Message processing method, device, storage medium and server
CN112910733A (en) Full link monitoring system and method based on big data
CN110677271B (en) Big data alarm method, device, equipment and storage medium based on ELK
CN113268389A (en) Abnormal node monitoring method and device, electronic equipment and readable storage medium
CN116795631A (en) Service system monitoring alarm method, device, equipment and medium
CN114697247B (en) Fault detection method, device, equipment and storage medium of streaming media system
CN111431764B (en) Node determining method, device, system and medium
CN110611576B (en) Data quality monitoring method, device, equipment and storage medium
CN114416560A (en) Program crash analysis aggregation method and system
CN113434729A (en) Video related information aggregation obtaining method and device and terminal equipment
CN113672449A (en) Intelligent operation and maintenance abnormity monitoring method and device, computer equipment and storage medium
CN113079065A (en) Heartbeat detection method, device, equipment and medium based on Ambari
CN112711517A (en) Server performance monitoring method and device, storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination