CN111694705A - Monitoring method, device, equipment and computer readable storage medium - Google Patents

Monitoring method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111694705A
CN111694705A CN201910199173.3A CN201910199173A CN111694705A CN 111694705 A CN111694705 A CN 111694705A CN 201910199173 A CN201910199173 A CN 201910199173A CN 111694705 A CN111694705 A CN 111694705A
Authority
CN
China
Prior art keywords
preset
judging
operation data
data
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910199173.3A
Other languages
Chinese (zh)
Inventor
李冬峰
李彦良
刘荣明
王哲涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201910199173.3A priority Critical patent/CN111694705A/en
Publication of CN111694705A publication Critical patent/CN111694705A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a monitoring method, a monitoring device, monitoring equipment and a computer readable storage medium, wherein the method comprises the following steps: respectively acquiring running data of a first system and a second system for resource sharing; judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not; and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate. Therefore, the running states of the two systems of resource sharing can be monitored in real time, the system problems can be found and solved as soon as possible, and the running safety of the system is improved on the basis of saving the cost.

Description

Monitoring method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a monitoring method, apparatus, device, and computer-readable storage medium.
Background
With the development of science and technology, electronic commerce gradually enters the lives of users, in order to support the service requirements of numerous users, the existing e-commerce websites generally adopt a plurality of distributed systems to support services, and different distributed systems respectively process different services. However, since different distributed systems process different services, the service processing time of different distributed systems is different. For example, in practical applications, a kubernets system can be adopted to bear the main business of online shopping of users, a Hadoop system is adopted to perform operations such as cleaning, conversion and processing on mass data, and basic data required by systems such as search recommendation, artificial intelligence, unbounded retail, face recognition and the like are generated, but due to the shopping habits of users, the main pressure of the kubernets system is between 9 and 24 points in the day. At 0 to 8 a.m., 80% of the resources of the kubernets system are idle, while the Hadoop system needs to provide 24 hours a day data service. However, with the rapid development and rapid expansion of services, the Hadoop system for big data needs more and more data to be processed, and huge funds are spent each year to expand the calculation and storage capacity of the existing big data, thereby causing resource waste.
In order to solve the above technical problem, a method for transferring the service of the Hadoop system to the kubernets system for processing so as to realize resource sharing is proposed in the prior art.
However, when the method is used for service processing, the service and hardware conditions of the two systems cannot be monitored, so that the current health condition of the systems cannot be monitored in real time.
Disclosure of Invention
The invention provides a monitoring method, a monitoring device, monitoring equipment and a computer readable storage medium, which are used for solving the technical problems that the existing resource sharing method cannot monitor the service and hardware conditions of two systems of resource sharing, and further cannot monitor the current health condition of the systems in real time.
A first aspect of the present invention provides a monitoring method, including:
respectively acquiring running data of a first system and a second system for resource sharing;
judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not;
and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
Another aspect of the present invention provides a monitoring apparatus, comprising:
the acquisition module is used for respectively acquiring the running data of a first system and a second system for resource sharing;
the judging module is used for judging the operating data according to a preset judging rule so as to determine whether the first system and the second system have faults or not;
and the processing module is used for taking corresponding measures according to the judgment result so as to ensure that the first system and the second system operate normally.
Yet another aspect of the present invention provides a monitoring apparatus comprising: a memory, a processor;
a memory; a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the monitoring method as described above by the processor.
Yet another aspect of the present invention is to provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the monitoring method as described above when executed by a processor.
The monitoring method, the monitoring device, the monitoring equipment and the computer readable storage medium respectively acquire the running data of a first system and a second system for resource sharing; judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not; and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate. Therefore, the running states of the two systems of resource sharing can be monitored in real time, the system problems can be found and solved as soon as possible, and the running safety of the system is improved on the basis of saving the cost.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic diagram of a network architecture on which the present invention is based;
fig. 2 is a schematic flow chart of a monitoring method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of a monitoring method according to a second embodiment of the present invention;
fig. 4 is a schematic flowchart of a monitoring method according to a third embodiment of the present invention;
fig. 5 is a schematic flow chart of a monitoring method according to a fourth embodiment of the present invention;
fig. 6 is a schematic flowchart of a monitoring method according to a fifth embodiment of the present invention;
fig. 7 is a schematic flowchart of a monitoring method according to a sixth embodiment of the present invention;
fig. 8 is a schematic flowchart of a monitoring method according to a seventh embodiment of the present invention;
fig. 9 is a schematic flowchart of a monitoring method according to an eighth embodiment of the present invention;
fig. 10 is a schematic structural diagram of a monitoring device according to a ninth embodiment of the present invention;
fig. 11 is a schematic structural diagram of a monitoring device according to a tenth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other examples obtained based on the examples in the present invention are within the scope of the present invention.
With the development of science and technology, electronic commerce gradually enters the lives of users, in order to support the service requirements of numerous users, the existing e-commerce websites generally adopt a plurality of distributed systems to support services, and different distributed systems respectively process different services. However, since different distributed systems process different services, the service processing time of different distributed systems is different, which results in a problem of resource waste. In order to solve the above technical problem, a method for transferring the service of the Hadoop system to the kubernets system for processing so as to realize resource sharing is proposed in the prior art. However, when the method is used for service processing, the service and hardware conditions of the two systems cannot be monitored, so that the current health condition of the systems cannot be monitored in real time. In order to solve the technical problem, the invention provides a monitoring method, a monitoring device, monitoring equipment and a computer-readable storage medium.
It should be noted that the monitoring method, apparatus, device, and computer-readable storage medium provided in the present application may be applied in a scenario of testing application software in various scenarios.
Fig. 1 is a schematic diagram of a network architecture based on the present invention, and as shown in fig. 1, the network architecture based on the present invention at least includes: a monitoring device 1, a first system 2 and a second system 3. The monitoring device 1 is in communication connection with the first system 2 and the second system 3, respectively, so as to obtain the operation data in the first system 2 and the second system 3. Wherein, the monitoring device 1 is written by C/C + +, Java, Shell or Python languages and the like; the first system 2 and the second system 3 may be server clusters in which a large amount of data is stored.
Fig. 2 is a schematic flow chart of a monitoring method according to an embodiment of the present invention, and as shown in fig. 2, the monitoring method includes:
step 101, respectively obtaining operation data of a first system and a second system for resource sharing.
The execution subject of the present embodiment is a monitoring device. In this embodiment, the first system and the second system may be distributed service systems or distributed databases. The first system and the second system can share resources, and specifically, the first system can deploy a task which is currently running to the second system to run, so that extra cost caused by capacity expansion can be avoided, and cost is saved. In order to ensure that both systems can operate stably, both systems need to be monitored. Specifically, the operation data of the first system and the second system may be acquired respectively. The operation data may include hardware operation data and software operation data.
And step 102, judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults.
In this embodiment, the monitoring device is preset with a plurality of judgment rules, so that the collected operation data of the first system and the second system can be judged according to the judgment rules, and whether the first system and the second system are in failure or not is determined. Specifically, the judgment rule may be preset and stored in the monitoring device, and as an implementable manner, the judgment rule may also be set by the user according to the current requirement, and the judgment rule set by the user is taken as the current preset judgment rule. Specifically, the monitoring device may interact with a user, and receive a determination rule input by the user from a preset interface.
And 103, taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
In this embodiment, after the operation data is determined according to the preset determination rule, it may be determined whether the first system and the second system are currently in failure. Therefore, in order to ensure that the first system and the second system can stably operate, corresponding measures can be taken according to the judgment result. Specifically, if it is detected that the current first system and the current second system are not in fault, the first system and the second system may be continuously monitored, and if it is detected that the current first system and the current second system are in fault, the fault may be processed according to a preset processing method, so that the first system and the second system can stably operate, and thus, the service processing efficiency can be improved.
For example, the first system may be a Hadoop system, and the second system may be a Kubernetes system, where the Kubernetes system undertakes a main operation service of online shopping of a user, and the Hadoop system is used to perform operations such as cleaning, conversion, and processing on mass data to generate basic data required by systems such as search recommendation, artificial intelligence, unbounded retail, and face recognition, but the main pressure of the Kubernetes system is between 9 and 24 points in the day due to a shopping habit of the user. At 0 to 8 a.m., 80% of the resources of the kubernets system are idle, while the Hadoop system needs to provide 24 hours a day data service. However, with the rapid development and rapid expansion of services, the Hadoop system for big data needs more and more data to be processed, and huge funds are spent each year to expand the calculation and storage capacity of the existing big data, thereby causing resource waste. Therefore, in order to achieve the effect of saving cost, part of tasks on the Hadoop system can be deployed to the Kubernets system for processing. In order to ensure that the two systems can stably operate, the operation data of the two systems can be respectively collected, the operation data of the Hadoop system and the Kubernets system is judged according to a preset judgment rule, whether the Hadoop system and the Kubernets system have faults at present is determined, and corresponding processing measures are taken according to the judgment result so as to ensure that the Hadoop system and the Kubernets system can normally operate.
In the monitoring method provided by this embodiment, the operation data of the first system and the second system for resource sharing are obtained respectively; judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not; and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate. Therefore, the running states of the two systems of resource sharing can be monitored in real time, the system problems can be found and solved as soon as possible, and the running safety of the system is improved on the basis of saving the cost.
Fig. 3 is a schematic flow chart of a monitoring method according to a second embodiment of the present invention, where on the basis of any of the foregoing embodiments, the method includes:
step 201, respectively acquiring running data of a first system and a second system for resource sharing through preset calling interfaces in the first system and the second system;
step 202, judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults;
and 203, taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
In this embodiment, in order to ensure that both the two systems can operate stably, the two systems need to be monitored to respectively acquire the operation data of the first system and the second system. Specifically, the operation data of the first system and the second system may be obtained through a preset call interface in the first system and the second system.
In the above example, the Hadoop system may be called to provide a Resource Manager (Resource Manager) API, so as to obtain the storage Resource and the computing Resource conditions of the Hadoop system. On one hand, hardware running data of two systems can be obtained, and specifically, an org.apache.hadoop.fs.filesystem interface provided by Hadoop can be called; calling an FsStatus object provided by the FileSystems; calling a fsstatus and getCapacity () method to obtain the size of the total space; invoking a fsstatus and getused () method to obtain the size of the used space; and calling a fsstatus and getsharing () method to obtain the size of the residual space so as to obtain the hardware running data in the Hadoop system. In addition, hardware operation data of the kubernets system can be acquired, and specifically, the following interfaces opened in the kubernets can be called to acquire the operation data of the system: calling a Container Runtime Interface (CRI) to acquire computing resource information of the Kubernetes system; calling a Container Network Interface (CNI) to acquire Network resource information of the Kubernetes system; and calling a Container Storage Interface (CSI) to acquire storage resource information of the Kubernetes system. On the other hand, software operation data of the Hadoop system and the Kubernets system can be obtained, and it should be noted that the calculation task of the Hadoop system is not different from the calculation process of the Hadoop system when the Hadoop system operates on the Kubernets system. Only in the processing method of storing the calculation result, there are the following differences: calculating intermediate data, transition data, temporary data and other non-final result data generated in the task, storing the non-final result data in a Kubernetes system container, namely local storage of the docker, and occupying storage resources of the docker; and the final result of the calculation task needs to be stored in the hadoop system HDFS and kept. And data loss caused by resource recovery of a Kubernetes system is avoided. Therefore, the monitoring device can acquire all the acquired and calculated task data running in the Hadoop system and the Kubernet system according to the mode of acquiring the task data calculated by the Hadoop system. Specifically, an interface preset in the Hadoop system may be called to obtain task data running in the Hadoop system, optionally, task running data currently running may be obtained, corresponding task running data may be obtained according to the task running time and the end time, corresponding task running data may also be obtained according to the task identifier, and in addition, the software running data may also be obtained in various ways, which is not limited herein.
In the monitoring method provided by this embodiment, the operation data of the first system and the second system for resource sharing are respectively obtained through the preset calling interfaces in the first system and the second system, so that the operation data of the first system and the second system can be obtained, and a basis is provided for subsequent system maintenance.
Further, on the basis of any of the above embodiments, the method comprises:
respectively acquiring hardware operating data of a first system and a second system for resource sharing;
judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not;
and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
It can be understood that if a problem occurs in the current hardware of the system, the problem may be caused that the operation speed is slow and the task completion efficiency is low, and therefore, the hardware operation data of the first system and the hardware operation data of the second system that perform resource sharing may be obtained separately. The hardware operation data includes, but is not limited to, a total space size, a used space size, a remaining space size, and the like.
According to the monitoring method provided by the embodiment, the hardware running data of the first system and the second system for resource sharing are respectively obtained, so that a basis is provided for subsequent judgment and maintenance of the running state of the system, and the problem of low task completion efficiency caused by insufficient hardware resources is avoided.
Further, on the basis of any of the above embodiments, the method comprises:
respectively acquiring task operation data of a first system and a second system for resource sharing;
judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not;
and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
In this embodiment, since a software failure may cause a task to be in a halt state and unable to be completed, in order to ensure that two systems sharing resources can operate normally, the monitoring device may also monitor software operating states of the first system and the second system. Specifically, task operation data of the first system and the second system may be acquired respectively.
According to the monitoring method provided by the embodiment, the task operation data of the first system and the second system for resource sharing are respectively obtained, so that a basis is provided for subsequent judgment and maintenance of the system operation state, and the problem of low task completion efficiency caused by software failure is avoided.
Fig. 4 is a schematic flow chart of a monitoring method according to a third embodiment of the present invention, where on the basis of any one of the above embodiments, as shown in fig. 4, the method includes:
301, respectively acquiring running data of a first system and a second system for resource sharing;
step 302, judging the operating data according to a preset judgment rule according to a preset period to determine whether the first system and the second system have faults;
and step 303, taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
In this embodiment, in order to ensure that the first system and the second system sharing resources can stably operate, after the operation data of the first system and the second system are respectively obtained, the obtained operation data may be determined according to a preset period and a preset determination rule, so as to determine whether the first system and the second system are currently operating normally. The preset period may be a default period of the system, or may be determined by the operation and maintenance staff according to historical experience and current requirements, which is not limited herein. For example, if the task executed by the first system and the second system is more important, a shorter period may be set to determine the operating states of the first system and the second system, and if the task executed by the first system and the second system is less important, a longer period may be set to determine the operating states of the first system and the second system in order to save cost.
According to the monitoring method provided by the embodiment, the operation data is judged according to a preset judgment rule according to a preset period to determine whether the first system and the second system have faults, so that the operation states of the first system and the second system can be accurately determined, and the first system and the second system can be ensured to stably operate.
Fig. 5 is a schematic flow chart of a monitoring method according to a fourth embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 5, the method includes:
step 401, respectively acquiring hardware operating data of a first system and a second system for resource sharing;
step 402, calculating the storage resource occupancy rates of the first system and the second system according to the hardware operation data of the first system and the second system;
step 403, judging the occupancy rates of the storage resources according to a preset judgment rule, and determining whether the times that the occupancy rates of the storage resources of the first system and the second system continuously exceed a preset proportional threshold exceeds a preset first threshold, so as to determine whether the first system and the second system have a fault;
and step 404, taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
In this embodiment, after the hardware operation data of the first system and the second system is obtained, the current storage resource utilization rates of the first system and the second system may be determined according to the hardware operation data, and it can be understood that if the current storage resource utilization rate is higher, the problems of a slower operation speed and a lower task completion efficiency may be caused, so that the current hardware operation data may be determined according to a preset determination rule. Specifically, the determination rule may be to determine whether the number of times that the occupancy rates of the storage resources of the first system and the second system continuously exceed the preset proportional threshold exceeds a preset first threshold, for example, if the number of times that the occupancy rates of the storage resources of the first system and/or the second system continuously exceed 90% exceeds three times, it may be determined that the first system and/or the second system currently has a fault. The ratio threshold and the first threshold may be default thresholds of the system, or may be determined by the operation and maintenance staff, and the present invention is not limited herein.
According to the monitoring method provided by the embodiment, the storage resource occupancy rates of the first system and the second system are calculated according to the hardware operation data of the first system and the second system, the storage resource occupancy rates are judged according to the preset judgment rule, and whether the times that the storage resource occupancy rates of the first system and the second system continuously exceed the preset proportional threshold exceeds the preset first threshold is determined, so as to determine whether the first system and the second system are in fault, so that whether the first system and the second system are in fault at present can be accurately determined, a basis is provided for subsequent maintenance, and further, the stable operation of the first system and the second system can be ensured on the basis of saving cost.
Fig. 6 is a schematic flow chart of a monitoring method according to a fifth embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 6, the method includes:
501, respectively acquiring task running data of a first system and a second system for resource sharing;
step 502, determining task completion rates of the first system and the second system according to the task operation data;
step 503, judging the task completion rate according to a preset judgment rule, and determining whether the task completion rates of the first system and the second system are lower than a preset second threshold value, so as to determine whether the first system and the second system have a fault; and/or the presence of a gas in the gas,
step 504, determining task completion time of the first system and the second system according to the task operation data;
step 505, judging the task completion time according to a preset judgment rule, and determining whether the task completion time of the first system and the second system exceeds a preset third threshold value to determine whether the first system and the second system have a fault;
step 506, taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
In this embodiment, after the task operation data of the first system and the second system is acquired, the current task completion rate and/or the task completion time may be determined according to the task operation data. It can be understood that if the current task completion rate is lower than a preset threshold, the software representing the current system fails; and if the completion time of the current task exceeds a preset threshold value, the software representing the current system fails. Therefore, the task completion rate and/or the task completion time of the first system and the second system can be determined according to the task operation data, the task completion rate is judged according to a preset judgment rule, whether the task completion rate of the first system and the second system is lower than a preset second threshold value is determined to determine whether the first system and the second system are in failure, and/or the task completion time is judged according to the preset judgment rule to determine whether the task completion time of the first system and the second system exceeds a preset third threshold value to determine whether the first system and the second system are in failure.
In the monitoring method provided by this embodiment, the task completion rates of the first system and the second system are determined according to the task operation data, the task completion rates are determined according to a preset determination rule, and whether the task completion rates of the first system and the second system are lower than a preset second threshold is determined, so as to determine whether the first system and the second system have a fault; and/or determining task completion time of the first system and the second system according to the task operation data, judging the task completion time according to a preset judgment rule, determining whether the task completion time of the first system and the second system exceeds a preset third threshold value or not, and determining whether the first system and the second system break down or not, so that whether the first system and the second system break down or not can be accurately determined, a basis is provided for subsequent maintenance, and stable operation of the first system and the second system can be guaranteed on the basis of saving cost.
Fig. 7 is a schematic flow chart of a monitoring method according to a sixth embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 7, the method includes:
601, respectively acquiring hardware running data of a first system and a second system for resource sharing;
step 602, calculating storage resource occupancy rates of the first system and the second system according to hardware operation data of the first system and the second system;
603, judging the storage resource occupancy rate according to a preset judgment rule, and determining whether the times that the storage resource occupancy rates of the first system and the second system continuously exceed a preset proportion threshold exceeds a preset first threshold so as to determine whether the first system and the second system have faults;
step 604, if the number of times that the occupancy rates of the storage resources of the first system and the second system continuously exceed a preset proportional threshold exceeds a preset first threshold, determining current idle cluster nodes in the first system and the second system;
step 605, processing the current task through the currently running cluster nodes in the first system and the second system and the idle cluster nodes.
In this embodiment, if the problem of the current system is a hardware problem, the problem can be solved in an idle node expansion manner. Specifically, if it is determined that the number of times that the occupancy rates of the storage resources of the first system and the second system continuously exceed the preset proportional threshold exceeds the preset first threshold, it is necessary to determine idle cluster nodes in the current first system and the second system, and process the current task through the currently operating cluster nodes in the first system and the second system and the idle cluster nodes, so that the operation problem caused by high occupancy rates of the storage resources can be solved. Specifically, all nodes under the Hadoop cluster can be listed firstly, the currently idle nodes are added into the Hadoop cluster nodes, and the Hadoop cluster configuration file is updated, so that the newly added nodes can execute tasks. In addition, if the nodes with low operation efficiency exist in the history nodes, the nodes can be restarted. A Resource Manager (RM) is called to check whether the resource adjustment has been made effective. It should be noted that the storage resource may be understood as a memory and a hard disk, and the computing resource may be understood as a CPU. Essentially, the storage and computing resources are located on a server (a host computer for storage and computing). When the two supplement each other to form the current resource shortage, the two increase at the same time; when the current resources are redundant, the two are deleted simultaneously.
In the monitoring method provided in this embodiment, if the number of times that the occupancy rates of the storage resources of the first system and the second system continuously exceed the preset proportional threshold exceeds the preset first threshold, the current idle cluster nodes in the first system and the second system are determined, and the current task is processed through the current running cluster nodes in the first system and the second system and the idle cluster nodes. Therefore, the self-healing of the system operation fault can be realized, and the stable operation of the system is guaranteed.
Fig. 8 is a schematic flow chart of a monitoring method according to a seventh embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 8, the method includes:
701, respectively acquiring task running data of a first system and a second system for resource sharing;
step 702, determining task completion rates of the first system and the second system according to the task operation data;
step 703, determining the task completion rate according to a preset determination rule, and determining whether the task completion rates of the first system and the second system are lower than a preset second threshold value, so as to determine whether the first system and the second system have a fault; and/or the presence of a gas in the gas,
step 704, determining task completion time of the first system and the second system according to the task operation data;
705, judging the task completion time according to a preset judgment rule, and determining whether the task completion time of the first system and the second system exceeds a preset third threshold value to determine whether the first system and the second system have a fault;
step 706, if the task completion rates of the first system and the second system are lower than a preset second threshold, sending a prompt message to an operation and maintenance worker, so that the operation and maintenance worker can perform manual operation and maintenance according to the prompt message and the operation data; and/or the presence of a gas in the gas,
and 707, if the task completion time of the first system and the second system exceeds a preset third threshold, sending a prompt message to the operation and maintenance personnel, so that the operation and maintenance personnel perform manual operation and maintenance according to the prompt message and the operation data.
In this embodiment, if it is detected that the task completion rates of the first system and the second system are lower than a preset second threshold and/or the task completion times of the first system and the second system exceed a preset third threshold, it is determined that the software of the system fails, and at this time, a prompt message needs to be sent to the operation and maintenance staff, where the prompt message includes failure time and failure details, so that the operation and maintenance staff can perform operation and maintenance in time according to the prompt message, and thus the first system and the second system can operate normally.
In the monitoring method provided by the embodiment, if the task completion rates of the first system and the second system are lower than a preset second threshold, prompt information is sent to operation and maintenance personnel, so that the operation and maintenance personnel perform manual operation and maintenance according to the prompt information and the operation data; and/or if the task completion time of the first system and the second system exceeds a preset third threshold, sending prompt information to operation and maintenance personnel so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data. Therefore, the first system and the second system can be ensured to operate normally.
Fig. 9 is a schematic flow chart of a monitoring method according to an eighth embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 9, the method further includes:
step 801, respectively acquiring running data of a first system and a second system for resource sharing;
and 802, generating a cluster state diagram according to the operation data and a preset statistical template, so that the operation and maintenance personnel can timely know the operation states of the first system and the second system according to the cluster state diagram.
In this embodiment, a statistical template may be prestored in the monitoring device, where the statistical template includes, but is not limited to, a bar statistical graph, a sector statistical graph, a broken line statistical graph, a pictogram, and the like, so that after the operation data of the first system and the second system are collected, the operation data may be added to the statistical template to generate a cluster state diagram, so that an operation and maintenance person can visually determine the operation state of the current system.
According to the monitoring method provided by the embodiment, the cluster state diagram is generated according to the operation data and the preset statistical template, so that the operation and maintenance personnel can timely know the operation states of the first system and the second system according to the cluster state diagram, the operation and maintenance personnel can visually determine the operation state of the current system, and the user experience is improved.
Fig. 10 is a schematic structural diagram of a monitoring device according to a ninth embodiment of the present invention, and as shown in fig. 10, the monitoring device includes:
an obtaining module 91, configured to obtain operation data of a first system and a second system that perform resource sharing, respectively;
a judging module 92, configured to judge the operating data according to a preset judgment rule, so as to determine whether the first system and the second system have a fault;
and the processing module 93 is configured to take corresponding measures according to the determination result, so that the first system and the second system operate normally.
The monitoring device provided in this embodiment obtains the operation data of the first system and the second system for resource sharing, respectively; judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not; and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate. Therefore, the running states of the two systems of resource sharing can be monitored in real time, the system problems can be found and solved as soon as possible, and the running safety of the system is improved on the basis of saving the cost.
Further, on the basis of any of the above embodiments, the obtaining module includes:
the first obtaining unit is used for respectively obtaining the running data of the first system and the second system for resource sharing through preset calling interfaces in the first system and the second system.
Further, on the basis of any of the above embodiments, the obtaining module includes:
and the second acquisition unit is used for respectively acquiring hardware operating data of the first system and the second system for resource sharing.
Further, on the basis of any of the above embodiments, the obtaining module includes:
and the third acquisition unit is used for respectively acquiring task operation data of the first system and the second system for resource sharing.
Further, on the basis of any of the above embodiments, the determining module includes:
and the first judgment unit is used for judging the operating data according to a preset judgment rule according to a preset period.
Further, on the basis of any of the above embodiments, the determining module includes:
the computing unit is used for computing the storage resource occupancy rates of the first system and the second system according to the hardware operation data of the first system and the second system;
and the second judging unit is used for judging the storage resource occupancy rate according to a preset judging rule and determining whether the times of the first system and the second system that the storage resource occupancy rates continuously exceed a preset proportional threshold value exceed a preset first threshold value.
Further, on the basis of any of the above embodiments, the determining module includes:
a first determining unit, configured to determine task completion rates of the first system and the second system according to the task operation data;
the third judging unit is used for judging the task completion rate according to a preset judging rule and determining whether the task completion rates of the first system and the second system are lower than a preset second threshold value or not; and/or the presence of a gas in the gas,
the second determining unit is used for determining task completion time of the first system and the second system according to the task operation data;
and the fourth judging unit is used for judging the task completion time according to a preset judging rule and determining whether the task completion time of the first system and the second system exceeds a preset third threshold value.
Further, on the basis of any of the above embodiments, the processing module includes:
a third determining unit, configured to determine a currently idle cluster node in the first system and the second system if the number of times that the occupancy rates of the storage resources of the first system and the second system continuously exceed a preset proportional threshold exceeds a preset first threshold;
and the first processing unit is used for processing the current task through the currently running cluster nodes in the first system and the second system and the idle cluster nodes.
Further, on the basis of any of the above embodiments, the processing module includes:
the second processing unit is used for sending prompt information to operation and maintenance personnel if the task completion rates of the first system and the second system are lower than a preset second threshold value, so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data; and/or the presence of a gas in the gas,
and the third processing unit is used for sending prompt information to the operation and maintenance personnel if the task completion time of the first system and the second system exceeds a preset third threshold value, so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data.
Further, on the basis of any of the above embodiments, the apparatus further includes:
and the generating module is used for generating a cluster state diagram according to the operating data and a preset statistical template so that the operation and maintenance personnel can timely know the operating states of the first system and the second system according to the cluster state diagram.
Fig. 11 is a schematic structural diagram of a monitoring device provided in a tenth embodiment of the present invention, and as shown in fig. 11, the monitoring device includes: a memory 111, a processor 112;
a memory 111; a memory 111 for storing instructions executable by the processor 112;
wherein the processor 112 is configured to execute the monitoring method according to any of the above embodiments by the processor 112.
The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used for implementing the monitoring method according to any one of the above embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (22)

1. A method of monitoring, comprising:
respectively acquiring running data of a first system and a second system for resource sharing;
judging the operating data according to a preset judgment rule to determine whether the first system and the second system have faults or not;
and taking corresponding measures according to the judgment result so as to enable the first system and the second system to normally operate.
2. The method of claim 1, wherein the obtaining the operation data of the first system and the second system for resource sharing respectively comprises:
and respectively acquiring the running data of the first system and the second system for resource sharing through preset calling interfaces in the first system and the second system.
3. The method of claim 1, wherein the obtaining the operation data of the first system and the second system for resource sharing respectively comprises:
hardware operation data of a first system and hardware operation data of a second system which share resources are respectively obtained.
4. The method of claim 1, wherein the obtaining the operation data of the first system and the second system for resource sharing respectively comprises:
task operation data of a first system and a second system for resource sharing are respectively obtained.
5. The method according to claim 1, wherein the determining the operation data according to a preset determination rule comprises:
and judging the operating data according to a preset judgment rule according to a preset period.
6. The method according to claim 3, wherein the determining the operation data according to a preset determination rule comprises:
calculating the storage resource occupancy rates of the first system and the second system according to the hardware operation data of the first system and the second system;
and judging the storage resource occupancy rate according to a preset judgment rule, and determining whether the times of the first system and the second system that the storage resource occupancy rates continuously exceed a preset proportional threshold value exceed a preset first threshold value.
7. The method according to claim 4, wherein the determining the operation data according to a preset determination rule comprises:
determining task completion rates of the first system and the second system according to the task operation data;
judging the task completion rate according to a preset judgment rule, and determining whether the task completion rates of the first system and the second system are lower than a preset second threshold value; and/or the presence of a gas in the gas,
determining task completion time of the first system and the second system according to the task operation data;
and judging the task completion time according to a preset judgment rule, and determining whether the task completion time of the first system and the second system exceeds a preset third threshold value.
8. The method according to claim 6, wherein taking corresponding measures according to the judgment result comprises:
if the times that the occupancy rates of the storage resources of the first system and the second system continuously exceed a preset proportional threshold exceed a preset first threshold, determining current idle cluster nodes in the first system and the second system;
and processing the current task through the current running cluster nodes in the first system and the second system and the idle cluster nodes.
9. The method according to claim 7, wherein taking corresponding measures according to the judgment result comprises:
if the task completion rates of the first system and the second system are lower than a preset second threshold value, sending prompt information to operation and maintenance personnel so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data; and/or the presence of a gas in the gas,
and if the task completion time of the first system and the second system exceeds a preset third threshold, sending prompt information to operation and maintenance personnel so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data.
10. The method according to any one of claims 1 to 9, wherein after the obtaining the operation data of the first system and the second system for resource sharing respectively, further comprises:
and generating a cluster state diagram according to the operation data and a preset statistical template so that the operation and maintenance personnel can timely know the operation states of the first system and the second system according to the cluster state diagram.
11. A monitoring device, comprising:
the acquisition module is used for respectively acquiring the running data of a first system and a second system for resource sharing;
the judging module is used for judging the operating data according to a preset judging rule so as to determine whether the first system and the second system have faults or not;
and the processing module is used for taking corresponding measures according to the judgment result so as to ensure that the first system and the second system operate normally.
12. The apparatus of claim 11, wherein the obtaining module comprises:
the first obtaining unit is used for respectively obtaining the running data of the first system and the second system for resource sharing through preset calling interfaces in the first system and the second system.
13. The apparatus of claim 11, wherein the obtaining module comprises:
and the second acquisition unit is used for respectively acquiring hardware operating data of the first system and the second system for resource sharing.
14. The apparatus of claim 11, wherein the obtaining module comprises:
and the third acquisition unit is used for respectively acquiring task operation data of the first system and the second system for resource sharing.
15. The apparatus of claim 11, wherein the determining module comprises:
and the first judgment unit is used for judging the operating data according to a preset judgment rule according to a preset period.
16. The apparatus of claim 13, wherein the determining module comprises:
the computing unit is used for computing the storage resource occupancy rates of the first system and the second system according to the hardware operation data of the first system and the second system;
and the second judging unit is used for judging the storage resource occupancy rate according to a preset judging rule and determining whether the times of the first system and the second system that the storage resource occupancy rates continuously exceed a preset proportional threshold value exceed a preset first threshold value.
17. The apparatus of claim 14, wherein the determining module comprises:
a first determining unit, configured to determine task completion rates of the first system and the second system according to the task operation data;
the third judging unit is used for judging the task completion rate according to a preset judging rule and determining whether the task completion rates of the first system and the second system are lower than a preset second threshold value or not; and/or the presence of a gas in the gas,
the second determining unit is used for determining task completion time of the first system and the second system according to the task operation data;
and the fourth judging unit is used for judging the task completion time according to a preset judging rule and determining whether the task completion time of the first system and the second system exceeds a preset third threshold value.
18. The apparatus of claim 16, wherein the processing module comprises:
a third determining unit, configured to determine a currently idle cluster node in the first system and the second system if the number of times that the occupancy rates of the storage resources of the first system and the second system continuously exceed a preset proportional threshold exceeds a preset first threshold;
and the first processing unit is used for processing the current task through the currently running cluster nodes in the first system and the second system and the idle cluster nodes.
19. The apparatus of claim 17, wherein the processing module comprises:
the second processing unit is used for sending prompt information to operation and maintenance personnel if the task completion rates of the first system and the second system are lower than a preset second threshold value, so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data; and/or the presence of a gas in the gas,
and the third processing unit is used for sending prompt information to the operation and maintenance personnel if the task completion time of the first system and the second system exceeds a preset third threshold value, so that the operation and maintenance personnel can carry out manual operation and maintenance according to the prompt information and the operation data.
20. The apparatus of any one of claims 11-19, further comprising:
and the generating module is used for generating a cluster state diagram according to the operating data and a preset statistical template so that the operation and maintenance personnel can timely know the operating states of the first system and the second system according to the cluster state diagram.
21. A monitoring device, comprising: a memory, a processor;
a memory; a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the monitoring method of any one of claims 1-10 by the processor.
22. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the monitoring method of any one of claims 1-10.
CN201910199173.3A 2019-03-15 2019-03-15 Monitoring method, device, equipment and computer readable storage medium Pending CN111694705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910199173.3A CN111694705A (en) 2019-03-15 2019-03-15 Monitoring method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910199173.3A CN111694705A (en) 2019-03-15 2019-03-15 Monitoring method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111694705A true CN111694705A (en) 2020-09-22

Family

ID=72475449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910199173.3A Pending CN111694705A (en) 2019-03-15 2019-03-15 Monitoring method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111694705A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118991A (en) * 2021-11-12 2022-03-01 百果园技术(新加坡)有限公司 Third-party system monitoring system, method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324539A (en) * 2013-06-24 2013-09-25 浪潮电子信息产业股份有限公司 Job scheduling management system and method
CN105718351A (en) * 2016-01-08 2016-06-29 北京汇商融通信息技术有限公司 Hadoop cluster-oriented distributed monitoring and management system
CN106815119A (en) * 2016-12-20 2017-06-09 曙光信息产业(北京)有限公司 The hardware monitoring device of server
CN106888254A (en) * 2017-01-20 2017-06-23 华南理工大学 A kind of exchange method between container cloud framework based on Kubernetes and its each module
CN108255661A (en) * 2016-12-29 2018-07-06 北京京东尚科信息技术有限公司 A kind of method and system for realizing Hadoop cluster monitorings
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109117259A (en) * 2018-07-25 2019-01-01 北京京东尚科信息技术有限公司 Method for scheduling task, platform, device and computer readable storage medium
CN109271233A (en) * 2018-07-25 2019-01-25 上海数耕智能科技有限公司 The implementation method of Hadoop cluster is set up based on Kubernetes
CN109413125A (en) * 2017-08-18 2019-03-01 北京京东尚科信息技术有限公司 The method and apparatus of dynamic regulation distributed system resource

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324539A (en) * 2013-06-24 2013-09-25 浪潮电子信息产业股份有限公司 Job scheduling management system and method
CN105718351A (en) * 2016-01-08 2016-06-29 北京汇商融通信息技术有限公司 Hadoop cluster-oriented distributed monitoring and management system
CN106815119A (en) * 2016-12-20 2017-06-09 曙光信息产业(北京)有限公司 The hardware monitoring device of server
CN108255661A (en) * 2016-12-29 2018-07-06 北京京东尚科信息技术有限公司 A kind of method and system for realizing Hadoop cluster monitorings
CN106888254A (en) * 2017-01-20 2017-06-23 华南理工大学 A kind of exchange method between container cloud framework based on Kubernetes and its each module
CN109413125A (en) * 2017-08-18 2019-03-01 北京京东尚科信息技术有限公司 The method and apparatus of dynamic regulation distributed system resource
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109117259A (en) * 2018-07-25 2019-01-01 北京京东尚科信息技术有限公司 Method for scheduling task, platform, device and computer readable storage medium
CN109271233A (en) * 2018-07-25 2019-01-25 上海数耕智能科技有限公司 The implementation method of Hadoop cluster is set up based on Kubernetes

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118991A (en) * 2021-11-12 2022-03-01 百果园技术(新加坡)有限公司 Third-party system monitoring system, method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105357038B (en) Monitor the method and system of cluster virtual machine
CN108632365B (en) Service resource adjusting method, related device and equipment
CN107016480B (en) Task scheduling method, device and system
CN112579304A (en) Resource scheduling method, device, equipment and medium based on distributed platform
CN111966289A (en) Partition optimization method and system based on Kafka cluster
CN112527484A (en) Workflow breakpoint continuous running method and device, computer equipment and readable storage medium
CN112380089A (en) Data center monitoring and early warning method and system
CN113656252B (en) Fault positioning method, device, electronic equipment and storage medium
CN112149975B (en) APM monitoring system and method based on artificial intelligence
CN111694705A (en) Monitoring method, device, equipment and computer readable storage medium
CN110209497B (en) Method and system for dynamically expanding and shrinking host resource
CN110750425A (en) Database monitoring method, device and system and storage medium
CN111104266A (en) Access resource allocation method and device, storage medium and electronic equipment
CN115712521A (en) Cluster node fault processing method, system and medium
CN113590287B (en) Task processing method, device, equipment, storage medium and scheduling system
CN114706893A (en) Fault detection method, device, equipment and storage medium
CN113656239A (en) Monitoring method and device for middleware and computer program product
CN110493071B (en) Message system resource balancing device, method and equipment
CN112000720A (en) Management method and management system for database connection and database connection pool
CN117076185B (en) Server inspection method, device, equipment and medium
CN115934479B (en) Interface service control method, device, storage medium and equipment
CN116723111B (en) Service request processing method, system and electronic equipment
CN116260703A (en) Distributed message service node CPU performance fault self-recovery method and device
US20220164219A1 (en) Processing system, processing method, higher-level system, lower-level system, higher-level program, and lower-level program
CN108234188B (en) Service platform resource scheduling processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination