CN111736948A

CN111736948A - Cloud computing platform automation operation and maintenance system and method, terminal device and storage medium

Info

Publication number: CN111736948A
Application number: CN202010430955.6A
Authority: CN
Inventors: 王洋
Original assignee: Inesa R&d Center
Current assignee: Inesa R&d Center
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2020-10-02
Anticipated expiration: 2040-05-20
Also published as: CN111736948B

Abstract

The invention relates to an automatic operation and maintenance system, a method, terminal equipment and a storage medium of a cloud computing platform, wherein the system comprises an operation and maintenance control subsystem, a nova-computer service module and a nova-api service module, wherein the operation and maintenance control subsystem is used for selectively killing corresponding processes obtained by judgment under different conditions and sending a corresponding request to the nova-api service module when automatic heat transfer is required to be carried out on a part of virtual machines; the nova-computer service module is configured on a host on the cloud computing platform, and is used for marking and scoring all processes of the host, also over-scoring configured resources of the host on the cloud computing platform, and executing a request from the nova-api service module; and the nova-api service module is used for receiving the request from the operation and maintenance control subsystem and informing the nova-computer service module to execute the request. Compared with the prior art, the method has the advantages of improving the utilization rate of cloud resources, ensuring the stable operation of the system and the like.

Description

Cloud computing platform automation operation and maintenance system and method, terminal device and storage medium

Technical Field

The invention relates to the technical field of cloud computing, in particular to an automatic operation and maintenance system and method for a cloud computing platform, terminal equipment and a storage medium.

Background

In a cloud computing platform, a large number of user virtual machines are run in a host cluster. Most of the virtual machines have low resource occupancy rates, the platform often performs over-partitioning on the CPU and the memory resources of the host machine in order to save resources, where the over-partitioning refers to scheduling the virtual machine resources exceeding the real resources of the host machine, however, under the condition of resource over-partitioning, a large amount of resource consumption peak values intermittently appear in the virtual machine of the user, including a large amount of occupation of the CPU and the memory resources. This in turn can lead to the extrusion of host machine resources, which often causes memory errors or CPU performance degradation of the host machine, affecting the normal operation of the user virtual machine.

The resource occupancy rate of the user virtual machine is often uncontrollable, and if the CPU and the memory resources of the host are not configured with over-distribution, a large amount of resources are idle. If the CPU and memory resource configuration of the host machine is over-divided, the host machine resource extrusion can be intermittently caused when too much application is performed, and the user experience is influenced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an automatic operation and maintenance system, method, terminal equipment and storage medium for a cloud computing platform.

The purpose of the invention can be realized by the following technical scheme:

an automatic operation and maintenance system of a cloud computing platform is applied to the environment of the cloud computing platform and comprises an operation and maintenance control subsystem, a nova-computer service module and a nova-api service module, wherein:

the operation and maintenance control subsystem is used for sending a corresponding request to the nova-api service module when the corresponding process obtained by judgment under different conditions needs to be killed and when automatic thermal migration needs to be carried out on part of the virtual machines;

the nova-computer service module is configured on a host on the cloud computing platform, and is used for marking and scoring all processes of the host, also over-scoring configured resources of the host on the cloud computing platform, and executing a request from the nova-api service module;

and the nova-api service module is used for receiving the request from the operation and maintenance control subsystem and informing the nova-computer service module to execute the request.

Further, the cloud computing platform is an OpenStack.

Further, the operation and maintenance control subsystem further comprises a monitoring subsystem, wherein:

and the monitoring subsystem is used for monitoring the use conditions of the memory and the CPU resource of the host machine and recording the respective use rates of the CPU and the memory resource of the host machine and the corresponding marking and scoring conditions.

The invention also provides an operation and maintenance control method based on the cloud computing platform automation operation and maintenance system, which comprises the following steps:

step 1: configuring the nova-computer service module on a host machine on the cloud computing platform to mark all processes of the host machine and sequentially score according to the importance degree of the processes;

step 2: configuring the nova-computer service module to perform resource allocation over-division on a host on the cloud computing platform, wherein a CPU (Central processing Unit) and a memory of the host set an over-division ratio according to historical average load on the cloud computing platform;

and step 3: a monitoring subsystem in the operation and maintenance control subsystem monitors the CPU of the process on the host machine and the utilization rate of the memory and marks the scoring condition;

and 4, step 4: based on the monitored conditions of CPU (central processing unit) and memory utilization rate of the process on the host and the marking, the operation and maintenance control subsystem selectively kills the corresponding process judged under different conditions, sends a request for deleting the virtual machine on the host to the nova-api service module, and the nova-computer service module executes a deletion task to ensure the normal operation of the important process of the host;

and 5: based on the monitored CPU of the process on the host machine, the monitored memory utilization rate and the marked condition, the operation and maintenance control subsystem carries out automatic live migration on part of virtual machines on the host machine, sends a live migration request to the nova-api service module, and the nova-computer service module executes the live migration request to reduce the system load.

Further, the step 3 specifically includes:

the monitoring subsystem monitors the use condition of the memory and the CPU resource of the host machine, records the use rate of the memory and the CPU resource of the host machine, simultaneously records the use rates of the memory and the CPU of all processes in the host machine, and further respectively calculates and records the memory and the CPU judgment score of each process.

Further, the CPU of each process determines a score, which is described by the formula:

Cn＝Sn*RCn

where Cn is the CPU determination score of the process number n, Sn is the score of the process number n, and RCn is the CPU utilization of the process number n.

The memory determination score for each process is described by the formula:

Mn＝Sn*RMn

in the formula, Mn is a memory determination score of a process with a process number n, and RMn is a memory usage rate of the process with the process number n.

Further, in the step 4, based on the monitored conditions of the CPU and the memory usage rate of the process on the host and the marking, the process of selecting and killing the corresponding process determined under different conditions by the operation and maintenance control subsystem specifically includes:

setting basic operation and maintenance intervention thresholds of a CPU and a memory as C and M, and setting advanced operation and maintenance intervention thresholds as TC and TM;

if TC > RC > Tc, killing a process with the number of mc on the host, wherein mc is Max (C1, C2, … and Cn), C1, C2 and …, Cn is a 1 st to nth CPU basic operation and maintenance intervention threshold, RC is the CPU resource utilization rate of the host, and Tc is a CPU operation and maintenance intervention threshold;

if TM > RM > Tm, a process with the number of mm is killed on the host, wherein mm is Max (M1, M2, …, Mn), M1, M2, …, Mn is the 1 st to nth memory basic operation and maintenance intervention threshold, RM is the memory resource utilization rate of the host, and Tm is the memory operation and maintenance intervention threshold.

Further, based on the monitored CPU and memory usage rate of the process on the host and the marked condition in step 5, the process of performing automated live migration on part of virtual machines on the host by the operation and maintenance control subsystem specifically includes:

if TC < RC, performing thermal migration on a virtual machine with the number of tmc on a host, wherein tmc is Max (Ct1, Ct2, … and CtN), Ct1, Ct2, … and CtN are the t1 to the tN CPU basic operation and maintenance intervention thresholds;

and if TM < RM, performing hot migration on the virtual machine with the number of tmm on the host machine, wherein tmc is Max (Mt1, Mt2, … and MtN), Mt1, Mt2 and …, and MtN is the t1 th to the tN th memory base operation and maintenance intervention thresholds.

The invention also provides a terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the operation and maintenance control method based on the cloud computing platform automation operation and maintenance system when executing the computer program.

The invention also provides a computer readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the operation and maintenance control method based on the cloud computing platform automation operation and maintenance system.

Compared with the prior art, the invention has the following advantages:

(1) the cloud computing platform automation operation and maintenance system, the cloud computing platform automation operation and maintenance method, the terminal equipment and the storage medium are applied to a cloud computing environment, are used for configuring resource over-distribution for the host on the premise of not wasting platform resources, and meanwhile, avoid resource extrusion of the host and improve user experience.

(2) The invention monitors the CPU utilization rate of the host machine by the subsystem, under the condition of overhigh CPU or memory load, according to the CPU of the process on the host machine, the memory utilization rate and the marking condition, the process with low killing importance degree and high utilization rate is selected, the request for deleting the virtual machine is sent to the nova-api service, and the deletion task is executed by the nova-computer service, thereby ensuring the normal operation of the important process of the host machine, in particular the process of the virtual machine of the user.

(3) The invention carries out automatic hot migration on part of the virtual machines under the condition that the process resource occupation of the user virtual machine is too high and is far higher than the resource occupation of other processes of the host machine, thereby reducing the system load. And the live migration request is sent to the nova-api service and executed by the nova-computer service, so that the utilization rate of cloud resources is improved, and the stable operation of the system is ensured.

Drawings

FIG. 1 is a system architecture diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The invention aims to provide an automatic operation and maintenance system and method for a cloud computing platform, which can improve the resource utilization rate of the cloud platform, ensure the stable operation of the system and optimize the user experience.

The system architecture of the invention is shown in fig. 1, and based on the architecture, an automation operation and maintenance control system and method for an openstack cloud computing platform are provided, and applied to an openstack environment, wherein:

the system comprises an operation and maintenance control subsystem, a nova-computer service module and a nova-api service module, wherein:

the nova-computer service module is configured on a host machine on the cloud computing platform and used for marking and scoring all processes of the host machine, also for over-scoring the configuration resources of the host machine on the cloud computing platform and executing a request from the nova-api service module;

the nova-api service module is used for receiving a request from the operation and maintenance control subsystem and informing the nova-computer service module to execute the request;

the operation and maintenance control subsystem further comprises a monitoring subsystem, wherein:

The method comprises the following steps of,

(1) and configuring a nova-computer service on a host machine of the cloud platform to mark and score all processes of the host machine. The process is scored according to the importance level, the important process is scored low, and the unimportant process is scored high.

(2) And the configuration nova-computer service performs over-scoring on the configuration resources of the host of the cloud platform, and the CPU and the memory can set an over-scoring proportion according to the historical average load on the platform.

(3) The monitoring subsystem monitors the CPU utilization rate of the host machine, under the condition that the CPU or the memory load is too high, according to the CPU of the process on the host machine and the conditions of memory utilization rate and mark scoring, the process with low killing importance degree and high utilization rate is selected, the request for deleting the virtual machine is sent to the nova-api service, and the deletion task is executed by the nova-computer service, so that the important process of the host machine, particularly the normal operation of the process of the virtual machine of the user is ensured.

(4) Under the condition that the process resource occupation of the user virtual machine is too high and is far higher than the resource occupation of other processes of the host machine, automatic hot migration is carried out on part of the virtual machine, and the system load is reduced. The live migration request is sent to the nova-api service and executed by the nova-computer service.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

(1) In a cloud platform environment, a system administrator marks and scores all processes (with process number of 1 … … N) on a host, and the process mark value with process number of N is marked as Sn. Important processes score low, such as 0, and unimportant processes score high, such as 10. In principle, all the corresponding processes of the user virtual machine should be marked as 0 point, and the virtual machine which is not considered to be important by a system administrator can also be marked with high point.

(2) The Nova-computer service performs over-scoring on the configuration resources of the host of the cloud platform, and the CPU and the memory can set an over-scoring proportion according to the historical average load on the platform. The commonly used memory overcut ratio is 1.5, and the CPU overcut ratio is 3 times. This may improve the utilization of system resources. That is to say, when the cloud platform is scheduled, it is considered that the host has 1.5 times of the actual memory and 3 times of the real CPU amount. In this case, the host can carry more virtual machine load. But because the virtual machine and CPU resource of the user often cannot reach 100% utilization rate, the normal operation of the host machine is not influenced.

(3) The monitoring subsystem monitors the use conditions of the memory and the CPU resource of the host machine and records the use rates of the CPU and the memory resource of the host machine as RC and RM. Meanwhile, the CPU and memory usage rates of all processes (numbered 1 … … N) in the host are recorded, the CPU usage rate of the process numbered N is recorded as RCn, and the memory usage rate of the process numbered N is recorded as RMn. The CPU judgment score Cn ═ Sn × RCn and the memory judgment score Mn ═ Sn × RMn for each process are recorded.

(4) The operation and maintenance control subsystem records the process numbers t1, t2, … … and tN corresponding to the virtual machines in the host machine, and then the ratio of the CPU consumption of the virtual machines in the host machine to the total CPU consumption of the host machine

Interest rate of memory consumption of virtual machine in host machine in total memory consumption of host machine

(5) Setting basic operation and maintenance intervention thresholds of a CPU and a memory as C and M, and setting advanced operation and maintenance intervention thresholds as TC and TM;

if TM > RM > Tm, killing a process with the number of mm on a host, wherein mm is Max (M1, M2, …, Mn), M1, M2, …, Mn is a 1 st to nth memory basic operation and maintenance intervention threshold, RM is the memory resource utilization rate of the host, and Tm is a memory operation and maintenance intervention threshold;

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An automatic operation and maintenance system of a cloud computing platform is applied to the environment of the cloud computing platform, and is characterized by comprising an operation and maintenance control subsystem, a nova-computer service module and a nova-api service module, wherein:

2. The cloud computing platform automation operation and maintenance system of claim 1, wherein the cloud computing platform is an OpenStack.

3. The cloud computing platform automation operation and maintenance system of claim 1, wherein the operation and maintenance control subsystem further comprises a monitoring subsystem, wherein:

4. The operation and maintenance control method of the cloud computing platform automation operation and maintenance system according to claim 1, characterized by comprising the following steps:

5. The operation and maintenance control method based on the cloud computing platform automation operation and maintenance system according to claim 4, wherein the step 3 specifically includes:

6. The operation and maintenance control method based on the cloud computing platform automation operation and maintenance system according to claim 5, wherein the CPU of each process determines a score according to a description formula:

Cn＝Sn*RCn

The memory determination score for each process is described by the formula:

Mn＝Sn*RMn

7. The operation and maintenance control method based on the cloud computing platform automation operation and maintenance system according to claim 4, wherein the operation and maintenance control subsystem selects and kills the processes judged under different conditions based on the monitored conditions of the CPU and the memory usage rate of the processes on the host and the marking and scoring in the step 4, and the process specifically comprises:

8. The operation and maintenance control method based on the cloud computing platform automation operation and maintenance system according to claim 4, wherein the operation and maintenance control subsystem performs an automated live migration process on part of virtual machines on the host machine based on the monitored CPU and memory usage rate of the processes on the host machine and the marked condition in the step 5 specifically includes:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the operation and maintenance control method of the cloud computing platform-based automation operation and maintenance system according to any one of claims 4 to 8 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the steps of the operation and maintenance control method of the cloud computing platform-based automation operation and maintenance system according to any one of claims 4 to 8.