CN116501460A

CN116501460A - Cloud host dynamic migration monitoring and early warning method

Info

Publication number: CN116501460A
Application number: CN202310320979.XA
Authority: CN
Inventors: 林德生; 郑生华
Original assignee: China Youke Communication Technology Co ltd
Current assignee: China Youke Communication Technology Co ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-07-28

Abstract

The invention relates to a cloud host dynamic migration monitoring and early warning method, which comprises the following steps: s1: modeling a relationship between a cloud host and a host machine; s2: collecting cloud resource pool relation data, deploying integrated acquisition service, and establishing a cross monitoring matrix; s3: monitoring and judging the running state of the host machine; when the host machine is abnormal, the step S4 is entered; s4: performing online migration of the cloud host, and starting a cloud host migration state monitoring function; s5: monitoring current host information of the cloud host, monitoring the running state of the cloud host, monitoring the key service state of the cloud host, and performing cloud host state fault judgment according to the monitoring result; s6: performing fault treatment; s7: the integrated acquisition service monitors the new host instead, and repeats the steps S3-S6. The method can monitor the dynamic migration process of the cloud host, so that faults generated during dynamic migration of the cloud host can be timely and accurately found and early-warned.

Description

Cloud host dynamic migration monitoring and early warning method

Technical Field

The invention relates to the technical field of cloud computing, in particular to a cloud host dynamic migration monitoring and early warning method.

Background

With the rapid development of cloud computing, many telecom operators have been cloud-loaded with many services. The cloud computing environment gathers a large amount of physical resources and virtual resources, and provides cloud host dynamic migration technology for ensuring that the loaded service can still stably run when the physical resources and the virtual resources are in failure, VMWARE, openStack and the like, and allows automatic migration before the host fails or performs poorly. The dynamic Migration (Life Migration), also called Online Migration (Online Migration), is a process of moving one cloud host system from one physical host (host) to another physical host while ensuring normal operation of services on the cloud host, so that offline maintenance or upgrade can be performed on the physical server without affecting normal use of tenants. However, in some special cases, migration is unsuccessful, and failures such as that the cloud host does not migrate or after migration, the cloud host does not automatically run or the business service process stops can occur. On the other hand, the cloud host for collecting the monitoring nodes also can cause failure of an external network link or misjudgment of faults caused by self resource exhaustion due to self dynamic property, so that false alarm interference is caused, and the reliability of an alarm is affected. Therefore, in order to ensure service operation and improve customer satisfaction, when a host machine fault occurs, whether the virtual machine is successfully migrated needs to be accurately monitored, if the virtual machine is not successfully migrated, relevant operation and maintenance personnel are timely notified to perform manual migration and processing, so that the migration fault is quickly responded and processed, and service operation is ensured.

Disclosure of Invention

The invention aims to provide a cloud host dynamic migration monitoring and early warning method, which can monitor the cloud host dynamic migration process, so as to timely and accurately discover faults occurring during the cloud host dynamic migration and perform early warning.

In order to achieve the above purpose, the invention adopts the following technical scheme: a cloud host dynamic migration monitoring and early warning method comprises the following steps:

step S1: modeling a relationship between a cloud host and a host machine;

step S2: collecting cloud resource pool relation data, deploying integrated acquisition service, and establishing a cross monitoring matrix;

step S3: monitoring and judging the running state of the host machine; when the host machine is abnormal, the step S4 is entered;

step S4: performing online migration of the cloud host, and starting a cloud host migration state monitoring function;

step S5: monitoring current host information of the cloud host, monitoring the running state of the cloud host, monitoring the key service state of the cloud host, and performing cloud host state fault judgment according to the monitoring result;

step S6: performing fault treatment;

step S7: the integrated acquisition service monitors the new host instead, and repeats the steps S3-S6.

Further, in the step S1, in the modeling process, a cloud host, a host entity table and an attribute table are defined, and system metadata is registered.

Further, the step S2 specifically includes the following steps:

step S201: the method comprises the steps of docking a cloud resource pool API interface, and collecting computing resource relation data of a bearing target cloud host, wherein the computing resource relation data comprise the current host, the resources of the current host and relation data of all hosts under a resource pool;

step S202: and deploying integrated acquisition services on cloud hosts of different network domains or network segments, and establishing a cross monitoring matrix.

Further, the step S3 specifically includes the following steps:

step 301: monitoring the network quality of a host machine; the integrated acquisition service dials and measures the target host through the ICMP, monitors the network quality of the target host in real time, and judges whether the target host has abnormal super-threshold value or not; the network monitoring index comprises whether a network is connected, network delay and packet loss rate, the monitored data comprises an index value, a time stamp and an abnormal state value, the abnormal value is set to be 1, and the normal value is set to be 0;

step 302: monitoring the hardware state of the host machine; the integrated acquisition service acquires the hardware health state of the target host through the IPMI and monitors whether the target host has hardware faults or not; the hardware monitoring index comprises: judging whether the power supply state, the current state, the voltage state, the fan state, the processor state, the memory state, the temperature state and the event log state are abnormal, setting an abnormal value as 1 and setting a normal value as 0;

step S303: performing host state fault judgment; performing phase operation on each index data acquired by the cross monitoring matrix through the step S301 and the step S301 according to the time correlation, so as to judge the final value of the corresponding index, further judge whether the host is abnormal or not, and then send the obstacle judging result to the control center; and the control center automatically starts the migration of the cloud host according to the host state obstacle judgment result, and when the host state obstacle judgment result is abnormal, the step S4 is started.

Further, in step S4, the cloud host online migration is performed by using the cloud host automatic migration technology, and the cloud host migration status monitoring function is started, that is, step S5 is entered.

Further, the step S5 specifically includes the following steps:

step S501: if the cloud host is successfully migrated, the current host of the cloud host is a new host; judging whether the cloud host successfully completes migration or not through the current host; the integrated acquisition service queries the current host of the cloud host through the cloud resource pool API interface, if the current host or the fault host is judged, the cloud host is not migrated successfully, abnormal data with a time stamp, which is not migrated by the cloud host, is generated, the state value is set to be 1, and otherwise, the state value is set to be 0;

step S502: if the cloud host is successfully migrated, the cloud host still keeps a starting running state; when step S501 queries that the cloud host has migrated to a new host, the integrated acquisition service performs ICMP dial testing on the new target cloud host, if the dial testing is found to be not successful, the cloud host may not be started, then an abnormal data with a timestamp is generated, the state value is set to 1, otherwise, the state value is set to 0;

step S503: if the cloud host is successfully migrated, the business service is not affected; performing dial testing on a port of the key service, if the dial testing is abnormal, generating abnormal data with a timestamp, wherein the service is not started, and the state value is set to be 1, otherwise, the state value is set to be 0;

step S504: performing phase-phase operation according to the monitoring results of the migration states of the cloud hosts in the step S501, the step S502 and the step S503 and sending the operation results to a control center; and (6) the control center locates the failed concrete link according to the fault judging result of the cloud host state, and performs fault processing when the fault judging result of the cloud host state is abnormal, namely, the step (S6) is entered.

Further, the step S6 specifically includes the following steps:

step S601: performing fault treatment of non-migration of the cloud host; when judging that the cloud host is not migrated or not started through the step S504, identifying that the fault is responsible for the cloud resource pool manager, and informing the manager to perform subsequent fault processing operations, including manual migration and manual starting; simultaneously notifying tenants of the cloud host, notifying fault reasons, and discovering faults before clients;

step S602: after the cloud host migration is carried out, the key process does not start fault processing; and when judging that the cloud host is migrated but the service process is not started in the step S504, notifying the tenant to perform manual starting processing or call an automatic operation service script.

Compared with the prior art, the invention has the following beneficial effects: the invention can monitor the host state and the cloud host state in the cloud host dynamic migration process, thereby analyzing and diagnosing faults generated during the cloud host dynamic migration timely and accurately, enhancing the efficiency of service fault positioning and processing in the cloud computing environment, improving the service quality of the service and improving the customer satisfaction.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a cross-monitoring matrix according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

As shown in fig. 1, this embodiment provides a cloud host dynamic migration monitoring and early warning method, which includes the following steps:

step S1: modeling the relationship between the cloud host and the host, defining a cloud host, a host entity table and an attribute table, and registering system metadata.

Step S2: and acquiring cloud resource pool relation data, deploying an integrated acquisition service, and establishing a cross monitoring matrix, as shown in fig. 2.

In this embodiment, the step S2 specifically includes the following steps:

step S201: and collecting cloud resource pool relation data. The method comprises the steps of docking a cloud resource pool API interface, and collecting computing resource relation data of a bearing target cloud host, wherein the computing resource relation data comprise the current host, resources of the current host, relation data of all hosts under a resource pool and the like;

step S202: and establishing a cross monitoring matrix. The integrated acquisition service is operated on the cloud host, and the host can also cause misjudgment faults due to failure of external network connection due to the self dynamic property, so that interference is caused, and the reliability of an alarm is affected. Therefore, the cross monitoring matrix is established by deploying the integrated acquisition service on cloud hosts of different network domains or network segments.

Step S3: monitoring and judging the running state of the host machine; when the host is abnormal, the process proceeds to step S4.

In this embodiment, the step S3 specifically includes the following steps:

step 301: and monitoring the network quality of the host machine.

The integrated acquisition service dials and measures the target host through the ICMP, monitors the network quality of the target host in real time, and judges whether the target host has abnormal super-threshold value or not; the network monitoring index comprises whether a network is connected, network delay and packet loss rate, the monitored data comprises an index value, a time stamp and an abnormal state value, the abnormal value is set to be 1, and the normal value is set to be 0.

Step 302: and monitoring the hardware state of the host machine.

The integrated acquisition service acquires the hardware health state of the target host through the IPMI and monitors whether the target host has hardware faults or not; the hardware monitoring index comprises: and judging whether the power supply, the current, the voltage, the fan, the processor, the memory, the temperature, the event log and other eight hardware states are abnormal, wherein the abnormal value is set to be 1, and the normal value is set to be 0.

Step S303: and performing host state fault judgment.

Performing phase operation on each index data acquired by the cross monitoring matrix through the step S301 and the step S301 according to the time correlation, so as to judge the final value of the corresponding index, further judge whether the host is abnormal or not, and then send the obstacle judging result to the control center; and the control center automatically starts the migration of the cloud host according to the host state obstacle judgment result, and when the host state obstacle judgment result is abnormal, the step S4 is started.

Step S4: and performing online migration of the cloud host.

And (5) carrying out online migration of the cloud host by utilizing the cloud host automatic migration technology, and starting a cloud host migration state monitoring function, namely entering step S5.

Step S5: monitoring current host information of the cloud host, monitoring the running state of the cloud host, monitoring the key service state of the cloud host, and performing cloud host state fault judgment according to the monitoring result.

In this embodiment, the step S5 specifically includes the following steps:

step S501: current host information of the cloud host is monitored.

If the cloud host is successfully migrated, the current host of the cloud host is a new host; judging whether the cloud host successfully completes migration or not through the current host; the integrated acquisition service queries the current host of the cloud host through the cloud resource pool API interface, if the current host is judged to be the fault host, the cloud host is not migrated successfully, the abnormal data with the time stamp, which is not migrated by the cloud host, is generated, the state value is set to be 1, and otherwise, the state value is set to be 0.

Step S502: and monitoring the running state of the cloud host.

If the cloud host is successfully migrated, the cloud host still keeps a starting running state; when step S501 queries that the cloud host has migrated to a new host, the integrated acquisition service performs ICMP dial testing on the new target cloud host, if the dial testing is found to be not successful, the cloud host may not be started, then an abnormal data with a timestamp is generated, the state value is set to 1, otherwise, the state value is set to 0.

Step S503: monitoring the key service state of the cloud host.

If the cloud host is successfully migrated, the business service is not affected; and (3) performing dial testing on the ports of the key service, if the dial testing is abnormal, generating abnormal data with a timestamp, wherein the service is not started, and setting the state value to be 1, otherwise setting the state value to be 0.

Step S504: and (5) judging obstacle by a cloud host cross matrix.

Performing phase-phase operation according to the monitoring results of the migration states of the cloud hosts in the step S501, the step S502 and the step S503 and sending the operation results to a control center; and (6) the control center locates the failed concrete link according to the fault judging result of the cloud host state, and performs fault processing when the fault judging result of the cloud host state is abnormal, namely, the step (S6) is entered.

Step S6: and performing fault processing.

In this embodiment, the step S6 specifically includes the following steps:

step S601: and performing migration failure processing on the cloud host.

When judging that the cloud host is not migrated or not started through the step S504, identifying that the fault is responsible for the cloud resource pool manager, and informing the manager to perform subsequent fault processing operations, including manual migration and manual starting; simultaneously notifying tenants of the cloud host, notifying fault reasons, and discovering faults before clients;

step S602: and after the cloud host migration is carried out, the key process does not start fault processing.

And when judging that the cloud host is migrated but the service process is not started in the step S504, notifying the tenant to perform manual starting processing or call an automatic operation service script.

The process according to the invention is further illustrated by the following example.

In this embodiment, the cloud host dynamic migration monitoring method includes:

step S1: modeling a cloud host and a host organization, defining a cloud host, a host entity table and an attribute table, and registering system metadata.

Entity attribute data:

table 1 metadata entity example data

Entity ID	Entity Java class name	Entity name
			cm_01_37_03_01	HostConfEntity	Host machine entity
cm_01_37_05_01	VmConfEntity	Cloud host entity

Table 2 sink entity attribute example data

Sequence number	Entity ID of the genus	Attribute Java membership	Data type	Attribute names	Attribute table field
						1	cm_01_37_03_01	cloudCenterId	Integer	Cloud center numbering	cm_01_37_01_01_01
2	cm_01_37_03_01	cloudPoolId	Integer	Cloud resource pool numbering	cm_01_37_02_01_01
						3	cm_01_37_03_01	hostId	Integer	Host numbering	cm_01_37_03_01_01
4	cm_01_37_03_01	resourceType	String	Resource category	cm_01_37_03_01_02
						5	cm_01_37_03_01	hostName	String	Host name	cm_01_37_03_01_03
6	cm_01_37_03_01	resourceRemark	String	Resource description	cm_01_37_03_01_04
						7	cm_01_37_03_01	uuId	String	UUID	cm_01_37_03_01_05
8	cm_01_37_03_01	hostIp	Integer	Managing IP	cm_01_37_03_01_06
						9	cm_01_37_03_01	softwareVersion	String	Virtualized software version	cm_01_37_03_01_07
10	cm_01_37_03_01	totalMemory	Integer	Total internal memory	cm_01_37_03_01_08
						11	cm_01_37_03_01	cpuModel	String	CPU model	cm_01_37_03_01_09
12	cm_01_37_03_01	cpuNums	Integer	Number of CPUs	cm_01_37_03_01_10

Table 3 cloud host entity attribute example data

Step S2: deploying an integrated acquisition service

Step S201: and collecting cloud resource pool relation data, butting a Vmware cloud resource pool Vmware sphere API interface (vSphere SDK for Java), and collecting and storing the monitored cloud host and other hosts under the cluster. The implementation method comprises the following steps:

1) Connecting vcenter, instantiating a venter service;

ServiceInstance si＝VCenterConnectService.getServiceInstance(vcenterId)；

2) Acquiring data center

Datacenter datacenter＝(Datacenter)new

InventoryNavigator(si.getRootFolder()).searchManagedEntity("Datacenter",

datacenterName)；

3) Acquiring project of cloud host

Folder project＝(Folder)new

InventoryNavigator((datacenter.getVmFolder())).searchManagedEntity("Folder",

projectName)；

4) Acquiring cloud host object according to cloud host name

VirtualMachine vm＝(VirtualMachine)new

InventoryNavigator(project).searchManagedEntity("VirtualMachine",vmName)；

5) And acquiring the real-time running state of the cloud host, and acquiring the current host. The code process is as follows:

ManagedObjectReference hostInfo＝vm.getRuntime().getHost()；

HostSystem host＝new HostSystem(si.getServerConnection(),hostInfo)；

String hostname＝host.getName()；

step S3: host running state monitoring

Step S301: and monitoring the network quality of the host machine. ICMP dial testing is carried out on the current host machine, and whether the target host machine has network faults or not is monitored. The monitoring index comprises whether a network is connected, network delay and packet loss rate, the data comprises an index value, a time stamp and an abnormal state value, the abnormal value is set to be 1, and the normal value is set to be 0.

The results were returned by ICMP dial testing, and the data examples are as follows:

10packets transmitted,10received,0％packet loss,time 9007ms

rtt min/avg/max/mdev＝0.545/0.639/0.762/0.062ms

when the packet loss rate is 100% or the average delay exceeds 0.3

Step S302: and monitoring the state of the host hardware. The method comprises the steps of polling an IPMI command of a current host machine to collect hardware health states of a target host machine, wherein the collected hardware states comprise: and judging whether hardware faults occur or not according to the temperature, the fan, the voltage, the current, the processor, the memory and the power supply conditions, wherein an abnormal value is set to be 1, and a normal value is set to be 0.

The temperature acquisition result is exemplified as follows, and whether or not a fault has occurred (the value other than ns, lnc, unc or ok is abnormal) is judged based on the third column state value.

CPU 1Temp |01h|ns|3.1|Disabled

CPU 2Temp |02h|ns|3.2|Disabled

IOH 2Temp |0Dh|ns|7.1|Disabled

Ambient Temp|0Eh|ok|7.1|17degrees C

Step S303: and judging the obstacle by the host.

And calculating host state data acquired by the matrix. And performing phase operation on the result data of each index acquired in the matrix step S301 and the matrix step S302 according to the time correlation, so as to judge the final value of the index, confirm whether the host is abnormal or not and send the result to the control center. The control center automatically starts the cloud host migration module when the host fault judgment result is abnormal

Step S4: cloud host automatic migration

And (3) carrying out online migration of the cloud host by using an automatic cloud host migration technology, and starting a cloud host state migration monitoring function. And starting the automatic migration of the cloud host through a vmmotion interface of the Vmware sphere API interface of the docking Vmware cloud resource pool, and starting a state migration monitoring task of the cloud host.

Step S5: monitoring cloud host migration status

Step S501: current host information of the cloud host is monitored. If vmware vmotion live migration is successful, the current host of the cloud host should be the new host. Therefore, whether the cloud host successfully completes migration can be judged through the current host. The integrated acquisition service queries the current host of the cloud host through the cloud resource pool API interface, if the current host is judged to be the fault host, the cloud host is not automatically migrated, the abnormal data with the timestamp, which is not migrated by the cloud host, is generated, the state value is set to be 1, and otherwise, the state value is set to be 0.

Step S502: and monitoring the running state of the cloud host. When the migration is successful, the cloud host still keeps the starting-up running state. When step S501 queries that the cloud host has migrated to a new host, the integrated acquisition service performs ICMP dial testing on the new target cloud host, when it is found that the dial testing is not enabled, the cloud host may not be powered on, and then a piece of abnormal data with a timestamp is produced, the state value of which is set to 1, otherwise the state value of which is set to 0.

Step S503: monitoring the key service state of the cloud host. If the cloud host successfully migrates seamlessly, the business service is not affected. Therefore, by performing dial testing on the ports of the key service, if dial testing is abnormal, generating a piece of abnormal data with a timestamp that the service is not started, setting the state value to 1, otherwise setting the state value to 0.

The specific implementation is as follows: the connection state of the target address and the service port (HTTP service port or UDP service port) is collected by the ncat command.

The exception results are exemplified as follows:

curl (7) Failed connect to IP: port; the connection times out.

If an anomaly is found, generating a piece of service unopened time-stamped anomaly data, setting the state value to 1, otherwise setting the state value to 0.

Step S504: and (5) judging obstacle by a cloud host cross matrix. And performing phase operation according to the time correlation according to the acquisition results of the cloud host migration states in the steps S501, S502 and S503, and sending the results to a control center. And the control center calls the fault processing module according to the specific link of failure positioning when the fault judging result of the cloud host state is abnormal.

Step S6: alert notification and handling

Step S601: the cloud host does not migrate failure handling. When step S504 determines that the cloud host is not migrated or not powered on, the fault is determined to be responsible for the cloud resource pool administrator, and the administrator is notified to perform subsequent fault handling operations, such as manual migration or manual power on. And simultaneously notifying the tenant of the cloud host, notifying the fault reason, and discovering the fault before the client.

Step S602: and after the cloud host is migrated, the key process is not started. And when the cloud host is judged to be migrated but the service process is not started in the step S504, notifying the tenant to perform manual starting processing or call an automatic operation service script.

Step S7: the integrated acquisition service instead monitors the new host machine and repeats the steps S3-S6.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. The cloud host dynamic migration monitoring and early warning method is characterized by comprising the following steps of:

step S1: modeling a relationship between a cloud host and a host machine;

step S6: performing fault treatment;

2. The method for monitoring and early warning of cloud host dynamic migration according to claim 1, wherein in the step S1, in the modeling process, a cloud host, a host entity table and an attribute table are defined, and system metadata is registered.

3. The cloud host dynamic migration monitoring and early warning method according to claim 1, wherein the step S2 specifically includes the following steps:

4. The cloud host dynamic migration monitoring and early warning method according to claim 1, wherein the step S3 specifically includes the following steps:

5. The method for monitoring and early warning of cloud host dynamic migration according to claim 1, wherein in step S4, cloud host online migration is performed by using a cloud host automatic migration technology, and a cloud host migration status monitoring function is started, namely step S5 is entered.

6. The cloud host dynamic migration monitoring and early warning method according to claim 1, wherein the step S5 specifically includes the following steps:

7. The cloud host dynamic migration monitoring and early warning method according to claim 6, wherein the step S6 specifically includes the following steps: