CN116627702A - Method and device for restarting virtual machine in downtime - Google Patents

Method and device for restarting virtual machine in downtime Download PDF

Info

Publication number
CN116627702A
CN116627702A CN202310631940.XA CN202310631940A CN116627702A CN 116627702 A CN116627702 A CN 116627702A CN 202310631940 A CN202310631940 A CN 202310631940A CN 116627702 A CN116627702 A CN 116627702A
Authority
CN
China
Prior art keywords
core cpu
monitoring
virtual
virtual machine
cpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310631940.XA
Other languages
Chinese (zh)
Inventor
燕飞祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou DPTech Technologies Co Ltd
Original Assignee
Hangzhou DPTech Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou DPTech Technologies Co Ltd filed Critical Hangzhou DPTech Technologies Co Ltd
Priority to CN202310631940.XA priority Critical patent/CN116627702A/en
Publication of CN116627702A publication Critical patent/CN116627702A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a method and a device for restarting a virtual machine in a downtime mode. The method comprises the following steps: adding virtual peripheral equipment in the virtual machine; monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral; when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU; recording abnormal information according to the monitoring interrupt request and restarting. The method and the device for restarting the virtual machine by downtime can save abnormal information and restart equipment in time when the virtual machine encounters the problems of downtime such as core deadlock, dead circulation and the like, and ensure the safe operation of the system.

Description

Method and device for restarting virtual machine in downtime
Technical Field
The disclosure relates to the field of computer information processing, in particular to a method and a device for restarting a virtual machine in a downtime mode.
Background
The QEMU embedded system simulator is an open source full virtualization solution running on a user layer, a complete operating system can be virtualized on an Intel x86 machine, and the QEMU is a user mode tool and is responsible for the virtualization of other devices except a CPU and a memory and the creation and the call of various virtual devices. QEMU is a piece of open source virtual machine software supporting a variety of architectures and operating systems. In addition to being used to simulate desktop computers and server systems, it can also be used to simulate embedded systems. This usage is known as QEMU embedded system simulators.
The QEMU embedded system simulator is based on the principle that an embedded system running environment comprising a processor, a memory, a peripheral device and the like is simulated on a host. In this simulator environment, embedded system image files may be loaded and run for software development and testing. The simulator can greatly simplify the development process of the embedded system and improve the development efficiency and the software quality. QEMU is widely used in various fields today.
However, QEMU is running in the user layer, and cannot sense some problems in kernel state of the linux operating system, and cannot determine whether the kernel state is abnormal, for example: whether there is a kernel dead loop or not, whether there is a kernel deadlock, these kernel mode problems can lead to a system crash. Qemu is currently unable to automatically record abnormal information under these conditions, and restart the operating system, so that before a problem is found manually, the device is always in a down state, and the service is also always interrupted.
Therefore, a new method and device for restarting the virtual machine in downtime are needed.
The above information disclosed in the background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of the above, the application provides a method and a device for restarting a virtual machine by downtime, which can save abnormal information and restart equipment in time when the virtual machine encounters a downtime problem such as a core deadlock, a dead cycle and the like, thereby ensuring the safe operation of a system.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of the present application, a method for restarting a virtual machine in downtime is provided, the method comprising: adding virtual peripheral equipment in the virtual machine; monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral; when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU; recording abnormal information according to the monitoring interrupt request and restarting.
In an exemplary embodiment of the present application, adding a virtual peripheral in a virtual machine includes: modifying a kernel configuration file in the process of manufacturing a virtual image file of the virtual machine; and adding the virtual peripheral in the hardware equipment based on the modified kernel configuration file.
In an exemplary embodiment of the present application, adding the virtual peripheral device in the hardware device based on the modified kernel configuration file includes: installing the virtual image file in the virtual machine through a KVM virtual machine; and adding the virtual peripheral in the installation process.
In an exemplary embodiment of the present application, monitoring a multicore CPU of a virtual machine based on a virtual peripheral and a plurality of monitoring functions includes: enabling the virtual peripheral in a virtual machine to monitor a core CPU; the plurality of monitoring functions are arranged in the core CPU to monitor the control core CPU and the data core CPU.
In one exemplary embodiment of the present application, enabling the virtual peripheral in a virtual machine to monitor a core CPU includes: enabling the i6300esb virtual peripheral in the virtual machine; the virtual peripheral turns on the object-nmi function based on i6300esb to monitor the core CPU.
In one exemplary embodiment of the present application, the i6300esb virtual peripheral turns on the object-nmi function to monitor the core CPU, comprising: the i6300esb virtual peripheral selection object-nmi function acquires a monitored health state variable of the core CPU; monitoring is performed based on the monitored health state variables of the core CPU.
In an exemplary embodiment of the present application, the setting of the plurality of monitoring functions in the core CPU to monitor the control core CPU, the data core CPU includes: initializing monitoring health state variables of a control core CPU and a data core CPU in the starting process of the virtual machine; the watchdog function, the watchdog feeding function and the watchdog clearing function are registered to monitor the control core CPU and the data core CPU.
In one exemplary embodiment of the present application, registering enabling a watchdog function, a watchdog feeding function, and a watchdog clearing function to monitor a control core CPU, a data core CPU includes: starting a first preset kernel thread in a core CPU in the starting process of the virtual machine; initializing monitoring health state variables of a control core and a data core based on a first preset kernel thread; after the initialization is completed, the first preset kernel thread pulls up the second preset kernel thread; and detecting the monitoring health state variable processes of the control core CPU and the data core CPU through a second preset kernel thread.
In an exemplary embodiment of the present application, the monitoring health state variable process detection of the control core CPU and the data core CPU by the second preset kernel thread includes: generating a control core count value through a global variable preset in a control core CPU; generating a data core count value through the utilization rate of the data core CPU; the second preset kernel thread respectively acquires control count values and data count values of a control kernel CPU and a data kernel CPU; and detecting the monitoring health state variable processes of the control core CPU and the data core CPU according to the control count value and the data count value.
According to an aspect of the present application, a downtime restarting device of a virtual machine is provided, the device includes: the installation module is used for adding virtual peripherals into the virtual machine; the monitoring module is used for monitoring a multi-core CPU of the virtual machine based on the virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; the peripheral module is used for initiating a monitoring interrupt request through the virtual peripheral when the core CPU is abnormal; the core module is used for initiating a monitoring interrupt request through the core CPU when the control core CPU or the data core CPU is abnormal; and the restarting module is used for recording abnormal information according to the monitoring interrupt request and restarting.
According to an aspect of the present application, there is provided an electronic device including: one or more processors; a storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the methods as described above.
According to an aspect of the application, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, implements a method as described above.
According to the downtime restarting method and device of the virtual machine, virtual peripheral equipment is added into the virtual machine; monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral; when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU; according to the mode of recording the abnormal information and restarting the monitoring interrupt request, the abnormal information can be stored in time when the virtual machine encounters a downtime problem such as a core deadlock, a dead loop and the like, and the equipment is restarted, so that the safe operation of the system is ensured.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the present application and other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flow chart illustrating a method of downtime restarting a virtual machine, according to an example embodiment.
FIG. 2 is a flow chart illustrating a method of downtime restarting a virtual machine, according to an example embodiment.
Fig. 3 is a flowchart illustrating a method of restarting a virtual machine at downtime, according to another example embodiment.
Fig. 4 is a flowchart illustrating a method of restarting a virtual machine at downtime, according to another example embodiment.
Fig. 5 is a block diagram illustrating a downtime restarting apparatus of a virtual machine, according to an example embodiment.
Fig. 6 is a block diagram of an electronic device, according to an example embodiment.
Fig. 7 is a block diagram of a computer-readable medium shown according to an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another element. Accordingly, a first component discussed below could be termed a second component without departing from the teachings of the present inventive concept. As used herein, the term "and/or" includes any one of the associated listed items and all combinations of one or more.
Those skilled in the art will appreciate that the drawings are schematic representations of example embodiments and that the modules or flows in the drawings are not necessarily required to practice the application and therefore should not be taken to limit the scope of the application.
The technical abbreviations involved in the present application are explained as follows:
NFVI (network function virtualization infrastructure solution): is a set of resources used to host and connect virtual functions. Specifically, NFVI is a cloud data center that contains servers, virtualization hypervisors, operating systems, virtual machines, virtual switches, and network resources.
KVM: the virtual hardware device is a module specially providing a virtualization function, is mainly responsible for the virtualization of the CPU and the memory, is used for managing the driving of virtual hardware devices, and has higher virtual efficiency on the CPU.
Watchdog: also known as a watchdog timer, is an electronic or software timer used to detect and recover from computer failures. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware failures and to prevent errant or malicious software from disrupting the operation of the system. During normal operation, the computer periodically restarts the watchdog timer to prevent it from expiring or timing out. If the computer fails to restart the watchdog due to a hardware failure or a program error, the timer will expire and generate a timeout signal. This timeout signal is used to initiate corrective action. Corrective action typically includes placing the computer and associated hardware in a secure state and invoking a computer restart. Microcontrollers typically include an integrated, on-chip watchdog. In other computers, the watchdog may reside in a nearby chip that is directly connected to the CPU, or it may reside on an external expansion card of the computer/chassis.
The behavior of a watchdog timer to restart periodically is commonly referred to as feeding a dog, as is the feeding function and the clean dog function. The start-up is typically accomplished by writing to a watchdog control port or setting a specific bit in a register. In addition, some tightly coupled watchdog timers are started by executing special machine language instructions. For example, in a Linux operating system, the user space program will launch the watchdog by interacting with the watchdog device driver, typically by writing a 0 character to/dev/watchdog or invoking keepalliv eiioctl. Some watchdog timers only allow dogs to feed within a specific time window. Window time is typically relative to the previous dog feeding time, and if the watchdog is not fed within a certain time, this will be considered a fault and trigger corrective action. The function that clears the watchdog timer signal is called the clear dog function.
The watchdog timer is called enabled when running and disabled when idle. After power up, the watchdog may be unconditionally enabled or may be initially disabled, requiring an external signal to enable it. In the latter case, the enable signal may be generated automatically by hardware or under software control. When automatically generated, the enable signal is typically from a computer reset signal. In some systems, the reset signal is used directly to enable the watchdog.
After the research of the applicant, in the prior art, when a virtual machine is started, a virtual watchdog is started; after the virtual machine is started, a high-precision timer is established, and the high-precision timer is started; the method comprises the steps of creating a kernel monitoring thread and a user state monitoring thread, wherein the kernel monitoring thread detects kernel fault heartbeat based on netlink, and when the user state monitoring thread detects that the kernel monitoring thread detects heartbeat abnormality through user state feeding, the user state monitoring thread closes a watchdog, and records kernel fault information logs; and the user state monitoring thread opens the watchdog again, and if the user state feeding abnormality exceeds the preset feeding abnormality time, the user state fault information log is recorded, and the watchdog triggers the system fault reset restart.
However, in the method in the prior art, if the kernel mode is deadlocked or circulated, the user process cannot rob resources of the CPU, and cannot close the watchdog, so that a subsequent process may not be performed.
In view of the defects in the prior art, the application provides a downtime restarting method of a virtual machine, and in practical application, after equipment is downtime due to the problems of core dead circulation and core deadlock, the equipment of the virtual machine can be abnormally restarted by recording the equipment of the virtual machine by using the method, so that the service is ensured to be recovered in time.
The following describes the present application in detail with reference to specific examples.
FIG. 1 is a flow chart illustrating a method of downtime restarting a virtual machine, according to an example embodiment. The downtime restarting method 10 of the virtual machine at least includes steps S102 to S110.
As shown in fig. 1, in S102, a virtual peripheral is added in a virtual machine. The kernel configuration file can be modified in the process of manufacturing the virtual image file of the virtual machine; and adding the virtual peripheral in the hardware equipment based on the modified kernel configuration file.
More specifically, the virtual image file may be installed in the virtual machine by a KVM virtual machine; and adding the virtual peripheral in the installation process.
In S104, the multi-core CPU of the virtual machine is monitored based on the virtual peripheral and a plurality of monitoring functions, where the plurality of monitoring functions includes a watchdog function, and a watchdog clearing function, and the multi-core CPU includes a core CPU, a control core CPU, and a data core CPU.
In order to improve forwarding performance, in the embodiment of the present application, all VCPUs are divided into a control core and a data core, where the control core is responsible for running user state processes and related processing flows of a control plane (such as operations of issuing processing table entries, etc.), and the data core mainly runs related flows of packet forwarding and DPI processing.
In the present application, a core CPU may be further included, and the core CPU may be a 0-core CPU. The computing cores in the CPU are generally numbered, for example, two cores in the dual-core CPU are called as CPU0 and CPU1, four-core CPU are the same, and the internal cores are CPU0, CPU1, CPU2 and CPU3. In the present application, the core CPU is the core of the virtual system.
In one embodiment, the virtual peripheral may be enabled in a virtual machine to monitor a core CPU.
In one embodiment, the plurality of monitoring functions may also be provided in the core CPU to monitor the control core CPU, the data core CPU.
The specific content of "monitor the multicore CPU of the virtual machine based on the virtual peripheral and a plurality of monitor functions" is described in detail in the corresponding embodiment of fig. 3.
In S106, when an exception occurs in the core CPU, a monitor interrupt request is initiated through the virtual peripheral. The health of the core CPU may be monitored by the virtual hardware device i6300esb watchdog's object-nmi function of the KVM.
In S108, when an abnormality occurs in the control core CPU or the data core CPU, a monitor interrupt request is initiated by the core CPU. The watch kernel thread running on the CPU of the core CPU uses the enabled watchdog of the i6300esb device to shut down the watchdog and the watchdog-feed function, responsible for monitoring the health status of the other cores.
In a specific practical application, under the condition that the equipment normally operates, the watchdog feeding function is executed once at regular time, and when the CPU is found to be abnormal, for example, the problem of dead core or deadlock occurs, and when the CPU time cannot be given out for executing counting, the cycle monitoring state can be jumped out, and the operation before restarting is started.
For a control core and a data core except a core CPU, initiating an nmi interrupt request by the core CPU, recording abnormal information into a file after triggering, and restarting the device after finishing recording the abnormal information; for the exception of the core CPU control core, an nmi interrupt request is initiated by the virtual peripheral i6300esb, exception information is recorded in a file, and the device is restarted.
In S110, the abnormal information is recorded and restarted according to the monitoring interrupt request. Before restarting, recording abnormal information, and by using the method of the application, no matter which core in the virtual system has the problem of core dead circulation and deadlock, the equipment can automatically restart the equipment after recording the current abnormal information.
According to the downtime restarting method of the virtual machine, virtual peripheral equipment is added into the virtual machine; monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral; when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU; according to the mode of recording the abnormal information and restarting the monitoring interrupt request, the abnormal information can be stored in time when the virtual machine encounters a downtime problem such as a core deadlock, a dead loop and the like, and the equipment is restarted, so that the safe operation of the system is ensured.
It should be clearly understood that the present application describes how to make and use specific examples, but the principles of the present application are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
FIG. 2 is a flow chart illustrating a method of downtime restarting a virtual machine, according to an example embodiment. The flow 20 shown in fig. 2 is a detailed description of S102 "add virtual peripheral in virtual machine" in the flow shown in fig. 1.
As shown in fig. 2, in S202, a kernel configuration file is modified in the process of creating a virtual image file of the virtual machine. The i6300esb function may be turned on in the kernel, for example, when an nfv iso image is made.
More specifically, the function of i6300esb may be started in the kernel of the virtual machine, and in one embodiment, the starting method may be: let config_i6300 esb_wdt=y in the kernel configuration file, replace the enabled watchdog, feed dog and clean dog functions currently used in the code with the corresponding functions in I6300 esb.c.
In S204, the virtual image file is installed in the virtual machine by the KVM virtual machine. The installation process may be performed by KVM, and preferably, the version of the virtual machine QEMU is equal to or higher than 2.12.
In S206, the virtual peripheral is added in the installation process. A watchdog (watch dog) virtual hardware device of i6300esb can be added in the installation process, and the virtual device can be installed after the configuration is completed by selecting a start function object-nmi.
Fig. 3 is a flowchart illustrating a method of restarting a virtual machine at downtime, according to another example embodiment. The process 30 shown in fig. 3 is a detailed description of S204 "monitor the multicore CPU of the virtual machine based on the virtual peripheral and the plurality of monitor functions" in the process shown in fig. 2.
As shown in fig. 3, in S302, a plurality of monitoring functions are set. A watchdog function, a feeding function and a clear dog function in the i6300esb virtual peripheral can be set; watchdog functions, feed dog functions, and clean dog functions in the core CPU may also be provided.
In S304, the virtual peripheral is enabled in the virtual machine to monitor the core CPU. Enabling the i6300esb virtual peripheral in the virtual machine; the virtual peripheral turns on the object-nmi function based on i6300esb to monitor the core CPU.
More specifically, the i6300esb virtual peripheral turns on the object-nmi function to monitor the core CPU, including: the i6300esb virtual peripheral selection object-nmi function acquires a monitored health state variable of the core CPU; monitoring is performed based on the monitored health state variables of the core CPU.
In S306, the control core CPU and the data core CPU are monitored by the core CPU. Initializing monitoring health state variables of a control core CPU and a data core CPU in the starting process of the virtual machine; the watchdog function, the watchdog feeding function and the watchdog clearing function are registered to monitor the control core CPU and the data core CPU.
Fig. 4 is a flowchart illustrating a method of restarting a virtual machine at downtime, according to another example embodiment. The process 40 shown in fig. 4 is a detailed description of S306 "register enable watchdog function, watchdog feed function, and watchdog clear function to monitor control core CPU, data core CPU" in the process shown in fig. 3.
As shown in fig. 4, in S402, a first preset kernel thread in a core CPU is started during the starting of a virtual machine.
In S404, the monitored health state variables of the control core and the data core are initialized based on the first preset kernel thread. In the starting process of the virtual machine, a first preset kernel thread (watchdog_init) respectively initializes the control core and the data core for monitoring health state variables, enables a Watchdog function and registers a Watchdog function, and runs on a CPU0 control core by the kernel thread.
In S406, after the initialization is completed, the first preset kernel thread pulls up the second preset kernel thread. After initializing the watchdog function columns, the first preset kernel thread function pulls up the second preset kernel thread (watchdog_loop) to start the detection of the health status of the control core and the data core except the control core CPU 0.
In S408, the monitored health state variable processes of the control core CPU and the data core CPU are detected by the second preset kernel thread.
In one embodiment, the monitoring health state variable process detection of the control core CPU and the data core CPU by the second preset kernel thread includes: generating a control core count value through a global variable preset in a control core CPU; generating a data core count value through the utilization rate of the data core CPU; the second preset kernel thread respectively acquires control count values and data count values of a control kernel CPU and a data kernel CPU; and detecting the monitoring health state variable processes of the control core CPU and the data core CPU according to the control count value and the data count value.
In the second preset kernel thread, a custom data core count value (tick value) calculation is added for each data core CPU to calculate the CPU utilization.
In the application, the data core count value can be used as a standard of the health state of the data core CPU, and whether the tick values are equal or not each time is judged, and if the tick values are not equal, the CPU runs normally.
In another embodiment, since the functions of the control core CPU and the data core CPU are different, the control core CPU may not perform the calculation of the tick, in the control core CPU, a kernel thread may be started in each control core except the control core CPU by using the method of starting the WORK queue by INIT_WORK, and wake up the thread each time the health status of the CPU is checked, in this thread, a global variable may be set as a control core count value for each CPU, and accumulated, which means that the CPU operates normally each time the acquired control core count value is unequal.
In a specific application scene, aiming at triggering data core abnormality, setting a global variable A in the flow of a receiving and transmitting packet to trigger while dead cycle; for triggering the cpu0 control core exception, setting a global variable B in the watchdog_init function to call a thread of the core dead loop, and after the equipment is normally started, respectively assigning values to the global variables A, B through a user-defined command to trigger the core exception. In the test, one control core and one data core are provided for the device.
In the experimental process, after the CPU core of the data core is in dead circulation, the abnormal information of restarting the equipment can be recorded. When the core CPU is triggered to recycle, the device can record the abnormal information of restarting the record.
Those skilled in the art will appreciate that all or part of the steps implementing the above described embodiments are implemented as a computer program executed by a CPU. When executed by a CPU, performs the functions defined by the above-described method provided by the present application. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic disk or an optical disk, etc.
Furthermore, it should be noted that the above-described figures are merely illustrative of the processes involved in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.
Fig. 5 is a block diagram illustrating a downtime restarting apparatus of a virtual machine, according to an example embodiment. As shown in fig. 5, the downtime restarting apparatus 50 of the virtual machine includes: the device comprises an installation module 502, a monitoring module 504, a peripheral module 506, a core module 508 and a restarting module 510.
The installation module 502 is used for adding virtual peripherals in the virtual machine; the installation module 502 is further configured to modify a kernel configuration file during a process of making a virtual image file of the virtual machine; and adding the virtual peripheral in the hardware equipment based on the modified kernel configuration file.
The monitoring module 504 is configured to monitor a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, where the plurality of monitoring functions includes a watchdog function, and a watchdog clearing function, and the multi-core CPU includes a core CPU, a control core CPU, and a data core CPU; the monitoring module 504 is further configured to enable the virtual peripheral device in the virtual machine to monitor the core CPU; the plurality of monitoring functions are arranged in the core CPU to monitor the control core CPU and the data core CPU.
The peripheral module 506 is configured to initiate a monitor interrupt request through the virtual peripheral when an exception occurs in the core CPU;
the core module 508 is used for initiating a monitoring interrupt request through the core CPU when the control core CPU or the data core CPU is abnormal;
the restarting module 510 is configured to record exception information according to the monitoring interrupt request and restart.
According to the downtime restarting device of the virtual machine, virtual peripheral equipment is added into the virtual machine; monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral; when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU; according to the mode of recording the abnormal information and restarting the monitoring interrupt request, the abnormal information can be stored in time when the virtual machine encounters a downtime problem such as a core deadlock, a dead loop and the like, and the equipment is restarted, so that the safe operation of the system is ensured.
Fig. 6 is a block diagram of an electronic device, according to an example embodiment.
An electronic device 600 according to this embodiment of the application is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 6, the electronic device 600 is in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different system components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs steps according to various exemplary embodiments of the present application described in the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1, 2, 3, 4.
The memory unit 620 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.
The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 600' (e.g., keyboard, pointing device, bluetooth device, etc.), devices that enable a user to interact with the electronic device 600, and/or any devices (e.g., routers, modems, etc.) that the electronic device 600 can communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, as shown in fig. 7, the technical solution according to the embodiment of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, or a network device, etc.) to perform the above-described method according to the embodiment of the present application.
In general, the present disclosure is mainly directed to the problem that a deadlock or a dead loop occurs in a kernel mode, and a user process cannot rob to a cpu, and cannot close a watchdog, so that a subsequent process may not be able to perform the problem. For this reason, when an nfv iso mirror image is made, the i6300esb function is started on the kernel, the watch dog kernel thread running on the 0-kernel cpu uses the enabled watchdog of the i6300esb device, the watchdog and the dog feeding function are closed, the health state of other kernels is monitored in charge of monitoring the health state of the other kernels, the input-nmi function of the virtual hardware device i6300esb watchdog of kvm is matched to monitor the health state of the 0-kernel cpu, no matter which kernel has the problems of kernel dead circulation and deadlock, and the device can automatically restart the device after recording current abnormal information. Therefore, the problems of core dead circulation and core deadlock can ensure that the equipment can record abnormality and restart after the equipment is down, and ensure that the service is recovered in time. Specifically, the kernel starts the function of i6300esb, and the starting method of the function in the technology is as follows: let config_i6300 esb_wdt=y in the kernel configuration file, replace the enabled watchdog, feed dog and clean dog functions currently used in the code with the corresponding functions in I6300 esb.c. In order to improve forwarding performance, in this scheme, all VCPUs are divided into a control core and a data core, where the control core is responsible for running user state processes and related processing flows of a control plane (such as operations of issuing processing table entries, etc.), and the data core mainly runs related flows of message forwarding and DPI processing. The installation process needs kvm installation, qemu version > =2.12, i6300esb watch dog virtual hardware equipment is added in the installation process, the function is selected as object-nmi, and after configuration is completed, the equipment can be installed. In the device starting process, the control core and the data core are respectively initialized for monitoring health state variables, enabling the Watchdog function and registering the dog feeding function, the watchdog_init function runs on the cpu0 control core by a kernel thread, after initialization of a plurality of columns of the Watchdog function is completed, the watchdog_init function pulls up the kernel thread watchdog_loop, and health state detection is started on the control core and the data core except the control core cpu 0. In the watchdog_loop, a user-defined calculation of a tick value is added for each data core to calculate the utilization rate of the cpu, here, we use the method as a standard of the health state of the data core, judge whether the tick value is equal or not each time, which means that the cpu operates normally, because the functions of the control cores and the data cores are different, the control cores do not add the computation of the tick, a kernel thread is started in each control core except the control core of the cpu0 by using the INIT_WORK to start a WORK queue, the thread is woken up each time the health state of the cpu is checked, in the thread, we set a global variable for each cpu and accumulate the global variable, we also name the value as the tick, and each time the acquired tick value is not equal, which means that the cpu operates normally. Under the condition that the equipment normally operates, the watchdog feeding function is executed once at regular time, and when the CPU is found to be abnormal, for example, the problem of dead cycle or deadlock of the kernel occurs, and the CPU cannot be given out of the time to execute the thread for calculating the tick value of the CPU, the state monitored every second is jumped out, and the operation before restarting is started. For the control core and the data core except the CPU0, the CPU0 initiates an nmi interrupt request, after triggering and recording the abnormal information into the file, and after finishing the recording of the abnormal information, restarting the device; for the cpu0 control core, an nmi interrupt request is initiated by the virtual peripheral i6300esb, and exception information is recorded in a file and the device is restarted.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The computer-readable medium carries one or more programs, which when executed by one of the devices, cause the computer-readable medium to perform the functions of: adding virtual peripheral equipment in the virtual machine; monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU; when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral; when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU; recording abnormal information according to the monitoring interrupt request and restarting.
Those skilled in the art will appreciate that the modules may be distributed throughout several devices as described in the embodiments, and that corresponding variations may be implemented in one or more devices that are unique to the embodiments. The modules of the above embodiments may be combined into one module, or may be further split into a plurality of sub-modules.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
The exemplary embodiments of the present application have been particularly shown and described above. It is to be understood that this application is not limited to the precise arrangements, instrumentalities and instrumentalities described herein; on the contrary, the application is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. The downtime restarting method of the virtual machine is characterized by comprising the following steps of:
adding virtual peripheral equipment in the virtual machine;
monitoring a multi-core CPU of the virtual machine based on a virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU;
when the core CPU is abnormal, initiating a monitoring interrupt request through the virtual peripheral;
when the control core CPU or the data core CPU is abnormal, initiating a monitoring interrupt request through the core CPU;
recording abnormal information according to the monitoring interrupt request and restarting.
2. The method of claim 1, wherein adding virtual peripherals in the virtual machine comprises:
modifying a kernel configuration file in the process of manufacturing a virtual image file of the virtual machine;
and adding the virtual peripheral in the hardware equipment based on the modified kernel configuration file.
3. The method of claim 2, wherein adding the virtual peripheral in the hardware device based on the modified kernel configuration file comprises:
installing the virtual image file in the virtual machine through a KVM virtual machine;
And adding the virtual peripheral in the installation process.
4. The method of claim 1, wherein monitoring the multicore CPU of the virtual machine based on the virtual peripheral and a plurality of monitoring functions comprises:
enabling the virtual peripheral in a virtual machine to monitor a core CPU;
the plurality of monitoring functions are arranged in the core CPU to monitor the control core CPU and the data core CPU.
5. The method of claim 4, wherein enabling the virtual peripheral in a virtual machine to monitor a core CPU comprises:
enabling the i6300esb virtual peripheral in the virtual machine;
the virtual peripheral turns on the object-nmi function based on i6300esb to monitor the core CPU.
6. The method of claim 5, wherein turning on the object-nmi function based on the i6300esb virtual peripheral to monitor the core CPU comprises:
the i6300esb virtual peripheral selection object-nmi function acquires a monitored health state variable of the core CPU;
monitoring is performed based on the monitored health state variables of the core CPU.
7. The method of claim 4, wherein setting the plurality of monitoring functions in the core CPU to monitor the control core CPU, the data core CPU, comprises:
Initializing monitoring health state variables of a control core CPU and a data core CPU in the starting process of the virtual machine;
the watchdog function, the watchdog feeding function and the watchdog clearing function are registered to monitor the control core CPU and the data core CPU.
8. The method of claim 7, wherein registering enabling the watchdog function, the watchdog feed function, and the watchdog clear function to monitor the control core CPU, the data core CPU comprises:
starting a first preset kernel thread in a core CPU in the starting process of the virtual machine;
initializing monitoring health state variables of a control core and a data core based on a first preset kernel thread;
after the initialization is completed, the first preset kernel thread pulls up the second preset kernel thread;
and detecting the monitoring health state variable processes of the control core CPU and the data core CPU through a second preset kernel thread.
9. The method of claim 8, wherein monitoring health state variable process detection of the control core CPU and the data core CPU by the second preset kernel thread comprises:
generating a control core count value through a global variable preset in a control core CPU;
generating a data core count value through the utilization rate of the data core CPU;
The second preset kernel thread respectively acquires control count values and data count values of a control kernel CPU and a data kernel CPU;
and detecting the monitoring health state variable processes of the control core CPU and the data core CPU according to the control count value and the data count value.
10. The downtime restarting method of the virtual machine is characterized by comprising the following steps of:
the installation module is used for adding virtual peripherals into the virtual machine;
the monitoring module is used for monitoring a multi-core CPU of the virtual machine based on the virtual peripheral and a plurality of monitoring functions, wherein the plurality of monitoring functions comprise a watchdog function, a dog feeding function and a dog clearing function, and the multi-core CPU comprises a core CPU, a control core CPU and a data core CPU;
the peripheral module is used for initiating a monitoring interrupt request through the virtual peripheral when the core CPU is abnormal;
the core module is used for initiating a monitoring interrupt request through the core CPU when the control core CPU or the data core CPU is abnormal;
and the restarting module is used for recording abnormal information according to the monitoring interrupt request and restarting.
CN202310631940.XA 2023-05-30 2023-05-30 Method and device for restarting virtual machine in downtime Pending CN116627702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310631940.XA CN116627702A (en) 2023-05-30 2023-05-30 Method and device for restarting virtual machine in downtime

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310631940.XA CN116627702A (en) 2023-05-30 2023-05-30 Method and device for restarting virtual machine in downtime

Publications (1)

Publication Number Publication Date
CN116627702A true CN116627702A (en) 2023-08-22

Family

ID=87613121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310631940.XA Pending CN116627702A (en) 2023-05-30 2023-05-30 Method and device for restarting virtual machine in downtime

Country Status (1)

Country Link
CN (1) CN116627702A (en)

Similar Documents

Publication Publication Date Title
JP6530774B2 (en) Hardware failure recovery system
US9158628B2 (en) Bios failover update with service processor having direct serial peripheral interface (SPI) access
US10353779B2 (en) Systems and methods for detection of firmware image corruption and initiation of recovery
WO2022160756A1 (en) Server fault positioning method, apparatus and system, and computer-readable storage medium
US8898517B2 (en) Handling a failed processor of a multiprocessor information handling system
US6173417B1 (en) Initializing and restarting operating systems
US10303458B2 (en) Multi-platform installer
US8219851B2 (en) System RAS protection for UMA style memory
US7783872B2 (en) System and method to enable an event timer in a multiple event timer operating environment
CN111124728A (en) Automatic service recovery method, system, readable storage medium and server
CN114741233A (en) Quick start method
US20170344360A1 (en) Protecting firmware flashing from power operations
JP2004302731A (en) Information processor and method for trouble diagnosis
US11726852B2 (en) Hardware-assisted paravirtualized hardware watchdog
CN107133130B (en) Computer operation monitoring method and device
KR101100894B1 (en) error detection and recovery method of embedded System
CN112068980B (en) Method and device for sampling information before CPU suspension, equipment and storage medium
CN116627702A (en) Method and device for restarting virtual machine in downtime
TWI554876B (en) Method for processing node replacement and server system using the same
US20180052798A1 (en) Techniques of accessing bmc terminals through serial port
EP2691853B1 (en) Supervisor system resuming control
JP2012181737A (en) Computer system
CN114115703A (en) Bare metal server online migration method and system
CN116991637B (en) Operation control method and device of embedded system, electronic equipment and storage medium
US20240012651A1 (en) Enhanced service operating system capabilities through embedded controller system health state tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination