TWI746512B

TWI746512B - Physical machine fault classification processing method and device, and virtual machine recovery method and system

Info

Publication number: TWI746512B
Application number: TW106104781A
Authority: TW
Inventors: 張文
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2016-03-10
Filing date: 2017-02-14
Publication date: 2021-11-21
Also published as: WO2017152763A1; CN107179957B; TW201738747A; CN107179957A

Abstract

本發明實施例提供一種叢集實體機器故障分類處理方法、裝置及虛擬機器恢復方法、系統。所述實體機器故障分類處理方法包括：從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。本發明實施例藉由對多種實體機器故障場景，進行精細化故障快速、準確的識別，並有針對性的進行分類處理，從而實現快速、高可靠的實體機器損毀修復處理，以保證其上的虛擬機器服務的快速恢復。 The embodiments of the present invention provide a clustered physical machine fault classification processing method and device, and a virtual machine recovery method and system. The physical machine fault classification processing method includes: obtaining a physical machine fault information list from a physical machine fault information storage center; if the physical machine fault information is detected in the physical machine fault information list due to a network attack, triggering the physical machine Processing by the security attack protection center outside the cluster; if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine fault information list, send an instruction to shut down the failed physical machine to the failed physical machine; and by The virtualization interface migrates the virtual machine on the failed physical machine to other healthy physical machines in the cluster system. In the embodiment of the present invention, a variety of physical machine failure scenarios are refined, quickly and accurately identified, and targeted classification processing is performed, thereby realizing rapid and highly reliable physical machine damage repair processing to ensure the above Fast recovery of virtual machine services.

Description

Physical machine fault classification processing method and device, and virtual machine recovery method and system

本發明涉及通信技術領域，特別是涉及一種應用於虛擬化叢集系統的實體機器故障分類處理方法、裝置及虛擬機器恢復方法、系統。 The present invention relates to the field of communication technology, in particular to a method and device for classifying and processing physical machine faults applied to a virtualized cluster system, and a method and system for restoring virtual machines.

隨著電腦技術的迅速發展，人們開始越來越多的關注如何降低能耗和提高資源利用率，雲端計算模式應運而生。雲端計算將所有的電腦抽象成特定的計算資源，然後將這些計算資源提供給用戶，而不是像傳統那樣直接提供一台或多台電腦。雲端計算模式最大的好處就是使用者可以根據自己的需要來發明資源，避免不必要的資源浪費，提高資源利用率。 With the rapid development of computer technology, people began to pay more and more attention to how to reduce energy consumption and improve resource utilization, and cloud computing models came into being. Cloud computing abstracts all computers into specific computing resources, and then provides these computing resources to users instead of directly providing one or more computers as in the traditional way. The biggest advantage of the cloud computing model is that users can invent resources according to their needs, avoid unnecessary waste of resources, and improve resource utilization.

在雲端計算環境中，虛擬化叢集技術是關鍵技術之一。虛擬化叢集將多台虛擬化伺服器組成為一個有機的整體，從而獲得很高的計算速度，提升虛擬化系統整體的計算能力。虛擬化叢集對多台伺服器進行統一管理，藉由虛擬化技術將物理資源抽象為儲存、計算、網路等各種資源組成大的資源池，藉由按需發明資源的方式提供虛擬機器給使用者。 In the cloud computing environment, virtualization cluster technology is one of the key technologies. The virtualization cluster composes multiple virtualized servers into an organic whole, thereby obtaining high computing speed and improving the overall computing power of the virtualized system. The virtualization cluster manages multiple servers in a unified manner, and uses virtualization technology to abstract physical resources into storage, computing, network and other resources A large resource pool is formed, and virtual machines are provided to users by inventing resources on demand.

隨著虛擬化叢集規模的逐漸擴大，由於叢集內實體機器軟硬體問題導致實體機器故障的概率也逐漸增大。實體機器故障會直接影響其上所運行的虛擬機器服務。為了保證虛擬機器業務的正常運行，需要及時發現其所在的實體機器故障並迅速處理以恢復虛擬機器業務；否則，虛擬機器使用者會受到實體機器故障的影響，無法保證業務的連續性。現有技術可以定時監控實體機器狀態，當發生實體機器故障時，則會對其上的虛擬機器進行關機，然後再開機操作；或者是關閉故障實體機器，將其上的虛擬機器遷移到叢集內其他實體機器上。 With the gradual expansion of the scale of virtualized clusters, the probability of physical machine failures due to the hardware and software problems of the physical machines in the cluster has gradually increased. A physical machine failure will directly affect the virtual machine services running on it. In order to ensure the normal operation of the virtual machine business, it is necessary to discover the physical machine failure in time and deal with it promptly to restore the virtual machine business; otherwise, the virtual machine users will be affected by the physical machine failure, and business continuity cannot be guaranteed. The existing technology can regularly monitor the state of the physical machine. When a physical machine fails, the virtual machine on it will be shut down and then turned on again; or the failed physical machine will be shut down and the virtual machine on it will be migrated to other clusters. On the physical machine.

然而，實體機器故障通常是由不同的原因而導致的，且實體機器故障的現象也會有很多種，而現有技術並未對實體機器故障進行精細劃分，並未針對性的進行分類處理，因此在實際商業化用途中會存在較多的誤判和漏判的情況，從而無法實現實體機器故障後其上的虛擬機器高可用(High Availability,HA)。 However, physical machine failures are usually caused by different reasons, and there are many types of physical machine failures. However, the prior art does not finely classify physical machine failures, and does not specifically classify them. Therefore, In actual commercial use, there will be more misjudgments and missed judgments, so that high availability (HA) of virtual machines on physical machines cannot be realized after physical machine failures.

因此，如何更準確、高效、有針對性地進行實體機器故障分類修復處理，成為極需本領域技術人員解決的技術問題。 Therefore, how to classify and repair physical machine faults more accurately, efficiently, and in a targeted manner has become a technical problem that needs to be solved by those skilled in the art.

鑒於上述問題，提出了本發明實施例以便提供一種克服上述問題或者至少部分地解決上述問題的一種應用於虛擬化叢集系統的實體機器機故障分類處理方法、裝置及虛擬機器恢復方法、系統。 In view of the above problems, the embodiments of the present invention are proposed in order to provide a A method and device for classifying and processing physical machine faults and a virtual machine recovery method and system applied to a virtualized cluster system that solve the above-mentioned problems or at least partially solve the above-mentioned problems.

本發明公開一種叢集實體機器故障分類處理方法，包括：從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 The invention discloses a clustered physical machine fault classification processing method, including: obtaining a physical machine fault information list from a physical machine fault information storage center; if the physical machine fault information is detected in the physical machine fault information list due to a network attack, the physical machine fault , The security attack protection center outside the cluster is triggered to process; if a hardware and software failure that cannot be repaired by the physical machine itself is detected in the physical machine fault information list, an instruction to shut down the failed physical machine is sent to the failed physical machine ; And migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system through the virtualization interface.

本發明還公開了一種叢集實體機器故障分類處理裝置，包括：獲取模組，用於從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；第一處理模組，用於若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；第二處理模組，進一步包括：關閉處理單元，用於若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；遷移處理單元，用於藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 The present invention also discloses a clustered physical machine fault classification processing device, which includes: an acquisition module for acquiring a physical machine fault information list from a physical machine fault information storage center; a first processing module for receiving fault information on the physical machine If it is detected in the fault information list that a physical machine fails due to a network attack, it will trigger processing by the security attack protection center outside the cluster; the second processing module further includes: a shutdown processing unit for Machine fault information list If a hardware and software failure that cannot be repaired by the physical machine itself is detected, an instruction to shut down the failed physical machine is sent to the failed physical machine; the migration processing unit is used to migrate the virtual machine on the failed physical machine to the On other healthy physical machines in the cluster system.

本發明還公開了一種虛擬機器恢復方法，應用於虛擬化叢集系統，所述方法包括：虛擬化叢集系統內的實體機器自主檢測自身的故障動態；若自主檢測到實體機器自身能容錯修復的軟硬體故障，藉由容錯方式修復；若自主檢測到實體機器自身能重啟修復的軟硬體故障，藉由重啟實體機器方式修復；從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 The invention also discloses a virtual machine recovery method, which is applied to a virtualized cluster system, and the method includes: the physical machine in the virtualized cluster system autonomously detects its own failure dynamics; if the physical machine itself is automatically detected, it can be fault-tolerant and repaired. Hardware failures can be repaired by fault tolerance; if software and hardware failures that can be restarted and repaired by the physical machine are detected autonomously, they can be repaired by restarting the physical machine; a list of physical machine failure information can be obtained from the physical machine failure information storage center; If it is detected in the physical machine fault information list that the physical machine fails due to a network attack, the security attack protection center outside the cluster is triggered to process; if the physical machine itself is detected in the physical machine fault information list For hardware and software failures that cannot be repaired, an instruction to shut down the failed physical machine is sent to the failed physical machine; and virtual machines on the failed physical machine are migrated to other healthy physical machines in the cluster system through a virtualization interface.

相應的，本發明公開了一種虛擬機器恢復系統，包括：實體機器損毀修復裝置，應用於虛擬化叢集系統內的實體機器上自主檢測實體機器自身的故障動態，若自主檢測到實體機器自身能容錯修復的軟硬體故障，藉由容錯方式修復；若自主檢測到實體機器自身能重啟修復的軟硬體故障，藉由重啟實體機器方式修復；實體機器故障資訊儲存中心，用於將所有上報的物理故障資訊彙集成實體機器故障資訊清單；實體機器故障分類處理裝置，用於從所述實體機器故障資訊儲存中心獲取實體機器故障資訊清單，若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令，及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Correspondingly, the present invention discloses a virtual machine recovery system, including Including: physical machine damage repair device, which is applied to the physical machine in the virtualized cluster system to autonomously detect the fault dynamics of the physical machine itself. If the software and hardware faults that the physical machine itself can be fault-tolerantly repaired are automatically detected, the fault-tolerant method is used to repair it; If the hardware and software failures that the physical machine itself can be restarted and repaired are detected autonomously, the physical machine can be repaired by restarting the physical machine; the physical machine fault information storage center is used to integrate all reported physical fault information into the physical machine fault information list; The machine fault classification processing device is used to obtain the physical machine fault information list from the physical machine fault information storage center. If the physical machine fault information is detected in the physical machine fault information list due to a network attack, it will trigger all the physical machine faults. Processing by the security attack protection center outside the cluster; if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine fault information list, an instruction to shut down the failed physical machine is sent to the failed physical machine, and by The virtualization interface migrates the virtual machine on the failed physical machine to other healthy physical machines in the cluster system.

根據本發明提供的具體實施例，本發明公開了以下技術效果：本發明實施例可以在大規模的雲端計算叢集中，藉由對多種實體機器故障場景，進行精細化故障快速、準確的識別，並有針對性的進行分類處理，從而實現快速、高可靠的實體機器損毀修復處理，以保證其上的虛擬機器服務的快速恢復。 According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects: The embodiments of the present invention can be used in large-scale cloud computing clusters to perform refined fault fast and accurate identification of various physical machine failure scenarios. And targeted classification processing, so as to achieve rapid and highly reliable physical machine damage repair processing, to ensure the rapid recovery of virtual machine services on it.

進一步的，本發明實施例藉由實體機器自主檢測自身的故障動態，並對實體機器自身能修復的實體機器故障情況有針對性的進行分類修復處理；對實體機器自身不能修復的實體機器故障情況，藉由叢集外部的實體機器故障分類處理模組有針對性的進行分類修復處理，從而有效降低實體機器故障的誤判和漏判情況的發生，更安全、穩定、快速的進行虛擬機器自動恢復。 Further, in the embodiment of the present invention, the physical machine autonomously detects its own failure dynamics, and specifically classifies and repairs the physical machine fault conditions that the physical machine itself can repair; the physical machine fault conditions that cannot be repaired by the physical machine itself , By using the physical machine fault classification processing module outside the cluster to perform targeted classification and repair processing, thereby effectively reducing the occurrence of false judgments and missed judgments of physical machine faults, and automatically recovering virtual machines more safely, stably and quickly.

另外，本發明實施例針對實體機器自身不能修復的實體機器故障情況，除了可以藉由故障實體機器上的帶外管理模組關閉故障實體機器之外，還可以藉由叢集外部的實體機器故障分類處理模組，指示故障實體機器自主關機，從而彌補帶外管理模組調用關機操作的可用性無法達到商用標準的問題，同時也確保自動化實體機器隔離的有效性。 In addition, the embodiments of the present invention aim at physical machine failures that cannot be repaired by the physical machine itself. In addition to shutting down the failed physical machine through the out-of-band management module on the failed physical machine, it can also be classified by physical machine failures outside the cluster. The processing module instructs the faulty physical machine to shut down autonomously, so as to make up for the problem that the availability of the out-of-band management module to call the shutdown operation cannot meet commercial standards, and at the same time ensure the effectiveness of automated physical machine isolation.

此外，本發明實施例也同時考慮到大規模雲端計算叢集內發生實體機器規模故障情況的可能性，藉由判斷故障實體機器的數量是否構成機房級別，並有針對性的採取不同的修復處理方式。尤其是針對大規模實體機器故障的情況，採用人工處理的方式修復，從而有效避免由於故障實體機器上的虛擬機器的頻繁遷移而影響系統性能情況的發生。 In addition, the embodiments of the present invention also take into account the possibility of physical machine-scale failures in large-scale cloud computing clusters, by judging whether the number of failed physical machines constitutes the computer room level, and adopting different repair processing methods in a targeted manner . Especially in the case of large-scale physical machine failures, manual processing is used to repair, so as to effectively avoid the occurrence of system performance impacts due to frequent migration of virtual machines on the failed physical machine.

當然，實施本發明的任一產品並不一定需要同時達到以上所述的所有優點。 Of course, any product implementing the present invention does not necessarily need to achieve all the advantages described above at the same time.

110-130‧‧‧步驟 110-130‧‧‧step

210-260‧‧‧步驟 210-260‧‧‧Step

310-390‧‧‧步驟 310-390‧‧‧Step

500‧‧‧實體機器損毀修復裝置 500‧‧‧Physical machine damage repair device

510‧‧‧自主檢測模組 510‧‧‧Autonomous detection module

511‧‧‧檢測單元 511‧‧‧Detection unit

512‧‧‧上報單元 512‧‧‧Reporting unit

520‧‧‧自主處理模組 520‧‧‧Autonomous Processing Module

521‧‧‧容錯單元 521‧‧‧Fault Tolerant Unit

522‧‧‧重啟單元 522‧‧‧Restart unit

523‧‧‧關機單元 523‧‧‧Shutdown unit

530‧‧‧帶外管理模組 530‧‧‧Out-of-band management module

600‧‧‧實體機器故障分類處理裝置 600‧‧‧Physical machine fault classification processing device

610‧‧‧獲取模組 610‧‧‧Get Module

620‧‧‧第一處理模組 620‧‧‧First processing module

630‧‧‧第二處理模組 630‧‧‧Second processing module

640‧‧‧第三處理模組 640‧‧‧Third processing module

650‧‧‧第四處理模組 650‧‧‧Fourth processing module

660‧‧‧第五處理模組 660‧‧‧Fifth Processing Module

670‧‧‧確定模組 670‧‧‧Determine the module

700‧‧‧虛擬化叢集系統 700‧‧‧Virtualized Cluster System

710‧‧‧實體機器損毀修復裝置 710‧‧‧Physical machine damage repair device

711‧‧‧自主檢測模組 711‧‧‧Autonomous detection module

712‧‧‧自主處理模組 712‧‧‧Autonomous Processing Module

713‧‧‧帶外管理模組 713‧‧‧Out-of-band management module

720‧‧‧實體機器故障分類處理裝置 720‧‧‧Physical machine fault classification processing device

721‧‧‧獲取模組 721‧‧‧Get Module

722‧‧‧第一處理模組 722‧‧‧First Processing Module

723‧‧‧第二處理模組 723‧‧‧Second Processing Module

724‧‧‧第三處理模組 724‧‧‧Third processing module

725‧‧‧第四處理模組 725‧‧‧Fourth processing module

726‧‧‧第五處理模組 726‧‧‧Fifth Processing Module

727‧‧‧確定模組 727‧‧‧Determining Module

730‧‧‧實體機器故障資訊儲存中心 730‧‧‧Physical machine fault information storage center

740‧‧‧安全攻擊防護中心 740‧‧‧Security Attack Protection Center

750‧‧‧實體機器故障探測模組 750‧‧‧Physical machine fault detection module

760‧‧‧實體機器故障收集模組 760‧‧‧Physical machine fault collection module

為了更清楚地說明本發明實施例或現有技術中的技術方案，下面將對實施例中所需要使用的圖式作簡單地介紹，顯而易見地，下面描述中的圖式僅僅是本發明的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動的前提下，還可以根據這些圖式獲得其他的圖式。 In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the embodiments. Obviously, the drawings in the following description are only some implementations of the present invention. For example, for those of ordinary skill in the art, without creative work, other schemas can be obtained from these schemas.

圖1是本發明的一種叢集實體機器故障分類處理方法實施例的步驟流程圖；圖2是本發明的另一種叢集實體機器故障分類處理方法實施例的步驟流程圖；圖3是本發明的一種虛擬機器恢復方法實施例的步驟流程圖；圖4是本發明的另一種虛擬機器恢復方法實施例的步驟流程圖；圖5是本發明的一種實體機器損毀修復裝置實施例的結構方塊圖；圖6是本發明的一種叢集實體機器故障分類處理裝置實施例的結構方塊圖；圖7是本發明的一種虛擬機器恢復系統實施例的結構方塊圖。 Figure 1 is a step flow diagram of an embodiment of a cluster entity machine fault classification processing method of the present invention; Figure 2 is a step flow diagram of another embodiment of a cluster entity machine fault classification processing method of the present invention; Figure 3 is a kind of embodiment of the present invention Step flowchart of an embodiment of a virtual machine recovery method; FIG. 4 is a step flowchart of another embodiment of a virtual machine recovery method of the present invention; FIG. 5 is a structural block diagram of an embodiment of a physical machine damage repair device of the present invention; 6 is a structural block diagram of an embodiment of a cluster physical machine fault classification processing device of the present invention; FIG. 7 is a structural block diagram of an embodiment of a virtual machine recovery system of the present invention.

下面將結合本發明實施例中的圖式，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員所獲得的所有其他實施例，都屬於本發明保護的範圍。 The following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the drawings in the embodiments of the present invention. Obviously, the described implementations The embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art fall within the protection scope of the present invention.

為使本發明的上述目的、特徵和優點能夠更加明顯易懂，下面結合圖式和具體實施方式對本發明作進一步詳細的說明。 In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention will be further described in detail below with reference to the drawings and specific embodiments.

為了方便理解本發明實施例，首先在此介紹本發明實施例描述中會涉及的幾個要素： In order to facilitate the understanding of the embodiments of the present invention, firstly, several elements involved in the description of the embodiments of the present invention are introduced here:

A. Cloud computing

雲端計算是一種基於網際網路技術的相關服務的增加、使用和交付模式，是在所有使用的伺服器上實踐分散式運算的伺服器叢集。也就是說，雲端計算提供了一個虛擬化的按需動態供應硬體、軟體和資料集的彈性資源平臺。 Cloud computing is a mode of increasing, using, and delivering related services based on Internet technology. It is a server cluster that implements distributed computing on all servers used. In other words, cloud computing provides a virtualized and on-demand flexible resource platform that dynamically supplies hardware, software, and data sets.

B. Virtual cluster

在雲端計算平臺上進行叢集管理就構成了虛擬叢集。所謂的虛擬叢集就是藉由採用虛擬化技術來虛擬出多台計算節點，從而構建出與物理叢集相似而且規模巨大的一個叢集系統。也就是說，虛擬叢集就是將那些協同完成特定任務的多台同構或異構的電腦連接起來的系統。 Cluster management on the cloud computing platform constitutes a virtual cluster. The so-called virtual cluster is the use of virtualization technology to virtualize multiple computing nodes, thereby constructing a cluster system that is similar to a physical cluster and has a huge scale. In other words, a virtual cluster is a system that connects multiple homogeneous or heterogeneous computers that coordinate to complete specific tasks.

C, physical machine

虛擬叢集系統內協同完成特定任務的多台電腦即為叢集物理電腦，簡稱叢集實體機器。其中，一台實體機器上可以模擬出一台或者多台虛擬的電腦。 Multiple computers in a virtual cluster system that cooperate to complete a specific task are cluster physical computers, referred to as cluster physical machines. Among them, one or more virtual computers can be simulated on a physical machine.

D, virtual machine

藉由虛擬機器軟體可以在一台實體機器上模擬出一台或者多台虛擬的電腦，而這些虛擬機器就像真正的電腦那樣進行工作，虛擬機器上可以安裝作業系統和應用程式，虛擬機器還可存取網路資源。對於在虛擬機器中運行的應用程式而言，虛擬機器就像是在真正的電腦中進行工作。 With virtual machine software, one or more virtual computers can be simulated on a physical machine, and these virtual machines work like real computers. The virtual machines can be installed with operating systems and applications. Can access network resources. For applications running in a virtual machine, the virtual machine is like working in a real computer.

本發明實施例可以應用在大規模的雲端計算虛擬化叢集系統中，可以藉由叢集系統內的實體機器自主檢測自身的故障動態，進而對實體機器自身能修復的實體機器故障情況有針對性的進行分類修復處理；而對實體機器自身不能修復的實體機器故障情況，藉由叢集外部的實體機器故障分類處理模組有針對性的進行分類修復處理，從而有效降低實體機器故障的誤判和漏判情況的發生，更安全、穩定、快速的進行虛擬機器自動恢復。 The embodiments of the present invention can be applied to a large-scale cloud computing virtualized cluster system, and the physical machine in the cluster system can autonomously detect its own failure dynamics, and then it can be targeted to the physical machine failures that the physical machine itself can repair. Carry out classification repair processing; and for the physical machine failures that cannot be repaired by the physical machine itself, the physical machine fault classification processing module outside the cluster is used for targeted classification and repair processing, thereby effectively reducing the misjudgment and missed judgment of the physical machine failure When the situation occurs, the automatic recovery of the virtual machine is safer, more stable, and faster.

影響虛擬機器運行和管理的實體機器故障現象可以歸納如下幾種： The physical machine failure phenomena that affect the operation and management of virtual machines can be summarized as follows:

1. The physical machine's network is unavailable

其原因主要包括：實體機器當機、網卡異常、上聯交換機故障、硬體異常、內核模組異常、實體機器重啟、網路分散式阻斷服務攻擊(Distributed Denial of Service,DDoS)等。 The main reasons include: physical machine downtime, network card abnormality, uplink switch failure, hardware abnormality, kernel module abnormality, physical machine restart, network Distributed Denial of Service (DDoS), etc.

2. Physical machine loss

其原因主要包括：實體機器負載高、上聯網路設備切換、網路DDoS攻擊等。 The main reasons for this include: high load on physical machines, switching of Internet access equipment, and network DDoS attacks.

3. Physical machine hardware system failure

例如，實體機器磁片、記憶體、中央處理器(Central Processing Unit，CPU)故障等。 For example, physical machine disks, memory, central processing unit (Central Processing Unit, CPU) failures, etc.

4. The physical machine software is abnormal

例如，實體機器的檔案系統、虛擬化相關模組、作業系統內核模組等作業系統層面的軟體異常等。 For example, software abnormalities at the operating system level such as the file system of the physical machine, virtualization-related modules, and operating system kernel modules.

5. The remote access channel of the physical machine is blocked

其原因主要包括：網路丟包、系統服務異常、檔案系統異常等。 The main reasons include: network packet loss, system service abnormality, file system abnormality, etc.

6. Abnormal physical machine performance

例如，可能表現為實體機器輸入輸出(Input/Output，I/O)卡頓、負載高等。其原因主要包括：實體機器硬體故障、實體機器內核模組異常、實體機器使用者態進程異常等。 For example, it may appear as physical machine input/output (Input/Output, I/O) stuck, high load, etc. The main reasons include: physical machine hardware failure, physical machine kernel module abnormality, physical machine user mode process abnormality, etc.

可以看出，以上實體機器故障的現象並不是一成不變的，而是在一定時間內可以相互轉化的，甚至是相關關聯、相互交織的。並且，相同的實體機器現象其背後的原因可能不一樣，因此故障實體機器的修復處理方式需要具體區分，例如，對於因網路DDoS攻擊而導致的某台實體機器網路不通與因實體機器當機而導致的實體機器網路不通是需要區別對待的，如果在實體機器正遭受網路DDoS攻擊時將其上的虛擬機器遷移至其他實體機器，會產生骨牌效應，導致擴大故障風險，即其他實體機器陸續被攻擊而不可用，最終可能造成全叢集網路設備的泛洪(flooding)，導致全叢集實體機器故障風險。 It can be seen that the above phenomenon of physical machine failure is not static, but can be transformed into each other within a certain period of time, and even related and intertwined. In addition, the reasons behind the same physical machine phenomenon may be different. Therefore, the repair method of the faulty physical machine needs to be specifically distinguished. For example, the network failure of a physical machine caused by a network DDoS attack and the physical machine are The physical machine network failure caused by the machine needs to be treated differently. If the virtual machine on the physical machine is being attacked by the network DDoS attack, the virtual machine on it is migrated to other physical machines. Physical machines are successively attacked and become unavailable, which may eventually cause flooding of the entire cluster network equipment, leading to the risk of failure of the entire cluster of physical machines.

基於上述實體機器故障現象和異常的深層原因分析，本發明實施例中，可以將實體機器故障歸納為如下幾類： Based on the in-depth cause analysis of the above-mentioned physical machine failure phenomena and abnormalities, in the embodiment of the present invention, the physical machine failures can be classified into the following categories:

A、實體機器自身能容錯修復的軟硬體故障類型 A. The type of hardware and software faults that the physical machine itself can fault-tolerantly repair

例如，儲存資料的磁片故障、虛擬化相關內核模組異常、儲存資料的檔案系統異常等。 For example, the disk that stores the data is faulty, the kernel module related to virtualization is abnormal, and the file system that stores the data is abnormal.

B、實體機器自身能重啟修復的軟硬體故障類型 B. The type of hardware and software failures that the physical machine itself can restart and repair

例如，根檔案系統唯讀等異常、網卡驅動重啟可修復的異常、作業系統內核模組異常等。 For example, root file system read-only exceptions, network card driver restarting repairable exceptions, operating system kernel module exceptions, etc.

C、實體機器自身不能修復的軟硬體故障類型 C. Types of hardware and software failures that cannot be repaired by the physical machine itself

例如，實體機器當機、實體機器CPU異常、實體機器記憶體異常、實體機器電源模組等各類硬體問題異常。 For example, physical machine crashes, physical machine CPU abnormalities, physical machine memory abnormalities, physical machine power modules, and other hardware problems.

另外，還包括未知原因的故障類型，例如，系統負載類、系統網路類、硬體故障類等。這類故障雖然本質的原因比較難查，但是這類故障的現象卻很明確，主要是：實體機器網路丟包、實體機器管理通道存取異常、實體機器性能使用異常。 In addition, it also includes the types of failures with unknown causes, such as system load, system network, and hardware failure. Although this type of failure is essentially It is difficult to find, but the symptoms of this type of failure are very clear, mainly: physical machine network packet loss, physical machine management channel access abnormality, physical machine performance usage abnormality.

D、實體機器遭受網路攻擊而導致實體機器故障類型 D. Types of physical machine failures caused by cyber attacks on physical machines

例如，網路DDoS類型安全攻擊，從而造成網路大量丟包甚至網路不通。這類故障的現象主要包括：實體機器網路不通、網路丟包、管理通道不通等。 For example, network DDoS type security attacks have caused a large number of network packet loss or even network disconnection. Such failure phenomena mainly include: physical machine network failure, network packet loss, management channel failure, etc.

因此，本發明實施例藉由對多種實體機器故障場景，進行精細化故障快速、準確的識別，並有針對性的進行分類處理，從而實現快速、高可靠的實體機器損毀修復處理，以保證其上的虛擬機器服務的快速恢復。例如，本發明實施例可以在十幾分鐘內處理完成故障實體機器上的虛擬機器恢復且該虛擬機器的功能具備超過99.95%的商用可用性標準。 Therefore, the embodiment of the present invention performs rapid and accurate identification of refined faults for various physical machine failure scenarios, and targeted classification processing, thereby realizing rapid and highly reliable physical machine damage repair processing to ensure its Fast recovery of virtual machine services on the Internet. For example, the embodiment of the present invention can process and complete the recovery of a virtual machine on a failed physical machine within ten minutes, and the function of the virtual machine has a commercial availability standard of more than 99.95%.

Example one

參照圖1，示出了本發明的一種叢集實體機器故障分類處理方法實施例的步驟流程圖，所述實體機器故障分類處理方法可以應用於虛擬化叢集系統，具體可以包括如下步驟：步驟110，從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；需要說明的是，所述實體機器故障資訊清單包括：由所述叢集外部的實體機器故障探測模組從故障實體機器處探測到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊，及由所述叢集外部的實體機器故障收集模組從故障實體機器處收集到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊。 1, there is shown a step flow chart of an embodiment of a clustered physical machine fault classification processing method of the present invention. The physical machine fault classification processing method can be applied to a virtualized cluster system, and specifically may include the following steps: Step 110, Obtain the physical machine fault information list from the physical machine fault information storage center; it should be noted that the physical machine fault information list includes: the physical machine fault detection module outside the cluster obtains the physical machine fault information from the faulty physical machine The physical machine fault information detected and reported to the physical machine fault information storage center, and the physical machine fault collection module outside the cluster is collected from the faulty physical machine and reported to the physical machine fault information Physical machine failure information in the storage center.

步驟120，若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；可以理解的是，在實際應用中，所述叢集外部的安全攻擊防護中心被觸發後，會啟動安全清洗程式，例如進行流量清洗等，從而使得故障實體機器恢復健康。需要說明的是，對於因網路DDoS攻擊而導致的某台實體機器網路不通與因實體機器當機而導致的實體機器網路不通是需要區別對待的，如果在實體機器正遭受網路DDoS攻擊時將其上的虛擬機器遷移至其他實體機器，會產生骨牌效應，導致擴大故障風險，即其他實體機器陸續被攻擊而不可用，最終可能造成全叢集網路設備的泛洪(flooding)，導致全叢集實體機器故障風險。 Step 120: If it is detected in the physical machine fault information list that the physical machine fails due to a network attack, the security attack protection center outside the cluster is triggered to process; it is understandable that in practical applications, all After the security attack protection center outside the cluster is triggered, the security cleaning program will be activated, such as traffic cleaning, etc., so as to restore the health of the failed physical machine. It should be noted that the network failure of a physical machine caused by a network DDoS attack and the physical machine network failure caused by a physical machine crash need to be treated differently. If the physical machine is suffering from network DDoS Migrating the virtual machines on it to other physical machines during an attack will produce a domino effect, leading to an enlarged risk of failure, that is, other physical machines are successively attacked and become unavailable, which may eventually cause flooding of the entire cluster of network equipment. Lead to the risk of failure of the entire cluster of physical machines.

步驟130，若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上；較佳的，若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組。 Step 130: If a hardware and software failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to shut down the failed physical machine to the failed physical machine; and migrate the failure through the virtualization interface The virtual machine on the physical machine is connected to other healthy physical machines in the cluster system; preferably, if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, the physical machine is sent to the failed entity machine Send an instruction to shut down the failed physical machine to instruct the failed physical machine to shut down the failed physical machine autonomously or through an out-of-band management module on the physical machine.

需要說明的是，所述的實體機器自身不能修復的軟硬體故障類型可以包括：實體機器當機、實體機器CPU異常、實體機器記憶體異常、實體機器電源模組等各類硬體問題異常。這類故障會直接導致實體機器不可用，且需要更換硬體模組方可修復，因此，本發明實施例藉由從叢集中將故障實體機器隔離後再對故障實體機器進行硬體更換或者維護。 It should be noted that the types of hardware and software failures that cannot be repaired by the physical machine itself can include: physical machine crashes, physical machine CPU abnormalities, physical machine memory abnormalities, physical machine power modules, and other hardware problems abnormal . This type of failure will directly cause the physical machine to be unavailable and require the replacement of the hardware module to be repaired. Therefore, the embodiment of the present invention isolates the failed physical machine from the cluster before performing hardware replacement or maintenance on the failed physical machine. .

此外，針對實體機器自身不能修復的軟硬體故障的情況下，傳統實體機器上的帶外管控系統由於硬體故障率和成本問題，通常可用性在90%左右甚至更低，在雲端計算服務本身至少99.95%的商用可用性要求下，全年的不可用性時長共計262.8分鐘，如果一台故障實體機器無法得到及時修復，則由於一台實體機器故障就會直接導致幾十分鐘的人工處理時耗，因此，現有技術中的帶外管控系統的可用性指標無法匹配商用雲端計算服務的故障恢復服務等級協定(Service-Level Agreement，SLA)。而本發明實施例提供的技術方案，對傳統的帶外管控系統進行改進，在帶外管理模組可用性達不到商用標準時，可以藉由所述叢集外部的實體機器故障分類處理模組的指令指示故障實體機器自主關閉，再由所述叢集外部的實體機器故障分類處理模組藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上；從而大量縮短故障實體機器的修復時間，進而提高系統的商用可用性。 In addition, in the case of hardware and software failures that cannot be repaired by the physical machine itself, the out-of-band management and control system on the traditional physical machine usually has an availability of about 90% or even lower due to hardware failure rate and cost issues. In the cloud computing service itself With a commercial availability requirement of at least 99.95%, the total unavailability for the whole year is 262.8 minutes. If a failed physical machine cannot be repaired in time, the failure of a physical machine will directly cause tens of minutes of manual processing time. Therefore, the availability index of the out-of-band management and control system in the prior art cannot match the failure recovery service-level agreement (SLA) of commercial cloud computing services. The technical solution provided by the embodiment of the present invention improves the traditional out-of-band management and control system. When the availability of the out-of-band management module does not reach commercial standards, the instructions of the physical machine fault classification processing module outside the cluster can be used. Instruct the faulty physical machine to shut down autonomously, and then the physical machine fault classification processing module outside the cluster migrates the faulty physical machine through the virtualization interface The virtual machine is added to other healthy physical machines in the cluster system; thus, the repair time of the failed physical machine is greatly shortened, and the commercial availability of the system is improved.

本發明實施例可以在大規模的雲端計算叢集中，藉由對多種實體機器故障場景，進行精細化故障快速、準確的識別，並有針對性的進行分類處理，從而實現快速、高可靠的實體機器損毀修復處理，以保證其上的虛擬機器服務的快速恢復。 The embodiments of the present invention can perform rapid and accurate fault identification in a large-scale cloud computing cluster through various physical machine failure scenarios, and perform targeted classification processing, thereby realizing fast and highly reliable entities. The machine is damaged and repaired to ensure the rapid recovery of the virtual machine service on it.

Example two

參照圖2，示出了本發明的另一種叢集實體機器故障分類處理方法實施例的步驟流程圖，具體可以包括如下步驟：步驟210，從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；需要說明的是，所述實體機器故障資訊清單包括：由所述叢集外部的實體機器故障探測模組從故障實體機器處探測到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊，及由所述叢集外部的實體機器故障收集模組從故障實體機器處收集到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊。 2, there is shown a step flow chart of another embodiment of a clustered physical machine fault classification and processing method of the present invention, which may specifically include the following steps: Step 210: Obtain a physical machine fault information list from a physical machine fault information storage center; It is noted that the physical machine fault information list includes: the physical machine fault detection module outside the cluster obtains information from the faulty physical machine The physical machine fault information detected and reported to the physical machine fault information storage center, and the physical machine fault collection module outside the cluster is collected from the faulty physical machine and reported to the physical machine fault information Physical machine failure information in the storage center.

步驟220，若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；可以理解的是，在實際應用中，所述叢集外部的安全攻擊防護中心被觸發後，會啟動安全清洗程式，例如進行流量清洗等，從而使得故障實體機器恢復健康。需要說明的是，對於因網路DDoS攻擊而導致的某台實體機器網路不通與因實體機器當機而導致的實體機器網路不通是需要區別對待的，如果在實體機器正遭受網路DDoS攻擊時將其上的虛擬機器遷移至其他實體機器，會產生骨牌效應，導致擴大故障風險，即其他實體機器陸續被攻擊而不可用，最終可能造成全叢集網路設備的泛洪(flooding)，導致全叢集實體機器故障風險。 Step 220: If it is detected in the physical machine fault information list that the physical machine fails due to a network attack, the security attack protection center outside the cluster is triggered to process; it is understandable that in practical applications, all After the security attack protection center outside the cluster is triggered, the security cleaning program will be activated, such as traffic cleaning, etc., so as to restore the health of the failed physical machine. It should be noted that the network failure of a physical machine caused by a network DDoS attack and the physical machine network failure caused by a physical machine crash need to be treated differently. If the physical machine is suffering from network DDoS Migrating the virtual machines on it to other physical machines during an attack will produce a domino effect, leading to an enlarged risk of failure, that is, other physical machines are successively attacked and become unavailable, which may eventually cause flooding of the entire cluster of network equipment. Lead to the risk of failure of the entire cluster of physical machines.

步驟230，若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上；較佳的，若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組。 Step 230: If a hardware and software failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to shut down the failed physical machine to the failed physical machine; and migrate the failure through the virtualization interface The virtual machine on the physical machine is connected to other healthy physical machines in the cluster system; preferably, if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, the physical machine is sent to the failed entity machine Send an instruction to shut down the failed physical machine to instruct the failed physical machine to shut down the failed physical machine autonomously or through an out-of-band management module on the physical machine.

步驟240，若在所述實體機器故障資訊清單中檢測到實體機器網路完全不通且網路不通持續時間達到預設時間；判斷網路不通的實體機器數量是否超過預設數量，如果是則通知運營維修人員人工修復；否則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內到其他健康實體機器上；其中，所述預設時間可以依據實際情況設定為3分鐘、5分鐘等適合的時間段。 Step 240: If it is detected in the physical machine fault information list that the physical machine network is completely disconnected and the duration of the network disconnection reaches the preset time; determine whether the number of physical machines with the network disconnected exceeds the preset number, and if so, notify Operation and maintenance personnel manually repair; otherwise, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster system through the virtualization interface; wherein, the preset time can be set to 3 minutes according to the actual situation , 5 minutes and other suitable time periods.

需要說明的是，在檢測到實體機器網路完全不通且網路不通持續時間達到預設時間的情況下，本發明實施例需要進一步檢查網路不通的故障實體機器的數量是否超過一個機櫃的實體機器數量或者一個交換機下聯實體機器數量，如果超過，則認為是叢集規模性網路故障，則需要採取電話報警通運營維修人員人工修復，而不再自動處理。這是由於對於大規模實體機器故障，在進行隔離實體機器遷移虛擬機器時，會導致大量實體機器被關閉，當機房設備(網路設備或者電力設備等)恢復後，還需要再次重啟實體機器，然後恢復虛擬機器，這一系列的操作將直接導致人工處理時間加倍甚至更多，從而大大加大虛擬機器的不可用時長。因此，本發明實施例提供的方法，對此種實體機器故障類型加以區分處理，可以大量縮短故障實體機器的修復時間，從而大大縮短其上的虛擬機器不可用的時長，進而提高系統的商用可用性。 It should be noted that when it is detected that the network of the physical machine is completely disconnected and the duration of the network disconnection reaches the preset time, the embodiment of the present invention needs to further check whether the number of faulty physical machines with the network disconnected exceeds the entity of one cabinet. If the number of machines or the number of physical machines connected to a switch exceeds, it is considered to be a cluster-scale network failure, and it needs to be repaired manually by the operation and maintenance personnel through telephone calls instead of automatic processing. This is because for large-scale physical machine failures, when isolating physical machines and migrating virtual machines, a large number of physical machines will be shut down. When the equipment in the computer room (network equipment or electrical equipment, etc.) is restored, the physical machines need to be restarted again. Then restore the virtual machine. This series of operations will directly cause the manual processing time to double or even more, thereby greatly increasing the unavailable time of the virtual machine. Therefore, the method provided by the embodiment of the present invention differentiates and processes this type of physical machine failure, which can greatly shorten the failure of the physical machine. The repair time of the server, thereby greatly shortening the time when the virtual machine on it is unavailable, thereby improving the commercial availability of the system.

較佳的，本發明實施例所述方法還可以進一步包括：步驟250，若在所述實體機器故障資訊清單中檢測到實體機器網路不通但網路不通持續時間未達到預設時間後網路又恢復正常，且確定實體機器網路不通是實體機器重啟所導致的，則判斷當前的實體機器是否健康，如果健康則藉由虛擬化介面重啟所述實體機器上的虛擬機器，如果不健康則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集內其他健康實體機器上；步驟260，若在所述實體機器故障資訊清單中檢測到實體機器網路不穩定且網路不穩定持續時間達到預設時間，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上；需要說明的是，所述實體機器網路不穩定且網路不穩定持續時間達到預設時間的情況主要是一些未知原因造成實體機器故障，例如，系統負載類、系統網路類、硬體故障類等。這類故障雖然本質原因比較難查，但是這類故障的現象卻很明確，主要是：實體機器網路丟包、實體機器管理通道存取異常、實體機器性能使用異常。對於這類實體機器故障，可以採用相同的處理方式，即向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Preferably, the method according to the embodiment of the present invention may further include: step 250, if it is detected in the physical machine fault information list that the physical machine has a network failure but the network failure duration has not reached the preset time, the network It returns to normal, and it is determined that the physical machine's network failure is caused by the restart of the physical machine, then determine whether the current physical machine is healthy, if it is healthy, restart the virtual machine on the physical machine through the virtualization interface, if it is unhealthy, borrow Migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster through the virtualization interface; step 260, if it is detected in the physical machine fault information list that the physical machine network is unstable and the network is not When the stable duration reaches the preset time, an instruction is sent to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or to shut down the faulty physical machine through the out-of-band management module on the physical machine; and by virtual Migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system; it should be noted that the network of the physical machine is unstable and the duration of network instability reaches the preset time It is mainly caused by some unknown reasons, such as system load, system network, hardware failure, etc. Although the essential causes of this type of failure are difficult to investigate, the symptoms of such failures are clear, mainly: physical machine network packet loss, physical machine management channel access abnormalities, and physical machine performance usage abnormalities. For this type of physical machine failure, the same processing method can be adopted, that is, to the failed physical machine The device sends instructions to instruct the faulty physical machine to shut down the faulty physical machine autonomously or to shut down the faulty physical machine through the out-of-band management module on the physical machine; and to migrate the virtual machine on the faulty physical machine through the virtualization interface To other healthy physical machines in the cluster system.

較佳的，在本發明的一個實施例中，藉由以下方式確定所述健康實體機器：在所述實體機器故障資訊清單中匹配所述叢集內的所有實體機器；將沒有匹配成功的實體機器確定為健康實體機器。 Preferably, in an embodiment of the present invention, the healthy physical machine is determined by the following method: matching all physical machines in the cluster in the physical machine fault information list; there will be no physical machines that are successfully matched Determined as a healthy physical machine.

另外，本發明實施例針對實體機器自身不能修復的實體機器故障情況，除了可以藉由故障實體機器上的帶外管理模組關閉故障實體機器之外，還可以藉由叢集外部的實體機器故障分類處理模組，指示故障實體機器自主關機，從而彌補帶外管理模組調用關機操作的可用性無法達到商用標準的問題，同時也確保自動化實體機器隔離的有效性。 In addition, the embodiment of the present invention is aimed at the physical machine failure that cannot be repaired by the physical machine itself. In addition to shutting down the failed physical machine through the out-of-band management module on the failed physical machine, the physical machine can also be shut down by an external cluster. The physical machine fault classification processing module instructs the faulty physical machine to shut down autonomously, so as to make up for the problem that the availability of the out-of-band management module to call the shutdown operation cannot meet the commercial standard, while also ensuring the effectiveness of automated physical machine isolation.

Example three

參照圖3，示出了本發明的一種虛擬機器恢復方法的實施例示意圖，具體可以包括如下步驟：步驟310，虛擬化叢集系統內的實體機器自主檢測自身的故障動態；較佳的，每台實體機器可以以固定的時間間隔定期自主檢測自身的故障動態，例如每隔30秒自主檢測一次。 3, there is shown a schematic diagram of an embodiment of a virtual machine recovery method of the present invention, which may specifically include the following steps: Step 310, the physical machine in the virtualized cluster system autonomously detects its own failure dynamics; preferably, each machine The physical machine can autonomously detect its own fault dynamics at regular intervals, for example, once every 30 seconds.

步驟320，若自主檢測到實體機器自身能容錯修復的軟硬體故障，藉由容錯方式修復；可以理解的是，本發明實施例所述的實體機器自身能容錯修復的軟硬體故障，可以包括：儲存資料的磁片故障、虛擬化相關內核模組異常、儲存資料的檔案系統異常等。例如，針對儲存資料的磁片故障，容錯修復方式具體是，首先隔離磁片，然後利用叢集分散式儲存多份資料的機制，實現該磁片上資料自動複製至其他健康磁片上，這樣可以有效保證該故障磁片隔離後不會影響系統穩定運行。同樣，針對儲存資料的檔案系統損壞，也可以藉由隔離該檔案系統掛載的磁片達到容錯修復的目的。 In step 320, if the hardware and software faults that the physical machine itself can be fault-tolerantly repaired are detected autonomously, the fault-tolerant method is used to repair; Including: Disk failure for storing data, abnormal kernel module related to virtualization, and abnormal file system for storing data Wait. For example, for the failure of the disk that stores data, the fault-tolerant repair method is specifically to isolate the disk first, and then use the mechanism of clustering to store multiple copies of the data, so that the data on the disk is automatically copied to other healthy disks, which can effectively ensure After the faulty disk is isolated, it will not affect the stable operation of the system. Similarly, for the file system that stores data is damaged, the purpose of fault-tolerant repair can also be achieved by isolating the disk mounted on the file system.

步驟330，若自主檢測到實體機器自身能重啟修復的軟硬體故障，藉由重啟實體機器方式修復；可以理解的是，本發明實施例所述的實體機器自身能修復的軟硬體故障，可以包括：根檔案系統唯讀等異常、網卡驅動重啟可修復的異常、作業系統內核模組異常等。這類軟硬體故障都可以藉由重啟實體機器的方式予以修復。 In step 330, if the hardware and software failures that the physical machine itself can be restarted and repaired are autonomously detected, repaired by restarting the physical machine; it can be understood that the hardware and software failures that can be repaired by the physical machine itself in the embodiment of the present invention are: It can include: root file system read-only exceptions, network card driver restarting repairable exceptions, operating system kernel module exceptions, etc. Such hardware and software failures can be repaired by restarting the physical machine.

步驟340，從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；需要說明的是，由實體機器故障分類處理模組從實體機器故障資訊儲存中心獲取實體機器故障資訊清單。所述實體機器故障資訊清單包括：由所述叢集外部的實體機器故障探測模組從故障實體機器處探測到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊，及由所述叢集外部的實體機器故障收集模組從故障實體機器處收集到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊。 Step 340: Obtain the physical machine fault information list from the physical machine fault information storage center; it should be noted that the physical machine fault classification processing module obtains the physical machine fault information list from the physical machine fault information storage center. The physical machine fault information list includes: physical machine fault information detected by a physical machine fault detection module outside the cluster from the faulty physical machine and reported to the physical machine fault information storage center, and The physical machine fault collection module outside the cluster collects the physical machine fault information from the faulted physical machine and reports it to the physical machine fault information storage center.

步驟350，若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；可以理解的是，在實際應用中，所述叢集外部的安全攻擊防護中心被觸發後，會啟動安全清洗程式，例如進行流量清洗等，從而使得故障實體機器恢復健康。需要說明的是，對於因網路DDoS攻擊而導致的某台實體機器網路不通與因實體機器當機而導致的實體機器網路不通是需要區別對待的，如果在實體機器正遭受網路DDoS攻擊時將其上的虛擬機器遷移至其他實體機器，會產生骨牌效應，導致擴大故障風險，即其他實體機器陸續被攻擊而不可用，最終可能造成全叢集網路設備的泛洪(flooding)，導致全叢集實體機器故障風險。 Step 350, if it is detected in the physical machine fault information list If a physical machine fails due to a network attack, the security attack protection center outside the cluster is triggered to process; it is understandable that in practical applications, the security attack protection center outside the cluster will be triggered. Cleaning procedures, such as flow cleaning, etc., so as to restore the health of the failed physical machine. It should be noted that the network failure of a physical machine caused by a network DDoS attack and the physical machine network failure caused by a physical machine crash need to be treated differently. If the physical machine is suffering from network DDoS Migrating the virtual machines on it to other physical machines during an attack will produce a domino effect, leading to an enlarged risk of failure, that is, other physical machines are successively attacked and become unavailable, which may eventually cause flooding of the entire cluster of network equipment. Lead to the risk of failure of the entire cluster of physical machines.

步驟360，若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Step 360: If a hardware and software failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, an instruction is sent to the failed physical machine to instruct the failed physical machine to shut down the failed physical machine autonomously or by all means. The out-of-band management module on the physical machine shuts down the faulty physical machine; and the virtual machine on the faulty physical machine is migrated to other healthy physical machines in the cluster system through a virtualization interface.

需要說明的是，所述的實體機器自身不能修復的軟硬體故障類型可以包括：實體機器當機、實體機器CPU異常、實體機器記憶體異常、實體機器電源模組等各類硬體問題異常。這類故障會直接導致實體機器不可用，且需要更換硬體模組方可修復，因此，本發明實施例藉由從叢集中將故障實體機器隔離後再對故障實體機器進行硬體更換或者維護。 It should be noted that the types of hardware and software failures that cannot be repaired by the physical machine itself can include: physical machine crashes, physical machine CPU abnormalities, physical machine memory abnormalities, physical machine power modules, and other hardware problems abnormal . This type of failure will directly cause the physical machine to be unavailable and require the replacement of the hardware module to be repaired. Therefore, the embodiment of the present invention isolates the faulty physical machine from the cluster and then replaces the hardware of the faulty physical machine. Or maintenance.

此外，針對實體機器自身不能修復的軟硬體故障的情況下，傳統實體機器上的帶外管控系統由於硬體故障率和成本問題，通常可用性在90%左右甚至更低，在雲端計算服務本身至少99.95%的商用可用性要求下，全年的不可用性時長共計262.8分鐘，如果一台故障實體機器無法得到及時修復，則由於一台實體機器故障就會直接導致幾十分鐘的人工處理時耗，因此，現有技術中的帶外管控系統的可用性指標無法匹配商用雲端計算服務的故障恢復服務等級協定(Service-Level Agreement，SLA)。而本發明實施例提供的技術方案，對傳統的帶外管控系統進行改進，在帶外管理模組可用性達不到商用標準時，可以藉由所述叢集外部的實體機器故障分類處理模組的指令指示故障實體機器自主關閉，再由所述叢集外部的實體機器故障分類處理模組藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上；從而大量縮短故障實體機器的修復時間，進而提高系統的商用可用性。 In addition, in the case of hardware and software failures that cannot be repaired by the physical machine itself, the out-of-band management and control system on the traditional physical machine usually has an availability of about 90% or even lower due to hardware failure rate and cost issues. In the cloud computing service itself With a commercial availability requirement of at least 99.95%, the total unavailability for the whole year is 262.8 minutes. If a failed physical machine cannot be repaired in time, the failure of a physical machine will directly cause tens of minutes of manual processing time. Therefore, the availability index of the out-of-band management and control system in the prior art cannot match the failure recovery service-level agreement (SLA) of commercial cloud computing services. The technical solution provided by the embodiment of the present invention improves the traditional out-of-band management and control system. When the availability of the out-of-band management module does not reach commercial standards, the instructions of the physical machine fault classification processing module outside the cluster can be used. Instruct the faulty physical machine to shut down autonomously, and then the physical machine fault classification processing module outside the cluster migrates the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface; thereby a large number of Shorten the repair time of failed physical machines, thereby improving the commercial availability of the system.

較佳的，本發明實施例所述方法還可以進一步包括：步驟370，若在所述實體機器故障資訊清單中檢測到實體機器網路完全不通且網路不通持續時間達到預設時間；判斷網路不通的實體機器數量是否超過預設數量，如果是則通知運營維修人員人工修復；否則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內到其他健康實體機器上。 Preferably, the method according to the embodiment of the present invention may further include: step 370, if it is detected in the physical machine fault information list that the physical machine’s network is completely disconnected and the duration of the network disconnection reaches a preset time; Whether the number of unreachable physical machines exceeds the preset number, if so, notify the operation and maintenance personnel to repair manually; otherwise, use the virtualization interface to migrate the virtual machines on the failed physical machine to the cluster system On other healthy physical machines.

其中，所述預設時間可以依據實際情況設定為3分鐘、5分鐘等適合的時間段。 Wherein, the preset time can be set to a suitable time period such as 3 minutes, 5 minutes, etc. according to actual conditions.

需要說明的是，在檢測到實體機器網路完全不通且網路不通持續時間達到預設時間的情況下，本發明實施例需要進一步檢查網路不通的故障實體機器的數量是否超過一個機櫃的實體機器數量或者一個交換機下聯實體機器數量，如果超過，則認為是叢集規模性網路故障，則需要採取電話報警通運營維修人員人工修復，而不再自動處理。這是由於對於大規模實體機器故障，在進行隔離實體機器遷移虛擬機器時，會導致大量實體機器被關閉，當機房設備(網路設備或者電力設備等)恢復後，還需要再次重啟實體機器，然後恢復虛擬機器，這一系列的操作將直接導致人工處理時間加倍甚至更多，從而大大加大虛擬機器的不可用時長。因此，本發明實施例提供的方法，對此種實體機器故障類型加以區分處理，可以大量縮短故障實體機器的修復時間，從而大大縮短其上的虛擬機器不可用的時長，進而提高系統的商用可用性。 It should be noted that when it is detected that the network of the physical machine is completely disconnected and the duration of the network disconnection reaches the preset time, the embodiment of the present invention needs to further check whether the number of faulty physical machines with the network disconnected exceeds the entity of one cabinet. If the number of machines or the number of physical machines connected to a switch exceeds, it is considered to be a cluster-scale network failure, and it needs to be repaired manually by the operation and maintenance personnel through telephone calls instead of automatic processing. This is because for large-scale physical machine failures, when isolating physical machines and migrating virtual machines, a large number of physical machines will be shut down. When the equipment in the computer room (network equipment or electrical equipment, etc.) is restored, the physical machines need to be restarted again. Then restore the virtual machine. This series of operations will directly cause the manual processing time to double or even more, thereby greatly increasing the unavailable time of the virtual machine. Therefore, the method provided by the embodiment of the present invention differentiates and handles this type of physical machine failure, which can greatly shorten the repair time of the failed physical machine, thereby greatly reducing the time that the virtual machine on it is unavailable, thereby improving the commercial use of the system. Availability.

較佳的，本發明實施例所述方法還可以進一步包括：步驟380，若在所述實體機器故障資訊清單中檢測到實體機器網路不通但網路不通持續時間未達到預設時間後網路又恢復正常，且確定實體機器網路不通是實體機器重啟所導致的，則判斷當前的實體機器是否健康，如果健康則藉由虛擬化介面重啟所述實體機器上的虛擬機器，如果不健康則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集內其他健康實體機器上。 Preferably, the method according to the embodiment of the present invention may further include: step 380, if it is detected in the physical machine fault information list that the physical machine has a network disconnection but the network disconnection duration has not reached the preset time, the network It returns to normal again, and it is determined that the physical machine’s network disconnection is caused by the restart of the physical machine, then it is determined whether the current physical machine is healthy, and if it is healthy, the virtual machine on the physical machine is restarted through the virtualization interface, if If it is unhealthy, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster through the virtualization interface.

較佳的，本發明實施例所述方法還可以進一步包括：步驟390，若在所述實體機器故障資訊清單中檢測到實體機器網路不穩定且網路不穩定持續時間達到預設時間，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Preferably, the method according to the embodiment of the present invention may further include: step 390, if it is detected in the physical machine fault information list that the physical machine network is unstable and the network instability duration reaches a preset time, then Send an instruction to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through the out-of-band management module on the physical machine; and migrate the faulty physical machine to the faulty physical machine through the virtualization interface To other healthy physical machines in the cluster system.

需要說明的是，所述實體機器網路不穩定且網路不穩定持續時間達到預設時間的情況主要是一些未知原因造成實體機器故障，例如，系統負載類、系統網路類、硬體故障類等。這類故障雖然本質原因比較難查，但是這類故障的現象卻很明確，主要是：實體機器網路丟包、實體機器管理通道存取異常、實體機器性能使用異常。對於這類實體機器故障，可以採用相同的處理方式，即向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 It should be noted that the situation that the physical machine network is unstable and the network instability lasts for the preset time is mainly due to some unknown reasons causing physical machine failures, such as system load, system network, and hardware failure. Class etc. Although the essential causes of this type of failure are difficult to investigate, the symptoms of such failures are clear, mainly: physical machine network packet loss, physical machine management channel access abnormalities, and physical machine performance usage abnormalities. For this type of physical machine failure, the same processing method can be used, that is, sending an instruction to the malfunctioning physical machine to instruct the malfunctioning physical machine to autonomously shut down the malfunctioning physical machine or shutting down the malfunctioning entity through the out-of-band management module on the physical machine Machine; and migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system through the virtualization interface.

較佳的，在本發明的一個實施例中，藉由以下方式確定所述健康實體機器：在所述實體機器故障資訊清單中匹配所述叢集內的所有實體機器；將沒有匹配成功的實體機器確定為健康實體機器。 Preferably, in an embodiment of the present invention, the healthy physical machine is determined by the following method: matching all the physical machines in the cluster in the physical machine fault information list There is a physical machine; the physical machine that has not been matched successfully is determined as a healthy physical machine.

此外，本發明實施例也同時考慮到大規模雲端計算叢集內發生實體機器規模故障情況的可能性，藉由判斷故障實體機器的數量是否構成機房級別，並有針對性的採取不同的修復處理方式。尤其是針對大規模實體機器故障的情況，採用人工處理的方式修復，從而有效避免由於故障實體機器上的虛擬機器的頻繁遷移而影響系統性能情況的發生。 In addition, the embodiments of the present invention also take into account the possibility of physical machine-scale failures in large-scale cloud computing clusters. By judging whether the number of failed physical machines constitutes the level of the computer room, and taking targeted measures The same repair method. Especially in the case of large-scale physical machine failures, manual processing is used to repair, so as to effectively avoid the occurrence of system performance impacts due to frequent migration of virtual machines on the failed physical machine.

Embodiment four

參照圖4，示出了本發明的另一種虛擬機器恢復方法的實施例示意圖，具體可以包括如下步驟：實體機器故障探測模組每隔30秒檢查叢集內每台實體機器的網路情況，並更新至實體機器故障資訊儲存中心；叢集系統內的每台實體機器自主檢測自身的故障情況，並藉由實體機器故障收集模組更新至實體機器故障資訊儲存中心。 4, there is shown a schematic diagram of an embodiment of another virtual machine recovery method of the present invention, which may specifically include the following steps: the physical machine fault detection module checks the network status of each physical machine in the cluster every 30 seconds, and Update to the physical machine fault information storage center; each physical machine in the cluster system autonomously detects its own fault conditions, and updates to the physical machine fault information storage center through the physical machine fault collection module.

對於實體機器自身能容錯修復的軟硬體故障的場景，則由實體機器自身藉由容錯方式修復處理；對於實體機器自身能重啟修復的軟硬體故障，則由實體機器自身藉由重啟實體機器方式修復處理；如果是實體機器自身不能修復的軟硬體故障，則進行關機處理。 For the scenario of hardware and software failures that can be repaired by the physical machine itself, the physical machine itself will be repaired by fault-tolerant methods; for the software and hardware failures that can be restarted and repaired by the physical machine itself, the physical machine itself will be restarted by the physical machine. Repair it in a way; if it is a hardware and software failure that cannot be repaired by the physical machine itself, it will be shut down.

實體機器故障分類處理模組每隔1分鐘從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；判斷該實體機器故障資訊清單是否為空，如果是則返回迴圈；否則繼續判斷所述實體機器故障資訊清單中是否有因遭受網路攻擊而導致實體機器故障的情況，如果有，則觸發所述叢集外部的安全攻擊防護中心處理；否則繼續判斷在所述實體機器故障資訊清單中是否有因實體機器自身不能修復的軟硬體故障的情況，如果有，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；再藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 The physical machine fault classification processing module obtains the physical machine fault information list from the physical machine fault information storage center every 1 minute; judges whether the physical machine fault information list is empty, if so, returns to the loop; otherwise, continues to judge the physical machine In the fault information list, whether there is a physical machine failure due to a network attack, if so, the security attack protection center outside the cluster will be triggered to process; otherwise, continue to judge the entity Whether there are any software and hardware failures in the machine failure information list that cannot be repaired by the physical machine itself, if so, send an instruction to the failed physical machine to instruct the failed physical machine to shut down the failed physical machine autonomously or by means of the physical machine The out-of-band management module on the above shuts down the failed physical machine; and then uses the virtualization interface to migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system.

如果確定所述實體機器故障資訊清單中沒有因遭受網路攻擊而導致實體機器故障的情況，則繼續判斷在所述實體機器故障資訊清單中是否有實體機器網路完全不通且網路不通持續時間達到預設時間，例如3分鐘；如果有則再判斷網路不通的實體機器數量是否超過預設數量，例如，故障實體機器的數量是否超過一個機櫃的實體機器數量或者一個交換機下聯實體機器數量，如果超過，則認為是叢集規模性網路故障，則需要採取電話報警通運營維修人員人工修復，而不再自動處理。否則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內到其他健康實體機器上。 If it is determined that the physical machine failure information list does not cause a physical machine failure due to a network attack, continue to determine whether there is a physical machine failure information list in the physical machine failure information list that is completely disconnected and the network is disconnected. Duration Reach the preset time, such as 3 minutes; if there is, then determine whether the number of physical machines that are not connected to the network exceeds the preset number, for example, whether the number of failed physical machines exceeds the number of physical machines in a cabinet or the number of physical machines connected to a switch. If it exceeds, it is considered to be a cluster-scale network failure, and it is necessary to take a telephone call to the operation and maintenance personnel to repair it manually instead of automatically processing it. Otherwise, the virtual machine on the failed physical machine is migrated to the cluster system to other healthy physical machines through the virtualization interface.

判斷在所述實體機器故障資訊清單中是否有檢測到實體機器網路不通但網路不通持續時間未達到預設時間後網路又恢復正常，且確定實體機器網路不通是實體機器重啟所導致的，則判斷當前的實體機器是否健康，如果健康則藉由虛擬化介面重啟所述實體機器上的虛擬機器，如果不健康則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集內其他健康實體機器上。 Determine whether the physical machine's network failure is detected in the physical machine failure information list, but the network failure duration has not reached the preset time and the network returns to normal, and it is determined that the physical machine's network failure is caused by the restart of the physical machine Yes, it is judged whether the current physical machine is healthy, if it is healthy, the virtual machine on the physical machine is restarted through the virtualization interface, and if it is unhealthy, the virtual machine on the failed physical machine is migrated to the On other healthy physical machines in the cluster.

如果在所述實體機器故障資訊清單中檢測到實體機器網路不穩定且網路不穩定持續時間達到預設時間，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 If it is detected in the physical machine fault information list that the network of the physical machine is unstable and the network instability lasts for a preset time, send an instruction to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or Shut down the faulty physical machine through the out-of-band management module on the physical machine; and migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through a virtualization interface.

需要說明的是，所述實體機器網路不穩定且網路不穩定持續時間達到預設時間的情況主要是一些未知原因造成實體機器故障，例如，系統負載類、系統網路類、硬體故障類等。這類故障雖然本質原因比較難查，但是這類故障的現象卻很明確，主要是：實體機器網路丟包、實體機器管理通道存取異常、實體機器性能使用異常。對於這類實體機器故障，可以採用相同的處理方式。 It should be noted that the situation that the physical machine network is unstable and the network instability lasts for the preset time is mainly due to some unknown reasons causing physical machine failures, such as system load, system network, and hardware failure. Class etc. Although the essential causes of this type of failure are difficult to investigate, the symptoms of such failures are clear, mainly: physical machine network packet loss, physical machine management channel access abnormalities, and physical machine performance usage abnormalities. For this type of physical machine failure, the same approach can be used.

進一步的，本發明實施例藉由實體機器自主檢測自身的故障動態，並對實體機器自身能修復的實體機器故障情況有針對性的進行分類修復處理；對實體機器自身不能修復的實體機器故障情況，藉由叢集外部的實體機器故障分類處理模組有針對性的進行分類修復處理，從而有效降低實體機器故障的誤判和漏判情況的發生，更安全、穩定、快速的進行虛擬機器自動恢復。 Further, in the embodiment of the present invention, the physical machine autonomously detects itself The failure dynamics of the physical machine itself can be repaired by the physical machine itself, and the physical machine fault conditions that can be repaired by the physical machine itself are classified and repaired; for the physical machine fault conditions that the physical machine itself cannot be repaired, the physical machine fault classification processing module outside the cluster has Targeted classification and repair processing, thereby effectively reducing the occurrence of false judgments and missed judgments of physical machine faults, and more secure, stable, and rapid automatic recovery of virtual machines.

需要說明的是，對於方法實施例，為了簡單描述，故將其都表述為一系列的動作組合，但是本領域技術人員應該知悉，本發明實施例並不受所描述的動作順序的限制，因為依據本發明實施例，某些步驟可以採用其他順序或者同時進行。其次，本領域技術人員也應該知悉，說明書中所描述的實施例均屬於較佳實施例，所涉及的動作並不一定是本發明實施例所必須的。 It should be noted that for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the embodiments of the present invention are not limited by the described sequence of actions, because According to the embodiments of the present invention, some steps may be in other order or Simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

Embodiment five

參照圖5，示出了本發明的一種實體機器損毀修復裝置實施例的結構方塊圖，所述實體機器損毀修復裝置500應用於虛擬化叢集系統內的實體機器上，具體可以包括：自主檢測模組(selfChecker)510、自主處理模組(selfHnadler)520；其中：自主檢測模組510具體包括：檢測單元511，用於自主檢測實體機器自身的故障動態；較佳的，檢測單元511可以以固定的時間間隔定期自主檢測實體機器自身的故障動態，例如每隔30秒自主檢測一次。 5, there is shown a structural block diagram of an embodiment of an apparatus for repairing damage to a physical machine of the present invention. The apparatus 500 for repairing damage to a physical machine is applied to a physical machine in a virtualized cluster system, and may specifically include: an autonomous detection module Group (selfChecker) 510, autonomous processing module (selfHnadler) 520; wherein: the autonomous detection module 510 specifically includes: a detection unit 511 for autonomously detecting the failure dynamics of the physical machine itself; preferably, the detection unit 511 can be fixed The time interval of the self-detection of the failure dynamics of the physical machine itself, for example, once every 30 seconds.

自主處理模組520，具體包括：容錯單元521，用於若所述檢測單元511檢測到實體機器自身能容錯修復的軟硬體故障，則藉由容錯方式修復；可以理解的是，本發明實施例所述的實體機器自身能容錯修復的軟硬體故障，可以包括：儲存資料的磁片故障、虛擬化相關內核模組異常、儲存資料的檔案系統異常等。例如，針對儲存資料的磁片故障，容錯修復方式具體是，首先隔離磁片，然後利用叢集分散式儲存多份資料的機制，實現該磁片上資料自動複製至其他健康磁片上，這樣可以有效保證該故障磁片隔離後不會影響系統穩定運行。同樣，針對儲存資料的檔案系統損壞，也可以藉由隔離該檔案系統掛載的磁片達到容錯修復的目的。 The autonomous processing module 520 specifically includes: a fault-tolerant unit 521, which is used to repair the hardware and software faults in a fault-tolerant manner if the detection unit 511 detects the physical machine's own fault-tolerant repair; it is understandable that the implementation of the present invention Examples of the hardware and software faults that the physical machine itself can fault-tolerate and repair may include: disk failures that store data, virtualization-related kernel module abnormalities, and file system abnormalities that store data. For example, for the failure of a disk that stores data, the fault-tolerant repair method is specifically to isolate the disk first, and then use the mechanism of clustering to store multiple copies of data to realize the data on the disk is automatically copied to other healthy disks. This can effectively ensure that the faulty disk will not affect the stable operation of the system after isolation. Similarly, for the file system that stores data is damaged, the purpose of fault-tolerant repair can also be achieved by isolating the disk mounted on the file system.

重啟單元522，用於若所述檢測單元511檢測到實體機器自身能重啟修復的軟硬體故障，則藉由重啟實體機器方式修復。 The restart unit 522 is configured to, if the detection unit 511 detects a hardware and software failure that can be restarted and repaired by the physical machine itself, it is repaired by restarting the physical machine.

可以理解的是，本發明實施例所述的實體機器自身能重啟修復的軟硬體故障，可以包括：根檔案系統唯讀等異常、網卡驅動重啟可修復的異常、作業系統內核模組異常等。這類軟硬體故障都可以藉由重啟實體機器的方式予以修復。 It is understandable that the hardware and software faults that the physical machine itself can restart and repair in the embodiment of the present invention may include: root file system read-only exceptions, network card driver restart repairable exceptions, operating system kernel module exceptions, etc. . Such hardware and software failures can be repaired by restarting the physical machine.

較佳的，所述自主處理模組520還可以進一步包括：關機單元523，用於若所述檢測單元511檢測到實體機器自身不能修復的軟硬體故障，則根據所述叢集外部的實體機器故障分類處理模組的指令或藉由所述實體機器上的帶外管理模組530關閉故障實體機器，由所述叢集外部的實體機器故障分類處理模組藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Preferably, the autonomous processing module 520 may further include: a shutdown unit 523, configured to: if the detection unit 511 detects a hardware and software failure that cannot be repaired by the physical machine itself, follow the physical machine outside the cluster The instruction of the fault classification processing module or the out-of-band management module 530 on the physical machine shuts down the faulty physical machine, and the physical machine fault classification processing module outside the cluster migrates the faulty entity through the virtualization interface Virtual machines on the machine to other healthy physical machines in the cluster system.

需要說明的是，針對實體機器自身不能修復的軟硬體故障的情況下，傳統實體機器上的帶外管控系統由於硬體故障率和成本問題，通常可用性在90%左右甚至更低，在雲端計算服務本身至少99.95%的商用可用性要求下，全年的不可用性時長共計262.8分鐘，如果一台故障實體機器無法得到及時修復，則由於一台實體機器故障就會直接導致幾十分鐘的人工處理時耗，因此，現有技術中的帶外管控系統的可用性指標無法匹配商用雲端計算服務的故障恢復服務等級協定(Service-Level Agreement，SLA)。而本發明實施例提供的技術方案，對傳統的帶外管控系統進行改進，在帶外管理模組530可用性達不到商用標準時，可以藉由所述叢集外部的實體機器故障分類處理模組的指令指示故障實體機器自主關閉，再由所述叢集外部的實體機器故障分類處理模組藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上；從而大量縮短故障實體機器的修復時間，進而提高系統的商用可用性。 It should be noted that in the case of hardware and software failures that cannot be repaired by the physical machine itself, the out-of-band control system on the traditional physical machine usually has an availability of around 90% or even lower due to hardware failure rate and cost issues. Under the commercial availability requirement of at least 99.95% of the computing service itself, the total unavailability of the whole year is 262.8 minutes. If a physical machine fails If the device cannot be repaired in time, the failure of a physical machine will directly cause tens of minutes of manual processing time. Therefore, the availability indicators of the out-of-band control system in the prior art cannot match the failure recovery service level of commercial cloud computing services. Agreement (Service-Level Agreement, SLA). The technical solution provided by the embodiment of the present invention improves the traditional out-of-band management and control system. When the availability of the out-of-band management module 530 does not meet the commercial standard, the fault classification processing module of the physical machine outside the cluster can be used. The instruction instructs the faulty physical machine to shut down autonomously, and the physical machine fault classification processing module outside the cluster migrates the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface; Significantly shorten the repair time of failed physical machines, thereby improving the commercial availability of the system.

較佳的，所述自主檢測模組510還可以進一步包括：上報單元512，用於當檢測單元511自主檢測到因遭受網路攻擊而導致實體機器故障時，藉由實體機器故障收集模組上報實體機器故障資訊到實體機器故障資訊儲存中心，由所述叢集外部的實體機器故障分類處理模組觸發所述叢集外部的安全攻擊防護中心處理。 Preferably, the autonomous detection module 510 may further include: a reporting unit 512, configured to use the physical machine fault collection module when the detection unit 511 autonomously detects a physical machine failure due to a network attack The physical machine fault information is reported to the physical machine fault information storage center, and the physical machine fault classification processing module outside the cluster triggers the processing by the security attack protection center outside the cluster.

其中，所述叢集外部的安全攻擊防護中心被觸發後，會啟動安全清洗程式，例如進行流量清洗等，從而使得故障實體機器恢復健康。需要說明的是，對於因網路DDoS攻擊而導致的某台實體機器網路不通與因實體機器當機而導致的實體機器網路不通是需要區別對待的，如果在實體機器正遭受網路DDoS攻擊時將其上的虛擬機器遷移至其他實體機器，會產生骨牌效應，導致擴大故障風險，即其他實體機器陸續被攻擊而不可用，最終可能造成全叢集網路設備的泛洪(flooding)，導致全叢集實體機器故障風險。 Wherein, after the security attack protection center outside the cluster is triggered, a security cleaning program is started, such as traffic cleaning, etc., so as to restore the health of the failed physical machine. It should be noted that the network failure of a physical machine caused by a network DDoS attack and the physical machine network failure caused by a physical machine crash need to be treated differently. If the physical machine is suffering from network DDoS Migrate the virtual machine on it to it during an attack Other physical machines will produce a domino effect, leading to an enlarged risk of failure, that is, other physical machines are successively attacked and become unavailable, which may eventually cause flooding of the entire cluster of network equipment, leading to the risk of failure of the entire cluster of physical machines.

Example Six

參照圖6，示出了本發明的一種叢集實體機器故障分類處理裝置實施例的結構方塊圖，所述實體機器故障分類處理裝置600具體可以包括如下模組：獲取模組610，用於從實體機器故障資訊儲存中心獲取實體機器故障資訊清單；需要說明的是，所述實體機器故障資訊清單包括：由所述叢集外部的實體機器故障探測模組從故障實體機器處探測到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊，及由所述叢集外部的實體機器故障收集模組從故障實體機器處收集到並上報給所述實體機器故障資訊儲存中心的實體機器故障資訊。 Referring to FIG. 6, there is shown a structural block diagram of an embodiment of a cluster entity machine fault classification and processing apparatus of the present invention. The entity machine fault classification and processing apparatus 600 may specifically include the following modules: an acquisition module 610 for obtaining data from entities The machine fault information storage center obtains a physical machine fault information list; it should be noted that the physical machine fault information list includes: the physical machine fault detection module outside the cluster detects the faulty physical machine from the faulty physical machine and reports it to the company The physical machine fault information of the physical machine fault information storage center, and the physical machine fault information collected by the physical machine fault collection module outside the cluster from the faulted physical machine and reported to the physical machine fault information storage center .

第一處理模組620，用於若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則觸發所述叢集外部的安全攻擊防護中心處理；可以理解的是，在實際應用中，所述叢集外部的安全攻擊防護中心被觸發後，會啟動安全清洗程式，例如進行流量清洗等，從而使得故障實體機器恢復健康。需要說明的是，對於因網路DDoS攻擊而導致的某台實體機器網路不通與因實體機器當機而導致的實體機器網路不通是需要區別對待的，如果在實體機器正遭受網路DDoS攻擊時將其上的虛擬機器遷移至其他實體機器，會產生骨牌效應，導致擴大故障風險，即其他實體機器陸續被攻擊而不可用，最終可能造成全叢集網路設備的泛洪(flooding)，導致全叢集實體機器故障風險。 The first processing module 620 is configured to trigger processing by the security attack protection center outside the cluster if it is detected in the physical machine fault information list that the physical machine fails due to a network attack; it is understandable that, In actual applications, after the security attack protection center outside the cluster is triggered, a security cleaning program, such as traffic cleaning, is activated, so that the failed physical machine can be restored to health. It should be noted that the network failure of a physical machine caused by a network DDoS attack and the physical machine network failure caused by a physical machine crash need to be treated differently. If the physical machine is suffering from network DDoS Migrating the virtual machines on it to other physical machines during an attack will produce a domino effect, leading to an enlarged risk of failure, that is, other physical machines are successively attacked and become unavailable, which may eventually cause flooding of the entire cluster of network equipment. Lead to the risk of failure of the entire cluster of physical machines.

第二處理模組630，進一步包括：關閉處理單元，用於若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則向故障實體機器發送關閉故障實體機器的指令；較佳的，所述指令可以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；遷移處理單元，用於藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 The second processing module 630 further includes: a shutdown processing unit, configured to send a shutdown of the failed physical machine to the failed physical machine if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list Preferably, the instruction can instruct the faulty physical machine to autonomously shut down the faulty physical machine or shut down the faulty physical machine through the out-of-band management module on the physical machine; the migration processing unit is used for virtual Migrate virtual machines on the failed physical machine to other healthy physical machines in the cluster system through a standardized interface.

較佳的，所述實體機器故障分類處理裝置600還可以進一步包括第三處理模組640，該第三處理模組640具體包括：通知處理單元，用於若在所述實體機器故障資訊清單中檢測到實體機器網路完全不通且網路不通持續時間達到預設時間，並且網路不通的實體機器數量超過一台，則通知運營維修人員人工修復；遷移處理單元，用於若在所述實體機器故障資訊清單中檢測到實體機器網路完全不通且網路不通持續時間達到預設時間，並且網路不通的實體機器數量未超過預設數量，則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內到其他健康實體機器上。 Preferably, the physical machine fault classification processing device 600 may further include a third processing module 640, and the third processing module 640 specifically includes: a notification processing unit configured to: It is detected that the physical machine’s network is completely unreachable and the duration of the unreachable network reaches the preset time, and the number of physical machines whose network is unreachable exceeds one, and the operation and maintenance personnel are notified to repair it manually; the migration processing unit is used if the entity is in Machine fault information list If it is detected that the physical machine is completely disconnected from the network and the duration of the network disconnection reaches the preset time, and the number of physical machines that are disconnected from the network does not exceed the preset number, the virtual machine on the failed physical machine is migrated through the virtualization interface To other healthy physical machines in the cluster system.

較佳的，所述實體機器故障分類處理裝置600還可以進一步包括第四處理模組650，所述第四處理模組650具體包括：重啟處理單元，用於若在所述實體機器故障資訊清單中檢測到實體機器網路不通但網路不通持續時間未達到預設時間後網路又恢復正常，且確定實體機器網路不通是實體機器重啟所導致的，則在確定當前的實體機器是健康的情況下，藉由虛擬化介面重啟所述實體機器上的虛擬機器；遷移處理單元，用於若在所述實體機器故障資訊清單中檢測到實體機器網路不通但網路不通持續時間未達到預設時間後網路又恢復正常，且確定實體機器網路不通是實體機器重啟所導致的，則在確定當前的實體機器是不健康的情況下，藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集內其他健康實體機器上。 Preferably, the physical machine fault classification processing device 600 may further include a fourth processing module 650, and the fourth processing module 650 specifically includes: The restart processing unit is used for detecting that the physical machine's network failure is detected in the physical machine failure information list, but the network failure duration has not reached the preset time and the network returns to normal, and determining that the physical machine's network failure is a physical entity If the machine restarts, if it is determined that the current physical machine is healthy, restart the virtual machine on the physical machine through the virtualization interface; the migration processing unit is used for if in the physical machine fault information list It is detected that the network of the physical machine is unreachable but the duration of the unreachable network has not reached the preset time and the network returns to normal, and it is determined that the unreachable network of the physical machine is caused by the restart of the physical machine, then it is determined that the current physical machine is unhealthy In this case, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster through the virtualization interface.

較佳的，所述實體機器故障分類處理裝置600還可以進一步包括第五處理模組660，所述第五處理模組660具體包括：關機處理單元，用於若在所述實體機器故障資訊清單中檢測到實體機器網路不穩定且網路不穩定持續時間達到預設時間，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；遷移處理單元，用於藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Preferably, the physical machine fault classification processing device 600 may further include a fifth processing module 660, and the fifth processing module 660 specifically includes: a shutdown processing unit, configured to: If it detects that the network of the physical machine is unstable and the duration of network instability reaches a preset time, it sends an instruction to the faulty physical machine to instruct the faulty physical machine to autonomously shut down the faulty physical machine or use the belt on the physical machine The external management module shuts down the faulty physical machine; the migration processing unit is used to migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through a virtualization interface.

需要說明的是，所述實體機器網路不穩定且網路不穩定持續時間達到預設時間的情況主要是一些未知原因造成實體機器故障，例如，系統負載類、系統網路類、硬體故障類等。這類故障雖然本質原因比較難查，但是這類故障的現象卻很明確，主要是：實體機器網路丟包、實體機器管理通道存取異常、實體機器性能使用異常。對於這類實體機器故障，可以採用相同的處理方式，即向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 It should be noted that the network of the physical machine is unstable and the network is unstable The situation where the predetermined duration reaches the preset time is mainly due to some unknown reasons causing physical machine failures, such as system load, system network, and hardware failure. Although the essential causes of this type of failure are difficult to investigate, the symptoms of such failures are clear, mainly: physical machine network packet loss, physical machine management channel access abnormalities, and physical machine performance usage abnormalities. For this type of physical machine failure, the same processing method can be used, that is, sending an instruction to the malfunctioning physical machine to instruct the malfunctioning physical machine to autonomously shut down the malfunctioning physical machine or shutting down the malfunctioning entity through the out-of-band management module on the physical machine Machine; and migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system through the virtualization interface.

較佳的，所述實體機器故障分類處理裝置600還可以進一步包括：確定模組670，用於在所述實體機器故障資訊清單中匹配所述叢集內的所有實體機器，將沒有匹配成功的實體機器確定為健康實體機器。 Preferably, the physical machine fault classification processing device 600 may further include: a determining module 670, configured to match all physical machines in the cluster in the physical machine fault information list, and there will be no entities that are successfully matched. The machine is determined to be a healthy physical machine.

進一步的，本發明實施例藉由實體機器自主檢測自身的故障動態，並對實體機器自身能修復的實體機器故障情況有針對性的進行分類修復處理；對實體機器自身不能修復的實體機器故障情況，藉由叢集外部的實體機器故障分類處理模組有針對性的進行分類修復處理，從而有效降低實體機器故障的誤判和漏判情況的發生，更安全、穩定、快速的進行虛擬機器自動恢復。 Further, in the embodiment of the present invention, the physical machine autonomously detects its own failure dynamics, and specifically classifies and repairs the physical machine fault conditions that the physical machine itself can repair; the physical machine fault conditions that cannot be repaired by the physical machine itself , By means of physical machine failures outside the cluster The class processing module performs targeted classification and repair processing, thereby effectively reducing the occurrence of misjudgments and missed judgments of physical machine failures, and automatically recovering virtual machines in a safer, stable, and faster manner.

Example Seven

參照圖7，示出了本發明的一種虛擬機器恢復系統實施例的架構圖，該虛擬機器恢復系統包括：實體機器損毀修復裝置710，其應用於虛擬化叢集系統700內的每台實體機器上；實體機器故障分類處理裝置720及實體機器故障資訊儲存中心730；其中：所述實體機器損毀修復裝置710具體可以包括：自主檢測模組711、自主處理模組712；其中：自主檢測模組711用於自主檢測實體機器自身的故障動態；自主處理模組712用於若所述自主檢測模組711檢測到實體機器自身能容錯修復的軟硬體故障，則藉由容錯方式修復；還用於若自主檢測模組711檢測到實體機器自身能重啟修復的軟硬體故障，藉由重啟實體機器方式修復。 Referring to FIG. 7, there is shown a structural diagram of an embodiment of a virtual machine recovery system of the present invention. The virtual machine recovery system includes: a physical machine damage repair device 710, which is applied to each physical machine in the virtualized cluster system 700 ; Physical machine fault classification processing device 720 and physical machine fault information storage center 730; among them: The physical machine damage repairing device 710 may specifically include: an autonomous detection module 711 and an autonomous processing module 712; wherein: the autonomous detection module 711 is used for autonomously detecting the failure dynamics of the physical machine itself; the autonomous processing module 712 is used for The autonomous detection module 711 detects a hardware and software fault that the physical machine itself can be fault-tolerantly repaired, and then uses a fault-tolerant method to repair it; it is also used to if the autonomous detection module 711 detects a software and hardware fault that the physical machine itself can restart and repair. , Repair by restarting the physical machine.

較佳的，所述自主處理模組712還可以用於若所述自主檢測模組711檢測到實體機器自身不能修復的軟硬體故障，則根據所述叢集外部的實體機器故障分類處理模組720的指令或藉由所述實體機器上的帶外管理模組713關閉故障實體機器，由所述叢集外部的實體機器故障分類處理模組720藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Preferably, the autonomous processing module 712 can also be used to classify and process the module according to the physical machine faults outside the cluster if the autonomous detection module 711 detects a hardware and software fault that cannot be repaired by the physical machine itself 720 or through the out-of-band management module 713 on the physical machine to shut down the faulty physical machine, and the physical machine fault classification processing module 720 outside the cluster migrates the faulty physical machine through the virtualization interface The virtual machine is connected to other healthy physical machines in the cluster system.

較佳的，所述自主檢測模組712還可以用於當自主檢測模組711自主檢測到因遭受網路攻擊而導致實體機器故障時，藉由實體機器故障收集模組760上報實體機器故障資訊到實體機器故障資訊儲存中心730，由所述叢集外部的實體機器故障分類處理模組720觸發所述叢集外部的安全攻擊防護中心740處理。 Preferably, the autonomous detection module 712 can also be used to report the physical machine fault through the physical machine fault collection module 760 when the autonomous detection module 711 autonomously detects that the physical machine fails due to a network attack. The information is sent to the physical machine fault information storage center 730, and the physical machine fault classification processing module 720 outside the cluster triggers the security attack protection center 740 outside the cluster to process.

需要說明的是，在本發明另一實施例中，該自主檢測模組711和自主處理模組712可以是部署在叢集每台實體機器上的軟體模組，在實體機器開機時自動啟動，該自主檢測模組711和自主處理模組712的運行不依賴檔案系統，僅僅依賴CPU、記憶體。 It should be noted that, in another embodiment of the present invention, the autonomous detection module 711 and the autonomous processing module 712 may be software modules deployed on each physical machine in the cluster, and are automatically activated when the physical machine is turned on. The operation of the autonomous detection module 711 and the autonomous processing module 712 does not depend on the file system The system only relies on CPU and memory.

所述實體機器故障資訊儲存中心730，用於將所有上報的物理故障資訊彙集成實體機器故障資訊清單；其中，所述實體機器故障資訊清單包括：由所述叢集外部的實體機器故障探測模組750從故障實體機器處探測到並上報給所述實體機器故障資訊儲存中心730的實體機器故障資訊，及由所述叢集外部的實體機器故障收集模組760從故障實體機器處收集到並上報給所述實體機器故障資訊儲存中心730的實體機器故障資訊。 The physical machine fault information storage center 730 is used to integrate all reported physical fault information into a physical machine fault information list; wherein, the physical machine fault information list includes: a physical machine fault detection module external to the cluster The group 750 detects the physical machine fault information from the faulty physical machine and reports it to the physical machine fault information storage center 730, and the physical machine fault collection module 760 outside the cluster collects and combines the physical machine fault information from the faulty physical machine. The physical machine fault information reported to the physical machine fault information storage center 730.

所述實體機器故障分類處理裝置720，用於藉由獲取模組721從所述實體機器故障資訊儲存中心730獲取實體機器故障資訊清單，若在所述實體機器故障資訊清單中檢測到因遭受網路攻擊而導致實體機器故障，則藉由第一處理模組722觸發所述叢集外部的安全攻擊防護中心740處理；若在所述實體機器故障資訊清單中檢測到因實體機器自身不能修復的軟硬體故障，則藉由第二處理模組723向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組713關閉故障實體機器，及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 The physical machine fault classification processing device 720 is used to obtain a physical machine fault information list from the physical machine fault information storage center 730 through the acquisition module 721. If a physical machine fails due to a path attack, the first processing module 722 triggers the security attack protection center 740 outside the cluster to process; if software that cannot be repaired by the physical machine is detected in the physical machine fault information list If the hardware is faulty, the second processing module 723 sends an instruction to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or to shut down the faulty physical machine through the out-of-band management module 713 on the physical machine , And migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system through the virtualization interface.

較佳的，所述實體機器故障分類處理裝置720還可以進一步包括第三處理模組724，用於若在所述實體機器故障資訊清單中檢測到實體機器網路完全不通且網路不通持續時間達到預設時間；判斷網路不通的實體機器數量是否超過預設數量，如果是則通知運營維修人員人工修復；否則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內到其他健康實體機器上。 Preferably, the physical machine fault classification processing device 720 may further include a third processing module 724 for detecting that the physical machine’s network is completely unavailable and the network is unavailable in the physical machine fault information list. The renewal time reaches the preset time; judge whether the number of physical machines with unreachable network exceeds the preset number, and if so, notify the operation and maintenance personnel to repair it manually; otherwise, use the virtualization interface to migrate the virtual machine on the failed physical machine to the In the cluster system to other healthy physical machines.

較佳的，所述實體機器故障分類處理裝置720還可以進一步包括第四處理模組725，用於若在所述實體機器故障資訊清單中檢測到實體機器網路不通但網路不通持續時間未達到預設時間後網路又恢復正常，且確定實體機器網路不通是實體機器重啟所導致的，則判斷當前的實體機器是否健康，如果健康則藉由虛擬化介面重啟所述實體機器上的虛擬機器，如果不健康則藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集內其他健康實體機器上。 Preferably, the physical machine fault classification and processing device 720 may further include a fourth processing module 725, which is used to detect in the physical machine fault information list that the physical machine’s network is unavailable but the duration of the network disconnection has not been reached. After the preset time is reached, the network returns to normal, and it is determined that the physical machine's network failure is caused by the restart of the physical machine, then determine whether the current physical machine is healthy, and if it is healthy, restart the physical machine through the virtualization interface If the virtual machine is unhealthy, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster through a virtualization interface.

較佳的，所述實體機器故障分類處理裝置720還可以進一步包括第五處理模組726，用於若在所述實體機器故障資訊清單中檢測到實體機器網路不穩定且網路不穩定持續時間達到預設時間，則向故障實體機器發送指令以指示所述故障實體機器自主關閉故障實體機器或藉由所述實體機器上的帶外管理模組關閉故障實體機器；及藉由虛擬化介面遷移所述故障實體機器上的虛擬機器到所述叢集系統內其他健康實體機器上。 Preferably, the physical machine fault classification processing device 720 may further include a fifth processing module 726 for detecting in the physical machine fault information list that the physical machine network is unstable and the network instability continues When the time reaches the preset time, send an instruction to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through the out-of-band management module on the physical machine; and through the virtualization interface Migrating the virtual machine on the failed physical machine to other healthy physical machines in the cluster system.

較佳的，所述實體機器故障分類處理裝置720還可以進一步包括確定模組727，用於在所述實體機器故障資訊清單中匹配所述叢集內的所有實體機器，將沒有匹配成功的實體機器確定為健康實體機器。 Preferably, the physical machine fault classification processing device 720 may further include a determining module 727 for matching all physical machines in the cluster in the physical machine fault information list, and the matching will not succeed. The physical machine of is determined to be a healthy physical machine.

需要說明的是，所述實體機器損毀修復裝置710以及實體機器故障分類處理裝置720的具體結構請參見前述實施例的詳細說明，此處不再贅述。 It should be noted that, for the specific structures of the physical machine damage repairing device 710 and the physical machine fault classification processing device 720, please refer to the detailed description of the foregoing embodiment, which will not be repeated here.

需要說明的是，在本發明另一個實施例中，虛擬機器恢復系統中的實體機器故障分類處理裝置720、實體機器故障探測模組750、實體機器故障收集模組760均為部署在虛擬化叢集系統700以外的實體機器上的軟體模組，其可以各自獨立部署在不同的實體機器上，也可以合併部署在同一台實體機器上。此外，實體機器故障資訊儲存中心730是部署在虛擬化叢集系統700以外的一套資料庫系統。安全攻擊防護中心740可以直接採用現有的安全攻擊防護系統。本發明實施例對此不做限制。 It should be noted that, in another embodiment of the present invention, the physical machine fault classification processing device 720, the physical machine fault detection module 750, and the physical machine fault collection module 760 in the virtual machine recovery system are all deployed in a virtualized cluster. The software modules on physical machines other than the system 700 can be independently deployed on different physical machines, or can be combined and deployed on the same physical machine. In addition, the physical machine failure information storage center 730 is a database system deployed outside the virtualized cluster system 700. The security attack protection center 740 can directly adopt the existing security attack protection system. The embodiment of the present invention does not limit this.

本發明實施例，具備以下優點：本發明實施例可以在大規模的雲端計算叢集中，藉由對多種實體機器故障場景，進行精細化故障快速、準確的識別，並有針對性的進行分類處理，從而實現快速、高可靠的實體機器損毀修復處理，以保證其上的虛擬機器服務的快速恢復。 The embodiments of the present invention have the following advantages: The embodiments of the present invention can be used in a large-scale cloud computing cluster to perform refined, fast and accurate fault identification through multiple physical machine failure scenarios, and perform targeted classification processing. , So as to achieve fast and highly reliable physical machine damage repair processing to ensure the rapid recovery of virtual machine services on it.

進一步的，本發明實施例藉由實體機器自主檢測自身的故障動態，並對實體機器自身能修復的實體機器故障情況有針對性的進行分類修復處理；對實體機器自身不能修復的實體機器故障情況，藉由叢集外部的實體機器故障分類處理模組有針對性的進行分類修復處理，從而有效降低實體機器故障的誤判和漏判情況的發生，更安全、穩定、快速的進行虛擬機器自動恢復。 Further, in the embodiment of the present invention, the physical machine autonomously detects its own failure dynamics, and specifically classifies and repairs the physical machine fault conditions that the physical machine itself can repair; the physical machine fault conditions that cannot be repaired by the physical machine itself , Through the physical machine fault classification processing module outside the cluster to perform targeted classification and repair processing, thereby effectively reducing The occurrence of false judgments and missed judgments of physical machine failures makes it safer, more stable, and faster to automatically recover virtual machines.

對於裝置實施例而言，由於其與方法實施例基本相似，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。 As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

本說明書中的各個實施例均採用遞進的方式描述，每個實施例重點說明的都是與其他實施例的不同之處，各個實施例之間相同相似的部分互相參見即可。 The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

本領域內的技術人員應明白，本發明實施例的實施例可提供為方法、裝置、或電腦程式產品。因此，本發明實施例可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明實施例可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存介質(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art should understand that the embodiments of the embodiments of the present invention can be provided as methods, devices, or computer program products. Therefore, the present invention The embodiment may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present invention may adopt computer program products implemented on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer-usable program codes. form.

在一個典型的配置中，所述電腦設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。記憶體可能包括電腦可讀介質中的非永久性記憶體，隨機存取記憶體(RAM)和/或非揮發性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀介質的示例。電腦可讀介質包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存介質的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式化唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁磁片儲存或其他磁性存放裝置或任何其他非傳輸介質，可用於儲存可以被計算設備存取的資訊。按照本文中的界定，電腦可讀介質不包括非持續性的電腦可讀媒體(transitory media)，如調製的資料信號和載波。 In a typical configuration, the computer equipment includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. Memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media. Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM) , Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, read-only CD-ROM (CD-ROM), digital multi-function Optical discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission media, can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include non-persistent computer-readable media (transitory media), such as modulated data signals and carrier waves.

本發明實施例是參照根據本發明實施例的方法、終端設備(系統)、和電腦程式產品的流程圖和/或方方塊圖來描述的。應理解可由電腦程式指令實現流程圖和/或方方塊圖中的每一流程和/或方方塊、以及流程圖和/或方方塊圖中的流程和/或方方塊的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可程式化資料處理終端設備的處理器以產生一個機器，使得藉由電腦或其他可程式化資料處理終端設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方方塊圖一個方方塊或多個方方塊中指定的功能的裝置。 The embodiments of the present invention are described with reference to the flowcharts and/or block diagrams of the methods, terminal devices (systems), and computer program products according to the embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of the process and/or block in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processors of general-purpose computers, dedicated computers, embedded processors or other programmable data processing terminal equipment to generate a machine, so that the processor of the computer or other programmable data processing terminal equipment The executed instruction generates a device for realizing the function specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

這些電腦程式指令也可儲存在能引導電腦或其他可程式化資料處理終端設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方方塊圖一個方方塊或多個方方塊中指定的功能。 These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory generate instructions including the manufacturing of the instruction device The instruction device realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

這些電腦程式指令也可裝載到電腦或其他可程式化資料處理終端設備上，使得在電腦或其他可程式化終端設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可程式化終端設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方方塊圖一個方方塊或多個方方塊中指定的功能的步驟。 These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to generate computer-implemented processing, so that the computer or other programmable terminal equipment The instructions executed on the modified terminal device provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

儘管已描述了本發明實施例的較佳實施例，但本領域內的技術人員一旦得知了基本創造性概念，則可對這些實施例做出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括較佳實施例以及落入本發明實施例範圍的所有變更和修改。 Although the preferred embodiments of the embodiments of the present invention have been described, once those skilled in the art know the basic creative concepts, they can implement these Additional changes and modifications are made to the embodiment. Therefore, the scope of the attached patent application is intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the embodiments of the present invention.

最後，還需要說明的是，在本文中，諸如第一和第二等之類的關係術語僅僅用來將一個實體或者操作與另一個實體或操作區分開來，而不一定要求或者暗示這些實體或操作之間存在任何這種實際的關係或者順序。而且，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者終端設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、物品或者終端設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個......”限定的要素，並不排除在包括所述要素的過程、方法、物品或者終端設備中還存在另外的相同要素。 Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. Or there is any such actual relationship or sequence between operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or terminal device including a series of elements not only includes those elements, but also includes those elements that are not explicitly listed. Other elements listed, or also include elements inherent to this process, method, article, or terminal device. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or terminal device that includes the element.

以上對本發明所提供的一種應用於虛擬化叢集系統的實體機器損毀修復方法、裝置和叢集實體機器故障分類處理方法、裝置及虛擬機器恢復方法、系統，進行了詳細介紹，本文中應用了具體個例對本發明的原理及實施方式進行了闡述，以上實施例的說明只是用於幫助理解本發明的方法及其核心思想；同時，對於本領域的一般技術人員，依據本發明的思想，在具體實施方式及應用範圍上均可有改變之處，綜上所述，本說明書內容不應理解為對本發明的限制。 The above provides a detailed introduction to the method and device for repairing physical machine damage applied to the virtualized cluster system, and the cluster physical machine fault classification processing method, device, and virtual machine recovery method and system provided by the present invention. Examples illustrate the principles and implementation of the present invention. The descriptions of the above examples are only used to help understand the methods and core ideas of the present invention; at the same time, for those of ordinary skill in the art, in accordance with the ideas of the present invention, specific implementation There may be changes in the method and the scope of application. In summary, the content of this specification should not be construed as a limitation to the present invention.

Claims

A clustered physical machine fault classification processing method, which is characterized in that it comprises: obtaining a physical machine fault information list from a physical machine fault information storage center; if the physical machine fault information is detected in the physical machine fault information list due to a network attack , The security attack protection center outside the cluster is triggered to process; if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine fault information list, an instruction to shut down the faulty physical machine is sent to the failed physical machine; and Migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster system through the virtualization interface.

The method according to claim 1, wherein the method further includes: if it is detected in the physical machine fault information list that the physical machine has a complete network disconnection and the network disconnection duration reaches a preset time; determining that the network is disconnected Whether the number of physical machines exceeds the preset number, if so, notify the operation and maintenance personnel to manually repair; otherwise, use the virtualization interface to migrate the virtual machines on the failed physical machine to the cluster system to other healthy physical machines.

The method according to claim 1, wherein the method further includes: if it is detected in the physical machine fault information list that the physical machine has a network failure but the network failure duration has not reached the preset time, the network returns to normal , And it is determined that the physical machine’s network failure is caused by the physical machine’s restart, then determine whether the current physical machine is healthy, and if it is healthy, use The virtualization interface restarts the virtual machine on the physical machine, and if it is unhealthy, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster through the virtualization interface.

The method according to claim 1, wherein the method further includes: if it is detected in the physical machine fault information list that the physical machine has a network instability and the network instability lasts for a preset time, reporting to the fault The physical machine sends instructions to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through the out-of-band management module on the physical machine; and migrate the virtual machine on the faulty physical machine through the virtualization interface To other healthy physical machines in the cluster system.

The method according to claim 1, wherein, if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, a command to shut down the failed physical machine is sent to the failed physical machine The steps include: sending an instruction to shut down the faulty physical machine to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through an out-of-band management module on the physical machine.

The method according to claim 1, wherein the healthy physical machine is determined by the following method: matching all physical machines in the cluster in the physical machine fault information list; determining the physical machine that has not been successfully matched as the healthy entity machine.

The method according to claim 1, wherein the physical machine fault information list includes: a physical machine fault detection module outside the cluster The physical machine fault information detected by the faulty physical machine and reported to the physical machine fault information storage center, and the physical machine fault collection module outside the cluster is collected from the faulty physical machine and reported to the entity Physical machine fault information in the machine fault information storage center.

A clustered physical machine fault classification and processing device, which is characterized by comprising: an acquisition module for acquiring a physical machine fault information list from a physical machine fault information storage center; a first processing module for detecting fault information on the physical machine If a physical machine failure is detected due to a network attack in the list, the security attack protection center outside the cluster will be triggered to process; the second processing module further includes: a shutdown processing unit, which is used if the physical machine failure information list is If a hardware and software failure that cannot be repaired by the physical machine itself is detected, an instruction to shut down the failed physical machine is sent to the failed physical machine; the migration processing unit is used to migrate the virtual machine on the failed physical machine to the failed physical machine through a virtualization interface On other healthy physical machines in the cluster system.

The device according to claim 8, wherein the device further includes a third processing module, and the third processing module includes: a notification processing unit configured to detect the physical machine network in the physical machine fault information list If the path is completely blocked and the network failure duration reaches the preset time, and the number of the physical machine that is blocked by the network exceeds one, the operation and maintenance personnel will be notified to repair it manually; the migration processing unit is used if the physical machine is in the fault information list It is detected that the network of the physical machine is completely disconnected and the duration of the network disconnection reaches the preset time, and the number of the physical machine that is disconnected from the network does not exceed the preset number, then the virtual machine on the faulty physical machine is migrated through the virtualization interface Go to other healthy physical machines in the cluster system.

The device according to claim 8, wherein the device further includes a fourth processing module, and the fourth processing module includes: a restart processing unit, configured to detect the physical machine network in the physical machine fault information list When the road is blocked but the network is blocked for less than the preset time, the network returns to normal, and it is determined that the network failure of the physical machine is caused by the restart of the physical machine. If it is determined that the current physical machine is healthy, Restart the virtual machine on the physical machine through the virtualization interface; the migration processing unit is used to detect that the physical machine has a network failure but the network failure duration has not reached the preset time in the physical machine fault information list. The path returns to normal, and it is determined that the physical machine’s network failure is caused by the physical machine’s restart. When it is determined that the current physical machine is unhealthy, the virtual machine on the failed physical machine is migrated to it through the virtualization interface. On other healthy physical machines in the cluster.

The device according to claim 8, wherein the device further includes a fifth processing module, and the fifth processing module includes: a shutdown processing unit, configured to detect the physical machine network in the physical machine fault information list If the path is unstable and the network is unstable for a preset period of time, then an instruction is sent to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or through the belt on the physical machine. The external management module shuts down the faulty physical machine; the migration processing unit is used to migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through a virtualization interface.

The device according to any one of claims 8-11, wherein the shutdown processing unit is configured to: if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send The faulty physical machine sends an instruction to shut down the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through the out-of-band management module on the physical machine.

The device according to any one of Claims 8-11, wherein the device further includes: a determining module for matching all physical machines in the cluster in the physical machine fault information list, and there will be no successful matching The physical machine is determined to be the healthy physical machine.

The device according to claim 8, wherein the physical machine fault information list includes: a physical machine fault detection module outside the cluster detects from the faulty physical machine and reports it to the physical machine fault information storage center Physical machine fault information, and physical machine fault information collected from the faulty physical machine by a physical machine fault collection module outside the cluster and reported to the physical machine fault information storage center.

A method for recovering a virtual machine, which is characterized in that it is applied to a virtualized cluster system, and the method includes: the physical machine in the virtualized cluster system autonomously detects its own failure actions. State; if the hardware and software failures that the physical machine can be repaired by itself are automatically detected, repaired by fault-tolerant methods; if the software and hardware failures that the physical machine can be restarted and repaired are automatically detected, the physical machine can be repaired by restarting the physical machine. ; Obtain a physical machine fault information list from the physical machine fault information storage center; if it is detected in the physical machine fault information list that the physical machine fails due to a network attack, it will trigger the processing of the security attack protection center outside the cluster; if If a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine fault information list, the faulty physical machine is sent an instruction to shut down the faulty physical machine; and the virtualized interface is used to migrate the faulty physical machine. Virtual machine to other healthy physical machines in the cluster system.

The method according to claim 15, wherein the method further includes: if it is detected in the physical machine fault information list that the physical machine has a complete network disconnection and the network disconnection duration reaches a preset time; determining that the network is disconnected Whether the number of physical machines exceeds the preset number, if so, notify the operation and maintenance personnel to manually repair; otherwise, use the virtualization interface to migrate the virtual machines on the failed physical machine to the cluster system to other healthy physical machines.

The method according to claim 15, wherein the method also includes Including: If it is detected in the physical machine failure information list that the physical machine has a network failure but the network failure duration has not reached the preset time, the network returns to normal, and it is determined that the physical machine has a network failure is the physical machine restart If caused, determine whether the current physical machine is healthy. If it is healthy, restart the virtual machine on the physical machine through the virtualization interface. If it is unhealthy, use the virtualization interface to migrate the virtual machine on the failed physical machine to the cluster. On the other healthy physical machines within.

The method according to claim 15, wherein the method further includes: if it is detected in the physical machine fault information list that the physical machine has a network instability and the network instability lasts for a preset time, reporting to the fault The physical machine sends instructions to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through the out-of-band management module on the physical machine; and migrate the virtual machine on the faulty physical machine through the virtualization interface To other healthy physical machines in the cluster system.

The method according to claim 15, wherein the healthy physical machine is determined by the following method: matching all physical machines in the cluster in the physical machine fault information list; determining the physical machine that does not match successfully as the healthy entity machine.

The method according to claim 15, wherein the step of obtaining the physical machine fault information list from the physical machine fault information storage center includes: The physical machine fault classification processing module should obtain the physical machine fault information list from the physical machine fault information storage center.

The method according to claim 15, wherein, if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, a command to shut down the failed physical machine is sent to the failed physical machine The steps include: sending an instruction to shut down the faulty physical machine to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through an out-of-band management module on the physical machine.

The method according to claim 15, wherein the physical machine fault information list includes: a physical machine fault detection module outside the cluster detects from the faulty physical machine and reports it to the physical machine fault information storage center Physical machine fault information, and physical machine fault information collected from the faulty physical machine by a physical machine fault collection module outside the cluster and reported to the physical machine fault information storage center.

A virtual machine recovery system, characterized in that the system includes: a physical machine damage repair device, applied to a physical machine in a virtualized cluster system to autonomously detect the failure dynamics of the physical machine itself, and if the physical machine itself can be detected autonomously The hardware and software faults of fault-tolerant repair are repaired by fault-tolerant methods; if the software and hardware faults that the physical machine itself can be restarted and repaired are detected autonomously, it is repaired by restarting the physical machine; the physical machine fault information storage center is used to restore All reported physical machine fault information is integrated into a physical machine fault information list; The physical machine fault classification processing device is used to obtain the physical machine fault information list from the physical machine fault information storage center, and trigger if the physical machine fault information is detected in the physical machine fault information list due to a network attack The security attack protection center outside the cluster handles; if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine fault information list, the faulty physical machine is sent an instruction to shut down the faulty physical machine, and by The virtualization interface migrates the virtual machine on the failed physical machine to other healthy physical machines in the cluster system.

The system according to claim 23, wherein the physical machine fault classification processing device is further used for: if it is detected in the physical machine fault information list that the physical machine has a complete network disconnection and the duration of the network disconnection reaches a preset time ; Determine whether the number of the physical machine that is not connected to the network exceeds the preset number, if so, notify the operation and maintenance personnel to repair it manually; otherwise, use the virtualization interface to migrate the virtual machine on the failed physical machine to the cluster system to other healthy entities On the machine.

The system according to claim 23, wherein the physical machine fault classification processing device is further used for: if it is detected in the physical machine fault information list that the physical machine has a network disconnection but the network disconnection duration has not reached the preset time After the network returns to normal, and it is determined that the physical machine's network failure is caused by the restart of the physical machine, it is determined whether the current physical machine is healthy, and if it is healthy, the virtual machine on the physical machine is restarted through the virtualization interface. If it is unhealthy, use the virtualization interface to migrate the virtual machine on the failed physical machine to the cluster. On other healthy physical machines in the set.

The system according to claim 23, wherein the physical machine fault classification processing device is further used for: if it is detected in the physical machine fault information list that the physical machine has an unstable network and the duration of network instability reaches a preset time At time, send an instruction to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or shut down the faulty physical machine through the out-of-band management module on the physical machine; and migrate the fault through the virtualization interface The virtual machine on the physical machine is connected to other healthy physical machines in the cluster system.

The system according to claim 23, wherein the physical machine fault classification processing device is further used to: match all physical machines in the cluster in the physical machine fault information list, and determine the physical machine that has not been successfully matched as a healthy entity machine.

The system according to claim 23, wherein the physical machine fault classification processing device is further configured to: send an instruction to shut down the faulty physical machine to the faulty physical machine to instruct the faulty physical machine to shut down the faulty physical machine autonomously or by The out-of-band management module on the physical machine shuts down the failed physical machine.

The system according to claim 23, wherein the physical machine fault information list includes: a physical machine fault detection module outside the cluster detects from the faulty physical machine and reports it to the physical machine fault information storage center Physical machine fault information, and the physical machine fault collection module outside the cluster collects from the faulty physical machine and reports it to the physical machine Physical machine fault information in the physical machine fault information storage center.

The system according to claim 23, wherein the physical machine damage repair device includes: an autonomous detection module, including: a detection unit, for autonomously detecting the failure dynamics of the physical machine itself; an autonomous processing module, including: a fault-tolerant unit , Used for if the detection unit detects the hardware and software faults that the physical machine itself can be fault-tolerantly repaired, it is repaired by fault-tolerant methods; the restarting unit is used if the detection unit detects that the physical machine itself can be repaired by the hardware and software faults that can be restarted If the body fails, it can be repaired by restarting the physical machine.

The system according to claim 30, wherein the autonomous processing module further includes: a shutdown unit, configured to: if the detection unit detects a software and hardware failure that cannot be repaired by the physical machine itself, follow the physical machine outside the cluster The instructions of the fault classification processing module or the out-of-band management module on the physical machine shut down the faulty physical machine, and the physical machine fault classification processing module outside the cluster migrates the faulty physical machine through the virtualization interface Virtual machine to other healthy physical machines in the cluster system.

The system according to claim 30, wherein the autonomous detection module further includes: a reporting unit, configured to use the physical machine when the autonomous detection module autonomously detects that the physical machine fails due to a network attack Fault collection The module reports the physical machine fault information to the physical machine fault information storage center, and the physical machine fault classification processing module outside the cluster triggers the security attack protection center outside the cluster to process.