WO2012137239A1

WO2012137239A1 - Computer system

Info

Publication number: WO2012137239A1
Application number: PCT/JP2011/001992
Authority: WO
Inventors: 成昊金; 英児西島
Original assignee: 株式会社日立製作所
Priority date: 2011-04-04
Filing date: 2011-04-04
Publication date: 2012-10-11

Abstract

Provided is a computer system that makes it possible to obtain and save plentiful dump information including memory content of a virtual machine monitor itself possibly necessary in analyzing the problem, without stopping the system, when a problem occurs in an operating system operating on the virtual machine monitor. The computer system where a guest machine is operating with using a virtualization system on a host machine is provided with: a probe insertion process unit for inserting a probe point from the host machine to the guest machine; a probe point management table for associating probe points with corresponding dump scenarios; a dump scenario management table for limiting dump locations for each scenario; and a guest machine problem detection program for detecting the occurrence of a problem in the guest machine. At the time a problem occurs, the memory content of the virtual machine monitor is dumped without stopping the system, and a memory region subject to the dump is limited according to the type of problem.

Description

Computer system

The present invention relates to an information processing apparatus having a virtualization function and a failure analysis information acquisition method applied to the apparatus.

Due to recent improvements in semiconductor technology, hardware resources such as CPUs and I / O devices are remarkably increasing in performance and price. As a result, the resource usage rate during system operation is reduced, and resources are wasted. In addition, an increase in the space for installing facilities such as data centers has been screamed, and a system corresponding to virtualization has been developed as a method of effectively using waste of resources. This system is controlled by virtualization software called a virtual machine monitor.

As virtualization software, the commercial software VMware (registered trademark) and the open source software Linux kernel KVM function are well known.

By using these virtualization software, a virtualization system can be easily constructed on a computer such as a personal computer.

In this virtualization system, it becomes possible for a plurality of operating systems to simultaneously use hardware resources such as a CPU, a memory, and an I / O device. As a result, it is possible to effectively use hardware resources and consolidate facilities, which leads to space saving of the installation space.

In the virtualization system, multiple guest machines operate on the host machine.

In Patent Document 1, the contents (data) of a memory for a guest machine operating on a virtual machine monitor (corresponding to a host machine) are periodically acquired and stored, and the latest memory stored when a failure occurs is stored. A method for recovering memory contents using contents is disclosed. This method turns off the writable bit in the page table, generates a memory protection interrupt when the processor tries to update the page data, and reserves the page contents in advance in the interrupt processing routine. It is to save in the memory area.

JP2009-245216

However, in the method of Patent Document 1, information necessary for analyzing a failure that has occurred in the operating system of the guest machine operating on the virtual machine monitor (corresponding to the host machine) is included in the virtual machine monitor memory. Since the case is not taken into consideration, there is a possibility that information necessary for failure analysis cannot be obtained sufficiently. That is, in the method of Patent Document 1, only the contents of the memory for the guest machine are stored. Therefore, when a failure occurs in the guest machine, the contents of the memory for the guest machine can be analyzed. The contents of the memory for the host machine whose data is changed by the processing of the guest machine are not saved and cannot be analyzed.

In a virtualized system, even if a failure occurs in the operating system of the guest machine that runs on the virtual machine monitor, not only the guest machine memory data but also the virtual machine monitor memory is used for failure analysis. Data needs to be acquired and stored.

Here, when trying to acquire and save the memory contents of the virtual machine monitor using the conventional memory dump technology (Non-Patent Document 1), it is necessary to stop the virtual machine monitor. It is necessary to stop other guest machines operating on the virtual machine monitor, which gives a large overhead to the other guest machines.

Therefore, when a failure occurs in a guest machine, a technology is provided to acquire and store information necessary for failure analysis of the guest machine without giving a large overhead to other guest machines.

The computer has a plurality of guest machines, each of which is a virtual calculation statement having an operation system, and a host machine having an operation system different from the plurality of guest machines and controlling the plurality of guest machines. The computer system further includes a processor for executing the operation system of each of the plurality of guest machines and the operation system of the host machine, and a memory. The memory includes a memory area for the host machine allocated to the host machine, data And a save memory area. Data stored in the host machine memory area is updated by processing executed by the guest machine. When a failure occurs in one of a plurality of guest machines, the host machine uses the area in the host machine memory area associated with the failure as a dump acquisition area and prohibits writing to the dump acquisition area. When the host machine receives a write request to the dump acquisition area from a guest machine other than the guest machine where the failure occurred, the host machine copies the data stored in the dump acquisition area to the data save memory area. Then, write to the dump acquisition area is permitted and data is written according to the write request.

In the virtualization system, when a failure occurs in the guest machine, it is possible to acquire and save information necessary for analyzing the failure of the guest machine while reducing overhead given to the guest machine other than the guest machine.

It is a figure which shows the structural example of the computer system of this embodiment. It is a figure which shows the structural example of the computer of this embodiment. It is a figure which shows the structural example of the probe point management table of this embodiment. It is a figure which shows the structural example of the probe point insertion process of this embodiment. It is a figure which shows the structural example of the dump scenario management table of this embodiment. It is a figure which shows the structural example of the dump area management table of this embodiment. It is a figure which shows the host OS dump location example at the time of guest OS failure of this embodiment. It is a figure which shows the structural example of the dump request | requirement bitmap table of this embodiment. It is a figure explaining an example of the non-stop dump whole processing flow of this embodiment. It is a figure explaining an example of the error detection method and error generation process of this embodiment. It is a figure explaining an example of the probe handler process of this embodiment. It is a figure explaining an example of the fault process for memory saving of this embodiment. It is a figure explaining an example of dump writing processing of this embodiment.

---System configuration---
Embodiments will be described in detail below with reference to the drawings. FIG. 1 is a configuration diagram of a computer system 10 according to this embodiment. FIG. 2 is a diagram showing in detail the configuration in the host computer (also called host machine) 110 of the computer system 10.

The computer system 10 has a configuration in which a memory 190, one or more processors 170, and a communication interface 180 are connected by a communication path such as a bus. In addition, the computer system 10 may include an output device such as a display and an input device such as a keyboard. An external storage device 200 is connected to the communication interface 180.

The memory 190 of the computer system 10 stores a program to be executed by the processor 170, and is provided with a later-described saving memory section 160.

The processor 170 is a multi-core processor equipped with a plurality of processor cores (hereinafter also referred to as CPU cores or simply cores).

The computer system (hereinafter also referred to as a physical machine) 10 creates a virtual environment on the physical machine by executing a virtualization program (also referred to as a virtualization mechanism) 113, and one or a plurality of guest machines on the physical machine It is a computer system that constitutes (also called a virtual machine).
For this reason, the memory 190 of the computer system 10 includes a host cluster program 112, a host OS (Operating System) 111, a virtualization program (hereinafter also referred to as a virtualization mechanism) 113, and for each guest machine. Application program 152, guest cluster program, and guest OS 150.

It should be noted that the functional configuration realized by the host cluster program 112, the host OS (Operating System) 111, and the virtualization program 113 is the host machine 110, the application program 152 prepared for each virtual machine, the guest cluster program, and the guest OS 150. The realized functional configuration is called a guest machine 140. In the following description, the execution of processing may be described with each program as the subject. This indicates that the processing is executed by the processor 170 executing the program.

The save memory section 160 is used as a save destination for memory contents to be described later. The save memory section 160 is outside the memory area used by the host machine and the guest machine, and is reserved in advance when the system is booted. The save memory section 160 may be an external memory (SRAM) instead of the memory mounted in the computer system 10.

In this embodiment, it is assumed that two

guest machines

140A and 140B are configured by the virtualization mechanism 113 of the host machine 110, and the save memory partition 160 is provided on the system memory 190. Guest machine A140A and guest machine A140B have the same configuration, and each includes guest OS 150A and guest OS 150B, and application program A 152A and application program B 152B. However, this is only an example, and the number of guest machines is not limited to two, and a plurality of guest machines do not necessarily have the same configuration, and may have different configurations. Absent. In the following description of the guest machine, the guest machine A 140A is used, but the guest machine B 140B has the same functional configuration.

FIG. 2 shows a detailed configuration example of the host machine 110.

The host OS 111 has a probe point management table 120 (details will be described later with reference to FIG. 3) for managing a memory address of a probe point to be inserted into the guest machine A 140A and a scenario number at the time of dumping, which will be described later. Dump scenario management table 121 for managing areas (details will be described later with reference to FIG. 5), dump area management table 122 for managing save memory partition 160 (details will be described later with reference to FIG. 6), and dump request bitmap table 123 (details will be described later with reference to FIG. 8), probe handler processing (details will be described later with reference to FIG. 11), and a memory saving fault process 125 for performing dump acquisition (details will be described later with reference to FIG. 12). ), A program for executing the dump writing process 126 (details will be described later with reference to FIG. 13). Equipped.

The host cluster program 112 detects a probe insertion program 130 (details will be described later with reference to FIG. 4) for inserting a probe (also referred to as a probe code) into the guest machine A 140A described above, when a failure occurs in the guest machine A 140A. A guest machine failure detection program 131 is provided.

There are the following two methods for the host machine 110 to detect a failure of the guest machine A 140A using the guest machine failure detection program 131.

In the pattern 1, the guest OS 150A executes the probe code inserted into the guest OS 150A by the probe point insertion program 130 described above, and notifies the host machine 110 of the failure from the guest machine A 140A. This is to detect abnormalities.

Pattern 2 indicates that when the host machine 110 performs a survival check such as a heartbeat with respect to the guest machine A 140A at a certain interval and the survival check cannot be performed (there is no response from the guest machine A 140A), the host machine 110 An abnormality is detected.

FIG. 3 is a diagram illustrating a configuration example of the probe point management table 120. In the probe point management table 120, pairs of probe point addresses and scenario numbers are registered.

The probe point address stores a memory address value on the guest machine A140A where the probe is inserted. The probe point is set for each subsystem (virtual functional configuration) of the guest machine.

The virtualization mechanism 113 of the host machine 110 has a conversion table between a memory address on the host machine and an address on the guest machine. When the host machine 110 executes the probe insertion program 130, the host machine 110 uses the conversion table to convert the probe point address specified in the probe point management table 120 into a memory address on the host machine 110, and then the guest machine A 140A. Insert the probe.

The scenario number stores an identification number of a scenario managed by a dump scenario management table 121 described later.

In this embodiment, the probe point management table 120 stores the probe insertion address and scenario number in advance. After the guest machine A 140A loads the guest OS 150A onto the memory, the guest OS 150A notifies the host OS 111 of the end of loading, and stops operating until a notification is received from the host machine 110. The host OS 111 that has received the notification of completion of loading refers to the probe point management table 120, executes the probe insertion program 130, and performs probe insertion processing. After the probe insertion process is completed, the guest OS 150A resumes operation.

Note that the probe insertion process can also be performed on the guest machine A 140A, and the probe insertion program 130 and the dump scenario management table 121 may be provided on the guest OS 150A of the guest machine A 140A. In this case, after the guest machine A 140A loads the guest OS 150A, the probe insertion program 130 is executed to perform the probe insertion process.

However, when the guest machine A 140A executes the probe insertion process, the guest machine A 140A executes the probe insertion program 130 without stopping the operation of the guest machine A 140A. Therefore, an error occurs until the probe insertion process is completed. Cannot be dumped.

Further, in the probe handler process described later, there is a process that is executed by changing the control from the guest OS to the host OS 150A. Therefore, when the guest machine A 140A executes the probe insertion process, the control transition between the guest OS 150A and the host OS 111 is performed. Frequently occurs, resulting in overhead.

When the probe point management table 120 is updated while the system is operating, the host machine 110 can execute the probe insertion program 130 according to this update, thereby inserting a new probe into the guest machine while the system is operating. I can do it.

FIG. 4 is a diagram for explaining processing for inserting a probe in accordance with the description of the probe point management table 120. This process is performed by executing the probe insertion program 130.

As described above, according to the description of the probe point management table 120, the probe is inserted on the guest machine by the probe insertion program 130 (S401). The probe point described in the probe point management table 120 is specifically the address address of the memory on the guest machine A 140A managed by the guest OS 150A as described above. The probe insertion program 130 on the host OS 111 converts the probe point from the guest machine address to the host machine address using the host machine address / guest machine address conversion table, and inserts the probe at the address. When there are a plurality of guest machines 140, the address value on the host machine of the probe point is obtained for each guest machine, and the probe is inserted at the obtained address for each guest machine.

When there are a plurality of probes described in the probe point management table, the insertion process is repeatedly executed until all the probes are inserted (S402).

FIG. 5 is a diagram illustrating a configuration example of the dump scenario management table 121. In the dump scenario management table 121, dump locations are registered corresponding to the scenario numbers. The dump location is information for specifying a memory area that is a dump acquisition target (hereinafter also referred to as a dump target), and at least one or more data in the kernel memory space used by the kernel subsystem in the memory space of the host OS 111 The area is specified as the dump target area. When a failure is detected when a probe inserted by the host machine 110 is executed on the guest machine, the host machine 110 refers to the probe point management table 120 and obtains a scenario number corresponding to the probe point at which the probe is inserted. Further, referring to the dump scenario management table 121, the memory area specified by the scenario No. is set as the dump target area.

In the present embodiment, the dump target area is a memory area reserved for the host machine 110. When a guest machine fails, the processing of the guest machine is stopped, but another guest machine configured on the same host machine can continue to operate. In this case, if the memory area for the host machine 110 is rewritten by the processing of another guest machine that continues to operate, the data stored in the memory for the host machine 110 when the failure occurs cannot be restored. It cannot be used for analysis. In this embodiment, in order to solve this problem, before the memory area for the host machine 110 is rewritten by the processing of another guest machine that continues to operate, the data stored in the area is saved in the save memory. A technique for evacuating to the section 160 is provided.

Note that dumping of the memory area reserved for the guest machine A 140A is performed by kdump (see Non-Patent Document 2), which is a conventional dump technology.

The host machine 110 may detect a failure other than the failure of the guest machine A 140A by means other than the execution of the probe by the guest machine A 140A (for example, by the method of pattern 2 described above). In this case, it is desirable to dump all data areas in the kernel memory space used by the kernel subsystem in the memory space of the host machine 110. Therefore, no scenario is prepared as a scenario for this case. When an identification number meaning “no corresponding scenario” is selected as the scenario number, the dump target area is the entire data area of the kernel memory space, but the other code areas are excluded from the dump target. This is because the data area is changed by the system operation, and the code area is unchanged.

According to the probe point management table 120, the scenario number is selected depending on which subsystem of the kernel (virtual functional configuration of the guest machine) the bug function (probe code) is executed. A bug function is not always executed and fails. In that case, the host machine 110 cannot know which subsystem the guest machine A 140A has failed during execution. Therefore, even if the host machine 110 detects a failure of the guest OS 150A at a time other than the execution of the probe inserted into the guest OS 150A by the probe insertion program 130, all subsystems use it as “no applicable scenario”. Data area to be dumped.

FIG. 6 is a diagram illustrating a configuration example of the dump management area table 122. In the dump management area table 112, a dump factor for storing the identification information of the guest machine in which the failure has occurred among a plurality of operating guest machines, and the probe executed when the guest machine 140 notifies the failure are stored. There are a scenario number storing the scenario number information associated with the address and the probe point management table 120, and a dump time storing the time information when the failure occurs.

Pointer from the scenario identification information (scenario No information) stored as the scenario number to the dump location indicating the dump target area specified by the scenario and the dump destination offset information indicating the memory area where the dump is stored Is stretched. The dump location is equivalent to the dump location of the dump scenario management table 121. The dump destination offset is a value representing a difference from the head address of the save memory partition 160, and data is saved in a memory area after the address indicated by the dump destination offset.

Note that the value of the dump destination offset is set so that the address value obtained by adding the data area length represented by the dump location to the address value of the dump destination offset falls within the area of the save memory partition 160.

FIG. 7 is a diagram illustrating an example of dump locations in the memory area of the host machine 110 when a failure occurs in the guest machine A 140A.

The guest machine A140A executes the probe inserted in the code section (for example, BUG_ON () function used in Linux OS) that is executed when a fatal bug exists in the network subsystem in the kernel space of the guest machine A140A. Thus, the dump location will be described using an example in which the host machine 110 detects a failure.

When a failure occurs in the memory area 702 used by the network subsystem of the guest machine A 140A, the memory area used by the network subsystem of the host machine 110, more specifically, the contents of the data area 703a therein becomes important. . This is because, for example, when packet transmission or the like is performed from the guest machine A 140A, this processing is performed via the network subsystem of the host machine 110.

Also, it is desirable that the memory subsystem is dumped as a dump target location regardless of the subsystem in which the failure has occurred, considering that the memory subsystem is always used.

Therefore, if a failure occurs in the network subsystem of the guest machine A 140A, the data area 703a of the network subsystem and the data area 703b of the memory subsystem of the host machine 110 are set as dump target locations. In conventional dumping, there is no way to dump a specific part of the host machine in response to the faulty part of the guest machine. By dumping the entire data area used by the host machine, the useless part is dumped. As a result, dump processing has caused the overhead of the entire system.

On the other hand, with this method, the dump target location can be kept to the minimum necessary, and the overhead due to dump processing can be minimized.

Although FIG. 7 illustrates the case where a failure occurs in the memory area 702 used by the network subsystem of the guest machine, the same applies when a failure occurs in the usage area of another subsystem.

FIG. 8 is a diagram illustrating a configuration example of the dump request bitmap table 123. The dump request bit table 123 determines the page numbers (page frame numbers) of all pages used by the host machine 110 and whether each page is requested to be dumped (is a dump acquisition target). Dump request flag information is stored. Here, the page is a unit used by the host machine 110 to manage a memory area used by the host machine. That is, the host machine 110 manages the memory area used by the host machine 110 as a collection of a plurality of pages.

In this embodiment, a process similar to COW (copy-on-write), which is a conventional technology, is triggered by a page fault exception that occurs when a dump target area is write-protected and a write request to the write-protected area occurs. To obtain a dump (details will be described later).

* UNIX operating systems employ the COW method for memory management, and page fault exceptions frequently occur during normal operation.

Therefore, if a page fault exception occurs, the dump request flag indicates whether it is a write-protected state set for acquiring a dump or a write-protected state set for other memory management reasons. Used to judge.

In the present embodiment, the value of the dump request flag is 1 when there is a dump request (when it is a dump acquisition target), and 0 when there is no dump request (when it is not a dump acquisition target). FIG. 9 is a diagram for explaining the processing flow of the entire nonstop dump.

The guest machine failure detection program 130 of the host cluster program 112 detects an error in the guest machine A 140A based on the notification from the guest machine A 140A that has executed the probe (S501).

The host OS 111 analyzes the failure information provided from the guest machine A 140A and determines whether or not the failure detection is due to the execution of the probe (S502). Specifically, the information notified from the guest machine A140A is the bug code head address in the memory area managed by the guest machine A140A. The host machine 110 that has received this calculates the memory address on the host machine 110 using the conversion table between the memory address on the host machine 110 and the memory address on the guest machine A140A, and which probe is in accordance with the probe point management table 120. Determine if it has been executed. When the determined address does not correspond to any of the probe point management tables 120, it is determined that the dump scenario management table 121 does not correspond.

In the case of failure detection due to execution of a probe, the host OS 111 executes probe handler processing 124, and first identifies which probe was executed on the guest machine A 140A from the failure information.

Then, the host OS 111 refers to the dump scenario management table 121 for the dump location (that is, the dump target area) associated with the scenario number managed in the dump scenario management table 121 in association with the probe point of this probe. Identify. Then, the host OS 111 prohibits storage writing in the host machine memory including the specified dump location (S504). Specifically, writing is prohibited for all pages including the specified dump location (dump target area).

If it is determined in S502 that the failure detection is triggered by something other than the execution of the probe, the entire data area in the memory area used by the host machine 110 is write-protected (S503). Specifically, writing is prohibited for all pages included in the data area.

When a write request is generated for a page that is set to write-protected by the processing of S503 or S504, an MMU (memory management unit) included in the computer system 10 generates a page fault exception when the write occurs.

This write request is not issued from the failed guest machine A140A, but issued from another guest machine such as the guest machine B140B. The guest machine A140A in which a normal failure has occurred does not issue a write request because the processing is interrupted after the failure has occurred. However, other guest machines B140B running on the same host machine 110 as the guest machine A140A can continue the processing regardless of the failure of the guest machine A140A. Therefore, a write request may be issued from the other guest machine B 140B to the host machine memory area.

When a normal page fault exception occurs, the COW (copy on write) method ensures that the kernel secures a new physical page, stores the data requested to be written here, and maps the area where writing has been performed. Change to correspond to this new physical page.

However, in this embodiment, when a page fault exception occurs, the host OS 111 executes the memory saving fault process 125. By the memory saving fault process 125, first, a physical page is secured in the saving memory partition 160, and the data currently stored in the page to be written is saved (S505). After that, the page to be written, which has been set to write-inhibited by the processing of S503 or S504, is changed to writable, and the data of the page is updated according to the write request (S506).

The data saved in the save memory partition is data stored in the memory area of the host machine 110 when an error occurs. By using this data as a dump, the system (and processing of other guest machines) can be performed. A dump can be obtained without stopping.

FIG. 10 is a diagram for explaining the error detection method in S501 of FIG. 9 and the flow of the probe handler process 124 that occurs after detection. As a premise of this process, it is assumed that the probe point insertion process of FIG. 4 has already been executed and a probe has been inserted into the guest machine A140A.

When the probe inserted into the guest machine A 140A is executed (S507), an int3 interrupt is generated, and control is transferred to the host machine 110 (S508). Upon receiving this interrupt, the host OS 111 on the host machine 110 calls a probe handler (S509), and executes a probe handler process 124 described later (S510). After the execution of the probe handler process 124 is completed and the control right is returned from the probe handler (S511), the control returns to the guest OS (S512).

FIG. 11 is a diagram for explaining the probe handler process 124 executed in S510 of FIG. The moment of entering S510 (that is, the moment when the probe handler process 124 is trajected) is the state where the CPU core executing the guest OS 150A has reached the probe point, that is, in the memory area of the guest OS 150A executing the core. An error has occurred.

However, in the multi-core processor, according to the description of the dump scenario management table 121 in S514 of the probe handler process, a specific memory area is prohibited from being written, and a process of setting the dump request flag of the corresponding dump request bitmap table 123 is executed. In the middle, another CPU core executes another guest OS 140B, and writing may occur in the specific memory area. As a result, there is a possibility that the memory contents at the time of the error cannot be saved and the memory contents are overwritten by writing.

Therefore, in the probe handler process 124, all the cores other than the core that executed the probe are temporarily put to sleep (S513). Next, in the probe handler process 124, the scenario identification number (scenario No) associated with the probe point of the probe executed in S507 is specified from the probe point management table 120. Then, referring to the dump scenario management table 121, the dump location linked to the scenario identification number is specified, the page including this dump location (dump target area) is specified, and this page is write-inhibited. And

Then, “1” is set in the dump request flag of the dump request bitmap table 123 for the page in the write-inhibited state (S514).
Finally, all the cores other than the core that is performing the probe handler process that has been put to sleep in S513 are returned from sleep (S515). Thereby, the preparation for executing the nonstop dump is completed.

FIG. 12 is a diagram for explaining the details of the memory saving fault process 125. When the host machine 110 accepts a data write request and the write target page is in a write prohibited state, the memory saving fault process 125 is executed. As described above, the write request is issued from a guest machine other than the guest machine in which the failure has occurred.

When there is a write to the write protected area (S516), the host OS 111 refers to the dump request bitmap table 123 and determines whether the dump request flag “1” is set in the write target page (S517). ).

If Yes, that is, if the dump request flag “1” is set, the host OS determines that the data stored in the page is a dump target, and saves the data stored in the page to be written The page is saved by copying to the memory partition 160 (S520).

Further, the dump area management table 122 is created, and the identification information of the guest machine 140 that executed the probe is registered in “dump factor”, and the identification number of the scenario specified in the above process is registered in “scenario No”. Furthermore, the current time information is “dump time”, the dump source page identification information is “dump location”, and the start address information of the copy destination area (offset from the start address of the save memory partition 160 indicating the start address) Is written as “dump destination offset” (S521).

After saving the data, the host OS clears the dump request flag in the dump request bitmap table 123 corresponding to the saved page (“0”) (S522), makes the page writable, and writes to the page. The written contents are reflected by writing the data according to the request (S523).

If No, that is, if the dump request flag is “0”, the host OS determines that the page is not a dump target. In this case, the host OS executes normal page fault processing. For example, if the host OS uses an operating system that uses the COW (copy-on-write) method, the data stored in the write-inhibited area is copied to the newly secured page frame and newly secured. The page frame is made writable and data is written (S519).

By this memory saving fault processing, the memory contents immediately after the error occurrence (data stored in the page) are dumped in the saving memory partition 160.

FIG. 13 is a diagram for explaining the details of the dump writing process 126 for writing the dump saved in the saving memory partition 160 by the memory saving fault process 124 to the external storage area 200.

In the present embodiment, the dump is written to the external storage device 200 connected using the communication interface 180 provided in the computer system 10.

First, the host OS identifies the scenario number from the dump area management table 122 (S524). Then, the host OS refers to the dump scenario management table 121 and identifies the dump location associated with the scenario No acquired in S524, thereby identifying the page including the dump location (S525).

Next, the host OS reads the value of the dump request flag in the dump request bitmap table 123 for the page specified in S525 (S526), and whether the dump request flag is set (that is, whether the value of the flag is “1”). Is determined (S527).

In this embodiment, based on the scenario, a specific area is write-protected, and data stored in the page in the save memory partition 160 is dumped page by page when a write request is made. If there is no write request, the data of the page is not dumped to the save memory partition 160.

Therefore, a page that has not yet been requested to be written and has not been dumped to the save memory partition 160 needs to be written directly from the corresponding page of the host machine memory to the external storage device 200. In order to confirm this, the dump request flag Check the value.

When the dump request flag is set, since the page has not been dumped in the save memory partition 160, the data is written from the page in the host machine memory to the external storage device 200 as described above, and the area is stored. The writing is enabled (S529), and the dump request flag in the dump request bitmap table 123 is cleared (set to “0”) (S530).

When the dump request flag is not set, a write request has already been generated for the page, and the data of the page has been dumped to the save memory partition 160. Therefore, data is transferred from the save memory partition 160 to the external storage device 200. Write out.

This process is performed for all pages specified in S525 (S531), and the process ends when data is written to the external storage device 200 for all pages.

The data writing destination in the dump writing process 126 is not limited to the external storage device 200 but may be the save memory partition 160. In this case, the processing of S528 is eliminated, and the external storage device 200 of S529 is changed to the save memory partition 160.

As mentioned above, although embodiment was concretely demonstrated based on the embodiment, it is not limited to this and can be variously changed in the range which does not deviate from the summary.
In particular, the system configuration uses a virtual environment as an embodiment, but can be applied to a non-virtual environment by replacing a guest machine with a process.

10 Computer system 110 Host machine 111 Host OS
112 Host cluster program 120 Probe point management table 121 Dump scenario management table 122 Dump area management table 123 Dump request bitmap table 124 Probe handler process 125 Memory saving fault process 126 Dump writing process 130 Probe insertion program 131 Guest machine failure detection program 140 Guest machine 150 Guest OS
160 Memory partition for saving 170 Processor 180 Communication interface

Claims

A plurality of guest machines, each of which is a virtual calculator with an operation system;
A computer system having an operation system different from the plurality of guest machines and having a host machine for controlling the plurality of guest machines,
A processor that executes an operation system of each of the plurality of guest machines and an operation system of the host machine, and a memory;
The memory includes a host machine memory area allocated to the host machine, and a data saving memory area.
The data stored in the host machine memory area is updated by a process executed by the guest machine,
When a failure occurs in one of the guest machines, the host machine prohibits writing to the dump acquisition area using the area in the host machine memory area associated with the failure as a dump acquisition area. ,
When a write request for the dump acquisition area is issued by a process by a guest machine other than the guest machine in which the failure has occurred, the host machine stores the data stored in the dump acquisition area in the data save memory A computer system characterized in that after copying to an area, writing to the dump acquisition area is permitted and data is written according to the write request.
The computer system according to claim 1,
The host machine inserts a probe code to be executed by the guest machine when a failure occurs in the functional configuration for each functional configuration of the guest machine,
When the probe code is executed by the guest machine, the host machine is an area in the host machine memory area managed by the host machine in association with the probe code based on a notification from the guest machine As a dump acquisition area.
A computer system according to claim 2, wherein
The area in the memory area for the host machine managed by the host machine in association with the probe code is the host accessed by the functional configuration of the guest machine that causes a failure that causes execution of the probe code A computer system characterized by being an area in a memory area for a machine.
A computer system according to claim 2, wherein
When the host machine detects a failure of the guest machine other than the execution of the probe code by the guest machine, the host machine sets the entire data area in the memory area for the host machine as a dump acquisition area A computer system.
The computer system according to claim 1,
The host machine manages the memory area for the host machine as a plurality of pages,
When a failure occurs in the guest machine, writing to the plurality of pages is prohibited for the plurality of pages including the dump acquisition area,
When a write request to a page for which writing has been prohibited by processing of a guest machine other than the failed guest machine is issued, after copying the data stored in the page to the data saving memory area, Allow writing to the page, update the page according to the write request,
A computer system for recording a head address of a copy destination and an identification number of the page.
A computer system according to claim 5, wherein
Upon receiving the dump write request, the host machine checks whether the page identification number is recorded in association with the copy destination start address for each of the plurality of pages including the dump acquisition area. Is recorded in association with the start address of the copy destination, the data corresponding to the size of the page is read from the start address, and the identification number of the page is associated with the start address of the copy destination. When not recorded, the computer system reads data from the page in the memory area for the host machine and writes the read data to a recording medium
The computer system according to claim 1,
A computer system, wherein the host machine sends a signal for alive monitoring to a guest machine and detects a failure of the guest machine when a response to the signal is not received.
The computer according to claim 2,
The processor is a multi-core processor having a plurality of core processors,
When the host machine prohibits writing to the dump acquisition area, the host machine prohibits writing to the dump acquisition area after stopping processing of a core processor other than the core processor that executed the probe code. ,
A computer system that resumes the processing of a core processor that stopped processing after writing to the dump acquisition area is prohibited.