WO2010116405A1

WO2010116405A1 - Calculation system provided with nonvolatile main memory

Info

Publication number: WO2010116405A1
Application number: PCT/JP2009/001589
Authority: WO
Inventors: 顕義橋本; 崇仁中村; 正法高田; 政弘新井; 健太郎島田; 淳江端
Original assignee: 株式会社日立製作所
Priority date: 2009-04-06
Filing date: 2009-04-06
Publication date: 2010-10-14

Abstract

The generation of mismatches between the state of a nonvolatile main memory and the state of a peripheral device is prevented. An operating system (OS) detects the transition to the stable state which is the state where each command transmitted to the peripheral device is not pending, and stores stable state information, which is information containing the data stored in the main memory in the stable state and the command address when transitioning to the stable state, in the main memory.

Description

Computer system with non-volatile main memory

The present invention relates to a computer system having a processor and a main memory.

Conventionally, in the field of computer systems, a technique for preventing data from being rewritten illegally due to an unexpected power failure has been studied. The most popular technology is transaction processing technology. The transaction processing technique is disclosed in detail in Patent Document 1. There are various ways to implement transaction processing, but the key is in journaling. Journaling is a history of data updates. For example, when updating data stored in the secondary storage in multiple steps, the update history (journal) is written to the secondary storage, and then the data is updated and written to the secondary storage. The success of updating the data is recorded (this record is called “checkpoint”). With this method, even if the data update is interrupted due to a power failure during the data update, the journal remains, so that the state before the start of the data update process can be restored. And the same process should just be performed anew. In this way, consistency is guaranteed at the level of the file stored in secondary storage.

When consistency at the file level is assured in this way, the system state (state at the system level) including intermediate results of operations in the main memory is saved, and the system state is consistent. The demand to do so began to increase. One example is Non-Patent Document 1. Non-Patent Document 1 defines a power state of a computer system. Among them is the “S4 state”. The “S4 state” is a state in which the CPU context, the contents of the main memory (data stored in the main memory), etc. are all stored in the secondary memory and the power of most components is turned off. When the computer system returns from the "S4 state" to the "operating state G0" when an operation such as pressing the keyboard of the computer system is performed, the computer system state returns to the state immediately before the transition to the "S4 state". . This technique is called “hibernation”. However, since this technique stores all the contents of the main memory in the secondary memory, the execution time has become longer due to the increase in the capacity of the main memory. As a result, it cannot be executed frequently and it takes a long time to recover from a power failure. This effectively made it impossible to use hibernation technology as a technology to maintain system state consistency. Further, since the hibernation technique is originally intended to save power, the stored system state is erased when the operation state is restored. For this reason, even if recovery is possible after a power failure, when the next power failure occurs, the system state that should be restored is erased, so it cannot be restored to the state before the power failure occurred. Therefore, hibernation technology cannot prepare for power outages.

On the other hand, there is also a technology that preserves the system state and prepares for a power failure by using a technology called “logical partition”. That is the technique disclosed in Patent Document 2. Logical partitioning is a technology that makes a single computer system appear to be virtually operated by multiple software systems called “VMM (Virtual Machine Monitor)” that operates on the computer system. A virtual computer generated by VMM is called “VM (Virtual Machine)”. An OS (Operating System) running on a VM cannot distinguish between a VM and a physical computer. This is because VMM generates virtual hardware on the main memory. Therefore, the VMM can save the state of the VM stored in the main memory in the secondary storage regardless of the state of the VM. The state of the VM stored in the secondary storage is hereinafter referred to as “VM state file”. VMM can store any number of VM state files at any time. Therefore, it is possible to return to the VM state at any time in the past. Thus, Patent Document 2 can be said to be a technique for applying a logical partitioning technique to prepare for a power failure.

However, the logical partitioning technology has a problem of performance degradation. Specifically, the following examples exist.
(1) When an OS or application accesses hardware (here, virtual hardware implemented by VMM) or tries to execute a privileged instruction of the CPU, the CPU generates an exception and performs processing. Move to VMM. VMM emulates hardware behavior and returns processing to the OS and applications. The software called VMM runs for such an emulation operation, so performance degradation is significant. In addition, VM hardware access by the OS and applications as described above frequently occurs, which greatly increases performance degradation.
(2) The VMM saves the VM state on the main memory in the secondary memory. The overhead for that is also large.

As described above, the problem of Patent Document 2 is performance deterioration during normal operation and system state storage.

The root of these problems is that the nonvolatile medium is limited to secondary storage in the current computer system. In order to solve the problem, research on a nonvolatile semiconductor device that can be used as a main memory is underway. A typical example is Patent Document 3. The semiconductor device described in Patent Document 3 is called MRAM (Magnetic Random Access Memory). MRAM is a semiconductor device that can be randomly accessed and has three characteristics: non-volatile, and response time shorter than DRAM (Dynamic RAM). Therefore, if the degree of integration of MRAM is improved in the future, it may become a main memory. The MRAM is expected to solve the problems related to the state preservation of the computer system described so far.

U.S. Pat. No. 7,673,330 U.S. Patent No. 6795966 U.S. Pat.No. 7,147,472

However, the above problem cannot be solved only by adopting a nonvolatile and randomly accessible semiconductor device such as MRAM as the main memory. The reason will be described below.

As shown in FIG. 8, the OS sends a command (denoted as “cmd” in the figure) to the secondary storage (4516). The secondary storage (4516) executes the command and reports the command execution completion to the OS.

Generally, the response time of the secondary memory (4516) is very long compared with the processing time of the CPU (4501). For this reason, normally, the OS executes another process while the secondary storage (4516) executes the command.

Therefore, as shown in FIG. 8, the OS registers a command for which a command execution completion report has not been transmitted from the secondary memory (4516) in the command completion queue (805) in the nonvolatile main memory (4502). In FIG. 8, the OS registers cmd0 (806) and cmd1 (807) in the command completion waiting queue (805).

Suppose that a power failure occurs in the state shown in Fig. 8, and then the computer system (4500) is restarted. At this time, the state of the computer system is as shown in FIG. That is, since the main memory (4502) is non-volatile, cmd0 (806) and cmd1 (807) remain registered in the command completion waiting queue (805) by the OS. On the other hand, in the secondary storage (4516), cmd0 (806) and cmd1 (807) received from the OS have disappeared due to a power failure. Therefore, the secondary storage (4516) does not transmit a command execution completion report. However, the OS waits for completion reports of cmd0 (806) and cmd1 (807). Since the command execution completion report does not come indefinitely, the OS determines that a failure has occurred in the secondary storage (4516).

As described above, if the main memory is only nonvolatile, there is a possibility that a mismatch occurs between the state of the peripheral device (for example, the secondary memory) and the state of the main memory due to the occurrence of a power failure or the like. For this reason, the computer system cannot resume operation after a power failure.

When the OS detects a transition to a stable state in which none of the commands sent to the peripheral device is in progress, and when the OS transitions to the stable state with the data stored in the nonvolatile main memory in the stable state The stable state information, which is information including the instruction address, is stored in the nonvolatile main memory. The non-volatile medium used as the main memory may be any medium.

Stored steady state information can be used for several things. For example, after a failure such as a power failure, it can be recovered using stable state information. Alternatively, for example, debugging can be performed with reference to stable state information.

It is possible to prevent inconsistency between the state of the nonvolatile main memory and the state of the peripheral device.

FIG. 1 is a block diagram of a computer system according to Embodiment 1 of the present invention. FIG. 2 is a functional block diagram of the OS (117). FIG. 3 is a functional block diagram of the I / O controller (103). FIG. 4 is an explanatory diagram of the main memory setting register group (306). FIG. 5 is an explanatory diagram of the flag area (114) in the firmware storage memory (104). FIG. 6 is a flowchart of the address translation mechanism of the I / O controller (103). FIG. 7 is a flowchart showing the operation of the firmware. FIG. 8 is a diagram for explaining the operation between the OS (117) and the secondary storage (106). FIG. 9 is a diagram illustrating an example of processing between the OS and the secondary storage. FIG. 10 shows an example of the nonvolatile main memory and the secondary memory when the power failure occurs in FIG. FIG. 11 is a diagram illustrating the “stable state”. FIG. 12 is a diagram illustrating an example of a method for storing the “stable state”. FIG. 13 is a diagram illustrating an example of a method for storing the “stable state”. FIG. 14 is a diagram illustrating an example of a method for storing the “stable state”. FIG. 15 is a diagram illustrating an example of a method for storing the “stable state”. FIG. 16 is a diagram illustrating an example of a trigger for starting the “stable state” saving process. FIG. 17 is a diagram illustrating an example of a trigger for starting the “stable state” saving process. FIG. 18 is a flowchart for explaining processing in which the OS (117) stores the “stable state”. FIG. 19 is a flowchart for explaining processing in which the OS (117) stores the “stable state”. FIG. 20 is a flowchart for explaining processing in which the OS (117) stores the “stable state”. FIG. 21 is a flowchart for explaining processing in which the OS (117) stores the “stable state”. FIG. 22 is a schematic diagram for explaining the resumption of operation after storing the “stable state”. FIG. 23 is a schematic diagram for explaining the resumption of operation after storing the “stable state”. FIG. 24 is a flowchart of the failure recovery program (203) for resuming operation after saving the “stable state”. FIG. 25 is a flowchart of the failure recovery program (203) for resuming operation after saving the “stable state”. FIG. 26 is a flowchart of the failure recovery program (203) for resuming operation after saving the “stable state”. FIG. 27 is a schematic diagram for explaining network communication processing of the computer system (100). FIG. 28 is a flowchart of network communication resumption processing of the failure recovery program (203). FIG. 29 is a block diagram of a computer system (2900) according to the second embodiment of the present invention. FIG. 30 is a diagram of the flag area (2902) of the firmware storage memory (104). FIG. 31 is a functional block diagram of the OS (2903). FIG. 32 is an explanatory diagram of the page management information (3103) and the page table (3104). FIG. 33 is an explanatory diagram of the page table pointer (3001) and the check code (3105). FIG. 34 is a schematic diagram illustrating the process of copying the page table (3104). FIG. 35 is a flowchart for explaining processing in which the OS (2903) stores the “stable state”. FIG. 36 is a flowchart showing firmware processing. FIG. 37 is a flowchart showing processing when the computer system (2900) is restarted. FIG. 38 is a schematic diagram illustrating the process of updating the page (3102). FIG. 39 is a ladder chart explaining the process of updating the page (3102). FIG. 40 is a flowchart for explaining the processing of the OS (2903) that makes an old generation page unused. FIG. 41 is a flowchart for explaining the process of writing an old generation page to the secondary storage (106). FIG. 42 is a diagram showing an example of the data structure of the write-once file system. FIG. 43 shows the data structure of sub-inode (4206). FIG. 44 is a flowchart showing the operation of the OS when the user requests the OS (2903) to open the file. FIG. 45A is a diagram illustrating a state before the “stable state” is stored for the first time. FIG. 45B is an explanatory diagram of processing when the “stable state” is stored for the first time. FIG. 45C is an explanatory diagram of processing when the “stable state” is stored for the second time. FIG. 45D is an explanatory diagram of processing when the “stable state” is stored after the third time.

100 ... computer system

Hereinafter, some embodiments of the present invention will be described with reference to the drawings. In the following description, saving (storing) stable state information may be referred to as “saving (storing)“ stable state ”. In the following description, it is assumed that the peripheral device connected to the computer system is a secondary storage. The secondary storage may be built in the computer system or may exist outside. The secondary storage may be a drive such as a hard disk drive or a storage system (for example, a disk array device) provided with a plurality of storage media.

FIG. 1 shows a computer system (100) according to the first embodiment of the present invention.

The computer system (100) consists of a CPU (Central Processing Unit) (101), nonvolatile main memory (102), I / O controller (103), firmware storage memory (104), I / O bus (105), HBA (Host Bus (Adapter) (107) and NIC (Network Interface Card) (109).

CPU (101) is in charge of calculation.

The non-volatile main memory (102) stores an OS (117) and an application (not shown). The nonvolatile main memory (102) is divided into two areas. These are referred to as region 1 (110) and region 2 (111). The “stable state” is stored in one of the two areas, and the OS (117) in the area in which the “stable state” is saved is activated at the time of restart after a power failure occurs.

The I / O controller (103) is connected to the CPU (101) and the nonvolatile main memory (102), and controls the nonvolatile main memory (102) and the data transfer between the above elements.

Firmware storage memory (104) is a non-volatile storage medium that stores firmware executed by CPU (101) when computer system (100) is started. The firmware storage memory (104) stores a firmware storage area (112) and a flag area (113) unique to the present embodiment.

The I / O bus (105) is a data transfer medium between the HBA (107) and NIC (109) and the I / O controller (103).

The HBA (107) controls the secondary storage (106) according to the instruction of the OS (117).

NIC (109) communicates with other computer systems via LAN (Local Area Network) (108) in accordance with instructions of OS (117). Communication between computer systems may be performed via another type of network instead of the LAN (108).

Figure 2 shows the configuration of OS (117).

The OS (117) is a kernel (201), a device driver (202), an OS failure recovery program (203) specific to this embodiment that is executed at the time of restart after power-off, and a device incorporated in the OS (117) A device driver list (204), which is a list of drivers (202), and a device recovery program (205) for performing failure recovery processing specific to the device driver are included. Also, a recovery stack pointer (206) indicating the instruction address immediately before the stored "stable state", and a system state variable (207) indicating whether the computer system (100) is in the "stable state" or in an unstable state There is.

Figure 3 shows the configuration of the I / O controller (103).

The I / O controller (103) includes a CPU interface controller (301) that controls communication with the CPU (101), a main memory interface controller (302) that controls the main memory (102), and a firmware storage memory (104). Connected to the firmware interface controller (303) that controls communication with the I / O bus (105), the I / O bus interface controller (304) that controls communication with the I / O bus (105), and the I / O controller (103) A routing control unit (305) for arbitrating data transfer between the interfaces. The I / O controller (103) further includes a main memory setting register group (306) for storing state variables for storing a “stable state” in the nonvolatile main memory (102), and a nonvolatile main memory (102). And a data transfer engine unit (307) that executes data transfer between the nonvolatile main memory (102) and the I / O bus.

Figure 4 shows the main memory setting register group (306).

The operation mode register (401) is a register indicating a mode of writing to the area 1 (110) and the area 2 (111). At the first startup after the computer system (100) is normally shut down, it is necessary to write data to both the area 1 (110) and the area 2 (111). If the “stable state” is stored in the area 1 (110), the OS (117) proceeds with the calculation using the area 2 (111). Accordingly, the main memory interface control unit (302) has a mode for writing data in both the area 1 (110) and the area 2 (111) and a mode for writing data in one of the areas. Therefore, an area that is not in a “stable state” and is currently in use is referred to as an “active area”. For this reason, the operation mode register (401) has two modes of “0: write only to active area” and “1: write to all areas”.

The active area register (402) indicates a currently active area.

The area size register (403) indicates the size of the divided area.

The stable area register (404) indicates an area where “stable state” is stored.

Area number register (405) indicates the total number of areas. In this embodiment, the area of the nonvolatile main memory (102) is divided into two. However, it goes without saying that even when the number of divisions of the region changes, it is within the scope of the present invention.

Figure 5 shows the flag area (114).

The shutdown completion flag (501) is a flag that indicates whether a normal shutdown or a power failure occurred at the previous startup. 0xFF indicates an initial value. 0x00 indicates that the computer system (100) is normally shut down. 0xEE indicates that the computer system is stopped due to a power failure, and the “stable state” is stored in the nonvolatile main memory (102). Otherwise, it indicates that a failure has occurred. It goes without saying that the meaning of the present invention does not change even if the meaning of the numerical value of the flag is changed.

The stable area number flag (502) stores a numerical value indicating the latest stable area.

The OS failure recovery program vector (503) stores the instruction address of the OS failure recovery program (203).

The area size (504) stores the size of the area obtained by dividing the nonvolatile main memory (102).

The number of areas (505) indicates the number of areas in the nonvolatile main memory (102).

Next, Fig. 6 shows the address transmission procedure for the nonvolatile main memory (102) of the I / O controller (103).

Step (601): The I / O controller (103) receives the physical address a (p) transmitted by the CPU (101) and others.

Step (602): The I / O controller (103) refers to the active area register (402). Let that value be n.

Step (603): The I / O controller (103) determines whether or not the value of the active area register (402) is zero. If the value of the active area register (402) is 0, the I / O controller (103) transmits an address to both of the divided areas (step (604) is performed). On the other hand, if the value of the active area register (402) is not 0, the I / O controller (103) transmits an address only to the active area (step (610) is performed).

Step (604): The I / O controller (103) refers to the area size register (403). Let this value be s.

Step (605): The I / O controller (103) calculates the address ns + a (p).

Step (606): The I / O controller (103) transmits the above address to the nonvolatile main memory (102).

Step (607): The I / O controller (103) increments n.

Step (608): The I / O controller (103) determines whether n is greater than 2. If n is greater than 2, step (609) is performed, and if n is 2 or less, step (605) is performed. In this embodiment, since the nonvolatile main memory (102) is divided into two, the I / O controller (103) compares n with the magnitude of 2. When the nonvolatile main memory (102) is divided into m (m is an integer of 2 or more), the I / O controller (103) compares n and m.

Step (609): This is the end of the process.

Step (610): The I / O controller (103) refers to the area size register (403). Let this value be s.

Step (611): The I / O controller (103) calculates the address (n-1) s + a (p).

Step (612): The I / O controller (103) transmits the address to the nonvolatile main memory (102).

In this way, the I / O controller (103) transmits the address to the nonvolatile main memory (102). In the present embodiment, the operation of the I / O controller (103) is expressed as a software flowchart, but it goes without saying that the operation is actually a hardware operation equivalent to FIG.

Next, the operation of the firmware will be described with reference to FIG.

Step 701: The user turns on the power.

Step 702: The firmware initializes the CPU (101).

Step 703: The firmware refers to the shutdown completion flag (501).

Step 704: The firmware operates in three ways depending on the value of the shutdown completion flag (501).

Step 705: If the shutdown completion flag (705) is 0xFF or 0x00, it was shut down normally last time, so the firmware initializes the hardware of the computer system (100) and the OS from the secondary storage (106) Read (117).

Step 705: The firmware initializes the I / O controller (103).

Step 706: The firmware initializes the interface between the I / O controller (103) and the nonvolatile main memory (102).

Step 707: The firmware clears the nonvolatile main memory (102) to zero.

Step 708: The firmware sets half the capacity of the nonvolatile main memory (102) in the area size register (403). The firmware sets 0 in the stable area register (404). The firmware sets 2 in the area number register (if the division number is m, the register is set to m). Further, the firmware sets the main memory setting register group (306). The firmware sets 0 in the operation mode register (401). The firmware sets 0 in the active area register (402).

Step 709: The firmware initializes the I / O bus (105).

Step 710: The firmware initializes the HBA (107) and NIC (109).

Step 711: The firmware loads the OS boot program from the secondary storage (106) to the nonvolatile main storage (102).

Step 712: The firmware jumps to the OS startup program.

Step 713: The OS startup program starts running.

Step 714: When the shutdown completion flag is 0xEE, the flag means that a power failure has occurred while the “stable state” is stored. The firmware calls the OS failure recovery program 203 after initializing the hardware. The firmware initializes the I / O controller.

Step 715: The firmware initializes an interface between the I / O controller (103) and the nonvolatile main memory (102).

Step 716: The firmware reads the stable area number flag (502) and writes the value in the stable area register (404).

Step 717: The firmware calculates the size of the area from the value of the number of areas (505).

Step 718: The firmware sets the area size in the area size register (403).

Step 719: The firmware initializes the I / O bus (105).

Step 720: The firmware initializes the HBA (107) and NIC (109).

Step 721: The firmware reads the OS failure recovery program vector (503) and jumps to the vector.

Step 722: The OS failure recovery program (203) takes over the processing. The operation of the OS failure recovery program (203) will be described after the “stable state” saving process is described.

Step 723: The value of the shutdown completion flag (501) is invalid except for 0x00, 0xEE, and 0xFF. For this reason, the fault processing is started. First, the firmware initializes the I / O bus.

Step 724: The firmware initializes a VGA (Video Graphics Array) device (not shown).

Step 725: The firmware displays an error on the console.

By the way, the “stable state” will be described with reference to FIGS.

FIGS. 10 and 11 represent divided areas n (110) (111) of the nonvolatile main memory (102). The nonvolatile main memory (102) includes a file system layer (801) and a device driver layer (802) of the OS (117).

The file system layer (801) has a device driver transmission queue (803). The file system layer (801) writes the contents of the nonvolatile main memory (102) to the secondary storage (106) in response to any one of swap-out, an explicit instruction from the application, and an instruction from the user. The device driver (202) takes charge of the actual writing operation. The device driver transmission queue (803) is a queue for the file system layer (801) to request the device driver (202) to write data. 10 and 11, cmd4 (810) and cmd5 (811) are registered in the device driver transmission waiting queue (803).

The device driver layer (802) has a command transmission waiting queue (804) and a command completion waiting queue (805). Cmd2 (808) and cmd3 (809) are registered in the command transmission queue (804). In FIG. 11, since the command completion waiting queue (805) is empty, it can be said to be in a “stable state”.

In FIG. 10, the secondary storage (106) reports the completion of execution of cmd0 (806) and cmd1 (807). Consider the case where the OS (117) did not send cmd2 (808) to the secondary storage (106) in this state. This state is shown in FIG.

In FIG. 11, the command completion queue (805) is empty. Naturally, the command execution queue (queue storing the command being executed) (812) in the secondary storage (106) is also empty. Therefore, the state of the nonvolatile main memory (102) and the state of the secondary memory (106) are the same. Even if a power failure occurs in this state, the computer system (100) can resume operation. This is because the states of the OS (117) and the secondary storage (106) are the same even when restarting after a power failure. As a result, the OS (117) does not need to wait for a command execution completion report from the secondary storage (106).

From the above, the “stable state” means that any command transmitted to the peripheral device (secondary storage (106) in this embodiment) is not in progress, in other words, transmitted to the peripheral device. It can be defined as the state where all the completed commands are completed.

Next, a method for storing the “stable state” will be described. It has already been explained that FIG. 11 represents the “stable state”. There are multiple ways to store the “stable state”. Each method will be described below.

Using Fig. 12 and Fig. 13, one of the methods for storing the "stable state" will be described.

As shown in FIG. 12, the OS (117) retrieves the cmd2 (808) and cmd3 (809) from the command transmission queue (804) and saves the cmd2 (808) and cmd3 ( 809) is registered in the device driver transmission waiting queue (803). Then, the state of the computer system (100) transitions to the state shown in FIG. Since the state shown in FIG. 13 is synonymous with the initialization of the device driver layer (802), there is an aspect that the design of the OS failure recovery program (203) is easy.

Also, there is a possibility that FIG. 11 does not represent the “stable state” depending on the specifications of the HBA (107). Until now, the explanation is based on the assumption that the OS (117) actively retrieves the command from the command transmission queue (804) (via the HBA (107)) and sends the command to the secondary storage (106). I have done it. However, there is also an HBA that takes out a command from the command transmission queue (804) in the nonvolatile main memory (102) and transmits the command to the secondary storage (106). In this case, the method described in FIG. 12, that is, the method in which the OS (117) extracts a command from the command transmission queue (804) and registers the command in the device driver transmission queue (803) is “stable state”. Means to save. Thus, the method of FIG. 12 can be said to be a method capable of storing the “stable state” without depending on the type of HBA.

In addition, there is another method for storing the “stable state”. This will be described with reference to FIG. 14 and FIG.

In FIG. 14, the command save queue (1401) exists in the device driver layer (802). When saving the “stable state”, the OS (117) saves the command registered in the command transmission queue (804) to the command save queue (1401). As a result, as shown in FIG. 15, since no command is registered in the command transmission queue (804), the HBA (107) does not transmit the command to the secondary storage (106), and the computer system (100 ) Is the “stable state”.

によ According to this method, the save destination is different from the method of FIG. In the method of FIG. 14, the design change part is closed in the device driver (202), so that it can be said that it is easy to realize.

So far, we have described how to store the “stable state”. Next, the trigger for starting the “stable state” saving process will be described.

There are multiple types of triggers for starting the “stable state” saving process. For example, there is an opportunity either periodically when receiving an instruction from the user, or when the state of the command completion waiting queue (805) becomes a predetermined state. “When the state of the command completion waiting queue (805) becomes a predetermined state” is, for example, when “unstable state” is detected or when “stable state” is detected. An “unstable state” is a state that is not a “stable state”.

Fig. 16 shows that "stable state" is stored when "unstable state" is detected.

The direction from the left to the right of the drawing represents the passage of time. A thick line (1601) indicates that the computer system (100) is in an “unstable state”. A thick line (1062) indicates that the computer system (100) is in a “stable state”.

First, commands cmd1 (1603), cmd2 (1604), and cmd3 (1605) are registered in the command completion waiting queue (805). Next, cmd1 (1603) is completed, and a transition is made to the state in which commands cmd2 (1604) and cmd3 (1605) are registered in the command completion queue (805). Further, the processing of cmd2 (1604) and cmd3 (1605) is completed, the command completion waiting queue (805) is emptied, and the computer system (100) transitions to the “stable state” (OS (117) is “ Detect stable state).

However, OS (117) does not save the “stable state” and continues operation.

And the OS (117) saves the “stable state” triggered by the event that cmd4 (1605) is registered in the command completion waiting queue (1606). That is, the OS (117) stores the “stable state” when it detects the “unstable state”.

The “stable state” stored at this opportunity is a past state closer to the “stable state” stored at the opportunity shown in FIG. 17 (when the “stable state” is detected). Therefore, the failure recovery time is shorter than in the case of FIG. However, in order to save the “stable state” when the “unstable state” is detected, the OS (117) must recognize its own state. Therefore, the OS (117) must identify whether it is “stable state” or “unstable state” using the system state variable (207). In addition, as another method of “detecting” the transition from “stable state” to “unstable state”, a device driver (not shown) that is part of the OS (117) sends a command to the secondary storage (106). There is a method of checking whether the command completion waiting queue (1606) is empty before sending.

On the other hand, FIG. 17 shows that “stable state” is saved when the command completion waiting queue (805) becomes empty, that is, “stable state” is saved when “stable state” is detected. It shows that.

The “stable state” stored at this opportunity is a past state farther than the “stable state” stored at the opportunity shown in FIG. Therefore, the failure recovery time is longer than in the case of FIG. However, since the OS (117) does not need to recognize its own state, the system state variable (207) is unnecessary. As another method for detecting the transition to the `` stable state '' by the OS (117), the device driver (802) that is a part of the OS (117) reports a command completion report from the secondary storage (106). There is also a method of checking whether or not the command completion waiting queue (1606) is empty when it is received.

So far, the trigger for starting the “stable state” storage process has been described. Next, the “stable state” storage process will be described with reference to FIGS. There are multiple “stable state” save processes.

FIG. 18 is a flowchart of processing for storing the “stable state” for the first time after the computer system (100) is started.

Step (1801): “Stable state” storage process starts. As described above, there are a plurality of types of triggers for starting this process, but it goes without saying that any of them may be used.

Step (1802): The OS (117) does not send a new command to the secondary storage (106).

Step (1803): The OS (117) checks whether there is a command waiting for completion. This step is necessary if the start of the “stable state” storage process is a user instruction or periodic, but this step may not be required for the trigger described in FIGS. 16 and 17.

Step (1804): The OS (117) saves the recovery stack pointer (206). This means that the state (context) immediately before the start of the “stable state” saving process is saved at a predetermined address that can be recognized by the OS failure recovery processing program (203). Incidentally, the “context” includes information indicating the state of the process (for example, how many lines the program is running). Most modern operating systems (117) execute one program for a few milliseconds, interrupt the execution of the program, and execute another program. From the human point of view, this operation looks as if multiple programs are being executed in parallel. This is called “multitasking”. Therefore, the OS (117) executes one program for several milliseconds, and when executing another program, information necessary to resume execution of the former program (how many lines the program has run to) , Data to be calculated) is stored in the nonvolatile main memory (102). This is called context. When the OS (117) interrupts execution of one program and starts to execute another program, it is called “context switch”.

Step (1805): The OS (117) flushes the CPU (101) cache. In this embodiment, the CPU (101) is volatile. Therefore, in order to save the state of the computer system (100), it is necessary to flush the cache of the CPU (101) and save it in the nonvolatile main memory (102).

Step (1806): The OS (117) sets “1” to the stable region register (404) of the I / O controller (103). This indicates that “stable region” is stored in region 1.

Step (1807): The OS (117) writes the instruction address of the OS failure recovery program (203) into the OS failure recovery program vector (503) in the firmware storage memory (104).

Step (1808): The OS (117) sets “0” in the operation mode register. This prevents the I / O controller (103) from rewriting the area 1 (110) in which the “stable state” is stored.

Step (1809): The OS (117) sets “2” in the active area register. As a result, the OS (117) uses only the area 2 (111).

Step (1810): The OS (117) sets the stable region number flag (502) to “1”.

Step (1811): The OS (117) sets 0xEE to the shutdown completion flag (501).

Step (1812): The OS (117) determines that the response of the secondary storage (106) is transmitted within a specified time in response to the command transmitted by the OS (117) to the secondary storage (106). . When the response of the secondary storage (106) does not come within the specified time, the OS (117) executes failure processing. For example, the same command is transmitted again. If the specified time has not been reached, the OS (117) waits for a response from the secondary storage (106).

Step (1814): The process ends.

Until the start of the first "stable state" saving process, both area 1 and area 2 have been used as "active areas", but after this process ends, area 1 becomes a "stable area" 2 is an “active area”. That is, data such as the latest context is not stored in the area 1.

FIG. 19 is a flowchart of processing for storing the “stable state” again when the “stable state” has already been stored. In the following description, when step (x) is the same as step (y), it is abbreviated as “step (x) = step (y)”.

Step (1901) = Step (1801), Step (1902) = Step (1802), Step (1903) = Step (1803), Step (1904) = Step (1804), and Step (1905) = Step (1805) ).

Step (1906): The OS (117) copies the contents of the current active area (for example, all data stored in the area 2) to the stable area (for example, the area 1). This copying may be performed by the CPU (101), but is preferably performed by the data transfer engine unit (307) built in the I / O controller (103). The data transfer engine unit (307) is a so-called DMA (Direct Memory Access) engine. The OS (117) instructs the data transfer engine unit (307) about the transfer source address, the transfer destination address, and the transfer size (not shown). The data transfer engine unit (307) copies data for the designated transfer size from the designated transfer source address to the designated transfer destination address. As a result, the data in the nonvolatile main memory (102) can be copied at a higher speed than when the software performs copying.

Step (1907): The OS (117) sets the stable area number in the stable area register (404). If the stable region is unchanged, this processing is not necessary.

Step (1908) = Step (1807) and Step (1909) = Step (1808).

Step (1910): The OS (117) sets the active area number in the active area register (402). If the active area is unchanged, this process is not necessary.

Step (1912) = Step (1811), Step (1913) = Step (1812), and Step (1914) = Step (1813). Then, the process ends at step (1915).

Next, FIG. 20 shows a flowchart of the “stable state” saving process corresponding to FIG. 12 and FIG. FIG. 20 shows a process of storing the “stable state” for the first time after the computer system (100) is activated.

Step (2001) = Step (1801).

Step (2002): The OS (117) moves the command in the command transmission waiting queue (804) to the device driver transmission waiting queue (803).

Thereafter, the same processing as that after step (1803) in FIG. 18 is performed. Step (2003) = Step (1803), Step (2004) = Step (1804), Step (2005) = Step (1805), Step (2006) = Step (1806), Step (2007) = Step (1807) ), Step (2008) = Step (1808), Step (2009) = Step (1809), Step (2010) = Step (1810), Step (2011) = Step (1811), Step (2012) = Step (1812) ) And step (2013) = step (1813). Then, the process ends at step (2014).

FIG. 21 is a flowchart of processing for storing the “stable state” again when the “stable state” has already been stored.

Step (2101) = Step (1901), Step (2102) = Step (2002), Step (2103) = Step (1903), Step (2104) = Step (1904), Step (2105) = Step (1905), Step (2106) = Step (1906), Step (2107) = Step (1907), Step (2108) = Step (1908), Step (2109) = Step (1909), Step (2110) = Step (1910), Step (2111) = Step (1911), Step (2112) = Step (1912), Step (2113) = Step (1913), and Step (2114) = Step (1914). Then, the wrinkle process ends at step (2115).

Next, the process of restarting from the “stable state” after a power failure occurs will be described.

FIG. 22 shows an outline of the restart process after the “stable state” storage process of FIG. 12 is executed.

The OS (117) generates commands cmd2 (808) and cmd3 (809) registered in the command transmission waiting queue (804) of the device driver layer (802) in order to generate a `` stable state ''. 801) is registered in the device driver transmission waiting queue (803). Therefore, since the device driver layer (802) is in the initial state, the restart process only needs to execute the initialization process again.

On the other hand, FIG. 23 shows an outline of the restart process after the “stable state” storage process of FIGS. 14 and 15 is executed.

The OS (117) returns the cmd2 (808) and cmd3 (809) saved in the command save queue (808) to the command transmission queue (804). If the HBA (107) is a specification for acquiring a command from the command transmission queue (804), further write the head pointer of the command transmission queue (804) to the register (not shown) of the HBA (107). is required. This is because the HBA (107) does not know which address of the command transmission queue (804) is the head address of the command group not yet transmitted.

In the “stable state” saving process described with reference to FIG. 11, the initialization of the device driver may be only the initial setting of the HBA (107). If the HBA (107) is a specification for acquiring a command from the command transmission queue (804), further write the head pointer of the command transmission queue (804) to the register (not shown) of the HBA (107). is required.

Fig. 24 shows the restart process from the "stable state".

Step (2401): Processing starts.

Step (2402): The OS failure recovery program (203) refers to the device driver list (204).

Step (2403): The OS failure recovery program (203) can know the instruction address of the device recovery program (205) for each device from the device driver list (204). Therefore, the OS failure recovery program (203) executes the device recovery program (205) for each device including the HBA (107) and the NIC (109).

Step (2404): When the recovery process of each device is completed, the OS failure recovery program (203) jumps to the OS stack pointer (206) to return to the “stable state”.

Step (2405): The OS (117) restarts the operation from the “stable state”.

Next, FIG. 25 shows a flowchart of the HBA (107) device recovery program when the “stable state” saving process of FIGS. 14 and 15 is executed.

Step (2501): The HBA device recovery program is called from the OS failure recovery program (203).

Step (2502): The HBA device recovery program restores the command transmission queue (804). That is, the commands (cmd2 (808) and cmd3 (809) shown in FIGS. 14 and 15) saved in the command save queue (1401) in FIG. 15 are moved to the command transmission queue (804).

Step (2503): The HBA failure recovery program sets an initial value in the register of HBA (107).

Step (2504): If the specification is that the HBA (107) retrieves the command from the command transmission queue (804) and sends it to the secondary storage (106), the HBA recovery program will not execute the command transmission queue (804). Set the start address of the command in the HBA register (not shown).

Step (2505): The failure recovery process of the HBA (107) is completed, and the control of the CPU (101) is transferred from the HBA failure recovery program to the OS failure recovery program (203).

Next, FIG. 26 shows the operation of the HBA failure recovery program when the “stable state” described with reference to FIGS. 11 and 12 is stored.

Step (2601): The OS failure recovery program (203) calls the HBA failure recovery program.

Step (2602): The HBA failure recovery program executes the same HBA initialization process as during normal startup. In the processing described with reference to FIG. 12, the OS (117) returns the command from the device driver layer (802) to the file system layer (801). Therefore, the nonvolatile main memory (102) used by the HBA device driver is returned to the initial state. Furthermore, according to the schedule of the OS (117), the command registered in the device driver transmission waiting queue (803) of the file system layer (801) is moved to the device driver layer (802). Therefore, the HBA failure recovery program only needs to initialize the HBA (107). In addition, the processing described with reference to FIG. 11 is a specification in which the OS (117) transmits a command in the command transmission queue (804) to the secondary storage (106) via the HBA (107). The HBA (107) may be initialized. If the HBA (107) reads the command registered in the command transmission queue (804) and sends it to the secondary storage (106), the "stable state" saving process described with reference to FIG. 11 is applied. I can't. Therefore, it should be noted that there is no HBA failure recovery program in this case.

Step (2603): The failure recovery process of the HBA (107) is finished, and the control of the CPU (101) is transferred from the HBA failure recovery program to the OS failure recovery program (203).

So far, the operation of the fault recovery program (203) has been described, but it should be noted that the “journaling” in terms of transaction processing is naturally realized. In the example of FIG. 23, when the OS (117) resumes operation from the “stable state”, the OS (117) transmits a command cmd1 (807) to the secondary storage (106). Thereafter, when a power failure occurs, the computer system (100) returns to the “stable state”. At this time, the command cmd1 (807) transmitted from the OS (117) to the secondary storage (106) is returned to the state registered in the command transmission queue (807). Then, the OS (117) resumes transmission of the command cmd1 (807) to the secondary storage (106). Thus, unlike the conventional journaling method, one of the features of this embodiment is that no extra resources are required. Further, in the conventional transaction processing technique, the OS (117) stores a journal file, which is a history of command transmission to the secondary storage (106), in the secondary storage (106). However, in this embodiment, the OS (117) stores the corresponding journal file in the nonvolatile main memory (102). Therefore, a small overhead due to journaling is also a feature of this embodiment.

So far, the “stable state” storage processing and recovery processing of the computer system (100) has been described from the viewpoint of the relationship between the secondary memory (106) and the nonvolatile main memory (102). Further, it is necessary to explain from the viewpoint of NIC (109) or LAN (108). In general, the computer system (100) (that is, the OS (117)) can continue operation even when the communication of the LAN (108) is interrupted. This is because the communication of the LAN (108) has no state (it has a state in the TCP protocol layer but does not seem to have a state from the upper layer). Therefore, it is not necessary to save the “stable state”. The NIC (109) recovery process only needs to initialize the NIC (109) or the TCP / IP stack of the OS (117).

However, when the computer system (100) functions as a file server, some care is required. FIG. 27 shows a schematic diagram of LAN (109) communication when the computer system (100) is a file server. Nonvolatile main memory
(102) includes an NFS (Network File System) layer (2701) and a TCP / IP layer (2702). Outside the computer system is a client (2703) connected via a LAN (108). The TCP / IP layer (2702) has a TCP / IP layer reception queue (2704). In FIG. 27, the command cmd4 (2705) and data “dat4” (2706) transmitted by the client (2703) are registered in the TCP / IP layer transmission / reception queue (2704). The TCP / IP layer (2702) has a TCP / IP transmission queue (2711). In FIG. 27, a command completion report message cmp2 (2712) is registered. The NFS layer (2701), which is an upper layer of the TCP / IP layer (2702), has an NFS layer reception queue (2707). In FIG. 27, the command cmd3 (2708) is registered. Further, the NFS layer (2701) has an NFS reception completion queue (2709). This is a queue for registering a command that has been processed in the NFS layer (2701). For example, when the client (2703) sends a write command to the computer system (100), the computer system (100) has a nonvolatile main memory (102), so when the reception of the NFS layer (2701) is completed. A write completion report can be sent to the client (2703). The command for which the reception process of the NFS layer (2701) is completed is registered in the NFS layer reception completion queue (2709) by the OS (117). In FIG. 27, command cmd1 (2714) and data (2715) are registered. The computer system (100) has an NFS layer transmission queue (2710) for transmitting a completion message and the like to the client (2703). Finally, FIG. 27 shows a state in which the completion message cmp1 (2713) indicating that the OS (117) has completed processing of the command cmd1 (2714) has reached the client (2703). After a power failure occurs in the state of FIG. 27, the OS failure recovery program (203) initializes all the queues of FIG. 27 except the NFS layer reception completion queue (2709). On the other hand, the OS (117) must always write the received data dat1 (2715) to the secondary storage (106). This is because the computer system (100) has already reported the normal end of the command cmd1 (2714) to the client (2703).

Based on the above points, the processing of the OS failure recovery program (203) related to the NFS layer (2701) is shown in FIG.

Step (2801): Processing starts.

Step (2802): The OS failure recovery program (203) checks whether there is a command or data in the NFS layer reception completion queue (2709).

Step (2803): If a command or data exists in the NFS layer reception completion queue (2709), the OS failure recovery program (203) writes the data in the secondary storage (106). The OS failure recovery program (203) does not need to write directly to the secondary storage (106), but only instructs the file system layer (801) to write the data to the secondary storage (106). Here, the processing of the OS failure recovery processing program (203) is finished, and the OS (117) executes normal NFS processing from the next time.

Step (2804): Check whether the OS (117) has received the NFS command. If no NFS command is received, monitoring continues.

Step (2805): When the OS (117) receives the NFS command, it executes the command and returns to step (2804).

As described above according to the drawings, storing the `` stable state '' of the computer system (100) in the nonvolatile main memory (102) enables the computer system (100) to be restarted consistently during a power failure. Become. Compared with the conventional technique for storing the system state in the secondary storage (106), the state storage overhead of this embodiment is short and the performance degradation during normal operation is small. Furthermore, since the overhead of state storage is short, the computer system to which this embodiment is applied can store the state at short intervals. Therefore, when the computer system (100) is restarted after the occurrence of a power failure, it becomes possible to save the state of the computer system (100) in the past that is shorter than that of the prior art. In other words, it means that the time required for rollback and rollforward in terms of transaction processing is shortened. Further, the secondary storage (106) area and CPU cycle necessary for the journaling processing in the transaction processing are unnecessary in the first embodiment, but can have an effect equivalent to that of journaling.

In addition, as described above, in the first embodiment, the method of dividing the nonvolatile main memory (102) into two and storing the “stable state” on one side has been described. However, the effect of the present invention is not affected even when the nonvolatile main memory (102) is divided by all integers of 2 or more. In this case, the following operation is possible. For example, as shown in FIG. 45A, when the nonvolatile main memory (102) is divided into three, the OS (117) first stores data in all the areas 1 to 3 (that is, the areas 1 to 3). Are all active areas). Next, when saving the “stable state”, as shown in FIG. 45B, the OS (117) saves the “stable state” in any one of the active areas 1 to 3 (for example, the area 1). (That is, one of the active areas 1 to 3 becomes a stable area). Thereafter, when further saving the “stable state”, as shown in FIG. 45C, the OS (117) saves the “stable state” in any one of the active areas 2 to 3 (for example, the area 2). (That is, one of the active areas 2 to 3 becomes a stable area). Thereafter, when the “stable state” is further stored, as shown in FIG. 45D, the OS (117) sends the contents of the current active area to one of the other two stable areas as shown in FIG. 45D. To copy.

Also, when the nonvolatile main memory (102) is divided into m, it is preferable that all m areas have the same capacity, but the capacity of at least one area may be different from the capacity of other areas.

In the first embodiment, a plurality of “stable state” storage processes are described, but two or more of them may be combined.

In addition, OS (117) selectively adopts whether to save "stable state" when "unstable state" is detected or to save "stable state" when "stable state" is detected. (That is, the method described with reference to FIG. 16 or the method described with reference to FIG. 17 can be dynamically changed). For example, a specific file system, for example, the Linux (registered trademark) / proc file system is a file system that allows a user to read and write the values of variables in the OS (117), and exists in the / proc file system. The behavior of the OS (117) can be changed by changing the value of the variable inside the OS (117). If this is used, the trigger for storing the “stable state” can be changed during operation. The specific operation is as follows. The OS (117) incorporates a variable indicating the timing for saving the “stable state”. This is called a “timing variable”. When the “timing variable” is a first value (eg, “0”), the OS (117) saves the “stable state” when transitioning from the “stable state” to the “unstable state”. It means to do. Conversely, when the “timing variable” is the second value (for example, “1”), the OS (117) stores the “stable state” when transitioning to the “stable state”. The OS (117) is designed to operate as described below. The OS (117) refers to the “timing variable” when transmitting the command to the secondary storage (106). If the “timing variable” is the first value (eg, “0”), the OS (117) checks the command completion waiting queue (805). If the command completion waiting queue (805) is empty, the “stable state” saving process is started. If the command completion queue (805) is not empty, the OS (117) does nothing. Thereafter, the OS (117) transmits the command to the secondary storage (106). On the other hand, if the “timing variable” is the second value (eg, “1”), the next process is executed without doing anything. The OS (117) refers to the “timing variable” after receiving the command completion report from the secondary storage (106). If the “timing variable” is the second value (for example, “1”), the OS (117) checks the command completion waiting queue (805). If the command completion waiting queue (805) is empty according to the received command completion report, the OS (117) stores the “stable state”. If the command completion queue (805) is not empty, nothing is done and the process proceeds to the next process. On the other hand, if the “timing variable” is the first value (eg, “0”), the OS (117) receives the command completion report from the secondary storage (106), and proceeds to the next processing without doing anything.

In the first embodiment, the method in which the nonvolatile main memory (102) is divided and the OS stores the “stable state” in the one area has been described. In the second embodiment, an implementation example in which the nonvolatile main memory (102) is not divided will be described. At this time, differences from the first embodiment will be mainly described, and description of common points with the first embodiment will be omitted or simplified.

FIG. 29 shows a computer system (2900) according to the second embodiment of the present invention.

The differences from the first embodiment are the I / O controller (2901), the flag area (2902) in the firmware storage memory (104), and the OS (2903).

Fig. 30 shows the flag area (2902) in the firmware storage memory (104).

The flag area (2902) has a shutdown completion flag (501), a page table pointer (3001), and an OS failure recovery program vector (503). The page table pointer (3001) is a pointer to a page table (explained later) which is management information of the nonvolatile main memory (102). The shutdown completion flag (501) and the OS failure recovery program vector (503) are the same as in the first embodiment.

Next, Fig. 31 shows the structure of OS (2903).

The OS (2903) has a kernel (3101), a device driver (202), an OS failure recovery program (203), recovery stack pointers (3107) and (3108), and a system state variable (207). The kernel (3101) has a device driver list (204) as in the first embodiment.

The characteristic point of the second embodiment is the structure of the memory management information held by the kernel (3101). The computer system manages the main memory in units of fixed sizes called pages. This size depends on the specifications of the CPU (101). The page size is often 4KiB. Here, “KiB” indicates 1024 bytes. In the second embodiment, there is page management information (3103) for each page (3102). The kernel (3101) manages the page table (3104), the page table pointer (3001), the check code (3105), the page management information list (3109), and the free page list (3110).

The page table (3104) is a list indicating physical page numbers corresponding to virtual addresses. At first glance, it seems that only one page table (3104) is required, but one feature of the present embodiment is that the kernel (3101) has a plurality of page tables (3104).

The page table pointer (3001) is a list of pointers to each page table (3104).

The check code (3105) is an error detection / correction code for checking whether the page table (main memory management information) is updated correctly when it is updated.

The page management information list (3109) is a list of used page management information (3103).

The free page list (3110) is a list of page management information (3103) of unused pages.

OS (2903) manages the stable state storage time table (3111). The stable state storage time table (3111) stores the time (3113) when the “stable state” of each generation is stored.

FIG. 32 shows page management information (3103) and a page table (3104).

The page management information (3103) has a page number (3201), a page attribute (3202), a page physical address (3203), and a reference count (3204).

The page number (3201) is an identification number for uniquely identifying the page (3102).

The page attribute (3202) is information indicating the type of access to which the corresponding page that can be read or written.

Page physical address (3203) indicates the physical address of the page.

Reference count (3204) indicates the number of page tables (3104) pointing to the page.

Next, the page table (3104) will be described. The page table (3104) has a generation number (3205), a virtual address (3206), a page number (3207), and a page attribute (3208).

The number of generations (3205) is a numerical value indicating the “stable state” generation of the nonvolatile main memory (102). Although details will be described later, the page table (3104) is newly generated every time the OS (2903) stores the “stable state”. Therefore, a plurality of “stable states” may exist in the nonvolatile main memory (102). The number of generations (3205) is a number assigned to each of a plurality of “stable states”.

The virtual address (3206) is an address in the process space sent by a program (application) running on the OS (2903).

The page number (3207) indicates the number of the page (3102) corresponding to the virtual address (3206).

The page attribute (3208) indicates the type of operation permitted for the page. For example, READ ONLY, READ enabled, or WRITE enabled.

The above virtual address (3206), page number (3207), and page attribute (3208) are collectively referred to as a page table entry (3209). The page table (3104) is a list of page table entries (3209). The process of converting a virtual address and a physical address using the page table (3104) is executed by an address translation function called TLB (Translate Look-aside Buffer) of the CPU (101). The address conversion process may be performed by the OS (2903). Further, a physical address may be set in the virtual address (3206) of the page table entry (3209). An area of the nonvolatile main memory (102) that is not the target of virtual storage can also be placed under the management of the page table and can be a target of “stable state” storage. In the description of the second embodiment, the above data structure is taken as an example, but the present invention can be applied to other data structures without being limited to the data structure.

Figure 33 shows the contents of the page table pointer (3001).

The 0th generation page table pointer (3301), the (G-1) generation page table pointer (3302), and the G generation page table pointer (3303) have a double linked list configuration. Each page table pointer points to the corresponding generation page table (3104). The check code (3105) also has a double linked list structure of check codes (3304) to (3306) of each generation.

FIG. 34 shows the change in the data structure when the OS (2903) stores the “stable state”.

When saving the “stable state”, the OS (2903) first creates a new page table (3401) by copying the page table (3104). Then, the OS (2903) adds a new page table pointer (3402) to the page table pointer (3001). Then, the OS (2903) increments the generation number (3201) of the new page table (3401). Then, the OS (2903) starts using the new page table (3401). At this time, the old and new page tables point to the same page management information (3103). Therefore, the OS (2903) increments the reference count (3207) of each page management information (3103). In the first embodiment, it is necessary to copy all the contents of the “stable state” area to another area, but in the method of the second embodiment, the OS (2903) changes only the management information.

Fig. 35 shows the flowchart for saving the "stable state" of the OS (2903).

Step (3501): Processing starts. The trigger for starting the process may be any of the multiple types of triggers described in the first embodiment.

Step (3502): The OS (2903) suppresses transmission of a new command to the secondary storage (106). For this step, any of the several methods described in the first embodiment may be selected.

Step (3503): The OS (2903) checks whether there is an incomplete command among the commands transmitted to the secondary storage (106). This step is not necessary when the OS (2903) selects a method of saving the “stable state” when there is no unfinished command.

Step (3504): The OS (2903) stores the recovery stack pointers (3107) to (3108). The recovery stack pointers (3107) to (3108) are in the form of a doubly linked list. Therefore, the OS (2903) adds a new recovery stack pointer to the double linked list of the recovery stack pointers (3107) to (3108).

Step (3505): OS (2903) flushes the cache. This step is the same as in Example 1.

Step (3506): The OS (2903) copies the page table (3104).

Step (3507): The OS (2903) increments the number of generations of the new page table (3104) (the number of generations for the copied page table). For example, if the number of generations is P (P is an integer), the number of generations is set to (P + 1) by this step.

Step (3508): The OS (2903) increments the reference count (3204) of the page management information (3103) of the page in use.

Step (3509): The OS (2903) sets the page attribute (3205) of the page in use to READ ONLY.

Step (3510): The OS (2903) saves the OS failure recovery program vector (503) in the firmware storage memory (104).

Step (3511): The OS (2903) adds a pointer to the latest generation page table (that is, a newly generated page table) to the page table pointer (3001).

Step (3512): The OS (2903) sets 0xEE to the shutdown completion flag (501) in the firmware storage memory (104).

Step (3513): The OS (2903) generates a check code (3105) based on the newly generated page table (3104), the newly generated page table pointer (3001) and the page management information (3103). Save the check code (3105). If a power failure occurs during execution of this processing by OS (2903), the change of these information ends halfway, and consistency between these information cannot be maintained, but if there is a check code (3105), use it. Thus, consistency between these pieces of information can be checked. That is, the check code (3105) is a code used to check the consistency between these pieces of information. The effect of the presence of the check code (3105) does not depend on the type of error correction code such as a checksum or CRC (Cyclic Redundancy Check).

Step (3514) = Step (1813) and Step (3515) = Step (1814).

Step (3516): Processing ends.

36, the operation of the firmware when the computer system (2900) is started will be described.

Step (3601) = Step (701), Step (3602) = Step (702), Step (3603) = Step (703), Step (3604) = Step (704), Step (3605) = Step (705), Step (3606) = Step (706), Step (3607) = Step (707), Step (3608) = Step (709), Step (3609) = Step (710), Step (3610) = Step (711), Step (3611) = Step (712), Step (3612) = Step (713), Step (3613) = Step (714), Step (3614) = Step (715), Step (3615) = Step (719), Step (3616) = Step (720), Step (3617) = Step (721), Step (3618) = Step (722), Step (3619) = Step (723), Step (3620) = Step (724), And step (3621) = step (725).

FIG. 37 shows a flowchart of the OS failure recovery program (3106).

Step (3701): This step is the step immediately after the firmware processing is completed and jumps to the OS failure recovery program (3106).

Step (3702): In the second embodiment, a plurality of “stable states” of the nonvolatile main memory (102) can be maintained. Therefore, the user can specify the generation of the nonvolatile main memory (102). For example, the OS failure recovery program (3106) can display a screen for accepting designation of the generation of the nonvolatile main memory (102) on a console (not shown). Therefore, in this step, the OS failure recovery program (3106) determines whether or not the user has specified the generation of the nonvolatile main memory (102).

Step (3702): When the generation of the nonvolatile main memory (102) is not designated by the user, the OS failure recovery program (3106) executes the latest generation of the nonvolatile main memory (102) (the G generation in the second embodiment). Page table pointer (3301) is selected and referred to.

Step (3703): The OS failure recovery program (3106) reads the check code (3105).

Step (3704): The OS failure recovery program (3106) generates the check code again from the page table (3104), page table pointer (3001), and page management information (3103), and the check code referenced in step (3803) Compare with (3105). A mismatch between the two check codes means that a power failure has occurred when the nonvolatile main memory management information is changed.

Step (3705): If the two check codes match, the OS (2903) uses the G-th generation page table pointer (3301).

Step (3706): If the two check codes do not match, the OS (2903) uses the (G-1) generation page table pointer (3302).

Step (3707): The OS failure recovery program (3106) refers to the page table pointer (3001) of the generation designated by the user (X generation in this embodiment).

Step (3708): The OS (2903) selects and references the Xth generation page table pointer (3301).

Step (3709): The OS failure recovery program (3106) regenerates a check code from the page table (3104), page table pointer (3001), and page management information (3103), and the check code referenced in step (3803) Compare with (3105).

Step (3710): This step is the same as Step (3805). However, the determination is made with respect to the Xth generation page table (3104).

Step (3711): This step is the same as Step (3805). However, the Xth generation page table (3104) is used.

Step (3712): This step is the same as Step (3805). However, the (X-1) th generation page table pointer (3001) is used.

Step (3813): Subsequent processes are the same as those in FIGS.

Next, the processing when the application, OS (2903) changes the page (3102) will be described using FIG.

In the initial state of FIG. 38, the page table (3104) and the new page table (3401) point to the same page. When the application and / or the OS (2903) tries to change the page (3801), the OS (2903) takes out an unused page and changes the content of the page (3801) to be changed to the page that has been taken out. Copying is performed, and the copy destination page is set as a new page (3802). Then, the OS (2903) changes the pointer to the page (3801) in the new page table to the pointer to the new page (3802).

Fig. 39 shows a ladder chart of the page rewriting process.

Step (3901): Application or OS (2903) tries to update page (3801). At this time, the TLB of the CPU (101) refers to the page table entry (3209) of the page (3801) and determines whether or not the page can be rewritten. At this time, as described in FIG. 35, the page attribute is READ ONLY. Therefore, the CPU (101) generates a page fault exception (3902).

Step (3903): The CPU (101) calls a page fault handler in the OS.

Step (3904): The page fault handler is called.

Step (3905): The page fault handler takes out one page from the unused pages and copies the contents of page (3801). Then, the page fault handler makes the page attribute READ READ.

Step (3906): The page fault handler changes the pointer to the page (3801) of the new page table to the pointer to the new page (3802).

Step (3907): Processing of the page fault handler ends, and the application or OS (2903) resumes operation.

Step (3908): The application updates the new page (3802). The application or OS (2903) does not notice that another page (3802) has been updated.

If the OS (2903) can have multiple page tables, the capacity of the nonvolatile main memory (102) will be insufficient. This is because the OS (2903) adds data without updating the data on the nonvolatile main memory (102). Therefore, it is necessary to return the page table (3104) with a small number of generations (= old) and the page (3102) pointed to to unused to ensure the capacity of the nonvolatile main memory (102). The operation will be described with reference to FIG.

Step (4001): This step is the start of OS (2903) processing. This process is started by at least one of various triggers such as an instruction from the user, the free capacity of the nonvolatile main memory (102) being less than a certain value, a certain time interval, and the number of generations exceeding a certain value. The

Step (4002): The OS (2903) refers to the page table (3104) of the oldest generation.

Step (4003): The OS (2903) refers to the page management information (3103) of each page (3102) pointed to by the oldest generation page table (3104), and reads the reference count (3204).

Step (4004): The OS (2903) checks whether the reference count (3204) is equal to 1.

Step (4005): If the reference count (3204) is 1, the page (3102) may be released because the other page table (3104) is not used. The OS (2903) sets the reference count (3204) of the page (3102) to 0 and registers it in the free page list (3110).

Step (4006): The OS (2903) determines whether all pages registered in the page table (3104) have been confirmed. When confirmation for all pages is completed, the process proceeds to step (4007). If confirmation has not been completed for all pages, the process proceeds to step (4003).

Step (4007): Processing ends.

40 may be repeated until a predetermined condition is satisfied. For example, if the process of FIG. 40 is started when the free capacity of the nonvolatile main memory (102) has fallen below a certain value, the free capacity of the nonvolatile main memory (102) even after the process of FIG. If the value is still below a certain value, the process of FIG. 40 is executed again.

In FIG. 40, the OS (2903) virtually erases the data by setting the old generation page to the “unused” state. On the other hand, a process of writing the contents of the old generation page (3102) to the secondary storage (106) is also conceivable. FIG. 41 is a flowchart of the processing.

Step (4101) = Step (4001).

Step (4102): The OS (2903) selects the oldest page table (3104).

Step (4103): The OS (2903) writes the selected page table (3104) and the page (3102) pointed to to the secondary storage (106).

Step (4104) = Step (4003), Step (4105) = Step (4004), Step (4106) = Step (4005), Step (4107) = Step (4006). Then, the process ends at step (4108).

As described above, in the second embodiment, the OS (2903) newly adds an update page (3102) from the nonvolatile main memory (102) in the “stable state”. Since the entire nonvolatile main memory (102) cannot be changed at once, even if the OS (2903) adds an update page (3102), it is updated when viewed from the entire nonvolatile main memory (102) The range is small. Therefore, it can be said that the OS (2903) efficiently uses the capacity of the nonvolatile main memory (102). In the method of the first embodiment, the capacity required for the nonvolatile main memory (102) is more than m times the capacity required for one active area when the nonvolatile main memory is divided into m. In the method of Example 2, the capacity required for the nonvolatile main memory (102) may be approximately the same as the capacity required for one active area. Further, in the method of the first embodiment, it is necessary to copy the entire contents of the active area to another region, but in the method of the second embodiment, only the management information in the nonvolatile main memory (102) needs to be updated. For this reason, the “stable state” storage operation hardly affects the performance of the computer system (2900).

In

Embodiments

1 and 2, the “stable state” of the nonvolatile main memory (102) is saved, and the computer system (2900) resumes operation while ensuring consistency when a failure occurs. In the third embodiment, the past state of the file stored in the secondary storage (106) is also extracted.

Some file systems that store file change history are known. For example, Apple (registered trademark) HFS + stores a file change history. Then, the user can read out a desired file at an arbitrary point in the past. Combining this technology with such a so-called write-once file system allows the computer system (2900) to resume operation from a certain state in the past.

FIG. 42 shows the management information of the write-once file system assumed in the third embodiment.

OS (2903) stores these management information in the secondary storage (106). The inode map (4201) is a set of pointers (4202) to the inode. Here, “inode” is management information in which information related to a file (creation date, update time, access permission information, etc.) is stored. inode0 (4203) is the 0th inode in the secondary storage (106). The inode0 (4203) includes an inode number (4205) and a sub-inode pointer (4206). The inode number (4205) is a number uniquely assigned to the inode (4203). The sub-inode (4208) is file management information at a certain point in the past. Since the write-once file system is assumed in the third embodiment, a plurality of inodes (4203) may exist in one file. However, the inode (4203) is typically designed to be basically stored in a specific area in the secondary storage (106) with a specific size. However, when the OS (2903) saves the change history, it must be variable. Therefore, the inode (4203) has a fixed size and has only a pointer to actual management information, and saves the actual file information in an undefined address and indefinite size data structure called sub-inode (4206). The sub-inode (4206) has an inode number, an update time, a version number, and a pointer to a file block (4207) that is the contents of the file. Furthermore, sub-inode (4208) is connected to each other in a doubly linked list.

Figure 43 shows the sub-inode (4206).

Sub-inode (4206) has inode number (4301), update time (4302), version (4303), and pointer to file block (4304) (the forward pointer and backward pointer are omitted).

The inode number (4301) is a number uniquely assigned to the inode (4203). sub-inode (4206) has the same number.

The update time (4302) indicates the time when the version of the file was updated.

* Version (4303) is a number uniquely assigned to sub-inode (4208). When the file is updated, the number increases by one. The update timing of the file and the “stable state” storage timing of the nonvolatile main memory (102) may or may not match. In the third embodiment, when the file is updated, the OS (2903) duplicates the latest sub-inode (4206). The OS (2903) then changes the pointer to the file block (4207) of the duplicated sub-inode (4206), increments the version number, and ends the double-linked list of sub-inode (4206). Add to the tail. The inode (4203) sub-inode pointer (4206) is the head of the doubly linked list of sub-inode (4206) (oldest sub-inode (4206)) and tail (newest sub-inode (4206)) Pointing.

Next, FIG. 44 is a flowchart of processing performed when the user issues an instruction to the OS (2903) to open a certain file in the “stable state” of the computer system (2900) at a certain time t in the past. Indicates.

Step 4401: The user instructs the OS (2903) to open the file.

Step 4402: The OS (2903) collects the storage time (3113) of the operating “stable state”. Furthermore, the elapsed time after restarting the operation from the “stable state” is added to the storage time (3113). This is a variable t.

Step (4403): The OS (2903) takes out the latest sub-inode (4206). In the third embodiment, the version is V.

Step (4404): The OS (2903) substitutes V for the variable v indicating the version.

Step (4405): The OS (2903) extracts the update time T (v) of the version from the sub-inode (4206).

Step (4406): The OS (2903) determines whether t> = T (v) (that is, whether the user-specified time t is the same as or closer to the current time than the update time T (v)). judge).
Step (4407): If t> = T (v), the OS (2903) opens the file of version v, which is the latest version at the time of operation.

Step (4408): Conversely, if t <T (v), it means that the file of version v did not exist at the time of operation. Then, the OS (2903) determines whether v is 0.

Step (4409): If v is 0, it means that the file did not exist during operation. Therefore, the OS (2903) reports to the user “File not Found”.

Step (4410): If v is not 0, v is decremented and the process returns to step (4405).

Step (4411): The process ends.

Note that this flowchart corresponds to the fact that the “stable state” storage time of the nonvolatile main memory (102) does not match the file update time.

As described above, by combining the `` stable state '' storage method and the write-once file system, the user can resume operation from the past state (nonvolatile main memory (102) and file) of the computer system (2900). is there. As described above, the entire computer system (including the main memory and the file system) can be saved and the operation can be resumed without using the logical partitioning technique. The performance in this embodiment may be compared with the performance of a general logical partitioning technique. The performance here refers to two performances of normal operation performance and state preservation processing. Note that the present invention can also be applied to a write-once file system having a data structure different from that shown in FIG.

Claims

A computer system connected to a peripheral device,
A processor that executes an operating system (OS) that sends commands to the peripheral device;
A nonvolatile main memory connected to the processor,
The OS performs the following processes (A) and (B):
(A) detecting a transition to a stable state in which none of the commands transmitted to the peripheral device is in progress;
(B) storing stable state information, which is information including data stored in the main memory in the stable state and an instruction address when transitioning to the stable state, in the main memory;
Run the
Computer system.
An I / O controller connected to the processor and the main memory and responsible for controlling the main memory, controlling communication between the processor and the main memory, and controlling communication between the peripheral device and the main memory;
The peripheral device is a secondary storage;
The command is an I / O command to the secondary storage,
The main memory has a plurality of areas of equal capacity;
One of the plurality of areas stores the latest OS information that is information currently used by the OS,
The OS has a write-once file system that saves file change history,
The OS causes the I / O controller to copy the latest OS information from the area where the latest OS information is stored to any other of the plurality of areas as the process of (B),
After the failure occurs, the OS starts up using the stable state information from the area where the stable state information desired by the user is stored,
When opening a user-desired file, the OS compares the copy time, which is the time when the stable state information was copied, with the change history of the file, and controls file output based on the comparison result.
The computer system according to claim 1.
An I / O controller connected to the processor and the main memory and responsible for controlling the main memory, controlling communication between the processor and the main memory, and controlling communication between the peripheral device and the main memory;
The peripheral device is a secondary storage;
The command is an I / O command to the secondary storage,
There is a page table which is information including attributes for each page,
On the OS, an application that updates the page registered in the page information operates,
When the OS performs the process (A), in the process (B), the page table is copied, and the attribute of the page registered in the copy source page table is updated to READ ONLY,
When the application or the OS tries to update a page registered in the page information, the processor generates a page fault and calls a page fault handler.
The called OS performs the following processing (C) to (I):
(C) secure unused pages;
(D) Copy the data stored in the page to be updated by the application or the OS to the reserved page;
(E) register the reserved page in the page table;
(F) referencing the oldest page table among the page tables stored in the main memory;
(G) For each page registered in the oldest page table, it is determined whether or not it has been registered in another page table;
(H) Unused pages not registered in other page tables;
(I) Write the data in the page registered in the referenced page table to the secondary storage.
And
The application or the OS updates the secured page,
After a failure occurs, the OS uses the copy source page table included in the stable state information.
The computer system according to claim 1.
An I / O controller connected to the processor and the main memory for controlling the main memory;
The main memory has a plurality of areas of equal capacity;
When the I / O controller receives data to be written to the main memory from the OS until the process of (A) is performed after the OS is started, all of the plurality of areas are In this case, all the areas are areas for storing the latest OS information, which is the information currently used by the OS,
When the processes (A) and (B) are performed, the OS instructs the I / O controller to write data to a specific area of the plurality of areas. The area where the data is written is an area for storing the latest OS information, and any other area is an area for storing the stable state information.
The computer system according to claim 1.
The OS instructs the I / O controller to copy the latest OS information from the area where the latest OS information is stored to another area as the process of (B).
The computer system according to claim 4.
There is a page table which is information including attributes for each page,
On the OS, an application that updates the page registered in the page information operates,
When the OS performs the process (A), in the process (B), the page table is copied, and the attribute of the page registered in the copy source page table is updated to READ ONLY,
When the application or the OS tries to update a page registered in the page information, the processor generates a page fault and calls a page fault handler.
The called OS performs the following processing (C) to (E):
(C) secure unused pages;
(D) Copy the data stored in the page to be updated by the application or the OS to the reserved page;
(E) register the reserved page in the page table;
And
The application or the OS updates the secured page,
After a failure occurs, the OS uses the copy source page table included in the stable state information.
The computer system according to claim 1, 4 or 5.
The OS performs the following processes (F) to (H):
(F) referencing the oldest page table among the page tables stored in the main memory;
(G) For each page registered in the oldest page table, it is determined whether or not it has been registered in another page table;
(H) Unused pages not registered in other page tables
Run the
The computer system according to claim 6.
The peripheral device is a secondary storage;
The OS further performs the following processing (I):
(I) Write the data in the page registered in the referenced page table to the secondary storage.
Run the
The computer system according to claim 7.
The OS executes the processes (F) to (H) and / or the process (I) when the free space in the main memory falls below a predetermined value.
The computer system according to claim 7 or 8.
The OS has a write-once file system that saves file change history,
When the OS is restarted using information on the stable state desired by the user and the file is opened, the storage time of the stable state information is compared with the change history of the file, and the file output is based on the comparison result To control the
The computer system according to claim 1.
In the file output control, the OS selects the newest file version among the file versions older than the time obtained by adding the elapsed time since restarting the operation at the storage time of the stable state information, and selects the selected file. Open version file,
The computer system according to claim 10.
If there is no file version that is older than the time when the OS is added to the elapsed time since restarting the operation at the storage time of the stable state information, it is answered that the file does not exist.
The computer system according to claim 10.
When the OS performs the processing of (A) by monitoring the completion of all commands to the peripheral device by suppressing the transmission of new commands to the peripheral device in accordance with the stable state saving command from the user Immediately, the process (B) is performed.
The computer system according to claim 1.
The OS has a state variable indicating whether or not it is in a stable state, and performs the process of (B) when it is detected that a transition from a stable state to an unstable state is performed.
The computer system according to claim 1.
A computer program connected to a peripheral device and executed by a computer system having a nonvolatile main memory,
Send a command to the peripheral device;
Detecting a transition to a stable state in which none of the commands sent to the peripheral device is in progress,
Storing stable state information, which is information including data stored in the main memory in the stable state and an instruction address when transitioning to the stable state, in the main memory;
A computer program that causes the computer system to execute the above.