WO2006082657A1 - マルチcpuコンピュータおよびシステム再起動方法 - Google Patents
マルチcpuコンピュータおよびシステム再起動方法 Download PDFInfo
- Publication number
- WO2006082657A1 WO2006082657A1 PCT/JP2005/001770 JP2005001770W WO2006082657A1 WO 2006082657 A1 WO2006082657 A1 WO 2006082657A1 JP 2005001770 W JP2005001770 W JP 2005001770W WO 2006082657 A1 WO2006082657 A1 WO 2006082657A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cpu
- error
- operating system
- processing
- error information
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
- G06F11/0724—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU] in a multiprocessor or a multi-core unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0775—Content or structure details of the error report, e.g. specific table structure, specific error fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0778—Dumping, i.e. gathering error/state information after a fault for later diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0784—Routing of error reports, e.g. with a specific transmission path or data flow
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
Definitions
- UNIX registered trademark
- IA server machines equipped with Intel microprocessors
- IA servers have improved performance through the use of multiprocessors. As for servers used in mission-critical systems, it is important to improve not only performance but also reliability. In particular, when a fatal hardware error occurs, it is necessary to prevent system runaway and data corruption. Therefore, when a fatal hardware error occurred, the system was stopped urgently.
- the suspension period can not be extended! Therefore, even if the system is stopped suddenly due to a hardware error, the multiprocessor server separates only the part where the error occurred and restarts the system quickly and automatically. This is an important requirement.
- An example of a hardware error that occurs in the server is a continuous occurrence of multi-bit errors in a CPU (Central Processing Unit) cache.
- the CPU in which the error occurred sets error information in a register.
- the CPU then generates a trap and notifies the operating system of the error.
- the operating system executes trap processing on the CPU in which an error has occurred.
- G In the wrap process, the operating system refers to the hardware register and obtains error information.
- Panic processing is an emergency stop and restart of the system to prevent system runaway and data corruption.
- the CPU displays / records error information, performs file system synchronization processing, collects a memory dump, and then restarts the system.
- FIG. 8 is a diagram showing a conventional restart method when an error occurs.
- the server 900 has a CPU 910 and a CPU 920.
- the CPU number of CPU 910 is “CPU # 0”, and the CPU number of CPU 920 is “CPU # 1”.
- the processes executed by the CPU 910 and the CPU 920 include a process executed by the hardware logic circuit 901 and a process executed by the operating system 902.
- the CPU 910 and the CPU 920 have error notification circuits 911 and 921 as processing functions executed by the hardware logic circuit 901. Further, the error notification circuits 911 and 921 notify the operating system of information indicating a hardware error that has occurred in the CPU.
- the notification of error information to the operating system is a process of passing error information to a process that performs error processing of the operating system.
- the CPU 910, 920 sets error information in a predetermined register when a hardware error occurs, and generates a trap.
- the error information includes the error type, CPU number, and the address of the data where the error occurred. Error information is notified by referring to a register that stores error information by a process based on the operating system.
- the processing functions executed in accordance with the operating system 902 include trap processing machines 912 and 922 and non-processing machines 913 and 923.
- the trap processors 912 and 922 are functions for acquiring an error type, a CPU number, an address, and the like by referring to a register in which error information is stored.
- Panic processing functions 913 and 923 are functions that display and record error information, synchronize file systems, collect memory dumps, and restart the system.
- the error notification circuit 911 of the CPU 9 10 gives error information to the operating system 902. Will be notified.
- trap processing is performed by the trap processing function 912 executed according to the CPU 910 operating system 902, and information such as error type, CPU number, and address is acquired by the operating system 902.
- the panic processing function 913 displays and records error information, performs file system synchronization processing, collects a memory dump, and then restarts the system.
- the diagnostic processor that has collected the fault information of the faulty processor power notifies the host processor of the fault occurrence, and the host processor initializes the faulty processor and restarts it, thereby returning the faulty processor to the operating state.
- Patent Document 2 Japanese Patent Document 2
- failure information collection technique when a failure occurs in a multi-CPU system, there is a technique for shortening failure information collection time by executing failure information collection in parallel by a plurality of processors.
- the processor power that has detected the occurrence of a fault also instructs other processors to collect fault information, and the other processor that receives the instruction collects fault information (see, for example, Patent Document 3).
- Patent Document 1 Japanese Patent Laid-Open No. 4 340631
- Patent Document 2 Japanese Patent Laid-Open No. 2-71336
- Patent Document 3 Japanese Patent Laid-Open No. 11 338838
- the diagnostic processor collects fault information from other processors, and the host processor initializes and restarts the faulty processor.
- each processor is operating individually and can be restarted independently.
- many multi-CPU computers run multiple CPUs with a common operating system. In such a multi-CPU computer, there is data shared by multiple CPUs and data consistency is required to restart one CPU. It is necessary to ensure processing. Therefore, it is difficult to apply the technique described in Patent Document 2 to a multi-CPU computer in which multiple CPUs operate with a common operating system.
- failure information is collected by a processor different from the processor in which the failure has occurred, so failure information can be collected by a normal processor.
- the system is restarted by the failed processor. As a result, even if the restart process is executed on a processor that does not operate normally, it may not be restarted correctly. If the restart fails, the system downtime will be prolonged and operational efficiency will deteriorate.
- the present invention has been made in view of these points, and even when a fatal CPU error occurs, error processing can be reliably executed and the system can be restarted.
- An object of the present invention is to provide a multi-CPU computer and a system restart method. Means for solving the problem
- the present invention provides a multi-CPU computer equipped with a plurality of CPUs operating on a common operating system 4 as shown in FIG.
- the multi-CPU computer has a nonvolatile storage device 1, a first CPU 2, and a second CPU 3.
- the first CPU 2 includes a first error notification circuit 2a that notifies error information to other CPUs when a hardware error occurs.
- the second CPU 3 incorporates a second error notification circuit 3a that acquires the error information notified from the first CPU 2 and notifies the operating system 4 of the error information.
- the storage processing of the fault information including the error information to the storage device and the restart processing of the system are executed according to the operating system 4.
- a hardware error occurs in the first CPU.
- the first error notification circuit incorporated in the first CPU notifies the second CPU of error information, and the second error notification circuit power incorporated in the second CPU
- the error information notified of the first CPU power is acquired, the error information is notified to the operating system, and the error information is notified to the operating system by the second error notification circuit
- the second CPU stores the failure information including the error information in a nonvolatile storage device, and the system Executes the process of restarting, the system restarts wherein provided that.
- the error information is notified to the PU, the second error notification circuit incorporated in the second CPU, the error information notified from the first CPU is acquired, and the error information is notified to the operating system.
- the second CPU is a nonvolatile storage device for failure information including the error information according to the operating system.
- the CPU power error information on which the hardware error has occurred is received.
- the CPU now stores fault information and restarts the system. As a result, even if a fatal error occurs in one CPU, the fault information storage capability can be reliably processed up to the system restart.
- FIG. 1 is a diagram showing an outline of the present embodiment.
- FIG. 2 is a diagram showing an example of a hardware configuration of a server used for implementing the present invention.
- FIG. 3 is a block diagram showing the main functions of the server.
- FIG. 4 is a diagram showing the relationship between the CPU error notification circuit and the error handling function of the operating system.
- FIG. 5 is a diagram showing an example data structure of error information.
- FIG. 6 is a sequence diagram showing a case where error processing is normally executed by another CPU.
- FIG. 7 is a sequence diagram showing a case where error processing by another CPU fails.
- FIG. 8 is a diagram showing a conventional restart method when an error occurs.
- FIG. 1 is a diagram showing an outline of the present embodiment.
- FIG. 1 shows an outline of the functions of the multi-CPU computer according to the present embodiment.
- the multi-CPU computer has a storage device 1, a first CPU 2, and a second CPU 3.
- the first CPU 2 and the second CPU 3 operate on a common operating system 4! /
- the storage device 1 is non-volatile and can retain data even when the power is shut off.
- a magnetic storage device such as a hard disk drive can be used.
- the first CPU 2 sends error information to other CPUs when a hardware error occurs.
- a first error notification circuit 2a for notification is incorporated.
- hardware errors include cache memory multi-bit errors.
- the error information includes, for example, the error type, the CPU number of the CPU in which the error has occurred, and the address of the data in which the error has occurred.
- the second CPU 3 incorporates a second error notification circuit 3 a that acquires error information notified from the first CPU 2 and notifies the operating system 4 of the error information.
- the second CPU 3 stores the failure information including the error information in the storage device 1 according to the operating system 4 (step S1) and system restart processing (step S2) are executed.
- the failure information can include, for example, memory dump information in addition to error information.
- step S1 when a hardware error occurs in the first CPU 2, the error information is notified to the second CPU 3 by the first error notification circuit 2a of the first CPU 2. . Then, the error information notified from the first CPU 2 is acquired by the second error notification circuit 3a of the second CPU 3, and the error information is notified to the operating system 4. Then, according to the operating system 4, the second CPU 3 executes processing for storing fault information including error information in the storage device 1 (step S1) and system restart processing (step S2). This restarts the entire multi-CPU computer.
- the first CPU 2 can stop the process executed by the first CPU 2 for a certain period of time in accordance with the operating system 4 after notifying the error information. In this way, by temporarily stopping the processing of the CPU in which the error has occurred, it is possible to prevent the first CPU 2 having a failure from affecting the normal processing of the second CPU 3. . as a result The error processing by the second CPU 3 can be surely performed.
- the power is shown when an error occurs in the first CPU 2 and the error processing is executed in the second CPU 3.
- the error notification circuit 2a and the second error notification circuit 3a can be incorporated. This makes it possible for other CPUs to perform error handling regardless of which CPU generates the error.
- the details of the embodiment of the present invention will be described below by taking an example of a multi-CPU computer that can execute error processing based on error information of all CPU power and other CPU power.
- FIG. 2 is a diagram illustrating a hardware configuration example of a server used in the present embodiment.
- the servo 100 is a UNIX server, for example, and has a plurality of CPUs 110, 120, 130, and 140. Each CPU 110, 120, 130, 140 is set with a CPU number for uniquely identifying within the Sano 100.
- the CPU number of the CPU 110 is “CPU # 0”.
- the CPU number of CPU120 is “CPU # 1”.
- the CPU number of the CPU 130 is “CPU # 2”.
- the CPU number of the CP U140 is “CPU # 3”.
- node disk drive HDD: Hard Disk Drive
- shared memory 101 At least a part of a talent-operating system program or application program to be executed by CPUs 110, 120, 130, 140 is temporarily stored. In addition, various data necessary for processing by the CPUs 110, 120, 130, and 140 are stored in the shared memory 101.
- the HDD 102 stores an operating system and application programs.
- the communication interface 103 is connected to the network 10. The communication interface 103 transmits / receives data to / from other computers via the network 10.
- a monitor 11 is connected to the graphic processing device 104.
- Graphic processing unit 1 04 ⁇ , CPU110, 120, 130, 140 Display the image on the screen of the monitor 11 according to the instruction of the force.
- a keyboard 12 and a mouse 13 are connected to the input interface 105.
- the input interface 105 transmits signals sent from the keyboard 12 and mouse 13 via the system node 106 to the CPUs 110, 120, 130, and 140.
- FIG. 3 is a block diagram showing the main functions of the server.
- the server 100 has a function realized by the hardware logic circuit 100a and a function realized by the CPUs 110, 120, 130, and 140 executing software such as the operating system 200.
- the hardware function is shown in the upper part, and the software function is shown in the lower part.
- the functions of the hardware logic circuit 100a are mainly a processing operation function of each of the CPUs 110, 120, 130, 140, a data storage function of the shared memory 101, and a data storage function S of the HDD 102.
- An error notification circuit 111, 121, 131, 141 is provided for each CPU 110, 120, 130, 140.
- the error notification circuits 111, 121, 131, and 141 are processing functions that notify error information to the operating system 200 and exchange error information with other CPUs.
- inter-CPU communication technology using the inter-CPU communication area 101a of the shared memory 101 is disclosed in, for example, Japanese Patent Laid-Open Nos. 6-243104, 6-243101, and 6-332864. Are listed.
- the panic processing unit 220 includes an error information display Z recording unit 221, a file system synchronization unit 222, a memory dump unit 223, and a system restart unit 224.
- the error information display / recording unit 221 displays error information and performs recording processing on the HDD 102.
- the file system synchronization unit 222 performs processing such as checking file system consistency and correcting inconsistencies.
- the memory dump unit 223 performs data dump processing in the shared memory 101.
- the system restart unit 224 performs system restart processing.
- Other functions of the operating system 200 include a file management unit 240, a memory management unit 241, a process management unit 242, an interrupt processing unit 243, a system call 244, a driver 245, a scheduler 246, and a shell 247. , Daemon 248, command processor 249, library 250, etc.
- Each function of the operating system 200 is realized individually on the CPU 110, 120, 130, 140 by executing the program for the CPU 110, 120, 130, 140, respectively. Is done.
- FIG. 4 is a diagram showing the relationship between the CPU error notification circuit and the operating system error processing function.
- CPU110 and CPU120 and their CPU1 10 shows the error notification processing in the operating systems 201 and 202 executed by 10 and 120.
- the error information 31 of the error occurring in the CPU 110 is notified to the operating system 202 executed by the CPU 120 via the error notification circuit 121 of the CPU 120 and also to the operating system 201 executed by the CPU 110. Be notified.
- the error information 32 of the error occurring in the CPU 120 is notified to the operating system 201 executed by the CPU 110 and also to the operating system 202 executed by the CPU 120 via the error notification circuit 111 of the CPU 110. .
- the trap processing unit 211 receives error information of an error that has occurred in the CPU 110. In that case, the trap processing unit 211 temporarily stops the process executed by the CPU 110. When stopping the processing of the CPU 110, the trap processing unit 211 can use the function if the hardware has a function of temporarily stopping the operation of the CPU, for example. In addition, the trap processing unit 211 can stop other processing in the CPU 110 by executing simple loop processing with software.
- the processing of the CPU 110 is temporarily stopped in order to hold information at the time of the error occurrence. That is, if the CPU 110 continues normal operation after an error occurs, the cause of the error occurrence in the memory is specified. The valid information may be overwritten with other information. Therefore, by temporarily stopping the processing of the CPU 110, it is possible to obtain accurate information when an error occurs. In addition, by stopping the CPU 110 having a failure, it becomes possible to stably execute error processing in the CPU 120.
- the trap processing unit 211 performs trap processing when it receives error information of another CPU 120 from the error notification circuit 111 of the CPU 110, and when it receives error information of the CPU 110 and resumes after temporarily stopping the processing. Execute. Specifically, the trap processing unit 211 refers to a predetermined register in the CPU 110 and acquires an error type, a CPU number, an address, and the like. The trap processing unit 211 passes error information to the panic processing unit 231 after completing the trap processing.
- the panic processing unit 231 performs a panic process.
- the error information display Z recording unit 221 displays the error information on the monitor and stores the error information in the HDD 102.
- the file system synchronization unit 222 synchronizes the file system with the actual file contents (updates the structure data of the file system held in the HDD 102 in synchronization with the actual file update).
- the memory dump unit 223 performs dump processing of the contents of the shared memory 101 (stores the contents of the shared memory 101 in the HDD 102).
- the system restart unit 224 restarts the entire system of the server 100.
- the operating system 202 executed by the CPU 120 also has the same processing function as the operating system 201 executed by the CPU 110.
- FIG. 5 is a diagram illustrating an example data structure of error information.
- Error information 31 includes error type, CPU number, address, and so on.
- the error type is represented by an identification code that indicates the type of error that occurred.
- the CPU number is the identification number of the CPU where the error occurred. is there.
- the address is an address of data in which an error has occurred.
- FIG. 6 is a sequence diagram showing a case where error processing is normally executed by another CPU.
- the error notification circuit 111 of the CPU 110 searches for another normal CPU (step S 11). For example, when a fatal error such as a cache multi-bit error occurs in the CPU 110, the error notification circuit 111 searches for a normal CPU. Specifically, the error notification circuit 111 detects the error and selects the CPU with the smallest CPU number! / From among the CPUs as a normal CPU. CPUs that have not detected an error should be stored in the shared memory 101 to obtain information on the status of each CPU (whether or not normal operating power is set) and refer to that status. Can be judged.
- the error notification circuit 111 of the CPU 110 notifies error information to the CPU 120 selected in step S 11 (step S 12). That is, the error notification circuit 111 writes error information in the inter-CPU communication area 101a of the shared memory 101, and the error notification circuit 121 of the CPU 120 reads the error information. As a result, the CPU 120 is notified of the occurrence of an error by the CPU 110.
- the error notification circuit 111 of the CPU 110 notifies error information of an error that has occurred in the CPU 110 to the operating system 201 executed by the CPU 110 (step S13). Specifically, the error notification circuit 111 stores error information such as an error type, a CPU number in which an error has occurred, and an address in a predetermined register. After that, the error notification circuit 111 generates a trap (activates the trap processing unit 211 of the operating system 201). The trap processing unit 211 of the operating system 201 refers to the contents of the register in which the error information is written. To do. As a result, the error information is notified to the operating system 201.
- the trap processing unit 211 suspends normal processing of CP Ul 10 (all processing except the minimum processing for resuming the stopped processing) (step S). 14).
- the error notification circuit 121 The error information of U110 is notified to the operating system 202 executed by the CPU 120 (step S15). This is a process in which the normal CPU 120 sets error information such as the error type, CPU number and address where the error occurred in the register, generates a trap, and notifies the operating system of the occurrence of the error.
- a panic process is performed by the operating system 202 (step S17).
- each processing function in the panic processing unit 232 performs the following processing.
- the error information display Z recording unit displays error information of the CPU 110 and records it.
- the file system synchronization unit performs file system synchronization processing.
- the memory dump unit collects the memory dump.
- the system restart unit performs system restart processing after the completion of other panic processing. As a result, the server 100 is shut down and then restarted.
- the error processing is executed by the other CPU 120, so that it is possible to reliably collect error information and a memory dump and restart the system.
- the CPU 120 that requested the processing may not be able to execute error processing for some reason. In that case, the CPU 110 itself continues error processing.
- FIG. 7 is a sequence diagram illustrating a case where error processing by another CPU has failed.
- the panic process (step S17) in the CPU 120 has failed.
- the processing from step S11 to step S17 is the same as in FIG.
- the trap processing unit 211 in the operating system 201 of the CPU 110 resumes the processing in the CPU 110 after a predetermined time has elapsed (step S15).
- trap processing is performed by the trap processing unit 211 of the operating system 201 executed by the CPU 110 (step S19). Further, panic processing is performed by the panic processing unit 231 (step S20). As a result, server 100 is restarted [0074]
- post-processing such as recording error information is performed by an error-generating CPU in the conventional technology. However, according to the present embodiment, other processing is performed. The normal CPU of the error occurred Post-processing the CPU. By adopting this method, the reliability of the system can be improved.
- the failure CPU can be replaced early, and the problem of repeatedly damaging the system due to the error of the same CPU can be prevented. As a result, it is possible to prevent file corruption and data corruption due to the inability to execute file system synchronization.
- error processing such as trap processing and panic processing is executed on a CPU where no error is detected, but an error in another CPU is triggered by a CPU failure where no error is actually detected. Sometimes it is done. In such a case, an error may be detected on a normal CPU, trap processing and panic processing may be performed on the failed CPU, and the system may hang up.
- trap processing and panic processing are executed even in the event of a CPu that detects an error after a certain time in case of an emergency. This ensures error information display Z recording, file system synchronization, memory dump collection, and system restart. it can.
- the above processing functions can be realized by a computer.
- a program that describes the processing contents of the functions realized on the server based on the operating system is provided.
- the program describing the processing contents can be recorded on a computer-readable recording medium.
- the computer-readable recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory.
- Magnetic recording devices include hard disk drives (HDD), flexible disks (FD), and magnetic tapes.
- optical disc examples include a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), and a CD—R (Recordable) ZRW (Rewritable).
- Magneto-optical recording media include MO (Magneto-Optical disk).
- Portable recording media such as ROM are sold. It is also possible to store the program in a storage device of the server computer and transfer the program to other computers via the network.
- a computer that executes a program stores, for example, a program recorded on a portable recording medium or a server computer-transferred program in its own storage device. Then, the computer reads its own storage device power program and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. The computer can also execute processing according to the received program sequentially each time the program is transferred to the server computer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Networks & Wireless Communication (AREA)
- Debugging And Monitoring (AREA)
- Hardware Redundancy (AREA)
- Retry When Errors Occur (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2005/001770 WO2006082657A1 (ja) | 2005-02-07 | 2005-02-07 | マルチcpuコンピュータおよびシステム再起動方法 |
JP2007501491A JP4489802B2 (ja) | 2005-02-07 | 2005-02-07 | マルチcpuコンピュータおよびシステム再起動方法 |
US11/879,390 US7716520B2 (en) | 2005-02-07 | 2007-07-17 | Multi-CPU computer and method of restarting system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2005/001770 WO2006082657A1 (ja) | 2005-02-07 | 2005-02-07 | マルチcpuコンピュータおよびシステム再起動方法 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/879,390 Continuation US7716520B2 (en) | 2005-02-07 | 2007-07-17 | Multi-CPU computer and method of restarting system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006082657A1 true WO2006082657A1 (ja) | 2006-08-10 |
Family
ID=36777052
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2005/001770 WO2006082657A1 (ja) | 2005-02-07 | 2005-02-07 | マルチcpuコンピュータおよびシステム再起動方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US7716520B2 (ja) |
JP (1) | JP4489802B2 (ja) |
WO (1) | WO2006082657A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009205362A (ja) * | 2008-02-27 | 2009-09-10 | Nec Corp | コンピュータ装置、コンピュータ装置の運用継続方法及びプログラム |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102004052576A1 (de) * | 2004-10-29 | 2006-05-04 | Advanced Micro Devices, Inc., Sunnyvale | Paralleler Verarbeitungsmechanismus für Multiprozessorsysteme |
US20100088542A1 (en) * | 2008-10-06 | 2010-04-08 | Texas Instruments Incorporated | Lockup recovery for processors |
JP2010231619A (ja) * | 2009-03-27 | 2010-10-14 | Renesas Electronics Corp | 情報処理装置 |
CN102971715B (zh) * | 2010-07-06 | 2015-07-08 | 三菱电机株式会社 | 处理器装置以及程序 |
US8850262B2 (en) | 2010-10-12 | 2014-09-30 | International Business Machines Corporation | Inter-processor failure detection and recovery |
US8645969B2 (en) | 2011-08-19 | 2014-02-04 | Qualcomm Incorporated | Method for dynamic discovery of processors and processor capabilities |
US20150006978A1 (en) * | 2012-02-13 | 2015-01-01 | Mitsubishi Electric Corporation | Processor system |
US9104575B2 (en) | 2012-08-18 | 2015-08-11 | International Business Machines Corporation | Reduced-impact error recovery in multi-core storage-system components |
CN103839016A (zh) * | 2012-11-21 | 2014-06-04 | 鸿富锦精密工业(武汉)有限公司 | 具有cpu保护功能的计算机 |
US20160301562A1 (en) * | 2013-11-15 | 2016-10-13 | Nokia Solutions And Networks Oy | Correlation of event reports |
WO2017052548A1 (en) * | 2015-09-24 | 2017-03-30 | Hewlett Packard Enterprise Development Lp | Failure indication in shared memory |
US10387260B2 (en) * | 2015-11-26 | 2019-08-20 | Ricoh Company, Ltd. | Reboot system and reboot method |
US10990468B2 (en) * | 2016-03-14 | 2021-04-27 | Hitachi, Ltd. | Computing system and error handling method for computing system |
US10536859B2 (en) | 2017-08-15 | 2020-01-14 | Charter Communications Operating, Llc | Methods and apparatus for dynamic control and utilization of quasi-licensed wireless spectrum |
US10459782B2 (en) * | 2017-08-31 | 2019-10-29 | Nxp Usa, Inc. | System and method of implementing heartbeats in a multicore system |
US10966073B2 (en) | 2017-11-22 | 2021-03-30 | Charter Communications Operating, Llc | Apparatus and methods for premises device existence and capability determination |
US11307921B2 (en) * | 2017-12-08 | 2022-04-19 | Apple Inc. | Coordinated panic flow |
US11475723B2 (en) * | 2017-12-29 | 2022-10-18 | Robert Bosch Gmbh | Determining a fault in an electronic controller |
US11129171B2 (en) | 2019-02-27 | 2021-09-21 | Charter Communications Operating, Llc | Methods and apparatus for wireless signal maximization and management in a quasi-licensed wireless system |
US11374779B2 (en) | 2019-06-30 | 2022-06-28 | Charter Communications Operating, Llc | Wireless enabled distributed data apparatus and methods |
US11182222B2 (en) * | 2019-07-26 | 2021-11-23 | Charter Communications Operating, Llc | Methods and apparatus for multi-processor device software development and operation |
US11528748B2 (en) | 2019-09-11 | 2022-12-13 | Charter Communications Operating, Llc | Apparatus and methods for multicarrier unlicensed heterogeneous channel access |
US11368552B2 (en) * | 2019-09-17 | 2022-06-21 | Charter Communications Operating, Llc | Methods and apparatus for supporting platform and application development and operation |
US11026205B2 (en) | 2019-10-23 | 2021-06-01 | Charter Communications Operating, Llc | Methods and apparatus for device registration in a quasi-licensed wireless system |
US11457485B2 (en) | 2019-11-06 | 2022-09-27 | Charter Communications Operating, Llc | Methods and apparatus for enhancing coverage in quasi-licensed wireless systems |
US11363466B2 (en) | 2020-01-22 | 2022-06-14 | Charter Communications Operating, Llc | Methods and apparatus for antenna optimization in a quasi-licensed wireless system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0348940A (ja) * | 1989-07-18 | 1991-03-01 | Nec Corp | 電子計算機システム |
JPH04340631A (ja) * | 1991-05-17 | 1992-11-27 | Mitsubishi Electric Corp | 分散処理システム |
JP2000311155A (ja) * | 1999-04-27 | 2000-11-07 | Seiko Epson Corp | マルチプロセッサシステム及び電子機器 |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0271336A (ja) | 1988-09-06 | 1990-03-09 | Nec Corp | プロセッサの障害状態監視方式 |
JPH06243104A (ja) | 1993-02-10 | 1994-09-02 | Fujitsu Ltd | マルチプロセッサシステムにおけるcpu間通信方式 |
JPH06243101A (ja) | 1993-02-10 | 1994-09-02 | Fujitsu Ltd | マルチプロセッサシステムにおけるcpu間通信方式 |
JPH06332864A (ja) | 1993-05-27 | 1994-12-02 | Fujitsu Ltd | マルチプロセッサシステムにおけるcpu間通信方式 |
US6199179B1 (en) * | 1998-06-10 | 2001-03-06 | Compaq Computer Corporation | Method and apparatus for failure recovery in a multi-processor computer system |
JPH11338838A (ja) | 1998-05-22 | 1999-12-10 | Nagano Nippon Denki Software Kk | マルチプロセッサシステムにおける障害情報のパラレルダンプ採取方法及び方式 |
US6675324B2 (en) * | 1999-09-27 | 2004-01-06 | Intel Corporation | Rendezvous of processors with OS coordination |
US6516429B1 (en) * | 1999-11-04 | 2003-02-04 | International Business Machines Corporation | Method and apparatus for run-time deconfiguration of a processor in a symmetrical multi-processing system |
US6622260B1 (en) * | 1999-12-30 | 2003-09-16 | Suresh Marisetty | System abstraction layer, processor abstraction layer, and operating system error handling |
US6725317B1 (en) * | 2000-04-29 | 2004-04-20 | Hewlett-Packard Development Company, L.P. | System and method for managing a computer system having a plurality of partitions |
US7082610B2 (en) * | 2001-06-02 | 2006-07-25 | Redback Networks, Inc. | Method and apparatus for exception handling in a multi-processing environment |
US6851071B2 (en) * | 2001-10-11 | 2005-02-01 | International Business Machines Corporation | Apparatus and method of repairing a processor array for a failure detected at runtime |
US7257734B2 (en) * | 2003-07-17 | 2007-08-14 | International Business Machines Corporation | Method and apparatus for managing processors in a multi-processor data processing system |
-
2005
- 2005-02-07 WO PCT/JP2005/001770 patent/WO2006082657A1/ja not_active Application Discontinuation
- 2005-02-07 JP JP2007501491A patent/JP4489802B2/ja active Active
-
2007
- 2007-07-17 US US11/879,390 patent/US7716520B2/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0348940A (ja) * | 1989-07-18 | 1991-03-01 | Nec Corp | 電子計算機システム |
JPH04340631A (ja) * | 1991-05-17 | 1992-11-27 | Mitsubishi Electric Corp | 分散処理システム |
JP2000311155A (ja) * | 1999-04-27 | 2000-11-07 | Seiko Epson Corp | マルチプロセッサシステム及び電子機器 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009205362A (ja) * | 2008-02-27 | 2009-09-10 | Nec Corp | コンピュータ装置、コンピュータ装置の運用継続方法及びプログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2006082657A1 (ja) | 2008-06-26 |
US20080010506A1 (en) | 2008-01-10 |
US7716520B2 (en) | 2010-05-11 |
JP4489802B2 (ja) | 2010-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4489802B2 (ja) | マルチcpuコンピュータおよびシステム再起動方法 | |
US7853825B2 (en) | Methods and apparatus for recovering from fatal errors in a system | |
JP4117262B2 (ja) | 故障プロセッサを置き換える方法、媒体およびシステム | |
US8413133B2 (en) | Software update management apparatus and software update management method | |
US6978398B2 (en) | Method and system for proactively reducing the outage time of a computer system | |
TWI337304B (en) | Method for fast system recovery via degraded reboot | |
Ruprecht et al. | VM live migration at scale | |
TWI554875B (zh) | 基於資源存取模式預測、診斷應用程式故障並從應用程式故障恢復 | |
US7752495B2 (en) | System and method for predictive processor failure recovery | |
US20100325471A1 (en) | High availability support for virtual machines | |
TW200414041A (en) | Method and system for maintaining firmware versions in a data processing system | |
JP2011060055A (ja) | 仮想計算機システム、仮想マシンの復旧処理方法及びそのプログラム | |
US20150309883A1 (en) | Recording Activity of Software Threads in a Concurrent Software Environment | |
JP4903244B2 (ja) | 計算機システム及び障害復旧方法 | |
JP2009211517A (ja) | 仮想計算機冗長化システム | |
JP2010086364A (ja) | 情報処理装置、動作状態監視装置および方法 | |
JP2007133544A (ja) | 障害情報解析方法及びその実施装置 | |
JP3030658B2 (ja) | 電源故障対策を備えたコンピュータシステム及びその動作方法 | |
US8977896B1 (en) | Maintaining data integrity in data migration operations using per-migration device error flags | |
US8555105B2 (en) | Fallover policy management in high availability systems | |
JP4992740B2 (ja) | マルチプロセッサシステム、障害検出方法および障害検出プログラム | |
JP5716830B2 (ja) | 情報処理装置及び方法、プログラム | |
JP2007080012A (ja) | 再起動方法、システム及びプログラム | |
US20070234114A1 (en) | Method, apparatus, and computer program product for implementing enhanced performance of a computer system with partially degraded hardware | |
JP4945774B2 (ja) | ディスクアレイ装置およびトランスポート制御用プロセッサコアの障害情報データ採取方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2007501491 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11879390 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWP | Wipo information: published in national office |
Ref document number: 11879390 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05709822 Country of ref document: EP Kind code of ref document: A1 |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 5709822 Country of ref document: EP |