EP0920661A1 - Slave dsp reboots stalled master cpu - Google Patents

Slave dsp reboots stalled master cpu

Info

Publication number
EP0920661A1
EP0920661A1 EP98905551A EP98905551A EP0920661A1 EP 0920661 A1 EP0920661 A1 EP 0920661A1 EP 98905551 A EP98905551 A EP 98905551A EP 98905551 A EP98905551 A EP 98905551A EP 0920661 A1 EP0920661 A1 EP 0920661A1
Authority
EP
European Patent Office
Prior art keywords
master
slave
processor
stalled
master processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP98905551A
Other languages
German (de)
French (fr)
Inventor
Steven Taylor Pancoast
Paul Davis Foster
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips Electronics NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics NV filed Critical Koninklijke Philips Electronics NV
Publication of EP0920661A1 publication Critical patent/EP0920661A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/24Resetting means

Definitions

  • the invention relates to an information processing system, in particular to a consumer electronics home entertainment system, with a slave processor for processing a task, and a master processor for control of the system, as defined in the precharacterizing part of claim 1.
  • a typical multi-processor system with a hierarchical organization comprises one or more slave processors, each processing a specific task, and a master processor driving one or more slave processors and controlling the system.
  • the system is a multimedia consumer apparatus or a digital entertainment system, wherein the master is a CPU and the slaves are DSPs.
  • One of the slaves processes audio data, another slave processes video data.
  • the master encounters an event that causes the master to stop working. For example, the master has lost its stack, or its memory is completely filled. The result is that the entire system stops functioning and has to be rebooted, typically by hand. The power has to be turned off before the system can start afresh (i.e. , cold reboot). Accordingly, the complete state of the system may be lost.
  • typical consumer apparatus of the type described which are commercially available now or in the near future, are well protected against such disasters, since all processes active on the system have been well tried out by the OEM during the system's development.
  • An object of the invention is therefore to provide an asynchronously multiprocessor system that limits the rather disastrous consequences of a stalled master or a "hung" system. It is another object to render a cold reboot power sequence unnecessary, which is of particular interest to apparatus meant for the consumer market. If anything goes wrong with the master, the consumer him/herself is the sole external agent available to jumpstart the system. Most consumers are not, and should not need to be, particularly knowledgeable about the nuances of system architectures beyond where to put the power plug and how to operate the remote. It is clear that having to do a cold reboot now and again is not only particularly unattractive, but a major drawback.
  • the invention provides an information processing system with a slave processor for processing a task, and a master processor for control of the system.
  • the slave processor is operative to reboot the master processor if the master processor has stalled.
  • the slave detects that the master has stalled by monitoring the master's heart beat.
  • the automatic reboot performed by the slave allows the temporarily inactivated master to start working again without intervention of the user.
  • periodic state savings are performed allowing the system to pick up at the last known valid checkpoint. Since almost all of the stalls occur as a consequence of mutually interfering events in the asynchronous system, restarting the master at the last known valid checkpoint solves the problem in most of the cases.
  • the system according to the invention is particularly useful in a consumer apparatus, wherein the DSP of the slave is capable of rebooting the master's CPU without the user's interfering.
  • FigJ is a block diagram of a system according to the invention.
  • FigJ is a block diagram of a multiprocessor system 100 according to the invention.
  • System 100 has a master processor 102 and one or more slave processors 104 and 106.
  • Master 102 drives the entire system 100.
  • Slave 104 and 106 each process a respective one of specific tasks under control of master 102.
  • the system is, for example, a multimedia system with an open architecture, wherein master 102 is a CPU and slaves 104 and 106 are DSPs (Digital signal processors).
  • Slaves 104 and 106 may communicate with each other.
  • Master 102 has program memory 108.
  • Slave 104 has program memory 110, and slave 106 has program memory 112.
  • Master 102 sends a data stream 118 to slave 104 wherein periodically a special command occurs, the sole purpose of which is to notify slave 104 of the fact that master 102 is still running.
  • the special command is commonly referred to as "heart beat" .
  • a heart beat is sent one every second.
  • Slave 104 has a fail safe timer 114. Upon receipt of a heart beat, slave 104 resets its timer 114. Timer 114 expires after, say, 2 seconds, which is substantially longer than the time period between two successive heart beats. When master 102 stalls, slave 104 does not receive the heart beat anymore, and timer 114 expires. This confirms that master 102 has become inert. Slave 104 then resets master 102.
  • the reset brings master 102 back to the very beginning of the program(s) it was supposed to execute, and master 102 starts anew from ground zero.
  • master 102 is coupled to a checkpoint memory
  • master 102 Prior to the failure, master 102 has been updating checkpoint memory 116 every N heart beats with the system state. For example, every 10 heart beats the content of the registers (not shown) of master 102, including the I/O registers, control registers, and the content of memory 108 are stored in memory 116.
  • Memory 116 thus stores periodically a snapshot of the system, i.e. , all information that unambiguously defines the state of system 100 that is necessary to restore the state of the system.
  • slave 104 notes that master 102 has stalled, slave 104 sends a reset to master 102.
  • master 102 starts to reboot it sends a response to slave 104 indicating it rebooted. This response enables slave 104 to issue a command to master 102 to fetch the last valid state recorded in memory 116, to reload the registers and memory 108 with this valid state, and start executing the program code from there on.
  • Slave 104 can notify the user, for example, via a short message on the system's display (not shown), that a problem occurred, has been solved, and that system operation has been resumed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Retry When Errors Occur (AREA)
  • Hardware Redundancy (AREA)
  • Power Sources (AREA)
  • Multi Processors (AREA)

Abstract

A digital home entertainment system comprises one or more slave processors, e.g, DSPs, for processing specific tasks, and a master processor, e.g., a CPU, for control of the system. The slave processor is capable of rebooting the master processor if the master processor has stalled. This slave-controlled rebooting avoids manual cold rebooting of the system and is particularly advantageous in open-architecture multimedia systems with asynchronously cooperating components.

Description

SLAVE DSP REBOOTS STALLED MASTER CPU
The invention relates to an information processing system, in particular to a consumer electronics home entertainment system, with a slave processor for processing a task, and a master processor for control of the system, as defined in the precharacterizing part of claim 1.
A typical multi-processor system with a hierarchical organization comprises one or more slave processors, each processing a specific task, and a master processor driving one or more slave processors and controlling the system. For example, the system is a multimedia consumer apparatus or a digital entertainment system, wherein the master is a CPU and the slaves are DSPs. One of the slaves processes audio data, another slave processes video data.
Suppose that, in the multi-processor system, the master encounters an event that causes the master to stop working. For example, the master has lost its stack, or its memory is completely filled. The result is that the entire system stops functioning and has to be rebooted, typically by hand. The power has to be turned off before the system can start afresh (i.e. , cold reboot). Accordingly, the complete state of the system may be lost. Fortunately, typical consumer apparatus of the type described, which are commercially available now or in the near future, are well protected against such disasters, since all processes active on the system have been well tried out by the OEM during the system's development.
An object of the invention is therefore to provide an asynchronously multiprocessor system that limits the rather disastrous consequences of a stalled master or a "hung" system. It is another object to render a cold reboot power sequence unnecessary, which is of particular interest to apparatus meant for the consumer market. If anything goes wrong with the master, the consumer him/herself is the sole external agent available to jumpstart the system. Most consumers are not, and should not need to be, particularly knowledgeable about the nuances of system architectures beyond where to put the power plug and how to operate the remote. It is clear that having to do a cold reboot now and again is not only particularly unattractive, but a major drawback.
Consumer apparatus have been becoming increasingly more sophisticated. Modular configurations and open architectures are believed to form the paradigm for such apparatus. The inventors have realized, however, that failure of the master may occur more frequently in such an architecture, typically when its components are cooperating asynchronously. The reason for this is the following. An open architecture system can be modified and extended at will. Future functionalities, presently unknown, or customized functionalities, will be added to the existing system as an after-market add-on. Proper functioning under each and every circumstance cannot be guaranteed anymore, simply because many of all possible processes could not have been contemplated in advance by the manufacturer, let alone tried out in the development phase.
To this end, the invention provides an information processing system with a slave processor for processing a task, and a master processor for control of the system. The slave processor is operative to reboot the master processor if the master processor has stalled. The slave detects that the master has stalled by monitoring the master's heart beat. The automatic reboot performed by the slave allows the temporarily inactivated master to start working again without intervention of the user. Preferably, periodic state savings are performed allowing the system to pick up at the last known valid checkpoint. Since almost all of the stalls occur as a consequence of mutually interfering events in the asynchronous system, restarting the master at the last known valid checkpoint solves the problem in most of the cases. The system according to the invention is particularly useful in a consumer apparatus, wherein the DSP of the slave is capable of rebooting the master's CPU without the user's interfering.
These and other aspects of the invention will be explained in further detail by way of example and with reference to the accompanying drawing.
FigJ is a block diagram of a system according to the invention.
FigJ is a block diagram of a multiprocessor system 100 according to the invention. System 100 has a master processor 102 and one or more slave processors 104 and 106. Master 102 drives the entire system 100. Slave 104 and 106 each process a respective one of specific tasks under control of master 102. The system is, for example, a multimedia system with an open architecture, wherein master 102 is a CPU and slaves 104 and 106 are DSPs (Digital signal processors). Slaves 104 and 106 may communicate with each other. Master 102 has program memory 108. Slave 104 has program memory 110, and slave 106 has program memory 112.
Master 102 sends a data stream 118 to slave 104 wherein periodically a special command occurs, the sole purpose of which is to notify slave 104 of the fact that master 102 is still running. The special command is commonly referred to as "heart beat" . Typically, a heart beat is sent one every second. Slave 104 has a fail safe timer 114. Upon receipt of a heart beat, slave 104 resets its timer 114. Timer 114 expires after, say, 2 seconds, which is substantially longer than the time period between two successive heart beats. When master 102 stalls, slave 104 does not receive the heart beat anymore, and timer 114 expires. This confirms that master 102 has become inert. Slave 104 then resets master 102.
In a first embodiment, the reset brings master 102 back to the very beginning of the program(s) it was supposed to execute, and master 102 starts anew from ground zero. In a second embodiment, master 102 is coupled to a checkpoint memory
116, typically a magnetic disk. Prior to the failure, master 102 has been updating checkpoint memory 116 every N heart beats with the system state. For example, every 10 heart beats the content of the registers (not shown) of master 102, including the I/O registers, control registers, and the content of memory 108 are stored in memory 116. Memory 116 thus stores periodically a snapshot of the system, i.e. , all information that unambiguously defines the state of system 100 that is necessary to restore the state of the system. Now, when slave 104 notes that master 102 has stalled, slave 104 sends a reset to master 102. When master 102 starts to reboot it sends a response to slave 104 indicating it rebooted. This response enables slave 104 to issue a command to master 102 to fetch the last valid state recorded in memory 116, to reload the registers and memory 108 with this valid state, and start executing the program code from there on.
Slave 104 can notify the user, for example, via a short message on the system's display (not shown), that a problem occurred, has been solved, and that system operation has been resumed.

Claims

CLAIMS:
1. An information processing system with a slave processor for processing a task, and a master processor for control of the system, wherein the slave processor is operative to reboot the master processor if the master processor has stalled.
2. The system of claim 1, wherein: - the master processor periodically sends a heart beat to the slave processor;
- the slave processor has a timer with a maximum timing interval substantially longer than a time period between two successive heart beats;
- the slave processor resets the timer upon receipt of the heart beat; and
- the slave processor starts rebooting the master upon expiry of the maximum timing interval.
3. The system of claim 2, wherein:
- the system comprises a checkpoint memory for periodically saving a valid current state of the system;
- upon detection of the master processor having stalled, the slave processor issues a command to the master processor to re-establish the last valid state of the system as saved in the checkpoint memory.
EP98905551A 1997-06-23 1998-03-12 Slave dsp reboots stalled master cpu Withdrawn EP0920661A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US880387 1986-06-30
US88038797A 1997-06-23 1997-06-23
PCT/IB1998/000332 WO1998059288A1 (en) 1997-06-23 1998-03-12 Slave dsp reboots stalled master cpu

Publications (1)

Publication Number Publication Date
EP0920661A1 true EP0920661A1 (en) 1999-06-09

Family

ID=25376154

Family Applications (1)

Application Number Title Priority Date Filing Date
EP98905551A Withdrawn EP0920661A1 (en) 1997-06-23 1998-03-12 Slave dsp reboots stalled master cpu

Country Status (4)

Country Link
EP (1) EP0920661A1 (en)
JP (1) JP2000516745A (en)
KR (1) KR100518478B1 (en)
WO (1) WO1998059288A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7428655B2 (en) 2004-09-08 2008-09-23 Hewlett-Packard Development Company, L.P. Smart card for high-availability clustering
JP2007034479A (en) * 2005-07-25 2007-02-08 Nec Corp Operation system device, standby system device, operation/standby system, operation system control method, standby system control method, and operation system/standby system control method
US7917812B2 (en) 2006-09-30 2011-03-29 Codman Neuro Sciences Sárl Resetting of multiple processors in an electronic device
KR101132389B1 (en) * 2007-04-09 2012-04-03 엘지엔시스(주) Apparatus and method of structuralizing checkpoint memory based dispersion data structure
KR100928187B1 (en) * 2007-11-30 2009-11-25 한국전기연구원 Fault-safe structure of dual processor control unit
US8117494B2 (en) * 2009-12-22 2012-02-14 Intel Corporation DMI redundancy in multiple processor computer systems
JP5494134B2 (en) * 2010-03-31 2014-05-14 株式会社リコー Control apparatus and control method
KR102031576B1 (en) * 2019-05-21 2019-10-14 주식회사 우리기술 A controller of a distributed control system having an abnormal task monitoring function
KR102220389B1 (en) * 2019-11-28 2021-02-24 주식회사 한화 Apparatus and method for performing real-time synchronization using fpga

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04240946A (en) * 1991-01-25 1992-08-28 Nec Eng Ltd Data communication system
JPH09128269A (en) * 1995-10-31 1997-05-16 Fujitsu Ltd Abnormality display system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9859288A1 *

Also Published As

Publication number Publication date
KR100518478B1 (en) 2005-10-05
KR20000068286A (en) 2000-11-25
WO1998059288A1 (en) 1998-12-30
JP2000516745A (en) 2000-12-12

Similar Documents

Publication Publication Date Title
US7975188B2 (en) Restoration device for BIOS stall failures and method and computer program product for the same
US6393560B1 (en) Initializing and restarting operating systems
AU2010307632B2 (en) Microcomputer and operation method thereof
KR20170120559A (en) Watchdog timer
WO1998059288A1 (en) Slave dsp reboots stalled master cpu
JPH07219809A (en) Apparatus and method for data processing
JPH0527880A (en) System restart device
TWI461905B (en) Computing device capable of remote crash recovery, method for remote crash recovery of computing device, and computer readable medium
JPH10187454A (en) Bios reloading system
US20010054130A1 (en) Computing machine with hard stop-tolerant disk file management system
JPH11175108A (en) Duplex computer device
JP2000039928A (en) Microcomputer
JP3476667B2 (en) Redundant controller
JPH09190360A (en) Microcomputer and its ranaway monitoring processing method
JPH02281367A (en) Sate reproducing method
JPH0721091A (en) Service interruption processing method for electronic computer
JPS58195968A (en) Re-execution controlling system
JPH0120775B2 (en)
JP2003067221A (en) Watchdog timer control circuit
JPH1027040A (en) Computer resetting system
JPH03244045A (en) Microcomputer circuit
JP3473002B2 (en) Data processing device and data processing method
JPH0395634A (en) Restart control system for computer system
JPH0833839B2 (en) Data processing device
JPS6249426A (en) Automatic operation control device

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR GB

17P Request for examination filed

Effective date: 19990630

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20070801