WO1997045804A1

WO1997045804A1 - Audio data management for interactive applications

Info

Publication number: WO1997045804A1
Application number: PCT/IB1997/000359
Authority: WO
Inventors: Richard David Gallery; Timothy Stuart Owlett
Original assignee: Philips Electronics N.V.; Philips Norden Ab
Priority date: 1996-05-29
Filing date: 1997-04-07
Publication date: 1997-12-04
Also published as: EP0847562A1; GB9611144D0; KR19990035937A; JPH11511268A

Abstract

An arrangement of apparatus is provided for the handling of audio in relation to an evaluation procedure (30) for a virtual world or environment (which may or may not be visually displayed). For each instant in time, for which the state of the virtual world is evaluated, modelled characters and objects within the virtual world may generate various audio events, such that the audio-state of the world changes on each world state evaluation. This information is passed to an audio management procedure (32) via a stack memory (34), with the audio management procedure (32) taking the information from the stack (34) to generate loudness and scaling factors for spatial localisation, and other audio effects. This information is then passed to an audio generation board (36) which, in conjunction with an audio store (38), synthesises digital audio signals to play out. The use of the audio stack (34) enables efficient passing of the audio event data between the world update process (30) and the audio management process (32), using a relatively small number of common functions for all types of audio.

Description

DESCRIPTION

AUDIO DATA MANAGEMENT FOR INTERACTIVE APPLICATIONS

The present invention relates to interactive entertainment apparatuses and in particular to the handling of audio data in such a system. Additionally, but not exclusively, the invention relates to the handling of audio data in conjunction with graphic or video data.

An example of an interactive audio system is described in US Patent

4,846,693 (Baer), with an animated toy figure such as a teddy bear or doll being coupled to a video animation unit presenting an image of a second figure on a television screen or other display. A scripted conversation occurs between the figures via a speaker within the toy and a speaker at the display, with appropriate controlled animation of each figure in synchronism with the audio. A user input enables a limited amount of user interaction with the screened figure (for example answering multiple choice questions); an arrangement of one or more position sensors is also suggested for triggering phrases such as "turn me over" or "pick me up" from the toy figure. A further example is given in US patent 4,305,131 (Best) which describes a story or game arrangement with a branching narrative structure. At the branch points, an on-screen character (as a filmed sequence or animated sprite) directly asks a question of the viewer, to which question there are a few permitted answers. The possible answers are presented to the user via a display on a hand-held unit which unit also includes voice recognition circuitry to detect the user speaking the answers. The audio data is stored independently of the video data to allow for flexibility in re-use, although a cueing circuit is provided to maintain synchronism between the two at run time.

A drawback of such systems is their reliance on scripting with the audio being tied to other features, such as the scripted conversations in the Baer reference, and the branch narrative structure of Best. This can lead to a lack of immersion for the user, with only a limited number of possible responses and these being predictable: this becomes particularly noticeable in the case of an audio-only application such as an electronic storybook, where there is no video to distract from repetitive audio. A further problem with such systems is their flexibility and/or capacity for editing. In both Baer and Best, the audio data is stored in a ROM device in its scripted form and, even if in small, individually addressed audio files (for example tracks of an audio compact disc) there is no simple way to introduce new audio segments or sound effects or to substitute one existing segment for another.

It is therefore an object of the present invention to provide a fast and flexible means for the handling of audio data, and in particular to provide an apparatus configuration for handling such data as a part of a real-time interactive system.

In accordance with the present invention there is provided an interactive audio entertainment apparatus comprising: a first memory holding data defining a virtual environment and objects within said virtual environment; a processor coupled with the memory and arranged to generate and periodically update a model of the virtual environment and objects therein, and to generate indicators to respective predetermined audio segments to be generated in response to respective predetermined conditions or conjunctions of conditions occurring in the virtual environment; a stack memory coupled to receive and sequentially store said indicators from the processor; and an audio management stage arranged to, on each periodic update of the virtual environment, pull the stack contents, and initiate the generation of the respectively indicated audio segment or segments.

It should be noted that the term "objects" used herein refers to any feature of the virtual environment, rather than a software construction packaging data or procedures operating on such data. The objects may be "solid" features of the virtual environment, such as animated characters, furniture or buildings, or they may be of a more abstract nature such as temperature or time. Consequently, the above-mentioned conditions or conjunctions of conditions triggering audio segment generation may, for example, be a car hitting a tree (as modelled within the virtual environment) or a dog beginning to bark if it is night and hotter than a predetermined temperature.

The stack of indicators transferring data from the processor to the audio management stage provides great flexibility for operation of the system: to substitute an audio segment, it is only necessary to change the indicator against which that segment is recorded, and adding additional audio segments only involves increasing the number of indicators rather than re-recording entire scripted passages to accommodate a small change to, for example, a simple phrase within.

The apparatus suitably includes an audio reproduction stage, coupled to the audio management stage, and operable to generate the said predetermined audio segments when initiated by said audio management stage, and an audio data memory may be provided coupled with the audio reproduction stage. In such an arrangement, the memory may hold individually addressable audio data segments defining respective ones of the said audio segments, and may be accessed following initiation of the said audio reproduction stage by the audio management stage. Each of the indicators generated by the processor may include an identifier for the location of the respective audio data segments within such an audio data memory, with the different audio data segments stored as respective numbered files in the audio data memory, and the processor being arranged to specify at least the file number as part of a common format for indicators within the stack memory.

The audio reproduction stage may suitably have a capability for applying one or more signal processing techniques to audio signals, and the data passed to the audio reproduction stage from the audio management stage may include specification of one or more of such techniques to be applied. In such an arrangement, the particular audio segment identifier may have a null value or carry some further code indicating that the processing technique is to be applied globally or just to audio originating from particular localised areas. This would avoid the need for treatments such as echo to be individually specified for all audio segments originated, for example, within a dungeon scenario. Additionally, the audio reproduction stage may be operable to output audio signals on two or more channels with different signal processing applied for each channel.

User operable input means may be provided coupled to the processor whereby a user is enabled to alter one or more of said respective predetermined conditions pertaining to at least one virtual environment object: in other words, the user may trigger various audio events by deliberately or accidentally setting up the condition or conjunction of conditions within the virtual environment with which the particular audio segment is associated.

Whilst applicable to purely audio applications such as talking books, the apparatus may suitably include a further store holding geometric and surface data describing a physical appearance for the modelled virtual environment and each of the objects therein, together with rendering means operable to periodically generate images of the virtual world from at least one viewpoint therein.

Further features and advantages of the present invention will become apparent from reading of the following description of proffered embodiments of the present invention, given by way of example only, and with reference to the following drawings, in which: Figure 1 is a simplified representation of an interactive entertainment system which may suitably embody the present invention;

Figure 2 schematically illustrates audio data management according tc the present invention;

Figure 3 is a block schematic diagram of an interactive entertainment apparatus embodying the present invention;

Figure 4 is a flowchart illustrating the relative order of functions performed by the apparatus of Figure 3;

Figure 5 represents exemplary contents for three successive entries in the audio stack of Figures 2 and 3; and

Figure 6 is a flowchart illustrating operation of the audio management stage of Figure 2.

Figure 1 shows a simulation of a virtual world, using real time graphics and audio, as an example of virtual reality. It is a multi-user interactive world, with two users A and B shown. A video display 10 presents a representation of a virtual world to the users, with the representation being produced by control and rendering apparatus 12. Each user has a respective input device 14 coupled with the control apparatus 12, and receives audio feedback from the simulation via speakers 16 (a quadraphonic arrangement being shown). The virtual world in this example takes the form of a room within which two animated characters 18, 20 appear, each animated character being controlled by a respective user A, B via their respective input devices 14. In addition to the animated characters 18, 20, the virtual world features a number of other modelled objects (including a table 22, vase 24, gun 26, and door 28) which may be moved or otherwise interacted with by the characters, together with "invisible" objects such as time or temperature which may also affect or initiate interaction.

Many variations on the arrangement of Figure 1 will be apparent to the skilled person. For example, the video display may comprise an autostereo display to provide a three-dimensional (3D) image of the virtual world to the users. In a modification, a multiple view 'autostereo' display may be provided such that, at their respective positions relative to the LCD screen, the users are presented with respective images of the virtual world, suitably from the viewpoint of their character. Alternatively, rather than a single or multiple view screen, each user may be provided with a stereoscopic head-mounted display (HMD) unit, with which one or more of the speakers 16 may be integral.

The form of user input device (UID) may also be subject to variations, from a simple manually operated unit 14 as shown, to full-body suits detecting the users compound motions and, via the control and rendering stage 12, reproducing these as corresponding movements of the users respective character 18, 20. The present invention is particularly concerned with the efficient handling of audio for such simulations, where various audio events are generated (leading to output of respective audio segments via speakers 16) in response to particular events within the virtual world, suitably as part of a periodic world evaluation procedure which takes account of user input and other factors to determine not only how the visual representation of the virtual world should be updated, but also what sounds should be generated to accompany the images.

Figure 2 schematically illustrates the handling of audio in relation to the world evaluation procedure 30. For each instant in time, for which the state of the virtual world is evaluated, the modelled characters and other objects may generate or trigger various audio events, such that the audio-state of the world changes on each world state evaluation. This information is then passed to the audio management procedure 32 via a stack memory 34 as will be described. The audio management procedure 32 takes the information from the stack 34 to generate loudness and scaling factors for spatial localisation, and other audio effects. This information is then passed to an audio generation board 36 which, in conjunction with an audio store 38, synthesises digital audio signals to play out.

The use of the audio stack 34 enables efficient passing of the audio event data between the world update process 30 and the audio management process 32, using a relatively small number of common functions for all types of audio, with software functions, during the world evaluation process, passing audio data into the audio stack 34, and the stack being read repeatedly during the audio management procedure 32. This data from the stack is then used to calculate relative positions, loudnesses and other factors for audio processing.

An apparatus implementation of the handling scheme of Figure 2 is schematically illustrated in Figure 3: for reasons of clarity, the features for only a single user input and video output are shown, although it will be well understood how these should be replicated for multiple users and/or stereoscopic video output. Where appropriate, reference numerals from Figures 1 and 2 are used to identify corresponding or directly equivalent features.

The apparatus is based around a coupled pair of processor stages respectively handling the world (including video) 42 and audio 32 management, with the world processor 42 passing audio event data to the audio processor 32 via stack memory 34. The two processors 42, 32 are coupled to a suitable clock signal source 44 as will be readily understood.

The world processor 42 is coupled to a random access memory (RAM) 46 containing data defining the virtual world and characters and objects therein, with the contents of RAM 46 being updated during world evaluation (30; Fig.2) to reflect the current position, orientation, etc. for the characters and objects which may have changed in response to user input to the world processor via UID 14. The contents of RAM 46 are periodically read by rendering stage 40 under control of the world processor 42 and used to generate the image or images of the virtual world from global or user viewpoints for subsequent display.

The audio processor 32 is coupled to receive function calls from the world processor and to pull data entries from the stack 34. Having calculated the audio processing required, for example to scale the volume for a particular audio event in relation to the relative positions within the virtual world of each user and the source of that audio event, the audio processor outputs the data to audio generation stage 36. The data may comprise a digitised audio signal, or it may simply comprise an index term identifying where the particular audio signal file or other data is stored in an audio data read-only memory (ROM) coupled with the audio generation stage. In either case, the data from the audio processor is accompanied by the details of the signal processing to be applied within audio generation unit 36 and in fact, as mentioned beforehand, the data from the audio processor may comprise only signal processing details for global or localised effects.

A linear arrangement for the world evaluation routine as applied by the apparatus of Figure 3 is represented by the flowchart of Figure 4. The first step 61 is to take account of input data received since the previous world evaluation. Secondly, each object within the virtual world is processed in turn (step 62) to determine if it has moved or how it should be moved, and what if any audio should be generated as accompaniment. In the third step 63, the audio management (as at 32; Fig.2) determines the processing required to be applied to generate spatial or other effects, for passing to the audio generation stage. The fourth stage 64 is to update the geometry database, that is to say the data model of the virtual world and its contents as held in RAM 46 (Fig.3), and the fifth stage 65 is to render an image of the virtual world from the or each selected viewpoint. As will be readily understood, with an asynchronous virtual environment, the steps of Figure 4 may not be performed in the order shown or may be performed concurrently.

Referring again to the scenario of Figure 1 , there are two human- controlled objects (characters 18, 20) in the world, both of which are walking and the first of which is firing the gun (object 26). During the world evaluation process, the object evaluation routines 62 start by taking the first object - character 20. As the walking flag is set for this object, which indicates that an audio "walk-event" has occurred and a footfall should sound, the world processor pushes the audio stack with an entry. Also as the explosion flag is true, the explosion graphic has started and the explosion audio should be played; consequently, the audio stack is pushed with another entry. The second object is then processed and, as before, the walking flag is true and the animation frame is in the right place. The audio stack is then pushed for a third time to identify the footfall sounds for character 20.

At this stage, the audio stack contains three entries as shown in Figure 5, where the data held for each entry is as follows: filejium: an identifying integer for the particular desired audio file stored within the audio file ROM 38 (Fig.3); loudness: an integer value, suitably ranging between 0 and 255, specifying the relative loudness for an audio file playback at the location within the virtual world which led to its generation; posn: the position of the sound source specified in three dimensions in the virtual world by an appropriate co-ordinate system; obj num: a flag which indicates whether or not the audio is to be muted for its source location on playback to a level equivalent to that at which the audio is heard by other characters following attenuation due to their separation from the source; dist_scale: a flag indicating whether or not the loudness of a audio segment should be attenuated with distance from the source; stop: a flag indicating whether the file should stop playing, particularly for use with looped audio files (see below); localise: a flag indicating whether or not an audio file should be spatially localised, such that factors such as direction of origin relative to an observer are taken into account when calculation relative attenuations for the different audio channels; self audio: a flag which, if set, means the audio file may only be heard by the character/object generating it (i.e. for a character thinking); loop: a flag which indicates the current audio file is to be repeated as soon as it has completed (to avoid having to continually re-specify the one file) until the 'stop' flag is set.

When the audio management stage 32 is called then, as shown by the flowchart of Figure 6, it repeatedly pulls this audio stack until it is empty (step 70 "Pull Stack" and step 73 "Stack Empty?"). Note that pulling the stack does not necessarily require the removal of the data therefrom: identifying that data to the audio management stage achieves the desired function without requiring copying of the data. For each stack entry, the audio management stage 32 parses the stack data (step 71) and uses the information contained to generate four loudness values for the respective speakers 16 in the quadraphonic set up. As shown by Step 72 ("More Users?") the parsing may be a two or more pass process by which the audio management stage 32 uses the information to generate respective loudness values for each of the two users A, B, if these are provided with individual speakers. These loudness values take into account distance and direction and, depending on the information contained in the audio stack, then different aural effects can be used within the system. On completion of the processing by audio processor 32, the derived data is output to the audio generation stage 36.

The control software contains a few basic functions for handling the stack and its contents, the functions being called externally. A global variable num_entries, which references the stack array, always points to the next free entry. An upper value for this suitably provides a control condition limiting the number of audio events which may be handled for each refresh operation.

The first of the functions is push_stack which is used to copy data into the next free entry in the stack as indicated by num_entries. The second function, pull_stack, is principally for diagnostic and analytical functions and copies stack entries to other storage areas whilst decrementing the value of num_entries. The third function, show stack, is again for diagnostic and analytical purposes and results in the output of a record of current stack contents without affecting those contents. A further routine, initialise_stack, is suitably also provided for calling at start-up to ensure that all relevant variables and sections of memory are initialised to zero. This routine might suitably be used in conjunction with an initialising set of stack entries which, rather than identifying audio segments or effects, just identify features such as initial character/object locations within the virtual environment in order to provide a more complete and self-referential data structure.

In summary, we have described an arrangement of apparatus provided for the handling of audio in relation to an evaluation procedure for a virtual world or environment (which may or may not be visually displayed). For each instant in time, for which the state of the virtual world is evaluated, modelled characters and objects within the virtual world may generate various audio events, such that the audio-state of the world changes on each world state evaluation. This information is passed to an audio management procedure via a stack memory, with the audio management procedure taking the information from the stack to generate loudness and scaling factors for spatial localisation, and other audio effects. This information is then passed to an audio generation board which, in conjunction with an audio store, synthesises digital audio signals to play out. The use of the audio stack enables efficient passing of the audio event data between the world update process and the audio management process, using a relatively small number of common functions for all types of audio.

From reading the present disclosure, other modifications will be apparent to persons skilled in the art. Such modifications may involve other features which already known in the field of audio signal handling and processing apparatuses and component parts thereof and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present application also includes any novel feature or any novel combination of features disclosed herein either explicitly or implicitly, whether or not it relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as does the present invention. The applicants hereby give notice that new claims may be formulated to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.

Claims

1. An interactive audio entertainment apparatus comprising: a first memory holding data defining a virtual environment and objects within said virtual environment; a processor coupled with the memory and arranged to generate and periodically update a model of the virtual environment and objects therein, and to generate indicators to respective predetermined audio segments to be generated in response to respective predetermined conditions or conjunctions of conditions occurring in the virtual environment; a stack memory coupled to receive and sequentially store said indicators from the processor; and an audio management stage arranged to, on each periodic update of the virtual environment, pull the stack contents, and initiate the generation of the respectively indicated audio segment or segments.

2. Apparatus as claimed in Claim 1 , further comprising an audio reproduction stage, coupled to the audio management stage, and operable to generate the said predetermined audio segments when initiated by said audio management stage.

3. Apparatus as claimed in Claim 2, further comprising an audio data memory coupled with the audio reproduction stage, said memory holding individually addressable audio data segments defining respective ones of said audio segments, and being accessed following initiation of said audio reproduction stage by said audio management stage.

4. Apparatus as claimed in Claim 3, wherein each indicator generated by the processor includes an identifier for the location of the respective audio data segments within the audio data memory.

5. Apparatus as claimed in Claim 4, wherein the different audio data segments are stored as respective numbered files in the audio data memory, and the processor is arranged to specify at least the file number as part of a common format for indicators within the stack memory.

6. Apparatus as claimed in Claim 2, wherein the audio reproduction stage is operable to selectively apply one or more signal processing techniques to audio signals, and the data passed to the audio reproduction stage from the audio management stage includes specification of one or more of said techniques to be applied.

7. Apparatus as claimed in Claim 6, wherein the audio reproduction stage is operable to output audio signals on two or more channels with different signal processing applied for each channel.

8. Apparatus as claimed in Claim 1 , further comprising user operable input means coupled to said processor whereby a user is enabled to alter one or more of said respective predetermined conditions pertaining to at least one virtual environment object.

9. Apparatus as claimed in Claim 1 , further comprising a further store holding geometric and surface data describing a physical appearance for the modelled virtual environment and each of the objects therein, the apparatus further comprising rendering means operable to periodically generate images of the virtual environment from at least one viewpoint therein.