WO2013030623A1

WO2013030623A1 - An audio scene mapping apparatus

Info

Publication number: WO2013030623A1
Application number: PCT/IB2011/053794
Authority: WO
Inventors: Juha Petteri Ojanpera
Original assignee: Nokia Corporation
Priority date: 2011-08-30
Filing date: 2011-08-30
Publication date: 2013-03-07

Abstract

An apparatus comprising: a receiver configured to receive at least one audio signal from at least one recording apparatus; and an audio scene event detector configured to determine for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

Description

AN AUDIO SCENE MAPPING APPARATUS

Field of the Application The present application relates to apparatus for the processing of audio and additionally video signals. The invention further relates to, but is not limited to, apparatus for processing audio and additionally video signals from mobile devices. Background of the Application

Viewing recorded or streamed audio-video or audio content is well known. Commercial broadcasters covering an event often have more than one recording device (video-camera/microphone) and a programme director will select a 'mix' where an output from a recording device or combination of recording devices is selected for transmission.

Multiple 'feeds' may be found in sharing services for video and audio signals (such as those employed by YouTube). Such systems, which are known and are widely used to share user generated content recorded and uploaded or up- streamed to a server and then downloaded or down-streamed to a viewing/listening user. Such systems rely on users recording and uploading or up- streaming a recording of an event using the recording facilities at hand to the user. This may typically be in the form of the camera and microphone arrangement of a mobile device such as a mobile phone.

Often the event is attended and recorded from more than one position by different recording users at the same time. The viewing/listening end user may then select one of the up-streamed or uploaded data to view or listen.

Summary of the Application Aspects of this application thus provide an audio source classification process whereby multiple devices can be present and recording audio signals and a server can classify and select from these audio sources suitable signals from the uploaded data.

There is provided according to the application an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform: receiving at least one audio signal from at least one recording apparatus; and determining for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

Determining for the at least one recording apparatus at least one audio event indicator may cause the apparatus to perform: determining at least one audio event for the at least one recording apparatus; and determining at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus.

Determining at least one audio event for an audio space comprising the at least one recording apparatus may further cause the apparatus to perform: sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

The at least one audio event criteria may comprise an audio event time index.

Determining at least one audio event for the at least one recording apparatus further may cause the apparatus to perform: sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria. The at least one audio event criteria may comprise an audio event time index.

Determining for the at least one recording apparatus at least one audio event indicator may cause the apparatus to perform: determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and determining a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event.

Determining at least one audio event for the at least one recording apparatus may further cause the apparatus to perform: determining a modulation energy value signal for an audio signal associated with the at least one recording apparatus; determining a difference modulation energy value signal from the modulation energy value signal; and filtering the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value. Filtering the difference modulation energy value signal to generate at least one audio event may cause the apparatus to perform: sorting the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector; determining a time index difference value for neighbouring pairs of difference modulation energy vector items; and removing an item from the difference modulation energy vector when the time index difference value is less than a determined threshold.

The apparatus may be further caused to perform determining at least one modulation energy value for at least one audio signal from at least one recording apparatus, wherein determining at least one modulation energy value may cause the apparatus to further perform: transforming the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain; scaling the frequency domain values with respect to the time domain at least one audio signal; converting the scaled frequency domain values into a logarithmic domain scaled frequency domain; grouping the logarithmic scaled frequency domain values according to frequency ranges; summing the logarithmic scaled frequency domain values within frequency range groups; and determining modulation energy value as a standard deviation of the summed frequency range groups.

According to a second aspect there is provided a method comprising: receiving at least one audio signal from at least one recording apparatus; and determining for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

Determining for the at least one recording apparatus at least one audio event indicator may comprise: determining at least one audio event for the at least one recording apparatus; and determining at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus. Determining at least one audio event for an audio space comprising the at least one recording apparatus may further comprise: sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

The at least one audio event criteria may comprise an audio event time index.

Determining at least one audio event for the at least one recording apparatus may further comprise: sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria. The at least one audio event criteria may comprise an audio event time index.

Determining for the at least one recording apparatus at least one audio event indicator may comprise: determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and determining a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event. Determining at least one audio event for the at least one recording apparatus may further comprise: determining a modulation energy value signal for an audio signal associated with the at least one recording apparatus; determining a difference modulation energy value signal from the modulation energy value signal; and filtering the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value.

Filtering the difference modulation energy value signal to generate at least one audio event may comprise: sorting the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector; determining a time index difference value for neighbouring pairs of difference modulation energy vector items; and removing an item from the difference modulation energy vector when the time index difference value is less than a determined threshold. The method may further comprise determining at least one modulation energy value for at least one audio signal from at least one recording apparatus, wherein determining at least one modulation energy value may comprise: transforming the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain; scaling the frequency domain values with respect to the time domain at least one audio signal; converting the scaled frequency domain values into a logarithmic domain scaled frequency domain; grouping the logarithmic scaled frequency domain values according to frequency ranges; summing the logarithmic scaled frequency domain values within frequency range groups; and determining modulation energy value as a standard deviation of the summed frequency range groups.

According to a third aspect of the application there is provided an apparatus comprising: a receiver configured to receive at least one audio signal from at least one recording apparatus; and an audio scene event detector configured to determine for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value. The audio scene event detector may comprise: a recording processor configured to determine at least one audio event for the at least one recording apparatus; and an audio space processor configured to determine at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus.

The audio space processor may further comprise: a sorter configured to sort the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and an event filter configured to filter the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

The at least one audio event criteria may comprise an audio event time index. The recording processor may comprise: a sorter configured to sort the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and an event filter configured to filter the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

The at least one audio event criteria may comprise an audio event time index. The audio space processor may further comprise: an event position tester configured to determine an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and a recoding apparatus selector configured to determine a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event.

The recording processor may comprise: a modulation energy determiner configured to determine a modulation energy value signal for an audio signal associated with the at least one recording apparatus; an event signal converter configured to determine a difference modulation energy value signal from the modulation energy value signal; and an event signal filter configured to filter the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value.

The event signal filter may comprise: a sorter configured to sort the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector; a time difference determiner configured to determine a time index difference value for neighbouring pairs of difference modulation energy vector items; and a filter configured to remove an item from the difference modulation energy vector when the time index difference value is less than a determined threshold. The apparatus may further comprise a modulation energy determiner, the modulation energy determiner further comprising: a time to frequency domain transformer configured to transform the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain; a scaler configured to scale the frequency domain values with respect to the time domain at least one audio signal; a log domain converter configured to convert the scaled frequency domain values into a logarithmic domain scaled frequency domain; a frequency bin selector configured to group the logarithmic scaled frequency domain values according to frequency ranges; a summer configured to sum the logarithmic scaled frequency domain values within frequency range groups; and a standard deviation determiner configured to determine the modulation energy value as a standard deviation of the summed frequency range groups. According to a fourth aspect of the application there is provided an apparatus comprising: means for receiving at least one audio signal from at least one recording apparatus; and means for determining for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

The means for determining for the at least one recording apparatus at least one audio event indicator may comprise: means for determining at least one audio event for the at least one recording apparatus; and means for determining at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus.

The means for determining at least one audio event for an audio space comprising the at least one recording apparatus may further comprise: means for sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and means for filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria. The at least one audio event criteria may comprise an audio event time index.

The means for determining at least one audio event for the at least one recording apparatus may further comprise: means for sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and means for filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

The at least one audio event criteria may comprise an audio event time index. The means for determining for the at least one recording apparatus at least one audio event indicator may comprise: means for determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and means for determining a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event. The means for determining at least one audio event for the at least one recording apparatus may further comprise: means for determining a modulation energy value signal for an audio signal associated with the at least one recording apparatus; means for determining a difference modulation energy value signal from the modulation energy value signal; and means for filtering the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value.

The means for filtering the difference modulation energy value signal to generate at least one audio event may comprise: means for sorting the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector; means for determining a time index difference value for neighbouring pairs of difference modulation energy vector items; and means for removing an item from the difference modulation energy vector when the time index difference value is less than a determined threshold.

The apparatus may further comprise means for determining at least one modulation energy value for at least one audio signal from at least one recording apparatus, wherein the means for determining at least one modulation energy value may comprise: means for transforming the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain; means for scaling the frequency domain values with respect to the time domain at least one audio signal; means for converting the scaled frequency domain values into a logarithmic domain scaled frequency domain; means for grouping the logarithmic scaled frequency domain values according to frequency ranges; means for summing the logarithmic scaled frequency domain values within frequency range groups; and means for determining modulation energy value as a standard deviation of the summed frequency range groups.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein. An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically a multi-user free-viewpoint service sharing system which may encompass embodiments of the application;

Figure 2 shows schematically an apparatus suitable for being employed in embodiments of the application;

Figure 3 shows schematically an audio scene mapping system according to some embodiments of the application;

Figure 4 shows schematically the audio scene event detector as shown in Figure 3 in further detail;

Figure 5 shows a flow diagram of the operation of the audio scene event detector recording processor according to some embodiments of the application; Figure 6 shows a flow diagram of the operation of the audio scene event detector audio space processor in further detail according to some embodiments of the application;

Figure 7 shows schematically the modulation energy determiner as shown in Figure 4 in further detail according to some embodiments of the application;

Figure 8 shows a flow diagram of the operation of the modulation energy determiner in further detail according to some embodiments;

Figure 9 shows schematically an example of the converter according to some embodiments of the application;

Figure 10 shows a flow diagram of the operation of the converter according to some embodiments of the application;

Figure 11 shows the event signal filter in further detail according to some embodiments of the application;

Figure 12 shows a flow diagram of the operation of the event signal filter according to some embodiments of the application;

Figure 13 shows schematically the audio scene event signal pre-processor as shown in Figure 4 according to some embodiments of the application;

Figure 14 shows a flow diagram of the operation of the audio signal event signal pre-processor according to some embodiments of the application;

Figure 15 shows schematically an example of the intermediate event position determiner according to some embodiments of the application;

Figure 16 shows a flow diagram of the operation of the intermediate event position determiner according to some embodiments of the application;

Figure 17 shows schematically the audio scene event position and recording device selection list generator recording in further detail according to some embodiments of the application;

Figure 18 shows a flow diagram of the operation of the audio scene event position and recording device selection list generator according to some embodiments; and

Figure 19 shows an example timeline of audio clips and event priority values.

Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective audio scene classification and furthermore selection. In the following examples audio signals and audio capture uploading and downloading is described. However it would be appreciated that in some embodiments the audio signal/audio capture, uploading and downloading is one part of an audio-video system.

The concept of this application is related to assisting in the production of immersive person-to-person communication and can include video (and in some embodiments synthetic or computer generated content). Maturing three- dimensional audio-visual rendering and capture technology allows more natural communications to be generated. For example an all-3D experience can be created which generates opportunity for new businesses through novel product categories.

In order to provide the best listening or viewing experience, the received or down- mixed signal at the receiver device should aim to contain the interesting events or moments which are occurring within the audio visual scene being monitored or listened to. These events or moments can be characterised, for example by the position of the events in a timeline and the recording device to be selected or employed for a given duration within these events. The recording position of the event in some embodiments describes the start of the event in the audio visual scene and in some embodiments the identity of the recording device or recording source to be used for the down-mixed or received signal.

The concept of this application is thus to provide an enabler for locating the events and recording devices to be employed or selected in the audio visual scene for a given time. Although a random selection or fixed interval selection could be implemented, this would allow minimal control on the scene composition and the perceptual quality and viewing experience of the resulting down-mixed signal could be compromised. Furthermore although it may be possible to use enhancing sensor data such as compass data, gyroscope data or accelerometer data to detect events, the recording audio source could, for example by suddenly changing the compass data, produce a false real world event detection.

With respect to Figure 19, an illustrated example of the concept of the application is shown. In this figure there are three separate audio source recordings, clip A 1601 , clip B 1603, and clip C 1605, each of which have a duration of 0 to TD and creates an audio space for the event. The event detection analysis according to the concept of the application outputs five event locations to 161 1 , ti 1613, t₂ 1615, t₃ 1617, and t₄ 1619 which in this example define time instances where events occur and are determined to have impact on the audio space. For each event location a position of the event within the timeline and a recording device selection list in a preferred order is created.

Using the event location data, the down-mix or received signal can be created for the end user to enhance the listening/viewing experience. For example as shown in Figure 19, the event data output in some embodiments of the application and according to this example can be of the following format: to = 00:00:30.000 - A, B, C;

ti = 00:00:59.000 - c, A, B;

t₂ = 00:01 :10.500 - c, B, A;

t₃ = 00:01 :25.200 - A, B, C;

= 00:02:02.100 - B, A, C.

Each event data entry describes the position of the event, for example in hours:minutes:seconds. milliseconds format followed by the preferred recording device selection list for that particular event. Thus on a timeline, such as shown in Figure 19 1607, a down-mixed signal using only a single source shows that from to to ti clip A recording source is selected 1623, from ti to t₂ clip C is selected 1625, from t₂ to t₃ clip A is selected 1627, and from t₃ to t₄ clip B is selected 1629. It is understood that in some embodiments the event data may fail to output or show any event data preferred source information for time periods from 0 to to and from t₄ to TD. Therefore in some embodiments the recording source can select any recording source according to a suitable method. For example in some embodiments the selection criteria for the recording sources is to select the same recording source as the previous period, following period or to randomly select from the recording source selection list. For example as shown in Figure 19, the recording source from 0 to to is set to Clip C 1621 , and from t₄ to TD the recording source is set to Clip A 1631.

With respect to Figure 1 an overview of a suitable system within which embodiments of the application can be located is shown. The audio space 1 can have located within it at least one recording or capturing device or apparatus 19 which are arbitrarily positioned within the audio space to record suitable audio scenes. The apparatus 19 shown in Figure 1 are represented as microphones with a polar gain pattern 101 showing the directional audio capture gain associated with each apparatus. The apparatus 19 in Figure 1 are shown such that some of the apparatus are capable of attempting to capture the audio scene or activity 103 within the audio space. The activity 103 can be any event the user of the apparatus wishes to capture. For example the event could be a music event or audio of a news worthy event. The apparatus 19 although being shown having a directional microphone gain pattern 101 would be appreciated that in some embodiments the microphone or microphone array of the recording apparatus 19 has a omnidirectional gain or different gain profile to that shown in Figure 1.

Each recording apparatus 19 can in some embodiments transmit or alternatively store for later consumption the captured audio signals via a transmission channel 107 to an audio scene server 109. The recording apparatus 19 in some embodiments can encode the audio signal to compress the audio signal in a known way in order to reduce the bandwidth required in "uploading" the audio signal to the audio scene server 109. The recording apparatus 19 in some embodiments can be configured to estimate and upload via the transmission channel 107 to the audio scene server 109 an estimation of the location and/or the orientation or direction of the apparatus. The position information can be obtained, for example, using GPS coordinates, cell-ID or a-GPS or any other suitable location estimation methods and the orientation/direction can be obtained, for example using a digital compass, accelerometer, or gyroscope information.

In some embodiments the recording apparatus 19 can be configured to capture or record one or more audio signals for example the apparatus in some embodiments have multiple microphones each configured to capture the audio signal from different directions. In such embodiments the recording device or apparatus 19 can record and provide more than one signal from different the direction/orientations and further supply position/direction information for each signal. With respect to the application described herein an audio or sound source can be defined as each of the captured or audio recorded signal. In some embodiments each audio source can be defined as having a position or location which can be an absolute or relative value. For example in some embodiments the audio source can be defined as having a position relative to a desired listening location or position. Furthermore in some embodiments the audio source can be defined as having an orientation, for example where the audio source is a beamformed processed combination of multiple microphones in the recording apparatus, or a directional microphone. In some embodiments the orientation may have both a directionality and a range, for example defining the 3dB gain range of a directional microphone.

The capturing and encoding of the audio signal and the estimation of the position/direction of the apparatus is shown in Figure 1 by step 1001. The uploading of the audio and position/direction estimate to the audio scene server 109 is shown in Figure 1 by step 1003. The audio scene server 109 furthermore can in some embodiments communicate via a further transmission channel 111 to a listening device 113.

In some embodiments the listening device 113, which is represented in Figure 1 by a set of headphones, can prior to or during downloading via the further transmission channel 11 1 select a listening point, in other words select a position such as indicated in Figure 1 by the selected listening point 105. In such embodiments the listening device 1 13 can communicate via the further transmission channel 11 1 to the audio scene server 109 the request.

The selection of a listening position by the listening device 1 13 is shown in Figure 1 by step 1005.

The audio scene server 109 can as discussed above in some embodiments receive from each of the recording apparatus 19 an approximation or estimation of the location and/or direction of the recording apparatus 19. The audio scene server 109 can in some embodiments from the various captured audio signals from recording apparatus 19 produce a composite audio signal representing the desired listening position and the composite audio signal can be passed via the further transmission channel 1 11 to the listening device 113.

The generation or supply of a suitable audio signal based on the selected listening position indicator is shown in Figure 1 by step 1007. In some embodiments the listening device 113 can request a multiple channel audio signal or a mono-channel audio signal. This request can in some embodiments be received by the audio scene server 109 which can generate the requested multiple channel data. The audio scene server 109 in some embodiments can receive each uploaded audio signal and can keep track of the positions and the associated direction/orientation associated with each audio source. In some embodiments the audio scene server 109 can provide a high level coordinate system which corresponds to locations where the uploaded/upstreamed content source is available to the listening device 1 13. The "high level" coordinates can be provided for example as a map to the listening device 1 13 for selection of the listening position. The listening device (end user or an application used by the end user) can in such embodiments be responsible for determining or selecting the listening position and sending this information to the audio scene server 109. The audio scene server 109 can in some embodiments receive the selection/determination and transmit the downmixed signal corresponding to the specified location to the listening device. In some embodiments the listening device/end user can be configured to select or determine other aspects of the desired audio signal, for example signal quality, number of channels of audio desired, etc. In some embodiments the audio scene server 109 can provide in some embodiments a selected set of downmixed signals which correspond to listening points neighbouring the desired location/direction and the listening device 113 selects the audio signal desired.

In this regard reference is first made to Figure 2 which shows a schematic block diagram of an exemplary apparatus or electronic device 10, which may be used to record (or operate as a recording device 19) or listen (or operate as a listening device 113) to the audio signals (and similarly to record or view the audio-visual images and data). Furthermore in some embodiments the apparatus or electronic device can function as the audio scene server 109.

The electronic device 10 may for example be a mobile terminal or user equipment of a wireless communication system when functioning as the recording device or listening device 1 13. In some embodiments the apparatus can be an audio player or audio recorder, such as an MP3 player, a media recorder/player (also known as an MP4 player), or any suitable portable device suitable for recording audio or audio/video camcorder/memory audio or video recorder.

The apparatus 10 can in some embodiments comprise an audio subsystem. The audio subsystem for example can comprise in some embodiments a microphone or array of microphones 11 for audio signal capture. In some embodiments the microphone or array of microphones can be a solid state microphone, in other words capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone or array of microphones 1 1 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone 1 1 or array of microphones can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 14.

In some embodiments the apparatus can further comprise an analogue-to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and outputting the audio captured signal in a suitable digital form. The analogue-to-digital converter 14 can be any suitable analogue-to- digital conversion or processing means.

In some embodiments the apparatus 10 audio subsystem further comprises a digital-to-analogue converter 32 for converting digital audio signals from a processor 21 to a suitable analogue format. The digital-to-analogue converter (DAC) or signal processing means 32 can in some embodiments be any suitable DAC technology.

Furthermore the audio subsystem can comprise in some embodiments a speaker 33. The speaker 33 can in some embodiments receive the output from the digital- to-analogue converter 32 and present the analogue audio signal to the user. In some embodiments the speaker 33 can be representative of a headset, for example a set of headphones, or cordless headphones. Although the apparatus 10 is shown having both audio capture and audio presentation components, it would be understood that in some embodiments the apparatus 10 can comprise one or the other of the audio capture and audio presentation parts of the audio subsystem such that in some embodiments of the apparatus the microphone (for audio capture) or the speaker (for audio presentation) are present.

In some embodiments the apparatus 10 comprises a processor 21. The processor 21 is coupled to the audio subsystem and specifically in some examples the analogue-to-digital converter 14 for receiving digital signals representing audio signals from the microphone 1 1 , and the digital-to-analogue converter (DAC) 12 configured to output processed digital audio signals. The processor 21 can be configured to execute various program codes. The implemented program codes can comprise for example audio classification and audio scene mapping code routines. In some embodiments the program codes can be configured to perform audio scene event detection and device selection indicator generation, wherein the audio scene server 109 can be configured to determine events from multiple received audio recordings to assist the user in selecting an audio recording which is meaningful and does not require the listener to carry out undue searching of all of the audio recordings.

In some embodiments the apparatus further comprises a memory 22. In some embodiments the processor is coupled to memory 22. The memory can be any suitable storage means. In some embodiments the memory 22 comprises a program code section 23 for storing program codes implementable upon the processor 21. Furthermore in some embodiments the memory 22 can further comprise a stored data section 24 for storing data, for example data that has been encoded in accordance with the application or data to be encoded via the application embodiments as described later. The implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processor 21 whenever needed via the memory-processor coupling. In some further embodiments the apparatus 10 can comprise a user interface 15. The user interface 15 can be coupled in some embodiments to the processor 21. In some embodiments the processor can control the operation of the user interface and receive inputs from the user interface 15. In some embodiments the user interface 15 can enable a user to input commands to the electronic device or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15. The user interface 15 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 10 and further displaying information to the user of the apparatus 10.

In some embodiments the apparatus further comprises a transceiver 13, the transceiver in such embodiments can be coupled to the processor and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 13 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The coupling can, as shown in Figure 1 , be the transmission channel 107 (where the apparatus is functioning as the recording device 19 or audio scene server 109) or further transmission channel 1 1 1 (where the device is functioning as the listening device 1 13 or audio scene server 109). The transceiver 13 can communicate with further devices by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

In some embodiments the apparatus comprises a position sensor 16 configured to estimate the position of the apparatus 10. The position sensor 16 can in some embodiments be a satellite positioning sensor such as a GPS (Global Positioning System), GLONASS or Galileo receiver.

In some embodiments the positioning sensor can be a cellular ID system or an assisted GPS system. In some embodiments the apparatus 10 further comprises a direction or orientation sensor. The orientation/direction sensor can in some embodiments be an electronic compass, accelerometer, a gyroscope or be determined by the motion of the apparatus using the positioning estimate.

It is to be understood again that the structure of the electronic device 10 could be supplemented and varied in many ways. Furthermore it could be understood that the above apparatus 10 in some embodiments can be operated as an audio scene server 109. In some further embodiments the audio scene server 109 can comprise a processor, memory and transceiver combination. With respect to Figure 3 an overview of the application according to some embodiments is shown with respect to the audio scene server 109 and listening device 1 13. Furthermore the operation of the audio scene server 109 according to some embodiments is shown with respect to Figure 8. As described herein, the audio scene server 109 is configured to receive the various recording capture or audio scene capture 19 sources with their uploaded audio signals. This is shown with respect to Figure 3 by the input to the audio scene server 109 of the sensor data from the capture sources and the recorded data or audio data from the capture or recording device sources. In some embodiments the audio signals and/or capture device (recording apparatus) orientation indicators can be received at some means for receiving such as a receiver, or receiver portion of a transceiver.

In some embodiments the audio scene server 109 can comprise an audio scene event detector or determiner 203, or means for determining an audio scene event from the recording apparatus, which is configured to receive the sensor data from the capture devices. The audio scene event detector or determiner 203 can be configured to determine the audio scene events and furthermore audio scene event capture device/recording device selection indicators. The output of the audio scene event detector or determiner 203 can be passed to the down mixer 205, and furthermore to the end user or listening device 113 and in some embodiments the Tenderer 209 associated with the listening device 113.

In some embodiments the audio scene server 109 can comprise a down mixer 205. The down mixer 205 can be configured in some embodiments to receive audio data such as recording source audio data, event position information and audio source selection information or indicators. The selection performed by the down mixer 205 thus can use the event information in the preparation for the composition of the audio sources used in the down mixing operation.

In some embodiments the down mixer 205 can be configured to use the selected audio sources to generate a signal suitable for transmitting on the transmission channel 111 to the listening device. For example in some embodiments the down mixer 205 can receive multiple audio source signals, and dependent on the event information, select and generate a multi-channel or single (mono) channel simulating the effect of being at the desired listening position and in a format suitable for listening to by the listening device 113. For example where the listening device is a stereo headset, the down mixer 205 can be configured to generate a suitable stereo signal.

Furthermore in some embodiments the listening device 1 13 can comprise a Tenderer 209. The renderer 209 can be configured to receive the down mixed output signal via the transmission channel 11 1 and generate a rendered signal suitable for the listening device 1 13 end user. For example in some embodiments, the Tenderer 209 can be configured to decode the encoded audio signal output by the down mixer 205 in a format suitable for presentation to a stereo headset or headphones or speakers.

With respect to Figure 4 the audio scene event detector 203 is shown in further detail. The audio scene event detector 203 can in some embodiments comprise a recording processor 301 , or means for determining at least one audio event for the at least one recording apparatus, configured to process each individual audio source input or recording source and outputs a series of event filtered signals. These event filtered signals can then be output to an audio space processor 303. The operation of the recording processor 301 is shown in further detail with respect to Figure 5.

The recording processor 301 can in some embodiments comprise a modulation energy determiner 311 or means for determining a modulation energy value signal for an audio signal associated with the at least one recording apparatus. The modulation energy determiner 31 1 can be configured to receive audio data from audio sources. The audio data can be encapsulated or encoded in any suitable format. In the following example each audio source/audio recording is in the form of a time domain digitally sampled audio signal.

The operation of receiving the audio data from audio sources is shown in Figure 5 by step 401.

Furthermore the recording processor 301 is configured to process the received audio signals for each audio source x_m , where m is the source indicator.

The operation of performing the recording processing or processing each audio source is shown in Figure 5 by step 403. In some embodiments the recording processor 301 comprises a modulation energy determiner 31 1 configured to process the audio data from each audio source x_m and determine a modulation energy value mf_m. The modulation energy mf_m can be defined as the energy of the signal over time with a high frequency resolution. In some embodiments the frequency resolution of the modulation energy signal can be set to be equal or less than 1 Hz.

The determination of the modulation energy value mf_m for each recording or source is shown in Figure 5 by step 405. With respect to Figure 7 a modulation energy determiner 31 1 according to some embodiments of the application is shown in further detail. Furthermore with respect to Figure 8 the operation of the modulation energy determiner 31 1 shown in Figure 7 is further described.

In some embodiments the modulation energy determiner 31 1 can be configured to receive the audio signal for the m^th source. The operation of receiving the audio signal for the m^th source is shown in Figure 8 by step 501 .

Furthermore the modulation energy determiner 31 1 in some embodiments can comprise a Time-to-Frequency Domain transformer 1401 or means for transforming the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain.

The Time-to-Frequency Domain Transformer can be mathematically summarised by the following equation:

X_m [bin,l] = TF{x_MT ) where m is the recording source index, bin is the frequency bin index, / is time frame index, T is the hop size between successive segments, and TFQ the time- to-frequency operator. In some embodiments the Discrete Fourier Transform (DFT) is used as the TF operator, mathematically represented as follows: ^k,₆,^)=∑( (^»)-^^/-^,¾'"^'") tw(n) = win(n) - x_m(n + l - T), 0 < n < N

2 * 71 ' bin

where w_btn = , N_f \s the size of the TF() operator transform, win(n) is a N- point analysis window, such as sinusoidal, Hanning, Hamming, Welch, Bartlett, Kaiser or Kaiser-Bessel Derived (KBD) window. To obtain continuity and smooth Fourier coefficients over time, the hop size is in some embodiments set to T=N/2, that is, the previous and current signal segments are 50% overlapping.

It would be understood that in some other embodiments of the application the Time-to-Frequency Domain Transformer implementation can be any suitable time-to-frequency domain transformation, such as for example, provided by a Fast Fourier Transformer (FFT), Discrete Cosine Transformer (DCT), Modified Discrete Cosine Transformer (MDCT), Modified Discrete Sine Transformer (MDST), Quadrature Mirror Filter (QMF), or Complex Valued QMF. The Time-to- Frequency Domain Transformer 1401 can furthermore be calculated on a frame- by-frame basis where the size of the frame is of a suitably short duration. For example in some embodiments the duration of each frame can be 20 milliseconds and typically less than 50 milliseconds. The size of the TF() operator transform can in some embodiments be defined according to the following equation:

where

returns the largest integer not greater than y. It would be understood that in some embodiments a computationally efficient implementation would have a size of the transform (TF()) to be a power of 2. However it would be understood that in some embodiments the transform is not limited to this. Furthermore in some embodiments the received time domain signals can be decimated in order to reduce the sampling rate to a lower rate. For example, an initial 16 KHz sampling rate can be reduced to 8 KHz which in turn reduces the size of the Time-to-Frequency Operator Transform. Thus for example at 8 KHz the size of the Operator Transform can be N_f = 4096.

The Time-to-Frequency Domain Transformer 1401 can then output the frequency domain values to a scaler 1403.

The operation of time-to-frequency domain transforming the audio signals is shown in Figure 8 by step 503.

In some embodiments the modulation energy determiner 311 comprises a scaler/log converter 1403 or means for scaling the frequency domain values with respect to the time domain at least one audio signal and means for converting the scaled frequency domain values into a logarithmic domain scaled frequency domain. The scaler/log converter 1403 can in some embodiments be configured to receive the frequency domain values and scale and convert to the logarithmic domain (log domain), the frequency domain signal values. The scaling and log domain conversion can be used in some embodiments to smooth out the short term characteristics of the signal and enable the highlighting of relevant segments.

The scaling of the frequency domain samples in some embodiments can be with reference to the corresponding time domain signal.

The operation of scaling the frequency domain samples with the time domain signal is shown in Figure 8 by step 505. The conversion to the log domain, for example, can be performed using the following equation: iogio(| , V])

The log domain converted frequency domain values can be output by the scaler/log converter 1403 to a filter or bin selector 1405. Furthermore the log domain conversion of the scaled frequency domain samples is shown in Figure 8 by step 507.

In some embodiments the modulation energy determiner 31 1 comprises a filter/bin selector 1405 or means for grouping the logarithmic scaled frequency domain values according to frequency ranges. The filter/bin selector 1405 can be configured to receive the output of the scaler/log converter 1403 and output a series of frequency bin ranges b(k) from which the filtered or bin allocated frequency components can be further processed. The filtering or allocation of frequency components to bin or groups b(k) can be made by any suitable allocation process. For example in some embodiments a first bin or group are the frequency components from 0 to 500 Hz and a second bin from 500 to 1000 Hz. In some embodiments there can be Bk numbers of bins and the bins can be overlapping or contiguous, linear or non-linear in distribution. For example in some embodiments the bin indices can be determined according to the following equation. f{k)- 2 - N_j

b{k) f 0≤k < Bk

Fs where f(k) describes the frequency for the k bin index and Bk is the number of bin indices. In some embodiments, the following values are used: f(0)=1500, f(1 )=2000, Bk = 2. The allocated bin values can then be passed to the sum determiner 1407.

The selection of frequency bin ranges from selected frequency domain samples is shown in Figure 8 by step 509. In some embodiments the modulation energy determiner 31 1 can comprise a sum determiner 1407 or means for summing the logarithmic scaled frequency domain values within frequency range groups. The sum determiner 1407 can in some embodiments be configured to receive the bin index values and sum the scaled frequency domain values for each 'bin' of frequencies. These summed values can then in some embodiments be passed to a standard deviation determiner 1409.

The summing of the selected frequency bin range value samples is shown in Figure 8 by step 51 1.

Furthermore in some embodiments the modulation energy determiner 311 comprises a standard deviation determiner 1409 or means for determining modulation energy value as a standard deviation of the summed frequency range groups. In some embodiments the standard deviation determiner 1409 can be configured to calculate the standard deviation of the summed values.

The output of the standard deviation determiner 1409 can then be output from the modulation energy determiner as the modulation energy value. The operation of determining the modulation energy by standard deviation of sum selected frequency bin range values is shown in Figure 8 by step 513. Thus for example the modulation energy following the filtering or 'binning' of the frequency representations can be shown with regards to the following pseudo code.

where mfRes describes the sampling resolution of the modulation energy signal, tRes is the time resolution of the frames according to tRes = N / (2 * Fs), L is the number of time frames present for the signal, and twNext describes the time period the modulation energy signal is covering at each time instant. In some embodiments, the following values are used: Fs = 8000Hz, N=256, twNext=2.5seconds, mfRes = 0.25 seconds. Furthermore in some embodiments the value of xData_m \n line 12 of the pseudo code is calculated according to the following: xData_m (tStart,tEnd) - std(cSnm)

Bk-\

cSum(k) = mfs_m (bk, k), tStart < k < tEnd

bk=0 where std(y) calculated the standard deviation of y according to

y length(y) and where length(y) returns the number of samples present in vector y.

The recording processor 301 can in some embodiments comprise a converter 313. The converter 313 can be configured to convert the modulation energy signal to an event signal.

The conversion of modulation energy signals (mf_m) to event signals (eData) is shown in Figure 5 by step 407.

With respect to Figure 9, the converter 313 is shown in further detail. Furthermore with respect to Figure 10 the operation of the converter 313 is shown in further detail.

The converter 313 can be configured to receive the modulation energy signal from the modulation energy determiner 31 1 for each of the recordings.

The operation of receiving the modulation energy signal is shown in Figure 10 by step 701.

The converter 313 can in some embodiments comprise a difference determiner 601. The difference determiner 301 can be configured to determine the difference between neighbouring modulation energy samples. This can be represented mathematically as follows: eData(0) = 0

eDataii) = mf_m (i) - mf_m (z - 1), 1 < i < length(mf_m ) where mf_m is the modulation energy, The operation of determining the difference between neighbouring modulation energy samples or instances as shown in Figure 10 by step 703.

The output of the converter/difference determiner 601 can be passed to the event signal filter.

In some embodiments the recording processor 301 can comprise an event signal filter 315. The event signal filter 315 can be configured to process the event signal to determine significant occurrences within the signal and output these 'events'.

The operation of filtering the event signal is shown in Figure 5 by step 409.

The event signal filter 315 according to some embodiments is shown in further detail in Figure 1 1. Furthermore the operation of the event signal filter 315 according to some embodiments is shown in further detail with regards to Figure 12.

The event signal filter 315 can in some embodiments comprise a sorter 81 1 or means for sorting. The sorter 81 1 can be configured to receive the event signal generated by the converter 313 and sort the vector of values into a descending order.

The sorter can then pass these to a time difference determiner 813. The operation of sorting the difference values into decreasing order vectors is shown in Figure 12 by step 901. In some embodiments the event signal filter 315, or means for filtering can comprise a time difference determiner 813 or means for determining a time index difference value for neighbouring pairs of difference modulation energy vector items. The time difference determiner can in some embodiments be configured to determine the temporal difference between determined "event" or "candidate event". The time difference between these neighbouring vector items can then be passed to a vector item filter 815.

The operation of determining the time difference between neighbouring vector items is shown in Figure 12 by step 903.

The event signal filter 315 can in some embodiments further comprise a vector item filter 815 or means for removing an item from the difference modulation energy vector when the time index difference value is less than a determined threshold. The vector item filter 815 can be configured to filter vector items where the time difference between items is less than a determined threshold value.

These can then be passed to an item spacing checker 817. The operation of filtering the vector item where the time difference between items is less than a determined threshold value is shown in Figure 12 by step 905.

In some embodiments the event signal filter 315 comprises an item spacing checker 817. The item spacing checker is configured to check the remaining items to determine whether the items have enough 'distance in time', in other words whether there is sufficient temporal spacing between the remaining event signal items. Where the time difference separation is acceptable then the item spacing checker can pass these values to be stored or to be passed to the audio space processor 303. Where the item space difference is not sufficient then the items can be passed back to the sorter 81 1 to be further processed.

The operation of checking the time difference separation between 'items' or event signal determined events is shown in Figure 12 by step 907. The outputting of data when the separation being acceptable or suitable is shown in Figure 12 by step 909. Where the event signal item time difference separation is not acceptable the operation can then loop back to the operation of sorting the remaining difference values into decreasing order vectors as shown in Figure 12 step 901.

The operation of the event signal filter 315 can be summarised in some embodiments according to the following pseudo code:

where [yData, yldx] = sort(y, 'descend') sorts vector y into descending order and returns the sorted vector in yData and the corresponding vector indices in yldx. In line 5, the time difference of two neighboring vector items is calculated. If the time difference is below tThr, the event signal is set to zero to signal that the particular event is too close to a previous event (line 7). According to some embodiments the event signal filter filters the events as it is not practical to have events too close each other and therefore filtering is applied to the event signal. The event signal is sorted again in line 10 and the steps are repeated (lines 2 - 1 1 ) until neighboring events have enough distance in time. Finally, in line 13, the event data for the m recording source is determined. In the current implementation the following values are used : tThr = 3 seconds, nLoop = 13.

The operation of outputting the event signal per audio source is shown in Figure 5 by step 41 1 .

In some embodiments the audio scene event detector 203 comprises an audio space processor 303 or means for determining at least one audio event for an audio space comprising the at least one recording apparatus. The audio space processor 303 is configured to combine the event data from multiple recording sources within a defined audio space or scene to determine the event data for the audio scene. Thus in some embodiments of the application the audio space processor 303 receives each of the recording event signals for an audio space and produces the audio event position and recording device selection list output.

The operation of the audio space processor is further described with respect to the flow diagram shown in Figure 6.

In some embodiments the audio space processor 303 comprises an audio scene event signal pre-processor 321 . The audio scene event signal pre-processor 321 is configured to receive the audio space event signals from each of the recording sources within the audio scene area and output a filtered audio event list. The definition of audio scene area can be any suitable grouping or selection of audio sources such as for example defining the audio space by determination of similar audio signals, audio signal correlation, sensor correlation, and sensor information matching.

With respect to Figure 13 an example of the audio scene event signal preprocessor 321 according to some embodiments of the application is shown. Furthermore with respect to Figure 14 a flow diagram shows the operation of the audio scene event signal pre-processor 321 according to some embodiments. In some embodiments the audio scene event signal pre-processor 321 comprises an event sorter 1001 or means for sorting. The audio scene event signal preprocessor sorter 1001 is configured to receive the audio scene event data from each audio source within the audio event region.

The operation of receiving the audio scene event data from each audio source is shown in Figure 14 as step 1101.

The event signal data can in some embodiments be received in the format shown below: audioSceneEventData - audioSceneEventLength

The sorter 1001 can be configured to sort the audio scene event data into a decreasing order vector. This operation can be similar to that carried out by the sorter 811 of the event signal filter 315 but with respect to the audio scene rather than the recording source. This sorted data can then be passed to an audio space event filter 1003. The operation of sorting the audio scene event data is shown in Figure 14 by step 1 103.

In some embodiments the audio scene event signal pre-processor 321 can further comprise an audio space event filter 1003, or means for filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria. The event filter 1003 can be configured to check across recording sources within a defined audio space to remove events which are 'too close' to each other. The at least one audio event criteria in some embodiments as described herein can comprise an audio event time index.

The operation of checking whether the recording sources have events which are too close to each other and filtering these is shown in Figure 14 by step 1105. The output filtered event can then be passed on to an intermediate event position determiner 323. In some embodiments. the pre-processing in terms of sorting and event filtering can be performed according to the following pseudo code:

The sorting and event filtering can be considered to operate in a similar manner to that as shown in the event signal filter 315. However with the difference that the time difference of the events is now checked across different recording sources.

The time_and_clip() function as described herein with respect the pseudocode returns the time index and the corresponding recording source index for a specified input. As the event data from different recording sources are in an appended format, the actual time index from the sorted data vector can in some embodiments be first decoded before it is used in the calculations. The time and clip function as described herein can, for example, be performed in some embodiments according to the following pseudo code:

1 [timelndex, cliplndex] = time_and_clip(ilndex, eventLen, nOutEvents)

2

3 nClips = length(eventLen);

4 oLen = minimum(length(ilndex), nOutEvents);

5

6 timelndex = zero_vector_of_size(oLen);

7 cliplndex = zero_vector_of_size(oLen);

8

9 for idx = 0 to oLen - 1

10

11 clipOffsetStart = 0; tlndex = 0; clndex = 0; targetldx = ilndex(idx);

12 for clipldx=0 to nClips - 1

13

14 clipOffsetEnd = clipOffsetEnd + eventLen(clipldx);

15 if targetldx >= clipOffsetStart && targetldx < clipOffsetEnd

16

17 clndex = clipldx;

18 tlndex = targetldx - clipOffsetStart;

19 break;

20 End

21

22 clipOffsetStart = clipOffsetEnd;

23 End

24

25 timelndex(idx) = tlndex;

26 cliplndex(idx) = clndex;

27 End

28 End The operation of pre-processing the filter event signal across the time source spectrum is shown in Figure 6 by step 453.

Furthermore in some embodiments the audio space processor 303 can comprise an intermediate event position determiner 323 or means for determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus. The intermediate event position determiner 323 can be configured to receive the filtered audio scene event signal and generate or determine intermediate event position values.

Figure 15, for example, shows an example of an intermediate event position determiner 323 in further detail. Furthermore with respect to Figure 16 the operation of the intermediate event position determiner 323 is shown in further detail.

The intermediate event position determiner 323 in some embodiments comprises a sorter 1201 or means for sorting. The sorter 1201 can be configured to receive the pre-processed audio scene event signal. The operation of receiving the pre-processed audio scene event signal is shown in Figure 16 by step 1301.

The sorter can then be configured to sort the audio scene event signal into a descending order and perform a time and clip operation on the sorted signal for a number of intermediate event positions. These intermediate event positions can be defined by a variable 'nOutlntermediate'.

The operation of sorting the audio scene event signal into descending order is shown in Figure 16 by step 1303.

The sorter 1201 can then output the sorted data to a filter 1203. The intermediate event position determiner 323 can comprise in some embodiments, a filter 1203 or means for filtering. The filter 1203 can be configured to sort the time index for a defined number of event signal positions into an ascending order. In other words for a time sample arranging the event signal positions into time-wise ordering.

The sorting of the time index for a defined number of event signal positions into an ascending order is shown in Figure 16 by step 1305. Furthermore the filter 1203 can be configured in some embodiments to determine intermediate positions for event data by extracting a portion of, and filtering the, sorted audio scene event signal and time index.

The operation of determining the intermediate position for event data by extracting portions of and filtering the sorted audio scene event signal and time index is shown in Figure 16 by step 1307.

The output of these intermediate positions can then be output to an audio scene event position and recording device selection list generator 325.

In some embodiments the operation of the intermediate event position determiner 323, including the sorting and filtering operations can be shown with respect to a pseudo code as shown here:

1 [sData, sldx] = sort(audioSceneEventData, 'descend') # Descending order

[timelndex, clipldx] = time_and_clip(sldx, eventLens, nOutlntermediate)

[timelndex, tldx] = sort(timelndex, 'ascend') # Ascending order

2 For loop = 0 to nLoop - 1

3

4 For t = 1 to length(tldx) - 1

If tldx(t) > -1 and tldx(t-1 ) > -1

7 diff =

tldx{t)- tldx(t - 1 8 If diff < tThr tlndex = tldxit - 1) + ^ audioSceneEventLength{i)

9 audioSceneEventData(tlndex) = -1

10 Endif

Endif

11 Endfor

[sData, sldx] = sort(audioSceneEventData, 'descend') # Descending order

[timelndex, clipldx] = time_and_clip(sldx, eventLens, nOut)

12 [timelndex, tldx] = sort(timelndex, 'ascend') # Ascending order

13 Endfor

The operation of determining the intermediate event positions is shown in Figure 6 by step 455. The audio space processor 303 can furthermore comprise an audio scene event position and recording device selection list generator 325. The audio scene event position and recording device selection list generator 325 is shown in further detail with respect to Figure 17. Furthermore the operation of the audio scene event position and recording device selection list generator 325 is described in further detail with respect to the flow diagram shown in Figure 18.

In some embodiments the audio scene event position and recording device selection list generator 325 comprises an event position tester 1451 or means for determining an event period. The event position tester 1451 can be configured to receive the intermediate event position determined event signal and test the entire timeline for event positions. The event position tester 1451 can for example test or check the entire timeline for event positions.

The operation of checking the entire timeline for event positions is shown in step 1501 of Figure 18. The event position tester 1451 can then furthermore perform a 'legal time' index check on any of the detected event values, wherein the time index of the event position candidate is checked whether it has a valid or legal value. The legal time index check is shown in Figure 18 by step 1503.

The event position tester 1451 can then further determine the event position. An example of determining the event position is shown herein with respect to the pseudocode.

The determination of the event position is shown in Figure 18 by step 1505.

The event position tester 1451 can then further be configured in some embodiments to determine or check whether previous event positions have been detected.

The operation of detecting previous positions is shown in Figure 18 by step 1507.

Where no previous position has been detected the event position tester 1451 can be configured to store the event position in the vector [event pos] as the first event position.

The operation of storing the event position as a first event position is shown in Figure 18 as step 1509.

Furthermore a recording device selector 1453 or means for determining a recording apparatus indicator associated with the audio event can then determine and store a preferred audio source. The recording device selector 1453 operation of determining and storing the preferred audio source can be shown in Figure 18 by step 151 1. Where previous event positions have been detected the event position tester 1451 can be configured to determine whether or not the time difference between the new location and the previous location exceeds a determined threshold. The determination of whether the time difference exceeds the threshold is shown in Figure 18 by step 1510. Where the threshold is exceeded then the event position tester 1451 can be configured to append the stored event position vector [event pos] with the new location. The operation of appending the stored event position vector is shown in Figure 18 by step 1512.

A recording device selector 1453 can furthermore then determine the preferred audio source and store this as part of the event source list vector [event source list].

The storing and determination of the preferred audio source of the event is shown in Figure 18 by step 1513. The event position tester 1451 can then furthermore determine whether or not the entire timeline has been scanned.

The operation of checking the entire timeline as being scanned is shown in Figure 18 by step 1515.

Where the entire timeline has been scanned the event position tester 1451 can be configured to end the sequence.

The operation of ending the sequence is shown in Figure 18 by step 1519.

Where the entire timeline has not been scanned the threshold can be increased by the event position tester 1451. The operation of incrementing the threshold is shown in Figure 18 by step 1517.

After the threshold is increased the operation can then pass back to the original opening operation of checking the entire timeline for event positions.

The event position and preferred recording source selection list can be summarised as the following pseudo code:

1 nFoundldx = 0

2 while nFoundldx < nOutlntermediate - 1

3

4 nFound = 0 ; nFoundldx = 0 nCurr = 0 ;

5 for n=0 to length (tldx) - 1

6

7 if nFound < nOut

8 t = timelndex (nCurr) ;

9 if t not equal to -1

10

11 if nFoundldx > 0

12

13 diff = abs (reportRes * (t - timelndex (nFoundldx) )) ;

14

15 if diff >= tThr

16 eventPos (nFound) = mfRes - t

17 eventSourceList (nFound) = clipSelectionList_t

18 nFound = nFound + 1 ; nFoundldx = nCurr

19 Endif

20 Else

21 eventPos (nFound) = mfRes - t

22

eventSourceList (nFound) = clipSelectionList_t

23 nFound = nFound + 1; nFoundldx = nCurr;

24 Endif

25 Endif

26 Endif

27 nCurr = nCurr + 1;

28 Endfor 29

30 tThr = tThr + 0.5;

31 Endwhile where nOut defines the number of output events to be extracted and nOut > nOutlntermediate. The number of events in some embodiments can depend on the duration of the audio scene, for example, for a 3 min duration, it may be specified that there would be at maximum 30 event positions which in theory means an event for every 6 seconds.

In line 2 of the pseudocode the entire time line check (step 1501 ) is shown where a check is made to make sure that the entire timeline of the audio scene is used for the event extraction. Line 9 of the pseudocode shows the legal or valud time index check (step 1503) where the time index is checked as being legal, that is, not excluded in the previous filtering operations. Line 1 1 of the pseudocode shows when the event position tester determines whether there are event positions already located. If event positions are already located, the time difference of the new location and the previously stored location is calculated and if the time difference exceeds the time threshold, the new location is appended to the stored event positions vector eventPos (as shown in line 16). In line 17 of the pseudocode the preferred recording source selection list is determined and stored to vector eventSourceList. If there were no event positions stored yet, the new location is set to appear as the first event position in the event positions vector in line 21. Similarly, the preferred recording source selection list is stored to vector eventSourceList.

In case the event positions for the audio scene do not cover the entire timeline, the time difference threshold is increased (as shown in line 30 of the pseudocode) and the process is started again (lines 2 - 32).

The preferred recording source selection list (lines 17 and 22) can in some embodiments be determined as follows strengthData = [eventData₀ (t) ... eventData_M__x (t )]

[eventStrength, sourceldx] = sort(strengthData, descend') clipSelectionList _t = sourceIdx(m), 0≤m < M

In some embodiments also the strength of the events (eventStrength) can be incorporated to the event data. The strength of the event may be used, for example, when the same recording source populates the preferred recording source selection list as the first selection in neighboring event positions and more recording source variation is needed for the downmixed signal(s). In this case, the strength of the sources may also be used to check the deviation of the other sources in the selection list with respect to the first source selection in the list. If the deviation is small, for example, within some threshold (-1dB etc), also the second (or third, fourth, etc) source from the selection list may be used to enable more variation in the downmixed signal(s).

The determination of the audio scene event positions is shown in Figure 6 by step 457 and the operation of determining the recording device selection list is shown in Figure 6 by step 459.

Although the above has been described with regards to audio signals, or audiovisual signals it would be appreciated that embodiments may also be applied to audio-video signals where the audio signal components of the recorded data are processed in terms of the determining of the base signal and the determination of the time alignment factors for the remaining signals and the video signal components may be synchronised using the above embodiments of the invention. In other words the video parts may be synchronised using the audio synchronisation information.

It shall be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers. Furthermore elements of a public land mobile network (PLMN) may also comprise apparatus as described above.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1. Apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform:

receiving at least one audio signal from at least one recording apparatus; and

determining for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

2. The apparatus as claimed in claim 1 , wherein determining for the at least one recording apparatus at least one audio event indicator cause the apparatus to perform:

determining at least one audio event for the at least one recording apparatus; and

determining at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus.

3. The apparatus as claimed in claim 2, wherein determining at least one audio event for an audio space comprising the at least one recording apparatus further causes the apparatus to perform:

sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and

filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

4. The apparatus as claimed in claim 3, wherein the at least one audio event criteria comprises an audio event time index.

5. The apparatus as claimed in claims 2 to 4, wherein determining at least one audio event for the at least one recording apparatus further causes the apparatus to perform:

6. The apparatus as claimed in claim 5, wherein the at least one audio event criteria comprises an audio event time index.

7. The apparatus as claimed in claims 2 to 6, wherein determining for the at least one recording apparatus at least one audio event indicator cause the apparatus to perform:

determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and determining a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event.

8. The apparatus as claimed in claims 2 to 7, wherein determining at least one audio event for the at least one recording apparatus further causes the apparatus to perform:

determining a modulation energy value signal for an audio signal associated with the at least one recording apparatus;

determining a difference modulation energy value signal from the modulation energy value signal; and

filtering the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value.

9. The apparatus as claimed in claim 8, wherein filtering the difference modulation energy value signal to generate at least one audio event causes the apparatus to perform:

sorting the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector; determining a time index difference value for neighbouring pairs of difference modulation energy vector items; and

removing an item from the difference modulation energy vector when the time index difference value is less than a determined threshold.

10. The apparatus as claimed in claims 1 to 9, further caused to perform determining at least one modulation energy value for at least one audio signal from at least one recording apparatus, wherein determining at least one modulation energy value causes the apparatus to further perform:

transforming the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain;

scaling the frequency domain values with respect to the time domain at least one audio signal;

converting the scaled frequency domain values into a logarithmic domain scaled frequency domain;

grouping the logarithmic scaled frequency domain values according to frequency ranges;

summing the logarithmic scaled frequency domain values within frequency range groups; and

determining modulation energy value as a standard deviation of the summed frequency range groups.

1 1. A method comprising:

receiving at least one audio signal from at least one recording apparatus; and

12. The method as claimed in claim 1 1 , wherein determining for the at Ieast one recording apparatus at Ieast one audio event indicator comprises:

determining at Ieast one audio event for the at Ieast one recording apparatus; and

determining at Ieast one audio event for an audio space comprising the at Ieast one recording apparatus, wherein the at Ieast one audio event for the audio space is selected from the at Ieast one audio event for the at Ieast one recording apparatus.

13. The method as claimed in claim 12, wherein determining at Ieast one audio event for an audio space comprising the at Ieast one recording apparatus further comprises:

sorting the at Ieast one audio event for the at Ieast one recording apparatus according to at Ieast one audio event criteria; and

filtering the at Ieast one audio event for the at Ieast one recording apparatus dependent on the at Ieast one audio event criteria.

14. The method as claimed in claim 13, wherein the at Ieast one audio event criteria comprises an audio event time index.

15. The method as claimed in claims 12 to 14, wherein determining at Ieast one audio event for the at Ieast one recording apparatus further comprises:

16. The method as claimed in claim 15, wherein the at Ieast one audio event criteria comprises an audio event time index.

17. The method as claimed in claims 12 to 16, wherein determining for the at Ieast one recording apparatus at Ieast one audio event indicator comprise: determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and determining a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event.

18. The method as claimed in claims 12 to 17, wherein determining at least one audio event for the at least one recording apparatus further comprises:

19. The method as claimed in claim 18, wherein filtering the difference modulation energy value signal to generate at least one audio event comprises: sorting the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector; determining a time index difference value for neighbouring pairs of difference modulation energy vector items; and

20. The method as claimed in claims 11 to 19, further comprising determining at least one modulation energy value for at least one audio signal from at least one recording apparatus, wherein determining at least one modulation energy value comprises:

scaling the frequency domain values with respect to the time domain at least one audio signal; converting the scaled frequency domain values into a logarithmic domain scaled frequency domain;

21. An apparatus comprising:

a receiver configured to receive at least one audio signal from at least one recording apparatus; and

an audio scene event detector configured to determine for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

22. The apparatus as claimed in claim 21 , wherein the audio scene event detector comprises:

a recording processor configured to determine at least one audio event for the at least one recording apparatus; and

an audio space processor configured to determine at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus.

23. The apparatus as claimed in claim 22, wherein the audio space processor further comprises:

a sorter configured to sort the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and

an event filter configured to filter the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

24. The apparatus as claimed in claim 23, wherein the at least one audio event criteria comprises an audio event time index.

25. The apparatus as claimed in claims 22 to 24, wherein the recording processor comprises:

26. The apparatus as claimed in claim 25, wherein the at least one audio event criteria comprises an audio event time index.

27. The apparatus as claimed in claims 22 to 26, wherein the audio space processor further comprises:

an event position tester configured to determine an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and

a recoding apparatus selector configured to determine a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event.

28. The apparatus as claimed in claims 22 to 27, wherein the recording processor comprises:

a modulation energy determiner configured to determine a modulation energy value signal for an audio signal associated with the at least one recording apparatus;

an event signal converter configured to determine a difference modulation energy value signal from the modulation energy value signal; and

an event signal filter configured to filter the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value.

29. The apparatus as claimed in claim 28, wherein the event signal filter comprises:

a sorter configured to sort the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector;

a time difference determiner configured to determine a time index difference value for neighbouring pairs of difference modulation energy vector items; and

a filter configured to remove an item from the difference modulation energy vector when the time index difference value is less than a determined threshold.

30. The apparatus as claimed in claims 21 to 29, further comprising a modulation energy determiner, the modulation energy determiner further comprising:

a time to frequency domain transformer configured to transform the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain;

a scaler configured to scale the frequency domain values with respect to the time domain at least one audio signal;

a log domain converter configured to convert the scaled frequency domain values into a logarithmic domain scaled frequency domain;

a frequency bin selector configured to group the logarithmic scaled frequency domain values according to frequency ranges;

a summer configured to sum the logarithmic scaled frequency domain values within frequency range groups; and

a standard deviation determiner configured to determine the modulation energy value as a standard deviation of the summed frequency range groups.

31. An apparatus comprising:

means for receiving at least one audio signal from at least one recording apparatus; and means for determining for the at least one recording apparatus at least one audio event indicator dependent on an at least one audio signal modulation energy value.

32. The apparatus as claimed in claim 31 , wherein the means for determining for the at least one recording apparatus at least one audio event indicator comprises:

means for determining at least one audio event for the at least one recording apparatus; and

means for determining at least one audio event for an audio space comprising the at least one recording apparatus, wherein the at least one audio event for the audio space is selected from the at least one audio event for the at least one recording apparatus.

33. The apparatus as claimed in claim 32, wherein the means for determining at least one audio event for an audio space comprising the at least one recording apparatus further comprises:

means for sorting the at least one audio event for the at least one recording apparatus according to at least one audio event criteria; and

means for filtering the at least one audio event for the at least one recording apparatus dependent on the at least one audio event criteria.

34. The apparatus as claimed in claim 33, wherein the at least one audio event criteria comprises an audio event time index.

35. The apparatus as claimed in claims 32 to 34, wherein the means for determining at least one audio event for the at least one recording apparatus further comprises:

The apparatus as claimed in claim 35, wherein the at least one audio event a comprises an audio event time index.

37. The apparatus as claimed in claims 32 to 36, wherein the means for determining for the at least one recording apparatus at least one audio event indicator comprise:

means for determining an at least one audio event period for the at least one audio event for an audio space comprising the at least one recording apparatus; and

means for determining a recording apparatus indicator associated with the audio event, wherein the audio event indicator comprises the at least one audio event period and recording apparatus indicator associated with the audio event.

38. The apparatus as claimed in claims 32 to 37, wherein the means for determining at least one audio event for the at least one recording apparatus further comprises:

means for determining a modulation energy value signal for an audio signal associated with the at least one recording apparatus;

means for determining a difference modulation energy value signal from the modulation energy value signal; and

means for filtering the difference modulation energy value signal to generate at least one audio event dependent on the difference modulation energy value.

39. The apparatus as claimed in claim 38, wherein the means for filtering the difference modulation energy value signal to generate at least one audio event comprises:

means for sorting the difference modulation energy value signal according to difference modulation energy value to form a difference modulation energy vector;

means for determining a time index difference value for neighbouring pairs of difference modulation energy vector items; and means for removing an item from the difference modulation energy vector when the time index difference value is less than a determined threshold.

40. The apparatus as claimed in claims 31 to 39, further comprising means for determining at least one modulation energy value for at least one audio signal from at least one recording apparatus, wherein the means for determining at least one modulation energy value comprises:

means for transforming the at least one audio signal from at least one recording apparatus from the time domain to the frequency domain;

means for scaling the frequency domain values with respect to the time domain at least one audio signal;

means for converting the scaled frequency domain values into a logarithmic domain scaled frequency domain;

means for grouping the logarithmic scaled frequency domain values according to frequency ranges;

means for summing the logarithmic scaled frequency domain values within frequency range groups; and

means for determining modulation energy value as a standard deviation of the summed frequency range groups.

41. A computer program product stored on a medium for causing an apparatus to perform the method of any of claims 11 to 20.

42. An electronic device comprising apparatus as claimed in claims 1 to 10 and 21 to 40.

43. A chipset comprising apparatus as claimed in claims 1 to 10 and 21 to 40.