EP2774391A1 - Audio scene rendering by aligning series of time-varying feature data - Google Patents

Audio scene rendering by aligning series of time-varying feature data

Info

Publication number
EP2774391A1
EP2774391A1 EP11875048.8A EP11875048A EP2774391A1 EP 2774391 A1 EP2774391 A1 EP 2774391A1 EP 11875048 A EP11875048 A EP 11875048A EP 2774391 A1 EP2774391 A1 EP 2774391A1
Authority
EP
European Patent Office
Prior art keywords
correlation
series
pair
resolution
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP11875048.8A
Other languages
German (de)
French (fr)
Other versions
EP2774391A4 (en
Inventor
Juha Petteri OJANPERÄ
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Oyj
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Oyj filed Critical Nokia Oyj
Publication of EP2774391A1 publication Critical patent/EP2774391A1/en
Publication of EP2774391A4 publication Critical patent/EP2774391A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/002Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
    • H04S3/004For headphones
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/03Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1

Definitions

  • This relates to aligning series of time-varying feature data.
  • Captured signals are transmitted and stored at a rendering location, from where an end user can select a listening point based on their preference from the reconstructed audio space.
  • a first aspect of the invention provides a method comprising:
  • each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series;
  • Receiving basis vectors relating to each of at least three series of time-varying feature data may comprise receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein performing multiple correlations may comprise performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
  • performing a correlation for the first pair of said at least three series at at least the two different resolutions may comprise performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
  • Determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion may comprise calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
  • Performing the second correlation for the first pair of said at least three series at the second resolution may comprise performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
  • Performing multiple correlations may comprise performing the first correlation for the first pair of said at least three series at the first resolution and also performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations.
  • This method may comprise assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
  • the method may comprise receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein performing multiple correlations for the first pair may comprise performing a third correlation at the third resolution.
  • the method may comprise selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
  • a second aspect of the invention provides apparatus comprising:
  • the means for receiving basis vectors relating to each of at least three series of time-varying feature data may comprise means for receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein the means for performing multiple correlations may comprise means for performing a first correlation for a first pair of said at least three series at the first resolution and also means for performing a second correlation for the first pair of said at least three series at the second resolution.
  • the means for performing a correlation for the first pair of said at least three series at at least the two different resolutions may comprise means for performing the first correlation for the first pair of said at least three series at a first resolution, means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and means for performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
  • the means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion may comprise means for calculating a histogram metric from the first correlation and means for comparing the histogram metric to a threshold.
  • the means for performing the second correlation for the first pair of said at least three series at the second resolution may comprise means for performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
  • the means for performing multiple correlations may comprise means for
  • the apparatus may comprise means for assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
  • the apparatus may comprise means for receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein the means for performing multiple correlations for the first pair may comprise means for performing a third correlation at the third resolution.
  • the apparatus may comprise means for selecting a series of time-varying data as a reference series and means for calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
  • the apparatus may comprise means for calculating the basis vectors relating the at least three series of time-varying feature data.
  • the apparatus may include at least one server.
  • the apparatus may include at least one server and a data-to-audio transducing device.
  • the apparatus may include at least one server, plural audio-to-data transducing devices and a data-to-audio transducing device.
  • the apparatus may be a system comprising at least one server and plural audio-to- data transducing devices.
  • a third aspect of the invention provides a computer program comprising instructions that when executed by a computer apparatus control it to perform a as recited method above.
  • a fourth aspect of the invention provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method comprising:
  • each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series;
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform receiving basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform performing a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform performing multiple correlations by
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
  • the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
  • a fifth aspect of the invention provides apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
  • each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series;
  • the computer-readable code when executed may control the at least one processor to receive basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
  • the computer-readable code when executed may control the at least one processor to perform a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and by performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
  • the computer-readable code when executed may control the at least one processor to determine whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
  • the computer-readable code when executed may control the at least one processor to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
  • the computer-readable code when executed may control the at least one processor to perform performing multiple correlations by performing the first correlation for the first pair of said at least three series at the first resolution and also to perform the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations.
  • the computer-readable code when executed may control the at least one processor to assign a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
  • the computer-readable code when executed may control the at least one processor to receive basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
  • the computer-readable code when executed may control the at least one processor to select a series of time-varying data as a reference series and calculate an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
  • Figure 1 shows audio scene with N capturing devices
  • FIG. 2 is a block diagram of an end-to-end system embodying aspects of the invention.
  • Figure 3 shows details of some aspects of the Figure 2 system;
  • Figure 4 shows a high level flowchart illustrating equation of aspects of embodiments of the invention.
  • Figure 5 shows a flowchart of calculating a basis vector from Figure 4.
  • Figure 6 shows a flowchart for aligning multi-user content from Figure 4.
  • FIGs 1 and 2 illustrate a system in which embodiments of the invention can be implemented.
  • a system 10 consists of N devices 11 that are arbitrarily positioned within the audio space to record an audio scene. In these Figures, there are shown four areas of audio activity 12. The captured signals are then transmitted (or alternatively stored for later consumption) so an end user can select a listening point 13 based on his/her preference from a reconstructed audio space.
  • a rendering part then provides one or more
  • microphones of the devices 11 are shown to have a directional beam, but embodiments of the invention use microphones having any form of suitable beam. Furthermore, the microphones do not necessarily employ a similar beam, but microphones with different beams may be used.
  • the downmixed signal(s) may be a mono, stereo, binaural signal or may consist of more than two channels, for instance four or six channels.
  • Each recording device 11 records the audio scene and uploads/upstreams (either in real-time or non realtime) the recorded content to an audio server 14 via a channel 15.
  • the upload/upstream process provides also positioning information about where the audio is being recorded and the recording direction/ orientation.
  • a recording device 11 may record one or more audio signals. If a recording device 11 records (and provides) more than one signal, the direction/ orientation of these signals may be different.
  • the position information may be obtained, for example, using GPS coordinates, Cell-ID or A-GPS. Recording
  • direction/ orientation may be obtained, for example, using compass, accelerometer or gyroscope information.
  • the server 14 receives each uploaded signal and keeps track of the positions and the associated directions/ orientations.
  • the audio scene server 14 may provide high level coordinates, which correspond to locations where user uploaded/up streamed content is available for listening, to an end user device 17. These high level coordinates may be provided, for example, as a map to the end user device 17 for selection of the listening position.
  • the end user device 17 or e.g. an application used by the end user device is responsible for determining the listening position and sending this information to the audio scene server 14.
  • the audio scene server 14 transmits the downmixed signal corresponding to the specified location to the end user device 17.
  • the audio server 14 may provide a selected set of downmixed signals that correspond to listening point and the end user device 17 selects the downmixed signal to which he/ she wants to listen.
  • a media format encapsulating the signals or a set of signals may be formed and transmitted to the end user devices 17.
  • the downmixed signals here can be audio only content or content where audio is accompanied with video content.
  • Embodiments of this specification relates to immersive person-to-person communication including also video and possibly synthetic content.
  • Maturing 3D audio-visual rendering and capture technology facilitates a new dimension of natural communication.
  • An 'all-3D' experience is created that brings a rich experience to users and brings opportunity to new businesses through novel product categories.
  • the local device clocks of different users would normally need to be at least within a few tens of milliseconds before content from multiple users can be jointly processed.
  • FIG 3 shows a schematic block diagram of a system 10 according to embodiments of the invention. Reference numerals are retained from Figures 1 and 2 for like elements.
  • multiple end user recording devices 11 are connected to a server 14 by a first transmission channel or network 15.
  • the user devices 11 are used for detecting an audio scene for recording.
  • the user devices 11 may record audio and store it locally for uploading later. Alternatively, they may transmit the audio in real time, in which case they may or may not also store a local copy.
  • the user devices 11 are referred to as recording devices 11 because they record audio, although they may not permanently store the audio locally.
  • the server 14 is connected to listening user devices 17 via a second transmission channel 18.
  • the first and second channels 15 and 18 may be different channels or networks or they may be the same channel or may be different channels within a single network.
  • the listening user devices 17 may also be termed consuming devices on the basis that audio content is consumed at those devices 17.
  • Each of the recording devices 11 is a communications device equipped with a microphone.
  • Each device 11 may for instance be a mobile phone, smartphone, laptop computer, tablet computer, PDA, personal music player, video camera, stills camera or dedicated audio recording device, for instance a dictaphone or the like.
  • the recording device 11 includes a number of components including a processor 20 and a memory 21.
  • the processor 20 and the memory 21 are connected to the outside world by an interface 22.
  • At least one microphone 23 is connected to the processor 20.
  • the microphone 23 may be directional. If there are multiple microphones 23, they may have different orientations of sensitivity.
  • the memory 21 may be a non- volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD).
  • the memory 21 stores, amongst other things, an operating system 24 and at least one software application 25.
  • the memory 21 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage.
  • the operating system 24 may contain code which, when executed by the processor 20 in conjunction with the memory 25, controls operation of each of the hardware components of the device 11.
  • the one or more software applications 25 and the operating system 24 together cause the processor 20 to operate in such a way as to achieve required functions.
  • the functions include processing audio data, and may include recording it. As is explained below, the functions may also include processing audio data to derive basis vectors therefrom.
  • the audio server 14 includes a processor 30, a memory 31 and an interface 32. Within the memory 31 are stored an operating system 34 and one or more software applications 35.
  • the memory 31 may be a non- volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD).
  • the memory 31 stores, amongst other things, an operating system 34 and at least one software application 35.
  • the memory 31 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage.
  • the operating system 34 may contain code which, when executed by the processor 30 in conjunction with the memory 35, controls operation of each of the hardware components of the server 14.
  • the one or more software applications 35 and the operating system 34 together cause the processor 30 to operate in such a way as to achieve required functions.
  • the functions may include processing received audio data to derive basis vectors therefrom.
  • the functions may also include processing basis vectors to derive alignment information therefrom.
  • the functions may also include processing alignment information and audio to render audio therefrom.
  • a processor 40 is connected to a memory 41 and to an interface 42.
  • An operating system 44 is stored in the memory, along with one or more software applications 45.
  • the memory 41 may be a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD).
  • ROM read only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 41 stores, amongst other things, an operating system 44 and at least one software application 45.
  • the memory 41 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage.
  • the operating system 44 may contain code which, when executed by the processor 40 in conjunction with the memory 45, controls operation of each of the hardware components of the listening user device 17.
  • the one or more software applications 45 and the operating system 44 together cause the processor 40 to operate in such a way as to achieve required functions.
  • the functions may include processing audio data to derive basis vectors therefrom.
  • the functions may also include processing basis vectors to derive alignment information therefrom.
  • the functions may also include processing alignment information and audio to render audio therefrom.
  • Each of the user devices 11, the audio server 14 and the listening user devices 17 operate according to the operating system and software applications that are stored in the respective memories thereof. Wherein the following one of these devices is said to achieve a certain operation or provide a certain function, this is achieved by the software and/ or the operating system stored in the memories unless otherwise stated.
  • Audio recorded by a recording device 11 is a time-varying series of data.
  • the audio may be represented in raw form, as samples. Alternatively, it may be represented in a non- compressed format or compressed format, for instance as provided by a codec.
  • codec for a particular implementation of the system may depend on a number of factors. Suitable codecs may include codecs that operate according to audio interchange file format, pulse-density modulation, pulse-amplitude modulation, direct stream transfer, or free lossless audio coding or any of a number of other coding principles. Coded audio represents a time-varying series of data in some form.
  • Figure 4 is a flowchart illustrating steps that occur in some part of the system 10 of Figure 3.
  • step 4.1 of Figure 4 the basis vectors are calculated for each series of content data.
  • step 4.2 the content from various users, i.e. different series of data, is aligned using the calculated basis vectors.
  • step 4.3 of the multi-user content is rendered for the end user consumption.
  • the rendering part may include various processing such as audio mixing, view switching, or joint processing of multi-user content. The exact details for the rendering part are outside the scope of this specification but what is common to all of the available rendering processing methods is that the multi-user content is assumed to be in synchronization.
  • the steps of Figure 4 may be implemented in various ways in the end-to-end context.
  • step 4.1 (creating basis vectors) is performed in the recording devices 11, and steps 4.2 (aligning content) and 4.3 (rendering content) are performed in the server 14.
  • basis vectors are transmitted from the recording devices 11 to the server 14 through the first channel 15 and the rendered content is transmitted from the server 14 to the consuming devices 17 through the second channel 18.
  • basic vector calculation step 4.1 is performed in the recording devices 11, aligning step 4.2 is performed in the audio server 14 and the rendering step 4.3 is performed in the consuming devices 17.
  • basis vectors are transmitted from the recording devices 11 to the audio server 14 through the channel 15 and alignment results are transmitted from the server 14 to the consuming devices 17.
  • basis vectors are created in step 4.1 of the recording devices 11 and the aligning step 4.2 and the rendering step 4.3 are performed in the consuming devices 17.
  • basis vectors are transmitted from the recording devices 11 to the consuming devices 17, and this transmission may or may not be via the audio server 14.
  • each of the basis vector calculation step 4.1, the alignment step 4.2 and the rendering step 4.3 are performed by the audio server 14.
  • raw or compressed audio data is received from the recording devices 11 and rendered content is transmitted from the server 14 to the consuming device 17.
  • the basis vector calculating step 4.1 and the alignment step 4.2 are performed at the audio server 14 and the rendering step 4.3 is performed at the end user device 17.
  • raw or compressed audio data is transmitted from the recording devices 11 to the audio server 14 and alignment results are transmitted from the audio server 14 to the consuming device 17.
  • basis vector calculating step 4.1 is performed at the audio server 14 and the alignment step 4.2 and the rendering step 4.3 are performed at the consuming device 17.
  • basis vectors are transmitted from the audio server to the receiving device 17 for processing.
  • the basis vector calculation step 4.1, the alignment step 4.2 and the rendering step 4.3 are all performed at the consuming device 17.
  • raw or coded audio data is transmitted from the recording devices 11 to the consuming device 17, optionally via the server 14, for processing thereat.
  • alignment results may be transmitted back to the server 14 for storage and for possible later use, and/or for use by other consuming devices 17.
  • basis vectors are calculated by the recording devices 11
  • raw or coded audio content needs to be transmitted from the recording devices 1 1 to the server 14 and/or the consuming devices 17.
  • rendering step 4.3 is performed at the audio server 14
  • raw or coded audio directly from one of the recording devices 11 is not usually provided to any of the consuming devices 17.
  • step 5.1 feature data is calculated from the content data.
  • the feature data is then converted in step 5.2 to basis vectors that describe the content data in the feature domain . Specific examples for performing the steps 5.1 and 5.2 will now be explained in detail.
  • the feature data is calculated is step 5.1 using a transform operator as follows.
  • Each recording source audio signal is first transformed to a frequency domain
  • transform (TF) operator is applied to each signal segment according to:
  • TF(x m nJ ) ⁇ (win(n) - x m (n + l - T) - e ⁇ ⁇ n )
  • Equation (1) is calculated on a frame by frame basis where the size of a frame is of short duration, for example, 20ms (typically less than 50ms).
  • each series of audio data is converted to another domain, here the frequency domain.
  • a transform operator for instance a discrete Fourier transform, is then applied.
  • the transform operator is applied to each signal segment.
  • Feature data may also be calculated, for example, based on harmonic ratio of the audio signal, low energy ratio, audio beating, or various MPEG-7 defined audio descriptors such as AudioSpectrumSpreadType.
  • the basis vectors may be calculated from multiple feature data instances. This can increase robustness in the content alignment operation.
  • the feature data is then converted to basis vectors according to the following pseudo-code:
  • sRes describes the sampling resolution of the basis vector
  • L is the number of time frames present for the signal
  • twNext describes the time period each element in the basis vector is covering at each time instant.
  • the number of window elements is set to a function of the sampling resolution of the basis vectors and the time resolution of frames.
  • the number of elements is then set to a function of the number of time frames present for the series and the number of window elements.
  • a start time is set to a function of time and the number of window elements and end time is set to a function of start time, the time period each element in the basis vector is covering at each time instant and the number of window elements. This is performed for each time frame.
  • the basis vector value is determined by applying a function to the data between the start time and end time and assigning the result of applying the function to basis vector index t.
  • the frequency domain in the sampling period is calculated to be the sum of all frequency bins that make up X at any instant k between start and end times.
  • Equation (6) is repeated for 2 _ j n me present embodiments, the binldx for the
  • TFQ operator is set to: f ⁇ k)- 2 - N
  • Fs is the sampling rate of the source signals and fQ describes the frequencies to be covered, both in Hz.
  • a frequency bin index which depends on an integer k, is calculated as a function of the reciprocal of the sampling rate, and frequencies covered for a period N.
  • the basis vectors are calculated for multiple different resolutions.
  • the three different resolutions relate to three different time periods.
  • the largest time period is a factor greater than the second largest, which is the same factor greater than the smallest.
  • the largest time period is twice the second largest time period, which is twice the smallest time period, at least approximately.
  • Figure 6 shows a flowchart for determining the alignment between the multi-user recorded content data. First, for each content data pair, correlation of the pair is calculated (step 6.1 step 6.1). Next, correlation metrics are determined for the pair to assess whether the content pair is correlated and the degree of the correlation in step 6.2.
  • step 6. If the metrics indicate that the pair is correlated but the degree of correlation is not strong enough (step 6.), the basis vectors are changed to another resolution and the calculations are recalculated with the new basis vectors (step 6.4). The steps are repeated until correlation is found or until different resolution basis vectors have been processed. Finally, in step 6.5 of Figure 6, the relative time differences between the multi-user content are determined from the correlation metrics. Next, the elements of Figure 6 are explained in more detail.
  • the correlation of the data pair (x,y) for index k is a function of the sign variable and the cross-correlation xC.
  • the cross-correlation is a function of normalized length of data pair (x,y) at index k and the ratio of the cross-correlation value.
  • the nominator of the cross-correlation value is calculated to be the sum of the product of x and y at indices defined by k and length of the data pair.
  • the denominator of the cross- correlation value is the root of the product of the sum of the delayed data vector y squared and the sum of the data vector x squared.
  • Equation (9) The correlation in Equation (9) is calculated twice in order to determine whether it is the content x that needs to be delayed with respect to content y or vice versa in order to achieve synchronization.
  • Equation (12) is repeated for 2 _
  • the lagVal describes the average alignment needed for the content pair based on the multiple instances of the basis vector and stdVal describes the deviation of the alignment in the basis vector domain.
  • low deviation indicates that alignment value has converged towards certain value that most likely corresponds to the time difference between the content pair.
  • the mean of length 2 is a function of the reciprocal of length 2 and the sum of all 2 in the domain k.
  • Standard deviation of length 2 is the root of the product of the reciprocal of length 2 and the sum of 2 minus the average of 2 all squared.
  • step 6.3 of Figure 6 the correlation metrics are assessed to see whether the content pair is correlated or whether new set of basis vectors should be taken into use to test whether a weak correlation can be converted into strong correlation.
  • start time 1 is zeroed if the average alignment is less than zero and is set to a value which is a function of the average alignment and basis vector resolution if the average alignment is greater than or equal to zero.
  • the basis vector resolution defines the time period each element in the basis vector is covering.
  • start time 2 is zeroed if the average alignment is greater than or equal to zero and is set to a value which is a function of the average alignment and basis vector (res) if the average alignment is less than zero.
  • End time 1 is formulated in the same way as start time 1 , except instead of being zeroed it is set to be equal to the length when the average alignment is less than zero.
  • End time 2 is formulated in the same way as start time 2, except instead of being zeroed it is set to be equal to the length when the average alignment is greater than or equal to zero.
  • weak correlation triggers to switch the resolution and at the same time the calculation window is positioned around the alignment value of the weak correlation.
  • the calculation window is limited to ⁇ 15 seconds around the first resolution alignment value and +10 seconds around the second resolution alignment value. In this way, either the weak correlation is turned into strong correlation indicating that content data pair is correlated or the weak correlation is confirmed by the multi- resolution calculations and hence the content data pair is not correlated.
  • the correlation is calculated for multiple resolutions in parallel manner, that is, calculations are performed, for example, both for lines 6 and 12 even though strong correlation would already be available for the first resolution. In these embodiment variations, all calculations for the content pair are required to indicate a strong correlation before the content pair is assigned a correlated value.
  • the switching logic between resolutions may be improved by employing also the histogram of the correlation results. In such variations of the embodiments the pseudo-code 3 is used. Pseudo-code 3:
  • the new logic uses histogram metrics to assess the correlation in the content data pair.
  • the histogram metrics are histRatio, and histVal.
  • the histVal describes the alignment value in the histogram domain and histRatio describes the ratio of the histVal items with respect to the total number of items available.
  • the alignment value in the histogram domain may be determined, for example, defining 3 time resolutions that have a width of 0.5s, Is, and 1.5s, and determining the histogram distribution (of x y ⁇ a ⁇ ) for each of the resolutions.
  • the histVal is the histogram item that gets most hits, and the histRatio is the corresponding ratio value.
  • the final value for the pseudo-code 3 is the one that maximizes the ratio over all 3 time resolution widths.
  • the final step is then to determine the relative time differences between the multi-user content data in step 6.5 of Figure 6.
  • V ⁇ 7 where M is the number of content data to be aligned and the corrMatrix holds the alignment value for the m* and ⁇ ⁇ content data pair that is determined according to pseudo-code 2.
  • the alignment values of different data pairs are adjusted so that common reference index is used for all pairs. This adjustment process is as follows
  • W j invalid _ value, 0 ⁇ i ⁇ M, 0 ⁇ j ⁇ M
  • the matrix IV is first initialized to default values and then the difference of the matrix entries at indices (i,j) and (i, refldx) is calculated where valid.
  • corrMatrix i refIdx ! not _ correlated
  • the mean alignment value is calculated for each column in the alignment matrix according to:
  • index (refldx, i) where only valid matrix entries are taken into account.
  • the final output for index (refldx, i) is always the difference with respect to the first column in the corresponding row for the matrix mOut.
  • Equation (18) is repeated for 0 ⁇ i ⁇ _ Furthermore, Equations (16)-(18) are repeated for 0 ⁇ refldx ⁇ M
  • the final alignment is the mean value of the previously calculated mean values for different reference matrix entry.
  • Equation (19) is repeated for 0 ⁇ i ⁇ j t ma y -, e advantageous to position the alignment values such that they are with respect to the content that appears first in the timeline.
  • the minimum alignment value is determined and all the values are adjusted with respect to this. In this way, the content that appears first in the timeline is assigned alignment value zero and the rest of the content are delayed with respect to this one.
  • the alignment value for each vector element is re-positioned as a difference of the corresponding matrix element and the minimum value of the alignment vector. Repositioning is determined only for valid vector entries.
  • the multi-user content is now in synchronization according to alignOut values and content can be jointly processed for various rendering and analysis purposes. [Are rendering and analysis sufficiently important that we should discuss them in some detail?]
  • An advantage achieved by the above-described embodiments is that they do not require any special timecodes, clappers or any other special preparations for the content alignment.
  • the above-described embodiments, as well as other embodiments operating on the same principles, can operate at low computational complexity, thereby enabling tens of content items to the aligned simultaneously.
  • only a certain portion of the basis vectors may be considered in the alignment at a time. This may be particularly advantageous for content which have different durations, for example if content A has duration of lmin and content B has duration of lOmin.
  • the processing flow is as per pseudo-code 4 below
  • the A and B content are both split into smaller duration segments as defined by refDuration, dstDuration, refAdvance, and dstAdvance. These small duration segments are aligned until alignment between the segments was found or all segments were processed. The switching to use smaller duration segments takes place if alignment is not found using the basic alignment setup with original durations.
  • the above embodiments relate to series of time-varying data that represents audio, the scope of the invention is not limited to this.
  • the invention is applicable also to processing video and other such time-varying series of data including static images, where spatial resolutions (width & height) can be considered to be time- varying.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

Apparatus comprises means for receiving basis vectors relating to each of at least three series of time-varying feature data; means for performing multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and means for aligning each of the at least three series of time-varying feature data.

Description

AUDIO SCENE RENDERING BY ALIGNING SERIES OF TIME-VARYING FEATURE DATA
Field
This relates to aligning series of time-varying feature data.
Background
It is known to distribute devices around an audio space and use them to record an audio scene. Captured signals are transmitted and stored at a rendering location, from where an end user can select a listening point based on their preference from the reconstructed audio space.
This type of system presents numerous technical challenges.
Summary
A first aspect of the invention provides a method comprising:
receiving basis vectors relating to each of at least three series of time-varying feature data;
performing multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and
aligning each of the at least three series of time-varying feature data.
Receiving basis vectors relating to each of at least three series of time-varying feature data may comprise receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein performing multiple correlations may comprise performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution. Here, performing a correlation for the first pair of said at least three series at at least the two different resolutions may comprise performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion. Determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion may comprise calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
Performing the second correlation for the first pair of said at least three series at the second resolution may comprise performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
Performing multiple correlations may comprise performing the first correlation for the first pair of said at least three series at the first resolution and also performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations. This method may comprise assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
The method may comprise receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein performing multiple correlations for the first pair may comprise performing a third correlation at the third resolution.
The method may comprise selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
A second aspect of the invention provides apparatus comprising:
means for receiving basis vectors relating to each of at least three series of time-varying feature data; means for performing multiple correlations, each correlation being for a p of said at least three series, the correlations linking together each of the at least three series; and
means for aligning each of the at least three series of time-varying feature data.
The means for receiving basis vectors relating to each of at least three series of time-varying feature data may comprise means for receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein the means for performing multiple correlations may comprise means for performing a first correlation for a first pair of said at least three series at the first resolution and also means for performing a second correlation for the first pair of said at least three series at the second resolution.
The means for performing a correlation for the first pair of said at least three series at at least the two different resolutions may comprise means for performing the first correlation for the first pair of said at least three series at a first resolution, means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and means for performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion. The means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion may comprise means for calculating a histogram metric from the first correlation and means for comparing the histogram metric to a threshold.
The means for performing the second correlation for the first pair of said at least three series at the second resolution may comprise means for performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
The means for performing multiple correlations may comprise means for
performing the first correlation for the first pair of said at least three series at the first resolution and also means for performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a
predetermined criterion between performing the first and second correlations.
The apparatus may comprise means for assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion. The apparatus may comprise means for receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein the means for performing multiple correlations for the first pair may comprise means for performing a third correlation at the third resolution.
The apparatus may comprise means for selecting a series of time-varying data as a reference series and means for calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
The apparatus may comprise means for calculating the basis vectors relating the at least three series of time-varying feature data.
The apparatus may include at least one server.
The apparatus may include at least one server and a data-to-audio transducing device. The apparatus may include at least one server, plural audio-to-data transducing devices and a data-to-audio transducing device.
The apparatus may be a system comprising at least one server and plural audio-to- data transducing devices.
A third aspect of the invention provides a computer program comprising instructions that when executed by a computer apparatus control it to perform a as recited method above.
A fourth aspect of the invention provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method comprising:
receiving basis vectors relating to each of at least three series of time-varying feature data;
performing multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and
aligning each of the at least three series of time-varying feature data.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform receiving basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform performing a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform performing multiple correlations by
performing the first correlation for the first pair of said at least three series at the first resolution and also performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
The computer-readable code, when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
A fifth aspect of the invention provides apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
to receive basis vectors relating to each of at least three series of time- varying feature data;
to perform multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and
to align each of the at least three series of time-varying feature data.
The computer-readable code when executed may control the at least one processor to receive basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
The computer-readable code when executed may control the at least one processor to perform a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and by performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
The computer-readable code when executed may control the at least one processor to determine whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
The computer-readable code when executed may control the at least one processor to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
The computer-readable code when executed may control the at least one processor to perform performing multiple correlations by performing the first correlation for the first pair of said at least three series at the first resolution and also to perform the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations. The computer-readable code when executed may control the at least one processor to assign a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
The computer-readable code when executed may control the at least one processor to receive basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution. The computer-readable code when executed may control the at least one processor to select a series of time-varying data as a reference series and calculate an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings.
Brief Description of the Drawings
Figure 1 shows audio scene with N capturing devices;
Figure 2 is a block diagram of an end-to-end system embodying aspects of the invention; Figure 3 shows details of some aspects of the Figure 2 system;
Figure 4 shows a high level flowchart illustrating equation of aspects of embodiments of the invention.
Figure 5 shows a flowchart of calculating a basis vector from Figure 4; and
Figure 6 shows a flowchart for aligning multi-user content from Figure 4.
Detailed Description of Embodiments
Figures 1 and 2 illustrate a system in which embodiments of the invention can be implemented. A system 10 consists of N devices 11 that are arbitrarily positioned within the audio space to record an audio scene. In these Figures, there are shown four areas of audio activity 12. The captured signals are then transmitted (or alternatively stored for later consumption) so an end user can select a listening point 13 based on his/her preference from a reconstructed audio space. A rendering part then provides one or more
downmixed signals from the multiple recordings that correspond to the selected listening point. In Figure 1 , microphones of the devices 11 are shown to have a directional beam, but embodiments of the invention use microphones having any form of suitable beam. Furthermore, the microphones do not necessarily employ a similar beam, but microphones with different beams may be used. The downmixed signal(s) may be a mono, stereo, binaural signal or may consist of more than two channels, for instance four or six channels.
In an end-to-end system context, the framework operates as follows. Each recording device 11 records the audio scene and uploads/upstreams (either in real-time or non realtime) the recorded content to an audio server 14 via a channel 15. The upload/upstream process provides also positioning information about where the audio is being recorded and the recording direction/ orientation. A recording device 11 may record one or more audio signals. If a recording device 11 records (and provides) more than one signal, the direction/ orientation of these signals may be different. The position information may be obtained, for example, using GPS coordinates, Cell-ID or A-GPS. Recording
direction/ orientation may be obtained, for example, using compass, accelerometer or gyroscope information.
Ideally, there are many users/ devices 11 recording an audio scene at different positions but in close proximity. The server 14 receives each uploaded signal and keeps track of the positions and the associated directions/ orientations.
Initially, the audio scene server 14 may provide high level coordinates, which correspond to locations where user uploaded/up streamed content is available for listening, to an end user device 17. These high level coordinates may be provided, for example, as a map to the end user device 17 for selection of the listening position. The end user device 17 or e.g. an application used by the end user device is responsible for determining the listening position and sending this information to the audio scene server 14. Finally, the audio scene server 14 transmits the downmixed signal corresponding to the specified location to the end user device 17. Alternatively, the audio server 14 may provide a selected set of downmixed signals that correspond to listening point and the end user device 17 selects the downmixed signal to which he/ she wants to listen. Furthermore, a media format encapsulating the signals or a set of signals may be formed and transmitted to the end user devices 17. The downmixed signals here can be audio only content or content where audio is accompanied with video content.
Embodiments of this specification relates to immersive person-to-person communication including also video and possibly synthetic content. Maturing 3D audio-visual rendering and capture technology facilitates a new dimension of natural communication. An 'all-3D' experience is created that brings a rich experience to users and brings opportunity to new businesses through novel product categories. To be able jointly to utilize the multi-user recorded content for various content processing methods, such as audio mixing from multiple users 11 and view switching from one user 11 to the other, the content between different users must employ common time. The local device clocks of different users would normally need to be at least within a few tens of milliseconds before content from multiple users can be jointly processed. If the clocks of different user devices 11 (and, hence, the timestamp of the creation time of the content itself) are not in synchronization, content processing methods fail (as they produce poor quality signal/ content for the multi-user recorded content). It is an aim of embodiments of this specification to align efficiently multi-user recorded for content processing purposes.
Figure 3 shows a schematic block diagram of a system 10 according to embodiments of the invention. Reference numerals are retained from Figures 1 and 2 for like elements.
In Figure 3, multiple end user recording devices 11 are connected to a server 14 by a first transmission channel or network 15. The user devices 11 are used for detecting an audio scene for recording. The user devices 11 may record audio and store it locally for uploading later. Alternatively, they may transmit the audio in real time, in which case they may or may not also store a local copy. The user devices 11 are referred to as recording devices 11 because they record audio, although they may not permanently store the audio locally.
The server 14 is connected to listening user devices 17 via a second transmission channel 18. The first and second channels 15 and 18 may be different channels or networks or they may be the same channel or may be different channels within a single network. The listening user devices 17 may also be termed consuming devices on the basis that audio content is consumed at those devices 17.
Each of the recording devices 11 is a communications device equipped with a microphone. Each device 11 may for instance be a mobile phone, smartphone, laptop computer, tablet computer, PDA, personal music player, video camera, stills camera or dedicated audio recording device, for instance a dictaphone or the like. The recording device 11 includes a number of components including a processor 20 and a memory 21. The processor 20 and the memory 21 are connected to the outside world by an interface 22. At least one microphone 23 is connected to the processor 20. The microphone 23 may be directional. If there are multiple microphones 23, they may have different orientations of sensitivity.
The memory 21 may be a non- volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 21 stores, amongst other things, an operating system 24 and at least one software application 25. The memory 21 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage. The operating system 24 may contain code which, when executed by the processor 20 in conjunction with the memory 25, controls operation of each of the hardware components of the device 11. The one or more software applications 25 and the operating system 24 together cause the processor 20 to operate in such a way as to achieve required functions. In this case, the functions include processing audio data, and may include recording it. As is explained below, the functions may also include processing audio data to derive basis vectors therefrom.
The audio server 14 includes a processor 30, a memory 31 and an interface 32. Within the memory 31 are stored an operating system 34 and one or more software applications 35.
The memory 31 may be a non- volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 31 stores, amongst other things, an operating system 34 and at least one software application 35. The memory 31 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage. The operating system 34 may contain code which, when executed by the processor 30 in conjunction with the memory 35, controls operation of each of the hardware components of the server 14.
The one or more software applications 35 and the operating system 34 together cause the processor 30 to operate in such a way as to achieve required functions. In this case, the functions may include processing received audio data to derive basis vectors therefrom. The functions may also include processing basis vectors to derive alignment information therefrom. The functions may also include processing alignment information and audio to render audio therefrom.
Within the listening user devices 17, a processor 40 is connected to a memory 41 and to an interface 42. An operating system 44 is stored in the memory, along with one or more software applications 45. The memory 41 may be a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 41 stores, amongst other things, an operating system 44 and at least one software application 45. The memory 41 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage. The operating system 44 may contain code which, when executed by the processor 40 in conjunction with the memory 45, controls operation of each of the hardware components of the listening user device 17.
The one or more software applications 45 and the operating system 44 together cause the processor 40 to operate in such a way as to achieve required functions. In this case, the functions may include processing audio data to derive basis vectors therefrom. The functions may also include processing basis vectors to derive alignment information therefrom. The functions may also include processing alignment information and audio to render audio therefrom.
Each of the user devices 11, the audio server 14 and the listening user devices 17 operate according to the operating system and software applications that are stored in the respective memories thereof. Wherein the following one of these devices is said to achieve a certain operation or provide a certain function, this is achieved by the software and/ or the operating system stored in the memories unless otherwise stated.
Audio recorded by a recording device 11 is a time-varying series of data. The audio may be represented in raw form, as samples. Alternatively, it may be represented in a non- compressed format or compressed format, for instance as provided by a codec. The choice of codec for a particular implementation of the system may depend on a number of factors. Suitable codecs may include codecs that operate according to audio interchange file format, pulse-density modulation, pulse-amplitude modulation, direct stream transfer, or free lossless audio coding or any of a number of other coding principles. Coded audio represents a time-varying series of data in some form.
High level operation of embodiments of the invention will now be descried with reference to Figure 4, which is a flowchart illustrating steps that occur in some part of the system 10 of Figure 3.
First, in step 4.1 of Figure 4, the basis vectors are calculated for each series of content data. Next, in step 4.2, the content from various users, i.e. different series of data, is aligned using the calculated basis vectors. Finally, in step 4.3 of, the multi-user content is rendered for the end user consumption. The rendering part may include various processing such as audio mixing, view switching, or joint processing of multi-user content. The exact details for the rendering part are outside the scope of this specification but what is common to all of the available rendering processing methods is that the multi-user content is assumed to be in synchronization. The steps of Figure 4 may be implemented in various ways in the end-to-end context.
The steps shown in the flowchart of Figure 4 are performed in different parts of the system 10 in different embodiments.
In a first embodiment, step 4.1 (creating basis vectors) is performed in the recording devices 11, and steps 4.2 (aligning content) and 4.3 (rendering content) are performed in the server 14. In this first embodiment, basis vectors are transmitted from the recording devices 11 to the server 14 through the first channel 15 and the rendered content is transmitted from the server 14 to the consuming devices 17 through the second channel 18.
In a second embodiment, basic vector calculation step 4.1 is performed in the recording devices 11, aligning step 4.2 is performed in the audio server 14 and the rendering step 4.3 is performed in the consuming devices 17. In this embodiment, basis vectors are transmitted from the recording devices 11 to the audio server 14 through the channel 15 and alignment results are transmitted from the server 14 to the consuming devices 17.
In a third embodiment, basis vectors are created in step 4.1 of the recording devices 11 and the aligning step 4.2 and the rendering step 4.3 are performed in the consuming devices 17. In this embodiment, basis vectors are transmitted from the recording devices 11 to the consuming devices 17, and this transmission may or may not be via the audio server 14.
In a fourth embodiment, each of the basis vector calculation step 4.1, the alignment step 4.2 and the rendering step 4.3 are performed by the audio server 14. In this case, raw or compressed audio data is received from the recording devices 11 and rendered content is transmitted from the server 14 to the consuming device 17.
In a fifth embodiment, the basis vector calculating step 4.1 and the alignment step 4.2 are performed at the audio server 14 and the rendering step 4.3 is performed at the end user device 17. In this embodiment, raw or compressed audio data is transmitted from the recording devices 11 to the audio server 14 and alignment results are transmitted from the audio server 14 to the consuming device 17.
In a sixth embodiment, basis vector calculating step 4.1 is performed at the audio server 14 and the alignment step 4.2 and the rendering step 4.3 are performed at the consuming device 17. In this embodiment, basis vectors are transmitted from the audio server to the receiving device 17 for processing.
In a seventh embodiment, the basis vector calculation step 4.1, the alignment step 4.2 and the rendering step 4.3 are all performed at the consuming device 17. Here, raw or coded audio data is transmitted from the recording devices 11 to the consuming device 17, optionally via the server 14, for processing thereat. In the sixth and seventh embodiments, in which the alignment is performed at the consuming device 17, alignment results may be transmitted back to the server 14 for storage and for possible later use, and/or for use by other consuming devices 17. Of course, even where basis vectors are calculated by the recording devices 11 , raw or coded audio content needs to be transmitted from the recording devices 1 1 to the server 14 and/or the consuming devices 17. In embodiments in which rendering step 4.3 is performed at the audio server 14, raw or coded audio directly from one of the recording devices 11 is not usually provided to any of the consuming devices 17.
The basis vector calculation step 4.1 and the alignment step 4.2 are described in more detail below. Figure 5 shows a flow chart for determining the basis vectors for each content data. First, in step 5.1, feature data is calculated from the content data. The feature data is then converted in step 5.2 to basis vectors that describe the content data in the feature domain . Specific examples for performing the steps 5.1 and 5.2 will now be explained in detail. The feature data is calculated is step 5.1 using a transform operator as follows.
Each recording source audio signal is first transformed to a frequency domain
representation. The transform (TF) operator is applied to each signal segment according to:
Xm [bin,l] = TF(xm bin l T )
where m is the recording source index, bin is the frequency bin index, /is time frame index, Tis the hop size between successive segments, and TF() is the time-to-frequency operator. In the present embodiments, Discrete Fourier Transform (DFT) is used as the transform operator as follows: TF(xm nJ ) =∑(win(n) - xm(n + l - T) - e~^n )
„=o (2)
2 · 71 · bin
wbin =
where N , ^ is the size of the TFQ operator transform, win(n) is a -point analysis window, such as sinusoidal, Hanning, Hamming, Welch, Bartlett, Kaiser or Kaiser- Bessel Derived (KBD) window. To obtain continuity and smooth Fourier coefficients over time, the hop size is set to T=N/2, that is, the previous and current signal segments are 50% overlapping. Naturally, the frequency domain representation may also be obtained using DCT, MDCT/MDST, QMF, complex valued QMF or any other transform that provides frequency domain representation. Equation (1) is calculated on a frame by frame basis where the size of a frame is of short duration, for example, 20ms (typically less than 50ms).
To summarise, each series of audio data is converted to another domain, here the frequency domain. A transform operator, for instance a discrete Fourier transform, is then applied. The transform operator is applied to each signal segment.
This transform based feature data extraction is just one example of the various possible implementation choices. Feature data may also be calculated, for example, based on harmonic ratio of the audio signal, low energy ratio, audio beating, or various MPEG-7 defined audio descriptors such as AudioSpectrumSpreadType. The basis vectors may be calculated from multiple feature data instances. This can increase robustness in the content alignment operation.
The feature data is then converted to basis vectors according to the following pseudo-code:
Pseudo-code 1 :
1 wr i sResltRes + 0.5
nwmdowElements = L ' J
2 R , \ LI n Window Ele ments + 0.5
nElements = L ' J
3
4 For t = 0 to nElements - 1 5
6 tStart = t * nWindowElements
7 tEnd = tStart + twNext * nWindowElements
8 If tEnd > L
9 tEnd = L
10
11 bVbl t) = fDbi m (tStart,tEnd)
12 Endfor where sRes describes the sampling resolution of the basis vector, tRes is the time resolution of the frames according to tRes = N / (2 * Fs), L is the number of time frames present for the signal, and twNext describes the time period each element in the basis vector is covering at each time instant.
To summarise, the number of window elements is set to a function of the sampling resolution of the basis vectors and the time resolution of frames. The number of elements is then set to a function of the number of time frames present for the series and the number of window elements. For each element, a start time is set to a function of time and the number of window elements and end time is set to a function of start time, the time period each element in the basis vector is covering at each time instant and the number of window elements. This is performed for each time frame. Then, the basis vector value is determined by applying a function to the data between the start time and end time and assigning the result of applying the function to basis vector index t. fD
The value of bx,m in line 11 is calculated according to
2-bx+l
fDbx m{tStart,tEnd) = Xm(binIdx(bk),k), tStart≤ k < tEnd
bk=2-bx where binldxQ returns the frequency bin indices to be included in the calculation and Bk is the number of bin indices to be used. To summarise, the frequency domain in the sampling period is calculated to be the sum of all frequency bins that make up X at any instant k between start and end times.
0≤bx <—
Equation (6) is repeated for 2 _ jn me present embodiments, the binldx for the
TFQ operator is set to: f{k)- 2 - N
binldx(k) - , 0≤k < Bk
Fs
\ 50 - (fc + l), A: = 0,2,4
/(*) = 150 · (£ + !) + 50, t = l,3,5,..., - 2
( ) where Fs is the sampling rate of the source signals and fQ describes the frequencies to be covered, both in Hz. As can be seen the frequencies covered by the basis vectors have a width of 50Hz and the first frequency is set to 50Hz, that is, f=50, 100, 150, 200, 250,...50*Bk.
To summarise, a frequency bin index, which depends on an integer k, is calculated as a function of the reciprocal of the sampling rate, and frequencies covered for a period N.
The basis vectors are calculated for multiple different resolutions. In the present embodiments, the basis vectors are calculated for three different resolutions with the following parameters: bVbx m = 0.25seconds, sRes = bvRes0 , twNext = 2.5s bVf Jt) = bVbx m {t},bvResl = 0.064-seconds, sRes = bvResx , twNext = 1.Os
bVbx m (t);bvRes2 = 012^>seconds ,sRes = bvRes2 , twNext = 1.5s
In summary, the three different resolutions relate to three different time periods. The largest time period is a factor greater than the second largest, which is the same factor greater than the smallest. In this example, the largest time period is twice the second largest time period, which is twice the smallest time period, at least approximately. Figure 6 shows a flowchart for determining the alignment between the multi-user recorded content data. First, for each content data pair, correlation of the pair is calculated (step 6.1 step 6.1). Next, correlation metrics are determined for the pair to assess whether the content pair is correlated and the degree of the correlation in step 6.2. If the metrics indicate that the pair is correlated but the degree of correlation is not strong enough (step 6.), the basis vectors are changed to another resolution and the calculations are recalculated with the new basis vectors (step 6.4). The steps are repeated until correlation is found or until different resolution basis vectors have been processed. Finally, in step 6.5 of Figure 6, the relative time differences between the multi-user content are determined from the correlation metrics. Next, the elements of Figure 6 are explained in more detail.
The correlation for some arbitrary content data pair (x,y) is calculated according to xCorrx ^sign (k) = sign xCk , 0≤k < xLen xLen = length(x), yLen = length(y) in(xLen, yLen)- k sumO
xC
yLen ^jsuml - energy min )— I energy = y[lldx - kf
(9) where lengthQ returns the size of the specified vector and minQ returns the minimum of the specified values. In summary, the correlation of the data pair (x,y) for index k is a function of the sign variable and the cross-correlation xC. The cross-correlation is a function of normalized length of data pair (x,y) at index k and the ratio of the cross-correlation value. The nominator of the cross-correlation value is calculated to be the sum of the product of x and y at indices defined by k and length of the data pair. The denominator of the cross- correlation value is the root of the product of the sum of the delayed data vector y squared and the sum of the data vector x squared. The correlation in Equation (9) is calculated twice in order to determine whether it is the content x that needs to be delayed with respect to content y or vice versa in order to achieve synchronization. The output vector from step 6.1 is therefore: corrXYx y = [xCorrx_-x y_-y l xCorr
The alignment between the content pair is determined by the maximum correlation in the correlation output as follows: xyVal = msLx(corrXYx y )
where maxQ returns the index that holds the maximum absolute value in the specified vector. Equation (11) is calculated for each element within the basis vectors according to xyValbx = sampResbx max(corrXF¾ ^ )
where describes the resolution of the basis vector in time domain for the bx(1
0≤bx <— element within the basis vector matrix. Equation (12) is repeated for 2 _
Next, correlation metrics are determined for the content pair according to: lagVal = xyVal , stdVal = std{xyVal ) (13) Where: kngth(z)-l
z =
length—(z)■ ∑z[k]
k=0
(14)
length : 1 z=o
The lagVal describes the average alignment needed for the content pair based on the multiple instances of the basis vector and stdVal describes the deviation of the alignment in the basis vector domain. Typically, low deviation indicates that alignment value has converged towards certain value that most likely corresponds to the time difference between the content pair.
To summarise, the mean of length 2 is a function of the reciprocal of length 2 and the sum of all 2 in the domain k. Standard deviation of length 2 is the root of the product of the reciprocal of length 2 and the sum of 2 minus the average of 2 all squared.
Then, in step 6.3 of Figure 6, the correlation metrics are assessed to see whether the content pair is correlated or whether new set of basis vectors should be taken into use to test whether a weak correlation can be converted into strong correlation.
bV
The calculations for Figure 6 start with first resolution within f .m £or eac] COntent pair. sampResbx = bvRes bVfl
Let ^ , and the content data pair (x, y) be fl^ ,o ancj bV
fldx.bx.i ^ respectively. The steps 6.1-6.4 can be summari2ed into following pseudo-code as shown in pseudo-code 2.
Pseudo-code 2:
1 CorrAlign(ml , m2)
2 {
3 tVal = 0
4 outVal = not_correlated
5
6 Set basis functions 0)
7 Calculate correlation and correlation metrics - Equations (9)- (13) 8 If stdVal < 0.2s
9 outVal = lagVal
10 Else
11 {
12 Set basis functions 1)
13 Calculate correlation and correlation metrics - Equations (9)- (13)
14 lagVal = lagVal + tVal
15 If stdVal < 0.2s
16 outVal = stdVal
17 Else
18 {
19 Set basis function 2)
20 Calculate correlation and correlation metrics - Equations (9)- (13)
21 lagVal = lagVal + tVal
22 If stdVal < 0.2s
23 outVal = lagVal
24 }
25 }
26 Return outVal
27 } where lines 6, 12, and 19 are determined according to Line 6 : Set basis function 0) :
y = b Vfldx,bx
Line 12 : Set basis function 1): fldx = fldx + 1 x = b Vfidx ,mi (4 startl≤t < end\
y = b v fldx,bx,m 2 (Λ start!≤ t < end 2
lagVal -\5s lagVal + l5s
+ 0.5 lagVal >= 0 + 0.5
startl = < lagVal >= 0 bvResfldx endl■ bvResfldx
0, otherwise length(bV fldx,bx, otherwise lagVal -15s lagVal + 15s
+ 0.5 , lagVal < 0 + 0.5
startl = lagVal < 0 bvResfldx endl = < bvResfldr
0, otherwise length(bV fldx, bx, ml otherwise
startl bvRes ndx , lagVal >= 0
tVal
startl bvRes fldx - otherwise
Line 9 : Set basis function 2): fldx = fldx + 1
: start ≤t < endl
bV fl±c ,m2 ( startl≤ t < endl
lagVal -10s lagVal + 10s
+ 0.5 , lagVal >= 0 — + 0.5
startl = < lagVal >= 0 bvResfldx endl = <
0, otherwise length(bVfldx otherwise lagVal -10s lagVal + IPs
+ 0.5 lagVal < 0 + 0.5
startl = lagVal < 0 bvResfldx endl = < bvResfldi
0, otherwise length(bV fldx ,bx ,m2 ) otherwise startl bvRes fldx ' lagVal >= 0
tVal =
start! bvRes fldx ' otherwise three cases above the following is also used: sampResbx = bvRes fldx
To summarise, start time 1 is zeroed if the average alignment is less than zero and is set to a value which is a function of the average alignment and basis vector resolution if the average alignment is greater than or equal to zero. The basis vector resolution defines the time period each element in the basis vector is covering. Conversely, start time 2 is zeroed if the average alignment is greater than or equal to zero and is set to a value which is a function of the average alignment and basis vector (res) if the average alignment is less than zero. End time 1 is formulated in the same way as start time 1 , except instead of being zeroed it is set to be equal to the length when the average alignment is less than zero. End time 2 is formulated in the same way as start time 2, except instead of being zeroed it is set to be equal to the length when the average alignment is greater than or equal to zero.
As can be seen, weak correlation triggers to switch the resolution and at the same time the calculation window is positioned around the alignment value of the weak correlation. In the present embodiments, the calculation window is limited to ±15 seconds around the first resolution alignment value and +10 seconds around the second resolution alignment value. In this way, either the weak correlation is turned into strong correlation indicating that content data pair is correlated or the weak correlation is confirmed by the multi- resolution calculations and hence the content data pair is not correlated. In some variations of the embodiments, the correlation is calculated for multiple resolutions in parallel manner, that is, calculations are performed, for example, both for lines 6 and 12 even though strong correlation would already be available for the first resolution. In these embodiment variations, all calculations for the content pair are required to indicate a strong correlation before the content pair is assigned a correlated value. Furthermore, in some variations of the embodiments, the switching logic between resolutions may be improved by employing also the histogram of the correlation results. In such variations of the embodiments the pseudo-code 3 is used. Pseudo-code 3:
1 Corr Align (ml, m2)
2 {
3 Execute lines 3-7 from pseudo-code 2
4
5 If stdVal < 0.2s
6 outVal = lagVal
7 Else if his Ratio > 0.9
8 outVal = histVal
9 Else
10 {
11 Execute lines 19-21 from pseudo-code 2
12 If stdVal < 0.2s
13 outVal = stdVal
14 Else if hisfRatio > 0.8
15 outVal = histVal
16 Else
17 {
18 Execute lines 19-21 from pseudo-code 2
19 If stdVal < 0.2s
20 outVal = lagVal
21 Else if hisfRatio > 0.7
22 outVal = histVal
23 }
24 }
25 Return outVal where lines 7-8, 14-15, and 21-22 introduce new switching logic. The new logic uses histogram metrics to assess the correlation in the content data pair. The histogram metrics are histRatio, and histVal. The histVal describes the alignment value in the histogram domain and histRatio describes the ratio of the histVal items with respect to the total number of items available. The alignment value in the histogram domain may be determined, for example, defining 3 time resolutions that have a width of 0.5s, Is, and 1.5s, and determining the histogram distribution (of xy^a^ ) for each of the resolutions. The histVal is the histogram item that gets most hits, and the histRatio is the corresponding ratio value. The final value for the pseudo-code 3 is the one that maximizes the ratio over all 3 time resolution widths.
The final step is then to determine the relative time differences between the multi-user content data in step 6.5 of Figure 6. For this purpose, alignment matrix is calculated for each content pair according to corrMatrixm n = CorrAlign(m,n), 0≤m < M, 0≤n < M
V ~7 where M is the number of content data to be aligned and the corrMatrix holds the alignment value for the m* and ηΛ content data pair that is determined according to pseudo-code 2. Next, the alignment values of different data pairs are adjusted so that common reference index is used for all pairs. This adjustment process is as follows
W j = invalid _ value, 0≤i < M, 0 < j < M
IV{ = corrMatrixi j - corrMatrixi refldx , 0 < i < M , 0 < j < M
In summary, the matrix IV is first initialized to default values and then the difference of the matrix entries at indices (i,j) and (i, refldx) is calculated where valid.
Equation (16) changes the reference to matrix index (i, refldx) for each matrix element. Equation (16) is not determined when corrMatrix t . != not _ correlated
corrMatrix i refIdx != not _ correlated
Next, the mean alignment value is calculated for each column in the alignment matrix according to:
O t ^ Λί ' IV j -= invalid _value
refldx'1 j=0 I 0, otherwise
f 1, IV- i != invalid _ value
count _ mOutrefIdx i = <
=0 [0, otherwise mOut = I™0 " *' lCOUnt - m0Utrefl ,,i ' COUnt - m0utrefldx,i > 0
refldx'1 invalid _value, otherwise mOutrefldx i = mOutrefldx i - mOutrefldxfi , lVrefldx i ! = invalid _ value
(18)
In summary, a mean value is calculated for each index (refldx, i) where only valid matrix entries are taken into account. The final output for index (refldx, i) is always the difference with respect to the first column in the corresponding row for the matrix mOut.
Equation (18) is repeated for 0 < i < _ Furthermore, Equations (16)-(18) are repeated for 0 < refldx < M
Next, for each content data, the final alignment is calculated from the mean values according to: mOuti != invalid _ value
alignOuti =
otherwise
^ fl, mOuti != invalid _value
count _ alignOuti = ,<
~ [0, otherwise
[alignOuti count _ alignOuti , count _ alignOuti > 0
alignOuti = <
[ invalid _value, otherwise In summary, the final alignment is the mean value of the previously calculated mean values for different reference matrix entry.
Equation (19) is repeated for 0 < i < jt may -,e advantageous to position the alignment values such that they are with respect to the content that appears first in the timeline. For this purpose, the minimum alignment value is determined and all the values are adjusted with respect to this. In this way, the content that appears first in the timeline is assigned alignment value zero and the rest of the content are delayed with respect to this one. The re-positioning is calculated according to: minAlign = alignOuti != invalid _value alignOuti = alignOuti - minAlign, 0 < i < M , alignOuti != invalid _value nC, where minQ returns the minimum value of the specified vector.
In summary, the alignment value for each vector element is re-positioned as a difference of the corresponding matrix element and the minimum value of the alignment vector. Repositioning is determined only for valid vector entries.
The multi-user content is now in synchronization according to alignOut values and content can be jointly processed for various rendering and analysis purposes. [Are rendering and analysis sufficiently important that we should discuss them in some detail?]
An advantage achieved by the above-described embodiments is that they do not require any special timecodes, clappers or any other special preparations for the content alignment. The above-described embodiments, as well as other embodiments operating on the same principles, can operate at low computational complexity, thereby enabling tens of content items to the aligned simultaneously.
In variations of these embodiments, only a certain portion of the basis vectors may be considered in the alignment at a time. This may be particularly advantageous for content which have different durations, for example if content A has duration of lmin and content B has duration of lOmin. In these variations, the processing flow is as per pseudo-code 4 below
The A and B content are both split into smaller duration segments as defined by refDuration, dstDuration, refAdvance, and dstAdvance. These small duration segments are aligned until alignment between the segments was found or all segments were processed. The switching to use smaller duration segments takes place if alignment is not found using the basic alignment setup with original durations.
Although the above embodiments relate to series of time-varying data that represents audio, the scope of the invention is not limited to this. For instance, the invention is applicable also to processing video and other such time-varying series of data including static images, where spatial resolutions (width & height) can be considered to be time- varying.

Claims

Claims
1. A method comprising:
receiving basis vectors relating to each of at least three series of time-varying feature data;
performing multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and
aligning each of the at least three series of time-varying feature data.
2. A method as claimed in claim 1, wherein receiving basis vectors relating to each of at least three series of time-varying feature data comprises receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein performing multiple correlations comprises performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
3. A method as claimed in claim 2, wherein performing a correlation for the first pair of said at least three series at at least the two different resolutions comprises performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
4. A method as claimed in claim 3, wherein determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion comprises calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
5. A method as claimed in claim 3 or claim 4, wherein performing the second correlation for the first pair of said at least three series at the second resolution comprises performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
6. A method as claimed in claim 2, wherein performing multiple correlations comprises performing the first correlation for the first pair of said at least three series at the first resolution and also performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a
predetermined criterion between performing the first and second correlations.
7. A method as claimed in claim 6, comprising assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
8. A method as claimed in any of claims 2 to 7, comprising receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein performing multiple correlations for the first pair comprises performing a third correlation at the third resolution.
9. A method as claimed in any preceding claim, comprising selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
10. Apparatus comprising:
means for receiving basis vectors relating to each of at least three series of time-varying feature data;
means for performing multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and means for aligning each of the at least three series of time-varying feature data.
11. Apparatus as claimed in claim 10, wherein the means for receiving basis vectors relating to each of at least three series of time-varying feature data comprises means for receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein the means for performing multiple correlations comprises means for performing a first correlation for a first pair of said at least three series at the first resolution and also means for performing a second correlation for the first pair of said at least three series at the second resolution.
12. Apparatus as claimed in claim 11, wherein the means for performing a correlation for the first pair of said at least three series at at least the two different resolutions comprises means for performing the first correlation for the first pair of said at least three series at a first resolution, means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and means for performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
13. Apparatus as claimed in claim 12, wherein the means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion comprises means for calculating a histogram metric from the first correlation and means for comparing the histogram metric to a threshold.
14. Apparatus as claimed in claim 12 or claim 13, wherein the means for performing the second correlation for the first pair of said at least three series at the second resolution comprises means for performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
15. Apparatus as claimed in claim 11, wherein the means for performing multiple correlations comprises means for performing the first correlation for the first pair of said at least three series at the first resolution and also means for performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations.
16. Apparatus as claimed in claim 15, comprising means for assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
17. Apparatus as claimed in any of claims 10 to 16, comprising means for receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein the means for performing multiple correlations for the first pair comprises means for performing a third correlation at the third resolution.
18. Apparatus as claimed in any of claims 10 to 17, comprising means for selecting a series of time-varying data as a reference series and means for calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
19. Apparatus as claimed in any of claims 10 to 18, comprising means for calculating the basis vectors relating the at least three series of time-varying feature data.
20. Apparatus as claimed in any of claims 10 to 19, wherein the apparatus is at least one server.
21. Apparatus as claimed in any of claims 10 to 19, wherein the apparatus is at least one server and a data-to-audio transducing device.
22. Apparatus as claimed in any of claims 10 to 19, wherein the apparatus is at least one server, plural audio-to-data transducing devices and a data-to-audio transducing device.
23. Apparatus as claimed in any of claims 10 to 19, wherein the apparatus is a system comprising at least one server and plural audio-to-data transducing devices.
24. A computer program comprising instructions that when executed by a computer apparatus control it to perform the method of any of claims 1 to 9.
25. A non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method comprising:
receiving basis vectors relating to each of at least three series of time-varying feature data;
performing multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and
aligning each of the at least three series of time-varying feature data.
26. A non-transitory computer-readable storage medium as claimed in claim 25 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform receiving basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
27. A non-transitory computer-readable storage medium as claimed in claim 26 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform performing a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
28. A non-transitory computer-readable storage medium as claimed in claim 27 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
29. A non-transitory computer-readable storage medium as claimed in claim 27 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
30. A non-transitory computer-readable storage medium as claimed in claim 26 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform performing multiple correlations by performing the first correlation for the first pair of said at least three series at the first resolution and also performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a
predetermined criterion between performing the first and second correlations.
31. A non-transitory computer-readable storage medium as claimed in claim 30 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method additionally comprising assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
32. A non-transitory computer-readable storage medium as claimed in claim 26 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method additionally comprising receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
33. A non-transitory computer-readable storage medium as claimed in claim 25 having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method additionally comprising selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
34. Apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
to receive basis vectors relating to each of at least three series of time- varying feature data;
to perform multiple correlations, each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series; and
to align each of the at least three series of time-varying feature data.
35. Apparatus as claimed in claim 34 wherein the computer-readable code when executed controls the at least one processor to receive basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
36. Apparatus as claimed in claim 35 wherein the computer-readable code when executed controls the at least one processor to perform a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and by performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
37. Apparatus as claimed in claim 36 wherein the computer-readable code when executed controls the at least one processor to determine whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
38. Apparatus as claimed in claim 36 wherein the computer-readable code when executed controls the at least one processor to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
39. Apparatus as claimed in claim 35 wherein the computer-readable code when executed controls the at least one processor to perform performing multiple correlations by performing the first correlation for the first pair of said at least three series at the first resolution and also to perform the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a
predetermined criterion between performing the first and second correlations.
40. Apparatus as claimed in claim 39 wherein the computer-readable code when executed controls the at least one processor to assign a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
41. Apparatus as claimed in claim 35 wherein the computer-readable code when executed controls the at least one processor to receive basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
42. Apparatus as claimed in claim 34 wherein the computer-readable code when executed controls the at least one processor to select a series of time-varying data as a reference series and calculate an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
EP11875048.8A 2011-10-31 2011-10-31 Audio scene rendering by aligning series of time-varying feature data Withdrawn EP2774391A4 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2011/054832 WO2013064860A1 (en) 2011-10-31 2011-10-31 Audio scene rendering by aligning series of time-varying feature data

Publications (2)

Publication Number Publication Date
EP2774391A1 true EP2774391A1 (en) 2014-09-10
EP2774391A4 EP2774391A4 (en) 2016-01-20

Family

ID=48191434

Family Applications (1)

Application Number Title Priority Date Filing Date
EP11875048.8A Withdrawn EP2774391A4 (en) 2011-10-31 2011-10-31 Audio scene rendering by aligning series of time-varying feature data

Country Status (2)

Country Link
EP (1) EP2774391A4 (en)
WO (1) WO2013064860A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015092492A1 (en) 2013-12-20 2015-06-25 Nokia Technologies Oy Audio information processing
KR102633077B1 (en) * 2015-06-24 2024-02-05 소니그룹주식회사 Device and method for processing sound, and recording medium
WO2019002179A1 (en) * 2017-06-27 2019-01-03 Dolby International Ab Hybrid audio signal synchronization based on cross-correlation and attack analysis
CN110741435B (en) 2017-06-27 2021-04-27 杜比国际公司 Method, system, and medium for audio signal processing

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB8826927D0 (en) * 1988-11-17 1988-12-21 British Broadcasting Corp Aligning two audio signals in time for editing
US7660424B2 (en) * 2001-02-07 2010-02-09 Dolby Laboratories Licensing Corporation Audio channel spatial translation
GB2391322B (en) * 2002-07-31 2005-12-14 British Broadcasting Corp Signal comparison method and apparatus
US7948557B2 (en) * 2005-06-22 2011-05-24 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and method for generating a control signal for a film event system
WO2010125228A1 (en) * 2009-04-30 2010-11-04 Nokia Corporation Encoding of multiview audio signals
US9008321B2 (en) * 2009-06-08 2015-04-14 Nokia Corporation Audio processing
KR101612704B1 (en) * 2009-10-30 2016-04-18 삼성전자 주식회사 Apparatus and Method To Track Position For Multiple Sound Source
US20130226324A1 (en) * 2010-09-27 2013-08-29 Nokia Corporation Audio scene apparatuses and methods
EP2666162A1 (en) * 2011-01-20 2013-11-27 Nokia Corp. An audio alignment apparatus

Also Published As

Publication number Publication date
WO2013064860A1 (en) 2013-05-10
EP2774391A4 (en) 2016-01-20

Similar Documents

Publication Publication Date Title
CN109313907B (en) Combining audio signals and spatial metadata
US10147433B1 (en) Digital watermark encoding and decoding with localization and payload replacement
KR101703388B1 (en) Audio processing apparatus
US20150146874A1 (en) Signal processing for audio scene rendering
US9445174B2 (en) Audio capture apparatus
US20160155455A1 (en) A shared audio scene apparatus
JP6056625B2 (en) Information processing apparatus, voice processing method, and voice processing program
US20130226324A1 (en) Audio scene apparatuses and methods
US9729993B2 (en) Apparatus and method for reproducing recorded audio with correct spatial directionality
US11997459B2 (en) Crowd-sourced device latency estimation for synchronization of recordings in vocal capture applications
US9412390B1 (en) Automatic estimation of latency for synchronization of recordings in vocal capture applications
US20150142454A1 (en) Handling overlapping audio recordings
US11609737B2 (en) Hybrid audio signal synchronization based on cross-correlation and attack analysis
WO2013088208A1 (en) An audio scene alignment apparatus
CN118522297A (en) Converting audio signals captured in different formats to a reduced number of formats to simplify encoding and decoding operations
US10284985B1 (en) Crowd-sourced device latency estimation for synchronization of recordings in vocal capture applications
EP2774391A1 (en) Audio scene rendering by aligning series of time-varying feature data
EP2932503A1 (en) An apparatus aligning audio signals in a shared audio scene
EP2666309A1 (en) An audio scene selection apparatus
US9392363B2 (en) Audio scene mapping apparatus
EP2926339A1 (en) A shared audio scene apparatus
WO2019002179A1 (en) Hybrid audio signal synchronization based on cross-correlation and attack analysis
CN115484466B (en) Online singing video display method and server
US11704087B2 (en) Video-informed spatial audio expansion
WO2019182074A1 (en) Signal processing method and signal processing device

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20140502

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: NOKIA TECHNOLOGIES OY

RA4 Supplementary search report drawn up and despatched (corrected)

Effective date: 20151221

RIC1 Information provided on ipc code assigned before grant

Ipc: H04S 3/02 20060101ALI20151215BHEP

Ipc: H04R 29/00 20060101AFI20151215BHEP

Ipc: H04W 56/00 20090101ALI20151215BHEP

Ipc: G11B 27/10 20060101ALI20151215BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20160722