EP2774391A1 - Audio scene rendering by aligning series of time-varying feature data - Google Patents
Audio scene rendering by aligning series of time-varying feature dataInfo
- Publication number
- EP2774391A1 EP2774391A1 EP11875048.8A EP11875048A EP2774391A1 EP 2774391 A1 EP2774391 A1 EP 2774391A1 EP 11875048 A EP11875048 A EP 11875048A EP 2774391 A1 EP2774391 A1 EP 2774391A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- correlation
- series
- pair
- resolution
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000009877 rendering Methods 0.000 title description 18
- 239000013598 vector Substances 0.000 claims abstract description 92
- 230000015654 memory Effects 0.000 claims description 31
- 238000000034 method Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000002596 correlated effect Effects 0.000 claims description 16
- 230000002463 transducing effect Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 30
- 238000012545 processing Methods 0.000 description 15
- 239000011159 matrix material Substances 0.000 description 11
- 238000005070 sampling Methods 0.000 description 5
- 230000000875 corresponding effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000003111 delayed effect Effects 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 238000010009 beating Methods 0.000 description 1
- 229910052729 chemical element Inorganic materials 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/002—Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
- H04S3/004—For headphones
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/03—Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
Definitions
- This relates to aligning series of time-varying feature data.
- Captured signals are transmitted and stored at a rendering location, from where an end user can select a listening point based on their preference from the reconstructed audio space.
- a first aspect of the invention provides a method comprising:
- each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series;
- Receiving basis vectors relating to each of at least three series of time-varying feature data may comprise receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein performing multiple correlations may comprise performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
- performing a correlation for the first pair of said at least three series at at least the two different resolutions may comprise performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
- Determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion may comprise calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
- Performing the second correlation for the first pair of said at least three series at the second resolution may comprise performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
- Performing multiple correlations may comprise performing the first correlation for the first pair of said at least three series at the first resolution and also performing the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations.
- This method may comprise assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
- the method may comprise receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein performing multiple correlations for the first pair may comprise performing a third correlation at the third resolution.
- the method may comprise selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
- a second aspect of the invention provides apparatus comprising:
- the means for receiving basis vectors relating to each of at least three series of time-varying feature data may comprise means for receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and wherein the means for performing multiple correlations may comprise means for performing a first correlation for a first pair of said at least three series at the first resolution and also means for performing a second correlation for the first pair of said at least three series at the second resolution.
- the means for performing a correlation for the first pair of said at least three series at at least the two different resolutions may comprise means for performing the first correlation for the first pair of said at least three series at a first resolution, means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and means for performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
- the means for determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion may comprise means for calculating a histogram metric from the first correlation and means for comparing the histogram metric to a threshold.
- the means for performing the second correlation for the first pair of said at least three series at the second resolution may comprise means for performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
- the means for performing multiple correlations may comprise means for
- the apparatus may comprise means for assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
- the apparatus may comprise means for receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and wherein the means for performing multiple correlations for the first pair may comprise means for performing a third correlation at the third resolution.
- the apparatus may comprise means for selecting a series of time-varying data as a reference series and means for calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
- the apparatus may comprise means for calculating the basis vectors relating the at least three series of time-varying feature data.
- the apparatus may include at least one server.
- the apparatus may include at least one server and a data-to-audio transducing device.
- the apparatus may include at least one server, plural audio-to-data transducing devices and a data-to-audio transducing device.
- the apparatus may be a system comprising at least one server and plural audio-to- data transducing devices.
- a third aspect of the invention provides a computer program comprising instructions that when executed by a computer apparatus control it to perform a as recited method above.
- a fourth aspect of the invention provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by computing apparatus, causes the computing apparatus to perform a method comprising:
- each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series;
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform receiving basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform performing a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform performing multiple correlations by
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising assigning a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising receiving basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
- the computer-readable code when executed by computing apparatus, may cause the computing apparatus to perform a method additionally comprising selecting a series of time-varying data as a reference series and calculating an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
- a fifth aspect of the invention provides apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:
- each correlation being for a pair of said at least three series, the correlations linking together each of the at least three series;
- the computer-readable code when executed may control the at least one processor to receive basis vectors relating to each of at least three series of time-varying feature data by receiving basis vectors relating to each of said at least three series of time-varying feature data at at least first and second different resolutions, and to perform performing multiple correlations by performing a first correlation for a first pair of said at least three series at the first resolution and also performing a second correlation for the first pair of said at least three series at the second resolution.
- the computer-readable code when executed may control the at least one processor to perform a correlation for the first pair of said at least three series at at least the two different resolutions by performing the first correlation for the first pair of said at least three series at a first resolution, determining whether a metric of the degree of correlation of the first correlation meets a predetermined criterion, and by performing a second correlation for the first pair of said at least three series at a second resolution, higher than the first resolution, only if the metric of the degree of correlation does not meet a predetermined criterion.
- the computer-readable code when executed may control the at least one processor to determine whether a metric of the degree of correlation of the first correlation meets a predetermined criterion by calculating a histogram metric from the first correlation and comparing the histogram metric to a threshold.
- the computer-readable code when executed may control the at least one processor to perform performing the second correlation for the first pair of said at least three series at the second resolution by performing correlation within a calculation window that frames an alignment value of the result of the first correlation.
- the computer-readable code when executed may control the at least one processor to perform performing multiple correlations by performing the first correlation for the first pair of said at least three series at the first resolution and also to perform the second correlation for the first pair of said at least three series at the second resolution without conducting any determination as to whether a metric of degree of correlation meets a predetermined criterion between performing the first and second correlations.
- the computer-readable code when executed may control the at least one processor to assign a correlated value to correlation of the first pair only if degrees of correlation of both the first and second correlations each meet a respective predetermined criterion.
- the computer-readable code when executed may control the at least one processor to receive basis vectors relating to each of the at least three series of time-varying feature data also at a third resolution, different to the first and second resolutions, and to perform performing multiple correlations for the first pair by performing a third correlation at the third resolution.
- the computer-readable code when executed may control the at least one processor to select a series of time-varying data as a reference series and calculate an alignment value for each of the other series of time-varying data that is indicative of a delay between the respective series and the reference series.
- Figure 1 shows audio scene with N capturing devices
- FIG. 2 is a block diagram of an end-to-end system embodying aspects of the invention.
- Figure 3 shows details of some aspects of the Figure 2 system;
- Figure 4 shows a high level flowchart illustrating equation of aspects of embodiments of the invention.
- Figure 5 shows a flowchart of calculating a basis vector from Figure 4.
- Figure 6 shows a flowchart for aligning multi-user content from Figure 4.
- FIGs 1 and 2 illustrate a system in which embodiments of the invention can be implemented.
- a system 10 consists of N devices 11 that are arbitrarily positioned within the audio space to record an audio scene. In these Figures, there are shown four areas of audio activity 12. The captured signals are then transmitted (or alternatively stored for later consumption) so an end user can select a listening point 13 based on his/her preference from a reconstructed audio space.
- a rendering part then provides one or more
- microphones of the devices 11 are shown to have a directional beam, but embodiments of the invention use microphones having any form of suitable beam. Furthermore, the microphones do not necessarily employ a similar beam, but microphones with different beams may be used.
- the downmixed signal(s) may be a mono, stereo, binaural signal or may consist of more than two channels, for instance four or six channels.
- Each recording device 11 records the audio scene and uploads/upstreams (either in real-time or non realtime) the recorded content to an audio server 14 via a channel 15.
- the upload/upstream process provides also positioning information about where the audio is being recorded and the recording direction/ orientation.
- a recording device 11 may record one or more audio signals. If a recording device 11 records (and provides) more than one signal, the direction/ orientation of these signals may be different.
- the position information may be obtained, for example, using GPS coordinates, Cell-ID or A-GPS. Recording
- direction/ orientation may be obtained, for example, using compass, accelerometer or gyroscope information.
- the server 14 receives each uploaded signal and keeps track of the positions and the associated directions/ orientations.
- the audio scene server 14 may provide high level coordinates, which correspond to locations where user uploaded/up streamed content is available for listening, to an end user device 17. These high level coordinates may be provided, for example, as a map to the end user device 17 for selection of the listening position.
- the end user device 17 or e.g. an application used by the end user device is responsible for determining the listening position and sending this information to the audio scene server 14.
- the audio scene server 14 transmits the downmixed signal corresponding to the specified location to the end user device 17.
- the audio server 14 may provide a selected set of downmixed signals that correspond to listening point and the end user device 17 selects the downmixed signal to which he/ she wants to listen.
- a media format encapsulating the signals or a set of signals may be formed and transmitted to the end user devices 17.
- the downmixed signals here can be audio only content or content where audio is accompanied with video content.
- Embodiments of this specification relates to immersive person-to-person communication including also video and possibly synthetic content.
- Maturing 3D audio-visual rendering and capture technology facilitates a new dimension of natural communication.
- An 'all-3D' experience is created that brings a rich experience to users and brings opportunity to new businesses through novel product categories.
- the local device clocks of different users would normally need to be at least within a few tens of milliseconds before content from multiple users can be jointly processed.
- FIG 3 shows a schematic block diagram of a system 10 according to embodiments of the invention. Reference numerals are retained from Figures 1 and 2 for like elements.
- multiple end user recording devices 11 are connected to a server 14 by a first transmission channel or network 15.
- the user devices 11 are used for detecting an audio scene for recording.
- the user devices 11 may record audio and store it locally for uploading later. Alternatively, they may transmit the audio in real time, in which case they may or may not also store a local copy.
- the user devices 11 are referred to as recording devices 11 because they record audio, although they may not permanently store the audio locally.
- the server 14 is connected to listening user devices 17 via a second transmission channel 18.
- the first and second channels 15 and 18 may be different channels or networks or they may be the same channel or may be different channels within a single network.
- the listening user devices 17 may also be termed consuming devices on the basis that audio content is consumed at those devices 17.
- Each of the recording devices 11 is a communications device equipped with a microphone.
- Each device 11 may for instance be a mobile phone, smartphone, laptop computer, tablet computer, PDA, personal music player, video camera, stills camera or dedicated audio recording device, for instance a dictaphone or the like.
- the recording device 11 includes a number of components including a processor 20 and a memory 21.
- the processor 20 and the memory 21 are connected to the outside world by an interface 22.
- At least one microphone 23 is connected to the processor 20.
- the microphone 23 may be directional. If there are multiple microphones 23, they may have different orientations of sensitivity.
- the memory 21 may be a non- volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD).
- the memory 21 stores, amongst other things, an operating system 24 and at least one software application 25.
- the memory 21 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage.
- the operating system 24 may contain code which, when executed by the processor 20 in conjunction with the memory 25, controls operation of each of the hardware components of the device 11.
- the one or more software applications 25 and the operating system 24 together cause the processor 20 to operate in such a way as to achieve required functions.
- the functions include processing audio data, and may include recording it. As is explained below, the functions may also include processing audio data to derive basis vectors therefrom.
- the audio server 14 includes a processor 30, a memory 31 and an interface 32. Within the memory 31 are stored an operating system 34 and one or more software applications 35.
- the memory 31 may be a non- volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD).
- the memory 31 stores, amongst other things, an operating system 34 and at least one software application 35.
- the memory 31 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage.
- the operating system 34 may contain code which, when executed by the processor 30 in conjunction with the memory 35, controls operation of each of the hardware components of the server 14.
- the one or more software applications 35 and the operating system 34 together cause the processor 30 to operate in such a way as to achieve required functions.
- the functions may include processing received audio data to derive basis vectors therefrom.
- the functions may also include processing basis vectors to derive alignment information therefrom.
- the functions may also include processing alignment information and audio to render audio therefrom.
- a processor 40 is connected to a memory 41 and to an interface 42.
- An operating system 44 is stored in the memory, along with one or more software applications 45.
- the memory 41 may be a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD).
- ROM read only memory
- HDD hard disk drive
- SSD solid state drive
- the memory 41 stores, amongst other things, an operating system 44 and at least one software application 45.
- the memory 41 is used for the temporary storage of data as well as permanent storage. Alternatively, there may be separate memories for temporary and non-temporary storage.
- the operating system 44 may contain code which, when executed by the processor 40 in conjunction with the memory 45, controls operation of each of the hardware components of the listening user device 17.
- the one or more software applications 45 and the operating system 44 together cause the processor 40 to operate in such a way as to achieve required functions.
- the functions may include processing audio data to derive basis vectors therefrom.
- the functions may also include processing basis vectors to derive alignment information therefrom.
- the functions may also include processing alignment information and audio to render audio therefrom.
- Each of the user devices 11, the audio server 14 and the listening user devices 17 operate according to the operating system and software applications that are stored in the respective memories thereof. Wherein the following one of these devices is said to achieve a certain operation or provide a certain function, this is achieved by the software and/ or the operating system stored in the memories unless otherwise stated.
- Audio recorded by a recording device 11 is a time-varying series of data.
- the audio may be represented in raw form, as samples. Alternatively, it may be represented in a non- compressed format or compressed format, for instance as provided by a codec.
- codec for a particular implementation of the system may depend on a number of factors. Suitable codecs may include codecs that operate according to audio interchange file format, pulse-density modulation, pulse-amplitude modulation, direct stream transfer, or free lossless audio coding or any of a number of other coding principles. Coded audio represents a time-varying series of data in some form.
- Figure 4 is a flowchart illustrating steps that occur in some part of the system 10 of Figure 3.
- step 4.1 of Figure 4 the basis vectors are calculated for each series of content data.
- step 4.2 the content from various users, i.e. different series of data, is aligned using the calculated basis vectors.
- step 4.3 of the multi-user content is rendered for the end user consumption.
- the rendering part may include various processing such as audio mixing, view switching, or joint processing of multi-user content. The exact details for the rendering part are outside the scope of this specification but what is common to all of the available rendering processing methods is that the multi-user content is assumed to be in synchronization.
- the steps of Figure 4 may be implemented in various ways in the end-to-end context.
- step 4.1 (creating basis vectors) is performed in the recording devices 11, and steps 4.2 (aligning content) and 4.3 (rendering content) are performed in the server 14.
- basis vectors are transmitted from the recording devices 11 to the server 14 through the first channel 15 and the rendered content is transmitted from the server 14 to the consuming devices 17 through the second channel 18.
- basic vector calculation step 4.1 is performed in the recording devices 11, aligning step 4.2 is performed in the audio server 14 and the rendering step 4.3 is performed in the consuming devices 17.
- basis vectors are transmitted from the recording devices 11 to the audio server 14 through the channel 15 and alignment results are transmitted from the server 14 to the consuming devices 17.
- basis vectors are created in step 4.1 of the recording devices 11 and the aligning step 4.2 and the rendering step 4.3 are performed in the consuming devices 17.
- basis vectors are transmitted from the recording devices 11 to the consuming devices 17, and this transmission may or may not be via the audio server 14.
- each of the basis vector calculation step 4.1, the alignment step 4.2 and the rendering step 4.3 are performed by the audio server 14.
- raw or compressed audio data is received from the recording devices 11 and rendered content is transmitted from the server 14 to the consuming device 17.
- the basis vector calculating step 4.1 and the alignment step 4.2 are performed at the audio server 14 and the rendering step 4.3 is performed at the end user device 17.
- raw or compressed audio data is transmitted from the recording devices 11 to the audio server 14 and alignment results are transmitted from the audio server 14 to the consuming device 17.
- basis vector calculating step 4.1 is performed at the audio server 14 and the alignment step 4.2 and the rendering step 4.3 are performed at the consuming device 17.
- basis vectors are transmitted from the audio server to the receiving device 17 for processing.
- the basis vector calculation step 4.1, the alignment step 4.2 and the rendering step 4.3 are all performed at the consuming device 17.
- raw or coded audio data is transmitted from the recording devices 11 to the consuming device 17, optionally via the server 14, for processing thereat.
- alignment results may be transmitted back to the server 14 for storage and for possible later use, and/or for use by other consuming devices 17.
- basis vectors are calculated by the recording devices 11
- raw or coded audio content needs to be transmitted from the recording devices 1 1 to the server 14 and/or the consuming devices 17.
- rendering step 4.3 is performed at the audio server 14
- raw or coded audio directly from one of the recording devices 11 is not usually provided to any of the consuming devices 17.
- step 5.1 feature data is calculated from the content data.
- the feature data is then converted in step 5.2 to basis vectors that describe the content data in the feature domain . Specific examples for performing the steps 5.1 and 5.2 will now be explained in detail.
- the feature data is calculated is step 5.1 using a transform operator as follows.
- Each recording source audio signal is first transformed to a frequency domain
- transform (TF) operator is applied to each signal segment according to:
- TF(x m nJ ) ⁇ (win(n) - x m (n + l - T) - e ⁇ ⁇ n )
- Equation (1) is calculated on a frame by frame basis where the size of a frame is of short duration, for example, 20ms (typically less than 50ms).
- each series of audio data is converted to another domain, here the frequency domain.
- a transform operator for instance a discrete Fourier transform, is then applied.
- the transform operator is applied to each signal segment.
- Feature data may also be calculated, for example, based on harmonic ratio of the audio signal, low energy ratio, audio beating, or various MPEG-7 defined audio descriptors such as AudioSpectrumSpreadType.
- the basis vectors may be calculated from multiple feature data instances. This can increase robustness in the content alignment operation.
- the feature data is then converted to basis vectors according to the following pseudo-code:
- sRes describes the sampling resolution of the basis vector
- L is the number of time frames present for the signal
- twNext describes the time period each element in the basis vector is covering at each time instant.
- the number of window elements is set to a function of the sampling resolution of the basis vectors and the time resolution of frames.
- the number of elements is then set to a function of the number of time frames present for the series and the number of window elements.
- a start time is set to a function of time and the number of window elements and end time is set to a function of start time, the time period each element in the basis vector is covering at each time instant and the number of window elements. This is performed for each time frame.
- the basis vector value is determined by applying a function to the data between the start time and end time and assigning the result of applying the function to basis vector index t.
- the frequency domain in the sampling period is calculated to be the sum of all frequency bins that make up X at any instant k between start and end times.
- Equation (6) is repeated for 2 _ j n me present embodiments, the binldx for the
- TFQ operator is set to: f ⁇ k)- 2 - N
- Fs is the sampling rate of the source signals and fQ describes the frequencies to be covered, both in Hz.
- a frequency bin index which depends on an integer k, is calculated as a function of the reciprocal of the sampling rate, and frequencies covered for a period N.
- the basis vectors are calculated for multiple different resolutions.
- the three different resolutions relate to three different time periods.
- the largest time period is a factor greater than the second largest, which is the same factor greater than the smallest.
- the largest time period is twice the second largest time period, which is twice the smallest time period, at least approximately.
- Figure 6 shows a flowchart for determining the alignment between the multi-user recorded content data. First, for each content data pair, correlation of the pair is calculated (step 6.1 step 6.1). Next, correlation metrics are determined for the pair to assess whether the content pair is correlated and the degree of the correlation in step 6.2.
- step 6. If the metrics indicate that the pair is correlated but the degree of correlation is not strong enough (step 6.), the basis vectors are changed to another resolution and the calculations are recalculated with the new basis vectors (step 6.4). The steps are repeated until correlation is found or until different resolution basis vectors have been processed. Finally, in step 6.5 of Figure 6, the relative time differences between the multi-user content are determined from the correlation metrics. Next, the elements of Figure 6 are explained in more detail.
- the correlation of the data pair (x,y) for index k is a function of the sign variable and the cross-correlation xC.
- the cross-correlation is a function of normalized length of data pair (x,y) at index k and the ratio of the cross-correlation value.
- the nominator of the cross-correlation value is calculated to be the sum of the product of x and y at indices defined by k and length of the data pair.
- the denominator of the cross- correlation value is the root of the product of the sum of the delayed data vector y squared and the sum of the data vector x squared.
- Equation (9) The correlation in Equation (9) is calculated twice in order to determine whether it is the content x that needs to be delayed with respect to content y or vice versa in order to achieve synchronization.
- Equation (12) is repeated for 2 _
- the lagVal describes the average alignment needed for the content pair based on the multiple instances of the basis vector and stdVal describes the deviation of the alignment in the basis vector domain.
- low deviation indicates that alignment value has converged towards certain value that most likely corresponds to the time difference between the content pair.
- the mean of length 2 is a function of the reciprocal of length 2 and the sum of all 2 in the domain k.
- Standard deviation of length 2 is the root of the product of the reciprocal of length 2 and the sum of 2 minus the average of 2 all squared.
- step 6.3 of Figure 6 the correlation metrics are assessed to see whether the content pair is correlated or whether new set of basis vectors should be taken into use to test whether a weak correlation can be converted into strong correlation.
- start time 1 is zeroed if the average alignment is less than zero and is set to a value which is a function of the average alignment and basis vector resolution if the average alignment is greater than or equal to zero.
- the basis vector resolution defines the time period each element in the basis vector is covering.
- start time 2 is zeroed if the average alignment is greater than or equal to zero and is set to a value which is a function of the average alignment and basis vector (res) if the average alignment is less than zero.
- End time 1 is formulated in the same way as start time 1 , except instead of being zeroed it is set to be equal to the length when the average alignment is less than zero.
- End time 2 is formulated in the same way as start time 2, except instead of being zeroed it is set to be equal to the length when the average alignment is greater than or equal to zero.
- weak correlation triggers to switch the resolution and at the same time the calculation window is positioned around the alignment value of the weak correlation.
- the calculation window is limited to ⁇ 15 seconds around the first resolution alignment value and +10 seconds around the second resolution alignment value. In this way, either the weak correlation is turned into strong correlation indicating that content data pair is correlated or the weak correlation is confirmed by the multi- resolution calculations and hence the content data pair is not correlated.
- the correlation is calculated for multiple resolutions in parallel manner, that is, calculations are performed, for example, both for lines 6 and 12 even though strong correlation would already be available for the first resolution. In these embodiment variations, all calculations for the content pair are required to indicate a strong correlation before the content pair is assigned a correlated value.
- the switching logic between resolutions may be improved by employing also the histogram of the correlation results. In such variations of the embodiments the pseudo-code 3 is used. Pseudo-code 3:
- the new logic uses histogram metrics to assess the correlation in the content data pair.
- the histogram metrics are histRatio, and histVal.
- the histVal describes the alignment value in the histogram domain and histRatio describes the ratio of the histVal items with respect to the total number of items available.
- the alignment value in the histogram domain may be determined, for example, defining 3 time resolutions that have a width of 0.5s, Is, and 1.5s, and determining the histogram distribution (of x y ⁇ a ⁇ ) for each of the resolutions.
- the histVal is the histogram item that gets most hits, and the histRatio is the corresponding ratio value.
- the final value for the pseudo-code 3 is the one that maximizes the ratio over all 3 time resolution widths.
- the final step is then to determine the relative time differences between the multi-user content data in step 6.5 of Figure 6.
- V ⁇ 7 where M is the number of content data to be aligned and the corrMatrix holds the alignment value for the m* and ⁇ ⁇ content data pair that is determined according to pseudo-code 2.
- the alignment values of different data pairs are adjusted so that common reference index is used for all pairs. This adjustment process is as follows
- W j invalid _ value, 0 ⁇ i ⁇ M, 0 ⁇ j ⁇ M
- the matrix IV is first initialized to default values and then the difference of the matrix entries at indices (i,j) and (i, refldx) is calculated where valid.
- corrMatrix i refIdx ! not _ correlated
- the mean alignment value is calculated for each column in the alignment matrix according to:
- index (refldx, i) where only valid matrix entries are taken into account.
- the final output for index (refldx, i) is always the difference with respect to the first column in the corresponding row for the matrix mOut.
- Equation (18) is repeated for 0 ⁇ i ⁇ _ Furthermore, Equations (16)-(18) are repeated for 0 ⁇ refldx ⁇ M
- the final alignment is the mean value of the previously calculated mean values for different reference matrix entry.
- Equation (19) is repeated for 0 ⁇ i ⁇ j t ma y -, e advantageous to position the alignment values such that they are with respect to the content that appears first in the timeline.
- the minimum alignment value is determined and all the values are adjusted with respect to this. In this way, the content that appears first in the timeline is assigned alignment value zero and the rest of the content are delayed with respect to this one.
- the alignment value for each vector element is re-positioned as a difference of the corresponding matrix element and the minimum value of the alignment vector. Repositioning is determined only for valid vector entries.
- the multi-user content is now in synchronization according to alignOut values and content can be jointly processed for various rendering and analysis purposes. [Are rendering and analysis sufficiently important that we should discuss them in some detail?]
- An advantage achieved by the above-described embodiments is that they do not require any special timecodes, clappers or any other special preparations for the content alignment.
- the above-described embodiments, as well as other embodiments operating on the same principles, can operate at low computational complexity, thereby enabling tens of content items to the aligned simultaneously.
- only a certain portion of the basis vectors may be considered in the alignment at a time. This may be particularly advantageous for content which have different durations, for example if content A has duration of lmin and content B has duration of lOmin.
- the processing flow is as per pseudo-code 4 below
- the A and B content are both split into smaller duration segments as defined by refDuration, dstDuration, refAdvance, and dstAdvance. These small duration segments are aligned until alignment between the segments was found or all segments were processed. The switching to use smaller duration segments takes place if alignment is not found using the basic alignment setup with original durations.
- the above embodiments relate to series of time-varying data that represents audio, the scope of the invention is not limited to this.
- the invention is applicable also to processing video and other such time-varying series of data including static images, where spatial resolutions (width & height) can be considered to be time- varying.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Signal Processing For Digital Recording And Reproducing (AREA)
Abstract
Description
Claims
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2011/054832 WO2013064860A1 (en) | 2011-10-31 | 2011-10-31 | Audio scene rendering by aligning series of time-varying feature data |
Publications (2)
Publication Number | Publication Date |
---|---|
EP2774391A1 true EP2774391A1 (en) | 2014-09-10 |
EP2774391A4 EP2774391A4 (en) | 2016-01-20 |
Family
ID=48191434
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP11875048.8A Withdrawn EP2774391A4 (en) | 2011-10-31 | 2011-10-31 | Audio scene rendering by aligning series of time-varying feature data |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP2774391A4 (en) |
WO (1) | WO2013064860A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015092492A1 (en) | 2013-12-20 | 2015-06-25 | Nokia Technologies Oy | Audio information processing |
KR102633077B1 (en) * | 2015-06-24 | 2024-02-05 | 소니그룹주식회사 | Device and method for processing sound, and recording medium |
WO2019002179A1 (en) * | 2017-06-27 | 2019-01-03 | Dolby International Ab | Hybrid audio signal synchronization based on cross-correlation and attack analysis |
CN110741435B (en) | 2017-06-27 | 2021-04-27 | 杜比国际公司 | Method, system, and medium for audio signal processing |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB8826927D0 (en) * | 1988-11-17 | 1988-12-21 | British Broadcasting Corp | Aligning two audio signals in time for editing |
US7660424B2 (en) * | 2001-02-07 | 2010-02-09 | Dolby Laboratories Licensing Corporation | Audio channel spatial translation |
GB2391322B (en) * | 2002-07-31 | 2005-12-14 | British Broadcasting Corp | Signal comparison method and apparatus |
US7948557B2 (en) * | 2005-06-22 | 2011-05-24 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Apparatus and method for generating a control signal for a film event system |
WO2010125228A1 (en) * | 2009-04-30 | 2010-11-04 | Nokia Corporation | Encoding of multiview audio signals |
US9008321B2 (en) * | 2009-06-08 | 2015-04-14 | Nokia Corporation | Audio processing |
KR101612704B1 (en) * | 2009-10-30 | 2016-04-18 | 삼성전자 주식회사 | Apparatus and Method To Track Position For Multiple Sound Source |
US20130226324A1 (en) * | 2010-09-27 | 2013-08-29 | Nokia Corporation | Audio scene apparatuses and methods |
EP2666162A1 (en) * | 2011-01-20 | 2013-11-27 | Nokia Corp. | An audio alignment apparatus |
-
2011
- 2011-10-31 EP EP11875048.8A patent/EP2774391A4/en not_active Withdrawn
- 2011-10-31 WO PCT/IB2011/054832 patent/WO2013064860A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2013064860A1 (en) | 2013-05-10 |
EP2774391A4 (en) | 2016-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109313907B (en) | Combining audio signals and spatial metadata | |
US10147433B1 (en) | Digital watermark encoding and decoding with localization and payload replacement | |
KR101703388B1 (en) | Audio processing apparatus | |
US20150146874A1 (en) | Signal processing for audio scene rendering | |
US9445174B2 (en) | Audio capture apparatus | |
US20160155455A1 (en) | A shared audio scene apparatus | |
JP6056625B2 (en) | Information processing apparatus, voice processing method, and voice processing program | |
US20130226324A1 (en) | Audio scene apparatuses and methods | |
US9729993B2 (en) | Apparatus and method for reproducing recorded audio with correct spatial directionality | |
US11997459B2 (en) | Crowd-sourced device latency estimation for synchronization of recordings in vocal capture applications | |
US9412390B1 (en) | Automatic estimation of latency for synchronization of recordings in vocal capture applications | |
US20150142454A1 (en) | Handling overlapping audio recordings | |
US11609737B2 (en) | Hybrid audio signal synchronization based on cross-correlation and attack analysis | |
WO2013088208A1 (en) | An audio scene alignment apparatus | |
CN118522297A (en) | Converting audio signals captured in different formats to a reduced number of formats to simplify encoding and decoding operations | |
US10284985B1 (en) | Crowd-sourced device latency estimation for synchronization of recordings in vocal capture applications | |
EP2774391A1 (en) | Audio scene rendering by aligning series of time-varying feature data | |
EP2932503A1 (en) | An apparatus aligning audio signals in a shared audio scene | |
EP2666309A1 (en) | An audio scene selection apparatus | |
US9392363B2 (en) | Audio scene mapping apparatus | |
EP2926339A1 (en) | A shared audio scene apparatus | |
WO2019002179A1 (en) | Hybrid audio signal synchronization based on cross-correlation and attack analysis | |
CN115484466B (en) | Online singing video display method and server | |
US11704087B2 (en) | Video-informed spatial audio expansion | |
WO2019182074A1 (en) | Signal processing method and signal processing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20140502 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAX | Request for extension of the european patent (deleted) | ||
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: NOKIA TECHNOLOGIES OY |
|
RA4 | Supplementary search report drawn up and despatched (corrected) |
Effective date: 20151221 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: H04S 3/02 20060101ALI20151215BHEP Ipc: H04R 29/00 20060101AFI20151215BHEP Ipc: H04W 56/00 20090101ALI20151215BHEP Ipc: G11B 27/10 20060101ALI20151215BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20160722 |