WO2016139392A1 - Appareil et procédé pour aider à la synchronisation de signaux audio ou vidéo provenant de plusieurs sources - Google Patents

Appareil et procédé pour aider à la synchronisation de signaux audio ou vidéo provenant de plusieurs sources Download PDF

Info

Publication number
WO2016139392A1
WO2016139392A1 PCT/FI2016/050103 FI2016050103W WO2016139392A1 WO 2016139392 A1 WO2016139392 A1 WO 2016139392A1 FI 2016050103 W FI2016050103 W FI 2016050103W WO 2016139392 A1 WO2016139392 A1 WO 2016139392A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
detectable
audio
signal
value
Prior art date
Application number
PCT/FI2016/050103
Other languages
English (en)
Inventor
Sujeet Shyamsundar Mate
Jussi LEPPÄNEN
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2016139392A1 publication Critical patent/WO2016139392A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/242Synchronization processes, e.g. processing of PCR [Program Clock References]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/2368Multiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/2665Gathering content from different sources, e.g. Internet and satellite
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/27Server based end-user applications
    • H04N21/274Storing end-user multimedia data in response to end-user request, e.g. network recorder
    • H04N21/2743Video hosting of uploaded data from client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4341Demultiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/325Synchronizing two or more audio tracks or files according to musical features or musical timings

Definitions

  • the present invention relates to apparatus to assist the synchronisation of audio or video signal processing from multiple sources.
  • the invention further relates to, but is not limited to, apparatus in mobile devices to assist the synchronisation of audio or video signal processing from multiple sources.
  • User generated content recorded and uploaded (or up-streamed) to a server rely on members of the public recording and uploading (or up-streaming) a recording of an event using the recording facilities at hand.
  • This may typically be in the form of the camera and microphone arrangement of a mobile device such as a mobile phone.
  • the mobile device or mobile phone may be considered to be an example of a capture apparatus or device. Often an event may be attended and recorded from more than one position by different recording users.
  • the director or end user may then select one of the up-streamed or uploaded data to view or listen. Furthermore where there are multiple recordings for the same event it may be possible to improve the quality of a single recording. However directing and editing user generated content can be difficult as the user generated content recordings are typically made in an unsynchronised manner. In other words each user may be recording using different sample frequencies, and/or encoding the recording a different bit rates, and/or even using different encoding formats. Furthermore even in 'real-time' streaming situations different users may be up-streaming over different parts of the network, or using different network parameters with a differing latency resulting.
  • an apparatus comprising: an audio analyser configured to determine a spectral flatness value associated with a captured audio signal associated with an audio scene and compare the spectral flatness value against a threshold value; a detectable audio signal generator configured to generate a detectable audio signal when the spectral flatness value is less than the threshold value; and an audio output configured to output the detectable audio signal when the spectral flatness value is less than the threshold value.
  • the detectable audio signal may be an ultrasound signal.
  • the apparatus may further comprise an apparatus analyser configured to determine at least one parameter associated with a further apparatus, and wherein the detectable audio signal generator may be further configured to generate a detectable audio signal based on the at least one parameter.
  • the at least one parameter may comprise a further apparatus microphone sensitivity
  • the detectable audio signal generator may be further configured to control the intensity of the detectable audio signal based on the further apparatus microphone sensitivity.
  • the at least one parameter may comprise a further apparatus microphone frequency sensitivity
  • the detectable audio signal generator may be further configured to control the frequency range of the detectable audio signal based on the further apparatus microphone frequency sensitivity.
  • the at least one parameter may comprise a further apparatus distance, and the detectable audio signal generator may be further configured to control the intensity of the detectable audio signal based on the further apparatus distance.
  • the at least one parameter may comprise a camera pose estimate, and the detectable audio signal generator may be further configured to control the intensity of the detectable audio signal based on the camera pose estimate.
  • the apparatus may further comprise an audio recorder configured to record and/or encode a captured audio signal associated with an audio scene.
  • the apparatus may further comprise: a receiver configured to receive at least one further audio signal captured by a further apparatus configured to monitor the audio scene; an alignment generator configured to determine at least one time indicator value for each of the at least one further audio signal and the audio signal; and a synchronizer configured to synchronize at least one of the at least one further audio signal and the audio signal signal stream to another signal stream based on the at least one time indicator value.
  • the apparatus may further comprise a filter configured to filter the detectable audio signal component from the at least one further audio signal and the audio signal.
  • the alignment generator may be configured to generate the at least one time indicator for the at least one further audio signal and the audio signal based on the correlation between the at least one further audio signal and the audio signal.
  • the at least one time indicator may comprise the ratio of the variance and mean values of the correlation between the at least one further audio signal and the audio signal.
  • a method comprising: determining at an apparatus a spectral flatness value associated with a captured audio signal associated with an audio scene and compare the spectral flatness value against a threshold value; generating a detectable audio signal when the spectral flatness value is less than the threshold value; and outputting from the apparatus the detectable audio signal when the spectral flatness value is less than the threshold value.
  • the detectable audio signal may be an ultrasound signal.
  • the method may further comprise determining at least one parameter associated with a further apparatus, and wherein the generating a detectable audio signal may further comprise generating a detectable audio signal based on the at least one parameter.
  • the at least one parameter may comprise a further apparatus microphone sensitivity, and generating a detectable audio signal further may comprise controlling the intensity of the detectable audio signal based on the further apparatus microphone sensitivity.
  • the at least one parameter may comprise a further apparatus microphone frequency sensitivity, and generating a detectable audio signal further may comprise controlling the intensity of the detectable audio signal based on the further apparatus microphone frequency sensitivity.
  • the at least one parameter may comprise a distance between the apparatus and further apparatus, and generating a detectable audio signal further may comprise controlling the intensity of the detectable audio signal based on the distance.
  • the at least one parameter may comprise a camera pose estimate, and generating a detectable audio signal further may comprise controlling the intensity of the detectable audio signal based on the camera pose estimate.
  • the method may further comprise recording and/or encoding a captured audio signal associated with an audio scene.
  • the method may further comprise: receiving at least one further audio signal captured by the further apparatus configured to monitor the audio scene; determining at least one time indicator value for each of the at least one further audio signal and the audio signal; and synchronizing at least one of the at least one further audio signal and the audio signal signal stream to another signal stream based on the at least one time indicator value.
  • the method may further comprise filtering the detectable audio signal component from the at least one further audio signal and the audio signal.
  • the determining at least one time indicator value may comprise generating the at least one time indicator for the at least one further audio signal and the audio signal based on the correlation between the at least one further audio signal and the audio signal.
  • the at least one time indicator may comprise the ratio of the variance and mean values of the correlation between the at least one further audio signal and the audio signal.
  • an apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: determine a spectral flatness value associated with a captured audio signal associated with an audio scene and compare the spectral flatness value against a threshold value; generate a detectable audio signal when the spectral flatness value is less than the threshold value; and output the detectable audio signal when the spectral flatness value is less than the threshold value.
  • the apparatus may be further caused to determine at least one parameter associated with a further apparatus, and wherein the generating a detectable audio signal may further cause the apparatus to generate a detectable audio signal based on the at least one parameter.
  • the at least one parameter may comprise a further apparatus microphone sensitivity, and generating a detectable audio signal further may cause the apparatus to control the intensity of the detectable audio signal based on the further apparatus microphone sensitivity.
  • the at least one parameter may comprise a further apparatus microphone frequency sensitivity, and generating a detectable audio signal further may cause the apparatus to control the intensity of the detectable audio signal based on the further apparatus microphone frequency sensitivity.
  • the at least one parameter may comprise a distance between the apparatus and further apparatus, and generating a detectable audio signal further may cause the apparatus to control the intensity of the detectable audio signal based on the distance.
  • the at least one parameter may comprise a camera pose estimate, and generating a detectable audio signal further may cause the apparatus to control the intensity of the detectable audio signal based on the camera pose estimate.
  • the apparatus may be further caused to record and/or encode a captured audio signal associated with an audio scene.
  • the apparatus may further be caused to: receive at least one further audio signal captured by the further apparatus configured to monitor the audio scene; determine at least one time indicator value for each of the at least one further audio signal and the audio signal; and synchronize at least one of the at least one further audio signal and the audio signal signal stream to another signal stream based on the at least one time indicator value.
  • the apparatus may further be caused to filter the detectable audio signal component from the at least one further audio signal and the audio signal.
  • the determining at least one time indicator value may cause the apparatus to generate the at least one time indicator for the at least one further audio signal and the audio signal based on the correlation between the at least one further audio signal and the audio signal.
  • the at least one time indicator may comprise the ratio of the variance and mean values of the correlation between the at least one further audio signal and the audio signal.
  • an apparatus comprising: means for determining a spectral flatness value associated with a captured audio signal associated with an audio scene and compare the spectral flatness value against a threshold value; means for generating a detectable audio signal when the spectral flatness value is less than the threshold value; and means for outputting the detectable audio signal when the spectral flatness value is less than the threshold value.
  • the apparatus may further comprising means for determining at least one parameter associated with a further apparatus, and wherein the means for generating a detectable audio signal may further comprise means for generating a detectable audio signal based on the at least one parameter.
  • the at least one parameter may comprise a further apparatus microphone sensitivity, and the means for generating a detectable audio signal further may comprise means for controlling the intensity of the detectable audio signal based on the further apparatus microphone sensitivity.
  • the at least one parameter may comprise a further apparatus microphone frequency sensitivity, and the means for generating a detectable audio signal further may comprise means for controlling the intensity of the detectable audio signal based on the further apparatus microphone frequency sensitivity.
  • the at least one parameter may comprise a distance between the apparatus and further apparatus, and the means for generating a detectable audio signal further may comprise means for controlling the intensity of the detectable audio signal based on the distance.
  • the at least one parameter may comprise a camera pose estimate, and the means for generating a detectable audio signal further may comprise means for controlling the intensity of the detectable audio signal based on the camera pose estimate.
  • the apparatus may further comprise means for recording and/or encoding a captured audio signal associated with an audio scene.
  • the apparatus may further comprise: means for receiving at least one further audio signal captured by the further apparatus configured to monitor the audio scene; means for determining at least one time indicator value for each of the at least one further audio signal and the audio signal; and means for synchronizing at least one of the at least one further audio signal and the audio signal signal stream to another signal stream based on the at least one time indicator value.
  • the apparatus may further comprise means for filtering the detectable audio signal component from the at least one further audio signal and the audio signal.
  • the means for determining at least one time indicator value may comprise means for generating the at least one time indicator for the at least one further audio signal and the audio signal based on the correlation between the at least one further audio signal and the audio signal.
  • the at least one time indicator may comprise the ratio of the variance and mean values of the correlation between the at least one further audio signal and the audio signal.
  • An electronic device may comprising apparatus as described herein.
  • a chipset comprising apparatus as described herein.
  • FIG. 1 shows schematically an electronic device suitable for being employed in embodiments of the application
  • Figure 2 shows schematically a multi-user free-viewpoint sharing services system which may encompass embodiments of the application
  • Figure 3 shows a schematically network orientated view of the system shown in Figure 2 within which embodiments of the application may be implemented;
  • Figure 4 shows schematically a method of operation of the system shown in Figure 2 within which embodiments of the application may be implemented;
  • Figure 5 shows a schematic view of the capture apparatus shown in Figure 3 in further detail
  • FIGS. 6a and 6b show schematic views of the passive and active audio sensing operations
  • Figure 7 shows a schematic view of an alignment mode selection example
  • Figure 8 shows a schematic view of a location based alignment mode selection example
  • Figure 9 shows a schematic view of the server shown in Figure 3 in further detail
  • Figure 10 shows schematically a method of operation of the server shown in Figures 9 according to embodiments of the application;
  • Figure 1 1 shows schematically the synchronisation of signals in embodiments of the application.
  • Figure 12 shows schematically a method of operation of the server shown in Figure 9 according to further embodiments of the application.
  • the concept as described with regards to embodiments shown herein is that at least one of the content capture apparatus within the monitored audio scene is configured to emit or insert a detectable audio signal.
  • the detectable audio signal may comprise audio transients detectable by the other capture apparatus within the audio scene.
  • the detectable audio signal is configured to be detectable by the other content capture apparatus but not significantly disturb the audio scene.
  • the detectable audio signal is an ultrasound audio signal which has a frequency band or range above a typical hearing range.
  • the content capture apparatus may be configured to analyse the audio scene and generate the detectable audio signal based on the analysis of the audio scene.
  • the content capture apparatus may be configured to capture and analyse audio signals associated with the audio scene and determine when to insert the detectable audio signal.
  • the audio scene may be analysed and determined to be unsuitable for performing audio alignment causing the content capture apparatus to insert the detectable audio signal into the audio scene such that a combined audio scene and detectable audio signal is able to be synchronised.
  • the content capture apparatus may be configured to analyse the number and configuration of other capture apparatus within the audio scene and modify the detectable audio signal based on this analysis. For example the content capture apparatus may be configured to modify or change the intensity of the detectable audio signal based on the distances between the content capture apparatus in the audio scene. In some embodiments the content capture apparatus may be configured to similarly change the intensity of the detectable audio signal based on the microphone characteristics of the other content capture apparatus within the audio scene. Furthermore in some embodiments the distances between devices and/or any potential acoustic obstructions between devices may be determined and the detectable audio signal insertion controlled based on these factors.
  • the audio signal recorded by the other content capture apparatus in the environment would thus comprise a combination of the audio sources forming the original audio scene and the detectable audio signals (the active ultrasound signal).
  • the inserted ultrasound signal may be used to determine the delay between the captured audio signals.
  • Figure 1 shows a schematic block diagram of an exemplary apparatus or electronic device 10, which may be used to record or listen to the audio signals and similarly to record or view the audio-visual images and data.
  • the electronic device 10 may for example be a mobile terminal or user equipment of a wireless communication system.
  • the electronic device 10 may comprise an audio subsystem 1 1 .
  • the audio subsystem may comprise a microphone(s) or inputs for microphones for audio signal capture and a loudspeaker(s) or outputs for loudspeaker(s) or headphones for audio signal output.
  • the audio subsystem 1 1 may be linked via an audio analogue-to-digital converter (ADC) and digital-to-analogue converter (DAC) 14 to a processor 21 .
  • the electronic device 10 may further comprise a video subsystem 33.
  • the video subsystem 33 may comprise a camera or input for a camera for image or moving image capture and a display or output for a display for video signal output.
  • the video subsystem 33 may also be linked via a video analogue-to- digital converter (ADC) and digital-to-analogue converter (DAC) 32 to the processor 21 .
  • ADC analogue-to- digital converter
  • DAC digital-to-analogue converter
  • the processor 21 may be further linked to a transceiver (TX/RX) 13, to a user interface (Ul) 15 and to a memory 22.
  • TX/RX transceiver
  • Ul user interface
  • the processor 21 may be configured to execute various program codes.
  • the implemented program code may comprise audio and/or video encoding code routines.
  • the implemented program code 23 may further comprise an audio and/or video decoding code.
  • the implemented program code 23 may further comprise a detectable audio signal insertion and control code.
  • the implemented program code 23 may be stored for example in the memory 22 for retrieval by the processor 21 whenever needed.
  • the memory 22 may further provide a section 24 for storing data, for example data identifying and quantifying other audio capture devices within range of the inserted audio signal .
  • the code may in embodiments of the invention be implemented in hardware or firmware.
  • the user interface 15 may enable a user to input commands to the electronic device 10, for example via a touch interface or keypad, and/or to obtain information from the electronic device 10, for example via the display.
  • the transceiver 13 enables a communication with other electronic devices, for example via a wireless communication network.
  • the transceiver 13 may in some embodiments of the invention be configured to communicate to other electronic devices by a wired connection.
  • a user of the electronic device 10 may use the microphone 1 1 for audio signal capture.
  • the captured audio signal may further be transmitted to some other electronic device or apparatus or be stored in the data section 24 of the memory 22.
  • a corresponding application may be activated to perform the transmission or storage of the audio signal by the user via the user interface 15.
  • This application which may be run by the processor 21 , causes the processor 21 to execute the encoding code stored in the memory 22.
  • the user of the device may use the camera or video sub-system input for video signal capture of video images that are to be transmitted to some other electronic device or apparatus or to be stored in the data section 24 of the memory 22.
  • a corresponding application may similarly be activated to perform the transmission or storage of the video signal by the user via the user interface 15.
  • This application which may be run by the processor 21 , causes the processor 21 to execute the encoding code stored in the memory 22.
  • the audio analogue-to-digital converter 14 may convert the input analogue audio signal into a digital audio signal and provide the digital audio signal to the processor 21 .
  • the video analogue-to-digital converter may convert an input analogue video signal into a digital signal format and provide the digital video signal to the processor 21 .
  • the processor 21 may then process the digital audio signal and/or digital video signal in the same way as described with reference to the description hereafter.
  • the resulting audio and/or video bit stream is provided to the transceiver 13 for transmission to another electronic device.
  • the coded data could be stored in the data section 24 of the memory 22, for instance for a later transmission or for a later presentation by the same electronic device 10.
  • the electronic device 10 may also receive a bit stream with correspondingly encoded data from another electronic device via the transceiver 13.
  • the processor 21 may execute decoding program code stored in the memory 22.
  • the processor 21 may therefore decode the received data, and provide the decoded data to either of the audio or video sub systems such as audio DAC 14 or the video digital-to-analogue converter 32.
  • the audio and/or video digital-to- analogue converter 14, 32 may convert the digital decoded data into analogue data and output the analogue audio signal to the loudspeakers 1 1 , or analogue video signal to the display 33.
  • the display and/or loudspeakers are themselves digital in operation, in which case the digital audio signal may be passed directly to the loudspeakers 1 1 and the digital video signal may be passed directly to the display 33.
  • Execution of the decoding program code for audio and/or video signals maybe triggered as well by an application that has been called by the user via the user interface 15.
  • the received encoded data could also be stored instead of an immediate presentation via the loudspeakers 1 1 and display 33 in the data section 24 of the memory 22, for instance for enabling a later presentation or a forwarding to further electronic device (not shown).
  • the loudspeakers 1 1 may be supplemented with or replaced by a headphone set which may communicate to the electronic device 10 or apparatus wirelessly, for example by a Bluetooth profile to communicate via the transceiver 13, or using a conventional wired connection.
  • Figure 2 shows a schematic overview of the system which may incorporate embodiments of the application.
  • Figure 2 shows a plurality of capture apparatus (also known as recording electronic devices or recording apparatus) 210, which may be apparatus 10 such as shown in Figure 1 , configured to record or capture an activity 171 from various angles or directions as shown in Figure 2 by an associated beam 121 .
  • the recording apparatus or capture apparatus 210 closest to the activity 171 are shown in Figure 2 as recording apparatus 210a to 21 Og.
  • Each of the closest recording apparatus 210a to 21 Og have an associated beam 121 a to 121 g.
  • Each of these capture apparatus 210 may then upload or upstream the recorded signals.
  • Figure 2 shows an arrow 191 representing the recorded signals which may be sent a transmission channel 101 to a server 103.
  • the server 103 may then process the received recorded signals and transmit signal data associated with a 'selected viewpoint', which may be a single recorded or synthesized signal, via a second transmission channel 105 to a end user or viewing apparatus or device 201 a.
  • the capture apparatus 210 configured to transmit the recording may be only a capture apparatus and the end user apparatus 201 configured to receive the recorded or synthesized signals associated with the selected viewpoint may be a viewing or listening apparatus only. However in other embodiments the capture apparatus 210 and/or end user apparatus 201 may each have both recording and viewing/listening capacity.
  • Figure 3 shows schematically a system suitable for implementing the embodiments of the application and Figure 4 shows a flow diagram of the operations of the system shown in Figure 3.
  • the content is purely audio content. However it is understood that the same methods may be applied to audio-video content.
  • the system within which embodiments may operate may comprise capture apparatus 210, an uploading or upstreaming network/transmission channel 101 , a server or network apparatus 103, a downloading or downstreaming network/transmission channel 105, and end user apparatus 201 .
  • the example shown in Figure 3 shows two capture apparatus 210, a first capture apparatus 210a and a n'th capture apparatus 21 On. It is understood that the two capture apparatus are shown only as an example of the possible number of capture devices within the audio scene.
  • the capture apparatus 210 may be connected to the server 103 via the uplink network/transmission channel 101 .
  • the content capture apparatus in some embodiments thus comprises the audio subsystem 1 1 in the form of microphone(s) or inputs for microphones for audio signal capture.
  • the audio subsystem may be passed to an audio analyser 215 and further to an encoder/recorder 21 1 .
  • the content capture apparatus may comprise an encoder/recorder 21 1 .
  • the encoder/recorder 21 1 may be configured to record and encode content in the form of a recorded signal .
  • the recorded signal may be the audio signal captured by the audio subsystem microphone or microphone array.
  • the encoder/recorder 21 1 may also perform encoding on the recorded signal data according to any suitable encoding methodology to generate an audio (or video, audio-video encoded signal).
  • FIG. 4 The operation of recording the content to form a recorded signal is shown in Figure 4 by step 401 .
  • Figure 3 shows that more than one capture apparatus 210 may be within the audio scene. Individual capture apparatus may be configured to capture the event and generate audio signals associated with the capture apparatus's position and recording capability.
  • the first capture apparatus 210a carries out the recording as step 401 a
  • a second capture apparatus carries out the recording operation as step 401 b
  • the n'th capture apparatus 21 On carries out the recording operation as step 401 n.
  • the capture apparatus 210 may also comprise an up-loader or transmitter 213 which formats the audio signal for transmission over the network/transmission channel 105. Furthermore in some embodiments the transmitter 213 may encode positional information to assist the server in locating the captured audio signal. This audio signal and positional data 191 may be transmitted over the uplink transmission channel 101 .
  • the uploading (transmission) of the content data and optionally the positional data is shown in Figure 4 by step 403. Figure 3 furthermore shows more than one capture apparatus 210 may upload the audio signals.
  • the first capture apparatus 210a carries out the uploading of first capture apparatus audio signal (and possibly positional) data 191 a as step 403a
  • a second capture apparatus carries out the uploading operation of second capture apparatus audio signal (and possibly positional) data 191 b as step 403b
  • the n'th capture apparatus 21 On carries out the uploading operation of n'th capture apparatus audio signal (and possibly positional) data 191 n as step 403n.
  • the uplink network/transmission channel 101 may be a single network, for example a cellular communications link between the capture apparatus and the server.
  • the communications channel may operate or span across multiple networks, for example the data may pass over a wireless communications link to an internet gateway in the wireless communications system and then pass over an internet protocol related physical link to the server.
  • the uplink network/transmission channel 101 may be a simplex network or part of a duplex or half-duplex network.
  • the uplink network/communications channel 101 may comprise any one of a cellular communication network such as a third generation cellular communication system, a Wi-Fi communications network, or any suitable wireless or wired communication link.
  • a cellular communication network such as a third generation cellular communication system, a Wi-Fi communications network, or any suitable wireless or wired communication link.
  • the recording and uploading operations may occur concurrently or substantially concurrently so that the information received at the server may be considered to be real time or streamed data.
  • the uploading operation may be carried out at a time substantially later than the recording operation and the information may be considered to be uploaded data.
  • the content capture apparatus 210 may furthermore comprise an audio analyser 215.
  • the audio analyser 215 may be configured to receive the captured audio signals and divide the audio signals into frames.
  • the frames are equal length frames.
  • each frame may be 10 seconds long, but may be less than or more than 10 seconds long in order to provide enough information for the analysis described therein.
  • the audio analyser 215 may furthermore determine for each frame the spectral flatness of the signal.
  • the spectral flatness of the signal can be determined by dividing the geometric mean of the power spectrum by the arithmetic mean of the power spectrum. For example by using the following equation:
  • x(n) represents the magnitude of frequency spectrum bin number n.
  • the flatness value may determine the noise like nature of the captured audio signal.
  • a high value (a value close to 1 .0) means that the signal is noise like and are likely to be difficult to align.
  • a low value (a value close to 0) means that the signal has information in specific spectral bands which can be used to align the audio signals according to the methods described herein.
  • the audio analyser 215 may furthermore be configured to compare the spectral flatness value to a threshold value. When the flatness value is above a threshold value then a signal can be passed to an alignment signal generator 219 indicating that the detectable audio signal is to be generated and inserted into the audio scene. Otherwise when the audio analyser 215 determines that the spectral flatness value is less than the threshold then a signal or indicator can be passed to the alignment signal generator 219 indicating that no detectable audio signal is to be generated and inserted into the audio scene.
  • the content capture apparatus 210 comprises an apparatus analyser 217.
  • the apparatus analyser 217 may be configured to receive information about the other capture apparatus in the audio scene. This information may be received from a server configured to store information on the capture apparatus in the audio scene. For example as described above the encoder/recorder may transmit with the audio signal positional information identifying the position of the capture apparatus to the server. The server may then pass this information to a capture apparatus which requests the positional information of other capture apparatus.
  • the capture apparatus information may be obtained on a peer-to-peer basis from other capture apparatus.
  • the apparatus analyser 217 may be configured to receive directly positional information from other capture apparatus also within the audio scene.
  • the capture apparatus information may as described herein be positional information and/or may be information about the capacity of the other apparatus.
  • the apparatus analyser 217 may be configured to receive information such as the number of microphones, the spatial sensitivity of the microphones, or the spectral frequency sensitivity of the microphones. This information may in some embodiments be passed to the alignment signal generator 219.
  • the content and audio signal alignment may be performed more efficiently if the location information (or example using GPS, indoor positioning, digital compass, gyroscope etc) about the capture apparatus location is known while performing the alignment processing. For example where the capture apparatus in the audio scene are known to be separated by more than a threshold distance with respect to each other then either the captured audio scene is likely to be very different for the different capture apparatus or the microphone sensitivity of the capture apparatus is likely to result in poor signal capture quality, and result in failure of achieving audio based alignment. In such embodiments, when the separation is greater than the threshold then the apparatus analyser 217 may be configured to control the alignment signal generator 219 to not generate the detectable audio signal (even when the audio analyser 215 determines that the audio scene is 'flat').
  • the apparatus analyser 217 may be configured to generate an indicator which is passed to the uploader/transmitter 213 to indicate to the server that the uploaded content is not suitable for audio alignment when there are no other capture apparatus within the threshold distance.
  • the computing resources for attempting audio alignment for such temporal segments is saved, resulting in improved efficiency.
  • the location and/or positioning of the capture apparatus may be determined by deriving camera pose estimate values (CPE) from the camera images.
  • Camera pose estimate values are external or extrinsic parameters associated with the camera.
  • pose parameters may be the camera position and orientation estimate values and may be determined from multiple images after determining any intrinsic or internal camera parameters.
  • Intrinsic or internal camera parameters may for example be parameters such as the focal length, the optical centre, the aspect ratio associated with the camera.
  • the pose estimate may be determined in some embodiments by applying an analytic or geometric method where given that the image sensor (camera) is calibrated the mapping from 3D points in the scene and 2D points in the image is known.
  • the projected image of the object on the camera image is a well-known function of the object's pose.
  • a set of control points on the object typically corners or other feature points
  • the pose estimates may be determined by applying genetic algorithms where the pose represent the genetic representation and the error between the projection of the object control points with the image is the fitness function.
  • the pose estimate may be determined using Learning-based methods where the system learns the mapping from 2D image features to pose transformation.
  • an approximate positional estimate may be obtained by comparing the visual content in the image or video with a globally registered image database.
  • the capture apparatus are within the same audio scene region. For example whether the capture apparatus are within the same room and therefore although separated by a distance capable of capturing the same audio scene or although separated by a shorter distance located in different rooms and therefore not capable of capturing the same audio scene.
  • step 453 Furthermore the operation of analysing the other apparatus in the audio scene and generating suitable control signals based on the analysis is shown in Figure 4 by step 453.
  • the capture apparatus 210 may comprise an alignment signal generator 219.
  • the alignment signal generator 219 may be configured to receive the capture apparatus information with respect to other apparatus in the audio scene from the apparatus analyser 217 and furthermore the output of the audio analyser 213.
  • the alignment signal generator 219 may then be configured to selectively generate a suitable alignment signal.
  • the alignment signal as described herein may be a predetermined or known audio signal.
  • the audio signal may in some embodiments furthermore be an ultrasound audio signal.
  • the audio signal may have a spectral frequency range above the range of the typical hearing range.
  • the detectable audio signal may be a predetermined audio signal within the frequency range between 16.5 kHz to 19.5 kHz.
  • the alignment signal generator 219 may be configured to control the intensity or the power of the detectable audio signal based on the output from the apparatus analyser 217.
  • the amplitude or power of the detectable audio signal may be dependent on the distance between the capture apparatus.
  • the frequency range of the detectable audio signal may be dependent on the spectral sensitivity of microphones in the other capture apparatus such that the generated audio signal may be within the frequency range detected by the microphones of the other capture apparatus but above the typical listeners hearing range.
  • the alignment signal generator 219 may be configured to generate an indicator which is passed to the uploader/transmitter 213 to indicate to the server whether the detectable audio signal has been generated.
  • the server on receiving this indicator may be configured to actively filter and process the detectable audio signal part of the captured audio signal rather than processing all of the captured audio signal .
  • the operation of generating a detectable audio signal controlled by the apparatus information and the analysis of the captured audio signal is shown in Figure 4 by step 455.
  • the alignment signal generator 219 may furthermore be configured to output the detectable audio signal to the audio subsystem and in particular the loudspeaker transducer and thus broadcast the detectable audio signal to any capture apparatus within the audio scene.
  • FIG. 4 The operation of transmitting the detectable audio signal to the other devices within the audio scene is shown in Figure 4 by step 457.
  • Figure 6a a scenario is shown where only the ambient audio scene sources are captured by the capture apparatus or devices.
  • the ambient audio scene source 171 is shown generating an audio signal 1601 which is detected and captured by the capture apparatus (Device 1 ) 210a and a further capture apparatus (Device 2) 210b.
  • This scenario may represent the situation where the captured ambient audio scene source is determined to have a spectral flatness value greater than a determined threshold and thus no detectable audio signal is added to the audio scene.
  • FIG. 6b a scenario where the detectable audio signal is added to the ambient audio scene source.
  • the ambient audio scene source 171 is shown generating an audio signal 1601 which is detected and captured by the capture apparatus (Device 1 ) 210a and a further capture apparatus (Device 2) 210b.
  • This scenario may represent the situation where the captured ambient audio scene source is determined by the further capture apparatus to have a spectral flatness value less than a determined threshold and thus would produce poor quality audio synchronisation data.
  • the further capture apparatus (Device 2) 210 is then configured to generate a detectable audio signal in the form of an ultrasound signal which is emitted by the further capture apparatus (the ultrasound signal being emitted by the further capture apparatus is shown in Figure 6b by reference 1605.
  • the emitted ultrasound signals 1603 can then be captured along with the ambient audio scene audio signals at the capture apparatus (Device 1 ) 210a.
  • the insertion of the detectable audio signal may be controlled based on the analysis of the audio scene source audio signal. This control may be performed over time, in other words the detectable audio signal may be switched on and off based on the audio scene source audio signal .
  • Figure 7 shows an audio signal 1700 timeline and an associated a suitability determination 1701 and an alignment mode determination 1702. This example shows 3 sections or periods where the detectable audio signal is switched on and off.
  • a third period 1715 when based on the analysis of the audio signal the suitability determination indicates that the audio signal is again suitable for audio alignment and a passive audio alignment mode is selected (in other words no detectable audio signal is generated or inserted).
  • control of the generation and transmission of the detectable audio signal is determined based on positioning information (for example capture apparatus GPS positioning information, indoor positioning information etc.).
  • positioning information for example capture apparatus GPS positioning information, indoor positioning information etc.
  • the audio scene captured by the separated capture apparatus is likely to be significantly different and therefore result in poor audio signal synchronisation even when a detectable audio signal is generated and broadcast.
  • FIG. 8 This is shown for example in Figure 8 wherein the locations or positions of a first capture apparatus 210a and a second capture apparatus 210b are shown. Furthermore the regions within which the capture apparatus are separated by less than and more than a threshold distance is shown in Figure 8.
  • a first region or temporal segment 1801 is shown when the first capture apparatus 210a and the second capture apparatus 210b are positioned in proximal distance less than a threshold distance and as such are suitable for audio alignment either using the passive or active modes of alignment.
  • the first region or temporal segment 1801 is then followed by a second region or temporal segment 1803 when the first capture apparatus 210a and the second capture apparatus 210b are positioned with a separation distance greater than the threshold distance and as such are not suitable for audio alignment either using the passive or active modes of alignment.
  • the server may receive an indicator which indicates that the audio positioning synchronisation method will not be successful and therefore is not attempted.
  • the server 103 in Figure 3 is shown in further detail in Figure 9 and the operation of the server described with reference to Figure 4 is described in further detail in Figure 10. Where the same (or similar) components or operations are described the same reference number may be used.
  • the server 103 may comprise a receiver or buffer 221 which may receive the recorded signal data (and in some embodiments the positioning data) 191 from the uplink network/communications channel .
  • the receiver/buffer 221 may be any suitable receiver for receiving the recorded signal data (and in some embodiments the positioning data) according to the format used on the uplink network/communications channel 101 .
  • the receiver/buffer 221 may be configured to output the received recorded signal data and in some embodiments the positioning data to the synchronizer 223.
  • the buffering may enable the server 103 to receive the recorded signal data from the capture apparatus 210 for the time reference required.
  • This buffering may therefore be short term buffering, for example in real-time or near real-time streaming of the recorded signal data the buffering or storage of the recorded signal data may be in the order of seconds and the receiver/buffer may use solid state memory to buffer the recorded signal data.
  • the receiver/buffer 21 1 may store the recorded signal data in a long term storage device, for example using magnetic media such as a RAID (Redundant Array of Independent Disks) storage. This long term storage may thus store the recorded signal data for an amount of time to enable several different capture devices to upload the recorded signal data at their convenience.
  • the buffering may further comprise filtering of the audio signals.
  • the filtering may for example be performed to select the detectable audio signal components from the captured audio signal.
  • the filtering may be to focus the analysis on the detectable audio signal components introduced during the 'active audio alignment' modes of operation.
  • the filtering is configured to reduce the effect of the audio signals where it is determined that the audio signal is likely to produce a poor result. For example where the captured audio signal is determined not to be within the audio scene region because of the positional information then the captured audio signal may be 'removed' or filtered from the synchronisation operation.
  • the vector b as the received and buffered i'th recorded signal data for a time period of length T seconds.
  • the sample rate of the i'th recorded signal data is S Hz
  • the number of time samples N within b i may then be defined by the following equation
  • the receiving/buffering operation is shown in both the system operation as shown in Figure 4 and the server operation as shown in Figure 10 as step 405.
  • the server 103 further comprises a synchronizer 223.
  • the synchronizer 223 receives at least two independently recorded signal data 191 a, 191 b,.., 191 n from the receiver/buffer and outputs at least two synchronized recorded signal data signal.
  • the synchronizer 223 does so by variable length framing of the recorded signal data, selecting a base recorded signal data and then aligning the remainder of the recorded signal data with the base recorded signal.
  • the at least two synchronized recorded signal data are then passed to the processor/transmitter 227 for further processing.
  • step 407 The synchronization operation is shown in Figure 4 by step 407.
  • the synchronizer 223 may comprise a variable length framer 301 .
  • the variable length framer may receive the at least two recorded signal data values 191 from the receiver/buffer 221 .
  • the variable length framer 301 may generate framed recorded signal values, by generating a single sample value from a first number of recorded signal data sample values.
  • An example of the variable length framer 301 carrying out variable length framing may be according to the following equation
  • vlf i j (k) is the output sample value for the first number of recorded signal data samples for the i'th recorded signal data, fj the first number (otherwise known as the input mapping size), bi(k.fj+h) the input sample value for the (k.fj+h) sample.
  • k.fj defines the first input sample index
  • k.fj+fj-1 the last input sample index.
  • the index k defines the output sample or variable frame index.
  • the index vlfjdx indicates the run time mode for the variable length framing.
  • the value of vlfjdx is set to 0 where ⁇ ⁇ ⁇ 2ms , otherwise the value of vlfjdx is set to 1 .
  • the amplitude envelope calculation path may be selected, otherwise the energy envelope calculation path may be used. In other words, for small input mapping sizes it is more advantageous to track the amplitude envelope than the energy envelope. This may improve the resilience to false synchronization results.
  • variable length framer 301 may then repeat the operation of variable length framing for each of the number of signals identified for the selected space to generate an output for each of the recorded signals so that the output samples for each of the recorded signals have the same number of sample values for the same time period.
  • the operation of the variable length framer 301 may be such that in embodiments all of the recorded signal data are variable length framed in a serial format, in other words one after another. In some embodiments the operation of the variable length framer 301 may be such that more than one of the recorded signal data may be processed at the same time or substantially at the same time to speed up the variable length processing for the time period in question.
  • the output of the variable length framer may be passed to the indicator selector
  • the operation of variable length framing is shown in Figure 10 by step 4071 .
  • the synchronizer 223 may also comprise an indicator selector 303 configured to receive the variable length framed sample values for each of the selected space of recorded signal data and generate a time alignment indicator for each recorded data signal.
  • the indicator selector 303 may for example generate the time alignment indicator tlnd for the i'th signal and for all variable time frame sample values j from 0 to M using the following equation.
  • tlnd j (k) max T ⁇ vlf u , vlf kJ ⁇ 0 ⁇ i ⁇ U, 0 ⁇ k ⁇ U, 0 ⁇ j ⁇ M
  • max T maximises the correlation between the given signals with respect to the delay ⁇ .
  • This maximisation function locates the delay ⁇ where the signals are best time aligned.
  • the function may in embodiments of the invention be defined as
  • T U p P er defines the upper limit for the delay in seconds.
  • the upper limit may be set to two seconds as this has been found to be a fair value for the delay in practical recording and networking conditions.
  • tCorr u (k) xCorr T 0 ⁇ i ⁇ U, 0 ⁇ k ⁇ U, 0 ⁇ j ⁇ M may provide the correlation value.
  • the indicator selector 303 may then pass the generated time alignment indicator (tlnd) values to the base signal determiner 305.
  • the synchronizer 223 may also comprise a base signal determiner 305 which may be configured to receive the time alignment indicator values from the indicator selector 303 and indicate which of the received recorded signal data is suitable to synchronize the remainder of the recorded signal data to.
  • the base signal determiner 305 may first generate a series of time aligned indicators from the time alignment indicator values.
  • the time aligned indicators may be a time aligned index average, a time aligned index variance and a time aligned index ratio which may be generated by the base signal determiner 305 according to the following three equations. J M-l
  • tlndAve ⁇ ⁇ tlnd i k (j), 0 ⁇ i ⁇ U, 0 ⁇ j ⁇ U
  • tIndVar u ⁇ lnd i k (j) - tIndAve u ), 0 ⁇ i ⁇ U, 0 ⁇ j ⁇ U
  • the base signal deternniner 305 may sort the indicator tlndRatio in increasing order of importance. For example the base signal determiner 305 may sort the indicator tlndRatio so that the ratio value having the smallest value appears first, the ratio value having the second smallest value appears second and so on. The base signal determiner 305 may output the sorted indicator as the ratio vector tlndRatioSorted. The base signal determiner 305 may also record the order of the time indicator values tlndRatio by generating an index tlndRatioSortedlndex which contains the corresponding original position indices for the sorted result.
  • the base signal determiner 305 may generate a vector with the values [ 2, 5, ...].
  • the determination of the base signal is shown in Figure 10 by step 4075.
  • the base signal determiner 305 may then pass the base signal indicator value base_signal_idx and also the time alignment factor values time_align for the remaining recorded signals to the signal synchronizer 307.
  • the synchronizer 223 may also comprise a signal synchronizer 307 configured to receive the recorded signals via the receiver/buffer 221 and the base signal indicator value and the time alignment factor values for the remaining recorded signals.
  • the signal synchroniser 307 may then synchronize the recorded signals by adding the time alignment value to the current time indices of each of the signals.
  • the synchronizer 223 may be configured to receive from the capture apparatus an indicator of the quality of the captured audio signal and be able to bias the alignment of the content based on the quality indicator. For example the synchronizer 223 may determine that the current audio signal being analysed has an associated indicator indicating the audio signal is poor quality and use previously determined delay values until the audio signal quality improves.
  • the quality of the audio signal may for example be determined by the content apparatus audio analyser and be based on the spectral flatness and/or the power level of the audio signal.
  • Figure 1 1 shows four recorded signals. These recorded signals may be a first signal (signal 1 ) 501 , a second signal (signal 2) 503, a third signal (signal 3) 505 and a fourth signal (signal 4) 507.
  • the signal synchronizer 307 may receive a base signal indicator value base_signal_idx with a value of 3 561 , and furthermore receive time_align values for the first signal Time_align(1 ) 551 , second signal Time_align(2), third signal Time_align(3) which is equal to zero, and fourth signal Time_align(4).
  • the signal synchronizer 307 may delay the first signal 501 by the Time_align(1 ) 551 value to generate a synchronized first signal 51 1 .
  • the signal synchronizer 307 may delay the second signal 503 by the Time_align(2) 553 value to generate a synchronized third signal 513.
  • the signal synchronizer 307 may also delay the fourth signal 507 by the Time_align(4) 557 value to generate a synchronized third signal 517.
  • the synchronized recorded data signals may then be output to the processor/transmitter 227.
  • the apparatus of the server may be considered to comprise a frame value generator which may generate for each of at least two signal streams, at least one signal value for a frame of audio signal values from the signal stream.
  • the same server apparatus may also comprise an alignment generator to determine at least one indicator value for each of the at least two signal streams dependent on the at least one signal value for a frame of audio signal values for the signal stream. Furthermore the server apparatus may comprise a synchronizer to synchronize at least one signal stream to another signal stream dependent on the indicator values.
  • the server 103 may comprise a viewpoint receiver/buffer 225.
  • the viewpoint receiver/buffer 225 may be configured to receive from the end user apparatus 201 data in the form of positional or recording viewpoint information signal - in other words the apparatus may communicate a request to hear or view the event from a specific capture apparatus or from a specified position.
  • the viewpoint it would be understood that this applies to audio only as well as audio-visual data.
  • the data may indicate for selection or synthesis a specific capture apparatus from which audio or audio-visual recorded signal data is to be selected or a position such as a longitude and latitude or other geographical co-ordinate system.
  • the viewpoint selection data may be received from the end user apparatus via the downlink network/transmission channel 105.
  • the downlink network/transmission channel 105 may be a single network, for example a cellular communications link between the end user apparatus 201 and the server 103 or may be a channel operating across multiple channels, for example the data may pass over a channel operating over a wireless communications link to a internet gateway in the wireless communications system and then pass over an internet protocol related physical link to the server 103.
  • the viewpoint selection is shown in Figure 4 by step 408.
  • the downlink network/communications channel 105 may also comprise any one of a cellular communication network such as a third generation cellular communication system, a Wi-Fi communications network, or any suitable wireless or wired communication link.
  • the uplink network/communications channel 101 and the downlink network/communications channel 105 are the same network/communications channel.
  • the uplink network/communications channel 101 and the downlink network/communications channel 105 share parts of the same network/communications channel.
  • both the downlink 105 network/communication channel is a pair of simplex channels, or a duplex or half duplex channel configured to carry information to and from the server either at the same time or substantially at the same time.
  • the processor/transmitter 227 may comprise a viewpoint synthesizer or selector signal processor 309.
  • the viewpoint synthesizer or selector signal processor 309 may receive the viewpoint selection information from any end user apparatus and then select or synthesize suitable audio or audio-visual data to be sent to the end user apparatus to provide the end user apparatus 201 with the content experience desired.
  • the signal processor 309 selects the synchronized recorded signal data from the recording apparatus indicated.
  • the signal processor 309 selects the synchronized recorded signal data which is positioned and/or directed closest to the desired position/direction.
  • specific location/direction are specified a synthesis of more than one nearby synchronized recorded signal data may be generated.
  • the signal processor 309 may generate a weighted averaging of the synchronized recorded signal data nearby the specific location/direction may be used to provide an estimate of the audio or audio-visual data which may have been recorded at the specified position.
  • the signal processor 309 may compensate for the missing or corrupted recorded signal data by synthesizing the recorded signal data from the synchronized recorded signal data from neighbouring recording apparatus 210.
  • the signal processor 309 may in some embodiments determine the nearby and neighbouring recording apparatus 210 and further identify the closest recording apparatus to the desired position by using the positional data provided by the capture apparatus.
  • the output of the signal processor 309 in the form of desired (in other words selected recorded or synthesized) signal data 195 may be passed to the transmitter/buffer 31 1 .
  • the selection/processing of the recorded signal data is shown in Figure 4 by step 409.
  • the processor/transmitter 227 may further comprise a transmitter/buffer configured to transmit the desired signal data 195 via the downlink network/transmission channel 105 which has been described previously.
  • the server 103 may therefore be connected via the downlink network/transmission channel 105 to end user apparatus (or devices) 201 configured to generate viewpoint or selection information and receive the desired signal data associated with the viewpoint or selection information.
  • end user apparatus 201 may receive signal data from the server 103 and transmit data to the server 103 via the downlink network/transmission channel 105.
  • the end user apparatus 201 such the first end user apparatus 201 a may comprise a viewpoint selector and transmitter 231 a.
  • the viewpoint selector and transmitter 231 a may use the user interface 15 where the end user apparatus may be the apparatus shown in Figure 1 to allow the user to specify the desired viewing position and/or desired capture apparatus.
  • the viewpoint selector and transmitter 231 a may then encode this information in order that it may be transmitted via the downlink network/communications channel 105 to the server 103.
  • the end user apparatus 201 such as the first end user apparatus 201 a may also comprise a receiver 233a configured to receive the desired signal data as described above via the down-link network/communications channel 105.
  • the receiver may decode the transmitted desired signal (in other words a selected synchronized recorded signal or synthesized signal from the synchronized recorded signals) to generate content data in a format suitable for viewing.
  • the end user apparatus 201 such as the first end user apparatus 201 a may also comprise a viewer 235a configured to display or output the desired signal data as described above.
  • the end user apparatus 201 may be the apparatus as shown in Figure 1 the audio stream may be processed by the audio ADC/DAC 14 and then passes to the loudspeaker 1 1 and the video stream may be processed by the video ADC/DAC 32 and output via the display 33.
  • the viewing/listening of the desired signal data is shown in Figure 4 by step 415.
  • the apparatus in the form of the end user apparatus may be considered to comprise an input selector configured to select a display variable.
  • the display variable may be an indication of at least one of a recording apparatus, a recording location, which may or not be a marked as a recording apparatus location, and a recording direction or orientation.
  • the apparatus in the form of the end user apparatus may furthermore be summarised as being considered to comprise a transmitter configured to transmit the display variable to a further apparatus, wherein the further apparatus may be the server as described previously. Furthermore the same apparatus may be considered to comprise a receiver configured to receive a signal stream from the server apparatus, wherein the signal stream comprises at least one signal stream received from a recording apparatus synchronized with respect to a further signal stream received from a further recording apparatus.
  • the same, end user, apparatus may also be summarized as comprising a display for displaying the signal stream.
  • End users may in embodiments select between recorded signal data from different capture apparatus with improved timing and cueing performance as the recorded signal data is synchronized.
  • the generation of synthesized signal data using the synchronized recorded signal data allows the end user to experience the content from locations not originally recorded or improve on the recorded data from a single source to allow for deficiencies in the original signal data - such as loss of recorded signal data due to network issues, failure to record due to partial or total device failure, or poorly recorded signal data due to interference or noise.
  • Figure 12 the operation of further embodiments of the server with respect to buffering and synchronization of the recorded signal data is shown. In these embodiments rather than synchronizing the recorded signal data using a single time alignment indicator further time alignment indicators may be generated for further time instances or periods.
  • the buffer/receiver 221 may receive the recorded signal data streams in step 405.
  • the buffered recorded signal may be defined as b n i where the subindex n describes the time instant from which the recorded signal is buffered.
  • the subindex n is an element of the set G, in other words the number of different time instants to be used to determine the base signal.
  • the starting position for each buffering location may be described by Tloc n , that is the signal buffered starting from Tloc n - T seconds.
  • variable length framer 301 may perform a variable length framing operation in step 4071 on each of the sub-periods using the previously described methods.
  • the indicator selector 303 may calculate the time alignment indicators in step 4073 by applying the following equations to determine the time index average, the time index variance and the time index ratio for all the sub-periods according to the following equations:
  • the base signal determiner 305 may in addition to the determination of the base signal and the generation of the time alignment factors may carry out an additional precursor step and make a decision on whether to include a new time instant or period to the calculations. This decision may be for example according to the following expressions: add new time location
  • the base signal determiner 305 may make the above decision with a condition which would limit the number of new time instants to be added to some predefined threshold to disable a potential infinite loop of iterations being carried out.
  • step 701 The decision of whether a new time location is to be added is shown in Figure 12 by step 701 .
  • the base signal determiner 307 may add a new time period to G, in other words the process performs another check at a different time than before and the loop passes back to step 407.
  • step 703. This addition of a new time instant to G can be seen in Figure 12 as step 703.
  • the base signal determiner 305 may then perform the operation of determining the base signal based on the indicators as described previously. The determination of the base signal is shown in Figure 12 by step 4075.
  • base signal determiner 305 may also determine the time alignment factors for the remaining signals as described previously and shown in Figure 12 in step 4077.
  • the signal synchronizer 307 may then use this base signal determination and the time alignment factors for the remaining recorded signals to synchronize the recorded signals as described previously and shown in Figure 12 in step 4079.
  • the loop is disabled or not present and time alignment indicators are determined for at least two of the sub-sets of the total time periods using the equations described above in order to improve the synchronization between recorded signals as the indicators are determined for different time periods.
  • embodiments may also be applied to audio-video signals where the audio signal components of the recorded data are processed in terms of the determining of the base signal and the determination of the time alignment factors for the remaining signals and the video signal components may be synchronised using the above embodiments of the invention.
  • the video parts may be synchronised using the audio synchronisation information.
  • user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.
  • elements of a public land mobile network (PLMN) may also comprise apparatus as described above.
  • PLMN public land mobile network
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Astronomy & Astrophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne un appareil comprenant : un analyseur audio configuré pour déterminer une valeur de plat spectral associée à un signal audio capturé associé à une scène audio et pour comparer la valeur de plat spectral à une valeur de seuil ; un générateur de signaux audio détectables configuré pour générer un signal audio détectable lorsque la valeur de plat spectral est inférieure à la valeur de seuil ; une sortie audio configurée pour émettre le signal audio détectable lorsque la valeur de plat spectral est inférieure à la valeur de seuil.
PCT/FI2016/050103 2015-03-03 2016-02-18 Appareil et procédé pour aider à la synchronisation de signaux audio ou vidéo provenant de plusieurs sources WO2016139392A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1503537.1A GB2536203A (en) 2015-03-03 2015-03-03 An apparatus
GB1503537.1 2015-03-03

Publications (1)

Publication Number Publication Date
WO2016139392A1 true WO2016139392A1 (fr) 2016-09-09

Family

ID=52876390

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2016/050103 WO2016139392A1 (fr) 2015-03-03 2016-02-18 Appareil et procédé pour aider à la synchronisation de signaux audio ou vidéo provenant de plusieurs sources

Country Status (2)

Country Link
GB (1) GB2536203A (fr)
WO (1) WO2016139392A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136053A1 (en) * 2005-12-09 2007-06-14 Acoustic Technologies, Inc. Music detector for echo cancellation and noise reduction
US20120265859A1 (en) * 2011-04-14 2012-10-18 Audish Ltd. Synchronized Video System
US20130121662A1 (en) * 2007-05-31 2013-05-16 Adobe Systems Incorporated Acoustic Pattern Identification Using Spectral Characteristics to Synchronize Audio and/or Video
US20140081987A1 (en) * 2012-09-19 2014-03-20 Nokia Corporation Methods and apparatuses for time-stamping media for multi-user content rendering
US20140192200A1 (en) * 2013-01-08 2014-07-10 Hii Media Llc Media streams synchronization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136053A1 (en) * 2005-12-09 2007-06-14 Acoustic Technologies, Inc. Music detector for echo cancellation and noise reduction
US20130121662A1 (en) * 2007-05-31 2013-05-16 Adobe Systems Incorporated Acoustic Pattern Identification Using Spectral Characteristics to Synchronize Audio and/or Video
US20120265859A1 (en) * 2011-04-14 2012-10-18 Audish Ltd. Synchronized Video System
US20140081987A1 (en) * 2012-09-19 2014-03-20 Nokia Corporation Methods and apparatuses for time-stamping media for multi-user content rendering
US20140192200A1 (en) * 2013-01-08 2014-07-10 Hii Media Llc Media streams synchronization

Also Published As

Publication number Publication date
GB2536203A (en) 2016-09-14
GB201503537D0 (en) 2015-04-15

Similar Documents

Publication Publication Date Title
KR102393798B1 (ko) 오디오 신호 처리 방법 및 장치
US9332372B2 (en) Virtual spatial sound scape
US10097943B2 (en) Apparatus and method for reproducing recorded audio with correct spatial directionality
US20130226324A1 (en) Audio scene apparatuses and methods
CN108432272A (zh) 用于回放控制的多装置分布式媒体捕获
US20160155455A1 (en) A shared audio scene apparatus
JP6834971B2 (ja) 信号処理装置、信号処理方法、並びにプログラム
US20150146874A1 (en) Signal processing for audio scene rendering
US20130297053A1 (en) Audio scene processing apparatus
WO2013088208A1 (fr) Appareil d'alignement de scène audio
US9195740B2 (en) Audio scene selection apparatus
US20150310869A1 (en) Apparatus aligning audio signals in a shared audio scene
US20150271599A1 (en) Shared audio scene apparatus
US20150302892A1 (en) A shared audio scene apparatus
WO2010131105A1 (fr) Appareil
US20130226322A1 (en) Audio scene apparatus
WO2016139392A1 (fr) Appareil et procédé pour aider à la synchronisation de signaux audio ou vidéo provenant de plusieurs sources
WO2019229300A1 (fr) Paramètres audio spatiaux
EP3540735A1 (fr) Traitement audio spatial
CN115942168A (zh) 空间音频捕获
JP2024041721A (ja) ビデオ電話会議

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16706392

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16706392

Country of ref document: EP

Kind code of ref document: A1