GB2488772A

GB2488772A - Creating annotated recordings from two devices using audio synchronisation

Info

Publication number: GB2488772A
Application number: GB1103789.2A
Authority: GB
Inventors: Roger Cecil Ferry Tucker
Original assignee: SONOCENT Ltd
Current assignee: SONOCENT Ltd
Priority date: 2011-03-05
Filing date: 2011-03-05
Publication date: 2012-09-12
Also published as: GB201103789D0

Abstract

A method of providing an annotated recording comprising receiving from a first audio recording device 103 a first audio recording of an event or presentation, receiving from a note taking device 104 a second audio recording of the same event or presentation with time-sensitive user annotations, wherein the second recording is used to provide a synchronization reference such that an annotated audio stream 106 can be generated from the first recording and from the annotations associated with the second recording. The annotations and first recording can be synchronized without the need for time-stamping or clock synchronization between devices. The second recording may be of poor quality containing local noise and the first recording of good quality. Correlation using pattern matching of the first and second audio recordings can be used to calculate a time offset between the two recordings. The first audio recording may mostly replace the second audio recording.

Description

Creating Annotated Recordings from Two Devices using Audio Synchronisation

Description

Baickgrond t© the llnventon When attending a lecture, seminar, meeting or interview (an event), a person (the user) may want to record it and also make time-sensitive annotations synchronised with the recording using a computing device (the note-taking device). Annotations might be text, writing, scribbles, sketches, highlights, marks, section breaks, images, slides -any time-sensitive media that enhances the recording. There are various hardware and software products available that allow this (e.g.Livescribe, Audio Notetaker).

the problem is that mobile computing devices do not usually have good sound capture, especially at a distance from the speaker, so if the note-taking device is also used for recording, the quality of the recording is poor. Also, any local sounds -tapping on the device or talking, or (if the device is a phone) answering a call or text message -add further local noise to the recording.

A way round the problem of poor audio quality is to use a separate recording device (the recording device) whose microphone can be placed near the speaker(s), or a loudspeaker if a public address system is in use. Then afterwards the good recording from the recording device can be combined with the time-sensitive annotations from the note-taking device, along with other available material (such as the presentation slides), for a complete set of audio-visual notes.

Because of the time-sensitive nature of the annotations, it is important that the annotations and recording be correctly synchronised. To do this the relative start time of each data stream needs to be known to the process doing the combining. The current way of ensuring this is to time-stamp each data stream (eg US patent application US 2008/0276159 Al). However, for this to work, the clocks of the recording device and the note-taking device must either be exactly synchronised, or have a known relative offset.

Time-stamping is unsatisfactory for these reasons: 1. The two devices may not both be accessible to the user, so the clock offset cannot be known.

For instance, an attendee may make annotations on their personal mobile device but obtain the audio recording from the host organisatlon.

2. Even if both capture devices are accessible to the user, clocks drift out quite quickly, so they would need to be frequently synchronised, and in practice this is likely to get overlooked.

Also, it is not uncommon for users to bypass re-setting the time and date after batteries have run out, giving totally spurious time and dates to recordings.

It would therefore be desirable to have a system which doesn't require any kind of time-stamping.

Staternetrat M lirn'erthouii The present invention proposes a method of providing an annotated recording, the method comprising receiving from an audio recording device a first recording of an event; receiving from a computing device a second recording of the event/presentation, wherein the second recording is associated with time-sensitive user annotations; using the second recording to provide a synchronisation reference such that an annotated audio stream can be generated from the first recording and the annotations associated with the second recording.

This Invention provides a method for synchronising the annotations and recording without the need for time-stamping or clock synchronisation. It makes use of the audio-recording capability of the note-taking device. Although this audio is most likely poor quality and contains local noise, and will have different start and end times to the good-quality recording from the recording device, it will usually be sufficiently similar to the good-quality recording to allow the time offset of the two audio streams to be calculated using carefully applied pattern-matching techniques.

The method allows a good-quality annotated recording to be produced with no special setup and with standard, ubiquitous devices like laptops, netbooks, mobile phones and digital recorders.

Because no time-stamp or clock synchronisation is required the devices can be owned by different people, and don't need time and date to be entered correctly on the device, or maintained accurately.

The method may mostly replace the second recording with the first recording.

The method may include both the first and second recordings in the annotated audio stream through the use of two-channels of audio.

The method may allow the user to select which parts of each recording to include in the final annotated audio stream.

The method may use an automatic process to select which of the two streams to include in the final annotated stream at any given point in the event.

The method may allow the second recording not to be contiguous but consist of two or more recordings with gaps in between.

The method may allow the second recording to be parameterised, soit cannot be played as audio but can still be used as a synchronisation reference.

The method may utilise a noise model in the synchronisation process to allow for local noise in the second recording.

The method may use time-stamps to detect that two recordings probably relate to the same event, without the user having to explicitly select both the recordings.

The method may receive a first recording which also has time-sensitive annotations, which it merges with the time-sensitive annotations associated with the second recording.

The method may process time-sensitive annotations which comprise continuous media.

The method may process time-sensitive annotations which comprise discrete media.

The method may receive a second recording with associated time-sensitive user annotations which has been generated by a computing device having audio recording capabilities.

The method may receive a second recording with associated time-sensitive user annotations which has been generated by combining the outputs of two devices using the present invention.

A computer program may execute the method.

A web service may execute the method.

Brief Description of Drawings

FIgure 1 is a schematic showing the scenario for creating an annotated recording using two devices, where the present invention relates to the Combine process 105.

Figure 2 gives examples on a time-axis of the two streams 201,206 input to the Combine process and the stream 207 output from the process. It also shows how the presentation slides 208 can be added to further enhance the final annotated recording 209.

Figure 3 shows two other ways that the two Input streams, now 301 and 302, might be combined -with either a mix of the audio 303 or both audio streams 304 ending up in the output.

FIgure 4 shows how the recording, now 401, might include video, which then becomes part of the output along with either the first 403 or the second 404 audio stream.

FIgures shows that if the recording device 103 is the presenter's computer, the presenter's slides can be captured along with their audio 501 and merged with the user's annotations 502 for the final output 503.

Figure 6 shows how video, of the screen or the presenter or both, can also be captured 601 and included in the final output 603.

Figure 7 shows how the presenter can capture their presentation on their computer 102 and combine it with a separate audio recording 101 to produce a high-quality capture 103.

Figure 8 shows that the note-taking device need not capture all the audio 802, as long as there is enough for synchronisation purposes the output 803 will be unaffected.

Figure 9 gives an overview of how the time offset between the two audio streams can be calculated.

Figure 10 gives details of how a valid correlation lag can be found.

Detailed Description

Referring to Figure 1, when attending an event, a user 102 makes time-sensitive annotations using a note-taking device 104, and in addition the event is recorded using a recording device 103. The user may be in charge of the recording device, which they place near the presenter 101, or the organisers or another attendee may be in charge of it. An advantage of the current invention is that it doesn't matter how the event is recorded.

The annotations from the note-taking device are merged with the recording from the recording device using a synchronise-and-replace process 105, which is the method of the current invention.

The two are combined to form the final annotated recording 106.

Referring to Figure 2, the recording from the recording device 201 is shown on a time-axis as a series of utterances (shaded) and pauses (solid). In the same way some example annotations from the note-taking device are also shown on a time-axis 206; highlights of important spoken phrases 203, textual notes associated with a particular point in the event 204, and markers indicating every time the presentation slide changes 205.

According to the present invention, the note-taking device also makes an audio recording 202 which is included with the annotations. When the recording 201 is combined with the annotations 206 in the synchronisation process 105, the audio 202 is used to provide a synchronisation reference for the recording 201, which is then combined with the annotations to provide a final annotated recording 207 containing the annotations 203,204, 205 and the recording 203..

In the example annotations 206, because the user has marked the slide changes, the presentation slides 208 once available can be inserted into the annotated recording at the correct place to provide a final annotated recording complete with synchronised slides 209.

The synchronisation of the recording 201 with the audio recorded by the note-taking device 202 uses carefully-applied pattern matching techniques.

Pattent-Maitthhg Tedllutques The pattern matching should take account of the possible differences between the two audio streams 201 and 202. These differences come from at least these factors: 1. Different input devices, one or both of which may have only basic-quality audio input whose intended use is only voice communication with a close-talking speaker. Different frequency responses, different amounts and types of circuit noise, different amounts of non-linear distortion are all likely.

2. Different acoustic path to the microphone. The audio 202 from the note-taking device 104 Is likely to have reverberation and frequency distortion.

3. Local noise in the audio 202 from talking, typing, tapping and mechanical noise from handling the device.

Preferably the pattern matching technique should include a noise model to explicitly take account of difference number 3.

Pattern matching can either use basic correlation, or employ more sophisticated techniques such as stochastic modelling (e.g. Hidden Markov Models) or Neural Networks to better incorporate a noIse model. Those skilled in the art will know how to apply such techniques to this problem.

S

lEffidemides Figure 3 shows that instead of recording the whole event on the note-taking device, just a part of the audio can be recorded, to save using up disk space and battery unnecessarily. Preferably at least 2 minutes of audio is recorded at least 3 times during the event, preferably around the time the user makes an annotation, as shown in 803.

Instead of storing the raw audio on the note-taking device, a parameterisation of the audio can be stored to further save on disk space.

Neither efficiency measure need affect the final annotated recording 803 in any way.

Vañatii©ns out bask tuusage Two important variations are shown In Figures 4 and 5, which show how the recording device may also have annotations, which then become merged with the annotations from the note-taking device.

The figures also show how annotations from either device may include continuous media streams such as video as well as discrete media such as text, images and event times.

In Figure 4, the recording device captures both audio and video. The audio part of the recording 401 is synchronised with the note-taking device's audio, then both audio and video are included in the final annotated recording 403. Alternatively, the roles of the two devices can be reversed so that the audio 401 is discarded after synchronisation and only the video is in included in the final annotated recording 404.

In some circumstances, the annotated recording 402 could itself be the result 106 of applying the present invention to the output of two devices. For instance, if a good-quality recorder was used for audio and a video recorder with audio capability was used for video, the audio would be combined with the user annotations first, as in Figure 2, producing an annotated recording 207 which then becomes the input annotated recording 402 of Figure 4, producing the combined audio-video annotated recording of 404.

Figure 5 shows another scenario where the recording device is the presenter's computer, and the recording 501 has the presenter's slides time-synchronised to it, and these are merged with the user annotations 502 to produce a final annotated recording complete with synchronised slides 503. As in the previous paragraph, the annotated recording 502 could itself be the result of combining another audio recording and the user's annotations using the present invention.

Figure 6 combines the variations of Figures 4 and 5 to show how in addition to the presenter's slides, or perhaps as an alternative to them, the presenters computer's screen may be captured as a video along with the audio recording 601 and the resulting video included in the final annotated recording 603. In addition to the screen capture and slide capture, a camera attached to the PC may add an additional video stream to the recording. Because all these media streams are recorded on the same computer, they can all be synchronised to the user's annotations via the audio synchronisation, as they all use the same clock.

The invention can be applied in a different configuration to improve the quality of the presenter's own capture of their presentation. In this case the presenter is also the user, and the annotations are the slides and slide transition times captured, probably automatically, on the presenter's own computer. This is demonstrated in Figure 7. The presenter has used their computer as the note-taking device to capture the slides and slide timings 702 (they could also add screen capture and video as in Figure 6), and an independent good-quality audio recorder (perhaps with a microphone clipped to their lapel) as the recording device. The good-quality recording 701 then replaces the audio from their computer to create a final high-quality capture of their presentation.

From these examples it can be seen that the invention can be applied in many different ways. In all cases an extra audio stream is recorded whose primary purpose is synchronising annotations captured on that device with a recording made on a different device.

Maidng use of the extra audfio stream Although the primary purpose of the extra audio stream is to provide synchronisation, it does not have to be discarded in its entirety.

Referring to Figure 8, which once more shows the recording 801 from the recording device and the annotations 803 from the note-taking device, parts of the audio 802 may be included in the final annotated recording where beneficial, as shown In 804. For example, if there are questions from the audience, the audio 802 from the note-taking device may capture the question more clearly than the recording 801. The places the audio 802 is used might be user-selectable or automatic (based for example on low energy in the recording 801). Another use for the audio 802 is if the recording device 103 is wireless, the recording 801 may contain drop-outs, in which case the audio 802 can used instead.

In some circumstances it may be beneficial to keep the note-taking device's audio 802 in its entirety through the use of two-channel audio 805.

There may be more than one recording device 103, each producing a recording 801. Each recording 801 in turn can be pattern-matched with the note-taking device's audio 802, and once synch ronised, the recordings combined either by selecting one at any given moment (manually or automatically) or simply by producing a multi-track recording.

Auitomatk detectfion of two recordifings Although the method is designed to work without time-stamped media, in many situations the media will be time-stamped (e.g. through "Date Modified" property of the files) and the clocks synchronised to within at least a few minutes of each other. In this case, the user can request the system to scan all uploaded media and find the recording which most likely covers the same event This saves the user having to explicitly remember which files go together.

xamplle of liuweiiflion The embodiment described here is a software program which runs on a computer or as a web service. The note-taking device is a mobile smart phone with a touch screen such as the Apple iPhone, a Blackberry or one of the many other models which run the Android operating system. The smart phone runs an application which records and allows the user to simultaneously take time-synchronised notes, and stores these in a file. The recording device is a digital recorder such as the Olympus DS-40 which can record many hours of audio, stored in a file using a compressed audio format such as Windows Media Audio (wma). However, the recording device could equally well be a mobile phone as well.

The user first of all requests the program to open the file from the note-taking device. If the audio has been stored in a playable format, the user can then listen to the audio to see if the quality is sufficient, and if it is not, they can then select a "replace audio" command. Alternatively if the audio has been stored in a parameterised format and can't be played, the user would instead select an "add audio" command. In either case, the program requests an audio file, and the user specifies the audio file from the recording device.

The program then analyses the two audio files and calculates the time offset (negative or positive) between the recording device recording and the note-taking device recording. This is described below and in Figures 9 and 10. Once this offset is known, the audio in the note-taking device file is removed and the recording device recording is added in its place at the calculated offset, and new combined audio and notes presented to the user. The user can then keep the new audio and save the file, or undo the change if for any reason they are unhappy with replacement.

This embodiment can easily be extended to any of the previously described variations and efficiencies by someone skilled in the art.

The audio is parameterised before performing the correlation process, into energy values computed over Hamming-windowed frames l6ms in length. In this embodiment these energy values are scalar (one per frame), but increased accuracy could be obtained using a vector of energy values each of which covers a different frequency band.

Before calculating the energy values, a high-pass filter with cut-off around 300Hz is applied, so that tow-frequency noise or hum in either recording system does not affect the correlation.

We implement the local-noise model via a pre-processing step which uses an energy detector to identify local-noise frames. The threshold is set so that 15% of frames in a 2 minute window are identified as noise. This simple approach is effective in most cases, taking out the worst of the local noise if there is a lot, and not removing too much of the desired signal if there is none. Clearly the detector can be improved by someone skilled in the art by incorporating expected time and frequency characteristics of the local noise.

The time offset between the two streams is computed using the straightforward algorithms described In Figures 9 and 10. I' 8