METHOD AND APPARATUS FOR MEASURING AUDIO/VIDEO SYNC DELAY
In the field of providing audio-visual content it is common to provide that content as digital data. A common requirement is to pass the digital data through one or more encoding processes, for example prior to the broadcast transmission of the digital audio-visual data, for example the broadcast of a television programme. The coding processes habitually involve data compression and the use of digital audio filters to process the audio signal. The encoding process may also typically involve multiplexing a plurality of separate data streams together. Since it will commonly be the case that the audio data will be processed differently from the video data, and each of the different stages in the encoding process can potentially introduce a time delay to the digital data signal, the overall encoding process can potentially introduce a loss of synchronisation between the audio and video data, which will be most noticeable as a loss of lip-sync in video footage of speaking characters. The human brain can perceive even quite small time delays between the video and audio data, with the circumstances in which the audio signal leads the video signal being most noticeable. For this reason, the applicable encoding and transmission standards stipulate maximum time delays between the audio and video data. For example, according to some standards the audio signal must not lead the corresponding video signal by a time delay greater than 40ms.
It is therefore advantageous to be able to determine in advance the precise amount of time delay any given encoding system is likely to introduce between the audio and video signals of an audio-visual programme.
According to a first aspect of the present invention there is provided a method of determining the delay between an audio and visual signal, the method comprising:
Providing a video signal having a plurality of sequential timestamps visually encoded thereon and providing an audio signal having a corresponding plurality of timestamps audibly encoded thereon, the audio and video signals being synchronised to one another;
Encoding the audio and video signals to generate digitally encoded audio and video data streams;
Analysing the encoded video and audio data streams to extract each of the audibly and visually encoded timestamps; and
Measuring any delay between the time of receipt of corresponding video and audio timestamps.
In preferred embodiments the audio and video timestamps are encoded as a binary code, for example Gray code.
Each visually encoded timestamp preferably comprises a plurality of display segments, the colour or shade of each segment being representative of a binary state. Preferably the display segments comprise a portion of a macro block.
Each audibly encoded timestamp preferably comprises an audio tone having a plurality of predetermined frequency components, the presence of a frequency component being representative of a binary state.
Preferably each encoded time stamp comprises a frame count.
According to a second aspect of the present invention there is provided apparatus for determining the delay between a digitally encoded audio and video signal, the video signal having a plurality of sequential timestamps visually encoded thereon and the audio signal having a corresponding plurality of timestamps audibly encoded thereon, the audio and video signals being synchronised to one another, the apparatus comprising:
A video timestamp detector arranged to detect each of the timestamps encoded in the encoded video signal, decode the timestamp and provide a first time signal representative of the actual time of receipt of the video timestamp;
An audio timestamp detector arranged to detect each of the timestamps encoded in the encoded audio signal, decode the timestamp and provide a second time signal representative of the actual time of receipt of the audio timestamp; and
A timestamp comparator arranged to receive the first and second time signals and measure any delay between their time of receipt.
Embodiments of the present invention will now be described below, by way of illustrative example only, with reference to the accompanying figures, of which:
Figure 1 schematically illustrates the timing and duration of audio and video events included in a possible test signal;
Figure 2 schematically illustrates a time delay analysis system for determining the time delays between the audio and video signals shown in Figure 1 ;
Figure 3 schematically illustrates the relative timings of an audio and video signal as shown in Figure 1 in which there is a delay between the audio and video signals;
Figure 4 schematically illustrates a method of visually encoding a time stamp according to an embodiment of the present information;
Figure 5 schematically illustrates a method of audio encoding a time stamp according to an embodiment of the present information; and
Figure 6 schematically illustrates a time delay analysis system according to an embodiment of the present invention for determining the time delays between audio and video signals having time stamps encoded therein of the kind illustrated in Figures 4 & 5.
According to a time analysis scheme detailed in the applicant's co-pending patent application of the same title, any time delay between audio and video data subsequent to an encoding process having been performed on the originally available audio and video data is determined utilising a predetermined video sequence having known timing properties. The video/audio data sequence is provided in either an uncompressed data format or in a standard encoded data format, such as for example MPEG-2 video or audio. The predetermined audio/video sequence comprises a series of visible "flashes" having a predetermined duration and time interval between each flash. The sequence also comprises a corresponding number of audible tones whose duration and time interval between tones exactly corresponds to the occurrences of the visible flashes. An example of an
appropriate timing diagram for the visible and audible signals is schematically illustrated in
Figure 1.
In Figure 1 the upper signal trace 2 represents the binary levels for the visible signal, with the signal either being totally black or totally white in visible appearance. The lower signal trace 4 represents the audible signal, with the upper signal level representing a production of an audible tone and the lower signal level representing the absence of a tone. As can be seen from Figure 1, after an initial time period of 1 unit, for example 1 second, during which neither a visible flash is produced nor an audible tone produced, a visible flash and audible tone of duration of 1 unit is subsequently produced. This is followed by a further time period during which no visible flash or audible tone is produced, this second time period having a duration of 2 units. This is then followed by the production of a visible flash and audible tone having a duration of 2 units, followed by a period of no visible flash or audible tone of duration 3 units and so on in the sequence illustrated in Figure 1 , the total sequence comprises five periods during which a visible flash and audible tone are produced, each period lasting one time unit longer than the preceding period, with correspondingly increasing time periods in between during which no visible flash or audible tone is produced. In the example shown therefore the entire sequence lasts for a total of 30 time units, which will typically be 30 seconds. The entire sequence preferably continually repeats.
In a preferred arrangement of this analysis scheme, the visible flash is produced in at least the macro block, or at least an integer multiple thereof, that is shown at the top left hand corner of the display screen. Preferably a 4x4 array of blocks, i.e. 32x32 pixels, is used to encode the visible flash. This location is carefully chosen since, due to the scanning method of generating a displayed image as will be appreciated by those skilled in the art, the digital data representing this part of the display screen will occur very early in the relevant data stream and will consequently practically always be correctly encoded. The selection of the visible flash as a 32x32 pixel area will also tend to ensure the correct encoding of this video data. Similarly, the use of only black and white shades for the visible flash will maximise the likelihood of the video data being correctly encoded since these are "basic" digital values unlikely to be corrupted by the encoding process. In a similar fashion, the audio tone is provided as a tone with only a single frequency
component, for example at 10KHz, or some other single frequency. Since only a single frequency component is utilised for the audio tone, it should be faithfully encoded by any audio encoder included within the encoding system under test.
Further visual data may be provided to the user, for example a larger visual representation of the visible flash, for example as a series of rotating circular segments, each segment being representative of a single time unit such that a complete sequence requires a full "revolution" through the multiple segments. It will of course be appreciated that such visual enhancements are merely for the convenience of the human operator and are not a necessary part of the present invention.
According to this analysis scheme, the predetermined audio visual sequence is passed through the encoding system under test and the encoded digital data stream subsequently analysed. The analysis process comprises detecting one or both of the beginning and end of one of the visible flashes by detecting the point in time within the encoded data stream at which the 32x32 pixel macro block integer changes from "black" to "white" or vice versa. The time at which this occurs is accurate to within the duration of 1 frame of visual data, since the display is only refreshed every frame, A typical frame rate is 25 frames per second. Concurrently the encoded audio signal is analysed to determine one or both of the beginning and end of the audio tones. A preferred method of detecting the beginning or end of the audio tone is to detect the sharply rising or falling amplitude of the tone as each transition from "tone" to "no tone" or vice versa occurs. The analysis process can thus determine any time delay between the video and audio "events" (an event being rising or falling audio or video signal edge). In preferred embodiments any determined delay that falls outside a predetermined set of parameters, such as those set by one or more transmission standards, causes an alert to be automatically generated.
A system in accordance with the applicant's co-pending analysis scheme for determining any loss of video/audio synchronisation in an encoded data stream is schematically represented in Figure 2. A predetermined video stream 10 as described above in an un- encoded state is stored on a data storage medium such as a hard disk 12 and is provided as an input to the encoding system 14 to be tested. The encoding system will generally output an encoded data stream that can be decomposed to separate video 16 and audio 18 streams.
Each of the video and audio streams are provided as inputs to an analysis engine 20 and input to separate video and audio event detection units 22, 24. It will be appreciated that although the video and audio streams are shown in Figure 2 as discrete inputs to the analysis engine, the decomposition of the encoded data stream provided by the encoding system 14 may equally be accomplished within the analysis engine, for example by means of a wrapper demux. Each event detection unit is arranged to detect the relevant video or audio 'events' of the encoded test data stream, these being the beginning or ends, or both, of the visible 'flashes' and audio tones as discussed above in relation to Figure 1, and to provide an output signal indicative of when each event occurs. The output signals from each of the audio and video event detection units 22, 24 is provided to a time comparison, unit 26 that is arranged to measure any time interval present between the output signals from the event detection units and thus any time interval, be it lag or lead, between the occurrences of the audio and video 'events'. This time interval data is provided from the time comparison unit 26 to an output interface unit 28 that is arranged to provide the time interval data to an appropriate user interface. Preferably the output interface unit 28 is also arranged to compare any time delay between the audio and video signals with defined maximum permitted delays that may be stored in a further data storage area 30 or may be stored internally to the output interface unit. If a detected time interval exceeds a predefined value then the output interface unit may be arranged to provide an alarm signal.
As previously mentioned with reference to Figure 1, the sequence of visible flashes and audible tones comprises 'events' with increasing time intervals between each 'event'. This ensures that should the time delay between the video and audio signals be great enough for one of the video events to coincide with an audio event this 'false' synchronisation, which would not cause the analysis engine to generate a report or alarm, will notjbe maintained at the next occurrence of a video and audio event. This is schematically illustrated in Figure 3, in which the upper trace 32 represents the video event signal and the lower trace 34 represents the audio event signal. As can be seen from Figure 3, the audio event signal has been delayed relative to the video signal by the encoding process by a time period of 3 time units, say seconds, as represented by arrow A. Consequently, the beginning of the second video event 36 occurs at the same time as the beginning of the first audio event 38. If these are the first video and audio events detected by the analysis engine then a false report of synchronisation between the audio and video streams may be provided. However, at the
beginning of the next video event 40 it can be seen that the audio stream is out of synchronisation, since the events are not evenly spaced apart and do not have a constant duration. Consequently the analysis engine is able to determine that in fact the video and audio streams are not in synchronisation. If the analysis engine detects both the beginning and end of the video and audio events then the loss of synchronisation will be detected sooner since the end of the first audio event 38 will occur before the end of the second video event 36, even though the beginning of both events coincided. In this instance the loss of synchronisation between the audio and video streams is detected by the analysis engine within one time unit, for example one second.
However, it will be appreciated that using the scheme described above any loss in synchronisation can only be determined, and the delay measured, at best when an audio or video event occurs and as described above in relation to Figure 3 the time required to determine a loss in synchronisation can be multiple time units if a 'miss-match' between video and audio events occurs due to a gross loss of synchronisation. This is wasteful of system resources since each second of audio-visual data will comprise, typically, 25 frames of data. In other words, the same data is being processed 25 times a second. According to embodiments of the present invention an analysis scheme is provided that allows improved determination of the time delay between audio and video data streams. This is accomplished by providing a predetermined audio-visual test sequence to be encoded in the encoding system under test that includes audio and visual data that allows each frame to be identified.
In a preferred embodiment the visual encoding is accomplished using a pattern of black and white squares to represent a binary code. A preferred binary code is Gray code, since a well known property of Gray code is that when presented in sequence only one bit of the binary word changes at a time. An example of a possible sequence of black and white squares is illustrated in Figure 4, in which the sequences of squares for three consecutive frames of audio-visual data are shown. The first 5 square sequence represents the Gray code 00101, which is decimal 7, and hence is used to identify the frame as frame number 7. The second and third sequences represent 00100 and 01100, decimal 8 and 9, respectively. It will be appreciated that a 5 bit word has been illustrated in Figure 4 for the purposes of clarity only and any length of word may be selected depending on the number of frames in
the test sequence. As with the scheme described above, in embodiments of the present invention the individual squares are encoded as discrete blocks or integer parts of a macroblock, such as 2x2 block of 32x32 pixels, so as to facilitate the reliable error free encoding of the sequence of squares, thus reliably maintaining the encoded frame identification code.
The audio signal is also encoded with a timing sequence that serves to identify which frame of the video signal the particular section of audio data should be synchronised to. In a preferred embodiment of the present invention this is accomplished by the inclusion of an audio tone that is made up of a number of separate discrete frequencies, each frequency representing one bit in the data word, in an analogous manner to each square of the video code representing a single bit of the Gray code. This allows the encoded tone to be analysed using Fourier analysis techniques to determine the presence or otherwise of the individual frequency components and thus the binary code represented by the tone. An example of the frequency analysis of such an encoded tone is schematically illustrated in Figure 5. The horizontal axis represents the frequency of detected frequency components, in KHz, whilst the vertical axis represents the power of the component. In the example illustrated in Figure 5 two frequency components 40 are shown at 9KHz and 12 KHz. If the selected code is a 5 bit code with the most significant bit represented by the frequency component centred at 3KHZ and the least significant bit at 15KHz, then the frequency spectrum shown in Figure 5 is taken to represent the binary word 00101, or decimal 7 in Gray code. Care must be taken in the selection of the frequency components selected to represent individual bits of the encoded timing word since it is common practice for audio encoders to discard certain frequency components of a signal based on an analysis of what frequencies the human ear will and will not be able to hear. The frequency components selected for the timing word must therefore be such that they will not be discarded by such encoding techniques.
In other embodiments of the present invention the audio code may be encoded as a series of short audio tones in a predetermined time interval, each tone in the series representing a bit within the timing word and the presence or not of a tone representing the binary state of the bit. So for example, a series of eight audio tones may be used to represent an 8 bit
binary word. The frequency of the audio tones may be pre-selected to facilitate their detection.
Figure 6 schematically illustrates an analysis engine for analysing a test audio-visual data stream of the format described above according to embodiments of the present invention after it has been encoded by an encoding system under test. The basic components are the same as for the system illustrated in Figure 2. The individual video and audio data streams 616, 618 are provided as inputs to respective time code detection units 622, 624. Each time code detection unit is arranged to identify and decode the embedded video and audio time codes. Consequently in preferred embodiments the video time code detection unit 622 will be arranged to locate the sequence of coded black and white squares and determine the binary code represented by the particular sequence, thus identifying the individual frame number. The point in time at which each frame is received is also determined. Equally, the audio time code detection unit 624 is preferably arranged to perform the necessary frequency analysis on the embedded audio time code to determine the present frequency components and thus the represented binary code. The relevant output signals from the time code detection units are provided to a time comparison unit 626 that is arranged to determine any time delay between the audio and video data streams on the basis of the time of receipt of corresponding portions of the data streams, as identified by the relevant embedded time codes. Any time delay is provided as an input to a report and/or alert unit 628 that is arranged to determine if the time delay exceeds certain predetermined parameters that may, for example, be stored as a look up table in local data storage 630.
Since each frame of the encoded audio-visual data stream is individually identified by its respective embedded time code, the analysis engine is capable of determining the relative positions of the separate audio and video data streams within the space of a single frame and can provide audio/video time delay information for each frame, as opposed to delay information only for each video/audio 'event' as is the case with the scheme discussed in the applicant's co-pending application. Consequently, the apparatus and method of the present invention allows improved speed of providing the delay information and improved resolution of the delay information.