GB2375907A

GB2375907A - An automated recognition system

Info

Publication number: GB2375907A
Application number: GB0111748A
Authority: GB
Inventors: Andrew John Murphy
Original assignee: British Broadcasting Corp
Current assignee: British Broadcasting Corp
Priority date: 2001-05-14
Filing date: 2001-05-14
Publication date: 2002-11-27
Also published as: GB0111748D0

Abstract

An automated recognition system is used to recognise an audio cue, such as a piece of music or 'jingle', and to generate useful ancillary data therefrom. The cue identifier 10 has a library 12 storing information about different audio cues. A received signal is applied to an ADC 20 which samples overlapping windows of about the expected jingle length eg 2 secs. A frame generator 24 generates for each window a plurality of overlapping frames which are fed to an FFT processor 26 to estimate the frequency content of the frames. This data is filtered 28, the logarithm taken 30, and applied to a matrix generator 32. A processor 34 forms a covariance matrix which is compared with those in the library 12. The system may be used in a network or transmitter station (Fig. 3) to add a data signal to the signal, or at a receiver (Fig. 4) to control a recorder. The system can be used with TV rather than radio broadcasting.

Description

AN AUTOMATED RECOGNITION SYSTEM This invention relates to an automated recognition system.

In particular the invention relates to an automated recognition system for recognising that a portion of a signal matches a known programme cue associated with a programme of interest contained in the signal and for generating a data signal containing information about the programme of interest.

The increasing use of digital technology in the broadcasting of both television and radio services means that broadcasters now have the capability to transmit additional data as well as programme material. This additional data can be used to provide listeners and viewers with extra information about the services that they are receiving. For example, Radio Data System (RDS) and Digital Audio Broadcasting (DAB) enabled radios can display the type of programme which is being received. For example, the programme type may be flagged as a news programme or travel announcement. In the case of digital television (DTV), viewers are able to obtain schedule information about the programmes that are currently being broadcast and those that are due to be broadcast next.

However, these systems rely on the programme provider transmitting the appropriate information in a data signal which is added to the programme signal before it is broadcast. Control information transmitted by broadcasters is used to control the display of information on appropriately adapted receivers. Schedule and programmetype information which is currently made available by broadcasters is based purely on the programme schedule and the time of day. The schedule for a particular day is entered into a computer which has been programmed to generate event flags, leading to the generation of a data signal containing appropriate programme information, at

the time indicated by the schedule. The known system does not make any allowance for programmes starting early or late and is passive rather than reactive. The information provided is often, therefore, out of date and does not reflect the actual transmission.

We have appreciated that the service to users would be improved if the information that was provided was generated with reference to the actual broadcast signal rather than a theoretical schedule. We have perceived a number of ways of generating the real-time programme signal related information. For example, a newsreader could press a"news bulletin"switch as they began to read the news bulletin. An event flag would be generated by the depression of the switch which could be transmitted with the programme signal to allow an adapted receiver to display up to date information. However, this approach would directly increase the workload of the programme presenter and there is a marked reluctance by broadcasters to introduce any system which increases presenter workload. Another problem with this method is that the function of the news bulletin switch is not directly related to the audio of the programme and it would be easy for the presenter to forget to press the switch.

Furthermore, this type of system would require an investment in additional infrastructure by the broadcast station to support the switching and routing of the control signals carrying the event flags around the broadcast centre from the studios to the transmission system.

Another solution would be place an inaudible code, or watermark, which the receiver could decode within an audio cue or jingle preceding a programme. The receiver could then display the information contained in the watermark.

The problem with this technique is that there is a conflict between making the watermark inaudible and making

it robust enough to be detected reliably. Furthermore, a jingle may come from any one of a variety of sources and it is not possible to rely on every instance being watermarked as required. This is especially true in the case of archive material such as repeat broadcasts of previously recorded programmes. To overcome this would require changes in ways of working and would introduce further production steps. This solution would, therefore, be unattractive.

It is standard practice for programme makers and broadcasters to place an audio cue such as a piece of music or jingle immediately before a particular programme to alert the user that the particular programme is coming up next. The user learns to recognise the audio cue and to associate it with the particular programme which usually follows. Audio cues are therefore distinctive and easily recognisable from one another by the human ear. We have appreciated that the audio programme cue could be used by an automated system to insert up-to-date information about the programmes in an audio signal. By providing an automated system capable of recognising a programme cue occurring in a signal information about the programme being broadcast can be automatically generated.

The present invention relates to such an automated recognition system and overcomes the problems associated with providing programme information on an inflexible, schedule basis.

According to the present invention a recognition system according to claim 1 is provided. The dependent claims define preferred features of the invention.

The recognition system is automated and preferably uses the audio signal and stored information on particular programmes which are to be identified to generate a data

signal. The data signal is generated from the actual programme signal and, therefore, reflects true information on the current programme. Preferably the data signal is added to the audio signal using standard techniques and the resulting signal broadcast. Receivers adapted to use information transmitted under that standard will automatically benefit from up to date information on the broadcast. Preferably the data signal is generated in accordance with an industry standard such as RDS or DAB.

Preferably the identification of the programme from the programme cue is performed by detecting a match between the frequency content of a portion of the signal with the frequency content of the programme cue associated with a given programme of interest.

The recognition system may be used in diverse applications such as a broadcast audio marking system to allow broadcasters to generate a data signal containing up-todate information on the programme content of their signal automatically. The data in the data signal is generated by reference to the programme signal and is therefore correct even if the programme is being aired outside its scheduled time. Preferably the data signal is generated in substantially real-time.

The recognition system may also be used in receivers to display information on a given programme even when the broadcast signal does not include such information. A database of information on the given programmes and the programme cues associated with the given programmes may be created by the user or by a third party and used by the receiver with the recognition system to display information on the current programme.

Embodiments of the invention will now be described by way of example with reference to the accompanying drawings in which:

Figure 1 is a block diagram showing an audio recognition system in accordance with an embodiment of the invention ; Figure 2 is a diagram showing how an audio signal is processed by the recognition system of figure 1 ; Figure 3 is a block diagram of an application of the recognition system in a network station environment; and Figure 4 is a block diagram of an application of the recognition system in a receiver.

The audio recognition system is designed to detect a match between an input audio signal and one or more known audio cues stored for recognition purposes. Any method of characterising a signal which makes it discernable from a different signal could be used to detect the presence of an audio cue in an audio signal. For example, characterisation could be based on acoustical features of the signal, such as pitch or tempo, or statistical characterisation, such as correlation. However, it is not sufficient to simply compare sample values of the audio signal with a stored version of the audio cue because such a system would fail to operate adequately in the presence of changes in amplitude, the use of audio bitrate reduction or sample by sample synchronisation issues between the stored audio cue and the detected audio signal from the programme audio feed.

There has been substantial research activity in the field of speech recognition. In the presently preferred embodiment of the invention, recognition of an audio cue in an audio signal is based on detecting a match between the frequency content of the audio signal and the frequency content of the audio cue. A paper titled"A method for direct audio search with applications to indexing and retrieval"published in Proceedings ICASSP

2000 by Johnson and Woodland describes a technique for searching audio data to find a match in the data for a given piece of voice audio using a cepstral parameterisation of the audio and a covariance based distance metric. However, we have found that the technique described is not particularly effective for recognising a general music based audio cue, such as a piece of music or jingle marking the beginning or end of a particular programme.

Tools for measuring the resemblance between audio signals are also known. US 5, 469, 529 by Bimbot et al describes a process for measuring the resemblance between sound samples using the covariance matrices of a test phase signal and a learning phase signal. Arithmetic, geometric and harmonic means of the covariance matrices are calculated and used to determine sphericity functions from which it is possible to deduce a resemblance measurement between the test and learning phase signals. A preferred embodiment of the present invention makes use of the Arithmetic Harmonic Sphericity measurement (AHS) to determine whether or not there is a match between a portion of the audio signal and a stored audio cue.

The presently preferred audio recognition system will now be described with reference to figures 1 and 2. The recognition system is implemented on a processor or as a program is a computer. It comprises a signal receiver, a programme cue identifier and a data signal generator. The programme cue identifier 10 is shown in Figure 1 and comprises a database, or library, 12 a processor 14 and a distance measurement processor 16. The database 12 stores information on the audio cues which the system is required to recognise. Rather than store the raw audio cue signal, the audio cue is pre-processed to determine an individual determining characteristic, or signature. This helps to minimise the amount real-time data processing performed by

the recognition system. Preprocessing may be performed by the recognition system itself or the signature of the audio cue may be generated by a compatible system and simply downloaded into the store of the recognition system. It may use a clean, noise-free version of the audio cue or an audio cue recorded from a previous broadcast signal. The audio cue and programme information may be stored on a CD or minidisc for use by the recognition system. Information concerning the programme, such as the programme type, with which the audio cues are associated is also stored.

The processor 14 processes the signal in a number of steps which will now be described. The signal on which information is required is received by the signal receiver as an input to the recognition system.

Analogue audio signals must be converted to digital data before further processing. The raw signal is received by the signal receiver and fed into an analogue to digital converter (ADC) 20 which samples the audio signal and outputs the sampled signal to a window generator 22. A sample rate of 16kHz has been found to work well providing a reasonable balance between capturing a good proportion of the frequency information in the signal and keeping the amount of data to be processed to a minimum. If the audio signal is digital, sample rate conversion is required if the sample rate of the digital signal differs from that used to generate the audio cue signature.

A standard length audio cue is used for the identification process. The presently preferred embodiment of the invention uses an audio cue length of 2 seconds. A window generator 22 segments the sampled audio signal into windows corresponding in length to the length of the audio cues to be recognised. The window generator 22 produces

overlapping windows 42 from the audio signal 40 as shown in figure 2. There is a trade off between the amount of overlap between the windows and the amount of processing.

Missing the start of the audio cue may result in failing to recognise the audio cue. However, moving the window by a single sample each time results in very high processing overheads and slow processing time. We have found that a suitable overlap is one second.

Each overlapping window is then further segmented by the frame generator 24 to generate a series of overlapping frames 44 (see figure 2). Each frame is preferably 400 samples long, with the overlap between frames being 160 samples. The overlapping frames are fed to a Fast Fourier Transform (FFT) processor 26 to estimate the frequency contents of the frames. The data is then filtered using a 12-band mel-frequency bank filter 28. This generates 12 coefficients corresponding to the energy contained within each frequency bank of the frame. The logarithms of the coefficients are calculated by. the log processor 30. Next a frame feature vector is generated. The first 12 entries of the frame feature vector are the 12 log values of the coefficients. The remaining 6 entries are generated by taking the difference between the first 6 log values of current frame and the first 6 log values of the next frame. These values relate to the low frequency energy content of the frames. No difference values can be obtained for the last frame in a given window. The feature vector for the last frame is therefore only 12 entries long and is discarded before the matrix of feature vectors is assembled. The feature vectors for each frame corresponding to a single window are assembled in a matrix by matrix generator 32 with frame feature vectors forming the rows of the matrix 50. The matrix is processed by the covariance matrix processor 34 to form a covariance matrix 52. The covariance matrix is unique to the particular window of data signal and can be compared to the

covariance matrix of known audio cues to find a match, thereby identifying the window of data as the audio cue.

The covariance matrix is therefore a suitable signature of the signal which may be used for recognition purposes.

The covariance matrix of the audio cue is determined in a similar manner except that windowing of the audio cue is not required. The covariance matrix is stored as the signature of the audio cue.

In practice, the covariance matrix estimated for a window of data from the audio signal and the covariance matrix of the audio cue may differ slightly. A method of measuring how much the covariance matrices differ is therefore required. Once the difference has been quantified, a threshold is applied to the measurement and if the measurement is below the threshold a match is detected between the window of audio signal and the audio cue.

Various distance measures could be used. The presently preferred measurement is the arithmetic harmonic sphericity measure which is described in US 5,469, 529 and

a paper"Text-free Speaker Recognition using an Arithmetic Harmonic Sphericity Measure" by F. Bimbot and L. Mathan published in Proc. Eurospeech 1993 pp 169-172.

The arithmetic and harmonic mean values of the eigenvalues of the covariance matrix of the window of audio signal is compared with those of the covariance matrices of the known audio cues. When the values differ by less than a threshold value a match is detected between the audio signal and the audio cue and a data signal is generated by the data signal generator of the recognition system which can be used by an ancillary system to indicate information concerning the audio signal. The data signal includes the information on the programme associated with the matching audio cue stored in the database. The data signal is generated in accordance with the RDS standard.

The recognition system of the present invention could be put to a number of different uses. For example, figure 3 shows an application of the recognition system in a broadcast radio environment 60. The audio signal generated by a network or station is continuously analysed by comparing it to any number of audio cues or jingles corresponding to given programmes. Information on the given programmes and on the corresponding audio cues is stored in the template database 12. A data signal containing information on the given programme is generated when an audio cue associated with the given programme is detected.

The audio signal is fed directly to the transmission system 62 and also to the input of the recognition system which looks for a match with audio cues stored in the template database 12. When a match is detected, appropriate event flags are generated and a data signal incorporating programme associated data (PAD) is sent to the transmission system. The PAD is added to the audio signal in accordance with the particular standard, eg RDS, being used and the combined signal transmitted. For example, if the audio cue corresponds to that played at the start of a traffic announcement, a traffic announcement flag is generated 68 and the PAD sent to the transmission system 62 in the data signal includes a Traffic Programme Identification Code. Receivers capable of detecting PAD signals may use the data signal to display additional information or provide additional functionality to the user, such as changing channel to listen to a traffic announcement being broadcast on another channel.

The recognition system may also be used to generate scheduling information 66 such as a'start flag for a particular programme by detecting from the audio signal a portion of signal corresponding to the audio cue of that

particular programme. Once a start flag has been generated, an end flag could be generated on the repeated occurrence of the audio cue or on the occurrence of a second, play-out, audio cue. Once the recognition system has detected a match between a portion of the audio signal and a stored audio cue it could generate a Programme-Type (PTY) data flag 64 allowing programme type data stored by the system to be included in the data signal and broadcast in a combined data and programme signal. Event flags could also be used to update the now and next schedule information for an electronic programme guide (EPG) and to indicate artist and title information as a dynamic label which could be used by a Digital Audio Broadcasting (DAB) receiver to display to the user the artist and title of the song currently playing on that channel.

The recognition system may also be used to allow stations to perform trail auditing by determining if and when certain programmes (including advertisements or news bulletins) are broadcast. Studio operators may be obliged by the station to play particular advertisements or news bulletins at a given time or at a regular frequency.

However, operators may choose not to comply or simply forget to play the programmes. By including suitable audio cues in the template database 12 the recognition system may be used to detect if and when a particular programme is played. By generating a suitable event flag a trail auditor 70 monitoring the event flags can be made to maintain a database showing when the particular programmes were played. The station manager could then use the database to assess whether or not the studio operator is fulfilling his obligations to play the programmes and to take corrective action if necessary.

The recognition system may also be used to allow automated recording of selected programmes. Frequently stations have Internet sites which contain, amongst other things,

information extracted from the programmes, such as news bulletins, that they are broadcasting. News bulletins are particularly prone to scheduling disruption as stations react to live stories or sports events causing the news bulletin to be displaced within the schedule. It has therefore been difficult to automate recordal of the news bulletins for subsequent playback on the website.

The recognition system may be used to overcome this problem by detecting the start of a news bulletin and generating a particular event flag. An automated recording system 76 monitoring the output of the recognition system would then start recording when the news bulletin event flag is detected. The automatically recorded broadcast may then be provided for output via the station's website 78 reducing or eliminating the need for specially recorded bulletins.

In order to provide a suitable quality of recording, a time delay unit 74 may be connected between the audio signal path and the automated recording system 76.

Although the recognition system processes the audio signal rapidly, any delay in the generation of the event flag would result in the beginning of the programme failing to be recorded. In order to avoid the possibility of losing the initial part of the programme, the automated recording system 76 may be configured to record a delayed version of the audio signal provided by a time delay unit 74. The effort involved in providing the updated information on a website using the recognition system is far less than that currently required.

An advantage of the recognition system is that it is selfstanding and does not require any input other than the signal to be broadcast, specifically it does not require any control signals to operate. It may, therefore, be located physically close to the transmission system to

avoid additional large-scale data switching and routing infrastructure within the station. The recognition system is not overly affected by bit-rate reduction or any form of computer-controlled playout systems because it recognises the audio cue or jingle based purely on audio characteristics. It is independent of the particular physical media.

A further use of the recognition system is in receivers to provide information to users on programmes for which the broadcaster has not used RDS or a similar standard to provide this information in a combined data and programme signal. The information can be automatically called up from a database by a receiver using the recognition system.

Figure 4 shows a receiver 80 adapted to include the recognition system. The receiver 80 receives a broadcast audio signal which is outputted to the loudspeaker 90. The received signal is also fed to the recognition system where it is processed as described above and checked for a match with any of the audio cues stored in the template database 12. When a match with a particular audio cue is detected an event flag is set. The event flag may be a programme start flag which, when detected by the recording system 82, causes the recording system 82 to start recording the signal or programme-type data (PTY), which when detected by a text display 88, is displayed.

The recording system 82 may be programmed by the user using a user input device 86 to record a particular programme whose associated audio cue has been identified in the template database 12. The audio cue may be assigned a particular code and when the programme starting with that audio cue is to be recorded, the code is entered on the recording system 82. The recording system 82 would be activated when it detected an event flag associated

with that code. Either the length of the programme could be specified by the user or the recording system 82 may look for an end flag to stop the recording.

Population of the database could be performed by the manufacturer or a facility to allow the user to build up the database could be incorporated. The template database 12 could be populated by downloading from the particular broadcaster, or a centrally provided database, the signature of the audio cue for the particular programme.

Alternatively, the audio cue itself could be downloaded, pre-processed and stored. Alternatively, the user could input to the recognition system a sample of the audio cue recorded off-air and use the recognition system to process the signal and store in the database the appropriate audio cue information. The system could be adapted to allow subsequent episodes of programmes to be recorded automatically-The recognition based recording would have the benefit of starting at the beginning of the desired programme regardless of whether or not the programme was broadcast at its scheduled time. The system does not require the broadcaster to adapt the signal to include additional data which can be detected and used to control the recording system 82 nor does it impose any particular information format onto broadcasters; the added functionality is generated entirely by the receiver using characteristics of the transmitted signal to deliver to the user additional functionality. Thus the receiver 80 may be used to record programmes and indicate information about programmes from broadcasters who are not equipped to transmit this information or who transmit such information in a format which is not supported by the receiver 80.

It should be noted that the features described by reference to particular figures and at different points of the description may be used in combinations other than those particularly described or shown. All such

modifications are encompassed within the scope of the invention as set forth in the following claims.

With respect to the above description, it is to be realized that equivalent apparatus and methods are deemed readily apparent to one skilled in the art, and all equivalent apparatus and methods to those illustrated in the drawings and described in the specification are intended to be encompassed by the present invention.

Therefore, the foregoing is considered as illustrative only of the principles of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation shown and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.

For example the recognition system is not intended to be restricted to programme cue recognition in audio broadcasts but could be used in a television network using the audio track of the signal to recognise audio cues and insert programme information for transmission by the station. A further example would be using the recognition system to recognise video cues in video signals rather than rely on recognition via the audio track.

Claims

Claims 1. An automated recognition system for use with a signal including programmes and associated programme cues, the system comprising: a signal receiver, for receiving a signal including programme content and associated programme cues; a programme cue identifier, coupled to the signal receiver, for identifying portions of the signal as programme cues ; a data signal generator, coupled to the programme cue identifier, for generating a data signal containing data related to a given programme when the programme cue associated with the given programme is identified.
2. An automated recognition system according to claim 1, wherein the data signal generator generates a data signal in substantially real-time.
3. An automated recognition system according to either claim 1 or claim 2, wherein the signal is an audio signal and the programme cue is an audio cue.
4. An automated recognition system according to claim 3, wherein the programme cue identifier includes a frequency processor for determining the frequency content of each signal portion and the programme cue identifier uses the frequency content of each signal portion to identify portions of the signal as audio cues.
5. An automated recognition system according to claim 4, wherein the programme cue identifier includes a filter whose frequency response imitates the

<Desc/Clms Page number 17>

frequency response of the human ear for filtering the frequency content of the signal portions prior to identification of portions of the signal as audio cues.
6. An automated recognition system according to claim 4, wherein the programme cue identifier further comprises a frame generator for generating a plurality of overlapping frames from the signal portions, the frequency processor processing each of the frames to determine the frequency content of each frame and the programme cue identifier using the frequency content of each frame to identify the signal portion as audio cues.
7. An automated recognition system according to any of the preceding claims, further comprising a store for storing data concerning programme cues and wherein the programme cue identifier includes a comparator for comparing each signal, portion with the stored data.
8. An automated recognition system according to claim 7, wherein the data stored by the store is related to the frequency content of the programme cues.
9. An automated recognition system according to claim 7, wherein the store stores data on the programmes associated with the stored programme cues for generation of the data signal.
10. An automated recognition system according to any of the preceding claims, further comprising a window generator, coupled to the signal receiver, for segmenting the signal into portions having a length substantially the same as the programme cues.

<Desc/Clms Page number 18>
11. An automated recognition system according to claim 5, wherein the programme cue identifier includes a logarithm processor for estimating the logarithm of the filtered data and a covariance matrix generator, coupled to the logarithm estimator, for generating a covariance matrix of the logarithm data, the covariance matrix of the signal portion being used by the programme cue identifier to identify the signal portion as an audio cue.
12. An automated recognition system according to claim 11, wherein the programme cue identifier includes a sphericity measurement processor for quantifying the difference between the covariance matrix of the signal portion and covariance matrices of the programme cues, the programme cue identifier thresholding the quantified difference to determine whether the signal portion matches an audio cue-
13. An automated recognition system according to any of the preceding claims, wherein the data signal generator generates a data signal complying with one of the following Standards: Radio Data System Standard EN 50097, Digital Audio Broadcasting Standard ETS 300 401, Digital Video Broadcasting TR 101 200.
14. A broadcast marking system comprising a signal generator for generating a signal including programmes and associated programme cues, an automated recognition system according to any of claims 1 to 13, coupled to the signal generator, and a transmission system, coupled to the signal generator and to the automated recognition system, for adding the signal from the'signal generator and the data signal from the automated recognition system and transmitting the combined signal.

<Desc/Clms Page number 19>
15. A receiver for displaying programme information from a signal containing programmes and associated programme cues, the receiver comprising an automated recognition system according to any of claims 1 to 13 and a display, coupled to the automated recognition system, the display using the data signal generated by the automated recognition system to display data on the programme being received.
16. A method of generating a data signal containing information on a programme of interest contained in a signal, the programme being preceded in the signal by a programme cue, the method comprising: identifying portions of a signal as programme cues; and automatically generating a data signal containing data related to a given programme when the programme cue associated with the given programme is identified.
17. A method of generating a data signal according to claim 16, wherein the data signal is generated in substantially real-time.
18. A method of generating a data signal according to either claim 16 or claim 17, further including the steps of: storing data on at least one programme cue associated with a given programme; storing for each identified programme cue data concerning the given programme; segmenting the audio signal into windows of substantially the same length as the programme cues ; comparing the segmented signal with the programme cues to determine whether there is a match; using the stored data on the given programme in

<Desc/Clms Page number 20>

the data signal when the programme cue associated with the given programme is identified.
19. An automated recognition system substantially as hereinbefore described with reference to any of figures 1 and 2 of the accompanying drawings.
20. A broadcast marking system substantially as hereinbefore described with reference to figure 3 of the accompanying drawings.
21. A receiver for displaying programme information substantially as hereinbefore described with reference to figure 4 of the accompanying drawings.