AU736314B2

AU736314B2 - Audio spotting advisor

Info

Publication number: AU736314B2
Application number: AU27742/00A
Authority: AU
Inventors: Stephen Robert Bruce; John Richard Windle
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1999-04-19
Filing date: 2000-04-13
Publication date: 2001-07-26
Anticipated expiration: 2020-04-13
Also published as: AU2774200A

Description

(t S&F Ref: 498316

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT

ORIGINAL

Name and Address of Applicant: Actual Inventor(s): Address for Service: Invention Title: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Stephen Robert Bruce John Richard Windle Spruson Ferguson St Martins Tower 31 Market Street Sydney NSW 2000 Audio Spotting Advisor ASSOCIATED PROVISIONAL APPLICATION DETAILS [33]-Country [-31-]-Applic. No(s) AU PP9838 [32] Application Date 19 Apr 1999 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c [R:\LIBT]26230.doc:vsg .F -1- AUDIO SPOTTING ADVISOR Field of the Invention The present invention relates to the editing and indexing of video information and, in particular, to using an audio component of the video information to assist in the editing or indexing task.

Background When editing raw video information to produce the final production, the editor is required to mark "in" and "out" points in the raw footage to indicate those sections of the video clips forming the footage that are to be used in the final production. This typically 10 requires that the editor view much, and usually all, of the footage to determine the oooo relevant or important sections thereof. In a number of cases, these areas can often be predicted by one skilled in the editing art, although the exact edit points still need to be S.identified.

The indexing of an audio track forming part of a video sequence, often involves :i 15 identifying significant words that give information regarding the content of the visual ooo.

track. A simple method of extracting such keywords requires performing the task manually and entering the keywords into a database that associates the keywords with the video data.

An- alternative method is to pass-the audio track through voice recognition software where the recognised words are extracted and added to form a database as with the above-noted manual solution. An example of this is disclosed in WO 97/0021634 6 which is directed to the discrimination between speech and music components in an audio signal. One problem associated with the approach of that document is that the entirety of the audio track needs to be processed, including those sections that contain no voice CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc f. I -2signals. This can be costly in processing power and in the time taken to process audio information.

Summary of the Invention It is an object of the present invention to substantially overcome, or at least ameliorate, one or more deficiencies of existing arrangements.

The present invention relates generally to a method and apparatus for indexing speech information in an audio signal and the consequential selection of in and out points in an editing or indexing system. The selection may be provided as advice to a user for manual editing or as advice to an automatic editing and indexing arrangement. The ol* •i 10 method takes advantage of the fact that the audio track has a volume level which may .ooo contain implicit information about the circumstances being recorded.

.I In accordance with one aspect of the present invention, there is disclosed a method of indexing speech information in an audio signal, said method comprising the steps of: examining said audio signal by applying heuristic rules to an amplitude of said 15 audio signal, to mark significant audio events; performing speech analysis on said audio signal for each said significant audio event to detect the presence of speech; and for each said significant audio event for which speech was detected, performing keyword -detection-on-said- audio signal for--a-predetermined plurality of keywords; wherein where any one keyword is located within said audio signal, associating with said audio signal, data indicative of the presence of said one keyword at a time said keyword is present in said audio signal.

In a specific implementation, the audio volume level is tracked to watch for any changes that are recognised as significant. When these events are noted, time codes CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDA\MMEDA39\498316.doc I -3associated with the audio events are recorded. The recorded time codes are then used to present those marked sections of the original video for further processing, whether by automatic or manual processes. In a specific implementation, the significant volume of events are recognised from a table heuristic rules developed from a manual analysis of a sample of home (amateur) video recordings.

Brief Description of the Drawings A number of embodiments of the present invention will now be described with reference to the drawings in which: Fig. 1 is a schematic block diagram representation of a general purpose computer 10 and associated video editing components with which the preferred embodiment may be implemented; S* Fig. 2 is a flowchart of the method steps of the preferred embodiment; Fig. 3 is a data flow diagram of the preferred embodiment; Fig. 4 is an example of an audio signal; and 15 Fig. 5 depicts an audio processing arrangement useful in the preferred embodiment.

Detailed Description The preferred embodiment of the present invention is described as a computer application-program-hosted-on the WindowsTM operating-system developed by Microsoft Corporation. However, those skilled in the art will recognise that the described embodiment may be implemented on computer systems hosted by other operating systems. For example, the preferred embodiment can be performed on computer systems running UNIXTM, OS/2TM, DOSTM. The application program has a user interface that includes menu items and controls that respond to mouse and keyboard operations. The CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc 4d 1- -4application program also has the ability to transmit and receive data to a connected digital communications network (for example the "Internet").

The preferred embodiment can be practised using a conventional general-purpose (host) computer system, such as the computer system 40 shown in Fig. 1, wherein the application program discussed above and to be described with reference to the other drawings is implemented as software executed on the computer system 40. The computer system 40 comprises a computer module 41, input devices such as a keyboard 42 and mouse 43, output devices including a printer 57 and an audio-visual output device 56 such as may be formed by a video display and loudspeaker arrangement. A Modulator- 10 Demodulator (Modem) transceiver device 52 is used by the computer module 41 for communicating to and from a communications network 59, for example connectable via a telephone line or other functional medium. The modem 52 can be used to obtain access to the Internet, and other network systems. An Ethernet connection may alternatively be used.

.i 15 The computer module 41 typically includes at least one processor unit 45, a memory unit 46, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface 47, and an I/O interface 48 for the keyboard 42, the mouse 43 and optionally a j -oystick-(not illustrated)-A-storage-device-49 is-provided and-typically includes a hard disk drive 53 and a floppy disk drive 54. A CD-ROM drive 55 is typically provided as a non-volatile source of data. The components 45 to 49 and 53 to 55 of the computer module 41, typically communicate via an interconnected bus 50 and in a manner which results in a conventional mode of operation of the computer system 40 known to those in the relevant art. Examples of computers on which the embodiments can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc I therefrom. Typically, the application program of the preferred embodiment is resident on a hard disk drive 53 and read and controlled using the processor 45. Intermediate storage of the program and any data fetched from the network may be accomplished using the semiconductor memory 46, possibly in concert with the hard disk drive 53. In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk, or alternatively could be read by the user from the network via the modem device 52.

As seen in Fig. 1, the modem device 52 allows for connection to the network which may act as a source of digital video information including both video images and an accompanying audio track. Alternatively, a video input interface 90 may be provided which includes an input 91 configured to receive digital video information, for example from a digital video camera 10, or a series of analog inputs 92 configured to receive video .information 93 and audio information 94, each in an analog format, from a device such as an analog video cassette recorder 95. The signals 93 and 94 are input to respective analog 15 -to-digital converters 96 and 97, the outputs of which are, like the digital input 91, applied to the system bus 50 via an isolating buffer 74. Through the provision of the arrangement shown in Fig. 1 and known software applications, video sequences comprising images and audio tracks may be stored, edited and reproduced via the output interface 47 and the -audio-video output-device 56.

Digital cameras, such as the camera 10, are known to provide for meta data to be associated with the particular audio-visual sequence being recorded. Such meta data usually includes time data such as a real-time commencement of a clip, together with a timer number for each frame. Meta data may also include specific indicators entered by the operator of the camera or added manually in a post-production environment. With CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\M MEDIA39\4983 16.doc -6video sequences obtained from an analog source, such as the VCR 95, meta data can be added through analysis for example being performed by the computer system In devising the present invention, the present inventors considered a number of home (amateur) videos in order to identify those circumstances that influence the presence of speech and its ability to be perceived in a traditional fashion. From their examinations, the present inventors sought to develop rules may be used to assist in the specific identification of speech among the entire content of the audio track. Whilst the present inventors have specified two significant heuristic rules which are used in the preferred embodiment, it should not be considered that these are the only rules that may 10 be applied. An observation common to both rules are that statistical methods exist that •can indicate the probability of an audio signal containing voice information, such being o. discussed in the aforementioned speech/music discrimination system.

A First Rule devised by the present inventors is based on the following observations: 15 when a person recording the video talks, that person's voice is usually recorded relatively loud on the recording medium (video tape) as a consequence of that person's relative close proximity to the microphone; usually integrated with the video o:o.•o camera; the person recording-the- video- will often comment on events that are considered to be significant as they are being recorded; and it is easier to reduce background sounds (ie. noise) and hence process the audio track, by seeking to identify words in the audio track, if the voice is significantly above the level of the background (noise).

The First Rule is practically resolved in such a fashion that if a section of the audio track is detected as having sufficient volume change, according to a threshold set by CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc -7the user and/or an ambient volume level, then that section should be further checked for the presence of voice information. If there is a strong statistical likelihood of voice information in that section, then the section should be further processed to determine if any keywords can be extracted from the audio section.

A Second Rule is based upon the following observations: many formal occasions include speeches which are generally preceded by periods of relatively low audio level (background noise) which is indicative of respectful silence; once the speech commences, it will usually be the only source of audio information; and some formal occasions can be identified by certain keywords in the speech.

The Second Rule resolves in practice that if the user has identified the video clip as a formal occasion, then the audio track is able to be searched for periods of quiet, 15 bordered by levels close to silence. Those audio sections identified are then statistically examined for voice presence and processed for keywords. The keywords found are then compared, using a thesaurus, with various collections of words associated with different formal events in an attempt to identify the event. Examples of such keywords might include-the-phrase-husband-and-wife"in-respect-to-a-wedding, and identification of-such keywords may assist in distinguishing words of the marriage celebrant which might be later followed by the words of the "best" man whilst making a speech.

Fig. 2 depicts a flow diagram of a particular embodiment 100 in which an audio input signal 102 and any associated meta data is downloaded to an editing system such as the arrangement shown in Fig. 1 or a dedicated hardware device. Specific method steps of the process 100 may be performed either entirely by software, entirely by hardware or CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc -8by a combination of both hardware and software. Initially at step 104, amplitude analysis is performed on the input audio signal to mark segments thereof as potentially containing voice information. Such marking can include modifying the meta data. At step 106 that follows, the heuristic rules mentioned above are applied to the amplitude analysis of the marked segment from st~p 104. When a match with the heuristic rules is found, the segments that match are processed at step 108 to perform keyword extraction. Where any keywords are extracted during the process of step 108, the extracted keywords are applied in step 110 as meta data to the raw video footage. Particularly the application of the keywords as meta data involves associating the keywords with the time code which forms :i 10 part of the meta data associated with a digital video visual signal as discussed above.

A further embodiment of the present invention is partly performed in real-time during recording of the video by the camera. This is seen in Fig. 3 for a process 120 which includes a digital video camera 130 where an optical signal is detected and converted via an analog-to-digital converter 132 into a video signalcomponent 134 and 15 an audio signal is converted using an analog-to-digital converter 136 into an audio component 138. The camera 130 also includes a timer 140 which drives a frame clock 142 that provides a frame clock signal 146 associated with the audio 138 and video 134 components. As illustrated, the digitised audio signal 138 is input to an accumulator-144-which-is-also input--withthe-frame-c-lock-signal 146. In this-fashion, the individual audio samples 138 over a single frame can be accumulated in the accumulator 144. On completion of each frame, as identified by the frame clock signal 146, the output of the accumulator 144 is combined with the frame clock signal 146 to provide meta data 148. The meta data 148 is combined with the audio 138 and video 134 to provide a composite digital video signal 150 which is typically recorded on to a recording medium 152. In this fashion, the meta data 148 includes a value for CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc each frame representative of the average volume level of the audio signal 138. Because each frame is of the same duration, there is no need to further process the raw accumulated value as this, of itself, is indicative of an average over that individual unit of time (ie. a frame period).

Fig. 3 also indicates certain processing based upon the composite signal 150, that may occur either within the camera 130 or within an off-line post-processing environment such as may be achieved with a purpose built editing apparatus or using a general purpose computer such as that described above. In particular, as illustrated, the composite signal 150 may be extracted from the recording medium 152 or directly from the :i 10 output 150 whereupon it is divided into its component parts of meta data 148, audio 138 oooo and video 134. The meta data 148, and in particular the average amplitude components, are subject to amplitude analysis in an analyser 154 according to the various heuristic *rules mentioned above to identify those parts of the audio 138 that include volume fluctuations that may be potentially significant. The analyser 154, having identified :i 15 significant volume fluctuations, instigates a check of the audio signal 138 in a process 156 to determine if the volume fluctuations arise from the presence of voice data. Such a check will occur over a number of video frames since almost all words and utterances exceed a single frame period. If voice data is detected, the audio is transferred to a keyword-extraction process-i 58 which extracts-any- keywords-160-which-are-combined with the meta data 148 to provide expanded meta data 162. The expanded meta data 162 is combined with the audio and video components 138 and 134 to provide enhanced composite signal 164 which may then be stored or otherwise processed 166. Typically the expanded meta data 162 including keyword identification is stored in the composite signal immediately before the utterance of the keyword. This permits the expanded meta CFP1680AU MMcdia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\4983 16.doc data 162 to be searched to provide for ready reproduction of the audio-visual information immediately upon its location being identified.

Advantageous results are obtained using the arrangements of Figs. 1, 2 and 3 in that the identification of specific keywords forms part of the meta data associated with the raw video footage. As a consequence, that meta data provides for specific segments of the raw video footage to be readily accessible for playback and/or editing so that salient features of the raw video footage may be extracted in an automated fashion or in a manual fashion whilst substantially minimising the manual effort required compared to previous arrangements.

10 Fig. 4 depicts an amplitude-time plot that illustrates one form of amplitude detection that may be used in an embodiment of the present invention. The amplitude scale has indicated upon it two relative amplitude levels, Al and A2. The level A2 is ::provided to indicate a relative background level of noise which is desired to be discriminated against as being something not containing voice signals desired to be 15 interpreted. The relative level Al is provided as being indicative of audio signals which may contain speech levels desired to be identified. As illustrated, Fig. 4 contains a number of sections, a first section 200 for example being indicative of background noise in a meeting room or as the speaker assumes the podium. The section 202 is also indicative ofbackground-noise-but during-a period-of respectful-silence-before-the speaker commences the speech. A period 204 includes areas of high amplitude indicative of the speaker making his speech. It is noted that some areas within the period 204 are of lower amplitude representing extended pauses and the like within the speech. The conclusion the speech is marked by a period 206 approaching the background noise and then a period 208 of thunderous applause from the audience. The levels are relative since their absolute values are not necessarily indicative for speech identification and determination.

CFP1680AU MMedia39 498316 I:\ELEC\CISRA\M MEDIA\MMEDIA39\498316.doc I -11- Of significance however is the relative difference between the levels Al and A2 which may be set as desired dependent upon the particular levels of noise and recording levels obtained. In some instances, the levels Al and A2 may be close to coincident but in most instances, the difference between level Al and A2 will be clearly determinable.

Fig. 5 depicts a specific arrangement for the detection of amplitude and discriminating possibly wanted signals from unwanted background noise. Significantly, the arrangement of Fig. 5 represents a combination of filtering and comparison processes which may be performed either in hardware using analog electronics or alternatively digitally either in software or hardware digital signal processing.

10 As seen in Fig. 5, the processing arrangement 210 commences with an audio input 212 which may be either an analog or digital signal as appropriate. The audio signal 212 is then converted into a magnitude form 214 typically by full wave rectification. The magnitude indicative output 216 is then provided to a pair of filters 218 and 220 each having respective filter time constants Tl and T2. For example, the first 15 time constant xl may be set to filter out all relatively high frequency transients to thereby be indicative of a background level of noise that is substantially constant over a S. predetermined period of time set by the period xl. The second filter 220 may also allow passage of signals but of higher frequency content thereby being indicative of transients suchasThigh level speech which-may-be desired-fodetection. In each instance, the filters 218 and 220 may be implemented by low pass filters. The output of the filters 218 and 220 may be directly compared in a comparator 222 to provide a meta data signal 230 indicative of those instances when any relatively high frequency transients exceed a longer term average provided from the first filter 218. In an additional or alternative fashion, each of the filtered signals may be provided to a subtraction circuit 226 which determines the absolute difference between the two filtered signals. The output of the CFP1680AU MMedia39 499316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc -12circuit 226 is provided to a threshold circuit 224 which determines if the relative difference exceeds a predetermined value. If so, a meta data signal 228 is provided which predicts the likelihood of speech existing in the audio input 212.

It will be apparent from the foregoing that an amplitude-based mechanism utilises heuristic rules may be used to identify the presence of voice in an audio track such indications then being used to extract particular keywords from the voice to provide meta data useable in editing to combine a composite video signal from which the audio is sourced.

Further, whilst the preferred embodiment is described as a computer program 10 product, other configurations may be implemented. For example, hardware implementations including combinations of analog and digital circuitry may be formed.

Alternatively, implementation in an application specific integrated circuit (ASIC) can simplify application in a variety of circumstances.

The foregoing describes only a number of embodiments of the present invention 15 and modifications can be made thereto without departing from the scope of the present invention.

S* In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only -Variations of-the word-comprising,-such as- "comprise" and "comprises" have corresponding meanings.

CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc

Claims

1. A method of indexing speech information in an audio signal, said method comprising the steps of: examining said audio signal by applying heuristic rules to an amplitude of said audio signal, to mark significant audio events; performing speech analysis on said audio signal for each said significant audio event to detect the presence of speech; and for each said significant audio event for which speech was detected, performing key-word detection on said audio signal for a predetermined plurality of keywords; wherein where any one keyword is located within said audio signal, associating with said audio signal, data indicative of the presence of said one keyword at a time said keyword is present in said audio signal.

2. A method according to claim 1, wherein said audio signal is associated with data comprising at least time code information related to a real-time reproduction of said audio signal.

3. A method according to claim 1 or 2, wherein said heuristic rules comprise distinguishing portions of said audio signal of relatively high amplitude from portions of relatively low amplitude, said high amplitude portions being marked as significant audio events. CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc i

14- 4. A method according to claim 3, wherein said heuristic rules further comprise detecting in said portions, periods of relatively high range of change of said amplitude, and marking the same as significant events. 5. A method according to any one of the preceding claims, wherein said heuristic rules comprise detecting a low amplitude first portion of said audio signal compared to an amplitude of a portion preceding said first portion, and marking a second portion of relatively high amplitude following said first portion as a significant event. 10 6. A method according to any one of the preceding claims, wherein said audio signal .000 forms part of a video sequence including a visual signal. 7. A method of indexing speech information substantially as described herein with 4 reference to any one of the embodiments as illustrated in the drawings. 8. Apparatus configured to perform the method of any one of claims 1 to 7. 9. A computer program product including a computer readable medium having a series-of instructions-configured to perform the method of any one of claims 1 to 7. A computer readable medium having a series of instructions configured to perform the method of any one of claims 1 to 7. 11. Apparatus for indexing speech information in an audio signal, said apparatus comprising: CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\498316.doc 1 1*_1 means for examining an amplitude of said audio signal to mark significant events therein, said means for examining applying at least a plurality of heuristic rules to said audio signal; means for analysing said audio signal at each said significant event to detect the presence of speech; keyword detection means for identifying one or more keywords in the detected speech; and data means for marking said audio signal with associated data where a keyword is identified. •o Dated this Thirteenth Day of April 2000 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant SPRUSON&FERGUSON S* 0@* O o CFP1680AU MMedia39 498316 I:\ELEC\CISRA\MMEDIA\MMEDIA39\4983 I 6.doc