WO2004093059A1

WO2004093059A1 - Phoneme extraction system

Info

Publication number: WO2004093059A1
Application number: PCT/IB2004/001401
Authority: WO
Inventors: William Mutual
Original assignee: Unisay Sdn. Bhd.
Priority date: 2003-04-18
Filing date: 2004-04-16
Publication date: 2004-10-28

Abstract

A process for the extraction of phonemes from an audio track and a script, the script corresponding to the audio track, including breaking the script into phonemes; using a speech recognition process on the audio track to produce a processed audio track; generating a phrase of the script using the phonemes; conducting a recognition process to recognise the phrase in the processed audio track; and extracting the phonemes from an output of the recognition process.

Description

Phoneme Extraction System

Field of the Invention

The present invention relates to a phoneme extraction system and refers particularly, though not exclusively, to such a system for producing timed phonemes.

Background to the Invention

Historically, storing and distributing of sound and images (as opposed to text files) has proved to be a major challenge both technically and economically. In fact, it was hardly possible to dream about digital video thirty years ago. Around 1970, however, came an explosion of interest in the field of image processing, and about a decade later, the first mature sound and image data compression techniques were available.

MPEG 1, 2 & 4 have since become the primary industry standard formats that define schemas for compression and transport of multimedia files. It is thanks to these techniques that it is now possible to access the plethora of video and audio digital media available via DVD's or the Internet.

The compression/decompression standards ("CODEC'S") provide for the processing, delivery and storage of content in digital format. Each CODEC'S role in today's communication realm, may be summarised as follows:

MPEG 1 - CD/lnternet/MP3 (1-8 Mb/sec) MPEG 2 - DVD/Satellite (4-84 Mb/sec)

MPEG 4 - lntemet/GPRS/G3 (>1.5Mb/sec)

Although these existing MPEG standards have served the industry well, they are all merely means to efficiently store and process media. This is why a second coding generation has evolved known as MPEG-7, (formerly known as the Multimedia Content Description Interface). With the new techniques of MPEG-7, video is not processed simply in terms of pixels and frames, but of its content. For these techniques to be efficient, however, semantically meaningful information must be extracted from the input stream with the minimum human input possible. Therefore, advanced multimedia analysis is needed.

MPEG-7 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the successful standards known as MPEG-1 (1992) and MPEG-2 (1994), and the MPEG-4 standard (Version 1 in 1998, and version 2 in 1999).

The MPEG-1 and MPEG-2 standards have enabled the production of widely adopted commercial products, such as Video CD, MP3, digital audio broadcasting (DAB), DVD, digital television (DVB and ATSC), and many video-on-demand trials and commercial services.

MPEG-4 is the first real multimedia representation standard, allowing interactivity and a combination of natural and synthetic material, coded in the form of objects (it models audio-visual data as a composition of these objects). MPEG-4 provides the standardised technological elements enabling the integration of the production; distribution and content access paradigms of the fields of interactive multimedia, mobile multimedia, interactive graphics and enhanced digital television.

MPEG-7 offers a comprehensive set of audio-visual Description Tools to describe multimedia content. Both human users and automatic systems that process audio-visual information are within the scope of MPEG-7. The metadata elements and their structure and relationships, that are defined by the standard in the form of Descriptors and Description Schemes used to create descriptions, will form the basis for applications enabling the needed effective and efficient access (search, filtering, browsing and processing) to multimedia content.

MPEG-7 Description Tools allows descriptions (i.e., a set of instantiated Description Schemes and their corresponding Descriptors at the users' will) of content that may include: • Information describing the creation and production processes of the content (director, title, short feature movie).

• Information related to the usage of the content (copyright pointers, usage history, broadcast schedule).

• Information of the storage features of the content (storage format, encoding).

• Structural information on spatial, temporal or spatio-temporal components of the content (scene cuts, segmentation in regions, region motion tracking).

• Information about low level features in the content (colours, textures, sound timbres, melody description).

• Conceptual information of the reality captured by the content (objects and events, interactions among objects).

• Information about how to browse the content in an efficient way (summaries, variations, spatial and frequency sub-bands)

• Information about collections of objects.

• Information about the interaction of the user with the content (user preferences, usage history).

MPEG-7 data may be physically located with the associated audio-visual material, in the same data stream, or on the same storage system. The descriptions could also be somewhere else. For cases when the content and its descriptions are not co-located, the standard provides mechanisms that link the multimedia material and their MPEG-7 descriptions.

MPEG-7 addresses many different applications in many different environments, which means that it needs to provide a flexible and extensible framework for describing audiovisual data. Therefore, MPEG-7 does not define a monolithic system for content description but rather a set of methods and tools for the different viewpoints of the description of audio-visual content.

MPEG-7 also uses XML as the language of choice for the textual representation of content description, as XML Schema has been the base for the DDL (Description Definition Language) that is used for the syntactic definition of MPEG-7 Description Tools and for allowing extensibility of Description Tools (either new MPEG-7 Description Tools application-specific Description Tools). Considering the popularity of XML, use of XML will facilitate interoperability with other metadata standards.

The main elements of the MPEG-7's standard are:

• Description Tools: Descriptors (D), that define the syntax and the semantics of each feature (metadata element); and Description Schemes (DS), that specify the structure and semantics of the relationships between their components, that may be both Descriptors and Description Schemes;

• Description Definition Language (DDL) to define the syntax of the MPEG-7 Description Tools and to allow the creation of new Description Schemes and to allow the extension and modification of existing Description Schemes;

• System tools, to support binary coded representation for efficient storage and transmission, transmission mechanisms (both for textual and binary formats), multiplexing of descriptions, synchronisation of descriptions with content, management and protection of intellectual property in MPEG-7 descriptions, etc.

MPEG-7 addresses applications that can be stored (on-line or off-line) or streamed (e.g. broadcast, push models on the Internet), and can operate in both real-time and non realtime environments. A 'real-time environment' in this context means that the description is generated while the content is being captured.

Definitions

Throughout this specification a reference to a "machine" (and its grammatical variants) is to be taken as including a reference to one or more of: server, desktop computer, personal computer, laptop computer, notebook computer, tablet computer, and personal digital assistant.

Throughout this specification a reference to a phrase is to be taken as including a reference to a clause. Summary of the Invention

According to one aspect of the invention there is provided a process for the extraction of phonemes from an audio track and a script, the script corresponding to the audio track, including the steps:

(a) breaking the script into phonemes;

(b) using a speech recognition process on the audio track to produce a processed audio track;

(c) generating a phrase of the script using the phonemes;

(d) conducting a recognition process to recognise the phrase in the processed audio track; and

(e) extracting the phonemes from an output of the recognition process.

The phonemes may be obtained using a text-to-speech process, and the script may be broken into words before the phonemes are obtained. Any unrecognised phrases in the audio track after the recognition process may be subsequently processed in the same manner with reset threshold and accuracy of the speech recognition process. Prior to the speech recognition process, parameters and next phrase logic of the speech recognition process may be set.

The audio track may be analysed to produce a time location for words in the audio track. The speech recognition process on the script may also produce a description of words and phoneme length timings. The description of words, phonemes and phoneme length timings may be further processed with a timer input to produce a timed word/phoneme list. The timer input may be obtained from the audio track.

The audio track may be part of an audio/video input and the audio/video input may be first converted to be MPEG1/MPEG2 compliant before separating the audio track from the audio/video input.

The timed phonemes may be subjected to further processing to provide a speaker phoneme set. The speech recognition process may also produce marked-up phonemes. The further processing may be a phoneme analysis. The phoneme analysis may be an MPEG 7 word/phoneme analysis using sound spectrum and sound waveforms. The audio track may also be input into an MPEG 7 audio low level descriptors extraction process and an MPEG 7 emotive analysis for extraction of emotion-related descriptors.

In another aspect of the present invention there is provided a computer usable medium comprising a computer program code that is configured to cause one or more processors to execute one or more functions to perform the steps and functions as described above.

Description of the Drawings

In order that he invention may be readily understood and put into practical effect, there shall now be described by way of non-limitative example, only a preferred embodiment of the present invention, the description being with reference to the accompanying illustrative drawings in which:

Figure 1 is an overall chart of a preferred embodiment of the present invention;

Figure 2 is a flow chart for the initialising process of Figure 1;

Figure 3 is a flow chart for a preferred process for phoneme extraction;

Figure 4 is a flow chart for a preferred process for timed phoneme extraction.

Figure 5 is a flow chart for audio low level descriptors extraction;

Figure 6 is a flow chart of emotive analysis;

Figure 7 is a flow chart of word/phoneme analysis; and

Figure 8 is a flow chart of the recognition process.

Description of the Preferred Embodiment

To refer to Figure 1, there is illustrated the overall process. At 101 machine of a user connects to the relevant web page to initiate the process. However, the process may be performed on a stand-alone machine.

The audio and, if possible, text files requiring processing for which a timing track is required are selected and at 102 sent to the server operating the web page, for processing. The server performs the necessary processing in 103 and sends the timing track (and other requested outputs) to the machine of the user. The first stage 101 of the process is illustrated in more detail in Figure 2. As is shown, each of the following is selected:

(a) the script in text format - 104;

(b) the voice track (audio file) - 105;

(c) the number of words to be recognised at the one time - 106;

(d) the output type and the output parameters. This may be either or both words and phonemes - 107, and

(e) output time units as either or both milliseconds and frames 108.

Upon the five selections being made, at 109 the process is initiated. At 110 a preliminary check is made to ensure all parameters are correct. If not, at 111 an incorrect parameter message is generated and sent to the user's machine. The five input parameters (a) to (e) can then be reselected, reset or the error corrected. If all parameters are correct, at 112 all data is sent for processing and the input stage ends.

To now refer to Figure 3, there is illustrated in more detail the process for phoneme extraction 103 of Figure 1.

The process commences with a detailed check at 113 to determine if all parameters are correct. This is a more complete test than that conducted at 110 and is also to ensure the transmission from the sender's machine to the server has been error-free. If there is an error, an error message is generated at 114, sent to the sending machine, and all processing stops.

This script is processed at 115 to break it down into individual words. If there is an error, an error message is generated at 114, sent to the sending machine, and all processing stops.

At 116 the words of the script are broken into phonemes using a text-to-speech engine such as, for example, "SAPI 5.1" as available from Microsoft Corp, or the "Viavoice" as available from IBM Corp.. If there is an error, error message is generated at 114, sent to the sending machine, and all processing stops. A speech recognition engine is used at 117 to analyse the audio/voice track 105. The parameters for the analysis are first set, and the logic for the next phrase of the audio/voice track 105 is also set. A suitable speech recognition engine would be, for example, "SAPI 5.1" as available from Microsoft Corp, or the "Viavoice" as available from IBM Corp.. If at the commencement of the audio track and script, the first phrase of the script is generated from 116. For subsequent processing, the next phrase of the script is generated. At the end of the script, there will be nothing left so the processing is diverted at 119 to the next major processing stage. Otherwise, it proceeds to 120 where the present voice track position is set.

The phonemes used and their relative timings come from the script. The phoneme timing, and time location in the audio track of the phoneme come from the audio track. By analysing on a phrase-by-phrase basis from both the script and the audio track, the phonemes can be timed for duration and location with a relatively high degree of accuracy. The overall length of the audio track is known and this will be the overall duration of the script. Therefore, it is possible to estimate an approximate time in the duration of the script where a particular phrase might appear.

Therefore, in process step 121, the current time location is recognised from the audio track as is the current script phrase as appearing on the audio track using speech recognition. By recognising phonemes of the script phrase on the audio track and matching with the phonemes of that phrase from the script, a "most likely position" for that script phrase can be determined and thus its likely time location on the audio track. Phonemes missing from the script may be obtained from the audio track using the voice properties of the speaker as obtained from the closely matched complete set, or by synthesizing the missing phonemes that don't have a close match. Manual intervention may be required to correct errors depending on the subject of the text, number of slang expressions used, and the diction on the audio track.

The word times are then updated in 122 using the results of the recognition process, and the next phrase of the script prepared. The process then loops back to 118 and repeats until there is nothing left in the script. When there is nothing left, the process is diverted at 119 and passes to a second stage of processing. Here, any unrecognised phrases in the script are re-examined. If there are no unrecognised phrases, the output from 122 is directly to 129 where the timed word/phoneme list is generated.

If there are unrecognised phrases in the script (normally due to the speech recognition engine producing on unusable result for phrase), in 123 the speech recognition engine threshold and accuracy are set. The first (or next for subsequent phrases) phrase of the script is generated. If there are no more unrecognised phrases, at 125 the process is diverted to 129 to generate the timed word/phoneme list.

If there is an unrecognised phrase, the voice track position is set at 126, as before. The recognition process of 121 is re-performed at 127 but with the new thresholds and accuracy. Again, at 128 the recogniser results are used to update the word times, and the next unrecognised phrase prepared. The process loops back to 124.

When all processing has completed, the output is generated at 129 and the process ends.

The recognition process at 121 is further illustrated in Figure 8.

At the beginning 159 of the recognition process audio 167 is input at 163 for processing in. accordance with the MPEG 7 sound classification model audio classifier. This extracts and outputs at 171 the timed sound classifications. Sound classifications include non- word sounds such as, for example, inhalation, laughter, natural sounds, noises, background sounds and noises, music, and so forth.

Audio 166 is input at 117 as described above for speech recognition analysis preferably using the Hidden Markov Method ("HMM") for raw recognition and timing. The output 172 is an aligned raw phoneme list.

The text 165 is input for processing as described above for text-to-speech (phoneme) and the output 173 is a normally timed phonetic text representation. The audio and text are input at 164 into process 121 as is described above for phrase- segmented, time-assisted, speech recognition. The output 174 is a verified, aligned word/phoneme list.

Outputs 171, 172 and 173 are input to process 168 for initial text alignment for words and the sound classifications. The output 175 from process 168 is estimated word timings and partial phoneme timings. This is also input into 121 for its output 174. The output 175 is also input to process 169, as is output 174, for final text alignment (including some or all of the sound classifications). The output 176 from 169 is a complete timed/word/phoneme list. The recognition process ends at 170.

To now refer to Figure 4, there is shown the initial processing of an audio/video input 8 as received in the machine operating the system of the present invention. At input 8 there is an audio/video input and a script corresponding to the audio track of the audio/video input as a text input in digital form. The audio/video input is preferably in source language. This is process 101 and 102 of Figure 1; and steps 104 to 112 of Figure 3.

The audio/video input is split into two processing streams 4, 6. In stream 4 the audio/video 10 is first transcoded at 12 to MPEG1/MPEG2, if not already in the appropriate format. If in the appropriate format, transcoding is not necessary. Audio/video separation takes place and the video is output at 14 and by-passes all subsequent processing at this stage...The audio/video separation is a standard technique using known applications/engines.

In stream 6, the script undergoes a track reading process at 18. These are process steps 113 to 122 of Figure 3. Also input at 18 is a time input to enable each of the phrases, words and phonemes to be timed for likely duration.

The time input may be in seconds down to a preferred level of three decimal places (i.e. milliseconds). The output from process 18 is timed phonemes. It may also have marked- up phonemes and descriptions of the words. Preferably, all output to 20 is in MPEG 7 representation. In the stage of processing at 20 the phoneme timing are completed in accordance with process steps 123 to 128 of Figure 3 to have a speaker phoneme set that is as complete as possible. The output 30 from process 20 is passed to data storage 22 for storage and to be used later in other processes.

The audio output 32 from process 12, and the outputs from process steps 18, 20, are all passed for three parallel enhancement process steps.

In the first enhancement step 24, the audio output 32 from process 12 is passed for MPEG 7 audio low level descriptors ("LLD") extraction. LLDs may include such characteristics as the spectrum of the sound, silence, and so forth. The LLD analysis may proceed in parallel to the track reading process 18, or may be subsequent to that process. If subsequent, the output from process 18 may be also input to process 24.

The audio output 32 from process 12 and the result 34 of the track reading process 18 are both input to the second enhancement step 26 . In this step MPEG 7 emotive analysis is performed by extracting of the audio using emotion-related descriptors. An emotive analysis based on MPEG 7 LLDs is then performed.

The third enhancement step 28 has as its input the text and phoneme timing output 30 of process 20, the audio output 32 from process 12, and a timer 36 . The timer 36 may be in seconds down to a preferred level of three decimal places (i.e. milliseconds) and may start from the commencement of the audio track, or otherwise as prescribed or required. The audio track 32 is combined with the output 30 and the timer 36 and analysed and described into a timed word/phoneme list. The output 30 gives the phonemes and their likely duration. The audio track 32 and the timer 36 give the time location of the phonemes By matching the timed phonemes and the audio track 32, based on a likely match, a timed word/phoneme list can be prepared. This is output 38 from process 28 and stored in data storage 22 for later use. The outputs 40, 42 from processes 26, 24 respectively are also stored in data storage 22 for subsequent use. The processing ends.

Figure 5 is a flow chart of the audio LLD extraction process 24 of Figure 4. Here, the spectrum, and logarithm of the spectrum, are calculated at 130. Some or all of a large number of low level descriptors are then determined in either parallel (as shown) or sequentially. These include, but are not limited to: 131 audio spectrum envelope descriptor; 132 audio spectrum centroid descriptor;

133 audio spectrum spread descriptor

134 audio spectrum flatness descriptor;

135 audio spectrum basis descriptor;

136 audio waveform descriptor;

137 audio power descriptor;

138 silence descriptor;

139 audio spectrum projection descriptor;

140 audio fundamental frequency descriptor;

141 audio harmonicity descriptor;

142 audio spectrum flatness type descriptor; log attack time descriptor;

143 harmonic spectral centroid descriptor

144 harmonic spectral deviation descriptor;

145 harmonic spectral spread descriptor;

146 harmonic spectral variation descriptor;

147 spectral centroid descriptor; and

148 temporal centroid descriptor.

The result is then output as is described above.

In Figure 6 is illustrated the process for emotive high level descriptors extraction. This is process step 26 in Figure 4. Here, four different processes are conducted in parallel:

149 the fundamental frequency (pitch change) is measured along each phoneme/word as a function of time;

150 audio amplitude changes detection (loudness change) is measured along each phoneme/word as a function of time;

151 rhythm detection (auto-correlation of the audio power) is measured along each phoneme/word as a function of time; and

152 spectral slope along each phoneme/word is measured along each phoneme/word as a function of time.

The outputs of all four are then combined at 154 to create prosodic descriptors.

The final enhancement process 28 of Figure 4 is illustrated in Figure 7 - the phoneme/word analysis. This may be performed using HLDs. The phoneme segmentation of the audio track 155 and the diphone segmentation of the audio track 156 are combined and at 157 the phoneme and diphone data is extracted and subjected to post-processing. Post-processing may include one or more of normalization of amplitude via peak or RMS methodologies, noise reduction, and frequency filtering (either adaptive or manually set). The phoneme/diphone data descriptors and then encoded and passed for storage.

The present invention also extends to a computer usable medium comprising a computer program code that is configured to cause one or more processors to execute one or more functions to perform the steps and functions described above.

Whilst there has been described in the foregoing description a preferred embodiment of the present invention, it will be understood by those skilled in the technology concerned that many variations or modifications in details of design, construction or operation may be made without departing from the present invention.

The present invention extends to all features disclosed both individually, and in all possible permutations and combinations.

Claims

The claims:

1. A process for the extraction of phonemes from an audio track and a script, the script corresponding to the audio track, including the steps:

(a) breaking the script into phonemes;

(c) generating a phrase of the script using the phonemes;

(e) extracting the phonemes from an output of the recognition process.

2. A process as claimed in claim 1 , wherein the phonemes are obtained using a text- to-speech process.

3. A process as claimed in claim 1 or claim 2, wherein the script is broken into words before the phonemes are obtained.

4. A process as claimed in any one of claims 1 to 3, wherein any unrecognised phrases in the audio track after the recognition process are subsequently processed in the same manner with reset threshold and accuracy of the speech recognition process.

5. A process as claimed in any one of claims 1 to 4, wherein prior to the speech recognition process, parameters of the speech recognition process are set.

6. A process as claimed in any one of claims 1 to 5, wherein prior to the speech recognition process, a next phrase logic of the speech recognition process is set.

7. A process as claimed in any one of claims 1 to 6, wherein the audio track is analysed to produce a time location for words in the audio track;

8. A process as claimed in claim 7 when appended to claim 2, wherein the speech recognition process on the script also produces a description of words and phoneme length timings.

9. A process as claimed in claim 8, wherein the description of words, phonemes and phoneme length timings are further processed with a timer input to produce a timed word/phoneme list.

10. A process as claimed in claim 9, wherein the timer input is obtained from the audio track.

11. A process as claimed in claim 9 or claim 10, wherein the audio track is part of an audio/video input and the audio/video input is first converted to be MPEG1/MPEG2 compliant before separating the audio track from the audio/video input.

12. A process as claimed in any one of claims 9 to 11 , wherein the timed phonemes are subjected to further processing to provide a speaker phoneme set.

13. A process as claimed in any one of claims 8 to 12, wherein the speech recognition process also produces marked-up phonemes.

14. A process as claimed in any one of claims 9 to 12, wherein the further processing is a phoneme analysis.

15. A process as claimed is claim 14, wherein the phoneme analysis is an MPEG 7 word/phoneme analysis using sound spectrum and sound waveforms.

16. A process as claimed in any one of claims 1 to 15, wherein the audio track is also input into an MPEG 7 audio low level descriptors extraction process.

17. A process as claimed in any one of claims 1 to 16, wherein the audio track is also input into an MPEG 7 emotive analysis for extraction of emotion-related descriptors.

18. A process as claimed in claim 16, wherein the low level descriptors are one or more selected from the group consisting of: audio spectrum envelope descriptor, audio spectrum centroid descriptor, audio spectrum spread descriptor, audio spectrum flatness descriptor, audio spectrum basis descriptor, audio waveform descriptor, audio power descriptor, silence descriptor, audio spectrum projection descriptor, audio fundamental frequency descriptor, audio harmonicity descriptor, audio spectrum flatness type descriptor, log attack time descriptor, harmonic spectral centroid descriptor, harmonic spectral deviation descriptor, harmonic spectral spread descriptor, harmonic spectral variation descriptor, spectral centroid descriptor, and temporal centroid descriptor.

19. A process as claimed in claim 17, wherein the emotive analysis is conducted by measuring;

(a) a fundamental frequency along each phoneme/word as a function of time;

(b) audio amplitude changes along each phoneme/word as a function of time;

(c) rhythm detection along each phoneme/word as a function of time; and

(d) spectral slope along each phoneme/word as a function of time.

20. A process as claimed in claim 8, wherein the word/phoneme analysis is conducted by using high level descriptors to analyse a phoneme segmentation of the audio track and a diphone segmentation of the audio track, combining the results, and extracting the . phoneme and diphone data and subjecting the extracted data to post-processing.

21. A computer usable medium comprising a computer program code that is configured to cause one or more processors to execute one or more functions to perform the steps and functions as claimed in any one of claims 1 to 20.