WO2013086575A1

WO2013086575A1 - Speech visualisation tool

Info

Publication number: WO2013086575A1
Application number: PCT/AU2012/001531
Authority: WO
Inventors: Phan Thi My Ngoc Nguyen
Original assignee: Phan Thi My Ngoc Nguyen
Priority date: 2011-12-15
Filing date: 2012-12-12
Publication date: 2013-06-20
Also published as: AU2012100262B4; AU2012100262A4

Abstract

A speech visualisation tool that assists learners to learn pronunciation based on pronunciation rules (including word pattern rules appropriate for the applicable language rhythm), and to learn how certain movements of the speech organs relate to speech sounds by providing speeach visualisation images corresponding to a pronunciation match based on user input.

Description

TITLE: SPEECH VISUALISATION TOOL TECHNICAL FIELD

The present invention relates to language learning systems, and in particular language learning systems and methods that include a speech visualisation tool. COPYRIGHT NOTICE

This document is subject to copyright. The reproduction, communication and distribution of this document is not permitted without prior consent from the copyright owner, other than as permitted under section 226 of the Patents Act 1990.

BACKGROUND

A difficult part of learning any new language is learning pronunciation or speech sounds (phonemes). The language student may be aware of how a word should sound but often has difficulty creating the appropriate phoneme or even knowing how to do so. Unless the language teacher can provide accurate feedback to the individual student on how to produce a specific phoneme, this difficulty may persist and hinder the student's ability to communicate in the new language, let alone become a fluent speaker.

Speech involves various voice and non-voice related actions. Non-voice related actions include movements of speech organs (e.g. the tongue), muscle movements and facial expressions. A pronunciation model can be built based on linguistics, phonetics, phonology, phonetic interpretation or other language structures.

Various systems and methods have been developed to assist with learning pronunciation. Most focus on the capture and replay of sound waves. For example, US Patent Application No. 10/372,828 describes a language learning system and method including a visualised pronunciation "suggestion" means. A user can capture his or her own oice and generate a sound wave diagram of this using the system. The system calculates a difference between the user's voice and a recorded voice. The difference may be a similarity value expressed as a percentage. By providing sound waves of the user's voice and of the pre-recorded language samples for comparison, a user can observe differences in tones and stresses from the sound wave diagrams so that the user can the adjust the tones and stresses in the next practice. The system and method rely on sound wave diagrams of full-sentence voice samples. The disadvantage of this system is that while it suggests to users how to adjust tones and stresses, this may not assist a user who is experiencing difficulty in pronouncing a particular phoneme and struggling to produce the correct sound (pronunciation) in the first place.

US 2007/0055523 describes a pronunciation training system that links

pronunciation features with corresponding muscle movements and diagram representations. The system identifies the positions and movements of speech organs, to help enhance a user's awareness of different sounds. The movements can be displayed by showing one or more speech organs from one or more viewings. The system also captures sound waveforms during speech and allows comparison between a user's speech and pre-recorded speech.

However, mastering pronunciation of a new language also requires a learner to understand and master the rhythm of that language. This is because every language has its own rhythm - different stress and length patterns of sounds associated with syllables, words, phrases and sentences.

US Patent Application No. 6,397,185 discloses a voice and pronunciation training system that can provide visual feedback during speech training by capturing sound waves and computing metrics for comparison with a native speaker's pre-recorded sound waves. The visual feedback includes a stress/rhythm mode in which each syllable is represented as a step, where the width of each step reflects the duration of the corresponding syllable and the height of each step represents the energy in the vowel of the corresponding syllable. US 2008/0274443 describes a foreign language teaching system and method that includes a visual representation of foreign language words as a series of tonal and rhythmic shapes and colours. The system relies on voice signals and extracts data from the voice signals regarding the frequency of sounds within a spoken syllable. For example, an over-emphasised "B" sound might be displayed as a peak in lower frequency sounds in a graphical representation of the sound (depicted as coloured segments of a circle, the colours corresponding to different frequencies), or an under-emphasised "S" sound may be displayed as a peak in higher frequency sounds (different segments of the circle). This system suffers the problem that although it provides visual feedback to a user about an overemphasis in a sound frequency or difference in tone or rhythm compared with a native speaker, the learner still needs to understand how to translate those metrics into information about how to make a desired sound.

■ A collective disadvantage of these language learning systems is that they require a trained "ear" for hearing the difference between a native speaker and a non-native

. speaker and experience in translating that difference into certain sounds.

US 2007/0055523 links pronunciation features with corresponding muscle movements and diagram. However, it is difficult to train the eye to associate certain movements of the speech organs with certain sounds.

Many sounds involve movements in the mouth or in the throat. The difference between voiced and non-voiced consonants is invisible to the eye. Thus there is no visible difference in movement of the lips between "hat" and "at", or between "bat" and "pat": E.B. Nitchie, Lip-Reading: principles and practise, Digireads.com, 2007. Speech sounds (phonemes) formed within the throat or by the back of the tongue and soft palate (k, Q, ng) are also not visible, US 6,151,577 describes a system for phonological training that includes an animated reproduction of speech organs such as the lip movements and the oral cavity. The system provides an image of a phoneme along a time axis, giving various parts of the sound configuration different space depending on the duration of use. The visualisation also shows areas of formation in the oral cavity relating to the various sounds - marked by different colours in the cavity. The colour selected can change to correspond with the sound - for example, a more intense and saturated colour corresponds with short sounds and a less intense and lighter colour corresponds with a long vowel sound. In this manner, the system of US 6,151,577 provides visual feedback to the user about the non-visible involvement of speech organs in the formation of a sound. However, this system is primarily geared to creating a visual link between sound and letters and focuses on individual sounds. As such, it does not address the difficulty of understanding the creation of phonemes in context - that is, linked to other phonemes. Important contextual information including word rhythm and linked phonemes is not addressed. One speech organ movement may modify the appearance of another connected with it in a word. Sounds pronounced alone (out of context) tend to be "mouthed" or exaggerated, resulting in mispronunciation. Nitchie provides the example of the long e sound. The movements involved in making that sound in "thief appear as a drawing back of the corners of the mouth but the same sound in "sheep" does not involve the same lip movement. Another example is the "th" sound. While represented by the same combination of letters, these represent different phonemes depending on the context (e.g. the "th" sound is different in "thin" than in "this"). Focusing on the phoneme alone without context can give rise to mispronunciation in real speech.

It would be useful to have a tool that assists in teaching how to make the appropriate speech sound in context (e.g. words or sentences) by focusing on the movement of the required speech organs and how a movement required to create a speech sound can be affected by a linked movement to create the next speech sound in context (e.g. linked phonemes as in "thief"). The prior art language learning systems discussed above suffer the common difficulty of focusing on sound wave comparisons or the relative position of speech organs in generating sound waves. While listening to audio recordings of speech and repeating spoken sounds assists the learning process, this is not sufficient to learn word rhythm. Learning word rhythm also needs to involve an understanding of word stress patterns:

(a) understanding the appropriate word stress pattern for a particular word or sentence, e.g. record or record

(b) predicting word stress by understanding / recognising word stress patterns rules for categories of words or categories of word combinations, e.g. i. for words ending in "-ic", the main stress occurs before the

"-ic" (electric, terrific);

ii. for compound nouns (high jump, greenhouse, spaceship,

underground), the main stress is on the first word.

What is needed is a language learning tool that takes into account the rhythm of a language, e.g. the stressed-timed pattern of spoken English, or the syllable-timed pattern of Chinese. Prior art language learning systems do not specifically address this aspect of mastering a language.

It would also be useful if the language learning tool could assist language learners how "read" lip movements (or other visible speech movements) to assist in the understanding of what sounds are revealed by such movements, as a basis to then understanding how to form a desired speech sound using a series of linked movements. This component is missing from prior art language learning systems.

It is an object of the present invention to provide a speech visualisation tool that assists learners to learn pronunciation based on pronunciation rules (including word pattern rules appropriate for the applicable language rhythm), and to learn how certain movements of the speech organs relate to speech sounds. SUMMARY

According to an aspect of the invention there is provided a speech visualisation tool comprising:

(a) a rule base of pronunciation rules, wherein said rule base applies one or more pronunciation rules to a user's input, said input entered into said speech visualisation tool using an input means such that a

pronunciation match to the user's input can be derived;

(b) a searchable repository of speech visualisation images, said repository searchable by said speech visualisation tool to locate a speech visualisation image corresponding to the pronunciation match;

(c) a display means in communication with the rule base, said display

means being capable of displaying one or more speech visualisation images that correspond to said pronunciation match,

such that the speech visualisation tool is able to derive a pronunciation match corresponding to a user's input and to locate one or mpre speech visualisation images corresponding to the pronunciation match; said speech visualisation images being viewable such that speech visualisation assistance is provided for facilitating the learning of pronunciation.

According to another aspect of the invention there is provided a speech visualisation assistance method performed by programming instructions and including the steps of:

(a) receiving an input from a user;

(b) executing said one or more pronunciation rules on the input received from the user to derive a pronunciation match;

(c) searching a repository of speech visualisation images to locate one or more speech visualisation images corresponding to the pronunciation match;

(d) communicating said one or more speech visualisation images to a

display means such that said speech visualisation images are viewable by a user of the speech visualisation tool. According to yet another aspect of the invention there is provided a computer program product comprising a computer usable medium having computer readable program code means embodied therein for provid ing speech visualisation assistance for a learner of a language, the computer readable program code means in said computer program product comprising programming instructions for:

(a) receiving an input from a user;

(d) communicating said one or more speech visualisation images to a

display means such that said speech visualisation images are viewable to a user of the speech visualisation tool.

According to a further aspect of the invention there is provided a speech visualisation system including:

(a) a speech visualisation tool;

(b) programming instructions residing in computer readable storage medium

performing a speech visualisation assistance method;

(c) a processing means for operating the programming instructions.

DETAILED DESCRIPTION

The invention thus provides a speech visualisation tool that assists learners to learn pronunciation based on word pattern rules appropriate for the applicable language rhythm, and to learn how certain movements of the speech organs relate to speech sounds. For a better understanding of the invention and to show how it may be performed, a preferred embodiment will now be described, by way of non-limiting example only, with reference to the accompanying drawings and example. FIGURE 1 is a flowchart showing a method for speech visualisation assistance according to an embodiment of the invention.

FIGURE 2 is an exemplary user interface of the speech visualisation system according to an embodiment of the invention.

FIGURE 3 is the exemplary user interface of Figure 2 containing a visual representation of a pronunciation match in the exemplary form of an animated reproduction of the letter "m". FIGURE 4 shows exemplary speech visualisation images according to an embodiment of the invention.

FIGURE 4A shows an exemplary animated 3D visual representation of a front view of facial movement information during speech. The visual representation is an animated image. The series of images depicted shows exemplary animated movements (of the face) during the speech sound /m/.

FIGURE 4B shows an exemplary animated 3D visual representation of a perspective view of facial movement information during speech accompanied by the animated letter of Figure 3. The visual representation is an animated image. The series of images depicted shows exemplary animated movements (of the face and of the letter) depicted in the animation. The facial movements and animated letter movements correspond to reflect the speech sound /m/.

FIGURE 4C shows an exemplary visual representation of a side view of speech organ movement during speech accompanied by the animated letter of Figure 3. The visual representation is an animated image. The series of images depicted shows exemplary animated movements of the speech organs depicted in the animation. The speech organ movements and animated letter movements correspond to reflect the speech sound /m/. FIGURE 5 shows an exemplary speech visualisation image depicting a stress pattern corresponding to a pronunciation match according to an embodiment of the invention.

FIGURE 5A shows an exemplary stress pattern for a word - here, "pronunciation". Figure 5B shows an exemplary stress pattern for a sentence or paragraph.

FIGURE 6 is an exemplary schematic diagram of a speech visualisation system according to an embodiment of the invention.

The elements of the invention are now described under the following headings.

Detailed description of a preferred embodiment

The invention provides a new or alternative speech visualisation tool, method and system for use in learning languages, and in particular for assisting to improve pronunciation.

In an embodiment, the speech visualisation tool includes:

(a) a rule base or knowledge base of pronunciation rules, the rules being executable on a user's input to find (derive, calculate) a pronunciation match to the user's input;

(b) a searchable repository of speech visualisation images, the repository being searchable by the speech visualisation tool to locate a speech visualisation image corresponding to the pronunciation match;

(c) a display means for displaying one or more speech visualisation images that correspond to the pronunciation match.

The speech visualisation tool further includes a user interface in communication with the display means, to allow a user to communicate with the speech visualisation tool, such as allowing the tool to receive input from a user or allowing a user to view (including navigating) speech visualisation images that correspond to a pronunciation match.

In an embodiment/ the speech visualisation tool further includes an inference engine (a computer program designed to produce reasoning based on rules) to derive a pronunciation match by executing the pronunciation rules on a user's input. The inference engine may form part of the rule base or be discrete from the rule base. Mastering pronunciation of a new language requires a learner to understand and master the rhythm of that language. This is because every language has its own rhythm - different stress and length patterns of sounds associated with syllables, words, phrases and sentences. The pattern of strong and weak (volume), and short and long sounds makes up the rhythm of a language. Vary the pattern of the speech sounds and the language can become difficult to understand.

Taking spoken English as an example, words can be categorised as either "content" words or -'function" words (function words are also referred to as "structure" words). Content words have the most stress in English and play an important role in the meaning of a sentence. Function words are weaker and shorter. These words are less important in expressing the meaning of a sentence.

Content words include:

(a) nouns (person, place, animal, thing, abstract idea)

(b) verbs (doing words e.g. to write)

(c) adjectives and adverbs (describing words) and

(d) pronouns (words that substitute for nouns)

Function words include:

(a) auxiliary verbs (also called helper verbs - they modify the main verb in a sentence to change the tense, e.g. may, do have, have been) (b) prepositions (words expressing spatial relations e.g. in, toward, to, under; or words that play a semantic / syntactic role e.g. for, of)

(c) conjunctions (joining words)

(d) determiners (e.g. the, that)

(e) possessive adjectives (e.g. my, your).

An additional component of the rhythm of a language is timing. A language may be timed according to the length and number of syllables (syllable-timed languages). In these languages the length of time needed to say something depends on how many syllables there are, as all syllables are the same length.

In other languages, the amount of time needed to say something depends on the number of stressed syllables, not the number of syllables per se. These are referred to as stressed-timed languages. An example of a stressed-timed language is the spoken English of the UK, North America and Australia (hereafter, simply referred to as "English").

In stressed-timed languages a word or sentence may include a mixture of stressed and unstressed syllables. The speaking of the stressed syllables is what contributes to the rhythm. The stressed syllables are spaced equally apart in time. If there are unstressed syllables between stressed syllables, the unstressed syllables are spoken faster, to keep the same rhythm. If there are no unstressed syllables, the stressed syllables are stretched out to fill the same amount of time. For example, "one, two, three, four" (four syllables, all stressed) takes the same amount of time to say as "one and two and three and four" (eight syllables: four stressed and four unstressed). This is achieved by reducing the vowel sounds in unstressed syllables, allowing the syllables to be spoken more quickly.

Apart from vowel-shortening, stressed and unstressed syllables differ in three other features:

(a) loudness or strength of the sound (e.g. intensity and power of the sound) (b) length (of time taken to speak the syllable - this is reflected by the clarity of the vowel component - i.e. whether it is long and full, or short and reduced);

(c) vibratory frequency and pattern;

(d) pitch.

Stressed syllables are louder, longer and higher-pitched than unstressed syllables. Changing the pattern of stressed syllables can change the meaning of the word, e.g. "record" can be pronounced record or record (stressed syllable in bold, italics). Similarly, "present" can be pronounced present or present Where the stress is placed also affects the pronunciation of the main vowel in the unstressed syllable. This reduction (shortening) of the vowel sound in an unstressed syllable is the key to stress timing. In English, the reduced vowel sound in unstressed syllables is often the sound "uh". This is the single most common sound in English, representing about 30% of sounds made during spoken English. Its importance is signified by the fact that the sound has its own name: "schwa" - indicated by the phonetic symbol Θ. The sound schwa can be represented by any vowel (a, e, i, o, u) in English. Consider the vowel sounds in the following examples (stressed syllable in bold, italics):

(a) the "a" in ba//oon

(b) the "e" degree

(c) the "i" in pencil

(d) the "o" in symbol

(e) the "u" in support

In each of the examples above, the unstressed syllable incorporates the schwa sound. The same practice of replacing Unstressed vowels with schwa also occurs in connected speech sounds. Examples include:

(a) Where are you from?

(b) I'm from Perth The "from" is pronounced differently in example (a) than in example (b). This is because in the first sentence, the word "from" is stressed, resulting in the o sound being pronounced clearly. In the second example, the word "Perth" is the stressed word in the sentence. The word "from" is unstressed in this example, resulting in the pronunciation of the "o" changing to a shorter vowel sound (schwa).

Other major languages have a different pattern or rhythm of speech. Each syllable receives the same amount of stress as the others in a word or sentence.

Understanding the rhythm of a language assists learners to. understand how to pronounce words like a native speaker, and hence improve their pronunciation.

A common problem in learning English is native speakers of syllable-timed languages intuitively giving equal stress (strength, length and pitch) to all syllables - influenced by the rhythm of their native tongue (e.g. Chinese). In syllable-timed languages, all "vowel sounds" are spoken clearly and with equal stress. The term "vowel" is a speech sound made with an open vocal tract - i.e. without any closure or constriction along the vocal tract including in the mouth. This phonetic definition of a vowel is broader than the letters "a, e, i, o and u", as referred to as "vowels" in English.

While listening to audio recordings of speech and repeating spoken sounds assists the learning process, this is not sufficient to learn word rhythm. Learning word rhythm also needs to involve an understanding of word stress patterns:

"-ic" (electric, terrific);

ii. for compound nouns (high jump, greenhouse, spaceship,

underground), the main stress is on the first word. Referring to Figure 1, a speech visualisation assistance method 100 as performed by the speech visualisation tool is shown. A user puts input into the speech visualisation tool, which in one embodiment is a computer program that takes user input, such as a question or other way. of seeking information about pronunciation - see step 110.

The input may be entered by typing a letter into a keyboard, selecting a character (e.g. a consonant from the phonetic alphabet) on the user interface or by any suitable means (e.g. clicking, tapping, swiping or speaking into a microphone). The user input may be entered using any suitable form of input device (e.g. computer keyboard, touchscreen or microphone, smartphone, tablet, personal digital assistant or any other mobile device or device with processing capacity, e.g. a game console). This may take the form of a user being presented with options for selection - for example, a visual representation of the International Phonetic Alphabet (IPA) - and selecting the desired option.

Figure 2 contains an exemplary user interface 200 of the speech visualisation tool. The interface 200 includes selection options 210, in this example, each of the 44 characters of the IPA. The selection options 210 are categorised according to the following main groups:

(a) vowels: short, long, diphthongs (double vowels)

(b) consonants: voiceless, voiced, others.

However, other suitable categorisation of selection options 210 could also be used. The user interface 200 of the speech visualisation tool may be viewable on a display means that forms part of an input device (e.g. a computer screen, smartphone or tablet).

An advantage of basing the speech visualisation tool on the IPA is that all words can be translated into a phonetic equivalent using the IPA. Pronunciation rules will determine how the word is translated - for example, if a vowel is to be replaced by schwa. Once a pronunciation match is derived, this can be expressed by a combination of appropriate IPA characters corresponding to the user input. The precise combination of IPA letters may differ depending on context (described below). By way of example only, the user interface 200 also shows a visual representation of a pronunciation match 220 for the sound "m". In this example, the visual representation 220 is depicted as a 3D representation of the sound "m" that corresponds to the alphabet letter "m". The visual representation 220 could also be an animated reproduction of the letter "m" or any other image. The animated movements of the visual representation 220 reflect the characteristics of the corresponding speech sound (phoneme) for the letter "m" - namely:

i. strength (e.g. intensity and power) of the sound;

ii. length;

iii. vibratory frequency and pattern; and '

iv. pitch.

For example, the phonetic letter "m" is a consonant categorised in the IPA as part of a group that requires vibration sounds from start to end. The animation reflects this in a "wobble" but also reflects the movement of the lips together, through a "bending" action of the "m" - bringing top and bottom closer. A "snapshot" of this can be seen in Figure 3. Different vibratory frequencies and patterns are reflected in the animation, as are differences in strength, length and pitch.

Returning to Figure 1, after receiving user input, the speech visualisation tool accesses a knowledge base of pronunciation rules and executes one or more pronunciation rules on the input received from the user to derive a pronunciation match (step 120). For example, if a user enters the consonant "k", the /k/ sound in kit and sirill are actually pronounced differently. The /k/ sound is aspirated in "kit" but unaspirated in "skill". In English, these different sounds are often unnoticed by native speakers and do not change the meaning of the word if one were to be substituted for the other. Hence, for English both the unaspirated and aspirated /k/ sounds are considered the same phoneme (speech sound) and the speech visualisation tool would include appropriate pronunciation rules to reflect this. In other languages, however, the aspirated and unaspirated /k/ sounds are perceived as different sounds (e.g. in Icelandic) and their use changes the meaning of the word. Accordingly, the knowledge base for Icelandic would reflect this difference.

The knowledge base may contain sets of pronunciation rules according to the language being learned. Hence, user input would indicate in a first step the desired language to apply. Alternatively, the speech visualisation tool can be tailored for a single specified language, in which case all the pronunciation rules for the tool would be applicable to the specified language.

The knowledge base includes a plurality of pronunciation rules, not each of which will be relevant to a user. Once a user has input a query (e.g. by selecting the letter "m") the speech visualisation tool will search the knowledge base for

pronunciation rules relevant to the user input (in this simplified example, relevent for "m") and apply only relevant rules to the input. In an embodiment, the speech visualisation tool may perform this step by presenting the relevant pronunciation rules to an inference engine. However, the rule base itself can derive a pronunciation match. Accordingly, the inference engine may be discrete from the rule base or indeed part of the rule base.

The pronunciation rules include one or more rules from the following groups:

(a) phonemic rules to identify one or more phonemes in the user input (e.g. in the example above, if the speech visualisation tool is for English, the /k/ sound will be a single phoneme, if it is for Icelandic, the /k/ sound will be two sounds (phonemes);

(b) word stress pattern rules to identify a stressed syllable in a word of two or more syllables;

(c) word stress pattern rules to identify a stressed part of a compound word;

(d) word stress pattern rules to identify a stressed word of a phrase;

(e) word stress pattern rules to identify a stressed word of a sentence; and (f) rule exception rules to identify circumstances where exceptions to one or more of the above rules are relevant to the user input.

The speech visualisation tool takes the relevant pronunciation rules and executes them to deliver to the user a pronunciation match corresponding to the user's input (step 120 in Figure 1). By taking into account pronunciation rules (including word stress pattern rules), the speech visualisation tool takes into account the rhythm of a language, e.g. the stressed-timed pattern of spoken English, or the syllable-timed pattern of Chinese, This is an advantage because the rhythm of a language affects pronunciation and meaning and is important to mastering a new language.

Using the example of a language learning tool for English, the speech visualisation tool analyses and extracts the appropriate word stress pattern for a particular word or sentence based on a series of pronunciation rules. The rules categorise words as "content" or "function" words.

Where a word could be in either category, the rules perform further analysis to determine which category is appropriate for the context. Consider the word "have" in the following examples:

"Have you been swimming?" (stressed word in italics)

"I have"

Hence the application of pronunciation rules can be performed in iterations.

The rhythm of a language is important to enable the learner to differentiate the appropriate pronunciation of words that can vary meaning according to pronunciation, e.g. record or record. Returning again to Figure 1, after deriving a suitable pronunciation match for the user input (step 120), the speech visualisation tool takes this information then matches it against an image repository to find one or more speech visualisation images that correspond to the pronunciation match (step 130).

The speech visualisation tool selects speech visualisation images corresponding to the pronunciation match are made available for viewing (including navigation) by a user (step 140). The user may view and/or navigate the speech visualisation image(s) that correspond(s) to the pronunciation match on a display means such as a computer display, a smartphone, a tablet or any other device with processing capacity.

The speech visualisation images may be from one or more of the following groups:

(a) an animated letter corresponding to a phonetic symbol, wherein the

animated letter has characteristics reflecting one or more of:

strength e.g. intensity and power) of the sound;

length;

in. vibratory frequency and pattern; and

iv. pitch

of a speech sound in the pronunciation match (see Figures 2 and 3);

(b) one or more visual representations of facial movement information during speech, including one or more of a 2D diagram, a 3D diagram, an animated reproduction of facial movement - see Figure 4;

(c) one or more visual representations of speech organ movement during speech, including one or more of a 2D diagram, a 3D diagram, an animated reproduction of speech organ movement;

(d) an animated reproduction of facial movement information during speech;

(a) an animated reproduction of speech organ movement during speech;

(b) a visual representation of a word stress pattern corresponding to the pronunciation match.

Referring to Figure 4, exemplary speech visualisation images are shown. Figure 4A shows an exemplary animated 3D visual representation of a front view 400 of facial movement information during speech. The front view 400 is an animated image. The series of images depicted in Figure 4A shows exemplary animated movements (of the face) during the speech sound /m/. The animated front view image 400 could be any other visual representation of facial movements. Facial information provides important cues to sound reproduction.

Figure 4B contains an exemplary animated 3D visual representation of a perspective view 410 of facial movement information during speech. The visual representation is an animated image. As depicted, the perspective view animation410 is accompanied by the animated letter of Figure 3. The series of images depicted in Figure 4B shows exemplary animated movements corresponding to reflect pronunciation of the speech sound /m/. The use of different perspectives (e.g. front, perspective and side views) may provide further visual cues to the speech organ movements required to produce the sound /m/ (here, the focus is on the movement of the lips).

The lips can be seen in Figures 4A and 4B to start in a closed position during production of the sound /m/. During the sound, the lips clamp together and roll over to become slightly "pursed". At the end of the sound, the lips return to a more "relaxed" closed position. This is depicted in the animations, which can be replayed or played in slow motion for the learner to see the detail.

Figure 4C shows an exemplary visual representation of a side view of speech organ movement 420 during speech. The visual representation is an animated image. As depicted, this side view 420 is also accompanied by the same animated letter of Figure 3. The views 400, 420 and 430 are available to view simultaneously or individually. The animations 400, 420 and 430 as well as the letter animation of Figure 3 can be played and re-played simultaneously or individually. The speech organ movements and animated letter movements correspond to reflect the speech sound /m/, in the same manner as described above.

By providing animations of lip movements (e.g. as in Figure 4B), the speech analysis tool assists language learners how "read" lip movements and to understand the relationship between movements and the sounds produced. Lip and other speech organ movements required to pronounce a word correctly can differ according to context. Some of these are visible (e.g. the /e/ sound in "thief" and "speech") and understanding these visible cues is part of pronouncing the word correctly. The speech visualisation tool assists learners to understand these differences because the speech visualisation images depicted are automatically selected by the system depending on the relevant context. The speech

visualisation images also depict non-visible speech organ movements e.g. airflow in the vocal tract. This is relevant, for example, to correctly pronounce the difference between an aspirated or unaspirated consonant (e.g. the /k/ sound as discussed above).

The speech visualisation tool also provides speech visualisation images to assist in understanding word stress patterns relevant to pronunciation of a word (see Figure 5A) and word stress patterns relevant to pronunciation of sentences and paragraphs (see Figure 5B). For example, the series of images depicted in Figure 5A provide an exemplary animated image (stepping from top to bottom) of a word stress pattern 500. Step 510 shows a visual representation of a word

"pronunciation" with a corresponding animated image (labelled 520) of the phonetic characters required to make the appropriate sounds. In this example, the phonetic equivalent 520 of "pronunciation" is the same as appears in a standard English dictionary.

The animated image 520 may appear in a different colour to differentiate the spelling from the phonetic symbols. The animated image of the phonetic symbols

520 is accompanied by a speech sound. As the sound and accompanying animation progress, the relevant phoneme (or syllable) is highlighted so that the learner can associate the highlighted phonetic symbol with the corresponding speech sound.

This highlighting may be achieved by any suitable method - e.g. a change in colour, italics or bold. Step 530 shows different formatting applied to the speech sound

/ 'el / as a visual cue to the learner when the speech sound corresponds to that phoneme. As the animation continues, the word stress pattern is depicted. In this example, the stress is on the / 'el / sound, as shown by the heightened size of the / 'el / in step 540 followed by compression of the / 'el / in step 550.

The changing heights in the symbols are a visual cue to the learner that this syllable is stressed. The animated image then returns to its original state 520 in the final step 560. Similar visual cues are provided to assist the learner with word stress patterns of sentences and paragraphs. For example in Figure 5B, a visual representation of the word "pronunciation" is provided. A series of animated phonetic symbols is shown. As the animation progresses, the relevant word (or word part) changes to a different colour to signal to a learner where to put the stress. The colour then reverts back to the original colour at the end of the animation. This progression is depicted schematically in steps 610 to 640 of Figure 5B.

Referring to Figure 6, the speech visualisation tool is a software tool (computer program) that performs a speech visualisation assistance method within a speech visualisation system 800.

The speech visualisation system 800 includes:

(a) a speech visualisation tool (computer program product) residing in

computer readable storage medium (memory) 810 for performing a speech visualisation assistance method. In an embodiment, the speech

visualisation tool further includes a display means. Alternatively, the speech visualisation system 800 includes a display means including a user interface

(as depicted in Figure 6);

(b) a processing means (processor) 820 for operating the speech visualisation tool; and (c) an input means to enable a user to communicate with the speech visualisation system, e.g. by inputting data or navigating one or more speech visualisation images.

The computer readable storage medium 810 can be memory in a storage medium such as a storage disk, a computer, a server, a network, or the cloud. In an embodiment, the system 800 includes access to a computer network 870

(including the internet or the cloud). The processing means 820 may be local or remote, via a network. A user may use a computer keyboard, smartphone, tablet, personal digital assistant or other mobile device or device with processing capacity (e.g. a game consple) as an input device 830 to input data (e.g. a syllable, word or sequence of words). The speech visualisation system 800 also includes display means 880 so that a user is able to view the speech visualisation images derived by the speech visualisation tool as a pronunciation match to the user's input. An exemplary display means 880 is depicted in Figure 6 as a screen connected to a computer, but the display means could also take one or more of the following forms:

(a) . a mobile device, including: i. a smartphone ii. a tablet iii. a personal digital assistant;

(b) any other device with processing capacity (e.g. a game console). Referring to Figure 6, the speech visualisation system 800 further includes a user interface 890 in communication with the display means 880, to allow a user to communicate with the speech visualisation tool within the speech visualisation system, such as allowing the tool to receive input from a user or allowing the user to view and navigate speech visualisation images corresponding to the pronunciation match derived by the speech visualisation system. The speech visualisation tool includes computer readable program code

(programming instructions) for performing a speech visualisation assistance method, the method including the steps of:

(a) receiving user input through a user interface of an input device 830 (e.g. a keyboard, smartphone, tablet, personal digital assistant or other mobile device or device with processing capacity, e.g. a game console), to allow a user to communicate with the speech visualisation system;

(b) accessing a rule base or knowledge base of pronunciation rules 840 to derive a pronunciation match by executing the pronunciation rules on a user's input. The pronunciation match reflects an applicable language rhythm based on pronunciation rules. In an alternative embodiment, the step of deriving a pronunciation match is achieved by operating an inference engine 850 (a computer program designed to produce reasoning based on rules) on a user's input. However, a rule base itself can derive a pronunciation match. Accordingly, the rule base and inference engine may be one and the same, or discrete components of the speech visualisation system;

(c) accessing and searching an image repository 860 containing a plurality of speech visualisation images to match one or more speech visualisation images to the derived pronunciation match. The matched speech visualisation image includes a viseme corresponding to the pronunciation match, wherein the viseme includes speech organ and facial movement information corresponding to one or more speech sounds.

The pronunciation rules include one or more rules from the following groups:

(a) phonemic rules to.identify one or more phonemes arising in the input received from the user;

(b) word stress pattern rules to identify a stressed syllable in a word of two or more syllables arising in the input received from the user;

(c) word stress pattern rules to identify a stressed part of a compound word arising in the input received from the user; 9

word stress pattern rules to identify a stressed word of a phrase arising in the input received from the user;

word stress pattern rules to identify a stressed word of a sentence arising in the input received from the user;

rule exception rules to identify circumstances where exceptions to one or more of the above rules arise in the input received from the user.

The speech visualisation images in the image repository 860 include one or more of the following:

(a) an animated letter corresponding to a phonetic symbol, wherein the

animated letter has characteristics reflecting one or more of:

i. strength (e.g. intensity and power) of the sound;

ii. length;

iii. vibratory frequency and pattern; and

iv. pitch

of a speech sound in the pronunciation match (see Figures 2 and 3);

(d) an animated reproduction of facial movement information during speech;

(c) an animated reproduction of speech organ movement during speech;

(d) a visual representation of a word stress pattern corresponding to the pronunciation match.

The speech visualisation tool assists learners to understand how speech organ , movements translate to differences in sound. This is achieved by the expert system first analysing the input to determine the appropriate speech sound (taking into account the context and the applicable language rhythm and then, if it is a stress- timing language such as English, the applicable word stress pattern) and then matching the pronunciation to a plurality of images from various perspectives, including animations that can be played, replayed, viewed in slow motion and/or in close up (by zooming into the image). The speech visualisation images also include animations to assist with understanding word stress patterns - stresses within words and also stressed words in sentences and paragraphs.

The image repository contains a collection of speech visualisation images. The appropriate image or images are selected by the rule base (and/or inference engine) system, only after the appropriate pronunciation rules have been applied and a pronunciation match identified. If a pronunciation match relates to a word of multiple phonemes or to a phrase or sentence, the speech visualisation tool selects the appropriate speech visualisation image for each phoneme and combines those images. Where a linked phoneme affects the appearance of another phoneme, the speech visualisation tool takes this into account in the speech visualisation image presented as a match to the user input (e.g. the /e/ in thief compared with sheep).

Where a difference in word stress affects the pronunciation of a word (or part of a word), the speech visualisation tool includes appropriate pronunciation rules that take this into account in deriving a pronunciation match (e.g. "from" in the sentence example "Where are you from?" and "I'm from Perth"}, and appropriate rules for selecting speech visualisation images (e.g. animated phonetic symbols, speech organ movements) to correspond to the pronunciation match. Where a difference in meaning affects the pronunciation of a word (e.g. record or record), the speech visualisation tool includes appropriate pronunciation rules that take this into account in deriving a pronunciation match (e.g. based on whether the word is a noun or a verb, which can be identified through a series of appropriate questions). The speech visualisation tool can operate on any device with processing capacity, including a computer, a smart phone, a tablet, or any other portable device with a processor and visual display (e.g. a game console). The invention thus provides a speech visualisation tool that overcomes the problem of other language learning systems and methods by providing

pronunciation assistance to learners that takes into account context, including rhythm and word stress patterns. The speech visualisation tool ascribes this pronunciation information (e.g. the phonetic translation of a user input) into visual tools (a plurality of speech visualisation images) to assist the learner to translate that information into movements required to produce a specific speech sound. This is achieved through the inclusion of an expert system in the speech visualisation tool. This is an advantage because other language learning systems have limited or no ability to take into account context, or rely on a limited selection of pre-recorded audio samples.

The invention provides a speech visualisation tool for use in learning a language. However, it will be appreciated that the invention is not restricted to this particular field of use and that it is not limited to particular embodiments or applications described herein.

Comprises/comprising when used in this specification is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof." Thus, unless the context clearly requires otherwise, throughout the description and the claims, the words ^'comprise^', ^'comprising^', and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including, but not limited to.

Claims

1. A speech visualisation tool comprising:

pronunciation match to the user's input can be derived;

(c) a display means in communication with the rule base, said display

such that the speech visualisation tool is able to derive a pronunciation match corresponding to a user's input and to locate one or more speech visualisation images corresponding to the pronunciation match, said speech visualisation images being viewable such that speech visualisation assistance is provided for facilitating the learning of pronunciation.

2. The speech visualisation tool of claim 1 further comprising programming

instructions to perform a speech visualisation assistance method, the method including the steps of:

(a) receiving an input from a user; (b) executing said one or more pronunciation rules on the input received from the user to derive a pronunciation match;

(d) communicating said one or more speech visualisation images to a display means such that said speech visualisation images are viewable by a user of the speech visualisation tool.

3. The speech visualisation tool of claim 1 or claim 2, wherein the one or more speech visualisation images includes one or more of:

(a) one or more phonetic symbols;

(b) an animated reproduction of a user's input that reflects one or more characteristics of a corresponding speech sound;

(c) a visual representation of facial movement information during speech

(d) a visual representation of speech organ movement during speech;

4. The speech visualisation tool of any one of the preceding claims wherein the visual representation of facial movement during speech includes one or more of:

(a) a 2D image

(b) a 3D image

(c) an animated reproduction,

of facial movement during speech.

5. The speech visualisation tool of any one of the preceding claims wherein the visual representation of speech organ movement includes one or more of:

(a) a 2D image

(b) a 3D image

(c) an animated reproduction,

of speech organ movement during speech.

6. The speech visualisation tool of any one of claim 3 to claim 5 wherein the animated reproduction of a user's input that reflects one or more characteristics of a corresponding speech sound includes one or more of:

(a) an animated image of a relevant word stress pattern;

(b) one or more animated phonetic symbols corresponding to a user's input.

7. The speech visualisation tool of claim 6 wherein the one or more animated phonetic symbols include characteristics reflecting one or more of:

(a) strength (e.g. intensity and power);

(b) length;

(c) vibratory frequency and pattern; and

(d) pitch

of a speech sound in the pronunciation match.

8. The speech visualisation tool of any one of the preceding claims wherein the pronunciation rules include one or more rules from the following groups;

(a) phonemic rules to identify one or more phonemes arising in said user's input;

(b) word stress pattern rules to identify a stressed syllable in a word of two or more syllables arising in said user's input;

(c) word stress pattern rules to identify a stressed part of a compound word arising in said user's input;

(d) word stress pattern rules to identify a stressed word of a phrase arising in said user's input;

(e) word stress pattern rules to identify a stressed word of a sentence arising in said user's input;

(f) rule exception rules to identify circumstances where exceptions to one or more of the above rules arise in said user's input.

9. The speech visualisation tool of any one of the preceding claims further

including an input means in communication with the inference engine.

10. The speech visualisation tool of claim 9 wherein said input means includes an input device from one or more of the following groups:

(a) a computer;

(b) a mobile device, including:

i. a smartphone

ii. a tablet iii. a personal digital assistant;

(c) any other device with processing capacity.

11. The speech visualisation tool of claim 9 or claim 10 wherein the input means communicates with the display means.

12. The speech visualisation tool of claim 11 wherein the display means includes a device from one or more of the following groups:

(a) a computer;

(b) a mobile device, including:

i. a smartphone

ii. a tablet

iii. a personal digital assistant;

(c) any other device with processing capacity.

13. A speech visualisation assistance method performed by programming

instructions and including the steps of:

(a) receiving an input from a user;

(b) executing said one or more pronunciation rules on the input received · from the user to derive a pronunciation match;

(c) searching a repository of speech visualisation images to locate one or more speech visualisation images corresponding to the pronunciation match; (d) communicating said one or more speech visualisation images to a display means such that said speech visualisation images are viewable by a user of the speech visualisation tool.

14. The speech visualisation assistance method of claim 13 wherein the

pronunciation rules include one or more rules from the following groups:

(a) phonemic rules to identify one or more phonemes arising in the input received from the user;

(c) word stress pattern rules to identify a stressed part of a compound word arising in the input received from the user;

(d) word stress pattern rules to identify a stressed word of a phrase arising in the input received from the user;

(e) word stress pattern rules to identify a stressed word of a sentence arising in the input received from the user;

(f) rule exception rules to identify circumstances where exceptions to one or more of the above rules arise In the input received from the user.

15. The speech visualisation assistance method of claim 13 or claim 14 wherein said speech visualisation images include a viseme, said viseme including speech organ and facial movement information corresponding to one or more speech sounds.

16. A computer program product comprising a computer usable medium having · computer readable program code means embodied therein for providing speech visualisation assistance for a learner of a language, the computer readable program code means in said computer program product comprising programming instructions for:

(a) receiving an input from a user;

(d) communicating said one or more speech visualisation images to a display means such that said speech visualisation images are viewable to a user of the speech visualisation tool.

17. A computer program product according to claim 16, wherein the pronunciation rules include one or more rules from the following groups:

(c) word stress pattern rules to identify a stressed part of a compound word arising in the input received from the user; (d) word stress pattern rules to identify a stressed word of a phrase arising in the input received from the user;

18. A computer program product according to claim 16 or 17, wherein the speech visualisation images include a visual representation of the pronunciation match, wherein the visual representation includes a representation from one or more of the following groups:

(a) an animated letter corresponding to a phonetic symbol, wherein the animated letter has characteristics reflecting one or more of: i. strength;

ii. length;

iii. vibratory frequency;

iv. vibratory pattern; and

v. pitch

of a speech sound in the pronunciation match;

(b) one or more visual representations of facial movement information during speech, including one or more of a 2D diagram, a 3D diagram, an animated reproduction of facial movement; (c) one or more visual representations of speech organ movement during speech, including one or more of a 2D diagram, a 3D diagram, an animated reproduction of speech organ movement;

(d) an animated reproduction of facial movement information during speech;

(e) an animated reproduction of speech organ movement during speech;

(f) a visual representation of a word stress pattern corresponding to the pronunciation match.

19. A computer program product according to any one of claim 16 to claim 18, wherein said input from a user is receivable from an input device from one or more of the following groups:

(a) a computer;

(b) a mobile device, including:

i. a smartphone

ii. a tablet

iii. a personal digital assistant;

(c) any other device with processing capacity.

20. A computer program product according to any one of claim 16 to claim 19 wherein the display means is a visual display for one or more of the following devices:

(a) a computer;

(b) a mobile device, including: i. a smartphone

ii. a tablet

iii. a personal digital assistant;

(c) any other device with processing capacity.

21. A speech visualisation system including:

(a) a speech visualisation tool according to any one of claim 1 to claim 12;

(b) programming instructions residing in computer readable storage

medium for performing a speech visualisation assistance method according to any one of claim 13 to claim 15;

(c) a processing means for operating the programming instructions.

22. The speech visualisation system of claim 21 wherein the programming

instructions include a computer program product according to any one of claim 17 to claim 19.

23. A speech visualisation tool as hereinbefore described by reference to the accompanying drawings.

24. A s peech visualisation assistance method as hereinbefore described by

reference to the accompanying drawings.

25. A computer program product as hereinbefore described by reference to the accompanying drawings,

26. A speech visualisation system as hereinbefore described by reference to the accompanying drawings.