AU2008100836B4 - Real-time realistic natural voice(s) for simulated electronic games - Google Patents

Real-time realistic natural voice(s) for simulated electronic games Download PDF

Info

Publication number
AU2008100836B4
AU2008100836B4 AU2008100836A AU2008100836A AU2008100836B4 AU 2008100836 B4 AU2008100836 B4 AU 2008100836B4 AU 2008100836 A AU2008100836 A AU 2008100836A AU 2008100836 A AU2008100836 A AU 2008100836A AU 2008100836 B4 AU2008100836 B4 AU 2008100836B4
Authority
AU
Australia
Prior art keywords
unit
voice
speech
articulators
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2008100836A
Other versions
AU2008100836A4 (en
Inventor
Diana Ford
Violet Ford
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Machinima Pty Ltd
Original Assignee
Machinima Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2007904687A external-priority patent/AU2007904687A0/en
Application filed by Machinima Pty Ltd filed Critical Machinima Pty Ltd
Priority to AU2008100836A priority Critical patent/AU2008100836B4/en
Application granted granted Critical
Publication of AU2008100836A4 publication Critical patent/AU2008100836A4/en
Publication of AU2008100836B4 publication Critical patent/AU2008100836B4/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6063Methods for processing data by generating or executing the game program for sound processing
    • A63F2300/6081Methods for processing data by generating or executing the game program for sound processing generating an output signal, e.g. under timing constraints, for spatialization
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/66Methods for processing data by generating or executing the game program for rendering three dimensional images
    • A63F2300/6607Methods for processing data by generating or executing the game program for rendering three dimensional images for animating game characters, e.g. skeleton kinematics

Description

(1) Field of the imovnaa The present i1novatian elates to physiological sound prducton from voice typesets, but more particularly, it relates to ai apparatus and a method to produce and ra1sfomi unique voices using 3D shaped moselto actassubsitute o voiceactors (2) Descriptiao fara srt There is a pmpased model for inlegrating 3D face and vocal ract model (e.g see reference I: Extrensive nfrastnture for a Dynsic 3D face and voal-ftact model, Vogt, F, Fels, S. S. and tAL UnivesiyofBitishColobia, Canada) Thee isa proposed et - speech package. gnspeech, released u r a GNU Project - Free software foundation based on real-ime, aiticulatoiy, speech-synthesis-by ules. (Seereference2:h : n re.h Thee is aproposedti m-varing tiree-dmnsional model of te voal tract (e see Reference 3: A time-varying three dimensional model of the vocal tract LuX, bier P and Thare W. UnivesitvofAuckland, NewZealand). There is proposed patents (e.g. see US Patent Dacuments: System and method for accented madificalson of a language madel. January 1, 2008, Patent applicalson number 7315811 and Speech synthesis apparatus and speech synthesis method, March 25, 2006, Patent application number 7349847) Also, there is two applications corently on the market HL2: faceposer and Moviestonn that offer lip-syneing for electronic simulated game characers to increase gaming play sundrecoding expejeice realism See references 6 and Theaiticuatoy speech synthesizer dislisedin reerence I is apropesed aticulatoiy expenmentalfraw o t include faceswhem rthe input will be a staevect aof the IP Australia Customer Number 3910218224 shape and acoustic output will depend on airflow through the vocal tract and integrating results from aerodynamics. Fig. A is a block diagram showing the structure of the articulatory speech synthesis apparatus disclosed in the paper. The speech synthesis apparatus includes a graphic user interface 100, a geometry model 101, a simulator engine 102, a synthesis engine 103, a numerics engine 104 and a graphics rendering engine 105. The graphic user interface 100 allows for human interaction with the software, the geometry unit 136 101 allows manipulation of the animation data, the simulator engine 102 is responsible for the updates for those animations, the synthesis engine 103 is responsible for generating the speech, the numerics engine 104 separates the number algorithms from others for optimisation and the graphics rendering engine 105 responsible for rendering all graphics. The speech synthesiser disclosed in reference 2 uses area functions of the vocal tract to create a digital filter for speech synthesis. Reference 3, Gnuspeech, as a text-to speech package is modelled on one person's vocal tract. Reference 4 refers to two patents one to customise through the addition of different pronunciations that are specific to the speaker to a database and an apparatus and method to modify sound so as to add emotions and other feelings to make it more realistic. Fig. Patent # 7349847 is a patent showing the processes utilised by one of the embodiments for the later patent described above. The wave file of voice actor 108 is first imported. The voice is analysed 109. Prosody information is generated 110. The information data is compared to default voice elements stored in a database 111. 2 IP Australia Customer Number 3910218224 A ratio is extracted based on this described comparison to be applied to the prosody information to modify the voice 112. This new voice information can now be stored in the database 113. Speech is in this way extrapolated 114. Reference 6 and 7, the two components of the applications offer a keyboard to type in manually the phonemes to generate the 3D face frames, one then needs to export these frames and import them in an editing program to add to the sound intended and the second as it relates offers an easier way to lip-sync by generating the appropriate phonetics from the actual sound file in one step thereby reducing the procedure required to lip-sync audio to video. Fig. HL2 shows the process utilised by mainstream electronic game HL2 to generate lip-synced frames. The user types in the phonemes associated with the text/ audio file they wish to use 118. 2D frames are generated and stored 119. These are mapped to the different phonemes 120. The user can export the silent clip to anther program for ADR 121. Fig. Moviestorm shows the process utilised in a mainstream application to record electronic video games, Moviestorm allows the user to first import the audio file of the voice actor 124, the phonemes and frames are automatically generated 125 and 126 respectively. The phonemes are mapped to the audio file 127. Playback is now possible 128 with the audio file lip-synced to the frames. 3 IP Australia Customer Number 3910218224 Summary of the Innovation: However, the speech synthesis apparatuses and method disclosed in the references 1 5 have a problem that they don't attempt to categorise voices or to produce multiple unique lip-synced natural voices. They are not able to generate narration and lines of dialogue according to user specifications. They are not able to generate unique voices from text only. Reference 1, 2 and 3 relate to three proposed models that are articulatory based and thus rely on MRI and other X-ray data to map out on shape and type of vocal tract unique to one person, thereby limiting the articulation to the person that the imaging data belongs to. The possibilities for modification of the voice extend to the loudness, pitch and tone etc. the physical qualities of the audio wave, but do not extend to modification of the voice to an extent that it produced two different unique voices that can be mapped out to different multiple people. These models are focused on articulation and less on transforming the qualities of the voice depending on the emotional state of the person. These are articulatory based whereas this apparatus is both articulatory and formant based. The two mentioned patents number 7315811 and number 7349847 relate to formant based synthesisers. Both innovations propose the modification of sounds to different voices based on already stored default speech information from a database thereby offering a semi-true transformation of the sound produced and do not offer true transformations. It is semi-true because the extrapolation and generating prosody information required to synthesise the speech relies on a ratio obtained by comparing the default modal speech elements already stored in the database and then using the ratio to alter the new generated phoneme(s). Reference 5 relates to products on the market at the moment in 2008. Both packages are not speech synthesisers but more focused on producing lip-synced images using 4 IP Australia Customer Number 3910218224 the phonemes of an imported sound file/ text. Even though both offer a natural and intelligible outcome if utilised, when it comes to using the HL2 application, the lip syncing still relies on dubbing the voice over in post-production. It does not take place in real-time thereby the user is always dealing with the tedious time consuming process of dubbing over again and again to get the sound to match the visual frames and ensure the words, sentences etc spoken are not too slow or too fast. The second package Moviestorm continues to be in development, it is now in beta stage and even though it offers a more expedient way to produce lip-synced 3D images from imported audio files, it remains incapable of generating text to lip-synced 3D images. Both applications at this stage rely on voice actors to produce professional quality, intelligible and natural voices. Accordingly, in order to offer a solution to the above mentioned shortfalls of prior art, an object of the present innovation is to provide a speech synthesis apparatus and method which focuses on the physiology of sound production which involves respiration, phonemation (from the word phoneme), resonation and articulation in order to generate and modify multiple natural voices in real-time and offer a means to lip-sync them simultaneously. Ultimately, there are an infinite number of unique voices out there because of the infinite genetic make-up, however, obtaining MRI imaging material for each person to map out their unique vocal tract and vocal folds information is implausible, so the innovation focuses on providing a finite number of voice possibilities from an infinite number of arrangements of different units responsible for generating intelligible voices. The innovation proposes instituting typesets of different voices. Their initial point of discrimination will be from the GUI selection criteria of age, gender, body type (fat, thin, posture etc) of the speaker, country of origin (to determine closest accent), and modal language selected (English, French etc which will determine which phoneme dictionary to use). For example, the body type is important because it determines 5 IP Australia Customer Number 3910218224 how freely the air is disseminated to the resonators and articulators to produce a certain voice type which can again be discriminated from its counterpart's voice if the body's posture is altered, as altering the chest position affects the capacity for air that can be generated, inevitably altering the amount and rate of air that can be disseminated to those bodily organs (i.e. unit articulators and resonators) to generate unique natural and intelligible voice(s) . It is important here to distinguish between the external workings and the internal workings of the apparatus and method proposed. The internal workings will describe the apparatus and method of the synthesiser while the external workings comprise of elements that constitute the GUI unit 135, see Figs 1-5. It is important to note that the arrangements and groupings of the external workings can easily be changed to other combinations without departing from the spirit of the innovation. For the internal workings, the drive to generate sound will start with the respiration unit that will ultimately have an affect on all other three units: the Phonemation, the resonation and the articulation units. In this instance, it is the Phonemation unit (from phoneme) that will be responsible for establishing the articulator units' minimum and maximum range for each of its geometrical shapes that comprise it, including but not limited to lips, nose, vocal tract, hyoid bone, cheeks, soft and hard palate etc. In turn this will determine the voice typesets for example breathy voice will generate less pressure units in the respiratory unit and thereby the finite number of possible articulators geometric shapes will be determined accordingly so as to still allow for recognisable speech to be produced without distortion, deformation and/ or loss of sound quality. The possible combinations of fundamental frequency for breathy voice will also be finite regardless of the ability of the resonation unit to deform the sounds further and produce unique frequency ranges depending on the Phonemation sounds supplied by the Phonemation unit. 6 IP Australia Customer Number 3910218224 The user will thus be able to select from a finite number of voice types which are categorised according to the amount and rate of air the respiration unit releases to affect the glottis state. The user can then proceed to pre-configure the articulator units and change their surface, shape, size using a scale bar. This will further alter the tone, pitch, loudness etc (voice quality) of the voice frequencies that will be generated. Sound in this context will be used interchangeably with voice or audio to mean intelligible voice. These maximum and minimum possible range for deformation(s) of a sub articulation unit or a combination of sub-articulation units thereof will enable the processing of only intelligible sound. The respiration unit, the articulators unit and the resonation/phonation unit will act as limiters in the first instance until their desired voice settings are achieved. For each geometry configuration possible of these articulators or a combination thereof, there will be a newly established minimum and maximum range depending on voice type. Once the settings are pre-configured to favour a certain type of voice, some frequencies will be more favoured than others while some will be restricted altogether. These settings will in effect produce accents for example if the natural voice selected in the first instance is a breathy voice and the text input data is English, then naturally the spoken words will be pronounced with an accent because of the placed limits on sound frequency that are possible using the breathy voice, while if the language selected is Hindi, which naturally uses breathy voicing in their phonemes and if the spoken sentence is in Hindi then the output will sound proficient in that language etc. The innovation thus acknowledges that accents are produced because certain vowels and consonants are not pronounced using the right configuration of the articulator units. To reiterate, the theory then is that the voice typeset influences how the vowels and consonants (vocal sounds) are formed and thereby pronounced. 7 IP Australia Customer Number 3910218224 Within these settings there will be room to further add emotions to the spoken voice by configuring the face unit. For example, one will be able to tense certain muscles in the face, add tears to the eyes etc to produce visual emotional states that can then be translated to representations of voice information and added onto the voice. In this way, emotion is introduced onto the voice output. The phonemation unit will be responsible for analysing user text and/ or audio files for further manipulation and then extrapolation of speech into desired voice. The phonemation unit will enable the user to use samples of their own imported audio files to generate unique voices if they so wish. Up until this point the method and apparatus deals with four units and their interactions to synthesise sound. The innovation is however not limited to sound. The geometry unit 136 that works with the articulation unit enables the manipulation of 3d models. Thus, the frame generate unit that is part of the articulators unit works directly with the phonemation unit to extract information about the voice such as voice characteristics, duration etc. and based on the information obtained it can map out the voice output to 3d models. The import/ export function that also forms part of the articulators unit enables for different 3d models to be imported and exported using 3d software packages that are familiar and known to those in the art. The render unit 133will then be responsible for bringing the voice and images together for editing and finalising product. It will also be possible to generate radio frequencies using the render unit 133to be included in distinguishing machine voices from real people's voices. Note that the present innovation can be realised not only as a speech synthesis and method apparatus thereof for simulated electronic games but also other simulated 8 IP Australia Customer Number 3910218224 applications both online and offline, but also as a categorising tool by mapping different voice typesets for natural voices and a recording tool for sound. For more details see the Industry Applications section. 9 IP Australia Customer Number 3910218224 Brief Description of the drawings: Fig. A is a block diagram showing the structure of an articulatory speech synthesis apparatus. Fig. Patent #7349847 is the flowchart of operation of the speech synthesis method and apparatus of patent number 7349847. Fig. HL2 is the flowchart of operation of the lip-syncing method and apparatus for an application that forms part of the HL2 game. Fig. Moviestorm is the flowchart of operation of the lip-syncing method and apparatus for an application that promotes using a game engine to record simulated game like movies. Fig. B is a block diagram showing the structure of the speech synthesis apparatus disclosed in the patent. Fig. C is the flowchart of operation of the speech synthesis method and apparatus according to the basic embodiment (without lip-syncing) Fig. D or 9 is a block diagram of the glottis used to formulate different functions that determine the exact units of air pressure needed to flow through it to maintain each one of its different states. Fig. E is a block diagram showing the structure of the articulation unit according to the first embodiment. Fig. F is a block diagram showing the structure of the phonemation unit. 10 IP Australia Customer Number 3910218224 Fig. G is a block diagram showing the structure of the waveform analyser unit that is part of the phonemation unit. Fig. H is a block diagram showing the structure of the phonation and resonation unit. Fig. I is the flowchart of operation of the speech synthesis method and apparatus according to the first embodiment. Fig. J is a block diagram showing the structure of the articulation unit according to the second embodiment. Fig. K is the flowchart of operation of the speech synthesis method and apparatus according to the second embodiment. Fig. L is a block diagram showing the structure of the articulation unit according to the third embodiment. Fig. GUI is a block diagram showing one possible structure of the GUI unit 135 for the articulation unit, respiration unit and resonation unit respectively. No attempt has been made to map out the GUI for the phonemation and phonation unit because there will be apparent those familiar with the art, whilst the GUI for the three described units may not be immediately apparent. Fig. 1 is a block diagram showing the structure of the GUI. The upper level is divided into voice unit and head unit. Fig. 2 is a block diagram showing the structure of the GUI. The 'voice' level is divided into the pump unit, the vibrators unit and articulators unit. Fig. 3 is a block diagram showing the structure of the GUI, the head unit level. 11 IP Australia Customer Number 3910218224 Fig. 4 is a block diagram showing the structure of the GUI, the articulator unit, sub level of the voice unit. Fig. 5 is a block diagram showing the structure of the GUI sub level of the articulator unit, the lips. 12 IP Australia Customer Number 3910218224 Detailed Description: The details herein make reference to the figures and drawings, which show the embodiment by way of illustration. It is described in detail to enable those skilled in the art to practise the innovation. It should be understood that other embodiments may be realised and that logical and mechanical changes may be made without departing from the spirit and scope of the innovation. The detailed description herein is for illustration purposes and not for limitation purposes. For example it will be apparent to those skilled in the art that there is more than one correct mathematical formula to calculate voice frequency or use calculus equations to calculate the area of a geometric shape to establish fundamental frequency of a voice. Furthermore, conventional functional aspects of the apparatus (and components of the individual operating units of the apparatus and method) may not be described in detail herein because many alternative or additional functional relationships may be present in a practical synthesiser apparatus. The connecting lines shown in the various figures and intended to represent exemplary functional relationships between the units. The word 'voice(s) herein imply articulatory human speech not restricted to one type of language. Simulated electronic games are used to simultaneously refer to both online and offline games connected or not via a network. The innovation thus relies on evaluating voices according to 1) glottal opening/ closing 2) select vocal property criteria such as tone, pitch, loudness etc. Once the first broadest classification is made (1) further sub-types (2) can be introduced based on range, weight, territura, timbre transition parts (i.e. breaks and lifts in the voice), speech level and registration but not limited to only these. The first embodiment (without lip-syncing): 13 IP Australia Customer Number 3910218224 Fig. B is a block diagram showing the structure of the speech synthesis apparatus disclosed in the patent. The respiration unit 129 supplies a predetermined airflow amount and rate that affects the articulation unit 131, phonemation unit 130 and phonation/ resonation unit 132. The phonemation unit 130 supplies the text to be spoken or predetermines a custom voice to be used to generate sound. The articulation 131 houses the most important articulators that help change and shape sound such as nose, jaw, lips etc. depending on their configurations. The phonation/ resonation unit enhances sound as well as synthesises it. In the first embodiment, creating a unique voice is a two-stage process. First, the user chooses a voice typeset or a combination thereof that depends mainly on information and functions supplied by the respiratory unit 129. The respiration unit now affects the articulators configuration and the phonemation process for example with limited air and depending on the amount of words in the text, the speed by which the text has to be output into intelligible speech can be calculated. The user is able to further select voice subtype(s) and/ or a combination thereof by being able to alter the geometric shape, size, surface etc. (properties of the articulators geometric mesh). This will alter the sound qualities. The sound qualities are also affected by the resonation unit and the phonation process itself. The four units, work together with the GUI unit 135, geometry unit 136, add/ store unit 137, render unit 133 and analyse unit 134. The structure of each of these will be examined in detail next but the simple embodiment is about demonstrating the top-level structure by which sound is produced. All three units, the respiration unit 129, the phonemation unit 130 and the articulation unit 131 are the configurations to enable the phonation unit to synthesise the sound. The settings of the phonemation unit 130 and articulation unit 131 and the resonation apparatuses that form part of the phonation and resonation unit 132 are dependent on the respiration unit 129 configurations. When the respiration unit settings are set, a main voice type is selected. Each voice type has further settings specific only to it that can be found again in all four units discussed herein, which can be selected from the GUI 135. The add/ store unit 137 makes it possible to add multiple voices and store their configuration settings in a database 14 IP Australia Customer Number 3910218224 One element of the render unit 133 is the privacy unit 174. The privacy unit is able to generate a radio frequency wave that does not register with the human ear but does register by any electronic analysing voice devise. The render unit 133 will be able to embed the radio frequency unit generated of a limited duration along with the phonation produced. This process will ensure no human voice is compromised and used without permission as to prevent fraudulent transactions for example using this method and apparatus described herein. The Analysis unit 134 will be able to gather data about the voice output and issue a detailed or a short report summary about the type of sound, its pitch, tonal qualities etc. This will enable modifications to be made to the voice generated if the user is not happy with the voice quality output. The information can then be stored in a database. This sound analysis is common in the area of the art, available both for free from various Internet web pages as well as widely distributed commercially and so it doesn't warrant further description here. See software like truRTA for an example of a real-time sound analyser. One element of the geometry unit 136 is the import/ export unit. The import/ export is the ability to import 3D models designed in 3D software including but not limited to Maya, milkshake and 3D studio max and package face models to be exported for further modifications and publishing on the internet for example. This is highly common in the art and the process is apparent to those skilled in it, but to give an example Sims 2's Bodyshape application allows one to import models and package them for export to enable the production of unique looking characters. Fig. C is the flowchart of operation of the speech synthesis method and apparatus according to the first basic embodiment. Choose a voice type 140 then choose a sub voice type 141 then generate voice using the phonation and articulation unit 142. A more detailed flowchart of operation is included in Fig. I after a detailed description of each one of the four units, articulator unit 131, respiration unit 129, phonemation unit 130 and phonation/ resonation unit 132 is provided. 15 IP Australia Customer Number 3910218224 Fig. D or 9 is a block diagram of the glottis. Different functions are formulated by the respiration unit that determine the exact units of air pressure needed to flow through it to maintain each one of its different states. The diagram shows the glottis from closed to open. The respiration unit is responsible for generating x cm pressure units to affect the glottis state. The glottis is found in the resonation unit where the vocal folds are also found. An open glottis will generate voiceless 46, breathy voice 45 and slack voice 44. The maximum vibration occurs at the modal voice 43, stiff voice 42, creaky voice and finally closed glottis 41. The tension of the vocal cords and the aperture of the arytenoid cartilages are regulated by the simulator unit part of the phonation and resonation unit 132, which will ensure that the right function(s) it uses to calculate the amount of air pressure as well as the rate at which it is to be applied to the glottis. Functions that work information such as standard threshold pressure will be apparent to anyone familiar with the art. The state of the glottis i.e. the type of voice selected by the user restricts the articulators' configuration and thereby when speech is synthesised and the phonemation unit 130 supplies the phoneme information to the phonation/ resonation unit 132, the articulators restricted movements may place a restraint on the ability of the phonation/ resonation unit 132 realising a native speaker's voice as the preferred accent, and instead it may generate another accent as a direct result of the geometric configuration of a sub-articulator or a combination thereof . 'The word 'accent' is here taken to refer to a +/- ratio that is applied to the accepted pronunciation of phonemes in a particular language to create a modified pronunciation based on physiological configuration of the humans body specifically articulators because they are responsible for producing intelligible voices in humans. For example, if the voice selected is stiff voice, and one language that uses stiff voice is Thai then the Phonemation units' 130 ideal configuration will be the Thai language and mapping English phonemes for example to the stiff voice will generate an accepted voice output that can be classified as Thai. These voice types will be further mapped out to the likely country of origin of one person that will further enable the Analysis unit 16 IP Australia Customer Number 3910218224 134 to make an educated guess as to the likely country of origin of the speaker if not defined at the top-level GUI stage to help with selecting a main voice type of choice. Fig. E is a block diagram showing the structure of the articulation unit according to the first embodiment. It maps out the articulator units, known also as the limiters. The limiters are divided further into lips 144, nose 145, jaw 146, teeth 147, cheeks 148, hard and soft palate 149, vocal tract 150, hyoid bone 152 and tongue 151. These can act independently of each other or work together to create recognisable speech sounds. It is important that the minimum and maximum threshold of these limiters is set depending on the selected voice type to ensure that they produce intelligible speech sounds regardless of voice type selected. Synthesising the sound at different geometrical shapes of the limiter units independently of each other and then dependently working together and mapping out the results accordingly is one method that can be used to obtain the minimum and maximum levels at which the limiter units can be set whilst still maintaining the ability to produce intelligible sound. The same process such as the one discussed herein will need to be applied across all voice types to determine the minimum and maximum range for the limiter unit(s). Another means of obtaining the data is to use the area function of the geometric mesh of the vocal tract using the calculating area of tract unit 153, which calculates the area of the vocal tract by finding the remaining area after subtracting all the other limiter units' areas from the whole. This needs to be done for each voice type in order to extract the different frequencies of the different phonemes for a set language such as English. Then using the calculating frequency unit 154 to calculate the different frequencies that the vocal tract(s) can generate under different voice type conditions. Those intelligible frequencies will need to be mapped out for each voice type so that at each end where they are defined as non-intelligible they will serve as the new minimum and maximum threshold limits for each limiter units for the different voice types, which will then be displayed, selected and applied using the scale bar unit 155. It is assumed that the shape of the vocal tract in this embodiment 17 IP Australia Customer Number 3910218224 can be changed and shaped by deforming the shapes of the different limiter units such as lips and tongue because their area is either maximised or minimised thereby either reducing or increasing the area of the vocal tract. Other more appropriate means commonly used in the art may be used to calculate and obtain the minimum/ maximum threshold results of the limiter units by performing different tests that will generate output data to enable the marking of the non- intelligible threshold areas for each limiter unit under different voice types without departing from the spirit of the innovation. The scale bar will further affect the basic product of phonation and affect tonal qualities because altering the size, shape thickness of walls, surface of a limiter unit or a combination of limiter units thereof will change how the different frequencies are expressed when the air travels through them in the phonation and resonation stage. For example, the softer the surface, the more varied fundamental frequencies that can penetrate through, in contrast to the hard surfaces that restrict the fundamental frequencies passing through them to one or multiples of the same frequency. Vocal resonation will be expanded on in Fig. H. Fig. F is a block diagram showing the structure of the phonemation unit. Phonemation herein refers to the process of mapping the phoneme to text and extracting or generating information based on the language selected. For example if it is English then the phoneme voice characteristics and duration extracted using the extraction unit will be according to the English phonemes dictionary and no other. First, the determine user-input unit 156 selects the type of input the user makes, i.e. text based, sound file of various acceptable formats i.e. way file, aiff file etc. Second the phonemes and other useful information such as duration are extracted via the extraction unit 157 then the information is further passed on to the script analyser unit 159 but also a representation of it is stored in the store unit 161. The script analyser unit 159 asks for more user-input to establish more parameters that will affect the quality of the vocal sound produced. For example increase stress on certain words/ decrease stress on others, speech levels, pauses/ breaks in the speech patterns with their durations etc but not limited to these. The unit is also responsible for 18 IP Australia Customer Number 3910218224 calculating a representation of the data to be stored then retrieved when needed. The adjustment unit 160 makes the necessary adjustments to the extracted information by comparing the two representation files and for those affected phonemes and information keep the modified new version and discard their counterparts that have not been modified whilst retaining the remaining information extracted that has not been modified by the script analyser unit. This is one method the adjustment unit can modulate and affect changes onto already extracted information from the text data but other functions and methods will be apparent to those skilled in the arts that can easily be utilised without detracting from the spirit of the innovation. For example, if you have a phoneme 'at' mapped on a graph with decibels mapping the y-axis to determine loudness. And the same phoneme 'at' that has been modified by the script analyser unit 159 is mapped on the same graph. The adjustment unit 160 will then detect that 'at' has been modified and calculate a representation of the later graph information to store and discard the former 'at' graph, because it registered the script analyser unit has modified the phoneme 'at'. In case of an imported sound file, the waveform analyser unit 162 will analyse the imported file to determine if the sample supplied is sufficient to extract all appropriate information to generate the same unique sound or whether more information is required. The phoneme will be analysed and other prosody information after the prosody generation unit 163, which will then be compared with the stored parameters for the minimum prosody information and phoneme required to synthesise speech using the same voice. These thresholds are widely accessible and known to those familiar with the art. The decision unit 164 will issue the decision and the user will have two choices, synthesise text using this sound file in which case the user will be directed to the text analyser unit 157 for more text data. The other option available is for the sound file to be converted to text via the convert to text unit 165 to be able to run it through the script analyser unit 159. If the voice sample does not meet the minimum requirements to enable the automatic prosody generation because the file is only of a small duration as an example, then the decision unit 164 notifies the waveform analyser unit 162 and the process repeats. 19 IP Australia Customer Number 3910218224 When the convert to text unit 165 and adjustment unit 160 are done executing, the process can execute to the phonation and resonation unit 132, Fig. B for execution. There are a number of different processes that be used to convert audio to text that will be familiar to those specialising in the art. The convert to text unit 165 will simply utilise one of those methods that are utilised by some of the popular transcribers on the market such as Dragon Natural Voice 9. Fig. G is a block diagram showing the structure of the waveform analyser unit that is part of the phonemation unit. Furthermore, what needs to be determined is whether the speech sound can be extracted from other background sounds if there is backgrounds sounds present i.e. sounds is determined to be a mix of voice with other diagetic and/ or non-diagetic sound such as sound effects, music etc. Sometimes the different frequencies range is high as for example humming sound recorded with the speech. The frequency analyser unit 166 analyses the different frequencies to identify intelligible speech frequencies by running a comparison to the female/ male fundamental frequency threshold. The speech frequencies for both female and male range and so other criteria will need to be set to help with the process of identifying human speech and the ability to map the different sounds on different timeline units to enable only speech extraction. The speech extractor unit 167 will then extract the voice and the rest of the sounds will be discarded. The process may need to be repeated more than one time with custom input of frequency numbers to extract the desired frequency and avoid things like clipping of certain frequencies etc. This is the responsibility of the custom frequency extractor unit 168. Those involved in the art will appreciate that these loop 'if.. elseif.. else/ for' conditional statements are widely used in art to loop through a set of criteria to get a match to the desired results. The applicability of this process is to enable the extrapolation of intelligible voices form electronic devises such as mobile phone, TV, radio etc. and not limit it to only voices obtained under optimum conditions. Fig. H is a block diagram showing the structure of the phonation and resonation unit. 20 IP Australia Customer Number 3910218224 The phonation and resonation unit 132 is responsible for both synthesising the sound as well as enhancing the sound one last time before it is synthesised. The phonation and resonation unit 131 is where all the pre-configured settings from the respiration unit 129, Fig. B, phonemation unit 130, Fig. B and articulation unit 131, Fig. B are executed. Phonation refers to modifying the airstream. In effect this is the simulator unit where the aerodynamic theory familiar to all those in the art is taken into account to produce sound. The larynx unit 169 and the pharynx unit 170 are the most important resonators and thus are included separately from the articulator unit 131, Fig. B in this unit, because altering the larynx and or pharynx's shape, size, position(s) and degree of adjustability(s) (i.e. their cavity) can produce other possible sound types that enhance the voice, for example the focalised voice, the harsh voice and the strident voice. The most important thing to producing intelligible speech is to measure the areas of the vocal tract and the larynx correctly to reproduce more accurate reflective frequencies. This will be apparent to those skilled in the art and are covered in a variety of reference material. See reference 1,2,4 and 5. One way however, this can be achieved is as proposed in the paper on 'A time varying three dimensional model of the vocal tract' cross-sectional areas can be calculated as long as one minimises the residual area between the different cross sections. These calculations can be finalised using the measure unit 173 that liases with the articulation unit 131 to achieve the desired result. The Accent Generator In another embodiment the method and apparatus described herein can act as an accent converter where the phoneme voice characteristics and duration is extracted from one voice type and the same phoneme voice characteristics and duration is extracted from a second voice type. A ratio is obtained based on subtracting the second phoneme of the second voice type from the first phoneme of the first voice type to obtain a ratio that can be applied to any voice to obtain an accent or a hybrid 21 IP Australia Customer Number 3910218224 of accent. This can also be achieved from the measure unit 132 in the phonation and resonation unit 132 that stores the representations thereof of predicted phoneme frequencies before or during a voice is synthesised in the same unit. For example, a Thai fluent language speaker is likely to use a stiff voice or a combination thereof therefore by obtaining a voice characteristics and duration of the phonemes and storing them in a database and subtracting the information form a second voice characteristics and duration of someone speaking English using a breathy voice type (breathy voice is attributed to the Hindi language), we obtain a ratio. This ratio can be applied to a third modal voice type that favours the English accent to produce a hybrid of a soft Hindi accent with a stronger Thai accent or if applied to a forth voice type, the breathy voice, that favours the Hindi accent then the accent generated will be a straight Thai accent. Based on having two controlled dependent variables in this instance the first and second voice typesets, one can generate hybrid accents. For more information on obtaining ratios from different phoneme information refer to patent number 7349847 B2 entitled Speech synthesis apparatus and speech synthesis method (see diagrams 24 - 28; Forth embodiment). Fig. I is the flowchart of operation of the speech synthesis method and apparatus according to the first embodiment. The process starts 175, a user selects a voice type 176, a user then further selects a sub-type of the chosen voice 177, the user inputs text data or a sound file and sets further criteria on the text by setting breaks in script etc. 178. The user then sets the larynx and pharynx settings to enhance the sound even further 179. The user performs a render if they would like to have the apparatus generate a radio frequency sound wave with the output speech 180. The speech is produced 180. The user may select to issue a speech report 181. Another two embodiment will be discussed herein, one such that allows for emotions to be applied, and another embodiment in which lip-syncing is possible. Hereafter, the second embodiment of the present innovation is explained in detail with reference to diagrams. 22 IP Australia Customer Number 3910218224 Fig. J is a block diagram showing the structure of the articulation unit according to the second embodiment. The deformation unit 184, the face unit 183 and rate of change unit @ TI, T2.. 185 are in this embodiment added to the articulation unit 131. The deformation unit 184 will allow the user to deform the geometric meshes of the articulator units as well as the face unit 183. The area of the vocal tract unit will be calculated taking into account the areas of all the geometric shapes and not just the articulator units. The measure unit 173 can map out a prediction frequency and a ratio can in this way be established between the default prediction frequency and the prediction frequency after the deformations are applied. The ratio the function representations can be stored in the store unit 186 and reused under a default emotion name the user assigns to them (i.e. sad; happy etc). The rate of change unit 185 calculates how fast a phoneme changes when the deformations are applied in respect to multiple times from the areas units. This rate of change can be used to map out the area functions of the geometric meshes when different voice types are selected and thereby generating voices with emotions. Other methodologies to achieve the same results are possible and will be known to those familiar in the art. Fig. K is the flowchart of operation of the speech synthesis method and apparatus according to the second embodiment. The process is essentially the same as the one outlined in Fig. I with an added step, deform face 190, which is fitted between inputting the text data and/ or sound file 189 and determining resonator settings 191. Fig. L is a block diagram showing the structure of the articulation unit according to the third embodiment with an added frame generator unit 194. In the final embodiment presented herein, the user is able to perform lip-sync operations to match the voice output to 3d models. The frame generator unit 194 works directly with the phonemation unit 130 to generate frames based on the phoneme information such as prosody, voice characteristics and duration. The frame generator will store representations of the frames in the store unit 194. There is a 23 IP Australia Customer Number 3910218224 relationship of each of the limiter units eg. Lips, nose etc. and the face unit to the face-generating unit 194, even though the relationship isn't marked by arrows, for the sake of the readability of the diagrams in Fig. L. The process of generating frames is common and will be known to those familiar in the art and will not be covered herein. It should be understood that many modifications will be readily apparent to those skilled in the art, and this method and apparatus is intended to cover any adaptations or variations thereof. The innovation should be limited only by the claims and equivalents thereof. The render unit 133, Fig. B will then bring the frames and sound together in a timeline settings to enable further editing, matching of lip movement to sound if required. It will be apparent to those in the art how to generate a timeline to allow for editing like WYSIWYG visual/ sound editing software like Final Cut Pro, avid or protocols to name a few. Fig. GUI is a block diagram showing one possible structure of the Graphical User Interface (GUI) unit 135 for the articulation unit 131, respiration unit 129 and part of the phonation/ resonation 132 unit respectively. No attempt has been made to map out the GUI for the phonemation unit 130 because it there will be apparent those familiar with the art, whereas the GUI for the three units mentioned above may not be immediately apparent. One instance of the GUI will comprise of the lips, nose, jaw, teeth, cheeks, palate, vocal tract, skull and hyoid bone, part of the articulators 195. The vibrators 201 will comprise of shape, size, muscle thickness, false folds and larynx. The pump 200 comprises of lung capacity and size. These are directly mapped to the articulator unit 131, resonation unit 132 and respiratory unit 129 respectively. Then there are the modifiers 196 that allow the user to alter the articulators' shapes, sizes etc. once they have been set already once. The scale bar 197 affects the tone, pitch, loudness etc. of the output by altering the parameters of the shape, size, surface etc. of the articulators. The save 198 that allows for the configurations to be stored. 24 IP Australia Customer Number 3910218224 Fig. 1 is a block diagram showing the structure of the GUI. The upper level is divided into voice unit 5 and head unit 6. The voice unit 5 is further subdivided into the pump unit 7, the vibrators unit 8 and the articulators unit 9. The head unit 6 will focus on producing emotions by deforming aspects of the face i.e. making it more tense etc. Fig. 2 is a block diagram showing the structure of the GUI. The 'voice' level is divided into the pump unit, the vibrators unit and articulators unit. The user can make their way directly to the voice unit 5 where they can pre configure the settings of the pump unit, the vibrators unit and the articulators unit before they make their way to the head unit. Fig. 3 is a block diagram showing the structure of the GUI, the head unit level. Within these settings there will be room to further modify the fundamental frequency of the voice produced using the head unit that will enable one to tense certain muscles in the face, add tears to the eyes etc to introduce emotion into the voice output. Fig. 4 is a block diagram showing the structure of the GUI, the articulator unit 9, sub-level of the voice unit 5. The hard palate 10, soft palate 11, tongue 12, lips 13, and teeth 14 will be able to be selected and further modified. Fig. 5 is a block diagram showing the structure of the GUI sub level of the articulator unit 9, the lips 13. As the sliding scale is adjusted, the lip size and shape changes. For example observe the dotted line for lip type 21. Once the lips settings are pre-configured a certain type of voice. The pronunciation of a phoneme using the thin lips will vary from the pronunciation of the phoneme using the thicker lips. This way of typecasting accents based on articulator structures assumes that more favourable phonemes' pronunciations are biological rather than environmental. This is not suggesting that different people with different articulators configuration cannot learn different accents, it does however, assume that the 25 IP Australia Customer Number 3910218224 configuration of the articulators will favour certain voices over others and so certain accents are easier for some people to pick up than others and in this application particularly the emphasis is on the fact that different voice types that depend on articulators configurations promote different accents. 26 IP Australia Customer Number 3910218224 It will be apparent to those skilled in the art that various modifications and variations to the current proposed detailed descriptions are possible without departing from the scope and spirit of the innovation. It is intended that the innovation realise all the functions recited in the claims even if they are not explicitly described herein. Industry applicability: The apparatus and method disclosed herein has wide applicability including and not limited to the entertainment industry where it may be embedded in simulated electronic games on the market both online and offline via a network. It is becoming more and more popular to produce 2d/ 3D representations of work to map out contingency factors and test its audience reactions and appeal factor in rehearsals and different visions realisations. Certainly, the application will be useful in script readings instead of hiring voice actors for example. It can also be used to analyse actors' voices to determine their more favoured accent. It can be used to understand better the intelligibility of different accented voices in multicultural societies. By providing an understanding of different types of voices, people are more inclined to assess better what accent they would like to pick up and the steps, processes associated with changing accents. The method and apparatus thus has a wide scope of applicability in the education and entertainment sectors of our society. 27

Claims (5)

1. A speech synthesis apparatus or system for synthesising speech comprising: a respiration, phonemation (from phoneme), articulation units and resonation unit that work as pre-configuration and settings unit(s) for a desired natural sounding voice. The respiration unit for selecting surface area, capacity and breathing rate and determining glottis state, the articulation unit for calculating size and shape of the articulators. The phonemation unit for analysing, extracting, adjusting, converting and storing text and audio voice files. Resonation unit for synthesising voice in real time.
2. A method for synthesising natural sounding voice using a speech synthesis apparatus of claim 1 and including the steps of: typing in text or uploading a sound file; carrying out audio and/ or text analysis to extrapolate prosody information and other speech information, wherein, the information is used to modify the articulators in the first instance and set predefined voice frequency levels; Selecting the surface area, capacity and breathing rate that regulates the glottis state(s); Modifying articulators geometric shape and size accordingly, within predefined ranges to minimise distortions and ensure the end result is an intelligible voice. Synthesise the voice in real time. The four units, herein, work together or in combination of parts thereof to analyse and pass on information supplied by the user to the resonation unit, then responsible for synthesising and producing unique intelligible voices in real-time.
3. The speech synthesis apparatus according to claim 1, wherein the articulation unit includes the important and necessary articulators to generate intelligible voice including but not limited to lips, nose, palata, tongue, hyoid bone, skull, teeth, cheeks, jaw and vocal tract 3
4. The speech synthesis apparatus according to claim 1, wherein the respiration unit includes and is not limited to the glottis, responsible for the unique voice types.
5. A speech synthesis apparatus and method as herein described with reference to the accompanying diagrams. 4
AU2008100836A 2007-08-30 2008-09-01 Real-time realistic natural voice(s) for simulated electronic games Ceased AU2008100836B4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2008100836A AU2008100836B4 (en) 2007-08-30 2008-09-01 Real-time realistic natural voice(s) for simulated electronic games

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
AU2007904687 2007-08-30
AU2007904687A AU2007904687A0 (en) 2007-08-30 Real-Time Realistic Natural Voice(s) for Simulated Electronic Games
AU2008100836A AU2008100836B4 (en) 2007-08-30 2008-09-01 Real-time realistic natural voice(s) for simulated electronic games

Publications (2)

Publication Number Publication Date
AU2008100836A4 AU2008100836A4 (en) 2008-10-09
AU2008100836B4 true AU2008100836B4 (en) 2009-07-16

Family

ID=39865827

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2008100836A Ceased AU2008100836B4 (en) 2007-08-30 2008-09-01 Real-time realistic natural voice(s) for simulated electronic games

Country Status (1)

Country Link
AU (1) AU2008100836B4 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0896322A2 (en) * 1997-08-05 1999-02-10 AT&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
EP1113417A2 (en) * 1999-12-28 2001-07-04 Sony Corporation Apparatus, method and recording medium for speech synthesis
US20040064321A1 (en) * 1999-09-07 2004-04-01 Eric Cosatto Coarticulation method for audio-visual text-to-speech synthesis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0896322A2 (en) * 1997-08-05 1999-02-10 AT&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
US20040064321A1 (en) * 1999-09-07 2004-04-01 Eric Cosatto Coarticulation method for audio-visual text-to-speech synthesis
EP1113417A2 (en) * 1999-12-28 2001-07-04 Sony Corporation Apparatus, method and recording medium for speech synthesis

Also Published As

Publication number Publication date
AU2008100836A4 (en) 2008-10-09

Similar Documents

Publication Publication Date Title
JP4539537B2 (en) Speech synthesis apparatus, speech synthesis method, and computer program
US5940797A (en) Speech synthesis method utilizing auxiliary information, medium recorded thereon the method and apparatus utilizing the method
JP4125362B2 (en) Speech synthesizer
JP4355772B2 (en) Force conversion device, speech conversion device, speech synthesis device, speech conversion method, speech synthesis method, and program
CN108053814B (en) Speech synthesis system and method for simulating singing voice of user
JP2003114693A (en) Method for synthesizing speech signal according to speech control information stream
JP2005516262A (en) Speech synthesis
JP2003084800A (en) Method and apparatus for synthesizing emotion conveyed on sound
Umbert et al. Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges
Keller The analysis of voice quality in speech processing
JP2003186379A (en) Program for voice visualization processing, program for voice visualization figure display and for voice and motion image reproduction processing, program for training result display, voice-speech training apparatus and computer system
WO2013018294A1 (en) Speech synthesis device and speech synthesis method
JP2010014913A (en) Device and system for conversion of voice quality and for voice generation
KR20080013524A (en) Voice color conversion system using glottal waveform
JP3513071B2 (en) Speech synthesis method and speech synthesis device
KR100754430B1 (en) Voice-based automatic lip-synchronization animation apparatus, Voice-based automatic lip-synchronization animation method, and storage medium
JP4808641B2 (en) Caricature output device and karaoke device
AU2008100836B4 (en) Real-time realistic natural voice(s) for simulated electronic games
JP7069386B1 (en) Audio converters, audio conversion methods, programs, and recording media
Burkhardt et al. How should Pepper sound-Preliminary investigations on robot vocalizations
JP2009216724A (en) Speech creation device and computer program
JP2006030609A (en) Voice synthesis data generating device, voice synthesizing device, voice synthesis data generating program, and voice synthesizing program
JP3785892B2 (en) Speech synthesizer and recording medium
Theobald Audiovisual speech synthesis
Howard The vocal tract organ and the vox humana organ stop

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK21 Patent ceased section 101c(b)/section 143a(c)/reg. 9a.4 - examination under section 101b had not been carried out within the period prescribed
NB Applications allowed - extensions of time section 223(2)

Free format text: THE TIME IN WHICH TO GAIN CERTIFICATION HAS BEEN EXTENDED TO 17 JUL 2009.

FF Certified innovation patent
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry