US20170301340A1 - Method and apparatus for designating a soundalike voice to a target voice from a database of voices - Google Patents
Method and apparatus for designating a soundalike voice to a target voice from a database of voices Download PDFInfo
- Publication number
- US20170301340A1 US20170301340A1 US15/473,103 US201715473103A US2017301340A1 US 20170301340 A1 US20170301340 A1 US 20170301340A1 US 201715473103 A US201715473103 A US 201715473103A US 2017301340 A1 US2017301340 A1 US 2017301340A1
- Authority
- US
- United States
- Prior art keywords
- voice
- database
- voices
- soundalike
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title description 9
- 239000013598 vector Substances 0.000 claims description 35
- 238000013178 mathematical model Methods 0.000 claims description 14
- 241001331845 Equus asinus x caballus Species 0.000 claims 1
- 230000015572 biosynthetic process Effects 0.000 abstract description 11
- 238000003786 synthesis reaction Methods 0.000 abstract description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000013518 transcription Methods 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 238000004138 cluster model Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000003064 k means clustering Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/12—Score normalisation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Definitions
- Embodiments herein relate to a method and apparatus for exemplary speech synthesis.
- speech synthesis is accomplished through the use of a speech synthesizer which generates speech through one or more pre-programmed voices.
- Embodiments of the present application relate to speech synthesis using a voice that is similar to the target speaker's voice.
- Speech synthesis is the artificial production of human speech.
- a computer system used for this purpose is called a speech computer or speech synthesizer and can be implemented in software or hardware.
- a text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.
- a text-to-speech system is composed of two parts: a front-end and a back-end.
- the front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization.
- the front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences.
- the process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.
- Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end.
- the back-end often referred to as the synthesizer then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations) which is then imposed on the output speech.
- a TTS To synthesize a target speakers voice, a TTS must first train on the target voice.
- the speaker speaks many hours of utterances spanning all the possible language information, e.g. phonemes, diphones, triphones, etc.
- the speaker reads these utterances from text provided to him/her.
- the speaker reads an utterance and an Automatic Speech Recognizer (ASR) converts the audio into text. This text is matched with the actual text provided to him and label matching is done to check the correctness and the quality of the utterance. Further processing is done on the audio to get the right sampling rate and noise free audio signal.
- ASR Automatic Speech Recognizer
- the audio is supplied to an algorithm to build a model based on distinctive characteristics (features) such as pitch, vocal tract information, formants, etc. These features are extracted and a mathematical (probabilistic) model is constructed based on well-known algorithms.
- the target voice is received by an ASR which outputs text.
- the text is broken down to units present in the model trained earlier, the closest unit is obtained along with the audio part of that unit with prosody. This is done for all the units in the input string and once the audio attaches to the unit, a stitching process is performed to combine these audio parts along with the units into an audio clip which must sound natural as if the actual human is talking
- TTS The problem inherent in training a TTS is that a TTS requires the speaker to spend dozens of hours, if not more, to properly train the TTS. Specifically, the TTS needs enough speech to adequately synthesize the target speaker's voice.
- the solution herein is to select a voice, aka the soundalike voice, from a database of voices, wherein the soundalike voice is substantially similar to the target voice and use the soundalike voice to build the TTS voice, i.e. train the TTS.
- This computer system described herein is optimally configured to determine which voice from the database of voices is the most similar to the target voice.
- the ideal database will have voices in the language of the target speaker; a range of voices (pitch, gender, accent, etc.) is preferable.
- a database containing a statistically significant distribution of voices is more likely to contain a good match to a speaker with a deep male voice than a database of primarily soprano female voices. This is because the identity of the target speaker is often unknown and thus a wide range of voices is more likely to find a good match.
- the identity of the speaker is partially known (e.g. gender)
- a wide distribution of voices in the database is still optimal.
- a database should contain at least 200 voices, each voice having spoken 200 sentences of 5 to 6 seconds duration per sentence.
- a database will have 200,000 to 240,000 seconds or approximately 55 to 66 hours of voice data.
- FIG. 1 is a schematic diagram of the soundalike computer system.
- FIG. 2 illustrates a high flow diagram of the soundalike selection process.
- FIG. 3 illustrates a flow diagram of training the Database 125 .
- FIG. 4 illustrates a K-Means clustering
- FIG. 5 illustrates a flow diagram of the soundalike system creating a mathematical model for the database at the cluster level and calculating the i-vector of the target voice.
- FIG. 6 illustrates a flow diagram of Group Selector 175 determining which group contains the soundalike voice.
- FIG. 1 illustrates a block diagram for selecting a voice, from a database of voices, which is substantially similar to a target voice.
- the soundalike system in FIG. 1 may be implemented as a computer system 110 ; a computer comprising several modules, i.e. computer components embodied as either software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to form an exemplary computer system.
- the computer components may be implemented as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks.
- FPGA Field Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- a unit or module may advantageously be configured to reside on the addressable storage medium and configured to execute on one or more processors or microprocessors.
- a unit or module may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
- components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
- components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
- the Input 120 is a module configured to receive the Voice 120 b from an Audio Source 120 a .
- the Audio Source 120 a maybe be one of several sources including, but not limited to, Human 121 speaking, Streamed Speech 122 or preferentially the Database 125 containing human speech, aka voices, but may also be a live person speaking into a microphone, synthesize speech, streamed speech, etc.
- DB Trainer 130 is a module configured to train a database by extracting the Mel Frequency Cepstral Coefficients (MFCCs) from the Voice 120 b in Audio Source 120 a , using the extracted MFCCs to create the DB Model 130 a of the database.
- MFCCs Mel Frequency Cepstral Coefficients
- Individual Voice Modeler 140 is a module configured to build a mathematical model of each individual voice obtained from Audio Source 120 a.
- Voice Clusterer 150 is a module configured to cluster aka classify voices from Audio Source 120 a into two or more groups, the Group 150 a by characteristic inherent with each voice, including, but not limited to gender, pitch and speed.
- Group I-Vector 160 is a model configured to calculate a single i-vector for each Group 150 a.
- Target Voice Calculator 170 is a module configured to calculate the i-vector of the target voice, the Target i-Vector 170 a.
- Group Selector 175 is a module configured to select the closest Group 150 a to the Target I-Vector 170 a , e.g. with the smallest Euclidean distance between the Target i-Vector 170 a and the Group 150 a or the highest probability score.
- Individual i-Vector 180 is a module configured to calculate the i-vectors of each Voice 180 a , the Voice 180 a within the selected Group 150 a.
- Voice Selector 190 is a module configured to select the voice with the smallest Euclidean distance between the target i-Vector 170 a and Voice 180 a.
- FIG. 2 illustrates a high flow diagram of the soundalike selection process.
- the soundalike system trains the database.
- the soundalike system builds mathematical models of each voice within the database.
- the soundalike system groups, i.e. creates clusters, of voices based on similarities between the voices e.g. pitch, speed, etc.
- the soundalike system creates mathematical models of each cluster.
- the soundalike selects the cluster most likely to contain the soundalike voice.
- the soundalike system selects the voice from within the selected cluster that is closest to the target voice.
- FIG. 3 illustrates a flow diagram of training the Database 125 .
- the Input 120 received the Voice 120 b from Database 125 .
- the Database 125 should contain enough Voice 120 b to be statistically significant.
- Optimally Database 125 should contain at least 300 voices, each voice having spoken 300 sentences of 5 to 6 seconds duration per sentence. Thus Database 125 will have 300,000 to 340,000 seconds or approximately 55 to 66 hours of voice data.
- the Database 125 needs to be trained. Training a database means building a mathematical model to represent database. In speech synthesis, the ultimate result of training for soundalike is creating i-vectors for the cluster and speaker level. This is a final low dimension representation of a speaker.
- the DB Trainer 130 divides the human speech into a plurality of frames, Frames 130 a , each Frame 130 a being generally the length of a single phoneme or 30 milliseconds.
- DB Trainer 130 calculates N Mel Frequency Cepstral Coefficients, or MFCCS, for each Frame 130 a which corresponds to the number of features extracted, i.e.
- DB Trainer 130 calculates 42 MFCCs per Frame 130 a over a sliding window equal which increments by 1 ⁇ 2 the length of Frame 130 a.
- the DB Trainer 130 uses the extracted MFCCs from Database 125 to create UBM 130 b , a universal background model of the Database 125 . Creating a universal background model is within the scope of one skilled in the art of speech synthesis.
- the UBM 130 b results in three matrices, the Weight 135 a , the Means 135 b and the Variance 135 c.
- each Voice 120 b must be modeled.
- the Individual Voice Modeler 140 builds a mathematical model for each Voice 120 b using a Maximum Apriori Probability, or MAP, algorithm which combines the UBM 130 b with the extracted MFCCs from each Voice 120 b . Building a mathematical model of a single voice using a Maximum Apriori Probability algorithm is within the ordinary scope of one skilled in the art of speech synthesis.
- Individual Voice Modeler 140 creates a mathematical model of each voice directly using the universal background model. Building individual voice mathematical models using the universal background model algorithm is within the scope of one skilled in the art of speech synthesis.
- FIG. 4 illustrates a K-Means clustering.
- the clustering algorithm is a k-means algorithm.
- K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
- K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
- a cluster model is a mathematical representation of each cluster within the selected database.
- a cluster model allows all of the voices within the cluster to be represented with a single mathematical model.
- FIG. 5 illustrates a flow diagram of the soundalike system creating a mathematical model for the database at the cluster level and calculating the i-vector of the target voice.
- Group I-Vector 160 selects a single cluster of voices.
- Group I-Vector 160 selects the MFCCs from all of the voice within the selected cluster.
- the feature vectors, or MFCCs are combined together using any number of mathematical combinations.
- Group I-Vector 160 simply creates the matrix 160 a by stacking the vectors, although other combinations such as summation, averages, means, etc. can be applied.
- a universal background model algorithm is applied to the Matrix 160 a .
- Group I-Vector 160 calculates the i-vector of the selected cluster. The result is the mathematical model of the selected cluster.
- Group I-Vector 160 repeats for each cluster in Database 125 .
- the Target Voice Selector 170 extracts the MFCCs of the target voice over a plurality of frames, each frame being approximately 20s, the length of a phoneme.
- the MFCC's are calculated over a sliding window equal in length to a single Frame 130 a
- the Target i-Vector 165 is calculated by applying the universal background model to the MFCCs of the Voice 120 b . Calculating an i-Vector is within the scope of someone skilled in the art of speech synthesis.
- FIG. 6 illustrates a flow diagram of Group Selector 175 determining which group contains the soundalike voice.
- Group Selector 175 calculates the Euclidean distance between the i-vector of each group and the Target I-Vector 165 .
- Group Selector 175 selects the Group with the lowest Euclidean distance to the Target I-Vector 165 .
- Individual I-Vector 180 selects the Voice 120 b within Group 175 a .
- Individual I-Vector 180 calculates the i-vector of each Voice 120 b.
- Voice Selector 190 compares the I-Vector of each voice in Group 175 a with the Target I-Vector 165 and closest I-vector as the soundalike voice.
- the soundalike system selects the Voice 120 b with the smallest Euclidean distance to the target voice as the soundalike voice.
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- This patent application claims priority from U.S. Provisional Patent Application No. 62/314,759, filed on Mar. 29, 2016 in the U.S. Patent and Trademark Office, the disclosure of which is incorporated herein by reference in its entirety.
- Embodiments herein relate to a method and apparatus for exemplary speech synthesis.
- Typically, speech synthesis is accomplished through the use of a speech synthesizer which generates speech through one or more pre-programmed voices.
- Embodiments of the present application relate to speech synthesis using a voice that is similar to the target speaker's voice.
- Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.
- A text-to-speech system, or engine, is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end often referred to as the synthesizer then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations) which is then imposed on the output speech.
- To synthesize a target speakers voice, a TTS must first train on the target voice. The speaker speaks many hours of utterances spanning all the possible language information, e.g. phonemes, diphones, triphones, etc. For optimal training, the speaker reads these utterances from text provided to him/her. The speaker reads an utterance and an Automatic Speech Recognizer (ASR) converts the audio into text. This text is matched with the actual text provided to him and label matching is done to check the correctness and the quality of the utterance. Further processing is done on the audio to get the right sampling rate and noise free audio signal. This is done for all audio and once ready, the audio is supplied to an algorithm to build a model based on distinctive characteristics (features) such as pitch, vocal tract information, formants, etc. These features are extracted and a mathematical (probabilistic) model is constructed based on well-known algorithms.
- When an incoming target voice is to be synthesized, the target voice is received by an ASR which outputs text. The text is broken down to units present in the model trained earlier, the closest unit is obtained along with the audio part of that unit with prosody. This is done for all the units in the input string and once the audio attaches to the unit, a stitching process is performed to combine these audio parts along with the units into an audio clip which must sound natural as if the actual human is talking
- The problem inherent in training a TTS is that a TTS requires the speaker to spend dozens of hours, if not more, to properly train the TTS. Specifically, the TTS needs enough speech to adequately synthesize the target speaker's voice.
- The solution herein is to select a voice, aka the soundalike voice, from a database of voices, wherein the soundalike voice is substantially similar to the target voice and use the soundalike voice to build the TTS voice, i.e. train the TTS.
- This computer system described herein is optimally configured to determine which voice from the database of voices is the most similar to the target voice.
- The ideal database will have voices in the language of the target speaker; a range of voices (pitch, gender, accent, etc.) is preferable. For example, a database containing a statistically significant distribution of voices is more likely to contain a good match to a speaker with a deep male voice than a database of primarily soprano female voices. This is because the identity of the target speaker is often unknown and thus a wide range of voices is more likely to find a good match. However, even when the identity of the speaker is partially known (e.g. gender), a wide distribution of voices in the database is still optimal. However, on occasion it is preferable to have a database with a narrow distribution of voices. This can occur when the target voice is constrained, e.g. male tenors.
- Optimally a database should contain at least 200 voices, each voice having spoken 200 sentences of 5 to 6 seconds duration per sentence. Thus a database will have 200,000 to 240,000 seconds or approximately 55 to 66 hours of voice data.
-
FIG. 1 is a schematic diagram of the soundalike computer system. -
FIG. 2 illustrates a high flow diagram of the soundalike selection process. -
FIG. 3 illustrates a flow diagram of training theDatabase 125. -
FIG. 4 illustrates a K-Means clustering. -
FIG. 5 illustrates a flow diagram of the soundalike system creating a mathematical model for the database at the cluster level and calculating the i-vector of the target voice. -
FIG. 6 illustrates a flow diagram of Group Selector 175 determining which group contains the soundalike voice. -
FIG. 1 illustrates a block diagram for selecting a voice, from a database of voices, which is substantially similar to a target voice. - The soundalike system in
FIG. 1 may be implemented as a computer system 110; a computer comprising several modules, i.e. computer components embodied as either software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to form an exemplary computer system. The computer components may be implemented as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks. A unit or module may advantageously be configured to reside on the addressable storage medium and configured to execute on one or more processors or microprocessors. Thus, a unit or module may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functionality provided for in the components and units may be combined into fewer components and units or modules or further separated into additional components and units or modules. -
Input 120 is a module configured to receive the Voice 120 b from anAudio Source 120 a. TheAudio Source 120 a maybe be one of several sources including, but not limited to, Human 121 speaking, StreamedSpeech 122 or preferentially theDatabase 125 containing human speech, aka voices, but may also be a live person speaking into a microphone, synthesize speech, streamed speech, etc. - DB Trainer 130 is a module configured to train a database by extracting the Mel Frequency Cepstral Coefficients (MFCCs) from the Voice 120 b in
Audio Source 120 a, using the extracted MFCCs to create the DB Model 130 a of the database. -
Individual Voice Modeler 140 is a module configured to build a mathematical model of each individual voice obtained fromAudio Source 120 a. - Voice
Clusterer 150 is a module configured to cluster aka classify voices fromAudio Source 120 a into two or more groups, the Group 150 a by characteristic inherent with each voice, including, but not limited to gender, pitch and speed. - Group I-Vector 160 is a model configured to calculate a single i-vector for each Group 150 a.
-
Target Voice Calculator 170 is a module configured to calculate the i-vector of the target voice, the Target i-Vector 170 a. - Group Selector 175 is a module configured to select the closest Group 150 a to the Target I-Vector 170 a, e.g. with the smallest Euclidean distance between the Target i-Vector 170 a and the Group 150 a or the highest probability score.
- Individual i-
Vector 180 is a module configured to calculate the i-vectors of each Voice 180 a, the Voice 180 a within the selected Group 150 a. -
Voice Selector 190 is a module configured to select the voice with the smallest Euclidean distance between the target i-Vector 170 a and Voice 180 a. -
FIG. 2 illustrates a high flow diagram of the soundalike selection process. Atstep 210, the soundalike system trains the database. Atstep 220, the soundalike system builds mathematical models of each voice within the database. Atstep 230, the soundalike system groups, i.e. creates clusters, of voices based on similarities between the voices e.g. pitch, speed, etc. step 240, the soundalike system creates mathematical models of each cluster. Atstep 260, the soundalike selects the cluster most likely to contain the soundalike voice. Atstep 270, the soundalike system selects the voice from within the selected cluster that is closest to the target voice. -
FIG. 3 illustrates a flow diagram of training theDatabase 125. Atstep 310, theInput 120 received the Voice 120 b fromDatabase 125. TheDatabase 125 should contain enough Voice 120 b to be statistically significant. OptimallyDatabase 125 should contain at least 300 voices, each voice having spoken 300 sentences of 5 to 6 seconds duration per sentence. ThusDatabase 125 will have 300,000 to 340,000 seconds or approximately 55 to 66 hours of voice data. - The
Database 125 needs to be trained. Training a database means building a mathematical model to represent database. In speech synthesis, the ultimate result of training for soundalike is creating i-vectors for the cluster and speaker level. This is a final low dimension representation of a speaker. AtStep 320, theDB Trainer 130 divides the human speech into a plurality of frames, Frames 130 a, each Frame 130 a being generally the length of a single phoneme or 30 milliseconds. Atstep 325,DB Trainer 130 calculates N Mel Frequency Cepstral Coefficients, or MFCCS, for each Frame 130 a which corresponds to the number of features extracted, i.e. the number of features in the target voice such as pitch, speed, etc., which will matched against the voices in theDatabase 125. In the preferred embodiment,DB Trainer 130 calculates 42 MFCCs per Frame 130 a over a sliding window equal which increments by ½ the length of Frame 130 a. - At
step 330, theDB Trainer 130, uses the extracted MFCCs fromDatabase 125 to create UBM 130 b, a universal background model of theDatabase 125. Creating a universal background model is within the scope of one skilled in the art of speech synthesis. The UBM 130 b results in three matrices, the Weight 135 a, the Means 135 b and the Variance 135 c. - Subsequent to modeling the
Database 125, each Voice 120 b must be modeled. Atstep 340, theIndividual Voice Modeler 140 builds a mathematical model for each Voice 120 b using a Maximum Apriori Probability, or MAP, algorithm which combines the UBM 130 b with the extracted MFCCs from each Voice 120 b. Building a mathematical model of a single voice using a Maximum Apriori Probability algorithm is within the ordinary scope of one skilled in the art of speech synthesis. - In another embodiment,
Individual Voice Modeler 140 creates a mathematical model of each voice directly using the universal background model. Building individual voice mathematical models using the universal background model algorithm is within the scope of one skilled in the art of speech synthesis. -
FIG. 4 illustrates a K-Means clustering. Applying a clustering algorithm is within the scope of one skilled in the art of speech synthesis. In the preferred embodiment, the clustering algorithm is a k-means algorithm. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid. K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters. - There is no well-defined value for “k”, but experimentally, between 40 and 50 clusters is ideal for a database containing millions of voices.
-
FIG. 4 illustrates a sample of k=2, i.e. two clusters (e.g. male and female voices). - Once the number of clusters has been determined, the soundalike system builds a cluster model. A cluster model is a mathematical representation of each cluster within the selected database. A cluster model allows all of the voices within the cluster to be represented with a single mathematical model.
-
FIG. 5 illustrates a flow diagram of the soundalike system creating a mathematical model for the database at the cluster level and calculating the i-vector of the target voice. Atstep 510 Group I-Vector 160 selects a single cluster of voices. Atstep 520, Group I-Vector 160 selects the MFCCs from all of the voice within the selected cluster. Atstep 530, the feature vectors, or MFCCs are combined together using any number of mathematical combinations. In the preferred embodiment, atstep 530, Group I-Vector 160 simply creates the matrix 160 a by stacking the vectors, although other combinations such as summation, averages, means, etc. can be applied. A universal background model algorithm is applied to the Matrix 160 a. Atstep 540, Group I-Vector 160 calculates the i-vector of the selected cluster. The result is the mathematical model of the selected cluster. Group I-Vector 160 repeats for each cluster inDatabase 125. - At
step 550, theTarget Voice Selector 170 extracts the MFCCs of the target voice over a plurality of frames, each frame being approximately 20s, the length of a phoneme. In the preferred embodiment, the MFCC's are calculated over a sliding window equal in length to a single Frame 130 a - At
Step 560, the Target i-Vector 165 is calculated by applying the universal background model to the MFCCs of the Voice 120 b. Calculating an i-Vector is within the scope of someone skilled in the art of speech synthesis. -
FIG. 6 illustrates a flow diagram of Group Selector 175 determining which group contains the soundalike voice. Atstep 610, Group Selector 175 calculates the Euclidean distance between the i-vector of each group and the Target I-Vector 165. AtStep 620, Group Selector 175 selects the Group with the lowest Euclidean distance to the Target I-Vector 165. - Once the Group 175 a has been selected, the i-vectors of each individual voice must be calculated.
- At
step 630, Individual I-Vector 180 selects the Voice 120 b within Group 175 a. Atstep 640 Individual I-Vector 180 calculates the i-vector of each Voice 120 b. - At
step 650,Voice Selector 190 compares the I-Vector of each voice in Group 175 a with the Target I-Vector 165 and closest I-vector as the soundalike voice. In the preferred embodiment of the invention, the soundalike system selects the Voice 120 b with the smallest Euclidean distance to the target voice as the soundalike voice.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/473,103 US10311855B2 (en) | 2016-03-29 | 2017-03-29 | Method and apparatus for designating a soundalike voice to a target voice from a database of voices |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662314759P | 2016-03-29 | 2016-03-29 | |
US15/473,103 US10311855B2 (en) | 2016-03-29 | 2017-03-29 | Method and apparatus for designating a soundalike voice to a target voice from a database of voices |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170301340A1 true US20170301340A1 (en) | 2017-10-19 |
US10311855B2 US10311855B2 (en) | 2019-06-04 |
Family
ID=60038499
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/473,103 Active US10311855B2 (en) | 2016-03-29 | 2017-03-29 | Method and apparatus for designating a soundalike voice to a target voice from a database of voices |
Country Status (1)
Country | Link |
---|---|
US (1) | US10311855B2 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10311855B2 (en) * | 2016-03-29 | 2019-06-04 | Speech Morphing Systems, Inc. | Method and apparatus for designating a soundalike voice to a target voice from a database of voices |
US10418024B1 (en) * | 2018-04-17 | 2019-09-17 | Salesforce.Com, Inc. | Systems and methods of speech generation for target user given limited data |
WO2021034786A1 (en) * | 2019-08-21 | 2021-02-25 | Dolby Laboratories Licensing Corporation | Systems and methods for adapting human speaker embeddings in speech synthesis |
US11615777B2 (en) * | 2019-08-09 | 2023-03-28 | Hyperconnect Inc. | Terminal and operating method thereof |
US11887579B1 (en) * | 2022-09-28 | 2024-01-30 | Intuit Inc. | Synthetic utterance generation |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6253179B1 (en) * | 1999-01-29 | 2001-06-26 | International Business Machines Corporation | Method and apparatus for multi-environment speaker verification |
US20030014250A1 (en) * | 1999-01-26 | 2003-01-16 | Homayoon S. M. Beigi | Method and apparatus for speaker recognition using a hierarchical speaker model tree |
US20100114572A1 (en) * | 2007-03-27 | 2010-05-06 | Masahiro Tani | Speaker selecting device, speaker adaptive model creating device, speaker selecting method, speaker selecting program, and speaker adaptive model making program |
US8160877B1 (en) * | 2009-08-06 | 2012-04-17 | Narus, Inc. | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting |
US20140142944A1 (en) * | 2012-11-21 | 2014-05-22 | Verint Systems Ltd. | Diarization Using Acoustic Labeling |
US20140358541A1 (en) * | 2013-05-31 | 2014-12-04 | Nuance Communications, Inc. | Method and Apparatus for Automatic Speaker-Based Speech Clustering |
US20150025887A1 (en) * | 2013-07-17 | 2015-01-22 | Verint Systems Ltd. | Blind Diarization of Recorded Calls with Arbitrary Number of Speakers |
US20150340039A1 (en) * | 2009-11-12 | 2015-11-26 | Agnitio Sl | Speaker recognition from telephone calls |
US9336782B1 (en) * | 2015-06-29 | 2016-05-10 | Vocalid, Inc. | Distributed collection and processing of voice bank data |
US20170076727A1 (en) * | 2015-09-15 | 2017-03-16 | Kabushiki Kaisha Toshiba | Speech processing device, speech processing method, and computer program product |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10311855B2 (en) * | 2016-03-29 | 2019-06-04 | Speech Morphing Systems, Inc. | Method and apparatus for designating a soundalike voice to a target voice from a database of voices |
-
2017
- 2017-03-29 US US15/473,103 patent/US10311855B2/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030014250A1 (en) * | 1999-01-26 | 2003-01-16 | Homayoon S. M. Beigi | Method and apparatus for speaker recognition using a hierarchical speaker model tree |
US6253179B1 (en) * | 1999-01-29 | 2001-06-26 | International Business Machines Corporation | Method and apparatus for multi-environment speaker verification |
US20100114572A1 (en) * | 2007-03-27 | 2010-05-06 | Masahiro Tani | Speaker selecting device, speaker adaptive model creating device, speaker selecting method, speaker selecting program, and speaker adaptive model making program |
US8160877B1 (en) * | 2009-08-06 | 2012-04-17 | Narus, Inc. | Hierarchical real-time speaker recognition for biometric VoIP verification and targeting |
US20150340039A1 (en) * | 2009-11-12 | 2015-11-26 | Agnitio Sl | Speaker recognition from telephone calls |
US20140142944A1 (en) * | 2012-11-21 | 2014-05-22 | Verint Systems Ltd. | Diarization Using Acoustic Labeling |
US20140358541A1 (en) * | 2013-05-31 | 2014-12-04 | Nuance Communications, Inc. | Method and Apparatus for Automatic Speaker-Based Speech Clustering |
US20150025887A1 (en) * | 2013-07-17 | 2015-01-22 | Verint Systems Ltd. | Blind Diarization of Recorded Calls with Arbitrary Number of Speakers |
US9336782B1 (en) * | 2015-06-29 | 2016-05-10 | Vocalid, Inc. | Distributed collection and processing of voice bank data |
US20170076727A1 (en) * | 2015-09-15 | 2017-03-16 | Kabushiki Kaisha Toshiba | Speech processing device, speech processing method, and computer program product |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10311855B2 (en) * | 2016-03-29 | 2019-06-04 | Speech Morphing Systems, Inc. | Method and apparatus for designating a soundalike voice to a target voice from a database of voices |
US10418024B1 (en) * | 2018-04-17 | 2019-09-17 | Salesforce.Com, Inc. | Systems and methods of speech generation for target user given limited data |
US11615777B2 (en) * | 2019-08-09 | 2023-03-28 | Hyperconnect Inc. | Terminal and operating method thereof |
WO2021034786A1 (en) * | 2019-08-21 | 2021-02-25 | Dolby Laboratories Licensing Corporation | Systems and methods for adapting human speaker embeddings in speech synthesis |
US11929058B2 (en) | 2019-08-21 | 2024-03-12 | Dolby Laboratories Licensing Corporation | Systems and methods for adapting human speaker embeddings in speech synthesis |
US11887579B1 (en) * | 2022-09-28 | 2024-01-30 | Intuit Inc. | Synthetic utterance generation |
Also Published As
Publication number | Publication date |
---|---|
US10311855B2 (en) | 2019-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10311855B2 (en) | Method and apparatus for designating a soundalike voice to a target voice from a database of voices | |
US11373633B2 (en) | Text-to-speech processing using input voice characteristic data | |
US11062694B2 (en) | Text-to-speech processing with emphasized output audio | |
Nishimura et al. | Singing Voice Synthesis Based on Deep Neural Networks. | |
US5745873A (en) | Speech recognition using final decision based on tentative decisions | |
US10276149B1 (en) | Dynamic text-to-speech output | |
Stolcke et al. | Highly accurate phonetic segmentation using boundary correction models and system fusion | |
US20160379638A1 (en) | Input speech quality matching | |
US20180247640A1 (en) | Method and apparatus for an exemplary automatic speech recognition system | |
US10497362B2 (en) | System and method for outlier identification to remove poor alignments in speech synthesis | |
US20050180547A1 (en) | Automatic identification of telephone callers based on voice characteristics | |
US10068565B2 (en) | Method and apparatus for an exemplary automatic speech recognition system | |
US10008216B2 (en) | Method and apparatus for exemplary morphing computer system background | |
US20070294082A1 (en) | Voice Recognition Method and System Adapted to the Characteristics of Non-Native Speakers | |
Chang et al. | An elitist approach to articulatory-acoustic feature classification | |
Aihara et al. | Exemplar-based emotional voice conversion using non-negative matrix factorization | |
JP2001166789A (en) | Method and device for voice recognition of chinese using phoneme similarity vector at beginning or end | |
Dhanalakshmi et al. | Intelligibility modification of dysarthric speech using HMM-based adaptive synthesis system | |
US11929058B2 (en) | Systems and methods for adapting human speaker embeddings in speech synthesis | |
Mullah et al. | Development of an HMM-based speech synthesis system for Indian English language | |
Ezzine et al. | Moroccan dialect speech recognition system based on cmu sphinxtools | |
Bunnell et al. | The ModelTalker system | |
Li et al. | Improving mandarin tone mispronunciation detection for non-native learners with soft-target tone labels and blstm-based deep models | |
He et al. | Fast model selection based speaker adaptation for nonnative speech | |
Wang et al. | An experimental analysis on integrating multi-stream spectro-temporal, cepstral and pitch information for mandarin speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SPEECH MORPHING SYSTEMS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YASSA, FATHY;REAVES, BENJAMIN;MOHAN, SANDEEP;SIGNING DATES FROM 20170630 TO 20170703;REEL/FRAME:042932/0396 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: SURCHARGE FOR LATE PAYMENT, SMALL ENTITY (ORIGINAL EVENT CODE: M2554); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 4 |