CN112599113B - Dialect voice synthesis method, device, electronic equipment and readable storage medium - Google Patents

Dialect voice synthesis method, device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112599113B
CN112599113B CN202011611428.1A CN202011611428A CN112599113B CN 112599113 B CN112599113 B CN 112599113B CN 202011611428 A CN202011611428 A CN 202011611428A CN 112599113 B CN112599113 B CN 112599113B
Authority
CN
China
Prior art keywords
pronunciation
dialect
vector
determining
input text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011611428.1A
Other languages
Chinese (zh)
Other versions
CN112599113A (en
Inventor
梁光
舒景辰
吴雨璇
杨惠
周鼎皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dami Technology Co Ltd
Original Assignee
Beijing Dami Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dami Technology Co Ltd filed Critical Beijing Dami Technology Co Ltd
Priority to CN202011611428.1A priority Critical patent/CN112599113B/en
Publication of CN112599113A publication Critical patent/CN112599113A/en
Application granted granted Critical
Publication of CN112599113B publication Critical patent/CN112599113B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the invention provides a dialect voice synthesis method, a device, electronic equipment and a readable storage medium, which relate to the technical field of computers.

Description

Dialect voice synthesis method, device, electronic equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a dialect speech synthesis method, a dialect speech synthesis apparatus, an electronic device, and a readable storage medium.
Background
At present, the machine synthesized voice can be applied to various scenes, such as online education, video dubbing and explanation, and the like, and the human cost is saved and the interestingness is improved due to the existence of the machine synthesized voice.
However, current machine synthesized speech is too stiff, resulting in a low similarity of machine synthesized speech to human voice.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a dialect speech synthesis method, apparatus, electronic device, and readable storage medium, which can synthesize a synthesized speech having a high similarity to a human voice.
In a first aspect, there is provided a dialect speech synthesis method, the method being applied to an electronic device, the method comprising:
acquiring an input text;
determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word;
determining the corresponding pronunciation time length of each pronunciation vector and dialect tone, wherein the pronunciation time length is used for representing the duration time of pronunciation, and the dialect tone is used for representing the pitch of pronunciation; and
and synthesizing the synthesized voice corresponding to the input text based on the pronunciation vector, the pronunciation time length and the dialect tone.
Optionally, the determining the pronunciation vector of at least one word in the input text includes:
and carrying out vectorization processing on at least one word in the input text, and determining the pronunciation vector of the at least one word in the input text.
Optionally, the determining the pronunciation vector of at least one word in the input text includes:
determining pinyin information of at least one word in the input text based on a preset corresponding relation between the word and pinyin; and
and carrying out vectorization processing on the pinyin information, and determining the pronunciation vector of the pinyin information.
Optionally, the determining the pronunciation time length corresponding to each pronunciation vector includes:
based on a pre-trained pronunciation time length prediction model, each pronunciation vector is taken as an input, and the pronunciation time length of each pronunciation vector output by the pronunciation time length prediction model is obtained.
Optionally, the determining the dialect pitch corresponding to each pronunciation vector includes:
and taking each pronunciation vector as input based on a pre-trained dialect intonation prediction model, and acquiring the dialect pitch of each pronunciation vector output by the dialect pitch prediction model, wherein the dialect pitch prediction model is pre-trained based on at least training samples with dialect pitch labels.
Optionally, the synthesizing the synthesized speech corresponding to the input text based on the pronunciation vector, the pronunciation duration and the dialect pitch includes:
based on a pre-trained speech synthesis model, taking the pronunciation vector, the pronunciation duration and the dialect tone as inputs to acquire a synthesis frequency spectrum output by the speech synthesis model; and
and determining the synthesized voice corresponding to the input text through the vocoder and the synthesized frequency spectrum.
In a second aspect, there is provided a dialect speech synthesis apparatus, the apparatus being applied to an electronic device, the apparatus comprising:
the acquisition module is used for acquiring an input text;
a first determining module, configured to determine a pronunciation vector of at least one word in the input text, where the pronunciation vector includes at least prosodic information of a corresponding word;
the second determining module is used for determining the pronunciation duration corresponding to each pronunciation vector and the dialect tone, wherein the pronunciation duration is used for representing the duration of pronunciation, and the dialect tone is used for representing the pitch of pronunciation; and
and the synthesis module is used for synthesizing the synthesized voice corresponding to the input text based on the pronunciation vector, the pronunciation time length and the dialect tone.
Optionally, the first determining module is specifically configured to:
and carrying out vectorization processing on at least one word in the input text, and determining the pronunciation vector of the at least one word in the input text.
Optionally, the first determining module is specifically configured to:
determining pinyin information of at least one word in the input text based on a preset corresponding relation between the word and pinyin; and
and carrying out vectorization processing on the pinyin information, and determining the pronunciation vector of the pinyin information.
Optionally, the second determining module is specifically configured to:
based on a pre-trained pronunciation time length prediction model, each pronunciation vector is taken as an input, and the pronunciation time length of each pronunciation vector output by the pronunciation time length prediction model is obtained.
Optionally, the second determining module is specifically configured to:
and taking each pronunciation vector as input based on a pre-trained dialect intonation prediction model, and acquiring the dialect pitch of each pronunciation vector output by the dialect pitch prediction model, wherein the dialect pitch prediction model is pre-trained based on at least training samples with dialect pitch labels.
Optionally, the synthesis module is specifically configured to:
based on a pre-trained speech synthesis model, taking the pronunciation vector, the pronunciation duration and the dialect tone as inputs to acquire a synthesis frequency spectrum output by the speech synthesis model; and
and determining the synthesized voice corresponding to the input text through the vocoder and the synthesized frequency spectrum.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory storing one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method as described in the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method according to the first aspect.
According to the embodiment of the invention, the synthesized voice can have common speaking modes of human beings such as pause, prolonged voice and the like based on prosody information in the pronunciation vector and pronunciation time length corresponding to the pronunciation vector, and then, the synthesized voice can be more similar to the speaking mode of the human beings based on the dialect tone which is unique to the additional dialect of the synthesized voice (namely, the unique pronunciation mode of the dialect), and finally, the synthesized voice determined based on the pronunciation vector, the pronunciation time length and the dialect tone can have higher similarity with the human voice.
Drawings
The above and other objects, features and advantages of embodiments of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a dialect speech synthesis method provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a process for determining synthesized speech according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of another process for determining synthesized speech according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a pronunciation duration prediction model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a feedforward network module according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a length adjuster according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a phoneme duration predictor according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a dialect pitch prediction process according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a pitch predictor provided by an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a dialect speech synthesis apparatus according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention is described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth in detail. The present invention will be fully understood by those skilled in the art without the details described herein. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the invention.
Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.
Unless the context clearly requires otherwise, the words "comprise," "comprising," and the like in the description are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, it is the meaning of "including but not limited to".
In the description of the present invention, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
At present, the machine synthesized voice can be applied to various scenes, such as online education, video dubbing, explanation and the like, and in particular, in the online education scene, the online education platform can construct a virtual character and configure the machine synthesized voice for the virtual character, and then the online education platform can display the virtual character provided with the machine synthesized voice on a display interface of a student end so as to realize functions of machine roll call and the like, thereby saving labor cost and improving interestingness.
Similarly, the machine synthesized voice can also be applied to the functions of video dubbing, explanation, etc., and the embodiments of the present invention are not described herein in detail.
However, in the related art, the pronunciation of the synthesized speech is too hard, so that the synthesized speech sounds to be different from the speech uttered by the person, and thus the user experience of the user may be reduced.
In order to make the machine synthesized voice more similar to the voice of a real person, the embodiment of the invention provides a dialect voice synthesis method, which can be applied to electronic equipment, wherein the electronic equipment can be a smart phone, a tablet personal computer or a personal computer (Personal Computer, PC) and the like, and the electronic equipment can also be a single server, a server cluster configured in a distributed mode or a cloud server.
Specifically, as shown in fig. 1, the method may include the following steps:
in step 100, input text is obtained.
In step 200, a pronunciation vector of at least one word in the input text is determined.
The pronunciation vector at least comprises prosodic information of the corresponding word, and specifically, the prosodic information can be used for representing the rhythm of pronunciation. Alternatively, in embodiments of the present invention, synthesizing speech with prosody may be accomplished by adding pauses after each word.
In step 300, the pronunciation duration and dialect pitch corresponding to each pronunciation vector are determined.
Wherein the pronunciation duration is used to characterize the duration of the pronunciation and the dialect tone is used to characterize the pitch of the pronunciation.
In step 400, synthesized speech corresponding to the input text is synthesized based on the pronunciation vector, the pronunciation duration, and the dialect pitch.
According to the embodiment of the invention, the synthesized voice can have common speaking modes of human beings such as pause, prolonged voice and the like based on prosody information in the pronunciation vector and pronunciation time length corresponding to the pronunciation vector, and then, the synthesized voice can be more similar to the speaking mode of the human beings based on the dialect tone which is unique to the additional dialect of the synthesized voice (namely, the unique pronunciation mode of the dialect), and finally, the synthesized voice determined based on the pronunciation vector, the pronunciation time length and the dialect tone can have higher similarity with the human voice.
In the embodiment of the invention, the electronic equipment can acquire a section of input text and then determine the synthesized voice corresponding to the input text, wherein the input text can be manually input text or text which is recognized by a preset voice recognition algorithm.
For example, in an online education scenario, the electronic device may be an online education platform, where a student list entered by a staff may be pre-stored in a database of the platform, and when a certain section of online education classroom starts, the platform may obtain a part of the student list (i.e., a student list corresponding to the section of online education classroom) from the database, and use the part of the student list as an input text, thereby determining the synthesized speech.
After the electronic device obtains the input text, a pronunciation vector for at least one word in the input text may be determined.
In one embodiment, step 200 may be performed as: and carrying out vectorization processing on at least one word in the input text, and determining the pronunciation vector of the at least one word in the input text.
In practical applications, at least one word in the input text may be first processed by Embedding (Embedding), and then the vector after Embedding is used as the pronunciation vector.
The feature extraction is to map high-dimensional original data (images, characters, etc.) to a low-dimensional Manifold (Manifold), so that the high-dimensional original data can be separated after being mapped to the low-dimensional Manifold, and the mapping process can be called as mapping, for example, word mapping, or mapping a sentence composed of words to a token vector, and in the embodiment of the invention, the object of the mapping is a Word in the input text.
In addition, prosodic information of the corresponding word is included in the pronunciation vector, and in the embodiment of the invention, blank sounds (i.e. pauses) with a predetermined duration can be added after the pronunciation vector corresponds to the word, so as to realize the synthesis of the voice with prosody, i.e. the synthesis of the voice which is more fit to the human speaking.
For example, the input text is "you eat you have a meal today," and if there is no pause between each word in the synthesized speech to which the input text corresponds, the hearing of the synthesized speech is quite hard.
Further, as shown in fig. 2, fig. 2 is a schematic diagram of determining a process of synthesizing speech according to an embodiment of the present invention, where the schematic diagram includes: text a and synthesized speech b are input.
In the process of performing speech synthesis for the input text a, namely, the text a is "you eat the meal today", the embodiment of the invention can firstly determine the pronunciation vector with prosodic information corresponding to each word in the input text a, then determine the pronunciation duration and the dialect tone corresponding to the pronunciation vector based on the pronunciation vector, and then perform speech synthesis based on the pronunciation vector, the pronunciation duration and the dialect tone to determine the synthesized speech b.
In the synthesized speech b, the words "you", "days", "meals" and "mock" are followed by blank sounds of a predetermined length, and the words "present", "eat", "pass" and "have no blank sounds after" have been completed ", so that by adding blank sounds of a predetermined length to the synthesized speech b, a plurality of consecutively pronounced short texts (as underlined texts in fig. 2) can be included in the synthesized speech b with a pause between each short text.
That is, in the synthesized speech b, the short text includes: "you", "today", "eat rice" and "do", and, after the last short text ("do") and between the above 4 short texts, blank sounds with a predetermined duration are added, so that the pronunciation of the synthesized speech b has prosody, and further, the pronunciation of the synthesized speech b is more fit to the speech of a human speaking.
In another embodiment, the step 200 may also be performed as: determining pinyin information of at least one word in an input text based on a preset corresponding relation between the word and pinyin; and carrying out vectorization processing on the pinyin information to determine the pronunciation vector of the pinyin information.
Specifically, in the embodiment of the invention, the corresponding relation between the characters and the pinyin can be preset based on tools such as a dictionary, after the input text is received, the pinyin corresponding to each character can be determined for each character in the input text, then the pinyin of each character is subjected to the processing of the Embedding respectively, the feature vector of each pinyin is determined, and then the feature vector is used as the pronunciation vector of the corresponding character.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of another process for determining a synthesized voice according to an embodiment of the present invention, where the schematic diagram includes: input text a, synthesized speech b, and pinyin text c.
In the process of synthesizing the voice aiming at the input text a 'how much you eat today', the embodiment of the invention can firstly determine the pinyin corresponding to each word in the input text a based on the preset corresponding relation to obtain the pinyin text c, wherein each pinyin in the pinyin text c corresponds to the pronunciation of each word in the input text a.
Then, according to each pinyin in the pinyin text c, the embodiment of the invention can determine the pronunciation vector with prosodic information corresponding to each pinyin, then determine the pronunciation time length and the dialect tone corresponding to the pronunciation vector based on the pronunciation vector, and then perform speech synthesis based on the pronunciation vector, the pronunciation time length and the dialect tone to determine the synthesized speech b.
According to the embodiment of the invention, the corresponding relation between the characters and the pinyin is established based on tools such as a dictionary, so that the pronunciation vector determined by the pinyin can more accurately represent the pronunciation of the characters, and the pronunciation of the synthesized voice is more accurate.
In the synthesized speech b, the words "you", "days", "meals" and "mock" are followed by blank sounds of a predetermined length, and the words "present", "eat", "pass" and "have no blank sounds after" have been completed ", so that by adding blank sounds of a predetermined length to the synthesized speech b, a plurality of consecutively pronounced short texts (as underlined texts in fig. 2) can be included in the synthesized speech b with a pause between each short text.
That is, in the synthesized speech b, the short text includes: "you", "today", "eat rice" and "do", and, after the last short text ("do") and between the above 4 short texts, blank sounds with a predetermined duration are added, so that the pronunciation of the synthesized speech b has prosody, and further, the pronunciation of the synthesized speech b is more fit to the speech of a human speaking.
It should be further noted that, after the electronic device determines the pronunciation vectors, the pronunciation duration and the dialect pitch corresponding to each pronunciation vector may be determined based on each pronunciation vector.
Specifically, the process of determining the pronunciation time period may be specifically performed as: based on a pre-trained pronunciation time length prediction model, each pronunciation vector is taken as input, and the pronunciation time length of each pronunciation vector output by the pronunciation time length prediction model is obtained.
In one implementation manner, as shown in fig. 4, fig. 4 is a schematic diagram of a pronunciation duration prediction model according to an embodiment of the present invention, where the schematic diagram includes: the pronunciation time length prediction model 41, the input of the pronunciation time length prediction model 41 (pronunciation vector), the output of the pronunciation time length prediction model 41 (pronunciation time length), and the position coding.
Wherein the pronunciation vector is a pronunciation vector of at least one word in the input text determined in the above step 200, the position code is used to represent the position information of the pronunciation vector corresponding to the word in the input text, and the pronunciation duration prediction model 41 includes: an N-Layer Feed-forward network module (Feed-Forward Transformer Block) 411, a Length Regulator (Length Regulator) 412, an N-Layer Feed-forward network module 413, and a Linear Layer 414.
In the embodiment of the present invention, the pronunciation vectors are input as the pronunciation duration prediction model 41, summed with the position codes first, and then input to the N-layer feed forward network module 411.
The feedforward network module may process input data based on an attention mechanism, specifically, as shown in fig. 5, fig. 5 is a schematic diagram of a feedforward network module 51 according to an embodiment of the present invention, and specifically, the feedforward network module 51 includes: a multi-head attention module (multi-head attention) 511, a summation & normalization module (Add & Norm) 512, a one-dimensional convolutional network (Conv 1D) 513, and a summation & normalization module 514, wherein the mechanism of multi-head attention is composed of a query, a mapping of key values, and an output, wherein the query, key, value, and output are all vectors, the output is calculated as a weighted sum of values, and the weight assigned to each value is calculated from a compatibility function of the query and the corresponding key; add & Norm can Add the input and output of the previous layer and input into Norm module for normalization treatment; conv1D is used for one-dimensional convolution operations.
The length adjuster 412 is configured to solve the problem of length mismatch between the phoneme and the spectrogram sequence in the feedforward network module 411, and specifically, as shown in fig. 6, fig. 6 is a schematic diagram of a length adjuster 61 according to an embodiment of the present invention, where the schematic diagram specifically includes: phoneme a, phoneme b, phoneme c, phoneme d, phoneme duration predictor 611, and mel spectrum sequence Length adjusting unit (LR).
Wherein the input of the length adjuster 61 is phonemes (i.e., phoneme a, phoneme b, phoneme c, and phoneme d), each of which has a fixed initial pronunciation time length, that is, the pronunciation time length of each of the phonemes input to the length adjuster 61 is the same.
The length adjuster 61 may then input each of the phonemes to the phoneme duration predictor 611, and the phoneme duration predictor 611 may predict a duration of each of the phonemes, i.e., D (duration) in fig. 6, as shown in fig. 6, d= [2,2,3,1], wherein values in D correspond to the phoneme a, the phoneme b, the phoneme c and the phoneme D in sequence, each value represents a duration of the corresponding phoneme, and further each value represents a multiple by which the corresponding phoneme is to be extended.
Specifically, referring to the phoneme duration predictor 611, as shown in fig. 7, fig. 7 is a schematic diagram of the phoneme duration predictor 71 according to an embodiment of the present invention, wherein the schematic diagram includes a workflow of the phoneme duration predictor 71 and a training flow of the phoneme duration predictor 71.
Wherein the phoneme duration predictor 71 includes a one-dimensional convolution+normalization Layer (conv1d+norm) 711, a one-dimensional convolution+normalization Layer 712, and a Linear output Layer (Linear Layer) 713, when the phoneme duration predictor 71 receives an input phoneme, a duration corresponding to the phoneme can be determined based on a one-dimensional convolution operation and a Linear operation.
In one implementation, a pre-trained autoregressive model (autoregressive transformer text-to-speech, autoregressive transformer TTS) 714 may be used as a teacher model, during which phonemes may be input to the teacher model to determine the speech output by the teacher model, and then the duration A corresponding to the phonemes may be obtained by a duration extractor 715, and when the duration B is output by the phoneme duration predictor to be trained, the duration A may be used as a label of the duration B, and the phoneme duration predictor 71 may be back-propagated by a loss function 716 to stabilize the parameters of the phoneme duration predictor 71. The loss function 716 may be a root mean square error function, or may be another suitable loss function, which is not limited in this embodiment of the present invention.
In fig. 6, after the phoneme duration predictor 611 outputs D, the length adjuster 61 may determine a mel-spectrum sequence based on the mel-spectrum sequence length adjusting unit in combination with D, the super parameter (α), the phoneme a, the phoneme b, the phoneme c, and the phoneme D, where α is used to control the overall length of the mel-spectrum sequence, thereby realizing control of the speech speed, for example, α=1 represents a normal speech speed, α=1.3 represents a slower speech speed, and α=0.5 represents a faster speech speed.
To sum up, in fig. 6, through the length adjuster 61, mel spectrum sequences having different lengths of respective phonemes can be determined based on the inputted phonemes.
Further, as shown in fig. 4, after the length adjuster 412 outputs mel spectrum sequences with different phoneme lengths, the final pronunciation duration prediction model 41 may output the pronunciation duration corresponding to each pronunciation vector through position coding, N-layer feedforward network module 413 and linear layer 414 calculation.
On the other hand, the electronic device may further determine a dialect tone based on the pronunciation vector, and in particular, the process of determining the dialect tone may be specifically performed as: based on the pre-trained dialect intonation prediction model, each pronunciation vector is taken as input, and the dialect pitch of each pronunciation vector output by the dialect pitch prediction model is obtained.
Wherein the dialect pitch prediction model is pre-trained based at least on training samples with dialect pitch labels.
As shown in fig. 8, fig. 8 is a schematic diagram of a dialect pitch prediction process according to an embodiment of the present invention, where the schematic diagram includes: a dialect pitch prediction model 81, a pronunciation vector, and a dialect pitch.
Wherein the dialect pitch prediction model 81 includes: an N-layer feed-forward network module 811, a pitch predictor (pitch predictor) 812, a repeat layer (repeat) 813, an N-layer feed-forward network module 814, and a full-connection layer (Fully connected layer, FC layer) 815.
The pitch predictor 812 may be configured to predict a pitch corresponding to the pronunciation vector, and in an embodiment of the present invention, the dialect tone prediction model 81 may be pre-trained using training samples of dialect tone labeling, so that the pitch predictor 812 may accurately predict the pitch corresponding to the pronunciation vector.
In practical applications, the pronunciation mode of the dialect is different from that of the mandarin, namely, the pronunciation mode of the mandarin can be changed to enable the voice of the mandarin to have dialect vowels.
As shown in fig. 9, fig. 9 is a schematic diagram of a pitch predictor according to an embodiment of the present invention, where the schematic diagram includes: a one-dimensional convolution network 911, a one-dimensional convolution network 912, a full connection layer 913, a one-dimensional convolution network 914, and a loss function 915.
The pitch predictor may receive the output of the feedforward network module, take the output of the feedforward network module as the input of the pitch predictor, then sequentially pass through the one-dimensional convolution network 911, the one-dimensional convolution network 912, the full connection layer 913, and the one-dimensional convolution network 914, and then sum the output of the one-dimensional convolution network 914 with the input of the pitch predictor to determine the pitch corresponding to the pronunciation vector.
In addition, during the training process (as indicated by the dashed line in fig. 9), the model parameters of each layer of the pitch predictor may be adjusted based on the output of the full-connection layer 913, the preset labels, and the preset loss function 915 until the model parameters of each layer of the pitch predictor converge.
Further, after the pitch predictor outputs the pitch, the dialect tone prediction model may further calculate the pitch output by the pitch predictor based on the repetition layer, the feedforward network module and the full connection layer, thereby determining the dialect tone (i.e., the pitch of the dialect corresponding to the pronunciation vector).
In connection with the descriptions shown in fig. 8 and 9, the dialect tone prediction model is essentially a dialect pronunciation of a word corresponding to the predicted pronunciation vector, and the dialect pronunciation is characterized by a unique pitch of the dialect, that is, the embodiment of the invention can determine the dialect version pronunciation corresponding to each word in the input text through the dialect tone prediction model, so that the finally obtained synthesized voice is more in accordance with the common speaking mode of human beings.
After the electronic device determines the pronunciation vector, pronunciation duration, and dialect pitch, it may determine the synthesized speech based on the pronunciation vector, pronunciation duration, and dialect pitch, which may be specifically performed as: based on a pre-trained speech synthesis model, taking a pronunciation vector, a pronunciation time length and a dialect tone as inputs, and acquiring a synthesis frequency spectrum output by the speech synthesis model; and determining the synthesized voice corresponding to the input text through the vocoder and the synthesized frequency spectrum.
The speech synthesis model may be a voice spectrum prediction network (Tacotron 2), where Tacotron2 is an end-to-end speech synthesis model based on deep learning, and has good speech synthesis capability, and may be used to synthesize a spectrum.
In the embodiment of the invention, the dialect synthesized voice with high simulation property can be synthesized through the combination of the pronunciation vector, the pronunciation duration and the dialect tone, namely the synthesized voice with smaller difference with the voice uttered by a real person can be synthesized, and the hearing experience can be further improved.
Based on the same technical concept, the embodiment of the invention also provides a dialect voice synthesis device, as shown in fig. 10, which comprises: an acquisition module 101, a first determination module 102, a second determination module 103 and a synthesis module 104.
The obtaining module 101 is configured to obtain an input text.
The first determining module 102 is configured to determine a pronunciation vector of at least one word in the input text, where the pronunciation vector includes at least prosodic information of the corresponding word.
The second determining module 103 is configured to determine a pronunciation duration corresponding to each pronunciation vector and a dialect tone, where the pronunciation duration is used to characterize a duration of pronunciation and the dialect tone is used to characterize a pitch of pronunciation.
The synthesizing module 104 is configured to synthesize a synthesized voice corresponding to the input text based on the pronunciation vector, the pronunciation duration, and the dialect pitch.
Optionally, the first determining module 102 is specifically configured to: and carrying out vectorization processing on at least one word in the input text, and determining the pronunciation vector of the at least one word in the input text.
Optionally, the first determining module 102 is specifically configured to: based on the preset corresponding relation between the characters and the pinyin, the pinyin information of at least one character in the input text is determined, the pinyin information is vectorized, and the pronunciation vector of the pinyin information is determined.
Optionally, the second determining module 103 is specifically configured to: based on a pre-trained pronunciation time length prediction model, each pronunciation vector is taken as input, and the pronunciation time length of each pronunciation vector output by the pronunciation time length prediction model is obtained.
Optionally, the second determining module 103 is specifically configured to: based on a pre-trained dialect intonation prediction model, taking each pronunciation vector as input, obtaining the dialect pitch of each pronunciation vector output by a dialect pitch prediction model, and pre-training the dialect pitch prediction model at least based on training samples with dialect pitch labels.
Optionally, the synthesis module 104 is specifically configured to: based on a pre-trained speech synthesis model, a pronunciation vector, a pronunciation time length and a dialect tone are taken as input, a synthesis frequency spectrum output by the speech synthesis model is obtained, and the synthesis speech corresponding to the input text is determined through a vocoder and the synthesis frequency spectrum.
According to the embodiment of the invention, the synthesized voice can have common speaking modes of human beings such as pause, prolonged voice and the like based on prosody information in the pronunciation vector and pronunciation time length corresponding to the pronunciation vector, and then, the synthesized voice can be more similar to the speaking mode of the human beings based on the dialect tone which is unique to the additional dialect of the synthesized voice (namely, the unique pronunciation mode of the dialect), and finally, the synthesized voice determined based on the pronunciation vector, the pronunciation time length and the dialect tone can have higher similarity with the human voice.
Fig. 11 is a schematic diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 11, the electronic device shown in fig. 11 is a general address query device, which includes a general computer hardware structure including at least a processor 111 and a memory 112. The processor 111 and the memory 112 are connected by a bus 113. The memory 112 is adapted to store instructions or programs executable by the processor 111. The processor 111 may be a separate microprocessor or may be a collection of one or more microprocessors. Thus, the processor 111 performs the process of the data and the control of other devices by executing the instructions stored by the memory 112, thereby executing the method flow of the embodiment of the present invention as described above. The bus 113 connects the above-described components together, and connects the above-described components to the display controller 114 and the display device and the input/output (I/O) device 115. Input/output (I/O) device 115 may be a mouse, keyboard, modem, network interface, touch input device, somatosensory input device, printer, and other devices known in the art. Typically, the input/output devices 115 are connected to the system via input/output (I/O) controllers 116.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus (device) or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each of the flows in the flowchart may be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the present invention is directed to a non-volatile storage medium storing a computer readable program for causing a computer to perform some or all of the method embodiments described above.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by specifying relevant hardware by a program, where the program is stored in a storage medium, and includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method of dialect speech synthesis, the method comprising:
acquiring an input text;
determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word;
determining the corresponding pronunciation time length of each pronunciation vector and dialect tone, wherein the pronunciation time length is used for representing the duration time of pronunciation, and the dialect tone is used for representing the pitch of pronunciation; and
synthesizing synthesized speech corresponding to the input text based on the pronunciation vector, the pronunciation duration and the dialect pitch;
the determining the pronunciation time length corresponding to each pronunciation vector includes:
based on a pre-trained pronunciation time length prediction model, each pronunciation vector is taken as an input, and the pronunciation time length of each pronunciation vector output by the pronunciation time length prediction model is obtained.
2. The method of claim 1, wherein said determining a pronunciation vector for at least one word in said input text comprises:
and carrying out vectorization processing on at least one word in the input text, and determining the pronunciation vector of the at least one word in the input text.
3. The method of claim 1, wherein said determining a pronunciation vector for at least one word in said input text comprises:
determining pinyin information of at least one word in the input text based on a preset corresponding relation between the word and pinyin; and
and carrying out vectorization processing on the pinyin information, and determining the pronunciation vector of the pinyin information.
4. The method of claim 1, wherein said determining the dialect pitch for each of said pronunciation vectors comprises:
and taking each pronunciation vector as input based on a pre-trained dialect intonation prediction model, and acquiring the dialect pitch of each pronunciation vector output by the dialect pitch prediction model, wherein the dialect pitch prediction model is pre-trained based on at least training samples with dialect pitch labels.
5. The method of any of claims 1-4, wherein the synthesizing the synthesized speech corresponding to the input text based on the pronunciation vector, the pronunciation duration, and the dialect pitch comprises:
based on a pre-trained speech synthesis model, taking the pronunciation vector, the pronunciation duration and the dialect tone as inputs to acquire a synthesis frequency spectrum output by the speech synthesis model; and
and determining the synthesized voice corresponding to the input text through the vocoder and the synthesized frequency spectrum.
6. A dialect speech synthesis apparatus, the apparatus comprising:
the acquisition module is used for acquiring an input text;
a first determining module, configured to determine a pronunciation vector of at least one word in the input text, where the pronunciation vector includes at least prosodic information of a corresponding word;
the second determining module is used for determining the pronunciation duration corresponding to each pronunciation vector and the dialect tone, wherein the pronunciation duration is used for representing the duration of pronunciation, and the dialect tone is used for representing the pitch of pronunciation; and
the synthesis module is used for synthesizing the synthesized voice corresponding to the input text based on the pronunciation vector, the pronunciation duration and the dialect tone;
the second determining module is specifically configured to:
based on a pre-trained pronunciation time length prediction model, each pronunciation vector is taken as an input, and the pronunciation time length of each pronunciation vector output by the pronunciation time length prediction model is obtained.
7. The apparatus of claim 6, wherein the first determining module is specifically configured to:
and carrying out vectorization processing on at least one word in the input text, and determining the pronunciation vector of the at least one word in the input text.
8. The apparatus of claim 6, wherein the first determining module is specifically configured to:
determining pinyin information of at least one word in the input text based on a preset corresponding relation between the word and pinyin; and
and carrying out vectorization processing on the pinyin information, and determining the pronunciation vector of the pinyin information.
9. The apparatus of claim 6, wherein the second determining module is specifically configured to:
and taking each pronunciation vector as input based on a pre-trained dialect intonation prediction model, and acquiring the dialect pitch of each pronunciation vector output by the dialect pitch prediction model, wherein the dialect pitch prediction model is pre-trained based on at least training samples with dialect pitch labels.
10. The apparatus according to any one of claims 6 to 9, wherein the synthesis module is specifically configured to:
based on a pre-trained speech synthesis model, taking the pronunciation vector, the pronunciation duration and the dialect tone as inputs to acquire a synthesis frequency spectrum output by the speech synthesis model; and
and determining the synthesized voice corresponding to the input text through the vocoder and the synthesized frequency spectrum.
11. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-5.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-5.
CN202011611428.1A 2020-12-30 2020-12-30 Dialect voice synthesis method, device, electronic equipment and readable storage medium Active CN112599113B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011611428.1A CN112599113B (en) 2020-12-30 2020-12-30 Dialect voice synthesis method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011611428.1A CN112599113B (en) 2020-12-30 2020-12-30 Dialect voice synthesis method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112599113A CN112599113A (en) 2021-04-02
CN112599113B true CN112599113B (en) 2024-01-30

Family

ID=75206504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011611428.1A Active CN112599113B (en) 2020-12-30 2020-12-30 Dialect voice synthesis method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112599113B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113178186B (en) * 2021-04-27 2022-10-18 湖南师范大学 Dialect voice synthesis method and device, electronic equipment and storage medium
CN113314092A (en) * 2021-05-11 2021-08-27 北京三快在线科技有限公司 Method and device for model training and voice interaction
CN114783406B (en) * 2022-06-16 2022-10-21 深圳比特微电子科技有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN116415582B (en) * 2023-05-24 2023-08-25 中国医学科学院阜外医院 Text processing method, text processing device, computer readable storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013156472A (en) * 2012-01-31 2013-08-15 Mitsubishi Electric Corp Speech synthesizer and speech synthesis method
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111899719A (en) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160365087A1 (en) * 2015-06-12 2016-12-15 Geulah Holdings Llc High end speech synthesis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013156472A (en) * 2012-01-31 2013-08-15 Mitsubishi Electric Corp Speech synthesizer and speech synthesis method
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111899719A (en) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN112599113A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
US11769483B2 (en) Multilingual text-to-speech synthesis
CN112599113B (en) Dialect voice synthesis method, device, electronic equipment and readable storage medium
US11990118B2 (en) Text-to-speech (TTS) processing
JP6777768B2 (en) Word vectorization model learning device, word vectorization device, speech synthesizer, their methods, and programs
US11443733B2 (en) Contextual text-to-speech processing
JP2022107032A (en) Text-to-speech synthesis method using machine learning, device and computer-readable storage medium
EP2958105B1 (en) Method and apparatus for speech synthesis based on large corpus
US11763797B2 (en) Text-to-speech (TTS) processing
KR20220000391A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
US20090157408A1 (en) Speech synthesizing method and apparatus
US9798653B1 (en) Methods, apparatus and data structure for cross-language speech adaptation
JP2023505670A (en) Attention-Based Clockwork Hierarchical Variational Encoder
KR102528019B1 (en) A TTS system based on artificial intelligence technology
JP2015041081A (en) Quantitative f0 pattern generation device, quantitative f0 pattern generation method, model learning device for f0 pattern generation, and computer program
CN116453502A (en) Cross-language speech synthesis method and system based on double-speaker embedding
CN112735379B (en) Speech synthesis method, device, electronic equipment and readable storage medium
JP7357518B2 (en) Speech synthesis device and program
CN114492382A (en) Character extraction method, text reading method, dialog text generation method, device, equipment and storage medium
KR102532253B1 (en) A method and a TTS system for calculating a decoder score of an attention alignment corresponded to a spectrogram
KR102503066B1 (en) A method and a TTS system for evaluating the quality of a spectrogram using scores of an attention alignment
KR102418465B1 (en) Server, method and computer program for providing voice reading service of story book
US20240153486A1 (en) Operation method of speech synthesis system
Georgila 19 Speech Synthesis: State of the Art and Challenges for the Future
KR20240014250A (en) A method and a TTS system for calculating an encoder score of an attention alignment corresponded to a spectrogram
Baloyi A text-to-speech synthesis system for Xitsonga using hidden Markov models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant