CN114944146A - Voice synthesis method and device - Google Patents

Voice synthesis method and device Download PDF

Info

Publication number
CN114944146A
CN114944146A CN202210410324.7A CN202210410324A CN114944146A CN 114944146 A CN114944146 A CN 114944146A CN 202210410324 A CN202210410324 A CN 202210410324A CN 114944146 A CN114944146 A CN 114944146A
Authority
CN
China
Prior art keywords
feature vector
phoneme
text
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210410324.7A
Other languages
Chinese (zh)
Inventor
应以勒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Eswin Computing Technology Co Ltd
Original Assignee
Beijing Eswin Computing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Eswin Computing Technology Co Ltd filed Critical Beijing Eswin Computing Technology Co Ltd
Priority to CN202210410324.7A priority Critical patent/CN114944146A/en
Publication of CN114944146A publication Critical patent/CN114944146A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice synthesis method and a voice synthesis device, relates to the technical field of text-to-speech conversion, and can improve voice synthesis tone quality effect. The main technical scheme of the application is as follows: performing phoneme conversion processing on a target text to obtain a phoneme sequence corresponding to the target text; inputting the phoneme sequence into a preset speech synthesis model for processing, and predicting and outputting a target Mel frequency spectrum, wherein the preset speech synthesis model comprises a characteristic adjusting module which is used for adjusting the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme in the speech synthesis process; and determining audio data corresponding to the target text according to the target Mel frequency spectrum. The method and the device are mainly applied to high-quality voice synthesis processing based on the target text.

Description

Voice synthesis method and device
Technical Field
The present application relates to the field of text-to-speech technology, and in particular, to a speech synthesis method and apparatus.
Background
The Speech synthesis is a technology for generating artificial Speech by a mechanical and electronic method, which is also called Text To Speech (TTS) technology, and can convert any Text information into standard smooth Speech for reading in real time, namely, a manual mouth is mounted on a machine.
At present, the deep learning technology is applied in the field of text-to-speech conversion, and by using a constructed deep learning model, phonemes corresponding to a text can be directly processed, an acoustic characteristic spectrum (generally a mel spectrum) is generated in an end-to-end learning processing mode, and finally, a synthesized audio is obtained through vocoder processing. Compared with the traditional method, the processing method adopting the deep learning model omits the steps of text preprocessing, word segmentation, part of speech tagging, phonetic notation, prosody level prediction and the like, greatly improves the voice synthesis effect and reduces the complexity of the algorithm.
However, the existing deep learning method is still not ideal in the sound quality effect of synthesized speech due to the limitation of the quantity and quality of training data required by model training and the consideration of the reason that the difficulty of model design is large.
Disclosure of Invention
In view of this, the present application provides a method and an apparatus for speech synthesis, and mainly aims to add a feature adjustment module and adjust the speech rate feature, the pitch feature and the volume feature of a phoneme by using the feature adjustment module in the process of performing speech synthesis by using a preset speech synthesis model, so that the feature adjustment module is used in the present application to improve the feature extraction and learning capability of the model on data, and simultaneously, the adjustability of a model algorithm is also realized, and the speech synthesis effect is improved.
In order to achieve the above purpose, the present application mainly provides the following technical solutions:
a first aspect of the present application provides a speech synthesis method, including:
performing phoneme conversion processing on a target text to obtain a phoneme sequence corresponding to the target text;
inputting the phoneme sequence into a preset speech synthesis model for processing, and predicting and outputting a target Mel frequency spectrum, wherein the preset speech synthesis model comprises a characteristic adjusting module which is used for adjusting the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme in the speech synthesis process;
and determining audio data corresponding to the target text according to the target Mel frequency spectrum.
In some variations of the first aspect of the present application, the feature adjusting module includes a speech rate adjusting module, a fundamental frequency adjusting module, and an energy adjusting module, and the processing the phoneme sequence by using a preset speech synthesis model to predict and output a target mel-frequency spectrum includes:
processing the phoneme sequence by utilizing an encoder of the preset speech synthesis model to obtain a first text feature vector corresponding to the phoneme sequence;
inputting the first text feature vector into the speech rate adjusting module, and outputting a duration feature vector corresponding to each phoneme after the speech rate adjustment;
inputting the first text feature vector into the fundamental frequency adjusting module, outputting the fundamental frequency feature vectors corresponding to the phonemes after the pitch adjustment, and forming a second text feature vector;
inputting the second text feature vector into the energy adjustment module, outputting energy feature vectors corresponding to the phonemes after volume adjustment, and forming a third text feature vector;
according to the duration feature vector corresponding to each phoneme after the speech rate adjustment, lengthening the third text feature vector until the vector length is equal to the length of the Mel frequency spectrum, and obtaining a fourth text feature vector;
and processing the fourth text feature vector by using a decoder of the preset speech synthesis model to obtain a target Mel frequency spectrum.
In some modified embodiments of the first aspect of the present application, the inputting the first text feature vector into the speech rate adjustment module, and outputting a duration feature vector corresponding to each phoneme after the speech rate adjustment includes:
carrying out duration prediction processing on the first text feature vector by using the speech speed adjusting module to obtain duration feature vectors corresponding to the phonemes;
and multiplying the preset duration distribution weight by the duration feature vector corresponding to each phoneme to obtain the adjusted duration feature vector corresponding to each phoneme.
In some modified embodiments of the first aspect of the present application, the inputting the first text feature vector into the fundamental frequency adjustment module, outputting the fundamental frequency feature vector corresponding to each phoneme after pitch adjustment, and forming a second text feature vector includes:
performing fundamental frequency prediction processing on the first text feature vector by using the fundamental frequency adjusting module to obtain a fundamental frequency feature vector corresponding to each phoneme;
multiplying the preset fundamental frequency distribution weight by the fundamental frequency feature vector corresponding to each phoneme to obtain a target fundamental frequency feature vector corresponding to each phoneme;
adding the target fundamental frequency feature vector corresponding to each phoneme with the first text feature vector to obtain a feature vector corresponding to each phoneme after the tone adjustment;
and forming a second text feature vector according to the feature vector corresponding to each phoneme after the tone adjustment.
In some modified embodiments of the first aspect of the present application, the inputting the second text feature vector into the energy adjustment module, outputting energy feature vectors corresponding to the phonemes after volume adjustment, and forming a third text feature vector includes:
performing energy prediction processing on the second text feature vector by using the energy adjusting module to obtain energy feature vectors corresponding to the phonemes;
multiplying the energy feature vectors corresponding to the phonemes by a preset energy distribution weight to obtain target energy feature vectors corresponding to the phonemes;
adding the target energy feature vector corresponding to each phoneme with the second text feature vector to obtain a feature vector corresponding to each phoneme after volume adjustment;
and forming a third text feature vector according to the feature vector corresponding to each phoneme after the volume adjustment.
In some modified embodiments of the first aspect of the present application, the obtaining a fourth text feature vector by performing lengthening processing on the third text feature vector until a vector length is equal to a length of a mel-frequency spectrum according to the duration feature vector corresponding to each phoneme after the speech rate adjustment includes:
for the same phoneme, establishing a mapping relation between the duration feature vector corresponding to each phoneme after the speech rate adjustment and each feature vector in the third text feature vector;
calculating the integral ratio of the duration feature vectors of each phoneme after the speech rate adjustment;
determining a target integer ratio between the lengths of the required feature vectors corresponding to the phonemes according to the integer ratio;
and according to the mapping relation and the target integer ratio, for each phoneme, increasing the vector length in a mode of copying the feature vector of the phoneme until the sum of the vector sequence lengths corresponding to the phonemes reaches the length of the Mel frequency spectrum to obtain a fourth text feature vector consisting of the phonemes.
In some variations of the first aspect of the present application, after the encoder that utilizes the preset speech synthesis model processes the phoneme sequence to obtain a first text feature vector corresponding to the phoneme sequence, and before the first text feature vector is input to the speech rate adjustment module and the fundamental frequency adjustment module for processing, the method further includes:
selecting a target speaker characteristic vector from a preset speaker characteristic vector set;
adding the target speaker feature vector to the first text feature vector.
In some variations of the first aspect of the present application, the method further comprises:
collecting sound sample data;
converting the sound sample data into Mel frequency spectrum data;
inputting the Mel frequency spectrum data into a preset speaker classification model, and outputting audio vector data corresponding to different individuals;
averaging the audio vector data corresponding to each individual to obtain speaker characteristic vectors corresponding to different individuals;
and forming a preset speaker feature vector set according to the speaker feature vectors corresponding to the different individuals.
A second aspect of the present application provides a speech synthesis apparatus, comprising:
the phoneme conversion unit is used for carrying out phoneme conversion processing on the target text to obtain a phoneme sequence corresponding to the target text;
the model processing unit is used for inputting the phoneme sequence into a preset speech synthesis model for processing and predicting and outputting a target Mel frequency spectrum, wherein the preset speech synthesis model comprises a characteristic adjusting module which is used for adjusting the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme in the speech synthesis process;
and the determining unit is used for determining the audio data corresponding to the target text according to the target Mel frequency spectrum.
In some modified embodiments of the second aspect of the present application, the feature adjusting module comprises a speech rate adjusting module, a fundamental frequency adjusting module and an energy adjusting module, and the model processing unit comprises:
the encoding processing module is used for processing the phoneme sequence by using an encoder of the preset speech synthesis model to obtain a first text feature vector corresponding to the phoneme sequence;
the speech speed adjusting module is used for inputting the first text feature vector into the speech speed adjusting module and outputting the duration feature vector corresponding to each phoneme after the speech speed adjustment;
the fundamental frequency adjusting module is used for inputting the first text feature vector into the fundamental frequency adjusting module, outputting the fundamental frequency feature vectors corresponding to the phonemes after the pitch adjustment, and forming a second text feature vector;
the energy adjusting module is used for inputting the second text feature vector into the energy adjusting module, outputting energy feature vectors corresponding to the phonemes after volume adjustment and forming a third text feature vector;
the lengthening processing module is used for lengthening the third text feature vector according to the duration feature vector corresponding to each phoneme after the speech rate adjustment until the vector length is equal to the length of the Mel frequency spectrum, so as to obtain a fourth text feature vector;
and the decoding processing module is used for processing the fourth text feature vector by using a decoder of the preset speech synthesis model to obtain a target Mel frequency spectrum.
In some modified embodiments of the second aspect of the present application, the speech rate adjustment module includes:
the prediction processing submodule is used for carrying out duration prediction processing on the first text feature vector by utilizing the speech rate adjusting module to obtain a duration feature vector corresponding to each phoneme;
and the adjusting processing submodule is used for multiplying the preset duration distribution weight by the duration feature vector corresponding to each phoneme to obtain the adjusted duration feature vector corresponding to each phoneme.
In some modified embodiments of the second aspect of the present application, the fundamental frequency adjustment module comprises:
the prediction processing submodule is used for carrying out fundamental frequency prediction processing on the first text feature vector by utilizing the fundamental frequency adjusting module to obtain a fundamental frequency feature vector corresponding to each phoneme;
the adjusting processing submodule is used for multiplying the preset fundamental frequency distribution weight by the fundamental frequency feature vector corresponding to each phoneme to obtain a target fundamental frequency feature vector corresponding to each phoneme;
the addition processing submodule is used for adding the target fundamental frequency feature vector corresponding to each phoneme with the first text feature vector to obtain a feature vector corresponding to each phoneme after the tone adjustment;
and the forming submodule is used for forming a second text feature vector according to the feature vector corresponding to each phoneme after the tone adjustment.
In some variations of the second aspect of the present application, the energy adjustment module comprises:
the prediction processing sub-module is used for performing energy prediction processing on the second text feature vector by using the energy adjusting module to obtain energy feature vectors corresponding to all phonemes;
the adjustment processing submodule is used for multiplying the energy characteristic vectors corresponding to the phonemes according to preset energy distribution weights to obtain target energy characteristic vectors corresponding to the phonemes;
the addition processing submodule is used for adding the target energy characteristic vector corresponding to each phoneme with the second text characteristic vector to obtain a characteristic vector corresponding to each phoneme after volume adjustment;
and the forming submodule is used for forming a third text feature vector according to the feature vector corresponding to each phoneme after the volume adjustment.
In some variations of the second aspect of the present application, the elongation treatment module comprises:
the establishing submodule is used for establishing a mapping relation between the duration feature vector corresponding to each phoneme after the speed of speech is adjusted and each feature vector in the third text feature vector for the same phoneme;
the calculation submodule is used for calculating the integral ratio between the duration feature vectors of the phonemes after the speech speed adjustment;
the determining submodule is used for determining a target integer ratio between the lengths of the required feature vectors corresponding to the phonemes according to the integer ratio;
and the elongation processing submodule is used for increasing the vector length of each phoneme in a mode of copying the feature vector of the phoneme according to the mapping relation and the target integer ratio until the sum of the vector sequence lengths corresponding to the phonemes reaches the length of the Mel frequency spectrum to obtain a fourth text feature vector consisting of the phonemes.
In some variations of the second aspect of the present application, the model processing unit further comprises:
the selecting module is used for selecting a target speaker characteristic vector from a preset speaker characteristic vector set;
and the adding module is used for adding the target speaker characteristic vector to the first text characteristic vector.
In some variations of the second aspect of the present application, the apparatus further comprises:
the acquisition unit is used for acquiring sound sample data;
a spectrum conversion unit for converting the sound sample data into mel spectrum data;
the processing unit is used for inputting the Mel frequency spectrum data into a preset speaker classification model and outputting audio vector data corresponding to different individuals;
the computing unit is used for obtaining speaker characteristic vectors corresponding to different individuals by averaging the audio vector data corresponding to each individual;
and the composition unit is used for composing a preset speaker characteristic vector set according to the speaker characteristic vectors corresponding to the different individuals.
A third aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech synthesis method as described above.
The present application provides, in a fourth aspect, an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech synthesis method as described above when executing the computer program.
By means of the technical scheme, the technical scheme provided by the application at least has the following advantages:
the application provides a speech synthesis method and a speech synthesis device, and for a target text to be processed, the target text is firstly converted into a phoneme sequence, and then the phoneme sequence is input into a preset speech synthesis model. The method adds the characteristic adjusting module in the preset speech synthesis model, so that the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme are adjusted in the speech synthesis process by using the characteristic adjusting module, a target Mel frequency spectrum is predicted and output through the processing of the preset speech synthesis model, and final audio data can be obtained according to the Mel frequency spectrum. In the application, the feature adjusting module is added in the preset speech synthesis model, so that the feature extracting and learning capacity of the model to data is improved by using the feature adjusting module, the adjustability of a model algorithm is realized, and the speech synthesis effect is improved. Compared with the prior art, the method solves the technical problem that the tone quality effect of the synthesized voice is poor due to too many model limiting factors and great design difficulty adopted by the existing deep learning method.
The above description is only an overview of the technical solutions of the present application, and the present application may be implemented in accordance with the content of the description so as to make the technical means of the present application more clearly understood, and the detailed description of the present application will be given below in order to make the above and other objects, features, and advantages of the present application more clearly understood.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application;
FIG. 2 is a flow chart of another speech synthesis method provided by the embodiments of the present application;
FIG. 3 is a flowchart of a specific implementation method for constructing a set of predetermined speaker feature vectors according to an embodiment of the present application;
fig. 4 is a block diagram illustrating a speech synthesis apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of another speech synthesis apparatus according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the present application provides a speech synthesis method, which adds a feature adjustment module in a preset speech synthesis model, so as to adjust the speech rate feature, the pitch feature and the volume feature of a phoneme in a speech synthesis process by using the feature adjustment module, thereby realizing the adjustability of a model algorithm to improve the speech synthesis tone quality effect, and as shown in fig. 1, the following specific steps are provided for the embodiment of the present application:
101. and performing phoneme conversion processing on the target text to obtain a phoneme sequence corresponding to the target text.
Wherein, the phoneme is the minimum voice unit divided according to the natural attribute of the voice. Taking the international phonetic notation as an example, the pinyin "hao" is usually separated according to the initial consonant and the final sound used for the Chinese pronunciation to obtain the phonemes "h" and "ao".
In an embodiment of the present application, an open source tool (e.g., pypinyin software) may be used to convert the target text into the corresponding pinyin. For example, inputting a target text 'good', and correspondingly obtaining a pinyin [ hao3 a5], wherein tones of Chinese pronunciation are also considered in the conversion process, numbers 1-4 represent four tones, and 5 represents light sound. Furthermore, the pinyin is split by adopting a splitting rule pre-stored in the "phone dictionary file" to obtain corresponding phones, for example, since the object to be split is the pinyin, a splitting rule of "separating the initial consonant and the final sound of Chinese pronunciation" is selected from the "phone dictionary file", and the pinyin [ hao3 a5] is converted into phones to obtain a phone sequence [ h, ao3, a5 ].
102. And inputting the phoneme sequence into a preset speech synthesis model for processing, and predicting and outputting a target Mel frequency spectrum.
The preset speech synthesis model provided by the embodiment of the application at least comprises a feature adjusting module, and the feature adjusting module is used for adjusting the speech speed feature, the tone feature and the volume feature of the phoneme in the speech synthesis process.
The structure of the characteristic adjusting module consists of the following parts: two identical series connected one-dimensional convolution modules (comprising a one-dimensional convolution layer, a relu activation layer, a layer _ norm regularization layer and a dropout layer) and a linear output layer module. In the processing process of the feature adjusting module, the speech speed feature, the tone feature and the volume feature of the phoneme can be adaptively adjusted, so that the speech speed, the tone and the volume of the synthesized speech can be adjusted to meet different requirements, which is equivalent to the adjustment of a model algorithm, thereby improving the feature extraction and learning capacity in the processing process of model data and being convenient to adapt to various speech synthesis service requirements.
In the embodiment of the present application, the acoustic characteristic spectrum, generally a mel spectrum, processed and output by the preset speech synthesis model is used as a target mel spectrum and applied to subsequent acquisition of final audio data.
103. And determining audio data corresponding to the target text according to the target Mel frequency spectrum.
In the embodiment of the application, the target Mel frequency spectrum is input into the vocoder for processing, and the corresponding voice audio data can be output. Wherein the vocoder may be, but is not limited to: a high-speed audio synthesis vocoder based on neural network (WaveRNN), a generative countermeasure network vocoder for conditional waveform synthesis (MelGan), and a high-efficiency vocoder with multi-scale and multi-period discriminators (hifingan).
The embodiment of the application provides a speech synthesis method, and for a target text to be processed, the embodiment of the application firstly converts the target text into a phoneme sequence, and then inputs the phoneme sequence into a preset speech synthesis model. According to the embodiment of the application, the characteristic adjusting module is added in the preset speech synthesis model, so that the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme are adjusted in the speech synthesis process by using the characteristic adjusting module, the output target Mel frequency spectrum is predicted through the processing of the preset speech synthesis model, and the final audio data can be obtained according to the Mel frequency spectrum. In the embodiment of the application, the feature adjusting module is added in the preset speech synthesis model, so that the feature extracting and learning capacity of the model to data is improved by using the feature adjusting module, the adjustability of a model algorithm is realized, and the speech synthesis effect is improved. Compared with the prior art, the method solves the technical problem that the tone quality effect of the synthesized voice is poor due to too many model limiting factors and great design difficulty adopted by the existing deep learning method.
In order to explain the foregoing embodiment in more detail, the embodiment of the present application further provides another speech synthesis method, and as shown in fig. 2, the embodiment of the present application provides the following specific steps:
201. and performing phoneme conversion processing on the target text to obtain a phoneme sequence corresponding to the target text.
202. And processing the phoneme sequence by using an encoder of a preset speech synthesis model to obtain a first text feature vector corresponding to the phoneme sequence.
In an embodiment of the present application, the preset speech synthesis model at least includes an encoder, a feature adjustment module, and a decoder, where the feature adjustment module is configured to adjust a speech rate feature, a pitch feature, and a volume feature of each phoneme included in the phoneme sequence.
In processing a sequence of phonemes with an encoder, each phoneme contained in the sequence of phonemes needs to be first converted to a uniquely corresponding number. For example, for the phoneme sequence [ h, ao3, a5], denoting h as 2, ao3 as 4, and a5 as 7, a sequence [2, 4, 7] is obtained, the number sequence is equal to the length of the phoneme sequence, and each number in the number sequence and the corresponding phoneme in the phoneme sequence have a one-to-one mapping relationship: "2" corresponds to "h", "4" corresponds to "ao 3", and "7" corresponds to "a 5" in the sequence.
Specifically, for example, the following specific implementation methods may be adopted for converting the phoneme sequence into the corresponding digit sequence, but not limited to:
in the embodiment of the present application, in the "phone dictionary file", n phones and corresponding relations between numbers 1 and n (where n is a positive integer) may also be stored, but it should be noted that, in the embodiment of the present application, taking international phonetic symbols as an example, pinyin is separated according to initials and finals used for chinese pronunciation, and the obtained phones may carry different tones, so for the same finals, if different tones are carried, the obtained phones are also different, for example, "ao 1", "ao 2", "ao 3", "ao 4" and "ao 5", and thus, for such different phones, the numbers matched in the phone dictionary file are also different. Therefore, after determining each phoneme included in the phoneme sequence, it is possible to obtain which digit each phoneme corresponds to by querying the "correspondence between phonemes and digits" in the phoneme dictionary file, and further convert the phoneme sequence into a digit sequence.
After the digital sequence converted from the phoneme sequence is obtained, the digital sequence is processed by using a word embedding layer (which is also equivalent to a first layer of the model) of an encoder, so as to convert each phoneme in the phoneme sequence into a corresponding vector representation, and then a corresponding text feature vector, namely a text feature vector corresponding to the phoneme sequence, can be formed based on the vector representations.
Illustratively, a number sequence [2, 4, 7] corresponding to a phoneme sequence is illustrated, a word embedding layer of an encoder is used to change a number into a unique vector representation, for example, assuming that the dimension of a vector is 2, then "2" can be changed into [2, 3], then the word embedding layer processes the number into a unique vector representation to obtain a sequence [ [2, 3], [1, 4], [3,3] ], and further, the word embedding layer processes the unique vector representation to obtain a text feature vector [ [1, 2], [2, 3], [3, 4] ].
Further, explanation is made on the word embedding layer processing: during training, the model randomly initializes a word embedding layer (generally, normally distributed random numbers) which includes a matrix of m rows and n columns, m is the dimension required for converting the numbers to obtain vectors, and n is the number of all phonemes, for example, for a number sequence [2, 4, 7], the vector dimension of the 2 nd, 4 th, and 7 th columns in the matrix is selected during the word embedding layer processing. Since the training of the model word embedding layer is a weight initialized at random, the matrix of the model word embedding layer trained each time is different, but it is only necessary to ensure that the trained model word embedding layer processing realizes a "digit-vector" dictionary relationship, so that the model word embedding layer processing outputs a vector corresponding to each digit, that is, as exemplified above, assuming that the dimension of the vector is 2, the word embedding layer processing outputs "2" to [2, 3], and finally, after the word embedding layer processing, the digit sequence [2, 4, 7] is converted into a sequence [ [2, 3], [1, 4], [3,3] ].
Further, explanation is made on the encoding process: after the sequence is input into the encoder, certain operation is carried out on the sequence and parameters in the encoder, the operation rule is manually designed according to the model structure, and the text feature vector of encoding processing is output after operation.
For the parameters of the encoder, when the model is trained, a random initial value is firstly given to the model, then the model is continuously adjusted and corrected according to the difference between actual measurement data (namely the Mel frequency spectrum of real voice) given during the training and the Mel frequency spectrum predicted by the output of the whole model, and the difference between the actual measurement data and the Mel frequency spectrum is smaller and approaches to the minimum through the continuous training of the model, so that the training of the model is finished, and the parameters of the better encoder are obtained.
For example, for the sequences [ [2, 3], [1, 4], [3,3] ], assume that the rule of artificial design in the encoder is: the number of each dimension of each vector is multiplied by the number of the corresponding same dimension of the other vectors, the results are added, and finally a is subtracted. If the value of a is 5, obtained by training the model, the first vector is processed to be [2x1+2x3-5, 3x4+3x3-5] ═ 3, 16], and similarly, the other two vectors are [1x2+1x3-5, 4x3+4x3-5] ═ 0, 19], and [3x2+3x1-5, 3x3+3x4-5] ═ 4, 16], so that the result obtained by the encoder processing is [ [3, 16], [0, 19], [4, 16], as the text feature vector.
It should be noted that, in order to distinguish the text feature vector obtained in step 202 from the text feature vectors obtained in other subsequent steps, the text feature vector obtained in step 202 is identified as the first text feature vector. And the text feature vectors are identified by the words "second", "third" and "fourth" in the following description, which are used to distinguish text feature vectors obtained at different processing stages.
203. And selecting a target speaker characteristic vector from the preset speaker characteristic vector set, and adding the target speaker characteristic vector into the first text characteristic vector.
The preset speaker feature vector set comprises a plurality of target speaker feature vectors obtained by processing through a preset speaker classification model. The preset speaker classification model is a classifier model trained by utilizing a large number of human voice audio samples and consists of a three-layer Long-Short Term Memory network (LSTM) structure and a linear layer module.
For example, in the embodiment of the present application, a specific implementation method for constructing a preset speaker feature vector set may include the following steps S301 to S305, as shown in fig. 3, which are specifically explained as follows:
s301, collecting sound sample data.
S302, converting the voice sample data into Mel frequency spectrum data.
And S303, inputting the Mel frequency spectrum data into a preset speaker classification model, and outputting audio vector data corresponding to different individuals.
In the embodiment of the application, a large amount of human voice audio data are classified by using a trained preset speaker classification model, and the model is set to output audio vector data, namely, the voice feature vector of a speaker. Then the classification process using the predetermined speaker classification model outputs a plurality of data sets, and each data set includes sound feature vectors corresponding to the same person.
S304, averaging the audio vector data corresponding to each individual to obtain speaker feature vectors corresponding to different individuals.
S305, forming a preset speaker feature vector set according to the speaker feature vectors corresponding to different individuals.
An averaging operation may be performed within each data set to obtain an average acoustic feature vector, which is used as the target speaker feature vector to represent the data set. Then, based on the plurality of data sets, a plurality of target speaker feature vectors can be obtained to form a speaker feature vector set, in short, each target speaker feature vector is equivalent to a feature representing a voice possessed by a person.
For example, the preset speaker feature vector set comprises three target speaker feature vectors, wherein one target speaker feature vector represents the sound features of child A; a target speaker feature vector represents the voice features of the third year in the B; one target speaker feature vector represents the voice features of the milk of C years old. The embodiments of the present application only provide exemplary examples, and the embodiments of the present application do not limit the number of target speaker feature vectors included in the preset speaker feature vector set, but it should be noted that, as the number of target speaker feature vectors included in the preset speaker feature vector set increases, the target speaker feature vectors should have differences and ensure that the differences are large, so as to select more diverse target speaker feature vectors from the preset speaker feature vector set, that is, to obtain more diverse voices indirectly.
Next, after the feature vector of the target speaker is selected from the preset speaker feature vector set, the feature vector of the target speaker is added to the first text feature vector, and the explanation is as follows:
in the embodiment of the present application, the purpose of adding the feature vector of the target speaker to the feature vector of the first text is to: so that the subsequent synthesized voice can be influenced by the feature vector of the target speaker to present various vocal effects. Therefore, for the voice synthesized by the deep learning method provided by the embodiment of the application, not only the voice effect of representing the speaker by the random vector is presented, but also the voice effect with various characteristics can be presented according to the requirements of different service voice synthesis scenes, so that the voice quality effect of the synthesized voice is improved.
For example, if a target speaker feature vector is selected as [1,0] and a first text feature vector is listed as [1, 2], [2, 3], [3, 4] (as exemplified in step 202), then the two vectors are added, i.e., [ [1, 2], [2, 3], [3, 4] ] + [1,0], [2,2], [3,3], [4,4] ], and the resulting vector will be used as a new first text feature vector and will go to other processing steps in the subsequent predetermined speech synthesis model.
It should be noted that this step 203 is a preferred solution provided in the embodiment of the present application, that is, a method of adding a target speaker feature vector to a first text feature vector, so that the final synthesized speech can present diverse speaker voices, and the speech synthesis effect is improved.
204a, carrying out duration prediction processing on the first text feature vector by using a speech rate adjusting module in the feature adjusting module to obtain a duration feature vector corresponding to each phoneme.
205a, multiplying the preset duration distribution weight by the duration feature vector corresponding to each phoneme to obtain the adjusted duration feature vector corresponding to each phoneme.
In the above steps 204a to 205a, the first text feature vector is input into the speech rate adjustment module, and the duration feature vector corresponding to each phoneme after the speech rate adjustment is output, and the specific explanation is as follows:
wherein, the speech rate adjustment module is used for: and adjusting the speed of speech characteristic of each phoneme, wherein the speed of speech characteristic is embodied as a duration characteristic of the phoneme.
Taking the new first text feature vector [ [2,2], [3,3], [4,4] ] obtained in step 203 as an example, it is input into the speech rate adjustment module for processing. In the processing process, firstly, duration feature vectors of each phoneme are predicted, the vector length of the sum of the duration feature vectors is the same as that of the first text feature vector, and the sum is assumed to be [1, 2, 1 ]; and then, the duration feature vector of each phoneme is adjusted according to the preset duration distribution weight. The preset duration distribution weight is preset according to the actual duration adjustment requirement, and the preset duration distribution weight can be used for multiplying the duration feature vector corresponding to each phoneme, so that the adjustment operation of the duration feature vector of each phoneme is completed.
It should be noted that what is given by using the preset duration distribution weight is equivalent to a weight coefficient used when adjusting the feature vectors of different durations, and the weight coefficient may be greater than 1 or less than 1; if the weight coefficient selection is greater than 1, the duration lengthening is adjusted (namely, the speed of speech of the phoneme is adjusted to be slow), and if the weight coefficient selection is less than 1, the duration shortening is adjusted (namely, the speed of speech of the phoneme is adjusted to be fast); correspondingly, if the weight coefficient is selected to be 1, the characteristic vector of the unadjusted changing duration is indicated; however, if the value obtained by multiplying one time length feature vector by the weight coefficient is a decimal, rounding operation is also needed.
Exemplarily, for the duration feature vector [1, 2, 1], assuming the weighting coefficient is 1.6, then:
1.6x [1, 2, 1] ═ 1.6, 3.2, 1.6], so a rounding operation needs to be performed, giving [1,3, 1 ]. The purpose of adding the rounding operation in the embodiment of the present application is to facilitate a subsequent operation process of "performing elongation processing on the third text feature vector until the vector length is equal to the length of the mel spectrum according to the duration feature vector corresponding to each phoneme to obtain a fourth text feature vector".
204b, performing fundamental frequency prediction processing on the first text feature vector by using a fundamental frequency adjusting module in the feature adjusting module to obtain a fundamental frequency feature vector corresponding to each phoneme.
205b, multiplying the preset fundamental frequency distribution weight by the fundamental frequency feature vector corresponding to each phoneme to obtain a target fundamental frequency feature vector corresponding to each phoneme.
206b, adding the target fundamental frequency feature vector corresponding to each phoneme with the first text feature vector to obtain the feature vector corresponding to each phoneme after the tone adjustment, and forming a second text feature vector.
In the above steps 204b-206b, the first text feature vector is input into the fundamental frequency adjustment module, the feature vectors corresponding to the phonemes after the pitch adjustment are output, and a second text feature vector is formed. The specific explanation is as follows:
wherein, the fundamental frequency is the lowest oscillation frequency of the free oscillation system and the lowest frequency in the compound wave, and the fundamental frequency adjusting module has the following functions: the fundamental frequency characteristics of each phoneme are adjusted, and the fundamental frequency characteristics are embodied as the tone characteristics of the phoneme.
Taking the new first text feature vectors [ [2,2], [3,3], [4,4] ] obtained in step 203 as an example, it is input into the fundamental frequency adjustment module for processing. In the processing process, firstly, the fundamental frequency feature vector of each phoneme is predicted, the vector length of the sum of the plurality of fundamental frequency feature vectors is the same as that of the first text feature vector, and the sum is assumed to be [ [1,3], [2,2], [0,1] ]; and then adjusting the fundamental frequency feature vector of each phoneme according to a preset fundamental frequency allocation weight, wherein the preset fundamental frequency allocation weight is preset according to the actual fundamental frequency adjustment requirement, and the preset fundamental frequency allocation weight can be used for multiplying the fundamental frequency feature vector corresponding to each phoneme to obtain a target fundamental frequency feature vector corresponding to each phoneme, so that the adjustment operation of the fundamental frequency feature vector of each phoneme is completed.
It should be noted that what is given by using the preset fundamental frequency distribution weight is equivalent to a weight coefficient used when adjusting different fundamental frequency feature vectors, and the weight coefficient may be greater than 1 or less than 1; if the weight coefficient selection is greater than 1, the fundamental frequency is adjusted to be increased (namely, the pitch of the phoneme is adjusted to be increased), and if the weight coefficient selection is less than 1, the fundamental frequency is adjusted to be decreased (namely, the pitch of the phoneme is adjusted to be decreased); accordingly, the weight is selected to be 1, which means that the fundamental frequency of the phoneme is not adjusted and changed.
Illustratively, for the fundamental frequency feature vector [ [1,3], [2,2], [0,1] ], assuming a weight coefficient of 1.5, then 1.5x [ [1,3], [2,2], [0,1] ] [ [1.5, 4.5], [3,3], [0, 1.5] ]. The fundamental frequency feature vector is unchanged assuming a weight coefficient of 1.
After the target fundamental frequency feature vector corresponding to each phoneme is obtained by multiplying the duration feature vector corresponding to each phoneme by using the preset fundamental frequency allocation weight, and is further added to the first text feature vector itself, then 1x [ [1,3], [2,2], [0,1] ] + [ [2,2], [3,3], [4,4] ] [ [3,5], [5,5], [4,5] ], the vector sequence length is unchanged, the feature vector corresponding to each phoneme after pitch adjustment is obtained, and a second text feature vector is formed.
207b, the energy prediction processing is carried out on the second text feature vector by utilizing an energy adjusting module in the feature adjusting module, so as to obtain energy feature vectors corresponding to the phonemes.
208b, multiplying the energy feature vectors corresponding to the phonemes according to the preset energy distribution weight to obtain target energy feature vectors corresponding to the phonemes.
209b, adding the target energy feature vector corresponding to each phoneme with the second text feature vector to obtain the feature vector corresponding to each phoneme after volume adjustment, and forming a third text feature vector.
In the above steps 207b-209b, the second text feature vector is input to the energy adjustment module, the energy feature vectors corresponding to the phonemes after the volume adjustment are output, and a third text feature vector is formed, and the specific explanation is as follows:
wherein, the effect of energy adjustment module does: and adjusting the energy characteristic of each phoneme, wherein the energy characteristic is embodied as the volume size characteristic of the phoneme.
Taking the second text feature vector [ [3,5], [5,5], [4,5] obtained in step 206b as an example, it is input into the energy adjustment module for processing. In the processing process, firstly, predicting energy feature vectors of each phoneme, wherein the vector length of the sum of the energy feature vectors is the same as that of a second text feature vector, and is assumed to be [ [1,0], [0,2], [1,1] ]; and then adjusting the energy feature vector of each phoneme according to a preset energy distribution weight, wherein the preset energy distribution weight is preset according to the actual energy adjustment requirement, and the preset energy distribution weight can be used for multiplying the energy feature vector corresponding to each phoneme to obtain a target energy feature vector corresponding to each phoneme, so that the adjustment operation of the energy feature vector of each phoneme is completed.
It should be noted that what is given by using the preset energy distribution weight is equivalent to a weight coefficient used when adjusting different energy feature vectors, and the weight coefficient may be greater than 1 or less than 1; if the weight coefficient selection is greater than 1, the adjustment energy is increased (namely, the volume of the phoneme is increased), and if the weight coefficient selection is less than 1, the adjustment energy is increased (namely, the volume of the phoneme is decreased); accordingly, if the weighting factor is selected to be 1, it means that the volume of the phoneme is not adjusted.
Illustratively, for an energy feature vector [ [1,0], [0,2], [1,1] ], assuming a weight coefficient of 1, the energy feature vector is not adjusted and changed, and further, it is added to the second text feature vector itself, then:
1x [ [1,0], [0,2], [1,1] ] + [ [3,5], [5,5], [4,5] ] [ [4,5], [5,7], [5,6] ], the vector sequence length is unchanged, the feature vectors corresponding to the respective phonemes after the volume adjustment are obtained, and a third text feature vector is composed.
Next, in this embodiment of the present application, according to the duration feature vector corresponding to each phoneme after the speech rate adjustment, the third text feature vector is elongated until the vector length is equal to the length of the mel-frequency spectrum, so as to obtain a fourth text feature vector, where the specific implementation steps are as follows:
firstly, the duration feature vectors corresponding to the phonemes after the speech rate adjustment are preprocessed in the steps 206a to 207a as follows:
206a, calculating the integral ratio between the duration feature vectors of the phonemes after the speech rate adjustment.
207a, determining a target integer ratio between the lengths of the required feature vectors corresponding to the phonemes according to the integer ratio.
In the embodiment of the present application, as shown in steps 204a to 205a, a vector sequence composed of duration feature vectors of each phoneme is listed as [1, 2, 1], and assuming that a weight coefficient of a preset duration allocation weight is 1, which indicates that the speech rate feature adjustment of the phoneme is not performed, it is also default that the duration feature of each phoneme after the speech rate adjustment is still [1, 2, 1], an integer ratio between the duration feature vectors of each phoneme is calculated to be 1:2:1, and according to the integer ratio, a target integer ratio between the required feature vector lengths corresponding to each phoneme is determined to be 1:2: 1.
Secondly, after the preprocessing of the duration feature vectors corresponding to the phonemes after the speech rate adjustment is completed, the following operation of obtaining a fourth text feature vector is performed, as shown in step 211-212.
210. And for the same phoneme, establishing a mapping relation between the duration characteristic vector after the speech rate adjustment and each characteristic vector in the third text characteristic vector.
In the embodiment of the present application, taking the duration feature vector [1, 2, 1] after the speech rate adjustment as an example, taking the third text feature vector [ [4,5], [5,7], [5,6] obtained in steps 207b-209b as an example, it includes three feature vectors [4,5], [5,7] and [5,6 ]. Then establishing a mapping relationship between the duration feature vector and each feature vector in the third text feature vector as follows: [1] and [4,5 ]; [2] and [5,7 ]; [1] and [5,6 ].
211. And according to the mapping relation and the target integer ratio, increasing the vector length of each phoneme in a mode of copying the feature vector of the phoneme until the sum of the vector sequence lengths corresponding to the phonemes reaches the length of the Mel frequency spectrum to obtain a fourth text feature vector consisting of the phonemes.
In the embodiment of the present application, the mapping relationship and the target integer ratio obtained in the above steps are exemplarily operated as follows:
according to the target integer ratio 1:2:1, on the vector length requirement, it is indicated that the second vector length is twice the first vector length and the second vector length is twice the third vector length, so that each vector length within the third text feature vector [ [4,5], [5,7], [5,6] is adjusted to: the length of the first vector [4,5] is unchanged, the length of the second vector [5,7] is doubled, and the length of the third vector [5,6] is unchanged, so that an adjusted third text feature vector [ [4,5], [5,7], [5,7], [5,6] ]) is obtained. Accordingly, this is equivalent to increasing the vector length in such a manner that the feature vector of the phoneme is copied, and the elongation processing on the third text feature vector is completed once.
In addition, it should be noted that, in the process of performing the elongation processing on the third text feature vector, it is also necessary to ensure that the length of the finally elongated third text feature vector needs to reach the length of the mel spectrum, so as to ensure that the mel spectrum processing is subsequently performed to obtain the audio data.
Therefore, if the length of the mel frequency spectrum cannot be reached after one time of elongation processing, multiple times of elongation processing is needed, specifically, only the number ratio of the feature vectors originally existing in the third text feature vector is ensured to be equal to the target integer ratio, namely, the integer ratio of the number of the feature vectors [4,5], [5,7] and [5,6] originally existing in the third text feature vector is ensured to be 1:2: 1.
According to the embodiment of the application, which feature vector in the third text feature vector is selectively copied is realized according to the mapping relation and the target integer ratio, the vector length is increased in a mode of copying the feature vectors of the phonemes until the sum of the lengths of the vector sequences corresponding to the phonemes reaches the length of the Mel frequency spectrum, and if the sum of the lengths of the vector sequences corresponding to the phonemes reaches the length of the Mel frequency spectrum, the feature vectors form a fourth text feature vector.
212. And processing the fourth text feature vector by using a decoder with a preset speech synthesis model to obtain a target Mel frequency spectrum, and determining audio data corresponding to the target text according to the target Mel frequency spectrum.
In the embodiment of the present application, the fourth text feature vector is processed by a decoder in the preset speech synthesis model to obtain an acoustic feature spectrum, which is generally a mel spectrum, and the mel spectrum is input to a vocoder for processing, so that corresponding human voice audio data can be output. Wherein the vocoder may be, but is not limited to: a high-speed audio synthesis vocoder based on neural network (WaveRNN), a generation countermeasure network vocoder for conditional waveform synthesis (MelGan), and a high-efficiency vocoder with multi-scale and multi-period discriminators (hifingan).
Further, as an implementation of the methods shown in fig. 1, fig. 2, and fig. 3, an embodiment of the present application provides a speech synthesis apparatus. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to optimizing the speech quality synthesized by the deep learning method, and specifically as shown in fig. 4, the device comprises:
a phoneme conversion unit 31, configured to perform phoneme conversion processing on a target text to obtain a phoneme sequence corresponding to the target text;
a model processing unit 32, configured to input the phoneme sequence into a preset speech synthesis model for processing, and predict and output a target mel spectrum, where the preset speech synthesis model includes a feature adjustment module, and the feature adjustment module is configured to adjust a speech rate feature, a pitch feature, and a volume feature of a phoneme in a speech synthesis process;
the determining unit 33 is configured to determine, according to the target mel spectrum, audio data corresponding to the target text.
Further, as shown in fig. 5, the feature adjusting module includes a speech rate adjusting module, a fundamental frequency adjusting module and an energy adjusting module, and the model processing unit 32 includes:
the encoding processing module 321 is configured to process the phoneme sequence by using an encoder of the preset speech synthesis model to obtain a first text feature vector corresponding to the phoneme sequence;
a speech rate adjusting module 322, configured to input the first text feature vector into the speech rate adjusting module, and output a duration feature vector corresponding to each phoneme after the speech rate adjustment;
a fundamental frequency adjusting module 323, configured to input the first text feature vector into the fundamental frequency adjusting module, output fundamental frequency feature vectors corresponding to the phonemes after the pitch adjustment, and form a second text feature vector;
an energy adjusting module 324, configured to input the second text feature vector into the energy adjusting module, output energy feature vectors corresponding to the phonemes after volume adjustment, and form a third text feature vector;
an elongation processing module 325, configured to perform elongation processing on the third text feature vector until a vector length is equal to a length of a mel spectrum according to the duration feature vector corresponding to each phoneme after the speech rate adjustment, to obtain a fourth text feature vector;
and the decoding processing module 326 is configured to process the fourth text feature vector by using the decoder of the preset speech synthesis model to obtain a target mel spectrum.
Further, as shown in fig. 5, the speed adjustment module 322 includes:
the prediction processing sub-module 3221 is configured to perform duration prediction processing on the first text feature vector by using the speech rate adjustment module to obtain a duration feature vector corresponding to each phoneme;
the adjusting processing sub-module 3222 is configured to multiply the duration feature vector corresponding to each phoneme by a preset duration allocation weight to obtain a duration feature vector corresponding to each adjusted phoneme.
Further, as shown in fig. 5, the fundamental frequency adjusting module 323 includes:
the prediction processing sub-module 3231 is configured to perform fundamental frequency prediction processing on the first text feature vector by using the fundamental frequency adjustment module to obtain a fundamental frequency feature vector corresponding to each phoneme;
the adjustment processing submodule 3232 is configured to multiply the baseband feature vectors corresponding to the phonemes by a preset baseband distribution weight to obtain a target baseband feature vector corresponding to each phoneme;
the adding processing sub-module 3233 is configured to add the target fundamental frequency feature vector corresponding to each phoneme to the first text feature vector to obtain a feature vector corresponding to each phoneme after pitch adjustment;
and a forming sub-module 3234, configured to form a second text feature vector according to the feature vector corresponding to each phoneme after the pitch adjustment.
Further, as shown in fig. 5, the energy adjustment module 324 includes:
the prediction processing submodule 3241 is configured to perform energy prediction processing on the second text feature vector by using the energy adjustment module to obtain an energy feature vector corresponding to each phoneme;
the adjustment processing submodule 3242 is configured to multiply the energy feature vectors corresponding to the phonemes by a preset energy distribution weight to obtain target energy feature vectors corresponding to the phonemes;
the adding processing submodule 3243 is configured to add the target energy feature vector corresponding to each phoneme to the second text feature vector to obtain a feature vector corresponding to each phoneme after volume adjustment;
and the forming sub-module 3244 is configured to form a third text feature vector according to the feature vector corresponding to each phoneme after the volume adjustment.
Further, as shown in fig. 5, the elongation processing module 325 includes:
the establishing submodule 3251 is configured to establish, for the same phoneme, a mapping relationship between the duration feature vector corresponding to each phoneme after the speech rate adjustment and each feature vector in the third text feature vector;
the calculating submodule 3252 is configured to calculate an integer ratio between duration feature vectors of each phoneme after the speech rate adjustment;
a determining submodule 3253, configured to determine, according to the integer ratio, a target integer ratio between the lengths of the required feature vectors corresponding to the phonemes;
and the elongation processing submodule 3254 is configured to increase the vector length of each phoneme in a manner of copying the feature vector of the phoneme according to the mapping relationship and the target integer ratio until the sum of the vector sequence lengths corresponding to the phonemes reaches the length of the mel spectrum, so as to obtain a fourth text feature vector consisting of the phonemes.
Further, as shown in fig. 5, the model processing unit 32 further includes:
a selecting module 327, configured to select a target speaker feature vector from a set of preset speaker feature vectors;
an adding module 328 is configured to add the target speaker feature vector to the first text feature vector.
Further, as shown in fig. 5, the apparatus further includes:
a collecting unit 34 for collecting voice sample data;
a spectrum conversion unit 35, configured to convert the sound sample data into mel spectrum data;
the processing unit 36 is configured to input the mel-frequency spectrum data into a preset speaker classification model, and output audio vector data corresponding to different individuals;
the calculating unit 37 is configured to obtain speaker feature vectors corresponding to different individuals by averaging audio vector data corresponding to each individual;
and a forming unit 38, configured to form a preset speaker feature vector set according to the speaker feature vectors corresponding to the different individuals.
To sum up, the embodiment of the present application provides a method and an apparatus for speech synthesis, where for a target text to be processed, the target text is first converted into a phoneme sequence, and then the phoneme sequence is input into a preset speech synthesis model. In the preset speech synthesis model, the phoneme sequence is processed by an encoder to obtain a corresponding first text feature vector, the target speaker feature vector is further added to the first text feature vector in the embodiment of the application, so that the first text feature vector is processed by a feature adjusting module, the speech speed feature, the pitch feature and the volume feature of each phoneme can be adjusted, and the diversified sound feature can be increased.
The speech synthesis device provided by the embodiment of the application comprises a processor and a memory, wherein the phoneme conversion unit, the model processing unit, the determination unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the feature adjusting module is added and used for adjusting the speed feature, the tone feature and the volume feature of the phoneme in the process of performing voice synthesis by using the preset voice synthesis model by adjusting the kernel parameters, so that the feature adjusting module is used for improving the feature extraction and learning capacity of the model to data, and meanwhile, the adjustability of the model algorithm is realized, and the voice synthesis effect is improved.
An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the speech synthesis method as described above.
An embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech synthesis method as described above when executing the computer program.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method of speech synthesis, the method comprising:
performing phoneme conversion processing on a target text to obtain a phoneme sequence corresponding to the target text;
inputting the phoneme sequence into a preset speech synthesis model for processing, and predicting and outputting a target Mel frequency spectrum, wherein the preset speech synthesis model comprises a characteristic adjusting module which is used for adjusting the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme in the speech synthesis process;
and determining audio data corresponding to the target text according to the target Mel frequency spectrum.
2. The method of claim 1, wherein the feature adjusting module comprises a speech rate adjusting module, a fundamental frequency adjusting module and an energy adjusting module, and the processing the phoneme sequence by using a preset speech synthesis model to predict and output a target Mel frequency spectrum comprises:
processing the phoneme sequence by using an encoder of the preset speech synthesis model to obtain a first text feature vector corresponding to the phoneme sequence;
inputting the first text feature vector into the speech speed adjusting module, and outputting a duration feature vector corresponding to each phoneme after the speech speed adjustment;
inputting the first text feature vector into the fundamental frequency adjusting module, outputting the fundamental frequency feature vectors corresponding to the phonemes after the pitch adjustment, and forming a second text feature vector;
inputting the second text feature vector into the energy adjustment module, outputting energy feature vectors corresponding to the phonemes after volume adjustment, and forming a third text feature vector;
according to the duration feature vector corresponding to each phoneme after the speech rate adjustment, lengthening the third text feature vector until the vector length is equal to the length of a Mel frequency spectrum, and obtaining a fourth text feature vector;
and processing the fourth text feature vector by using a decoder of the preset speech synthesis model to obtain a target Mel frequency spectrum.
3. The method according to claim 2, wherein said inputting the first text feature vector into the speech rate adjustment module and outputting the duration feature vector corresponding to each phoneme after the speech rate adjustment comprises:
carrying out duration prediction processing on the first text feature vector by using the speech speed adjusting module to obtain duration feature vectors corresponding to the phonemes;
and multiplying the preset duration distribution weight by the duration feature vector corresponding to each phoneme to obtain the adjusted duration feature vector corresponding to each phoneme.
4. The method of claim 2, wherein inputting the first text feature vector into the fundamental frequency adjustment module, outputting the pitch-adjusted fundamental frequency feature vector corresponding to each phoneme, and forming a second text feature vector comprises:
performing fundamental frequency prediction processing on the first text feature vector by using the fundamental frequency adjusting module to obtain a fundamental frequency feature vector corresponding to each phoneme;
multiplying the preset fundamental frequency distribution weight by the fundamental frequency feature vector corresponding to each phoneme to obtain a target fundamental frequency feature vector corresponding to each phoneme;
adding the target fundamental frequency feature vector corresponding to each phoneme with the first text feature vector to obtain a feature vector corresponding to each phoneme after the tone adjustment;
and forming a second text feature vector according to the feature vector corresponding to each phoneme after the tone adjustment.
5. The method of claim 2, wherein inputting the second text feature vector into the energy adjustment module, outputting energy feature vectors corresponding to the phonemes after volume adjustment, and forming a third text feature vector comprises:
performing energy prediction processing on the second text feature vector by using the energy adjusting module to obtain energy feature vectors corresponding to the phonemes;
multiplying the energy feature vectors corresponding to the phonemes by a preset energy distribution weight to obtain target energy feature vectors corresponding to the phonemes;
adding the target energy feature vector corresponding to each phoneme with the second text feature vector to obtain a feature vector corresponding to each phoneme after volume adjustment;
and forming a third text feature vector according to the feature vector corresponding to each phoneme after the volume adjustment.
6. The method according to claim 2, wherein said performing, according to the duration feature vector corresponding to each phoneme after the speech rate adjustment, elongation processing on the third text feature vector until a vector length is equal to a length of a mel-frequency spectrum to obtain a fourth text feature vector comprises:
for the same phoneme, establishing a mapping relation between the duration feature vector corresponding to each phoneme after the speech rate adjustment and each feature vector in the third text feature vector;
calculating the integral ratio of the duration feature vectors of each phoneme after the speech rate adjustment;
determining a target integer ratio between the lengths of the required feature vectors corresponding to the phonemes according to the integer ratio;
and according to the mapping relation and the target integer ratio, for each phoneme, increasing the vector length in a mode of copying the feature vector of the phoneme until the sum of the vector sequence lengths corresponding to the phonemes reaches the length of the Mel frequency spectrum, and obtaining a fourth text feature vector consisting of the phonemes.
7. The method according to claim 2, wherein after the encoder using the preset speech synthesis model processes the phoneme sequence to obtain a first text feature vector corresponding to the phoneme sequence, and before the first text feature vector is input to the speech rate adjustment module and the fundamental frequency adjustment module for processing, the method further comprises:
selecting a target speaker characteristic vector from a preset speaker characteristic vector set;
adding the target speaker feature vector to the first text feature vector.
8. The method according to any one of claims 1-7, further comprising:
collecting sound sample data;
converting the sound sample data into Mel frequency spectrum data;
inputting the Mel frequency spectrum data into a preset speaker classification model, and outputting audio vector data corresponding to different individuals;
averaging the audio vector data corresponding to each individual to obtain speaker characteristic vectors corresponding to different individuals;
and forming a preset speaker feature vector set according to the speaker feature vectors corresponding to the different individuals.
9. A speech synthesis apparatus, characterized in that the apparatus comprises:
the phoneme conversion unit is used for carrying out phoneme conversion processing on the target text to obtain a phoneme sequence corresponding to the target text;
the model processing unit is used for inputting the phoneme sequence into a preset speech synthesis model for processing and predicting and outputting a target Mel frequency spectrum, wherein the preset speech synthesis model comprises a characteristic adjusting module which is used for adjusting the speech speed characteristic, the tone characteristic and the volume characteristic of the phoneme in the speech synthesis process;
and the determining unit is used for determining the audio data corresponding to the target text according to the target Mel frequency spectrum.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the speech synthesis method according to any one of claims 1-8.
CN202210410324.7A 2022-04-19 2022-04-19 Voice synthesis method and device Pending CN114944146A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210410324.7A CN114944146A (en) 2022-04-19 2022-04-19 Voice synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210410324.7A CN114944146A (en) 2022-04-19 2022-04-19 Voice synthesis method and device

Publications (1)

Publication Number Publication Date
CN114944146A true CN114944146A (en) 2022-08-26

Family

ID=82907032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210410324.7A Pending CN114944146A (en) 2022-04-19 2022-04-19 Voice synthesis method and device

Country Status (1)

Country Link
CN (1) CN114944146A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798455A (en) * 2023-02-07 2023-03-14 深圳元象信息科技有限公司 Speech synthesis method, system, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798455A (en) * 2023-02-07 2023-03-14 深圳元象信息科技有限公司 Speech synthesis method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US11295721B2 (en) Generating expressive speech audio from text data
US11908448B2 (en) Parallel tacotron non-autoregressive and controllable TTS
CN117043855A (en) Unsupervised parallel Tacotron non-autoregressive and controllable text-to-speech
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN110599998B (en) Voice data generation method and device
US20240087558A1 (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
CN113761841B (en) Method for converting text data into acoustic features
US20230169953A1 (en) Phrase-based end-to-end text-to-speech (tts) synthesis
CN114582317B (en) Speech synthesis method, training method and device of acoustic model
CN113035228A (en) Acoustic feature extraction method, device, equipment and storage medium
CN114387946A (en) Training method of speech synthesis model and speech synthesis method
CN110992926A (en) Speech synthesis method, apparatus, system and storage medium
CN112908294A (en) Speech synthesis method and speech synthesis system
KR20230075340A (en) Voice synthesis system and method capable of duplicating tone and prosody styles in real time
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
CN114944146A (en) Voice synthesis method and device
CN113053353A (en) Training method and device of speech synthesis model
CN116129853A (en) Training method of speech synthesis model, speech synthesis method and related equipment
CN114299989A (en) Voice filtering method and device, electronic equipment and storage medium
CN113066459A (en) Melody-based song information synthesis method, melody-based song information synthesis device, melody-based song information synthesis equipment and storage medium
US12033611B2 (en) Generating expressive speech audio from text data
Kim et al. SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer
CN114299910B (en) Training method, using method, device, equipment and medium of speech synthesis model
CN116072152A (en) Speech synthesis method and device and electronic equipment
CN116129858A (en) Speech synthesis method, training method and device of speech posterior probability generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination