CN117746831A

CN117746831A - Emotion controllable speech synthesis method and system based on few samples of specific characters

Info

Publication number: CN117746831A
Application number: CN202311682364.8A
Authority: CN
Inventors: 杨捍; 马军; 郭先会; 汪淼; 曾宇龙; 王海兮; 庄祖江
Original assignee: Shenzhen Wanglian Anrui Network Technology Co ltd; CETC 30 Research Institute
Current assignee: Shenzhen Wanglian Anrui Network Technology Co ltd; CETC 30 Research Institute
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-22

Abstract

The invention belongs to the technical field of speech synthesis, and discloses an emotion controllable speech synthesis method and system based on a few samples of specific characters. Inputting selected specific speaking audio into a data automation processing model to form trainable data of a specific person; aiming at trainable data of a specific person, respectively extracting corresponding voiceprint features and emotion features by utilizing a voiceprint extraction module and an emotion feature extraction module; embedding the obtained phoneme sequence features fused with emotion features into different networks; and (3) performing the whole end-to-end voice synthesis training and reasoning process, and embedding the extracted speaker characteristics into different networks to obtain emotion voice synthesis with specified emotion and the specified speaker. The invention adopts an end-to-end voice synthesis flow and forms an automatic training flow for realizing quick response based on a data automatic processing module.

Description

Emotion controllable speech synthesis method and system based on few samples of specific characters

Technical Field

The invention belongs to the technical field of speech synthesis, and particularly relates to an emotion controllable speech synthesis method and system based on a few samples of specific characters.

Background

The prior art provides speech synthesis techniques and emotion synthesis techniques. However, the existing speech synthesis technology depends on a large amount of high-quality data recorded in a recording studio, different specific persons need to record a large amount of data corresponding to the persons, time and effort are consumed, meanwhile, the existing emotion synthesis technology is aimed at emotion synthesis of the specific persons, the specific persons still need to record a plurality of preset emotions in a matched manner, time and effort are consumed, meanwhile, the synthesized emotion types are limited to recorded data, and emotion synthesis capability cannot be provided under the condition that the specific persons cannot record in a matched manner.

Through the above analysis, the problems and defects existing in the prior art are as follows: the existing speech synthesis technology depends on recording a large amount of high-quality data aiming at specific characters, so that a large amount of time and effort are required, meanwhile, the existing speech synthesis technology adopts two stages, obtains a mel frequency spectrum from a text, and reconstructs the mel frequency spectrum into a waveform, and the synthesis speed is low.

The existing emotion synthesis technology also depends on recording a large amount of emotion data of a corresponding person, and cannot synthesize emotion voice of the specific person under the condition that the person cannot match.

Disclosure of Invention

In order to overcome the problems in the related art, the disclosed embodiments of the present invention provide a method and a system for emotion controllable speech synthesis based on a specific character with few samples. The method is particularly applied to speech synthesis with controllable emotion under the condition of few sample data aiming at specific people.

The technical scheme is as follows: a method for synthesizing emotion controllable voice based on the condition of few samples of specific characters comprises the following steps:

s1, inputting selected specific speaking audio into a data automation processing model to form trainable data of a specific person;

s2, aiming at trainable data of a specific person, respectively extracting corresponding voiceprint features and emotion features by utilizing a voiceprint extraction module and an emotion feature extraction module; embedding emotion features into the features after phoneme coding, fusing the emotion features and the phoneme sequence features, and obtaining the phoneme sequence features after fusing the emotion features through addition;

s3, embedding the obtained phoneme sequence features fused with emotion features into a priori encoder, embedding voiceprint features into a random duration prediction module, and embedding a posterior encoder, a stream network and a HiFiGAN vocoder;

s4, training a speaker voiceprint extraction network on a large number of speaker data based on a pre-training voiceprint extraction module, so that the speaker voiceprint extraction network can distinguish tone features of different persons irrelevant to speaking contents, training the emotion extraction module on a large number of emotion recognition data in advance, enabling the emotion extraction module to have emotion feature extraction capability, finally using the extracted voiceprint features and emotion features for whole end-to-end voice synthesis training, and synthesizing emotion voice synthesis with specified emotion and specified speakers according to different reference audio emotion and different specific person voices during reasoning.

In step S1, the processing flow of the data automation processing model includes:

the voice of a specific person is firstly subjected to noise elimination processing, background sound and noise in the voice are eliminated, resampling is carried out at a fixed sampling rate, then voice fragment cutting is carried out according to VAD silence detection, language detection is carried out on the cut voice, different language models are selected for carrying out voice recognition, and trainable data aiming at the specific person are finally obtained.

Further, the voice of the specific person is firstly subjected to noise elimination processing, background sound and noise in the voice are eliminated, and resampling is carried out at a fixed sampling rate, and the method comprises the following steps: resampling the collected audio for the specific person voice to a single channel, 16000 HZ; and denoising the resampled audio through a deep learning algorithm to remove noise and background noise interference sound and obtain only clean human voice audio.

Further, the cutting of the voice segment according to the VAD silence detection includes:

detecting a silent interval in the voice of the specific person after the resampling and noise reduction treatment by setting a decibel threshold; cutting the voice fragments, cutting the random length according to the silence interval obtained by VAD silence detection, and finally obtaining a small number of voice fragments aiming at specific people, wherein the cutting length of single voice is within 10 seconds;

the language detection of the cut voice comprises the following steps: performing language detection on the voice subjected to resampling, denoising and cutting, namely marking a Chinese voice label if the voice fragment is recognized as Chinese, marking an English voice label if the voice fragment is recognized as English, and marking a Chinese-English mixed label if the voice fragment is recognized as Chinese;

the selecting different language models for voice recognition comprises the following steps: selecting a corresponding voice recognition model, a Chinese label, a Chinese voice recognition model and an English label according to the labels corresponding to the voice fragments after the language detection, selecting an English voice recognition model and a Chinese-English mixed voice label, and selecting a Chinese-English mixed voice recognition model; performing voice recognition on the voice fragments to obtain characters corresponding to the voice fragments; and the obtained text is processed by the front end of the corresponding language text to obtain the corresponding phoneme sequence.

In step S2, the emotion features and the features after the phoneme encoding module are consistent in dimension; the two matrices with the same dimension are added point by point through the same position in the matrix to obtain the phoneme sequence characteristics after the emotion characteristics are fused; the dimension matrix is (b, t, h), b represents the number of samples in a batch of training, t represents the sequence length of the transmission, and h represents the dimension of the hidden layer.

In step S4, in the whole end-to-end speech synthesis training process, the network structure adopts a VAE, a streaming network, a HiFiGAN, a random duration prediction network, and a speaker voiceprint extraction network; meanwhile, the bidirectional flow network loss is increased, and the bidirectional flow network loss is used for restraining the characteristics before and after the flow network changes.

Further, the KL-based divergence is used for constraining prior distribution obtained through a prior encoder and a linear layer and posterior distribution obtained through a posterior encoder and a stream network of a linear frequency spectrum; the linear frequency spectrum passes through a posterior coder to obtain posterior characteristics, and the HiFiGAN vocoder reconstructs the posterior characteristics obtained through the posterior coder into a voice waveform; when KL divergence constraint is made, bidirectional flow network loss is adopted;

L _kl ＝logq _Φ (z|x _lin )-logp _Φ (z|c _text ,A)

wherein x is _lin Spectral analysis representing speech passing through a linear scale, z representing distribution, c _text A represents the input phoneme sequence and the duration of the corresponding phoneme, q _Φ Representing a posterior network, p _Φ Representing a priori network, L _kl Representing the difference after passing through the a priori network and the a posteriori network.

Further, in the training process, the time length corresponding to the text phonemes is estimated through monotone alignment search and is used as a real label of a training random time length predictor; in the training process, the emotion feature extractor and the speaker feature extractor do not participate in updating network parameters, and respectively calculate emotion features and speaker features in the speaker audio.

Further, the reasoning process includes: according to the input text and the reference audio emotion of the appointed speaker, the time length prediction of the text phonemes is obtained by combining the speaker characteristics, the prior distribution of the linear layer is expanded according to the time length prediction, the expanded characteristics are subjected to inverse transformation of a streaming network to obtain the characteristics after inverse transformation, and finally the HiFiGAN vocoder carries out waveform reconstruction according to the characteristics after inverse transformation to obtain emotion speech synthesis of the appointed emotion and the appointed speaker.

Another object of the present invention is to provide an emotion controllable speech synthesis system based on a specific person with few samples, which implements the emotion controllable speech synthesis method based on a specific person with few samples, the system comprising:

the data automation processing model module is used for inputting the selected specific speaking audio into the data automation processing model to form trainable data of a specific person;

the phoneme sequence feature obtaining module after the emotion feature is fused is used for respectively extracting corresponding voiceprint features and emotion features by utilizing the voiceprint extracting module and the emotion feature extracting module aiming at trainable data of a specific person; embedding emotion features into the features after phoneme coding, and fusing the emotion features and phoneme sequence features; the emotion characteristics are consistent with the characteristics after passing through the phoneme coding module in dimension, and the phoneme sequence characteristics after the emotion characteristics are fused are obtained through addition;

different network modules are embedded, and are used for embedding the obtained phoneme sequence characteristics after the emotion characteristics are fused into a priori encoder, and voiceprint characteristics are embedded into a random duration prediction module, a posterior encoder, a streaming network and a HiFiGAN vocoder;

the emotion voice synthesis module is used for training the speaker voice extraction network on a large number of speaker data based on the pre-training voice print extraction module, so that voice characteristics of different people irrelevant to speaking contents can be distinguished, the whole end-to-end voice synthesis training and reasoning process is carried out, and the extracted speaker characteristics are embedded into different networks to obtain emotion voice synthesis with appointed emotion and appointed speakers.

By combining all the technical schemes, the invention has the advantages and positive effects that: high-quality emotion-controllable speech synthesis is realized with only a small number of samples of a specific person. The invention realizes high-quality emotion controllable speech synthesis technology based on a specific person and a small sample data based on a pretrained voiceprint extraction model, an emotion extraction model, a stream network-based VAE (virtual application program) network, a HiFiGAN (high-performance information processing) network and the like. An end-to-end voice synthesis flow is adopted, and an automatic training flow for realizing quick response is formed based on a data automatic processing module.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure;

FIG. 1 is a flowchart of an emotion controllable speech synthesis method based on a few samples of a specific person according to an embodiment of the present invention;

FIG. 2 is a process flow diagram of a data automation processing model provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a phoneme sequence feature obtained by fusing emotion features according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of training a speaker voiceprint extraction network on a large number of speaker data based on a pre-training voiceprint extraction module provided by an embodiment of the present invention;

FIG. 5 is a flowchart of an overall end-to-end speech synthesis training provided by an embodiment of the present invention;

FIG. 6 is a flow chart of the overall end-to-end speech synthesis reasoning provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of an emotion controllable speech synthesis system based on a few samples of a specific person according to an embodiment of the present invention;

in the figure: 1. a data automation processing model module; 2. the phoneme sequence features after the emotion features are fused are obtained; 3. different network modules are embedded; 4. and the emotion voice synthesis module.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.

The emotion controllable voice synthesis method provided by the embodiment of the invention aims at the voice synthesis of a specific person under the condition of few samples of the specific person, a large amount of data is required to be collected in a professional recording studio, and the synthesized voice emotion is limited to the recorded data. And the voice print extraction module and the emotion extraction module based on pre-training are adopted, and emotion controllable voice synthesis aiming at a specific person under the condition of few samples is realized on the basis of a VAE-based streaming network and a HIFIGAN antagonistic neural network.

Embodiment 1 of the present invention provides an emotion controllable speech synthesis method based on a few samples of a specific person, including:

s1, inputting the selected specific speaking audio into a data automation processing model to form trainable data of a specific person.

S2, aiming at trainable data of a specific person, respectively extracting corresponding voiceprint features and emotion features by utilizing a voiceprint extraction module and an emotion feature extraction module; embedding emotion features into the features after phoneme coding, and fusing the emotion features and phoneme sequence features; obtaining phoneme sequence characteristics after the emotion characteristics are fused through addition;

s3, embedding the obtained phoneme sequence features fused with emotion features into a priori encoder, embedding voiceprint features into a random duration prediction module, a posterior encoder, a stream network and a HiFiGAN vocoder;

In step S1 of the embodiment of the present invention, the selected specific speaking audio is input to the data automation processing model to form trainable data of the specific person.

The process flow of the data automation processing model is shown in fig. 2, and specifically includes:

firstly, removing background sound, noise and the like in voice by denoising treatment, resampling the voice of a specific person at a fixed sampling rate, and resampling collected voice aiming at the specific person into single-channel 16000HZ audio; and denoising the resampled audio through a deep learning algorithm to remove noise, background sound and other interference sounds, so as to obtain the clean voice audio.

It can be appreciated that the deep learning algorithm mainly adopts an audio denoising algorithm based on a Unet structure. The model employed is based on an encoder-decoder architecture with a jump connection, optimized in time and frequency domain using multiple loss functions.

And detecting the silence interval in the specific human voice after resampling and noise reduction treatment by setting a decibel threshold according to VAD silence detection. Cutting the voice fragments, cutting the random length according to the silence interval obtained by VAD silence detection, and finally obtaining a small number of voice fragments aiming at specific people, wherein the cutting length of single voice is within 10 seconds.

The cut voice is subjected to language detection, the voice subjected to resampling, denoising and cutting is subjected to language detection of single voice, chinese voice tags are marked if the recognition voice fragments are Chinese, english voice tags are marked if the recognition voice fragments are English, and Chinese-English mixed tags are marked if the recognition voice fragments are Chinese-English mixed.

Selecting different language models for voice recognition; and selecting a corresponding voice recognition model, a Chinese label, a Chinese voice recognition model and an English label according to the labels corresponding to the voice fragments after the language detection, and selecting an English voice recognition model, a Chinese-English mixed voice label and a Chinese-English mixed voice recognition model. And performing voice recognition on the voice fragments to obtain the characters corresponding to the voice fragments. And the obtained text is processed by the front end of the corresponding language text to obtain the corresponding phoneme sequence.

Trainable data for the particular character is ultimately obtained.

In step S2 of the embodiment of the present invention, for trainable data of a specific person, a voiceprint extraction module and an emotion feature extraction module are used to extract corresponding voiceprint features and emotion features, respectively. And embedding the emotion characteristics into the characteristics after the phoneme coding, and fusing the emotion characteristics and the phoneme sequence characteristics. The flow is shown in fig. 3. The emotion characteristics and the characteristics after the phoneme coding module are consistent in dimension, and the phoneme sequence characteristics after the emotion characteristics are fused are obtained through addition. And adding the two matrixes with the same dimension point by point through the same position in the matrixes to obtain the fused characteristic. The dimension matrix is (b, t, h), b represents the number of samples in a batch of training, t represents the sequence length of the transmission, and h represents the dimension of the hidden layer.

In step S3 of the embodiment of the present invention, based on the phoneme sequence feature after the emotion feature is fused, the obtained phoneme sequence feature after the emotion feature is fused is embedded into a priori encoder, a random duration prediction module, a posterior encoder and a streaming network, and a HiFiGAN vocoder, and the corresponding emotion is synthesized according to the reference audio emotion feature, so as to realize emotion-controllable speech synthesis without the specific human emotion data.

In step S4 of the embodiment of the present invention, the speaker voiceprint extraction network is trained on a large number of speaker data using a pre-training-based voiceprint extraction module (speaker feature extraction module) so that it can distinguish tone features of different persons independent of the content of the speaking, as shown in fig. 4; and (3) performing the whole end-to-end voice synthesis training and reasoning process, and embedding the extracted speaker characteristics into different networks to obtain emotion voice synthesis with specified emotion and the specified speaker. The voiceprint extraction module is utilized to reduce data dependence on a particular person.

It will be appreciated that end-to-end speech synthesis generally refers to directly generating speech from text, without generating acoustic features such as mel spectrum in a two-stage speech synthesis technology route, where the end-to-end training and reasoning processes are separate, and the network needs to train and then reason. End-to-end speech synthesis has the technical advantage of fast synthesis speed.

In the embodiment of the invention, the whole end-to-end speech synthesis training flow is shown in fig. 5, and the network structure adopts a voice-print extraction network based on VAE, stream network, hiFiGAN, random duration prediction network and speaker. Meanwhile, the bidirectional flow network loss is increased, and the bidirectional flow network loss is used for restraining the characteristics before and after the flow network changes.

The KL-based divergence is used for constraining prior distribution obtained through a prior encoder and a linear layer and posterior distribution obtained through a posterior encoder and a stream network of a linear frequency spectrum. The linear frequency spectrum passes through a posterior coder to obtain posterior characteristics, and the HiFiGAN vocoder reconstructs the posterior characteristics obtained through the posterior coder into a voice waveform. When KL divergence constraint is made, bidirectional flow network loss is adopted.

L _kl ＝logq _Φ (z|x _lin )-logp _Φ (z|c _text ,A)

In the training process, the time length corresponding to the text phonemes is estimated through monotone alignment search and is used as a real label of a training random time length predictor. In the training process, the emotion feature extractor and the speaker feature extractor do not participate in updating network parameters, and respectively calculate emotion features and speaker features in the speaker audio.

Exemplary, the entire end-to-end speech synthesis training process of fig. 5 includes:

the text is first converted into text corresponding phonemes.

And extracting emotion characteristics through the emotion module, and extracting speaker characteristics through the voiceprint model.

And adding and fusing the emotion characteristics and the text phoneme characteristics, and embedding the voiceprint characteristics into different modules.

And obtaining a time tag of the corresponding text through monotonically aligned search, and training a random duration predictor to obtain duration loss.

And calculating the KL divergence between the prior encoding and the distribution of the output of the posterior encoder and between the prior encoding output and the distribution of the output of the prior encoding after the inverse network of the stream and before the output of the prior encoding enters the posterior encoder respectively through the bidirectional stream loss to obtain the final bidirectional loss.

And (3) inputting the waveform to the HiFiGAN for training through the characteristics of the posterior encoder, and obtaining GAN loss.

Based on obtaining all loss values, end-to-end training is started (end-to-end is mainly used in the training process without the need of common acoustic models and separate training of two modules of a vocoder), and training is performed together in the scheme of the invention, so that training time and reasoning time are shortened.

The reasoning process is as shown in fig. 6, and a monotonically aligned search module, a bi-directional stream network module and a posterior encoder module are not needed any more, and all the modules only exist in the training process. When reasoning, according to the input text and the reference audio emotion of the appointed speaker, the time length prediction of the text phonemes is obtained by combining the speaker characteristics, the prior distribution of the linear layer is expanded according to the time length prediction, the expanded characteristics are subjected to inverse transformation of the streaming network to obtain the characteristics after inverse transformation, and finally the HiFiGAN vocoder performs waveform reconstruction according to the characteristics after inverse transformation to obtain the emotion speech synthesis of the appointed emotion and the appointed speaker.

By way of example, the reasoning process in FIG. 6 includes:

inputting a designated speaker and extracting speaker characteristics.

Inputting appointed reference emotion audio, and extracting emotion characteristics of the audio.

According to the flow chart, the speaker features are embedded into different modules, and after the emotion features and the text phoneme features are fused, the speaker features are sent to the priori encoder.

And according to the duration predicted by the random duration module, aligning the prior coded features with corresponding durations.

The aligned features are sent to a flow-based inverse network to obtain features before waveform reconstruction.

Based on the reverse network output characteristics of the stream, the HiFi-GAN vocoder is utilized to reconstruct the waveform.

The resulting speech synthesis, given text, specifies a particular person and specifies emotion.

Based on the embodiment, the emotion and voiceprint training for different emotions and different speakers is not needed on the basis of the pretrained emotion feature extraction and speaker feature extraction. Based on a language basic model related to a specific person trained in advance, only a few minutes of audio aiming at the specific person is collected, automatic processing is carried out, fine tuning training is carried out, and finally the few-sample emotion controllable speech synthesis aiming at the specific task is formed. The whole scheme can be fully automated without manual intervention, and finally, the emotion voice synthesis model of the specific person is obtained.

According to the embodiment, the high-quality voice synthesis method and the device have the advantages that high-quality recording studio recording of voice data of the specific person is not needed, a great deal of time and energy are consumed, the problem that part of voice data of the specific person are difficult to collect and training of the voice data of the specific person cannot be conducted is solved, and finally, in the technical aspect of the method and the device, the high-quality voice synthesis of the specific person can be conducted with little data. According to the scheme, emotion voice synthesis of appointed reference emotion can be realized, and multi-emotion voice synthesis aiming at specific people is realized. By means of the technical route of the scheme, the method and the device can avoid recording a large amount of voice data aiming at specific people and emotion data recording, and realize emotion voice synthesis aiming at any specific few samples. Has extremely high economic and practical value.

Under the condition of small models, the technology of the scheme of the invention can realize that voice synthesis can be performed by only a small amount of voice data, and meanwhile, the emotion voice synthesis of the specific person corresponding to the reference emotion can be synthesized by specifying the reference emotion, so that the synthesis effect is independent of the large amount of data of the specific person and the large amount of emotion data of the specific person, and the emotion controllable voice synthesis of a small sample for any specific person is realized. Fills the blank of the prior art.

The traditional voice synthesis scheme needs to record in a large number of record studio, consumes a large amount of time and energy, can solve the problem, can synthesize voice only by a small amount of voice data, and can synthesize emotion voice synthesis of the specific person corresponding to the reference emotion by designating the reference emotion, so that the synthesis effect is independent of a large amount of data of the specific person and a large amount of emotion data of the specific person, and the emotion controllable voice synthesis of a small sample aiming at any specific person is realized.

Embodiment 2, as shown in fig. 7, the emotion controllable speech synthesis system provided in the embodiment of the present invention based on the case of few samples of a specific person includes:

the data automation processing model module 1 is used for inputting the selected specific speaking audio into the data automation processing model to form trainable data of a specific person.

The phoneme sequence feature obtaining module 2 after the emotion feature is fused is used for respectively extracting corresponding voiceprint features and emotion features by utilizing the voiceprint extracting module and the emotion feature extracting module aiming at trainable data of a specific person. Embedding emotion features into the features after phoneme coding, and fusing the emotion features and phoneme sequence features; the emotion characteristics are consistent with the characteristics after the phoneme coding module in dimension, and the phoneme sequence characteristics after the emotion characteristics are fused are obtained through addition.

Different network modules 3 are embedded for embedding the obtained phoneme sequence features fused with emotion features into a priori encoder, and voiceprint features into a random duration prediction module, a posterior encoder and a streaming network, and a HiFiGAN vocoder.

The emotion voice synthesis module 4 is configured to train the speaker voiceprint extraction network on a large number of speaker data based on the pre-trained voiceprint extraction module, so that the speaker voiceprint extraction network can distinguish tone features of different persons irrelevant to speaking content, perform the whole end-to-end voice synthesis training and reasoning process, and embed the extracted speaker features into different networks to obtain emotion voice synthesis with specified emotion and a specified speaker.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The content of the information interaction and the execution process between the devices/units and the like is based on the same conception as the method embodiment of the present invention, and specific functions and technical effects brought by the content can be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. For specific working processes of the units and modules in the system, reference may be made to corresponding processes in the foregoing method embodiments.

Based on the technical solutions described in the embodiments of the present invention, the following application examples may be further proposed.

According to an embodiment of the present application, the present invention also provides a computer apparatus, including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.

Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the respective method embodiments described above.

The embodiment of the invention also provides an information data processing terminal, which is used for providing a user input interface to implement the steps in the method embodiments when being implemented on an electronic device, and the information data processing terminal is not limited to a mobile phone, a computer and a switch.

The embodiment of the invention also provides a server, which is used for realizing the steps in the method embodiments when being executed on the electronic device and providing a user input interface.

Embodiments of the present invention also provide a computer program product which, when run on an electronic device, causes the electronic device to perform the steps of the method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

To further demonstrate the positive effects of the above embodiments, the present invention was based on the above technical solutions to perform the following experiments.

The existing speech synthesis technology is time-consuming and labor-consuming depending on a large amount of high-quality data recorded in a recording studio, different specific persons need to record a large amount of data corresponding to the persons, meanwhile, the existing emotion synthesis technology is still required to record a plurality of preset emotions by matching the specific persons for emotion synthesis of the specific persons, and is also time-consuming and labor-consuming, meanwhile, the synthesized emotion types are limited to recorded data, and emotion synthesis capability cannot be achieved under the condition that the specific persons cannot match the recording. Meanwhile, the existing speech synthesis technology adopts two stages, a Mel frequency spectrum is obtained from a text, and then the Mel frequency spectrum is reconstructed into a waveform, so that the synthesis speed is low in the process.

According to the invention, firstly, the voice is directly obtained from the text through an end-to-end training and reasoning mode, so that the two-stage process from the text to the Mel spectrum and then from the Mel spectrum to the voice is avoided, the training and reasoning process is simplified, meanwhile, the accumulated error is reduced, and the voice waveform with higher quality is synthesized.

According to the voice data processing method, the voice data of the specific person can be finally synthesized through the voice extraction model trained in advance and the voice embedding module, so that the voice data of the specific person can be synthesized in one sentence or a few minutes. The defect that a large number of specific people need to be recorded is avoided, and the method has certain economic effect and wider application path. Under the condition that a plurality of specific persons cannot cooperate with recording, the voice data of one sentence or a few minutes of the specific persons can realize the voice synthesis capability of the specific persons based on the scheme.

And fusing the extracted emotion characteristics and the phoneme text characteristics by pre-training a voice emotion recognition model, so that the finally synthesized voice has emotion extracted by the emotion characteristics.

The invention can finally realize the specific speech synthesis capability of designating any emotion aiming at the specific person under the condition that the specific person speech data only needs one sentence or a few minutes. The voice synthesis method and the voice synthesis system have the advantages that a large amount of voice data of a specific person is prevented from being recorded, a large amount of emotion data is recorded, time and energy are consumed, and meanwhile, the wider application space of the system is expanded, and the voice synthesis capability of the specific person can be realized only by a small amount of data resources based on the scheme of the invention under the condition that the specific person cannot cooperate with recording.

While the invention has been described with respect to what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. The emotion controllable speech synthesis method based on the condition of few samples of specific characters is characterized by comprising the following steps of:

2. The emotion-controllable speech synthesis method based on the case of a small sample of a specific person according to claim 1, wherein in step S1, the processing flow of the data automation processing model includes:

3. The emotion controllable speech synthesis method based on a few samples of a specific person according to claim 2, wherein the specific person's speech is first denoised, background sounds, noise in the speech are eliminated, and resampling is performed at a fixed sampling rate, comprising: resampling the collected audio for the specific person voice to a single channel, 16000 HZ; and denoising the resampled audio through a deep learning algorithm to remove noise and background noise interference sound and obtain only clean human voice audio.

4. The method for emotion controllable speech synthesis based on few samples of a specific person according to claim 2, wherein said performing speech segment clipping based on VAD silence detection comprises:

5. The emotion controllable speech synthesis method based on the case of few samples of a specific person according to claim 1, wherein in step S2, emotion features and features after passing through a phoneme encoding module are consistent in dimension; the two matrices with the same dimension are added point by point through the same position in the matrix to obtain the phoneme sequence characteristics after the emotion characteristics are fused; the dimension matrix is (b, t, h), b represents the number of samples in a batch of training, t represents the sequence length of the transmission, and h represents the dimension of the hidden layer.

6. The emotion controllable speech synthesis method based on the case of few samples of specific characters according to claim 1, wherein in step S4, in the whole end-to-end speech synthesis training flow, the network structure adopts a VAE-based, streaming network, hiFiGAN, random duration prediction network, speaker voiceprint extraction network; meanwhile, the bidirectional flow network loss is increased, and the bidirectional flow network loss is used for restraining the characteristics before and after the flow network changes.

7. The emotion controllable speech synthesis method based on the case of few samples of a specific person according to claim 6, wherein KL-based divergence is used for constraining a priori distribution obtained through a priori encoder and a linear layer and a posterior distribution obtained through a posterior encoder and a streaming network of a linear spectrum; the linear frequency spectrum passes through a posterior coder to obtain posterior characteristics, and the HiFiGAN vocoder reconstructs the posterior characteristics obtained through the posterior coder into a voice waveform; when KL divergence constraint is made, bidirectional flow network loss is adopted;

L _kl ＝logq _Φ (z|x _lin )-logp _Φ (z|c _text ,A)

8. The emotion controllable speech synthesis method based on the case of few samples of specific characters according to claim 7, wherein in the training process, the duration corresponding to the text phonemes is estimated through monotonically aligned search, and is used as a real label of a training random duration predictor; in the training process, the emotion feature extractor and the speaker feature extractor do not participate in updating network parameters, and respectively calculate emotion features and speaker features in the speaker audio.

9. The emotion controllable speech synthesis method based on a specific person with a small sample of claim 1, wherein the reasoning process comprises: according to the input text and the reference audio emotion of the appointed speaker, the time length prediction of the text phonemes is obtained by combining the speaker characteristics, the prior distribution of the linear layer is expanded according to the time length prediction, the expanded characteristics are subjected to inverse transformation of a streaming network to obtain the characteristics after inverse transformation, and finally the HiFiGAN vocoder carries out waveform reconstruction according to the characteristics after inverse transformation to obtain emotion speech synthesis of the appointed emotion and the appointed speaker.

10. An emotion controllable speech synthesis system based on a small sample of a specific person, characterized in that the system implements the emotion controllable speech synthesis method based on a small sample of a specific person as claimed in any one of claims 1 to 9, and comprises:

the data automation processing model module (1) is used for inputting the selected specific speaking audio into the data automation processing model to form trainable data of a specific person;

the phoneme sequence feature obtaining module (2) after the emotion feature is fused is used for respectively extracting corresponding voiceprint features and emotion features by utilizing the voiceprint extraction module and the emotion feature extraction module aiming at trainable data of a specific person; embedding emotion features into the features after phoneme coding, and fusing the emotion features and phoneme sequence features; the emotion characteristics are consistent with the characteristics after passing through the phoneme coding module in dimension, and the phoneme sequence characteristics after the emotion characteristics are fused are obtained through addition;

different network modules (3) are embedded, and are used for embedding the obtained phoneme sequence characteristics fused with emotion characteristics into a priori encoder, and voiceprint characteristics into a random duration prediction module, a posterior encoder, a streaming network and a HiFiGAN vocoder;

and the emotion voice synthesis module (4) is used for training the speaker voice extraction network on a large number of speaker data based on the pre-training voice print extraction module, so that voice characteristics of different people irrelevant to speaking contents can be distinguished, and the whole end-to-end voice synthesis training and reasoning process is carried out, so that the extracted speaker characteristics are embedded into different networks to obtain emotion voice synthesis of a designated emotion and a designated speaker.