CN111312208A

CN111312208A - Neural network vocoder system with irrelevant speakers

Info

Publication number: CN111312208A
Application number: CN202010158293.1A
Authority: CN
Inventors: 周俊明; 何颖洋; 吴东海; 黄博贤
Original assignee: Guangzhou Shensheng Technology Co Ltd
Current assignee: Guangzhou Shensheng Technology Co Ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-06-19

Abstract

The invention discloses a neural network vocoder system with irrelevant speakers, which comprises the following steps: s1, the tone color feature extraction module receives the acoustic features M and performs tone color feature extraction on the acoustic features to obtain tone color feature information S, and the acoustic features can select a Mel frequency spectrum, a Mel cepstrum and a linear magnitude spectrum; and S2, the waveform production module receives the acoustic feature M and the tone feature S output by the tone extraction module, and carries out waveform generation processing to obtain a voice waveform W. The invention solves the problems that each single tone vocoder system can only serve one specific tone, the service deployment and operation cost is high, a new vocoder system needs to be trained when a new tone is encountered, the training time is long, and a large amount of recording data of a certain tone is needed for training.

Description

Neural network vocoder system with irrelevant speakers

Technical Field

The invention relates to the technical field of networks, in particular to a neural network vocoder system with irrelevant speakers.

Background

With the rapid development of the neural network technology, the voice synthesis effect is also rapidly improved. Realistic speech synthesis technology has been applied to news broadcasting, audio books, voice assistants, intelligent customer service, virtual characters, voice cloning, and the like. Along with the continuous development of artificial intelligence technology and the continuous increase of application scenes, people have higher and higher requirements on speech synthesis technology. Not only is the sound quality of speech synthesis required to be realistic, but it is also desirable to be able to synthesize a wide variety of timbres. This presents many challenges to the development of speech synthesis technology and application deployment.

The current mainstream speech synthesis technology system mainly comprises three subsystems: speech synthesis front-end systems (converting text to phonemes); speech synthesis backend systems (convert phonemes into acoustic features); vocoder systems (convert acoustic features into audio). Among them, the vocoder system plays an important role in synthesizing the sound quality. In recent years, with the success of vocoder systems constructed by neural networks such as WaveNet, SampleRNN, WaveRNN, etc., existing single-tone vocoder systems have been able to synthesize comparable real recording sounds. However, these monophonic vocoder systems can only synthesize sounds of a single timbre and cannot support high quality synthesis of multiple timbres with a single system. Therefore, if the timbre diversity requires a high application scenario (e.g., talking books, voice cloning, etc.), a very large number of vocoder systems are required to meet the requirement of multiple timbres. Along with the increase of the number of systems, the number of hardware for service deployment is increased, and the operation cost is greatly improved. Moreover, the vocoder system for each tone requires training with recording data of the tone for several hours, and the training is converged before synthesizing the voice. The time required for training will vary depending on the training hardware, but typically requires 2-7 days of training time. This undoubtedly brings a huge obstacle to the synthesis of multi-timbre high-quality speech, especially in scenes where sufficient training recording data cannot be acquired.

In summary, the current vocoder system has the following disadvantages in the multi-tone application scenario:

1. each monophonic vocoder system can only serve one particular timbre. Service deployment and operation costs are high.

2. Encountering a new timbre requires a lengthy training time (typically 2-7 days) from the new training vocoder system.

3. A large amount of recording data (generally, recording data of 3 hours or more) of a certain tone color is required for training.

Disclosure of Invention

The invention aims to solve the problems that each single tone vocoder system can only serve one specific tone, the service deployment and operation cost is high, a new vocoder system needs to be trained from a new tone, the training time is long, and a large amount of recording data of a certain tone are needed for training.

In order to achieve the purpose, the invention adopts the following technical scheme: a speaker-independent neural network vocoder system, comprising the steps of:

s1, the tone color feature extraction module receives the acoustic features M and performs tone color feature extraction on the acoustic features to obtain tone color feature information S, and the acoustic features can select a Mel frequency spectrum, a Mel cepstrum and a linear magnitude spectrum;

and S2, the waveform production module receives the acoustic feature M and the tone feature S output by the tone extraction module, and performs waveform generation processing to obtain a voice waveform W.

2. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the acoustic feature may be selected from a mel-frequency spectrum, a mel-frequency cepstrum, and a linear amplitude spectrum.

3. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the traditional tone color feature extraction module extracts a traditional tone color feature sp from the input acoustic feature M, where the traditional tone color feature may be selected from a basic audio frequency, a voiced-unvoiced flag, a magnitude spectrum envelope, a linear prediction coefficient, or a line spectrum pair;

the feature mapping network module maps the traditional tone features sp output by the traditional tone feature extraction module into abstract tone features S, and the tone feature mapping network can be formed by a residual error network or a bidirectional cyclic neural network.

4. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S2, performing upsampling processing on the acoustic feature M and the tone feature S, and increasing the sampling rate to the sampling rate of the audio waveform, for example, if the sampling rate of the audio waveform is 16000Hz, the sampling rate of the acoustic feature is 80Hz, and the duration of each frame is 12.5ms, upsampling the acoustic feature and the tone feature sampling rate 80Hz by 200 times, so as to obtain 16000Hz sampled acoustic feature M1 and tone feature S1;

the up-sampled acoustic feature M1 and the tone feature S1 are input to the neural network layer 1, the feature M2 is output, and then, the operation is repeated N times, the output feature Mi of the previous neural network layer and the tone feature S1 are input to the next neural network layer i, and then, the feature Mi +1 is output. The neural network layer can be realized by a CNN network or a unidirectional RNN network;

the DNN network layer converts the output feature MN +1 of the neural network layer N into a speech waveform W.

Compared with the prior art, the invention has the following beneficial effects: the invention adopts an independent tone characteristic extraction module to extract the tone characteristics of a target speaker, and continuously inputs the tone-changed characteristics into each processing network of the waveform production module, thereby enhancing the robustness of the waveform production module and enabling the waveform production module to synthesize sounds with different tones in high quality from acoustic characteristics.

The neural network vocoder system with irrelevant speakers can synthesize voices in a training data set and outside the training data set, the synthesis effect is close to real person recording, and the voice data of the target speakers do not need to be collected in large quantity for new timbre because the system is irrelevant to the target speakers, and the new training model is not needed, and only a neural network vocoder with irrelevant speakers needs to be trained in advance to be applied to the timbre. This greatly reduces the time cost and hardware cost of a multi-tone speech synthesis scene.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic view of the present invention in its entirety;

FIG. 2 is a schematic diagram of processing details of the timbre feature extraction module of the present invention;

FIG. 3 is a schematic diagram of processing details of the waveform generation module of the present invention.

In the figure: a tone color feature extraction module 101, a waveform production module 102, a traditional tone color feature extraction module 01, a feature mapping network module 02 and an up-sampling processing module 03.

Detailed Description

The following description of the embodiments of the present invention is provided for illustrative purposes, and other advantages and effects of the present invention will become apparent to those skilled in the art from the present disclosure.

Please refer to fig. 1 to 3. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions under which the present invention can be implemented, so that the present invention has no technical significance, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.

The invention provides a technical scheme that: a neural network vocoder system with speaker incoherence comprises a tone color feature extraction module 101 and a waveform production module 102, and the processing procedure comprises the following two steps:

the acoustic feature extraction module 101 receives the acoustic feature M, performs a sound feature extraction process on the acoustic feature, and outputs sound feature information S, where the acoustic feature of this example is selected as a mel frequency spectrum, but is not limited to the mel frequency spectrum;

the waveform generation module 102 receives the acoustic features M and the timbre features S output by the timbre extraction module 101, performs waveform generation processing, and outputs a speech waveform W.

In step 1, the processing details of the tone feature extraction module 101 are shown in fig. 2:

the traditional tone color feature extraction module 01 receives the acoustic feature M and outputs a traditional tone color feature sp, in this example, the pitch frequency F0 and the amplitude spectrum envelope are adopted as the traditional tone color feature, but not limited to these two features;

the feature mapping network module 02 receives the traditional tone features sp output by the traditional tone feature extraction module 01, and maps the traditional tone features sp into abstract tone features S, and the feature mapping network of this example is implemented by using a 5-layer residual error network, but is not limited to this implementation.

In step 2, the processing details of the waveform generating module 102 are as shown in fig. 3:

the up-sampling processing module 03 receives the acoustic features M and the abstract tone features S output by the feature mapping network module 02, and increases the sampling rate of the two features by 200 times to the sampling rate of the audio waveform, where the sampling rate of the example speech audio is 16000Hz, and the sampling rate of the acoustic features and the tone features is 80Hz, and the duration of each frame is 12.5ms, but is not limited to this parameter;

the neural network layer 104 receives the acoustic feature M1 and the tone feature S1 after the sampling rate is raised, processes the output feature M2, and repeats the operation N times: inputting the output characteristic Mi +1 of the last neural network layer i and the tone characteristic S1 output by the up-sampling processing module 03 into the next neural network layer i +1, and then processing the output characteristic Mi +2, wherein the neural network layer of the embodiment is realized by adopting a CNN network, the number of times of repeated operation N is 10, but is not limited to the parameter;

the DNN network layer 07 receives the output feature MN +1 of the neural network layer N06, and processes the output speech waveform W.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A speaker-independent neural network vocoder system, comprising the steps of:

s1, the tone characteristic extraction module receives the acoustic characteristic M and performs tone characteristic extraction on the acoustic characteristic M to obtain tone characteristic information S;

3. The speaker-incoherent neural network vocoder system of claim 1, wherein: in S1, the tone color feature extraction module includes a traditional tone color feature extraction module, where the traditional tone color feature extraction module extracts a traditional tone color feature sp from the input acoustic feature M, and the traditional tone color feature may be selected from a basic audio frequency, a voiced and unvoiced sound flag, an amplitude spectrum envelope, a linear prediction coefficient, or a line spectrum pair;

inputting the up-sampled acoustic feature M1 and the tone feature S1 to the neural network layer 1, outputting the feature M2, and then repeating the operation N times, inputting the output feature Mi of the previous neural network layer and the tone feature S1 to the next neural network layer i, and then outputting the feature Mi + 1;

the neural network layer can be realized by a CNN network or a unidirectional RNN network;