CN111402923B

CN111402923B - Emotion voice conversion method based on wavenet

Info

Publication number: CN111402923B
Application number: CN202010229173.6A
Authority: CN
Inventors: 白杨; 陈明义; 吴国彪
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2023-11-03
Anticipated expiration: 2040-03-27
Also published as: CN111402923A

Abstract

The invention discloses a wave-net-based emotion voice conversion method, which comprises the steps of obtaining voice files to form a corpus; dividing voice data into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; extracting acoustic characteristic fundamental tone frequency of a voice file; preprocessing the voice files to obtain Mel spectrum characteristics of each group of voice files; aiming at the Mel spectrum characteristics of each group of voice files, carrying out dynamic programming characteristic point alignment; constructing an emotion Mel spectrum conversion model; constructing a wavenet voice synthesis model; and adopting an emotion Mel spectrum conversion model as a forward network input, and adopting a wave net voice synthesis model as a backward network to output a final emotion voice file. The method has the advantages of high reliability, good accuracy and higher efficiency.

Description

Emotion voice conversion method based on wavenet

Technical Field

The invention belongs to the field of voice data processing, and particularly relates to a wave net-based emotion voice conversion method.

Background

With the development of economic technology and artificial intelligence technology, people's amusement life is also increasingly abundant. How to make a machine have emotion perception capability and expression capability as human is a key for realizing harmony of human-computer interaction. In recent years, the voice processing technology is remarkably improved, but the current computer only has logic reasoning capability, if the emotion expression capability is given to the computer, the harmonious man-machine interaction can be realized, and indirect tools for communication with the computer such as a keyboard, a mouse and the like are omitted. Future robot communication with the robot is no longer limited to neutral speech, but can utilize the voice sound and communicate with the computer in a sense. In the field of video art, the emotion of the voice of a person can be converted, so that the level of the work, such as dubbing, can be greatly increased. Therefore, the method has great research significance for conversion of voice emotion, and the object is a robot or a human.

The existing emotion voice conversion technology generally adopts the following methods:

1. purely manual conversion: according to neutral voice, the professional dubbing actor imitates and reproduces the original voice with emotion; however, the accuracy also depends on the professional degree of personnel, and a great deal of time is also required, so that the efficiency is low;

2. the method for parallel training based on the regression algorithm of the machine learning model comprises the following steps: the conversion effect is achieved by carrying out parallel training on each voice acoustic characteristic of the neutral voice and the emotion voice; although the method has high precision, the training efficiency is extremely low; and the requirements on the training corpus are high, and a large amount of training data is needed; in addition, the method generally adopts acoustic features with higher dimensionality to retain acoustic information of original voice, and training time is longer when the dimensionality is higher;

3. the non-parallel training method based on the regression algorithm of the machine learning model comprises the following steps: based on the second method, neutral voice and emotion voice are adopted to be used as training samples to input a conversion model for training under the condition that speaking contents are different; the method has the advantages of quick training time, low output precision and lower quality of output emotion voice due to overlarge target parameter range during training.

Disclosure of Invention

The invention aims to provide a wave-net-based emotion voice conversion method which is high in reliability, good in accuracy and high in efficiency.

The emotion voice conversion method based on the wavenet provided by the invention comprises the following steps:

s1, acquiring a voice file to form a corpus;

s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same speaking content into the same group;

s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2;

s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files;

s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs;

s6, constructing an emotion Mel spectrum conversion model;

s7, constructing a wavenet voice synthesis model;

s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file.

The step S2 is to divide the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and the voices with the same content are divided into the same group, specifically, the following steps are adopted for grouping:

A. extracting a plurality of emotion color voice files with the same content as a training set;

B. obtaining text information of each sentence by adopting an ASR tool;

C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;

D. dividing a plurality of groups of files into rows to form a training matrix; one of which is a row.

The step S3 of extracting the fundamental tone frequency of the acoustic feature in the voice file after grouping in the step S2 specifically comprises the following steps:

a. dividing the training matrix obtained in the step S2 into rows;

b. inputting data of the training matrix into a vocoder decoder in units of rows;

c. and c, dividing the fundamental frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and the fundamental frequency corresponding to the neutral voice file, thereby obtaining a fundamental frequency characteristic matrix.

Preprocessing the voice files grouped in the step S2 to obtain the Mel spectrum characteristics of each group of voice files, wherein the Mel spectrum characteristics of each group of voice files are obtained by adopting the following steps:

(1) Dividing the training matrix obtained in the step S2 into rows;

(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;

(3) Carrying out framing treatment on the compressed voice file obtained in the step (2);

(4) Windowing the framed voice file obtained in the step (3);

(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data;

(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);

(7) And (3) saving the spectrum data subjected to the Mel filtering processing in the step (6), thereby obtaining Mel spectrum characteristics of each group of voice files.

Step S5, aiming at the Mel spectrum characteristics of each group of voice files obtained in step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs, and specifically, the following steps are adopted to carry out dynamic programming characteristic point alignment:

1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;

2) Establishing Euclidean distance matrixes of two sequences;

3) Finding the shortest distance from the element in the top left corner of the matrix to the element in the bottom right corner of the matrix;

4) And 3) acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points.

And step S6, constructing an emotion Mel spectrum conversion model, namely taking the parallel corresponding points obtained in the step S5 as input data, inputting the input data into a CNN network model for training, and thus obtaining a final emotion Mel spectrum conversion model.

The construction of the wavenet speech synthesis model in step S7 is specifically implemented by adopting the following steps:

i, adopting the following formula as a causal prediction formula:

in which x is _i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;

II, adopting an extended causal convolution model:

for speech signals, the current predicted value is largely related to the output at a previous point in time, since the semantic meaning is a large link between words. The causal convolution can be well combined with the characteristic to predict the current node, and the node output of the current time point of the causal convolution is established on the basis of the input of the current node and the output of the previous time node. The relationship between the receiving domain and the network layer number of the traditional causal convolution is: the number of nodes in the receiving domain is equal to the number of network layers, and the output value of the first node is given to the second node as input through one layer of connecting layer, so that the predicted value is obtained when the highest layer is reached. The extended convolution is based on the traditional causal convolution, and does not adopt continuous nodes to predict, but takes a plurality of nodes as a group to output the current predicted value, and the relation between the receiving domain of the extended convolution and the network layer number is as follows: the number of nodes of the receiving domain is equal to the power series with 2 as the bottom and the number of layers as the power of the power, so that the number of times of the number of the nodes of each layer of the receiving domain is increased, the output value of the current time node is better related with the output value of more previous time nodes to a certain extent, and a better prediction effect is achieved.

III, residual jump connection:

connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels;

IV, adopting the following formula as a condition input model:

wherein h is the final emotion Mel spectrum characteristic obtained in the step S6; x is the pitch frequency feature matrix obtained in step S3.

The emotion mel spectrum conversion model obtained in the step S6 is used as a forward network input, the wave net speech synthesis model constructed in the step S7 is used as a backward network, a final emotion speech file is output, and specifically, the following steps are adopted to output the final emotion speech file:

i, inputting neutral voice;

ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;

iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;

iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;

v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.

According to the emotion voice conversion method based on the wavenet, which is provided by the invention, the idea of matching and aligning neutral Mel spectrum and emotion Mel spectrum is adopted by adopting the dynamic programming algorithm idea, so that the matching degree of the parallel correspondence of the Mel spectrum characteristics is improved, and the effects that low-dimensional characteristics with less information content can output high-precision predicted values are realized; the complete decoupling can be realized by adopting the mapping of the channel correlation and the space correlation in the characteristic mapping of the convolutional neural network, so that each 1x1 convolution kernel can be connected, and meanwhile, the calculation mode of the complete decoupling of the channel correlation is realized by the corresponding 3x3 convolution kernel; and then, by combining a residual jump connection mechanism of the wavenet, optimizing a convolution mode, and reserving a residual compensation mechanism, the prediction precision can be improved; after the Mel spectrum and the pitch frequency are adopted as the characteristic input of the synthesized voice and the wave net model is adopted, the prediction precision of the low-dimensional characteristic is greatly improved because the convolution is expanded to enlarge the receiving domain and the causal prediction mechanism, and experiments prove that the optimized wave net can reach the training convergence state more quickly; therefore, the method has high reliability, good accuracy and higher efficiency.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of quantization rules in the method of the present invention.

FIG. 3 is a schematic diagram of a rule for determining the shortest distance in the method of the present invention.

FIG. 4 is a schematic representation of an extended causal convolution model in the method of the present invention.

Fig. 5 is a schematic diagram of an improved residual skip connection in the method of the present invention.

FIG. 6 is a graph showing the comparison of the prediction accuracy of the method of the present invention and the prior art method.

Detailed Description

A schematic process flow diagram of the method of the present invention is shown in fig. 1: the emotion voice conversion method based on the wavenet provided by the invention comprises the following steps:

s1, acquiring a voice file to form a corpus;

s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; the method comprises the following steps:

B. obtaining text information of each sentence by adopting an ASR tool;

D. dividing a plurality of groups of files into rows to form a training matrix; one group is a row;

s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2; the method comprises the following steps of:

a. dividing the training matrix obtained in the step S2 into rows;

c. dividing the fundamental tone frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and fundamental tone frequencies corresponding to the neutral voice files, thereby obtaining a fundamental tone frequency characteristic matrix;

s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files; specifically, the mel spectrum characteristics of each group of voice files are obtained by the following steps:

(1) Dividing the training matrix obtained in the step S2 into rows;

in practice, the sampling frequency is set to 44k; mu is set to 255; the quantization rules are specifically as follows (as shown in fig. 2):

and uniformly dividing the interval of the sampled data into 255 interval values x, and obtaining a corresponding mapping value y according to the located interval and the graph curve, wherein y is the quantized sampled data value. The larger the sample value, the more the coordinate interval is located, the smaller the corresponding slope, and the closer the mapped y value is. The smaller the signal value, the more detailed and more dispersed the value mapped by the obtained value, and the larger the signal mapping value, the coarser and more approximate the value mapped by the obtained value. The characteristic that the voice signal needs to be subjected to key analysis of small signal values is met;

(3) Carrying out framing treatment on the compressed voice file obtained in the step (2); the frame length is 32ms, and the frame shift is 20ms;

(4) Windowing the framed voice file obtained in the step (3); the window length is set to 16;

(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data; FFT algorithm processing can be adopted, and the length of FFT n=256;

at a specific setting, sampling frequency f _s =8000 Hz, the lowest frequency of the filter frequency range is f _l =0, and then according to the nyquist sampling theorem, the highest frequency of the filter frequency range is f _h ＝f _s 2=4000 Hz; setting the number of filters m=24; the value of each Mel filter can be calculated according to the transfer function of each band-pass filter, and a Mel filter group can be formed by combining the number of the filters;

(7) Storing the spectrum data subjected to the Mel filtering processing in the step (6), so as to obtain Mel spectrum characteristics of each group of voice files;

s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs; the method comprises the following steps of:

2) Establishing two sequences of Euclidean distance matrices (shown in FIG. 3 (a));

3) Finding the shortest distance from the element in the top left most corner of the matrix to the element in the bottom right most corner of the matrix (as shown in fig. 3 (b));

4) Acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points; as shown in FIG. 3, X-Y corresponding pairs (0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6) and (6, 7) are obtained sequentially; the corresponding coordinates are parallel corresponding points put in during training;

s6, constructing an emotion Mel spectrum conversion model; the parallel corresponding points obtained in the step S5 are used as input data and are input into a CNN network model for training, so that a final emotion Mel spectrum conversion model is obtained; in specific implementation, the Mel spectrum dimension is set to 256, and 4 convolution modules are used;

s7, constructing a wavenet voice synthesis model; the construction method comprises the following steps:

i, adopting the following formula as a causal prediction formula:

II. Use of extended causal convolution model (as shown in FIG. 4):

wherein the lowest layer is an input layer, and the uppermost layer is an output layer; enlarging a receiving domain of an input layer in a power level increasing mode;

for speech signals, the current predicted value is largely related to the output at a previous point in time, since the semantic meaning is a large link between words. The causal convolution can be well combined with the characteristic to predict the current node, and the node output of the current time point of the causal convolution is established on the basis of the input of the current node and the output of the previous time node; the relationship between the receiving domain and the network layer number of the traditional causal convolution is: the number of nodes of the receiving domain is equal to the number of network layers, and each time the output value of a first node is given to a second node as input through one layer of connecting layer, the predicted value is obtained when the highest layer is reached; the extended convolution is based on the traditional causal convolution, and does not adopt continuous nodes to predict, but takes a plurality of nodes as a group to output the current predicted value, and the relation between the receiving domain of the extended convolution and the network layer number is as follows: the number of nodes of the receiving domain is equal to the power series with 2 as the bottom and the number of layers as the power of the power, so that the number of times of the number of the nodes of each layer of the receiving domain is increased, the output value of the current time node is better related with the output value of more previous time nodes to a certain extent, and a better prediction effect is achieved;

III, adopting residual jump connection:

connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels; as shown in fig. 5;

IV, adopting the following formula as a condition input model:

h is a final emotion Mel spectrum conversion model obtained in the step S6; x is the fundamental tone frequency characteristic matrix obtained in the step S3;

s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file; specifically, the method comprises the following steps of outputting a final emotion voice file:

i, inputting neutral voice;

According to the method, the idea of matching and aligning the neutral Mel spectrum and the emotion Mel spectrum is adopted, so that the matching degree of the parallel correspondence of the Mel spectrum characteristics is improved, and the effects that the low-dimensional characteristics with less abundant information content can output high-precision predicted values are realized; the mapping of the channel correlation and the space correlation in the convolutional neural network characteristic mapping can be completely decoupled; each 1x1 convolution kernel can be connected, and the calculation mode of completely decoupling the channel correlation is achieved through the corresponding 3x3 convolution kernel; combining a residual jump connection mechanism of the wavenet, optimizing a convolution mode, and reserving a residual compensation mechanism; the data show that the accuracy of xception combined with the wavenet residual mechanism is improved (as shown in fig. 6 (a)); in addition, mel spectrum and pitch frequency are adopted as characteristic input of synthesized voice, and if the synthesized voice is in a traditional model, the low-dimensional characteristic can improve efficiency, but a large amount of original information can be lost to reduce accuracy; after the wave net model is adopted, the prediction precision of the low-dimensional features is greatly improved because the mechanism of the receiving domain and the causal prediction is increased by extending convolution; moreover, experiments show that the optimized wavenet can reach a training convergence state faster; as shown in fig. 6 (b), the optimized wavenet converges faster than the original version.

Claims

1. A wave net-based emotion voice conversion method comprises the following steps:

s1, acquiring a voice file to form a corpus;

B. obtaining text information of each sentence by adopting an ASR tool;

a. dividing the training matrix obtained in the step S2 into rows;

(1) Dividing the training matrix obtained in the step S2 into rows;

(4) Windowing the framed voice file obtained in the step (3);

2) Establishing Euclidean distance matrixes of two sequences;

4) Acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points;

s6, constructing an emotion Mel spectrum conversion model; the parallel corresponding points obtained in the step S5 are used as input data and are input into a CNN network model for training, so that a final emotion Mel spectrum conversion model is obtained;

i, adopting the following formula as a causal prediction formula:

II, adopting an extended causal convolution model:

III, adopting residual jump connection:

IV, adopting the following formula as a condition input model:

i, inputting neutral voice;