CN111402923B - Emotion voice conversion method based on wavenet - Google Patents
Emotion voice conversion method based on wavenet Download PDFInfo
- Publication number
- CN111402923B CN111402923B CN202010229173.6A CN202010229173A CN111402923B CN 111402923 B CN111402923 B CN 111402923B CN 202010229173 A CN202010229173 A CN 202010229173A CN 111402923 B CN111402923 B CN 111402923B
- Authority
- CN
- China
- Prior art keywords
- voice
- emotion
- mel spectrum
- files
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 35
- 238000001228 spectrum Methods 0.000 claims abstract description 77
- 230000007935 neutral effect Effects 0.000 claims abstract description 26
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 13
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 39
- 239000011159 matrix material Substances 0.000 claims description 36
- 230000001364 causal effect Effects 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 108090000623 proteins and genes Proteins 0.000 claims description 3
- 238000010183 spectrum analysis Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 description 7
- 238000013507 mapping Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Abstract
The invention discloses a wave-net-based emotion voice conversion method, which comprises the steps of obtaining voice files to form a corpus; dividing voice data into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; extracting acoustic characteristic fundamental tone frequency of a voice file; preprocessing the voice files to obtain Mel spectrum characteristics of each group of voice files; aiming at the Mel spectrum characteristics of each group of voice files, carrying out dynamic programming characteristic point alignment; constructing an emotion Mel spectrum conversion model; constructing a wavenet voice synthesis model; and adopting an emotion Mel spectrum conversion model as a forward network input, and adopting a wave net voice synthesis model as a backward network to output a final emotion voice file. The method has the advantages of high reliability, good accuracy and higher efficiency.
Description
Technical Field
The invention belongs to the field of voice data processing, and particularly relates to a wave net-based emotion voice conversion method.
Background
With the development of economic technology and artificial intelligence technology, people's amusement life is also increasingly abundant. How to make a machine have emotion perception capability and expression capability as human is a key for realizing harmony of human-computer interaction. In recent years, the voice processing technology is remarkably improved, but the current computer only has logic reasoning capability, if the emotion expression capability is given to the computer, the harmonious man-machine interaction can be realized, and indirect tools for communication with the computer such as a keyboard, a mouse and the like are omitted. Future robot communication with the robot is no longer limited to neutral speech, but can utilize the voice sound and communicate with the computer in a sense. In the field of video art, the emotion of the voice of a person can be converted, so that the level of the work, such as dubbing, can be greatly increased. Therefore, the method has great research significance for conversion of voice emotion, and the object is a robot or a human.
The existing emotion voice conversion technology generally adopts the following methods:
1. purely manual conversion: according to neutral voice, the professional dubbing actor imitates and reproduces the original voice with emotion; however, the accuracy also depends on the professional degree of personnel, and a great deal of time is also required, so that the efficiency is low;
2. the method for parallel training based on the regression algorithm of the machine learning model comprises the following steps: the conversion effect is achieved by carrying out parallel training on each voice acoustic characteristic of the neutral voice and the emotion voice; although the method has high precision, the training efficiency is extremely low; and the requirements on the training corpus are high, and a large amount of training data is needed; in addition, the method generally adopts acoustic features with higher dimensionality to retain acoustic information of original voice, and training time is longer when the dimensionality is higher;
3. the non-parallel training method based on the regression algorithm of the machine learning model comprises the following steps: based on the second method, neutral voice and emotion voice are adopted to be used as training samples to input a conversion model for training under the condition that speaking contents are different; the method has the advantages of quick training time, low output precision and lower quality of output emotion voice due to overlarge target parameter range during training.
Disclosure of Invention
The invention aims to provide a wave-net-based emotion voice conversion method which is high in reliability, good in accuracy and high in efficiency.
The emotion voice conversion method based on the wavenet provided by the invention comprises the following steps:
s1, acquiring a voice file to form a corpus;
s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same speaking content into the same group;
s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2;
s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files;
s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs;
s6, constructing an emotion Mel spectrum conversion model;
s7, constructing a wavenet voice synthesis model;
s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file.
The step S2 is to divide the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and the voices with the same content are divided into the same group, specifically, the following steps are adopted for grouping:
A. extracting a plurality of emotion color voice files with the same content as a training set;
B. obtaining text information of each sentence by adopting an ASR tool;
C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;
D. dividing a plurality of groups of files into rows to form a training matrix; one of which is a row.
The step S3 of extracting the fundamental tone frequency of the acoustic feature in the voice file after grouping in the step S2 specifically comprises the following steps:
a. dividing the training matrix obtained in the step S2 into rows;
b. inputting data of the training matrix into a vocoder decoder in units of rows;
c. and c, dividing the fundamental frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and the fundamental frequency corresponding to the neutral voice file, thereby obtaining a fundamental frequency characteristic matrix.
Preprocessing the voice files grouped in the step S2 to obtain the Mel spectrum characteristics of each group of voice files, wherein the Mel spectrum characteristics of each group of voice files are obtained by adopting the following steps:
(1) Dividing the training matrix obtained in the step S2 into rows;
(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;
(3) Carrying out framing treatment on the compressed voice file obtained in the step (2);
(4) Windowing the framed voice file obtained in the step (3);
(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data;
(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);
(7) And (3) saving the spectrum data subjected to the Mel filtering processing in the step (6), thereby obtaining Mel spectrum characteristics of each group of voice files.
Step S5, aiming at the Mel spectrum characteristics of each group of voice files obtained in step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs, and specifically, the following steps are adopted to carry out dynamic programming characteristic point alignment:
1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;
2) Establishing Euclidean distance matrixes of two sequences;
3) Finding the shortest distance from the element in the top left corner of the matrix to the element in the bottom right corner of the matrix;
4) And 3) acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points.
And step S6, constructing an emotion Mel spectrum conversion model, namely taking the parallel corresponding points obtained in the step S5 as input data, inputting the input data into a CNN network model for training, and thus obtaining a final emotion Mel spectrum conversion model.
The construction of the wavenet speech synthesis model in step S7 is specifically implemented by adopting the following steps:
i, adopting the following formula as a causal prediction formula:
in which x is i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;
II, adopting an extended causal convolution model:
for speech signals, the current predicted value is largely related to the output at a previous point in time, since the semantic meaning is a large link between words. The causal convolution can be well combined with the characteristic to predict the current node, and the node output of the current time point of the causal convolution is established on the basis of the input of the current node and the output of the previous time node. The relationship between the receiving domain and the network layer number of the traditional causal convolution is: the number of nodes in the receiving domain is equal to the number of network layers, and the output value of the first node is given to the second node as input through one layer of connecting layer, so that the predicted value is obtained when the highest layer is reached. The extended convolution is based on the traditional causal convolution, and does not adopt continuous nodes to predict, but takes a plurality of nodes as a group to output the current predicted value, and the relation between the receiving domain of the extended convolution and the network layer number is as follows: the number of nodes of the receiving domain is equal to the power series with 2 as the bottom and the number of layers as the power of the power, so that the number of times of the number of the nodes of each layer of the receiving domain is increased, the output value of the current time node is better related with the output value of more previous time nodes to a certain extent, and a better prediction effect is achieved.
III, residual jump connection:
connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels;
IV, adopting the following formula as a condition input model:
wherein h is the final emotion Mel spectrum characteristic obtained in the step S6; x is the pitch frequency feature matrix obtained in step S3.
The emotion mel spectrum conversion model obtained in the step S6 is used as a forward network input, the wave net speech synthesis model constructed in the step S7 is used as a backward network, a final emotion speech file is output, and specifically, the following steps are adopted to output the final emotion speech file:
i, inputting neutral voice;
ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;
iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;
iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;
v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.
According to the emotion voice conversion method based on the wavenet, which is provided by the invention, the idea of matching and aligning neutral Mel spectrum and emotion Mel spectrum is adopted by adopting the dynamic programming algorithm idea, so that the matching degree of the parallel correspondence of the Mel spectrum characteristics is improved, and the effects that low-dimensional characteristics with less information content can output high-precision predicted values are realized; the complete decoupling can be realized by adopting the mapping of the channel correlation and the space correlation in the characteristic mapping of the convolutional neural network, so that each 1x1 convolution kernel can be connected, and meanwhile, the calculation mode of the complete decoupling of the channel correlation is realized by the corresponding 3x3 convolution kernel; and then, by combining a residual jump connection mechanism of the wavenet, optimizing a convolution mode, and reserving a residual compensation mechanism, the prediction precision can be improved; after the Mel spectrum and the pitch frequency are adopted as the characteristic input of the synthesized voice and the wave net model is adopted, the prediction precision of the low-dimensional characteristic is greatly improved because the convolution is expanded to enlarge the receiving domain and the causal prediction mechanism, and experiments prove that the optimized wave net can reach the training convergence state more quickly; therefore, the method has high reliability, good accuracy and higher efficiency.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of quantization rules in the method of the present invention.
FIG. 3 is a schematic diagram of a rule for determining the shortest distance in the method of the present invention.
FIG. 4 is a schematic representation of an extended causal convolution model in the method of the present invention.
Fig. 5 is a schematic diagram of an improved residual skip connection in the method of the present invention.
FIG. 6 is a graph showing the comparison of the prediction accuracy of the method of the present invention and the prior art method.
Detailed Description
A schematic process flow diagram of the method of the present invention is shown in fig. 1: the emotion voice conversion method based on the wavenet provided by the invention comprises the following steps:
s1, acquiring a voice file to form a corpus;
s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; the method comprises the following steps:
A. extracting a plurality of emotion color voice files with the same content as a training set;
B. obtaining text information of each sentence by adopting an ASR tool;
C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;
D. dividing a plurality of groups of files into rows to form a training matrix; one group is a row;
s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2; the method comprises the following steps of:
a. dividing the training matrix obtained in the step S2 into rows;
b. inputting data of the training matrix into a vocoder decoder in units of rows;
c. dividing the fundamental tone frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and fundamental tone frequencies corresponding to the neutral voice files, thereby obtaining a fundamental tone frequency characteristic matrix;
s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files; specifically, the mel spectrum characteristics of each group of voice files are obtained by the following steps:
(1) Dividing the training matrix obtained in the step S2 into rows;
(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;
in practice, the sampling frequency is set to 44k; mu is set to 255; the quantization rules are specifically as follows (as shown in fig. 2):
and uniformly dividing the interval of the sampled data into 255 interval values x, and obtaining a corresponding mapping value y according to the located interval and the graph curve, wherein y is the quantized sampled data value. The larger the sample value, the more the coordinate interval is located, the smaller the corresponding slope, and the closer the mapped y value is. The smaller the signal value, the more detailed and more dispersed the value mapped by the obtained value, and the larger the signal mapping value, the coarser and more approximate the value mapped by the obtained value. The characteristic that the voice signal needs to be subjected to key analysis of small signal values is met;
(3) Carrying out framing treatment on the compressed voice file obtained in the step (2); the frame length is 32ms, and the frame shift is 20ms;
(4) Windowing the framed voice file obtained in the step (3); the window length is set to 16;
(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data; FFT algorithm processing can be adopted, and the length of FFT n=256;
(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);
at a specific setting, sampling frequency f s =8000 Hz, the lowest frequency of the filter frequency range is f l =0, and then according to the nyquist sampling theorem, the highest frequency of the filter frequency range is f h =f s 2=4000 Hz; setting the number of filters m=24; the value of each Mel filter can be calculated according to the transfer function of each band-pass filter, and a Mel filter group can be formed by combining the number of the filters;
(7) Storing the spectrum data subjected to the Mel filtering processing in the step (6), so as to obtain Mel spectrum characteristics of each group of voice files;
s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs; the method comprises the following steps of:
1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;
2) Establishing two sequences of Euclidean distance matrices (shown in FIG. 3 (a));
3) Finding the shortest distance from the element in the top left most corner of the matrix to the element in the bottom right most corner of the matrix (as shown in fig. 3 (b));
4) Acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points; as shown in FIG. 3, X-Y corresponding pairs (0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6) and (6, 7) are obtained sequentially; the corresponding coordinates are parallel corresponding points put in during training;
s6, constructing an emotion Mel spectrum conversion model; the parallel corresponding points obtained in the step S5 are used as input data and are input into a CNN network model for training, so that a final emotion Mel spectrum conversion model is obtained; in specific implementation, the Mel spectrum dimension is set to 256, and 4 convolution modules are used;
s7, constructing a wavenet voice synthesis model; the construction method comprises the following steps:
i, adopting the following formula as a causal prediction formula:
in which x is i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;
II. Use of extended causal convolution model (as shown in FIG. 4):
wherein the lowest layer is an input layer, and the uppermost layer is an output layer; enlarging a receiving domain of an input layer in a power level increasing mode;
for speech signals, the current predicted value is largely related to the output at a previous point in time, since the semantic meaning is a large link between words. The causal convolution can be well combined with the characteristic to predict the current node, and the node output of the current time point of the causal convolution is established on the basis of the input of the current node and the output of the previous time node; the relationship between the receiving domain and the network layer number of the traditional causal convolution is: the number of nodes of the receiving domain is equal to the number of network layers, and each time the output value of a first node is given to a second node as input through one layer of connecting layer, the predicted value is obtained when the highest layer is reached; the extended convolution is based on the traditional causal convolution, and does not adopt continuous nodes to predict, but takes a plurality of nodes as a group to output the current predicted value, and the relation between the receiving domain of the extended convolution and the network layer number is as follows: the number of nodes of the receiving domain is equal to the power series with 2 as the bottom and the number of layers as the power of the power, so that the number of times of the number of the nodes of each layer of the receiving domain is increased, the output value of the current time node is better related with the output value of more previous time nodes to a certain extent, and a better prediction effect is achieved;
III, adopting residual jump connection:
connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels; as shown in fig. 5;
IV, adopting the following formula as a condition input model:
h is a final emotion Mel spectrum conversion model obtained in the step S6; x is the fundamental tone frequency characteristic matrix obtained in the step S3;
s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file; specifically, the method comprises the following steps of outputting a final emotion voice file:
i, inputting neutral voice;
ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;
iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;
iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;
v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.
According to the method, the idea of matching and aligning the neutral Mel spectrum and the emotion Mel spectrum is adopted, so that the matching degree of the parallel correspondence of the Mel spectrum characteristics is improved, and the effects that the low-dimensional characteristics with less abundant information content can output high-precision predicted values are realized; the mapping of the channel correlation and the space correlation in the convolutional neural network characteristic mapping can be completely decoupled; each 1x1 convolution kernel can be connected, and the calculation mode of completely decoupling the channel correlation is achieved through the corresponding 3x3 convolution kernel; combining a residual jump connection mechanism of the wavenet, optimizing a convolution mode, and reserving a residual compensation mechanism; the data show that the accuracy of xception combined with the wavenet residual mechanism is improved (as shown in fig. 6 (a)); in addition, mel spectrum and pitch frequency are adopted as characteristic input of synthesized voice, and if the synthesized voice is in a traditional model, the low-dimensional characteristic can improve efficiency, but a large amount of original information can be lost to reduce accuracy; after the wave net model is adopted, the prediction precision of the low-dimensional features is greatly improved because the mechanism of the receiving domain and the causal prediction is increased by extending convolution; moreover, experiments show that the optimized wavenet can reach a training convergence state faster; as shown in fig. 6 (b), the optimized wavenet converges faster than the original version.
Claims (1)
1. A wave net-based emotion voice conversion method comprises the following steps:
s1, acquiring a voice file to form a corpus;
s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; the method comprises the following steps:
A. extracting a plurality of emotion color voice files with the same content as a training set;
B. obtaining text information of each sentence by adopting an ASR tool;
C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;
D. dividing a plurality of groups of files into rows to form a training matrix; one group is a row;
s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2; the method comprises the following steps of:
a. dividing the training matrix obtained in the step S2 into rows;
b. inputting data of the training matrix into a vocoder decoder in units of rows;
c. dividing the fundamental tone frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and fundamental tone frequencies corresponding to the neutral voice files, thereby obtaining a fundamental tone frequency characteristic matrix;
s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files; specifically, the mel spectrum characteristics of each group of voice files are obtained by the following steps:
(1) Dividing the training matrix obtained in the step S2 into rows;
(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;
(3) Carrying out framing treatment on the compressed voice file obtained in the step (2);
(4) Windowing the framed voice file obtained in the step (3);
(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data;
(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);
(7) Storing the spectrum data subjected to the Mel filtering processing in the step (6), so as to obtain Mel spectrum characteristics of each group of voice files;
s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs; the method comprises the following steps of:
1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;
2) Establishing Euclidean distance matrixes of two sequences;
3) Finding the shortest distance from the element in the top left corner of the matrix to the element in the bottom right corner of the matrix;
4) Acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points;
s6, constructing an emotion Mel spectrum conversion model; the parallel corresponding points obtained in the step S5 are used as input data and are input into a CNN network model for training, so that a final emotion Mel spectrum conversion model is obtained;
s7, constructing a wavenet voice synthesis model; the construction method comprises the following steps:
i, adopting the following formula as a causal prediction formula:
in which x is i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;
II, adopting an extended causal convolution model:
III, adopting residual jump connection:
connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels;
IV, adopting the following formula as a condition input model:
h is a final emotion Mel spectrum conversion model obtained in the step S6; x is the fundamental tone frequency characteristic matrix obtained in the step S3;
s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file; specifically, the method comprises the following steps of outputting a final emotion voice file:
i, inputting neutral voice;
ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;
iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;
iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;
v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010229173.6A CN111402923B (en) | 2020-03-27 | 2020-03-27 | Emotion voice conversion method based on wavenet |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010229173.6A CN111402923B (en) | 2020-03-27 | 2020-03-27 | Emotion voice conversion method based on wavenet |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111402923A CN111402923A (en) | 2020-07-10 |
CN111402923B true CN111402923B (en) | 2023-11-03 |
Family
ID=71429205
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010229173.6A Active CN111402923B (en) | 2020-03-27 | 2020-03-27 | Emotion voice conversion method based on wavenet |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111402923B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101666930B1 (en) * | 2015-04-29 | 2016-10-24 | 서울대학교산학협력단 | Target speaker adaptive voice conversion method using deep learning model and voice conversion device implementing the same |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | A kind of phonetics transfer method merging Bi-LSTM and WaveNet |
CN110033755A (en) * | 2019-04-23 | 2019-07-19 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
KR102057927B1 (en) * | 2019-03-19 | 2019-12-20 | 휴멜로 주식회사 | Apparatus for synthesizing speech and method thereof |
CN110619867A (en) * | 2019-09-27 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | Training method and device of speech synthesis model, electronic equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2489473B (en) * | 2011-03-29 | 2013-09-18 | Toshiba Res Europ Ltd | A voice conversion method and system |
-
2020
- 2020-03-27 CN CN202010229173.6A patent/CN111402923B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101666930B1 (en) * | 2015-04-29 | 2016-10-24 | 서울대학교산학협력단 | Target speaker adaptive voice conversion method using deep learning model and voice conversion device implementing the same |
CN109767778A (en) * | 2018-12-27 | 2019-05-17 | 中国人民解放军陆军工程大学 | A kind of phonetics transfer method merging Bi-LSTM and WaveNet |
KR102057927B1 (en) * | 2019-03-19 | 2019-12-20 | 휴멜로 주식회사 | Apparatus for synthesizing speech and method thereof |
CN110033755A (en) * | 2019-04-23 | 2019-07-19 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
CN110619867A (en) * | 2019-09-27 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | Training method and device of speech synthesis model, electronic equipment and storage medium |
Non-Patent Citations (4)
Title |
---|
Heejin Choi 等.Emotional Speech Synthesis f or Multi-Speaker Emotional Dataset Using WaveNet Vocoder.《2019 IEEE International Conference on Consumer Electronics》.2019,第1-2页. * |
Zhaojie Luo 等.Emotional Voice Conversion Using Dual Supervised Adversarial Networks With Continuous Wavelet Transform F0 Features.《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》.2019,第1535-1538页. * |
基于长短期记忆和卷积神经网络的语音情感识别;卢官明等;南京邮电大学学报(自然科学版)(第05期);第63-68页 * |
张亚强.基于迁移学习和自学习情感表征的情感语音合成.中国优秀硕士学位论文全文数据库.2019,第44-48页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111402923A (en) | 2020-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110992987B (en) | Parallel feature extraction system and method for general specific voice in voice signal | |
CN108597539B (en) | Speech emotion recognition method based on parameter migration and spectrogram | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN112466326B (en) | Voice emotion feature extraction method based on transducer model encoder | |
EP0342630A2 (en) | Speech recognition with speaker adaptation by learning | |
CN111402928B (en) | Attention-based speech emotion state evaluation method, device, medium and equipment | |
WO2022141842A1 (en) | Deep learning-based speech training method and apparatus, device, and storage medium | |
CN110634476B (en) | Method and system for rapidly building robust acoustic model | |
CN109902164B (en) | Method for solving question-answering of open long format video by using convolution bidirectional self-attention network | |
CN113140220B (en) | Lightweight end-to-end speech recognition method based on convolution self-attention transformation network | |
CN113987179A (en) | Knowledge enhancement and backtracking loss-based conversational emotion recognition network model, construction method, electronic device and storage medium | |
CN112259119B (en) | Music source separation method based on stacked hourglass network | |
CN107316635A (en) | Audio recognition method and device, storage medium, electronic equipment | |
CN111061951A (en) | Recommendation model based on double-layer self-attention comment modeling | |
CN113539232A (en) | Muslim class voice data set-based voice synthesis method | |
CN116486794A (en) | Chinese-English mixed speech recognition method | |
CN111724809A (en) | Vocoder implementation method and device based on variational self-encoder | |
CN111402923B (en) | Emotion voice conversion method based on wavenet | |
CN113808581A (en) | Chinese speech recognition method for acoustic and language model training and joint optimization | |
CN111951778B (en) | Method for emotion voice synthesis by utilizing transfer learning under low resource | |
CN116758451A (en) | Audio-visual emotion recognition method and system based on multi-scale and global cross attention | |
CN112380874B (en) | Multi-person-to-speech analysis method based on graph convolution network | |
CN115171878A (en) | Depression detection method based on BiGRU and BiLSTM | |
CN115222059A (en) | Self-distillation model compression algorithm based on high-level information supervision | |
CN115019785A (en) | Streaming voice recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |