CN111402923B - Emotion voice conversion method based on wavenet - Google Patents

Emotion voice conversion method based on wavenet Download PDF

Info

Publication number
CN111402923B
CN111402923B CN202010229173.6A CN202010229173A CN111402923B CN 111402923 B CN111402923 B CN 111402923B CN 202010229173 A CN202010229173 A CN 202010229173A CN 111402923 B CN111402923 B CN 111402923B
Authority
CN
China
Prior art keywords
voice
emotion
mel spectrum
files
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010229173.6A
Other languages
Chinese (zh)
Other versions
CN111402923A (en
Inventor
白杨
陈明义
吴国彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010229173.6A priority Critical patent/CN111402923B/en
Publication of CN111402923A publication Critical patent/CN111402923A/en
Application granted granted Critical
Publication of CN111402923B publication Critical patent/CN111402923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a wave-net-based emotion voice conversion method, which comprises the steps of obtaining voice files to form a corpus; dividing voice data into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; extracting acoustic characteristic fundamental tone frequency of a voice file; preprocessing the voice files to obtain Mel spectrum characteristics of each group of voice files; aiming at the Mel spectrum characteristics of each group of voice files, carrying out dynamic programming characteristic point alignment; constructing an emotion Mel spectrum conversion model; constructing a wavenet voice synthesis model; and adopting an emotion Mel spectrum conversion model as a forward network input, and adopting a wave net voice synthesis model as a backward network to output a final emotion voice file. The method has the advantages of high reliability, good accuracy and higher efficiency.

Description

Emotion voice conversion method based on wavenet
Technical Field
The invention belongs to the field of voice data processing, and particularly relates to a wave net-based emotion voice conversion method.
Background
With the development of economic technology and artificial intelligence technology, people's amusement life is also increasingly abundant. How to make a machine have emotion perception capability and expression capability as human is a key for realizing harmony of human-computer interaction. In recent years, the voice processing technology is remarkably improved, but the current computer only has logic reasoning capability, if the emotion expression capability is given to the computer, the harmonious man-machine interaction can be realized, and indirect tools for communication with the computer such as a keyboard, a mouse and the like are omitted. Future robot communication with the robot is no longer limited to neutral speech, but can utilize the voice sound and communicate with the computer in a sense. In the field of video art, the emotion of the voice of a person can be converted, so that the level of the work, such as dubbing, can be greatly increased. Therefore, the method has great research significance for conversion of voice emotion, and the object is a robot or a human.
The existing emotion voice conversion technology generally adopts the following methods:
1. purely manual conversion: according to neutral voice, the professional dubbing actor imitates and reproduces the original voice with emotion; however, the accuracy also depends on the professional degree of personnel, and a great deal of time is also required, so that the efficiency is low;
2. the method for parallel training based on the regression algorithm of the machine learning model comprises the following steps: the conversion effect is achieved by carrying out parallel training on each voice acoustic characteristic of the neutral voice and the emotion voice; although the method has high precision, the training efficiency is extremely low; and the requirements on the training corpus are high, and a large amount of training data is needed; in addition, the method generally adopts acoustic features with higher dimensionality to retain acoustic information of original voice, and training time is longer when the dimensionality is higher;
3. the non-parallel training method based on the regression algorithm of the machine learning model comprises the following steps: based on the second method, neutral voice and emotion voice are adopted to be used as training samples to input a conversion model for training under the condition that speaking contents are different; the method has the advantages of quick training time, low output precision and lower quality of output emotion voice due to overlarge target parameter range during training.
Disclosure of Invention
The invention aims to provide a wave-net-based emotion voice conversion method which is high in reliability, good in accuracy and high in efficiency.
The emotion voice conversion method based on the wavenet provided by the invention comprises the following steps:
s1, acquiring a voice file to form a corpus;
s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same speaking content into the same group;
s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2;
s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files;
s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs;
s6, constructing an emotion Mel spectrum conversion model;
s7, constructing a wavenet voice synthesis model;
s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file.
The step S2 is to divide the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and the voices with the same content are divided into the same group, specifically, the following steps are adopted for grouping:
A. extracting a plurality of emotion color voice files with the same content as a training set;
B. obtaining text information of each sentence by adopting an ASR tool;
C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;
D. dividing a plurality of groups of files into rows to form a training matrix; one of which is a row.
The step S3 of extracting the fundamental tone frequency of the acoustic feature in the voice file after grouping in the step S2 specifically comprises the following steps:
a. dividing the training matrix obtained in the step S2 into rows;
b. inputting data of the training matrix into a vocoder decoder in units of rows;
c. and c, dividing the fundamental frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and the fundamental frequency corresponding to the neutral voice file, thereby obtaining a fundamental frequency characteristic matrix.
Preprocessing the voice files grouped in the step S2 to obtain the Mel spectrum characteristics of each group of voice files, wherein the Mel spectrum characteristics of each group of voice files are obtained by adopting the following steps:
(1) Dividing the training matrix obtained in the step S2 into rows;
(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;
(3) Carrying out framing treatment on the compressed voice file obtained in the step (2);
(4) Windowing the framed voice file obtained in the step (3);
(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data;
(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);
(7) And (3) saving the spectrum data subjected to the Mel filtering processing in the step (6), thereby obtaining Mel spectrum characteristics of each group of voice files.
Step S5, aiming at the Mel spectrum characteristics of each group of voice files obtained in step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs, and specifically, the following steps are adopted to carry out dynamic programming characteristic point alignment:
1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;
2) Establishing Euclidean distance matrixes of two sequences;
3) Finding the shortest distance from the element in the top left corner of the matrix to the element in the bottom right corner of the matrix;
4) And 3) acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points.
And step S6, constructing an emotion Mel spectrum conversion model, namely taking the parallel corresponding points obtained in the step S5 as input data, inputting the input data into a CNN network model for training, and thus obtaining a final emotion Mel spectrum conversion model.
The construction of the wavenet speech synthesis model in step S7 is specifically implemented by adopting the following steps:
i, adopting the following formula as a causal prediction formula:
in which x is i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;
II, adopting an extended causal convolution model:
for speech signals, the current predicted value is largely related to the output at a previous point in time, since the semantic meaning is a large link between words. The causal convolution can be well combined with the characteristic to predict the current node, and the node output of the current time point of the causal convolution is established on the basis of the input of the current node and the output of the previous time node. The relationship between the receiving domain and the network layer number of the traditional causal convolution is: the number of nodes in the receiving domain is equal to the number of network layers, and the output value of the first node is given to the second node as input through one layer of connecting layer, so that the predicted value is obtained when the highest layer is reached. The extended convolution is based on the traditional causal convolution, and does not adopt continuous nodes to predict, but takes a plurality of nodes as a group to output the current predicted value, and the relation between the receiving domain of the extended convolution and the network layer number is as follows: the number of nodes of the receiving domain is equal to the power series with 2 as the bottom and the number of layers as the power of the power, so that the number of times of the number of the nodes of each layer of the receiving domain is increased, the output value of the current time node is better related with the output value of more previous time nodes to a certain extent, and a better prediction effect is achieved.
III, residual jump connection:
connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels;
IV, adopting the following formula as a condition input model:
wherein h is the final emotion Mel spectrum characteristic obtained in the step S6; x is the pitch frequency feature matrix obtained in step S3.
The emotion mel spectrum conversion model obtained in the step S6 is used as a forward network input, the wave net speech synthesis model constructed in the step S7 is used as a backward network, a final emotion speech file is output, and specifically, the following steps are adopted to output the final emotion speech file:
i, inputting neutral voice;
ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;
iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;
iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;
v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.
According to the emotion voice conversion method based on the wavenet, which is provided by the invention, the idea of matching and aligning neutral Mel spectrum and emotion Mel spectrum is adopted by adopting the dynamic programming algorithm idea, so that the matching degree of the parallel correspondence of the Mel spectrum characteristics is improved, and the effects that low-dimensional characteristics with less information content can output high-precision predicted values are realized; the complete decoupling can be realized by adopting the mapping of the channel correlation and the space correlation in the characteristic mapping of the convolutional neural network, so that each 1x1 convolution kernel can be connected, and meanwhile, the calculation mode of the complete decoupling of the channel correlation is realized by the corresponding 3x3 convolution kernel; and then, by combining a residual jump connection mechanism of the wavenet, optimizing a convolution mode, and reserving a residual compensation mechanism, the prediction precision can be improved; after the Mel spectrum and the pitch frequency are adopted as the characteristic input of the synthesized voice and the wave net model is adopted, the prediction precision of the low-dimensional characteristic is greatly improved because the convolution is expanded to enlarge the receiving domain and the causal prediction mechanism, and experiments prove that the optimized wave net can reach the training convergence state more quickly; therefore, the method has high reliability, good accuracy and higher efficiency.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of quantization rules in the method of the present invention.
FIG. 3 is a schematic diagram of a rule for determining the shortest distance in the method of the present invention.
FIG. 4 is a schematic representation of an extended causal convolution model in the method of the present invention.
Fig. 5 is a schematic diagram of an improved residual skip connection in the method of the present invention.
FIG. 6 is a graph showing the comparison of the prediction accuracy of the method of the present invention and the prior art method.
Detailed Description
A schematic process flow diagram of the method of the present invention is shown in fig. 1: the emotion voice conversion method based on the wavenet provided by the invention comprises the following steps:
s1, acquiring a voice file to form a corpus;
s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; the method comprises the following steps:
A. extracting a plurality of emotion color voice files with the same content as a training set;
B. obtaining text information of each sentence by adopting an ASR tool;
C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;
D. dividing a plurality of groups of files into rows to form a training matrix; one group is a row;
s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2; the method comprises the following steps of:
a. dividing the training matrix obtained in the step S2 into rows;
b. inputting data of the training matrix into a vocoder decoder in units of rows;
c. dividing the fundamental tone frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and fundamental tone frequencies corresponding to the neutral voice files, thereby obtaining a fundamental tone frequency characteristic matrix;
s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files; specifically, the mel spectrum characteristics of each group of voice files are obtained by the following steps:
(1) Dividing the training matrix obtained in the step S2 into rows;
(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;
in practice, the sampling frequency is set to 44k; mu is set to 255; the quantization rules are specifically as follows (as shown in fig. 2):
and uniformly dividing the interval of the sampled data into 255 interval values x, and obtaining a corresponding mapping value y according to the located interval and the graph curve, wherein y is the quantized sampled data value. The larger the sample value, the more the coordinate interval is located, the smaller the corresponding slope, and the closer the mapped y value is. The smaller the signal value, the more detailed and more dispersed the value mapped by the obtained value, and the larger the signal mapping value, the coarser and more approximate the value mapped by the obtained value. The characteristic that the voice signal needs to be subjected to key analysis of small signal values is met;
(3) Carrying out framing treatment on the compressed voice file obtained in the step (2); the frame length is 32ms, and the frame shift is 20ms;
(4) Windowing the framed voice file obtained in the step (3); the window length is set to 16;
(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data; FFT algorithm processing can be adopted, and the length of FFT n=256;
(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);
at a specific setting, sampling frequency f s =8000 Hz, the lowest frequency of the filter frequency range is f l =0, and then according to the nyquist sampling theorem, the highest frequency of the filter frequency range is f h =f s 2=4000 Hz; setting the number of filters m=24; the value of each Mel filter can be calculated according to the transfer function of each band-pass filter, and a Mel filter group can be formed by combining the number of the filters;
(7) Storing the spectrum data subjected to the Mel filtering processing in the step (6), so as to obtain Mel spectrum characteristics of each group of voice files;
s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs; the method comprises the following steps of:
1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;
2) Establishing two sequences of Euclidean distance matrices (shown in FIG. 3 (a));
3) Finding the shortest distance from the element in the top left most corner of the matrix to the element in the bottom right most corner of the matrix (as shown in fig. 3 (b));
4) Acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points; as shown in FIG. 3, X-Y corresponding pairs (0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6) and (6, 7) are obtained sequentially; the corresponding coordinates are parallel corresponding points put in during training;
s6, constructing an emotion Mel spectrum conversion model; the parallel corresponding points obtained in the step S5 are used as input data and are input into a CNN network model for training, so that a final emotion Mel spectrum conversion model is obtained; in specific implementation, the Mel spectrum dimension is set to 256, and 4 convolution modules are used;
s7, constructing a wavenet voice synthesis model; the construction method comprises the following steps:
i, adopting the following formula as a causal prediction formula:
in which x is i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;
II. Use of extended causal convolution model (as shown in FIG. 4):
wherein the lowest layer is an input layer, and the uppermost layer is an output layer; enlarging a receiving domain of an input layer in a power level increasing mode;
for speech signals, the current predicted value is largely related to the output at a previous point in time, since the semantic meaning is a large link between words. The causal convolution can be well combined with the characteristic to predict the current node, and the node output of the current time point of the causal convolution is established on the basis of the input of the current node and the output of the previous time node; the relationship between the receiving domain and the network layer number of the traditional causal convolution is: the number of nodes of the receiving domain is equal to the number of network layers, and each time the output value of a first node is given to a second node as input through one layer of connecting layer, the predicted value is obtained when the highest layer is reached; the extended convolution is based on the traditional causal convolution, and does not adopt continuous nodes to predict, but takes a plurality of nodes as a group to output the current predicted value, and the relation between the receiving domain of the extended convolution and the network layer number is as follows: the number of nodes of the receiving domain is equal to the power series with 2 as the bottom and the number of layers as the power of the power, so that the number of times of the number of the nodes of each layer of the receiving domain is increased, the output value of the current time node is better related with the output value of more previous time nodes to a certain extent, and a better prediction effect is achieved;
III, adopting residual jump connection:
connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels; as shown in fig. 5;
IV, adopting the following formula as a condition input model:
h is a final emotion Mel spectrum conversion model obtained in the step S6; x is the fundamental tone frequency characteristic matrix obtained in the step S3;
s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file; specifically, the method comprises the following steps of outputting a final emotion voice file:
i, inputting neutral voice;
ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;
iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;
iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;
v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.
According to the method, the idea of matching and aligning the neutral Mel spectrum and the emotion Mel spectrum is adopted, so that the matching degree of the parallel correspondence of the Mel spectrum characteristics is improved, and the effects that the low-dimensional characteristics with less abundant information content can output high-precision predicted values are realized; the mapping of the channel correlation and the space correlation in the convolutional neural network characteristic mapping can be completely decoupled; each 1x1 convolution kernel can be connected, and the calculation mode of completely decoupling the channel correlation is achieved through the corresponding 3x3 convolution kernel; combining a residual jump connection mechanism of the wavenet, optimizing a convolution mode, and reserving a residual compensation mechanism; the data show that the accuracy of xception combined with the wavenet residual mechanism is improved (as shown in fig. 6 (a)); in addition, mel spectrum and pitch frequency are adopted as characteristic input of synthesized voice, and if the synthesized voice is in a traditional model, the low-dimensional characteristic can improve efficiency, but a large amount of original information can be lost to reduce accuracy; after the wave net model is adopted, the prediction precision of the low-dimensional features is greatly improved because the mechanism of the receiving domain and the causal prediction is increased by extending convolution; moreover, experiments show that the optimized wavenet can reach a training convergence state faster; as shown in fig. 6 (b), the optimized wavenet converges faster than the original version.

Claims (1)

1. A wave net-based emotion voice conversion method comprises the following steps:
s1, acquiring a voice file to form a corpus;
s2, dividing the voice data in the corpus obtained in the step S1 into a neutral voice file and an emotion voice file, and dividing voices with the same content into the same group; the method comprises the following steps:
A. extracting a plurality of emotion color voice files with the same content as a training set;
B. obtaining text information of each sentence by adopting an ASR tool;
C. b, dividing the voice files with the same text but different emotion and the neutral voice files into a group aiming at the text information obtained in the step B;
D. dividing a plurality of groups of files into rows to form a training matrix; one group is a row;
s3, extracting acoustic characteristic fundamental tone frequencies in the voice file after the grouping in the step S2; the method comprises the following steps of:
a. dividing the training matrix obtained in the step S2 into rows;
b. inputting data of the training matrix into a vocoder decoder in units of rows;
c. dividing the fundamental tone frequency output by the vocoder decoder obtained in the step b into a group of gene frequencies corresponding to the voice files with the same text but different emotion and fundamental tone frequencies corresponding to the neutral voice files, thereby obtaining a fundamental tone frequency characteristic matrix;
s4, preprocessing the voice files grouped in the step S2, so as to obtain the Mel spectrum characteristics of each group of voice files; specifically, the mel spectrum characteristics of each group of voice files are obtained by the following steps:
(1) Dividing the training matrix obtained in the step S2 into rows;
(2) Sampling the voice file corresponding to each line of data of the training matrix in a line unit at a set sampling frequency, and compressing mu-law thirteen fold lines;
(3) Carrying out framing treatment on the compressed voice file obtained in the step (2);
(4) Windowing the framed voice file obtained in the step (3);
(5) Performing spectrum analysis on the windowed voice file obtained in the step (4), so as to obtain corresponding spectrum data;
(6) Carrying out Mel filtering treatment on the frequency spectrum data obtained in the step (5);
(7) Storing the spectrum data subjected to the Mel filtering processing in the step (6), so as to obtain Mel spectrum characteristics of each group of voice files;
s5, aiming at the Mel spectrum characteristics of each group of voice files obtained in the step S4, carrying out dynamic programming characteristic point alignment, so that two points of the shortest path corresponding to each group of voice files are mapped and used as training pairs; the method comprises the following steps of:
1) Setting a neutral voice Mel spectrum sequence as X and an emotion voice Mel spectrum sequence as Y;
2) Establishing Euclidean distance matrixes of two sequences;
3) Finding the shortest distance from the element in the top left corner of the matrix to the element in the bottom right corner of the matrix;
4) Acquiring coordinates on the path corresponding to the shortest distance in the step 3), and marking the coordinates as parallel corresponding points;
s6, constructing an emotion Mel spectrum conversion model; the parallel corresponding points obtained in the step S5 are used as input data and are input into a CNN network model for training, so that a final emotion Mel spectrum conversion model is obtained;
s7, constructing a wavenet voice synthesis model; the construction method comprises the following steps:
i, adopting the following formula as a causal prediction formula:
in which x is i Values for the pitch frequency feature i time points; t is the current time point; p (x) is the prediction probability of the current time point;
II, adopting an extended causal convolution model:
III, adopting residual jump connection:
connecting the outputs of each plurality of layers, and performing residual error compensation processing on the input through a convolution kernel of 1x 1; finally decomposing the 1x1 convolution kernels of the channels into a plurality of 3x3 convolution kernels;
IV, adopting the following formula as a condition input model:
h is a final emotion Mel spectrum conversion model obtained in the step S6; x is the fundamental tone frequency characteristic matrix obtained in the step S3;
s8, adopting the emotion Mel spectrum conversion model obtained in the step S6 as a forward network input, and adopting the wave net voice synthesis model constructed in the step S7 as a backward network to output a final emotion voice file; specifically, the method comprises the following steps of outputting a final emotion voice file:
i, inputting neutral voice;
ii, according to the fundamental tone frequency characteristic matrix obtained in the step S3, the fundamental tone frequency characteristic matrix is used as an input characteristic of a subsequent wavenet model;
iii, taking the Mel spectrum characteristics obtained in the step S4 as input characteristics of a later emotion Mel spectrum conversion model;
iv, converting neutral Mel spectrum data into emotion Mel spectrum data according to the final emotion Mel spectrum conversion model obtained in the step S6;
v. taking the fundamental tone frequency characteristic matrix obtained in the step S3 as basic input, taking emotion Mel spectrum data obtained in the step iv as conditional input, and inputting the emotion Mel spectrum data into the wave net voice synthesis model obtained in the step S7 together, so as to obtain a final emotion voice file.
CN202010229173.6A 2020-03-27 2020-03-27 Emotion voice conversion method based on wavenet Active CN111402923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010229173.6A CN111402923B (en) 2020-03-27 2020-03-27 Emotion voice conversion method based on wavenet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010229173.6A CN111402923B (en) 2020-03-27 2020-03-27 Emotion voice conversion method based on wavenet

Publications (2)

Publication Number Publication Date
CN111402923A CN111402923A (en) 2020-07-10
CN111402923B true CN111402923B (en) 2023-11-03

Family

ID=71429205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010229173.6A Active CN111402923B (en) 2020-03-27 2020-03-27 Emotion voice conversion method based on wavenet

Country Status (1)

Country Link
CN (1) CN111402923B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101666930B1 (en) * 2015-04-29 2016-10-24 서울대학교산학협력단 Target speaker adaptive voice conversion method using deep learning model and voice conversion device implementing the same
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
KR102057927B1 (en) * 2019-03-19 2019-12-20 휴멜로 주식회사 Apparatus for synthesizing speech and method thereof
CN110619867A (en) * 2019-09-27 2019-12-27 百度在线网络技术(北京)有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2489473B (en) * 2011-03-29 2013-09-18 Toshiba Res Europ Ltd A voice conversion method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101666930B1 (en) * 2015-04-29 2016-10-24 서울대학교산학협력단 Target speaker adaptive voice conversion method using deep learning model and voice conversion device implementing the same
CN109767778A (en) * 2018-12-27 2019-05-17 中国人民解放军陆军工程大学 A kind of phonetics transfer method merging Bi-LSTM and WaveNet
KR102057927B1 (en) * 2019-03-19 2019-12-20 휴멜로 주식회사 Apparatus for synthesizing speech and method thereof
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110619867A (en) * 2019-09-27 2019-12-27 百度在线网络技术(北京)有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Heejin Choi 等.Emotional Speech Synthesis f or Multi-Speaker Emotional Dataset Using WaveNet Vocoder.《2019 IEEE International Conference on Consumer Electronics》.2019,第1-2页. *
Zhaojie Luo 等.Emotional Voice Conversion Using Dual Supervised Adversarial Networks With Continuous Wavelet Transform F0 Features.《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》.2019,第1535-1538页. *
基于长短期记忆和卷积神经网络的语音情感识别;卢官明等;南京邮电大学学报(自然科学版)(第05期);第63-68页 *
张亚强.基于迁移学习和自学习情感表征的情感语音合成.中国优秀硕士学位论文全文数据库.2019,第44-48页. *

Also Published As

Publication number Publication date
CN111402923A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN110992987B (en) Parallel feature extraction system and method for general specific voice in voice signal
CN108597539B (en) Speech emotion recognition method based on parameter migration and spectrogram
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN112466326B (en) Voice emotion feature extraction method based on transducer model encoder
EP0342630A2 (en) Speech recognition with speaker adaptation by learning
CN111402928B (en) Attention-based speech emotion state evaluation method, device, medium and equipment
WO2022141842A1 (en) Deep learning-based speech training method and apparatus, device, and storage medium
CN110634476B (en) Method and system for rapidly building robust acoustic model
CN109902164B (en) Method for solving question-answering of open long format video by using convolution bidirectional self-attention network
CN113140220B (en) Lightweight end-to-end speech recognition method based on convolution self-attention transformation network
CN113987179A (en) Knowledge enhancement and backtracking loss-based conversational emotion recognition network model, construction method, electronic device and storage medium
CN112259119B (en) Music source separation method based on stacked hourglass network
CN107316635A (en) Audio recognition method and device, storage medium, electronic equipment
CN111061951A (en) Recommendation model based on double-layer self-attention comment modeling
CN113539232A (en) Muslim class voice data set-based voice synthesis method
CN116486794A (en) Chinese-English mixed speech recognition method
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
CN111402923B (en) Emotion voice conversion method based on wavenet
CN113808581A (en) Chinese speech recognition method for acoustic and language model training and joint optimization
CN111951778B (en) Method for emotion voice synthesis by utilizing transfer learning under low resource
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
CN112380874B (en) Multi-person-to-speech analysis method based on graph convolution network
CN115171878A (en) Depression detection method based on BiGRU and BiLSTM
CN115222059A (en) Self-distillation model compression algorithm based on high-level information supervision
CN115019785A (en) Streaming voice recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant