CN114023343A - Voice conversion method based on semi-supervised feature learning - Google Patents
Voice conversion method based on semi-supervised feature learning Download PDFInfo
- Publication number
- CN114023343A CN114023343A CN202111277502.5A CN202111277502A CN114023343A CN 114023343 A CN114023343 A CN 114023343A CN 202111277502 A CN202111277502 A CN 202111277502A CN 114023343 A CN114023343 A CN 114023343A
- Authority
- CN
- China
- Prior art keywords
- encoder
- voice
- network
- speaker
- voice data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000015654 memory Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 11
- 238000011156 evaluation Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention provides a voice conversion method based on semi-supervised feature learning. Firstly, preprocessing voice data in a training set by utilizing an open source voice packet librosa to obtain an extended acoustic feature segment set, and extracting acoustic features representing identity information of a speaker in advance by adopting a generalized end-to-end loss encoder; then, a voice conversion network is constructed, wherein the voice conversion network comprises a variational self-encoder, a decoder and a post-network, and the network is trained by utilizing the constructed data set, wherein the network loss is set by adopting mean square error constraint under supervision information; and finally, processing the source speech data and the target speech data to be converted by using the trained network to obtain converted speech data. The invention introduces semi-supervised feature learning based on the structure of the variational self-encoder, can accurately extract the identity information of speakers, solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and has good generalization capability.
Description
Technical Field
The invention belongs to the field of deep learning, and particularly relates to a voice conversion method based on semi-supervised feature learning.
Background
The voice conversion method aims to realize the conversion of the tone of a source speaker into the tone of a target speaker, and simultaneously keep the language content of the source speaker unchanged. The most common practice of speech conversion systems is: only one piece of target speaker voice data and one piece of source speaker voice data are provided, the system automatically extracts the language content from the source speaker data, extracts speaker embedding (i.e., speaker timbre information) from the target speaker data, and recombines the two information to generate a new target voice. In speech conversion, if data is from different speakers and the language content is different, such data belongs to non-parallel corpus data. Due to the characteristics of low cost, easy acquisition, being close to a real application scene and the like, the non-parallel corpus data is widely applied to voice conversion. At present, in the non-parallel corpus data, because the language contents of the converted target speaker and the source speaker are different, the voice color of the source speaker needs to be converted into the voice color of the target speaker while the language contents are kept unchanged. Therefore, two challenges are faced when using non-parallel corpus data for conversion: firstly, if a model is established on a non-parallel corpus data set, the mapping from a source speaker to a target speaker is difficult to accurately learn, so that the problem of poor voice conversion effect is caused; second, for voice conversion between multiple speakers, if the speaker voice data under test never appears in the training data set, the converted voice effect is low in both naturalness and similarity. The above two points become the problem to be solved urgently for the non-parallel voice conversion.
To accurately learn the mapping relationship between the source speaker and the target speaker, Takuhiro et al first introduced the CycleGAN (Cyclic Generation countermeasure network) method into the speech Conversion task in the documents "T.Kaneko and H.Kameoka, CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-dependent adaptive Networks, European Signal Processing Conference (EUSIPCO), pp.2100-2104,2018". The cycle generation countermeasure network solves the problem that the source domain and the target domain are difficult to map. The method uses the countermeasure loss and the cycle consistency loss as the criteria to realize the tone color conversion from the source speaker to the target speaker.
To achieve speech Conversion between arbitrary speakers, Chou et al propose a method for speech Conversion using a variable-rate encoder in the documents "Ju-chip Chou, Cheng-chip Yeh, and Hung-yi Lee, One-Shot Voice Conversion by separation Speaker and Content reproduction with distance Normalization, in Proc. The method takes mean square error as a criterion, utilizes a variational self-encoder structure, utilizes two encoders at an input end to respectively extract language content and speaker embedding in voice through unsupervised learning, and uses a decoder at an output end to combine the language content and the speaker embedding, thereby generating new voice, namely the voice of a target speaker. Because the encoder learns the ability to separate the speech content from the speaker-embedded information through training, given the target speaker speech data and the source speaker speech data, the content encoder can automatically extract the speech content representation from the source speaker speech, the speaker encoder automatically extracts the speaker-embedded representation from the target speech, and at the decoder side, the speech content representation and the speaker-embedded representation are combined into new speech data. This method can be extended to voice conversion between multiple speakers.
The above two methods solve some problems of non-parallel voice conversion, but still have limitations. The first method can accurately learn the mapping from the source speaker to the target speaker, but usually only can perform the voice conversion between two speakers, and the training process is complex, the situation that the gradient dip is zero easily occurs, and the voice conversion between multiple speakers cannot be expanded. In the second method, although the variational self-encoder has a simple structure and is easy to implement, the language content representation extracted by the variational self-encoder contains a small amount of speaker embedded information, and finally the result similarity of voice conversion is poor.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a voice conversion method based on semi-supervised feature learning. Firstly, preprocessing voice data in a training set by utilizing an open source voice packet librosa to obtain an extended acoustic feature segment set, and extracting acoustic features representing identity information of a speaker in advance by adopting a generalized end-to-end loss encoder; then, a voice conversion network is constructed, wherein the voice conversion network comprises a variational self-encoder, a decoder and a post-network, and the network is trained by utilizing the constructed data set, wherein the network loss is set by adopting mean square error constraint under supervision information; and finally, processing the source speech data and the target speech data to be converted by using the trained network to obtain converted speech data. The invention introduces semi-supervised feature learning based on the structure of the variational self-encoder, can accurately extract the identity information of speakers, solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and has good generalization capability.
A voice conversion method based on semi-supervised feature learning is characterized by comprising the following steps:
step 1: utilizing an open source voice packet librosa to preprocess each voice data in a training set, wherein the preprocessing comprises reading in the voice data, performing pre-emphasis, windowing and framing processing on each voice data, performing short-time Fourier transform on each frame of framed voice data, converting the frame of voice data from a time domain signal into a frequency domain signal, screening the voice data converted into the frequency domain signal to obtain voice sections in accordance with the length, and forming an acoustic characteristic section set by all the preprocessed voice sections in the training set;
randomly selecting less than half of speakers from a training set, inputting voice data of the speakers to a coder with generalized end-to-end loss design, and extracting acoustic features representing identity information of the speakers; the encoder is composed of a long-short term memory network layer and a linear layer, the dimensions of an input layer, an output layer and a hidden layer of the long-short term memory network layer are respectively 80, 256 and 256, the dimensions of the input layer and the output layer of the linear layer are both 256, the activation function of the linear layer is a ReLu function, and the encoder adopts end-to-end loss constraint;
step 2: constructing a voice conversion network which comprises a variational self-encoder, a decoder and a post-network, wherein the variational self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit sizes of 768, and speaker identity information is extracted from input voice data; the content encoder is composed of 3 5 multiplied by 1 convolution layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data; the decoder is composed of 3 convolution layers of 5 multiplied by 1 and 3 long-short term memory network layers with unit dimensionality of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data; the post network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder to obtain reconstructed voice data;
the loss function of the voice conversion network is set as follows:
L=Lcon+Lspe+Lreco (1)
wherein L represents the total loss of the network, LconRepresents content encoder loss, LspeIndicating a loss of speaker identity information, LrecoExpressing the self-weight loss, and respectively calculating according to the following formulas:
wherein, E [. C]Which represents the mathematical expectation of the calculation,which represents the output of the decoder, and,is expressed as inputOutput of the temporal content encoder, ZcIn the representationOutput of the capacity encoder, ZsiRepresents the output of the speaker's encoder,representing the speaker identity information extracted by a generalized end-to-end method, i representing the speaker serial number, x representing the initial input voice data of the network, Es(x) Representing the output of the speaker's encoder when the input is x, D (E)s(x),Zc) Denotes a reaction of Es(x) And ZcInputting the output after the decoder;
the specific processing procedure of the example normalization layer is as follows:
first, the mean value of each channel represented by the speech content is calculated as follows:
wherein, mucMean of the c-th channel, W the array dimension of each channel, Mc[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;
then, the variance of each channel is calculated as follows:
wherein σcThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);
finally, the channel array M is expressed by the following formulacEach element in (a) is normalized:
wherein, M'c[ω]Represents the value of the ω -th element in the normalized c-th channel; c1, 2, …, C, ω 1,2, …, W;
and step 3: setting network parameters, including the Batch processing size of data read-in being 32, the initial learning rate being 0.001 and the iteration times of the network being 500000, inputting the voice segments in the acoustic feature segment set obtained in the step 1 into the voice conversion network constructed in the step 2 for training to obtain the trained voice conversion network;
and 4, step 4: inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.
The invention has the beneficial effects that: based on the structure of the variational self-encoder, semi-supervised feature learning is introduced, and the semi-supervised feature learning is combined with the simple structure of the variational encoder. The speaker extracted by the pre-training model based on the generalized end-to-end loss is embedded as the monitoring information of the speaker encoder in the voice conversion network, and the mean square error loss is used as the constraint, so that the speaker encoder can accurately extract the identity information of the speaker. Since the content encoder adds an instance normalization layer to remove speaker information, it can extract more accurate speech content information. The invention solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and utilizes pre-extracted speaker tone information as the characteristic supervision information extracted by a speaker encoder, so that the speaker encoder can accurately extract the tone information to complete a voice conversion task with higher similarity, and can expand the tone conversion among the multiple speakers and complete excellent voice conversion for the speakers not in voice training data set. The trained network can independently change the identity of the speaker, and can also complete conversion under the condition that the data of the speaker is less.
Drawings
FIG. 1 is a flow chart of the speech conversion method based on semi-supervised feature learning according to the present invention.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples. As shown in fig. 1, the present invention provides a speech conversion method based on semi-supervised feature learning, which is implemented as follows:
1. extracting acoustic features of source and target speech data
Each piece of voice data in a training set is preprocessed by utilizing an open source voice packet librosa, including reading in the voice data, each piece of voice data is subjected to pre-emphasis, windowing and framing processing, each frame of framed voice data is subjected to short-time Fourier transform, the voice data is converted from a time domain signal into a frequency domain signal, the voice data converted into the frequency domain signal is subjected to screening processing, voice sections in accordance with the length are obtained, and all the preprocessed voice sections in the training set form an acoustic characteristic section set. Among them, the open source voice package librosa is described in detail in the literature "Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W.Ellis, Matt McVicar, Eric Battenberg, Oriol Nieto, librosa: Audio and Music Signal Analysis in wavelength.proceedings of the 14th wavelength in Science conference.2015.
Selecting no more than half of speakers randomly from a training set, and designing an encoder by adopting a generalized end-to-end loss method, wherein the encoder consists of a long-short term memory network layer with an input dimension of 80, an output dimension of 256 and a hidden layer dimension of 256, and a linear layer with an input dimension of 256 and an output dimension of 256. The activation function of the linear layer is the ReLu function. The selected speaker data is input into the encoder, the encoder extracts speaker embedding capable of representing speaker identity information from the voice data of the speakers under the constraint of end-to-end loss, and the acoustic characteristics of the part are used as supervision information in the speaker encoder in the subsequent processing so as to guide the speaker encoder to generate correct speaker identity information. Among them, the Generalized end-to-end loss method is described in detail in "Wan L, Wang Q, Papir A, and Moreno I.L, Generalized end-to-end loss for spaker verification. in IEEE Conference on optics, Speech and Signal Processing (ICASSP), pp.4879-4883.2018".
2. Building a voice conversion network
The voice conversion network comprises a variation self-encoder, a decoder and a post-network, wherein the variation self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit size of 768, the identity information of the speaker is extracted from input voice data, and the output result is a 256-dimensional vector. The content encoder is composed of 3 5 × 1 convolutional layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data, and an output result is also a vector with the dimension of 64. The decoder is composed of 3 5 x 1 convolution layers and 3 long-short term memory network layers with unit dimension of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data. The post-network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder so as to increase the details of the voice frequency spectrum and generate a high-quality voice frequency spectrogram, namely reconstructed voice data.
The loss function of the voice conversion network is set as follows:
L=Lcon+Lspe+Lreco (8)
wherein L represents the total loss of the network, LconRepresents content encoder loss, LspeIndicating a loss of speaker identity information, LrecoExpressing the self-weight loss, and respectively calculating according to the following formulas:
wherein, E [. C]Which represents the mathematical expectation of the calculation,which represents the output of the decoder, and,is expressed as inputOutput of the temporal content encoder, ZcRepresenting the output of a content encoder, ZsiRepresents the output of the speaker's encoder,the method is characterized in that the speaker identity information extracted by adopting a generalized end-to-end method is represented and serves as the monitoring information of a speaker encoder, the mean square error of the speaker encoder and the speaker encoder is calculated, the similarity maximization between the speaker encoder and the speaker encoder is realized, and i represents the serial number of the speaker. x denotes the initial input speech data of the network, Es(x) Representing the output of the speaker's encoder when the input is x, D (E)s(x),Zc) Denotes a reaction of Es(x) And ZcAnd inputting the output after the decoder.
The example normalization layer performs example normalization processing on the content extracted by the content encoder to remove a small amount of speaker information contained in the content, and the specific processing procedure is as follows:
first, the mean value of each channel represented by the speech content is calculated as follows:
wherein, mucMean of the c-th channel, W the array dimension of each channel, Mc[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;
then, the variance of each channel is calculated as follows:
wherein σcThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);
finally, the channel array M is expressed by the following formulacEach element in (a) is normalized:
wherein, M'c[ω]Represents the value of the ω -th element in the normalized c-th channel; c1, 2, …, C, ω 1,2, …, W.
3. Network training
And (3) setting network parameters, including the Batch processing size 32 of data read-in, the initial learning rate 0.001 and the iteration times 500000 of the network, inputting the voice segments in the acoustic feature segment set obtained in the step (1) into the voice conversion network constructed in the step (2) for training to obtain the trained voice conversion network.
4. Speech conversion
Inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.
In order to verify the effectiveness of the method, the method is adopted to carry out experiments on an open source voice data set AISHELL-3, wherein the data set comprises 174 speakers, and each speaker has about 300 pieces of voice data. The voice data time length is between 3-8 seconds. The 174 speakers are divided into 75 speakers for extracting speaker tone information in advance, 75 speakers for extracting speaker information in the training process and 24 speakers for voice data to be tested. The objective evaluation index of the voice conversion result is MCD (Mel Cepstral Distorescence), that is, whether the two voices before and after conversion are distorted or not is compared to evaluate the performance of the voices. The subjective evaluation mainly adopts MOS (mean opinion score) evaluation, and the collector volunteers score results, firstly, the similarity score is 1-5, 1 represents completely dissimilar, and 5 represents completely similar; the second is a score of 1-5 in naturalness, 1 indicating poor naturalness and 5 indicating excellent sound quality. The two evaluations are used as important indexes for evaluating the performance of the voice conversion system. The comparison method adopts a cycleGAN (cyclic generation countermeasure network) method and a Seq2seqVC (variation self-coder) method, and the index evaluation result is shown in table 1.
TABLE 1
Name of method | Naturalness score | Similarity score | MCD(dB) |
CycleGAN method | 3.8 | 3.9 | 3.3 |
Seq2seqVC method | 3.6 | 3.5 | 2.8 |
The method of the invention | 3.9 | 4.1 | 3.2 |
Claims (1)
1. A voice conversion method based on semi-supervised feature learning is characterized by comprising the following steps:
step 1: utilizing an open source voice packet librosa to preprocess each voice data in a training set, wherein the preprocessing comprises reading in the voice data, performing pre-emphasis, windowing and framing processing on each voice data, performing short-time Fourier transform on each frame of framed voice data, converting the frame of voice data from a time domain signal into a frequency domain signal, screening the voice data converted into the frequency domain signal to obtain voice sections in accordance with the length, and forming an acoustic characteristic section set by all the preprocessed voice sections in the training set;
randomly selecting less than half of speakers from a training set, inputting voice data of the speakers to a coder with generalized end-to-end loss design, and extracting acoustic features representing identity information of the speakers; the encoder is composed of a long-short term memory network layer and a linear layer, the dimensions of an input layer, an output layer and a hidden layer of the long-short term memory network layer are respectively 80, 256 and 256, the dimensions of the input layer and the output layer of the linear layer are both 256, the activation function of the linear layer is a ReLu function, and the encoder adopts end-to-end loss constraint;
step 2: constructing a voice conversion network which comprises a variational self-encoder, a decoder and a post-network, wherein the variational self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit sizes of 768, and speaker identity information is extracted from input voice data; the content encoder is composed of 3 5 multiplied by 1 convolution layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data; the decoder is composed of 3 convolution layers of 5 multiplied by 1 and 3 long-short term memory network layers with unit dimensionality of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data; the post network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder to obtain reconstructed voice data;
the loss function of the voice conversion network is set as follows:
L=Lcon+Lspe+Lreco (1)
wherein L represents the total loss of the network, LconRepresents content encoder loss, LspeIndicating a loss of speaker identity information, LrecoExpressing the self-weight loss, and respectively calculating according to the following formulas:
wherein, E [. C]Which represents the mathematical expectation of the calculation,which represents the output of the decoder, and,is expressed as inputOutput of time content encoderOut, ZcRepresenting the output of a content encoder, ZsiRepresents the output of the speaker's encoder,representing the speaker identity information extracted by a generalized end-to-end method, i representing the speaker serial number, x representing the initial input voice data of the network, Es(x) Representing the output of the speaker's encoder when the input is x, D (E)s(x),Zc) Denotes a reaction of Es(x) And ZcInputting the output after the decoder;
the specific processing procedure of the example normalization layer is as follows:
first, the mean value of each channel represented by the speech content is calculated as follows:
wherein, mucMean of the c-th channel, W the array dimension of each channel, Mc[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;
then, the variance of each channel is calculated as follows:
wherein σcThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);
finally, the channel array M is expressed by the following formulacEach element in (a) is normalized:
wherein M isc′[ω]Represents the second in the c channel after normalizationOmega element values; c1, 2, …, C, ω 1,2, …, W;
and step 3: setting network parameters, including the Batch processing size of data read-in being 32, the initial learning rate being 0.001 and the iteration times of the network being 500000, inputting the voice segments in the acoustic feature segment set obtained in the step 1 into the voice conversion network constructed in the step 2 for training to obtain the trained voice conversion network;
and 4, step 4: inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111277502.5A CN114023343B (en) | 2021-10-30 | 2021-10-30 | Voice conversion method based on semi-supervised feature learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111277502.5A CN114023343B (en) | 2021-10-30 | 2021-10-30 | Voice conversion method based on semi-supervised feature learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114023343A true CN114023343A (en) | 2022-02-08 |
CN114023343B CN114023343B (en) | 2024-04-30 |
Family
ID=80059050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111277502.5A Active CN114023343B (en) | 2021-10-30 | 2021-10-30 | Voice conversion method based on semi-supervised feature learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114023343B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114283824A (en) * | 2022-03-02 | 2022-04-05 | 清华大学 | Voice conversion method and device based on cyclic loss |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111712874A (en) * | 2019-10-31 | 2020-09-25 | 支付宝(杭州)信息技术有限公司 | System and method for determining sound characteristics |
WO2020205233A1 (en) * | 2019-03-29 | 2020-10-08 | Google Llc | Direct speech-to-speech translation via machine learning |
-
2021
- 2021-10-30 CN CN202111277502.5A patent/CN114023343B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020205233A1 (en) * | 2019-03-29 | 2020-10-08 | Google Llc | Direct speech-to-speech translation via machine learning |
CN111712874A (en) * | 2019-10-31 | 2020-09-25 | 支付宝(杭州)信息技术有限公司 | System and method for determining sound characteristics |
Non-Patent Citations (1)
Title |
---|
黄国捷;金慧;俞一彪;: "增强变分自编码器做非平行语料语音转换", 信号处理, no. 10, 25 October 2018 (2018-10-25) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114283824A (en) * | 2022-03-02 | 2022-04-05 | 清华大学 | Voice conversion method and device based on cyclic loss |
Also Published As
Publication number | Publication date |
---|---|
CN114023343B (en) | 2024-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN109036382B (en) | Audio feature extraction method based on KL divergence | |
CN110782872A (en) | Language identification method and device based on deep convolutional recurrent neural network | |
CN108520753B (en) | Voice lie detection method based on convolution bidirectional long-time and short-time memory network | |
CN102968990B (en) | Speaker identifying method and system | |
CN111462769B (en) | End-to-end accent conversion method | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
WO2012075641A1 (en) | Device and method for pass-phrase modeling for speaker verification, and verification system | |
CN111785285A (en) | Voiceprint recognition method for home multi-feature parameter fusion | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
CN104123933A (en) | Self-adaptive non-parallel training based voice conversion method | |
CN102789779A (en) | Speech recognition system and recognition method thereof | |
CN107293306A (en) | A kind of appraisal procedure of the Objective speech quality based on output | |
KR102272554B1 (en) | Method and system of text to multiple speech | |
CN111489763B (en) | GMM model-based speaker recognition self-adaption method in complex environment | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN110136746B (en) | Method for identifying mobile phone source in additive noise environment based on fusion features | |
Rudresh et al. | Performance analysis of speech digit recognition using cepstrum and vector quantization | |
CN114023343B (en) | Voice conversion method based on semi-supervised feature learning | |
Koizumi et al. | Miipher: A robust speech restoration model integrating self-supervised speech and text representations | |
CN114283822A (en) | Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient | |
Mengistu | Automatic text independent amharic language speaker recognition in noisy environment using hybrid approaches of LPCC, MFCC and GFCC | |
CN112927723A (en) | High-performance anti-noise speech emotion recognition method based on deep neural network | |
CN115472168B (en) | Short-time voice voiceprint recognition method, system and equipment for coupling BGCC and PWPE features | |
Afshan et al. | Attention-based conditioning methods using variable frame rate for style-robust speaker verification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |