CN114023343A - Voice conversion method based on semi-supervised feature learning - Google Patents

Voice conversion method based on semi-supervised feature learning Download PDF

Info

Publication number
CN114023343A
CN114023343A CN202111277502.5A CN202111277502A CN114023343A CN 114023343 A CN114023343 A CN 114023343A CN 202111277502 A CN202111277502 A CN 202111277502A CN 114023343 A CN114023343 A CN 114023343A
Authority
CN
China
Prior art keywords
encoder
voice
network
speaker
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111277502.5A
Other languages
Chinese (zh)
Other versions
CN114023343B (en
Inventor
李学龙
张强
陈穆林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202111277502.5A priority Critical patent/CN114023343B/en
Publication of CN114023343A publication Critical patent/CN114023343A/en
Application granted granted Critical
Publication of CN114023343B publication Critical patent/CN114023343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a voice conversion method based on semi-supervised feature learning. Firstly, preprocessing voice data in a training set by utilizing an open source voice packet librosa to obtain an extended acoustic feature segment set, and extracting acoustic features representing identity information of a speaker in advance by adopting a generalized end-to-end loss encoder; then, a voice conversion network is constructed, wherein the voice conversion network comprises a variational self-encoder, a decoder and a post-network, and the network is trained by utilizing the constructed data set, wherein the network loss is set by adopting mean square error constraint under supervision information; and finally, processing the source speech data and the target speech data to be converted by using the trained network to obtain converted speech data. The invention introduces semi-supervised feature learning based on the structure of the variational self-encoder, can accurately extract the identity information of speakers, solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and has good generalization capability.

Description

Voice conversion method based on semi-supervised feature learning
Technical Field
The invention belongs to the field of deep learning, and particularly relates to a voice conversion method based on semi-supervised feature learning.
Background
The voice conversion method aims to realize the conversion of the tone of a source speaker into the tone of a target speaker, and simultaneously keep the language content of the source speaker unchanged. The most common practice of speech conversion systems is: only one piece of target speaker voice data and one piece of source speaker voice data are provided, the system automatically extracts the language content from the source speaker data, extracts speaker embedding (i.e., speaker timbre information) from the target speaker data, and recombines the two information to generate a new target voice. In speech conversion, if data is from different speakers and the language content is different, such data belongs to non-parallel corpus data. Due to the characteristics of low cost, easy acquisition, being close to a real application scene and the like, the non-parallel corpus data is widely applied to voice conversion. At present, in the non-parallel corpus data, because the language contents of the converted target speaker and the source speaker are different, the voice color of the source speaker needs to be converted into the voice color of the target speaker while the language contents are kept unchanged. Therefore, two challenges are faced when using non-parallel corpus data for conversion: firstly, if a model is established on a non-parallel corpus data set, the mapping from a source speaker to a target speaker is difficult to accurately learn, so that the problem of poor voice conversion effect is caused; second, for voice conversion between multiple speakers, if the speaker voice data under test never appears in the training data set, the converted voice effect is low in both naturalness and similarity. The above two points become the problem to be solved urgently for the non-parallel voice conversion.
To accurately learn the mapping relationship between the source speaker and the target speaker, Takuhiro et al first introduced the CycleGAN (Cyclic Generation countermeasure network) method into the speech Conversion task in the documents "T.Kaneko and H.Kameoka, CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-dependent adaptive Networks, European Signal Processing Conference (EUSIPCO), pp.2100-2104,2018". The cycle generation countermeasure network solves the problem that the source domain and the target domain are difficult to map. The method uses the countermeasure loss and the cycle consistency loss as the criteria to realize the tone color conversion from the source speaker to the target speaker.
To achieve speech Conversion between arbitrary speakers, Chou et al propose a method for speech Conversion using a variable-rate encoder in the documents "Ju-chip Chou, Cheng-chip Yeh, and Hung-yi Lee, One-Shot Voice Conversion by separation Speaker and Content reproduction with distance Normalization, in Proc. The method takes mean square error as a criterion, utilizes a variational self-encoder structure, utilizes two encoders at an input end to respectively extract language content and speaker embedding in voice through unsupervised learning, and uses a decoder at an output end to combine the language content and the speaker embedding, thereby generating new voice, namely the voice of a target speaker. Because the encoder learns the ability to separate the speech content from the speaker-embedded information through training, given the target speaker speech data and the source speaker speech data, the content encoder can automatically extract the speech content representation from the source speaker speech, the speaker encoder automatically extracts the speaker-embedded representation from the target speech, and at the decoder side, the speech content representation and the speaker-embedded representation are combined into new speech data. This method can be extended to voice conversion between multiple speakers.
The above two methods solve some problems of non-parallel voice conversion, but still have limitations. The first method can accurately learn the mapping from the source speaker to the target speaker, but usually only can perform the voice conversion between two speakers, and the training process is complex, the situation that the gradient dip is zero easily occurs, and the voice conversion between multiple speakers cannot be expanded. In the second method, although the variational self-encoder has a simple structure and is easy to implement, the language content representation extracted by the variational self-encoder contains a small amount of speaker embedded information, and finally the result similarity of voice conversion is poor.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a voice conversion method based on semi-supervised feature learning. Firstly, preprocessing voice data in a training set by utilizing an open source voice packet librosa to obtain an extended acoustic feature segment set, and extracting acoustic features representing identity information of a speaker in advance by adopting a generalized end-to-end loss encoder; then, a voice conversion network is constructed, wherein the voice conversion network comprises a variational self-encoder, a decoder and a post-network, and the network is trained by utilizing the constructed data set, wherein the network loss is set by adopting mean square error constraint under supervision information; and finally, processing the source speech data and the target speech data to be converted by using the trained network to obtain converted speech data. The invention introduces semi-supervised feature learning based on the structure of the variational self-encoder, can accurately extract the identity information of speakers, solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and has good generalization capability.
A voice conversion method based on semi-supervised feature learning is characterized by comprising the following steps:
step 1: utilizing an open source voice packet librosa to preprocess each voice data in a training set, wherein the preprocessing comprises reading in the voice data, performing pre-emphasis, windowing and framing processing on each voice data, performing short-time Fourier transform on each frame of framed voice data, converting the frame of voice data from a time domain signal into a frequency domain signal, screening the voice data converted into the frequency domain signal to obtain voice sections in accordance with the length, and forming an acoustic characteristic section set by all the preprocessed voice sections in the training set;
randomly selecting less than half of speakers from a training set, inputting voice data of the speakers to a coder with generalized end-to-end loss design, and extracting acoustic features representing identity information of the speakers; the encoder is composed of a long-short term memory network layer and a linear layer, the dimensions of an input layer, an output layer and a hidden layer of the long-short term memory network layer are respectively 80, 256 and 256, the dimensions of the input layer and the output layer of the linear layer are both 256, the activation function of the linear layer is a ReLu function, and the encoder adopts end-to-end loss constraint;
step 2: constructing a voice conversion network which comprises a variational self-encoder, a decoder and a post-network, wherein the variational self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit sizes of 768, and speaker identity information is extracted from input voice data; the content encoder is composed of 3 5 multiplied by 1 convolution layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data; the decoder is composed of 3 convolution layers of 5 multiplied by 1 and 3 long-short term memory network layers with unit dimensionality of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data; the post network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder to obtain reconstructed voice data;
the loss function of the voice conversion network is set as follows:
L=Lcon+Lspe+Lreco (1)
wherein L represents the total loss of the network, LconRepresents content encoder loss, LspeIndicating a loss of speaker identity information, LrecoExpressing the self-weight loss, and respectively calculating according to the following formulas:
Figure BDA0003330170180000031
Figure BDA0003330170180000032
Figure BDA0003330170180000033
wherein, E [. C]Which represents the mathematical expectation of the calculation,
Figure BDA0003330170180000034
which represents the output of the decoder, and,
Figure BDA0003330170180000035
is expressed as input
Figure BDA0003330170180000036
Output of the temporal content encoder, ZcIn the representationOutput of the capacity encoder, ZsiRepresents the output of the speaker's encoder,
Figure BDA0003330170180000037
representing the speaker identity information extracted by a generalized end-to-end method, i representing the speaker serial number, x representing the initial input voice data of the network, Es(x) Representing the output of the speaker's encoder when the input is x, D (E)s(x),Zc) Denotes a reaction of Es(x) And ZcInputting the output after the decoder;
the specific processing procedure of the example normalization layer is as follows:
first, the mean value of each channel represented by the speech content is calculated as follows:
Figure BDA0003330170180000041
wherein, mucMean of the c-th channel, W the array dimension of each channel, Mc[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;
then, the variance of each channel is calculated as follows:
Figure BDA0003330170180000042
wherein σcThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);
finally, the channel array M is expressed by the following formulacEach element in (a) is normalized:
Figure BDA0003330170180000043
wherein, M'c[ω]Represents the value of the ω -th element in the normalized c-th channel; c1, 2, …, C, ω 1,2, …, W;
and step 3: setting network parameters, including the Batch processing size of data read-in being 32, the initial learning rate being 0.001 and the iteration times of the network being 500000, inputting the voice segments in the acoustic feature segment set obtained in the step 1 into the voice conversion network constructed in the step 2 for training to obtain the trained voice conversion network;
and 4, step 4: inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.
The invention has the beneficial effects that: based on the structure of the variational self-encoder, semi-supervised feature learning is introduced, and the semi-supervised feature learning is combined with the simple structure of the variational encoder. The speaker extracted by the pre-training model based on the generalized end-to-end loss is embedded as the monitoring information of the speaker encoder in the voice conversion network, and the mean square error loss is used as the constraint, so that the speaker encoder can accurately extract the identity information of the speaker. Since the content encoder adds an instance normalization layer to remove speaker information, it can extract more accurate speech content information. The invention solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and utilizes pre-extracted speaker tone information as the characteristic supervision information extracted by a speaker encoder, so that the speaker encoder can accurately extract the tone information to complete a voice conversion task with higher similarity, and can expand the tone conversion among the multiple speakers and complete excellent voice conversion for the speakers not in voice training data set. The trained network can independently change the identity of the speaker, and can also complete conversion under the condition that the data of the speaker is less.
Drawings
FIG. 1 is a flow chart of the speech conversion method based on semi-supervised feature learning according to the present invention.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples. As shown in fig. 1, the present invention provides a speech conversion method based on semi-supervised feature learning, which is implemented as follows:
1. extracting acoustic features of source and target speech data
Each piece of voice data in a training set is preprocessed by utilizing an open source voice packet librosa, including reading in the voice data, each piece of voice data is subjected to pre-emphasis, windowing and framing processing, each frame of framed voice data is subjected to short-time Fourier transform, the voice data is converted from a time domain signal into a frequency domain signal, the voice data converted into the frequency domain signal is subjected to screening processing, voice sections in accordance with the length are obtained, and all the preprocessed voice sections in the training set form an acoustic characteristic section set. Among them, the open source voice package librosa is described in detail in the literature "Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W.Ellis, Matt McVicar, Eric Battenberg, Oriol Nieto, librosa: Audio and Music Signal Analysis in wavelength.proceedings of the 14th wavelength in Science conference.2015.
Selecting no more than half of speakers randomly from a training set, and designing an encoder by adopting a generalized end-to-end loss method, wherein the encoder consists of a long-short term memory network layer with an input dimension of 80, an output dimension of 256 and a hidden layer dimension of 256, and a linear layer with an input dimension of 256 and an output dimension of 256. The activation function of the linear layer is the ReLu function. The selected speaker data is input into the encoder, the encoder extracts speaker embedding capable of representing speaker identity information from the voice data of the speakers under the constraint of end-to-end loss, and the acoustic characteristics of the part are used as supervision information in the speaker encoder in the subsequent processing so as to guide the speaker encoder to generate correct speaker identity information. Among them, the Generalized end-to-end loss method is described in detail in "Wan L, Wang Q, Papir A, and Moreno I.L, Generalized end-to-end loss for spaker verification. in IEEE Conference on optics, Speech and Signal Processing (ICASSP), pp.4879-4883.2018".
2. Building a voice conversion network
The voice conversion network comprises a variation self-encoder, a decoder and a post-network, wherein the variation self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit size of 768, the identity information of the speaker is extracted from input voice data, and the output result is a 256-dimensional vector. The content encoder is composed of 3 5 × 1 convolutional layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data, and an output result is also a vector with the dimension of 64. The decoder is composed of 3 5 x 1 convolution layers and 3 long-short term memory network layers with unit dimension of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data. The post-network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder so as to increase the details of the voice frequency spectrum and generate a high-quality voice frequency spectrogram, namely reconstructed voice data.
The loss function of the voice conversion network is set as follows:
L=Lcon+Lspe+Lreco (8)
wherein L represents the total loss of the network, LconRepresents content encoder loss, LspeIndicating a loss of speaker identity information, LrecoExpressing the self-weight loss, and respectively calculating according to the following formulas:
Figure BDA0003330170180000061
Figure BDA0003330170180000062
Figure BDA0003330170180000063
wherein, E [. C]Which represents the mathematical expectation of the calculation,
Figure BDA0003330170180000064
which represents the output of the decoder, and,
Figure BDA0003330170180000065
is expressed as input
Figure BDA0003330170180000066
Output of the temporal content encoder, ZcRepresenting the output of a content encoder, ZsiRepresents the output of the speaker's encoder,
Figure BDA0003330170180000067
the method is characterized in that the speaker identity information extracted by adopting a generalized end-to-end method is represented and serves as the monitoring information of a speaker encoder, the mean square error of the speaker encoder and the speaker encoder is calculated, the similarity maximization between the speaker encoder and the speaker encoder is realized, and i represents the serial number of the speaker. x denotes the initial input speech data of the network, Es(x) Representing the output of the speaker's encoder when the input is x, D (E)s(x),Zc) Denotes a reaction of Es(x) And ZcAnd inputting the output after the decoder.
The example normalization layer performs example normalization processing on the content extracted by the content encoder to remove a small amount of speaker information contained in the content, and the specific processing procedure is as follows:
first, the mean value of each channel represented by the speech content is calculated as follows:
Figure BDA0003330170180000068
wherein, mucMean of the c-th channel, W the array dimension of each channel, Mc[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;
then, the variance of each channel is calculated as follows:
Figure BDA0003330170180000071
wherein σcThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);
finally, the channel array M is expressed by the following formulacEach element in (a) is normalized:
Figure BDA0003330170180000072
wherein, M'c[ω]Represents the value of the ω -th element in the normalized c-th channel; c1, 2, …, C, ω 1,2, …, W.
3. Network training
And (3) setting network parameters, including the Batch processing size 32 of data read-in, the initial learning rate 0.001 and the iteration times 500000 of the network, inputting the voice segments in the acoustic feature segment set obtained in the step (1) into the voice conversion network constructed in the step (2) for training to obtain the trained voice conversion network.
4. Speech conversion
Inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.
In order to verify the effectiveness of the method, the method is adopted to carry out experiments on an open source voice data set AISHELL-3, wherein the data set comprises 174 speakers, and each speaker has about 300 pieces of voice data. The voice data time length is between 3-8 seconds. The 174 speakers are divided into 75 speakers for extracting speaker tone information in advance, 75 speakers for extracting speaker information in the training process and 24 speakers for voice data to be tested. The objective evaluation index of the voice conversion result is MCD (Mel Cepstral Distorescence), that is, whether the two voices before and after conversion are distorted or not is compared to evaluate the performance of the voices. The subjective evaluation mainly adopts MOS (mean opinion score) evaluation, and the collector volunteers score results, firstly, the similarity score is 1-5, 1 represents completely dissimilar, and 5 represents completely similar; the second is a score of 1-5 in naturalness, 1 indicating poor naturalness and 5 indicating excellent sound quality. The two evaluations are used as important indexes for evaluating the performance of the voice conversion system. The comparison method adopts a cycleGAN (cyclic generation countermeasure network) method and a Seq2seqVC (variation self-coder) method, and the index evaluation result is shown in table 1.
TABLE 1
Name of method Naturalness score Similarity score MCD(dB)
CycleGAN method 3.8 3.9 3.3
Seq2seqVC method 3.6 3.5 2.8
The method of the invention 3.9 4.1 3.2

Claims (1)

1. A voice conversion method based on semi-supervised feature learning is characterized by comprising the following steps:
step 1: utilizing an open source voice packet librosa to preprocess each voice data in a training set, wherein the preprocessing comprises reading in the voice data, performing pre-emphasis, windowing and framing processing on each voice data, performing short-time Fourier transform on each frame of framed voice data, converting the frame of voice data from a time domain signal into a frequency domain signal, screening the voice data converted into the frequency domain signal to obtain voice sections in accordance with the length, and forming an acoustic characteristic section set by all the preprocessed voice sections in the training set;
randomly selecting less than half of speakers from a training set, inputting voice data of the speakers to a coder with generalized end-to-end loss design, and extracting acoustic features representing identity information of the speakers; the encoder is composed of a long-short term memory network layer and a linear layer, the dimensions of an input layer, an output layer and a hidden layer of the long-short term memory network layer are respectively 80, 256 and 256, the dimensions of the input layer and the output layer of the linear layer are both 256, the activation function of the linear layer is a ReLu function, and the encoder adopts end-to-end loss constraint;
step 2: constructing a voice conversion network which comprises a variational self-encoder, a decoder and a post-network, wherein the variational self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit sizes of 768, and speaker identity information is extracted from input voice data; the content encoder is composed of 3 5 multiplied by 1 convolution layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data; the decoder is composed of 3 convolution layers of 5 multiplied by 1 and 3 long-short term memory network layers with unit dimensionality of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data; the post network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder to obtain reconstructed voice data;
the loss function of the voice conversion network is set as follows:
L=Lcon+Lspe+Lreco (1)
wherein L represents the total loss of the network, LconRepresents content encoder loss, LspeIndicating a loss of speaker identity information, LrecoExpressing the self-weight loss, and respectively calculating according to the following formulas:
Figure FDA0003330170170000011
Figure FDA0003330170170000012
Figure FDA0003330170170000013
wherein, E [. C]Which represents the mathematical expectation of the calculation,
Figure FDA0003330170170000021
which represents the output of the decoder, and,
Figure FDA0003330170170000022
is expressed as input
Figure FDA0003330170170000023
Output of time content encoderOut, ZcRepresenting the output of a content encoder, ZsiRepresents the output of the speaker's encoder,
Figure FDA0003330170170000024
representing the speaker identity information extracted by a generalized end-to-end method, i representing the speaker serial number, x representing the initial input voice data of the network, Es(x) Representing the output of the speaker's encoder when the input is x, D (E)s(x),Zc) Denotes a reaction of Es(x) And ZcInputting the output after the decoder;
the specific processing procedure of the example normalization layer is as follows:
first, the mean value of each channel represented by the speech content is calculated as follows:
Figure FDA0003330170170000025
wherein, mucMean of the c-th channel, W the array dimension of each channel, Mc[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;
then, the variance of each channel is calculated as follows:
Figure FDA0003330170170000026
wherein σcThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);
finally, the channel array M is expressed by the following formulacEach element in (a) is normalized:
Figure FDA0003330170170000027
wherein M isc′[ω]Represents the second in the c channel after normalizationOmega element values; c1, 2, …, C, ω 1,2, …, W;
and step 3: setting network parameters, including the Batch processing size of data read-in being 32, the initial learning rate being 0.001 and the iteration times of the network being 500000, inputting the voice segments in the acoustic feature segment set obtained in the step 1 into the voice conversion network constructed in the step 2 for training to obtain the trained voice conversion network;
and 4, step 4: inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.
CN202111277502.5A 2021-10-30 2021-10-30 Voice conversion method based on semi-supervised feature learning Active CN114023343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111277502.5A CN114023343B (en) 2021-10-30 2021-10-30 Voice conversion method based on semi-supervised feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111277502.5A CN114023343B (en) 2021-10-30 2021-10-30 Voice conversion method based on semi-supervised feature learning

Publications (2)

Publication Number Publication Date
CN114023343A true CN114023343A (en) 2022-02-08
CN114023343B CN114023343B (en) 2024-04-30

Family

ID=80059050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111277502.5A Active CN114023343B (en) 2021-10-30 2021-10-30 Voice conversion method based on semi-supervised feature learning

Country Status (1)

Country Link
CN (1) CN114023343B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283824A (en) * 2022-03-02 2022-04-05 清华大学 Voice conversion method and device based on cyclic loss

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111712874A (en) * 2019-10-31 2020-09-25 支付宝(杭州)信息技术有限公司 System and method for determining sound characteristics
WO2020205233A1 (en) * 2019-03-29 2020-10-08 Google Llc Direct speech-to-speech translation via machine learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020205233A1 (en) * 2019-03-29 2020-10-08 Google Llc Direct speech-to-speech translation via machine learning
CN111712874A (en) * 2019-10-31 2020-09-25 支付宝(杭州)信息技术有限公司 System and method for determining sound characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄国捷;金慧;俞一彪;: "增强变分自编码器做非平行语料语音转换", 信号处理, no. 10, 25 October 2018 (2018-10-25) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283824A (en) * 2022-03-02 2022-04-05 清华大学 Voice conversion method and device based on cyclic loss

Also Published As

Publication number Publication date
CN114023343B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN109036382B (en) Audio feature extraction method based on KL divergence
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN108520753B (en) Voice lie detection method based on convolution bidirectional long-time and short-time memory network
CN102968990B (en) Speaker identifying method and system
CN111462769B (en) End-to-end accent conversion method
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
WO2012075641A1 (en) Device and method for pass-phrase modeling for speaker verification, and verification system
CN111785285A (en) Voiceprint recognition method for home multi-feature parameter fusion
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN102789779A (en) Speech recognition system and recognition method thereof
CN107293306A (en) A kind of appraisal procedure of the Objective speech quality based on output
KR102272554B1 (en) Method and system of text to multiple speech
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
CN114023343B (en) Voice conversion method based on semi-supervised feature learning
Koizumi et al. Miipher: A robust speech restoration model integrating self-supervised speech and text representations
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
Mengistu Automatic text independent amharic language speaker recognition in noisy environment using hybrid approaches of LPCC, MFCC and GFCC
CN112927723A (en) High-performance anti-noise speech emotion recognition method based on deep neural network
CN115472168B (en) Short-time voice voiceprint recognition method, system and equipment for coupling BGCC and PWPE features
Afshan et al. Attention-based conditioning methods using variable frame rate for style-robust speaker verification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant