WO2021128256A1 - 语音转换方法、装置、设备及存储介质 - Google Patents

语音转换方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021128256A1
WO2021128256A1 PCT/CN2019/129115 CN2019129115W WO2021128256A1 WO 2021128256 A1 WO2021128256 A1 WO 2021128256A1 CN 2019129115 W CN2019129115 W CN 2019129115W WO 2021128256 A1 WO2021128256 A1 WO 2021128256A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
training
frequency spectrum
speaker
conversion model
Prior art date
Application number
PCT/CN2019/129115
Other languages
English (en)
French (fr)
Inventor
赵之源
黄东延
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to CN201980003287.4A priority Critical patent/CN111247585B/zh
Priority to PCT/CN2019/129115 priority patent/WO2021128256A1/zh
Publication of WO2021128256A1 publication Critical patent/WO2021128256A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • This application relates to the field of signal processing, and in particular to a voice conversion method, device, equipment, and storage medium.
  • voice conversion technology has become more and more mature, and the function of tone conversion can be realized through the voice conversion model, which has a wide range of application scenarios.
  • the existing voice conversion model only supports the conversion of a single speaker.
  • an embodiment of the present application provides a voice conversion method, which includes:
  • the frequency spectrum of the target speaker is converted into the voice of the target speaker through the vocoder.
  • the voice conversion model includes:
  • the affine matrix is used to encode the input target speaker number and the speaker number corresponding to the source audio data into a speaker vector.
  • the encoder is used to correspond to the speaker vector and the source audio data.
  • the frequency spectrum obtains the feature vector, and the decoder is used to obtain the frequency spectrum of the target speaker according to the feature vector and the speaker vector.
  • the training steps of the speech conversion model are as follows:
  • the training sample set includes multiple training samples, each training sample includes: the training target speaker number, the speaker number corresponding to the training audio data, the frequency spectrum corresponding to the training audio data, and the frequency spectrum of the training target speaker;
  • the weight parameters of the voice conversion model are updated according to the comparison result of the actual output of the voice conversion model and the expected output, and a trained voice conversion model is obtained.
  • the weight parameters of the voice conversion model are updated according to the comparison result of the actual output of the voice conversion model and the expected output to obtain a trained voice conversion model, including:
  • the training sample re-enter the step of using the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data as the input of the speech conversion model, and the frequency spectrum of the training target speaker as the desired output.
  • the training is stopped, and a trained voice conversion model is obtained.
  • the loss value is calculated according to the comparison difference between the actual output of the voice conversion model and the expected output, including:
  • the loss value is obtained.
  • the training process of the decoder includes:
  • the corresponding spectrum frame is output according to the average spectrum frame.
  • preprocessing the source audio data to obtain the frequency spectrum corresponding to the source audio data includes:
  • an embodiment of the present application provides a voice conversion device, which includes:
  • the receiving module is used to receive the selected target speaker number and the speaker number corresponding to the source audio data
  • the processing module is used to preprocess the source audio data to obtain the frequency spectrum corresponding to the source audio data
  • the frequency spectrum conversion module is used to take the target speaker number, the speaker number corresponding to the source audio data, and the frequency spectrum corresponding to the source audio data as the input of the voice conversion model to obtain the target speaker's frequency spectrum output by the voice conversion model;
  • the voice generation module is used to convert the frequency spectrum of the target speaker into the voice of the target speaker through a vocoder.
  • an embodiment of the present application provides a voice conversion device, including a memory and a processor, and a computer program is stored in the memory.
  • the processor executes the following steps:
  • the frequency spectrum of the target speaker is converted into the voice of the target speaker through the vocoder.
  • an embodiment of the present application provides a storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the frequency spectrum of the target speaker is converted into the voice of the target speaker through the vocoder.
  • the speakers are numbered.
  • the frequency spectrum of the target speaker that needs to be converted is controlled by the number, and the multi-speaker-to-multi-speaker voice conversion is realized. Purpose to improve applicability.
  • FIG. 1 is a flowchart of a voice conversion method in an embodiment of this application
  • Figure 2 is a training flowchart of a voice conversion model in an embodiment of the application
  • FIG. 3 is a flowchart of obtaining a loss value in an embodiment of the application
  • FIG. 4 is a specific flow chart of a root decoder referring to a target spectrum frame in an embodiment of this application;
  • FIG. 5 is a flowchart of obtaining a frequency spectrum corresponding to source audio data in an embodiment of the application
  • FIG. 6 is a specific schematic diagram of the generation stage of voice conversion in an embodiment of this application.
  • FIG. 7 is a specific schematic diagram of the training phase of voice conversion in an embodiment of this application.
  • FIG. 8 is a schematic structural diagram of a voice conversion device in an embodiment of this application.
  • Fig. 9 is a schematic diagram of the internal structure of a voice conversion device in an embodiment of the application.
  • a voice conversion method is proposed, and the method includes:
  • Step 102 Obtain source audio data.
  • the source audio data refers to the audio that needs to be voice-converted. For example, it is currently necessary to convert a piece of'a' utterance spoken by the speaker'A' into the'a' utterance spoken by the speaker of'B', which is Refers to the content of the speech, that is, the text information in the audio.
  • the audio data to which the "a" utterance spoken by the speaker "A" belongs is the source audio data.
  • Step 104 Receive the selected target speaker number and the speaker number corresponding to the source audio data.
  • the number is the code of the pointer for different speakers, indicating the timbre of the speaker, and different numbers indicate different timbres.
  • the target speaker number is the speaker number that needs to be converted to a certain timbre, such as'B' above; the speaker number corresponding to the source audio data refers to the speaker number of the timbre contained in the source audio data, that is, The tone to be converted corresponds to the speaker number, such as the above-mentioned'A'.
  • Step 106 Preprocess the source audio data to obtain a frequency spectrum corresponding to the source audio data.
  • the source audio data is a time domain signal, which is a waveform diagram of the amplitude of the sound changing with time.
  • the speech feature cannot be extracted and analyzed from the time domain signal. Therefore, the time domain signal is converted into a frequency domain signal through preprocessing. Obtain the frequency spectrum corresponding to the source audio data.
  • Step 108 Use the target speaker number, the speaker number corresponding to the source audio data, and the frequency spectrum corresponding to the source audio data as the input of the voice conversion model, and obtain the frequency spectrum of the target speaker output by the voice conversion model.
  • the voice conversion model refers to a virtual program model that can convert the input frequency spectrum into the target frequency spectrum. Therefore, by inputting the frequency spectrum corresponding to the source audio data, the target speaker number and the speaker number corresponding to the source audio data during conversion, the frequency spectrum of the target speaker is obtained.
  • the speaker number before and after the conversion is input in order to use the speaker as a variable feature, so that when the speaker needs to be specified, the number is output based on the speaker number.
  • Step 110 Convert the frequency spectrum of the target speaker into the voice of the target speaker through a vocoder.
  • vocoder refers to a speech analysis and synthesis system of a certain model of speech signal. Only model parameters are used in transmission, and speech signal codec using model parameter estimation and speech synthesis technology when compiling and decoding, a kind of codec and decoder for analyzing and synthesizing speech, also called speech analysis synthesis system or speech Band compression system. It is a powerful tool for compressing the communication frequency band and conducting secure communication. After obtaining the frequency spectrum of the target speaker, the frequency spectrum can be converted into the corresponding voice by the vocoder.
  • the vocoder can use World, Griff-Lim or WaveNet, etc.
  • the frequency spectrum of the target speaker that needs to be converted is controlled by the numbering, so that the purpose of multi-speaker-to-multi-speaker voice conversion is realized, and the applicability is improved.
  • the voice conversion model includes:
  • the affine matrix is used to encode the input target speaker number and the speaker number corresponding to the source audio data into a speaker vector
  • the encoder is used to encode the speaker vector and source
  • the frequency spectrum corresponding to the audio data obtains the feature vector
  • the decoder is used to obtain the frequency spectrum of the target speaker according to the feature vector and the speaker vector.
  • the affine matrix refers to Speaker Embedding, in the Speaker Embedding
  • the corresponding relationship between each speaker and the frequency spectrum is stored in Embedding;
  • the specific architecture of the encoder is CNN + Bi-LSTM + Linear Projection;
  • the specific architecture of the decoder is Pre-Net + Attention + LSTM + Post-Net.
  • the specific execution process within the voice conversion model includes:
  • the frequency spectrum of the target speaker corresponding to the speaker vector is output through CNN (Convolutional Neural Networks).
  • the intermediate value of the speaker vector input into the codec is adopted, so that the number variable is carried in the encoding and decoding process, and the corresponding frequency spectrum is finally output according to the number variable.
  • the training steps of the speech conversion model are as follows:
  • Step 202 Obtain a training sample set.
  • the training sample set includes multiple training samples.
  • Each training sample includes: the training target speaker number, the speaker number corresponding to the training audio data, the frequency spectrum corresponding to the training audio data, and the training target speaker number. Spectrum.
  • the training sample set contains the numbers and frequency spectra of different speakers. For example, it is now necessary to convert the'a' utterance spoken by the speaker'A' into the'a' utterance spoken by the speaker'B'. At this time, the "a' utterance spoken by the speaker “A” corresponds to The frequency spectrum is the frequency spectrum corresponding to the training audio data, the frequency spectrum corresponding to the "a" utterance spoken by the speaker “B” is the frequency spectrum of the training target speaker, and "A, B" are the speech corresponding to the training audio data. Person number and training target speaker number.
  • Sample training is to make the voice conversion model fit the parameters of the voice feature conversion within the sample group based on big data, so that in the subsequent actual production process, the voice feature can be converted according to the training fitting parameters.
  • Step 204 Use the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data as the input of the voice conversion model, and use the frequency spectrum of the training target speaker as the desired output.
  • the frequency spectrum corresponding to the "a" speech spoken by the speaker “A” and the number “A, B” are used as input, and the "a” speech spoken by the speaker “B”
  • the corresponding frequency spectrum is used as the expected output, and the speech conversion model refers to the expected output when outputting the corresponding frequency spectrum.
  • Step 206 Update the weight parameters of the voice conversion model according to the comparison result of the actual output of the voice conversion model and the expected output to obtain a trained voice conversion model.
  • the weight parameter of the speech conversion model is updated, and the speech conversion model is optimized.
  • the voice conversion model is trained to obtain a trained voice conversion model, and output based on the trained voice conversion model, so that the accuracy of the voice conversion is higher and the effect is better.
  • the weight parameters of the voice conversion model are updated according to the comparison result of the actual output of the voice conversion model and the expected output to obtain a trained voice conversion model, including:
  • the loss value is calculated according to the comparison difference between the actual output of the voice conversion model and the expected output.
  • the weight parameter of the voice conversion model is updated according to the loss value.
  • the training sample re-enter the step of using the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data as the input of the speech conversion model, and the frequency spectrum of the training target speaker as the desired output.
  • the training is stopped, and a trained voice conversion model is obtained.
  • the loss value refers to how much the actual output is distorted compared to the expected output.
  • the specific loss value may refer to the comparison difference between the actual output frequency spectrum and the expected output frequency spectrum, and may also include other differences.
  • the loss value is calculated according to the comparison difference between the actual output of the voice conversion model and the expected output, including:
  • Step 302 Calculate the first difference between the frequency spectrum actually output by the voice conversion model and the frequency spectrum of the target speaker.
  • the loss value specifically includes two parts, one is the first difference between the actual output spectrum and the target spectrum, and the other is the second difference between the predicted phoneme information and the source phoneme information.
  • Step 304 Input the frequency spectrum actually output by the voice conversion model into the phoneme recognizer to obtain predicted phoneme information, and compare it with the phoneme information corresponding to the source audio data to calculate a second difference.
  • the phoneme recognizer refers to a virtual program module that can take the frequency spectrum as an input and output the phoneme information in the frequency spectrum.
  • the specific phoneme recognizer uses the CTC algorithm (Connectionist Temporal Classification), and its internal architecture is Linear Projection+CTC Loss.
  • the source phoneme information is obtained based on source audio data extraction, and the specific phoneme information refers to a phoneme vector formed by phoneme encoding.
  • CTC is used in the training phase.
  • Step 306 Obtain a loss value according to the first difference and the second difference.
  • the loss value is specifically obtained by adding the first difference value and the second difference value.
  • the training process of the decoder includes:
  • Step 402 Obtain a preset training target frequency spectrum frame and a preset average frequency spectrum frame of the training target speaker.
  • the existing decoder when the existing decoder outputs the frequency spectrum in the training phase, it is output with reference to a given preset target frequency spectrum, and each frame of the output refers to each frame of the corresponding target frequency spectrum.
  • the results obtained in the training phase are deviated from the results obtained in the generation phase. If the target spectrum frame is completely referenced, the generation stage cannot get the same good results as the training stage. If the target spectrum frame is not referenced at all, the model is difficult to converge. Therefore, through the internal control of the decoder, the reference probability is set to make the target spectrum frame randomly distributed In the reference frame, the result obtained in the generation stage is close to the real situation by incompletely referring to the target spectrum frame.
  • the above-mentioned training target spectrum frame refers to each frame in the spectrum of the target speaker, and the average spectrum frame of the training target speaker refers to the average value of the spectrum frames in all spectrums corresponding to the target speaker.
  • Step 404 Obtain a preset probability, and determine a reference frame corresponding to each spectrum frame according to the preset probability.
  • the preset probability is preset, which is controlled by the teacher forcing rate and the speaker global mean frame in the decoder.
  • the decoder outputs the spectrum, the corresponding spectrum frame is referenced according to the preset probability.
  • the preset probability is 0.5; undoubtedly, the preset probability can also be other values.
  • Step 406 When the reference frame corresponding to the output spectrum frame of the decoder is the training target spectrum frame, output the corresponding spectrum frame according to the training target spectrum frame.
  • Step 408 When the reference frame corresponding to the output spectrum frame of the decoder is an average spectrum frame, output the corresponding spectrum frame according to the average spectrum frame.
  • the global mean frame is used to control the probability of referring to the target spectrum frame, so that the target spectrum frame is not completely referenced, which is close to the actual generated effect, which improves the phenomenon of deviations in the results caused by the different training and generation processes.
  • preprocessing the source audio data to obtain the frequency spectrum corresponding to the source audio data includes:
  • Step 502 Remove blank parts, pre-emphasis, and short-time Fourier transform from the beginning and end of the source audio data to obtain the first frequency spectrum.
  • the audio blank part is subtracted from the source audio data to allow the Attention module to better learn alignment; pre-emphasis adds high-frequency information to the audio and filters some noise; STFT (short-time Fourier transform) changes the waveform from time to time The domain is converted to the frequency domain to obtain the first frequency spectrum, which is convenient for extracting voice features
  • Step 504 Pass the first spectrum through the mel filter bank to obtain the mel spectrum.
  • the frequency scale of the first spectrum obtained does not conform to the hearing linear habits of the human ear, so the first spectrum is passed through the mel filter bank to obtain the mel spectrum, and the frequency scale of the mel spectrum conforms to the human hearing habit of.
  • the Mel filter bank is that the filters distributed in the low frequency are denser and the threshold value is larger. On the contrary, the filters distributed in the high frequency are more sparse and the threshold value is small.
  • the source audio data is filtered, noise reduced, and frequency domain converted, so that the frequency spectrum entering the voice conversion model is clear and accurate, and the accuracy of the voice conversion is improved.
  • the generation stage of speech conversion specifically includes: obtaining the Mel spectrum of the source speaker by preprocessing the source audio data, and adding the Mel spectrum of the source speaker and the number of the target speaker
  • the speaker number corresponding to the source audio data is input into the speech conversion model to obtain the Mel spectrum of the target speaker.
  • the target speaker number and the speaker number corresponding to the source audio data are input into the Speaker In Embedding, the corresponding speaker vector is obtained.
  • Input the frequency spectrum into the encoder go through CNN (Convolutional Neural Networks), and use Bi-LSTM (Bi-directional Long Short-Term Memory), a model for text context modeling, including forward and Enter the speaker vector in (backward), and obtain the voice feature vector through Linear linearization.
  • CNN Convolutional Neural Networks
  • Bi-LSTM Bi-directional Long Short-Term Memory
  • the obtained Mel spectrum of the target speaker is converted into the voice of the target speaker through a vocoder.
  • the training phase of speech conversion specifically includes: preprocessing the training audio data to obtain the mel spectrum of the training speaker, and combine the mel spectrum of the training speaker and the training target speaker.
  • the number and the speaker number corresponding to the training audio data are input into the speech conversion model to obtain the Mel spectrum of the training target speaker.
  • Input the frequency spectrum into the encoder go through CNN (Convolutional Neural Networks), and use Bi-LSTM (Bi-directional Long Short-Term Memory), a model for text context modeling, including forward and Input the training speaker vector in (backward), and obtain the speech feature vector through Linear linearization.
  • Input the obtained feature vector into the decoder pass through PreNet, and input the training speaker vector in Attention (attention model), and in LSTM (directional Long Short-Term Memory, a model for text context modeling, One-way) input the training speaker vector.
  • the frequency spectrum of the training target speaker corresponding to the training speaker vector is output through CNN (Convolutional Neural Networks).
  • the present application provides a voice conversion device, which includes:
  • the obtaining module 802 is used to obtain source audio data
  • the receiving module 804 is configured to receive the selected target speaker number and the speaker number corresponding to the source audio data
  • the processing module 806 is configured to preprocess the source audio data to obtain a frequency spectrum corresponding to the source audio data
  • the frequency spectrum conversion module 808 is configured to use the target speaker number, the speaker number corresponding to the source audio data, and the frequency spectrum corresponding to the source audio data as the input of the voice conversion model, and obtain the frequency spectrum of the target speaker output by the voice conversion model;
  • the voice generation module 810 is used to convert the frequency spectrum of the target speaker into the voice of the target speaker through a vocoder.
  • the voice conversion model includes: an affine matrix, an encoder, and a decoder.
  • the affine matrix is used to encode the input target speaker number and the speaker number corresponding to the source audio data into a speaker vector
  • the encoder uses To obtain the feature vector according to the speaker vector and the frequency spectrum corresponding to the source audio data
  • the decoder is used to obtain the frequency spectrum of the target speaker according to the feature vector and the speaker vector.
  • the frequency spectrum conversion module is also used to obtain a training sample set.
  • the training sample set includes multiple training samples.
  • Each training sample includes: a training target speaker number, a speaker number corresponding to the training audio data, and training audio data
  • the corresponding frequency spectrum and the frequency spectrum of the training target speaker; the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data are used as the input of the speech conversion model, and the frequency spectrum of the training target speaker is used as the desired output ;
  • the weight parameters of the voice conversion model are updated to obtain a trained voice conversion model.
  • the spectrum conversion module is also used to calculate the loss value according to the comparison difference between the actual output of the voice conversion model and the expected output; when the loss value does not reach the preset convergence condition, update the voice conversion according to the loss value The weight parameters of the model; get the next training sample, re-enter the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data as the input of the speech conversion model, and the frequency spectrum of the training target speaker as the expectation In the step of outputting, the training is stopped until the calculated loss value meets the preset convergence condition, and the trained voice conversion model is obtained.
  • the frequency spectrum conversion module is also used to calculate the first difference between the frequency spectrum actually output by the speech conversion model and the frequency spectrum of the training target speaker; input the frequency spectrum actually output by the speech conversion model into the phoneme recognizer to obtain the predicted phoneme The information is compared with the phoneme information corresponding to the source audio data, and the second difference is calculated; and the loss value is obtained according to the first difference and the second difference.
  • the spectrum conversion module is also used to obtain the preset training target spectrum frame and the preset average spectrum frame of the training target speaker; obtain the preset probability, and determine the reference corresponding to each spectrum frame according to the preset probability Frame; when the reference frame corresponding to the decoder output spectrum frame is the training target spectrum frame, the corresponding spectrum frame is output according to the training target spectrum frame; when the reference frame corresponding to the decoder output spectrum frame is the average spectrum frame, according to the average spectrum frame Output the corresponding spectrum frame.
  • the processing module is also used to remove blank parts, pre-emphasis, and short-time Fourier transform from the beginning and end of the source audio data to obtain the first frequency spectrum; pass the first frequency spectrum through the mel filter bank to obtain the mel Spectrum.
  • the present application provides a voice conversion device, and the internal structure diagram of the voice conversion device is shown in FIG. 9.
  • the voice conversion device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the voice conversion device stores an operating system, and may also store a computer program.
  • the processor can implement the voice conversion method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the voice conversion method.
  • the device may include more or fewer parts than shown in the figures, or combine certain parts, or have a different arrangement of parts.
  • a voice conversion method provided may be implemented in the form of a computer program, and the computer program may run on the voice conversion device as shown in FIG. 9.
  • the memory of the voice conversion device can store various program modules that make up a voice conversion device. For example, the acquisition module 802, the receiving module 804, the processing module 806, the spectrum conversion module 808, and the speech generation module 810.
  • a voice conversion device includes a processor and a memory, and a computer program is stored in the memory.
  • the processor executes the following steps: acquiring source audio data; receiving selected target speaker number and source audio data Corresponding speaker number; preprocess the source audio data to obtain the frequency spectrum corresponding to the source audio data; use the target speaker number, the speaker number corresponding to the source audio data, and the frequency spectrum corresponding to the source audio data as the input of the voice conversion model , Obtain the frequency spectrum of the target speaker output by the voice conversion model; convert the frequency spectrum of the target speaker into the voice of the target speaker through a vocoder.
  • the voice conversion model includes: an affine matrix, an encoder, and a decoder.
  • the affine matrix is used to encode the input target speaker number and the speaker number corresponding to the source audio data into a speaker vector
  • the encoder uses To obtain the feature vector according to the speaker vector and the frequency spectrum corresponding to the source audio data
  • the decoder is used to obtain the frequency spectrum of the target speaker according to the feature vector and the speaker vector.
  • the training steps of the voice conversion model are as follows: Obtain a training sample set, the training sample set includes multiple training samples, each training sample includes: training target speaker number, speaker number corresponding to the training audio data, training The frequency spectrum corresponding to the audio data and the frequency spectrum of the training target speaker; the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data are used as the input of the speech conversion model, and the frequency spectrum of the training target speaker is taken as the expectation The output of the voice conversion model; update the weight parameters of the voice conversion model according to the comparison result of the actual output of the voice conversion model and the expected output, and obtain a trained voice conversion model.
  • the weight parameters of the voice conversion model are updated according to the comparison result of the actual output of the voice conversion model and the expected output to obtain the trained voice conversion model, including: according to the ratio of the actual output of the voice conversion model to the expected output Calculate the difference value to obtain the loss value; when the loss value does not reach the preset convergence condition, update the weight parameter of the voice conversion model according to the loss value; obtain the next training sample, re-enter the corresponding training speaker number and training audio data
  • the speaker number and the frequency spectrum corresponding to the training audio data are used as the input of the speech conversion model, and the frequency spectrum of the training target speaker is used as the desired output step, until the calculated loss value meets the preset convergence conditions, the training is stopped, and the training is obtained Good voice conversion model.
  • calculating the loss value according to the comparison difference between the actual output of the voice conversion model and the expected output includes: calculating the first difference between the actual output frequency spectrum of the speech conversion model and the frequency spectrum of the training target speaker; The frequency spectrum actually output by the voice conversion model is input to the phoneme recognizer to obtain the predicted phoneme information, which is compared with the phoneme information corresponding to the source audio data to calculate the second difference; according to the first difference and the second difference, the loss value is obtained .
  • the training process of the aforementioned speech conversion model includes: obtaining a preset training target frequency spectrum frame and a preset average frequency spectrum frame of the training target speaker; obtaining a preset probability, and determining each frequency spectrum according to the preset probability
  • the reference frame corresponding to the frame when the reference frame corresponding to the decoder output spectrum frame is the training target spectrum frame, the corresponding spectrum frame is output according to the training target spectrum frame; when the reference frame corresponding to the decoder output spectrum frame is the average spectrum frame, Output the corresponding spectrum frame according to the average spectrum frame.
  • preprocessing the source audio data to obtain the frequency spectrum corresponding to the source audio data includes: removing blank parts from the beginning and the end of the source audio data, pre-emphasis, and short-time Fourier transform to obtain the first frequency spectrum; Pass the first spectrum through the mel filter bank to obtain the mel spectrum.
  • the present invention provides a storage medium storing a computer program.
  • the processor executes the following steps: obtain source audio data; receive selected target speaker number and source audio data Corresponding speaker number; preprocess the source audio data to obtain the frequency spectrum corresponding to the source audio data; use the target speaker number, the speaker number corresponding to the source audio data, and the frequency spectrum corresponding to the source audio data as the input of the voice conversion model , Obtain the frequency spectrum of the target speaker output by the voice conversion model; convert the frequency spectrum of the target speaker into the voice of the target speaker through a vocoder.
  • the voice conversion model includes: an affine matrix, an encoder, and a decoder.
  • the affine matrix is used to encode the input target speaker number and the speaker number corresponding to the source audio data into a speaker vector
  • the encoder uses To obtain the feature vector according to the speaker vector and the frequency spectrum corresponding to the source audio data
  • the decoder is used to obtain the frequency spectrum of the target speaker according to the feature vector and the speaker vector.
  • the training steps of the voice conversion model are as follows: Obtain a training sample set, the training sample set includes multiple training samples, each training sample includes: training target speaker number, speaker number corresponding to the training audio data, training The frequency spectrum corresponding to the audio data and the frequency spectrum of the training target speaker; the training speaker number, the speaker number corresponding to the training audio data, and the frequency spectrum corresponding to the training audio data are used as the input of the speech conversion model, and the frequency spectrum of the training target speaker is taken as the expectation The output of the voice conversion model; update the weight parameters of the voice conversion model according to the comparison result of the actual output of the voice conversion model and the expected output, and obtain a trained voice conversion model.
  • the weight parameters of the voice conversion model are updated according to the comparison result of the actual output of the voice conversion model and the expected output to obtain the trained voice conversion model, including: according to the ratio of the actual output of the voice conversion model to the expected output Calculate the difference value to obtain the loss value; when the loss value does not reach the preset convergence condition, update the weight parameter of the voice conversion model according to the loss value; obtain the next training sample, re-enter the corresponding training speaker number and training audio data
  • the speaker number and the frequency spectrum corresponding to the training audio data are used as the input of the speech conversion model, and the frequency spectrum of the training target speaker is used as the desired output step, until the calculated loss value meets the preset convergence conditions, the training is stopped, and the training is obtained Good voice conversion model.
  • calculating the loss value according to the comparison difference between the actual output of the voice conversion model and the expected output includes: calculating the first difference between the actual output frequency spectrum of the speech conversion model and the frequency spectrum of the training target speaker; The frequency spectrum actually output by the voice conversion model is input to the phoneme recognizer to obtain the predicted phoneme information, which is compared with the phoneme information corresponding to the source audio data to calculate the second difference; according to the first difference and the second difference, the loss value is obtained .
  • the training process of the speech conversion model includes: obtaining a preset training target spectrum frame and a preset training target speaker's average spectrum frame; obtaining a preset probability, and determining each spectrum frame according to the preset probability Corresponding reference frame; when the reference frame corresponding to the decoder output spectrum frame is the training target spectrum frame, output the corresponding spectrum frame according to the training target spectrum frame; when the reference frame corresponding to the decoder output spectrum frame is the average spectrum frame, according to The average spectrum frame outputs the corresponding spectrum frame.
  • preprocessing the source audio data to obtain the frequency spectrum corresponding to the source audio data includes: removing blank parts from the beginning and the end of the source audio data, pre-emphasis, and short-time Fourier transform to obtain the first frequency spectrum; Pass the first spectrum through the mel filter bank to obtain the mel spectrum.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种语音转换方法、语音转换装置、设备及存储介质,方法包括:获取源音频数据(102);接收选择的目标说话人编号和源音频数据对应的说话人编号(104);对源音频数据进行预处理,得到与源音频数据对应的频谱(106);将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱(108);通过声码器将目标说话人的频谱转换为目标说话人的语音(110)。对目标说话人进行编号,在实际转换过程中,通过编号来控制需要转换的目标说话人的频谱,实现了多说话人对多说话人语音转换的目的,提高了适用性。

Description

语音转换方法、装置、设备及存储介质 技术领域
本申请涉及信号处理领域,尤其涉及一种语音转换方法、装置、设备及储存介质。
背景技术
随着技术的发展,语音转换技术愈发成熟,通过语音转换模型可以实现音色转换的功能,具有广阔的应用场景。
技术问题
但现有的语音转换模型只支持单一说话人的转换。
技术解决方案
基于此,有必要针对上述问题,提供一种语音转换方法、装置、设备及存储介质。
第一方面,本申请实施例提供一种语音转换方法,该方法包括:
获取源音频数据;
接收选择的目标说话人编号和源音频数据对应的说话人编号;
对源音频数据进行预处理,得到与源音频数据对应的频谱;
将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;
通过声码器将目标说话人的频谱转换为目标说话人的语音。
在一个实施例中,语音转换模型包括:
仿射矩阵、编码器和解码器,仿射矩阵用于将输入的目标说话人编号和源音频数据对应的说话人编号编码为说话人向量,编码器用于根据说话人向量和源音频数据对应的频谱得到特征向量,解码器用于根据特征向量和说话人向量得到目标说话人的频谱。
在一个实施例中,语音转换模型的训练步骤如下:
获取训练样本集,训练样本集中包括多个训练样本,每个训练样本包括:训练目标说话人编号、训练音频数据对应的说话人编号、训练音频数据对应的频谱和训练目标说话人的频谱;
将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出;
根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型,包括:
根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值;
当损失值没有达到预设的收敛条件时,根据损失值更新语音转换模型的权重参数;
获取下一个训练样本,重新进入将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出的步骤,直至计算得到的损失值满足预设的收敛条件时,停止训练,得到训练好的语音转换模型。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值,包括:
计算语音转换模型实际输出的频谱与训练目标说话人的频谱的第一差值;
将语音转换模型实际输出的频谱输入音素识别器,得到预测的音素信息,并与源音频数据对应的音素信息对比,计算得到第二差值;
根据第一差值与第二差值,得到损失值。
在一个实施例中,解码器的训练过程,包括:
获取预设的训练目标频谱帧和预设的训练目标说话人的平均频谱帧;
获取预设概率,根据预设概率确定每个频谱帧对应的参考帧;
当解码器输出频谱帧对应的参考帧为训练目标频谱帧时,根据训练目标频谱帧输出对应的频谱帧;
当解码器输出频谱帧对应的参考帧为平均频谱帧时,根据平均频谱帧输出对应的频谱帧。
在一个实施例中,对源音频数据进行预处理,得到与源音频数据对应的频谱,包括:
对源音频数据的首尾去掉空白部分、预加重、短时傅里叶变换,得到第一频谱;
将第一频谱通过梅尔滤波器组,得到梅尔频谱。
第二方面,本申请实施例提供一种语音转换装置,该装置包括:
获取模块,用于获取源音频数据;
接收模块,用于接收选择的目标说话人编号和源音频数据对应的说话人编号;
处理模块,用于对源音频数据进行预处理,得到与源音频数据对应的频谱;
频谱转换模块,用于将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;
语音生成模块,用于通过声码器将目标说话人的频谱转换为目标说话人的语音。
第三方面,本申请实施例提供一种语音转换设备,包括存储器和处理器,存储器中储存有计算机程序,计算机程序被处理器执行时,使得处理器执行如下步骤:
获取源音频数据;
接收选择的目标说话人编号和源音频数据对应的说话人编号;
对源音频数据进行预处理,得到与源音频数据对应的频谱;
将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;
通过声码器将目标说话人的频谱转换为目标说话人的语音。
第四方面,本申请实施例提供一种存储介质,储存有计算机程序,计算机程序被处理器执行时,使得处理器执行如下步骤:
获取源音频数据;
接收选择的目标说话人编号和源音频数据对应的说话人编号;
对源音频数据进行预处理,得到与源音频数据对应的频谱;
将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;
通过声码器将目标说话人的频谱转换为目标说话人的语音。
有益效果
实施本申请实施例,将具有如下有益效果:
通过上述语音转换方法、装置、设备及存储介质,对说话人进行编号,在实际转换过程中,通过编号来控制需要转换的目标说话人的频谱,实现了多说话人对多说话人语音转换的目的,提高了适用性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为本申请一个实施例中语音转换方法的流程图;
图2为本申请一个实施例中语音转换模型的训练流程图;
图3为本申请一个实施例中得到损失值的流程图;
图4为本申请一个实施例中根解码器参考目标频谱帧的具体流程图;
图5为本申请一个实施例中得到源音频数据对应的频谱的流程图;
图6为本申请一个实施例中语音转换的生成阶段的具体示意图;
图7为本申请一个实施例中语音转换的训练阶段的具体示意图;
图8为本申请一个实施例中语音转换装置的结构示意图;
图9为本申请一个实施例中语音转换设备的内部结构示意图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,在一个实施例中,提出一种语音转换方法,该方法包括:
步骤102,获取源音频数据。
其中,源音频数据是指需要进行语音转换的音频,例如,当前需将一段‘A’说话人说出的‘a’话语转换为‘B’说话人说出的‘a’话语,该话语是指说话内容,也即音频中的文本信息,其中“‘A’说话人说出的‘a’话语”所属的音频数据即为源音频数据。
步骤104,接收选择的目标说话人编号和源音频数据对应的说话人编号。
其中,编号是指针对于不同说话人的代号,表示着说话人的音色,不同编号表示不同的音色。目标说话人编号即为需要转换为某个音色对应说话人的编号,如上述的‘B’;源音频数据对应的说话人编号是指源音频数据中包含的音色对应说话人的编号,也即待转换的音色对应说话人的编号,如上述的‘A’。
步骤106,对源音频数据进行预处理,得到与源音频数据对应的频谱。
其中,源音频数据是一段时域信号,是声音的幅度随着时间变化而变化的波形图,但时域信号中不能提取和分析语音特征,因此通过预处理将时域信号转换为频域信号得到与源音频数据对应的频谱。
步骤108,将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱。
其中,语音转换模型是指能够将输入的频谱转换为目标频谱的虚拟程序模型。因此转换时通过将源音频数据对应的频谱输入,目标说话人编号以及源音频数据对应的说话人编号,得到目标说话人的频谱。输入转换前后说话人编号是为了将说话人作为一种可变特征,从而在需要指定说话人时基于该编号进行输出。
步骤110,通过声码器将目标说话人的频谱转换为目标说话人的语音。
其中,声码器是指语音信号某种模型的语音分析合成系统。在传输中只利用模型参数,在编译码时利用模型参数估计和语音合成技术的语音信号编译码器,一种对话音进行分析和合成的编、译码器,也称话音分析合成系统或话音频带压缩系统。它是压缩通信频带和进行保密通信的有力工具。在得到目标说话人的频谱后通过声码器即可将频谱转换为对应的语音。声码器可以采用World、Griff-Lim或WaveNet等。
通过对说话人进行编号,在实际转换过程中,通过编号来控制需要转换的目标说话人的频谱,实现了多说话人对多说话人语音转换的目的,提高了适用性。
在一个实施例中,语音转换模型包括:
仿射矩阵、编码器和解码器,仿射矩阵用于将输入的目标说话人编号和源音频数据对应的说话人编号编码为说话人向量,编码器(encoder)用于根据说话人向量和源音频数据对应的频谱得到特征向量,解码器(decoder)用于根据特征向量和说话人向量得到目标说话人的频谱。
其中,仿射矩阵是指Speaker Embedding(说话人嵌入),在Speaker Embedding中存有每个说话人和频谱间的对应关系;编码器具体的架构为CNN + Bi-LSTM + Linear Projection;解码器具体的架构为Pre-Net + Attention + LSTM + Post-Net。
其中,在语音转换模型内部具体的执行流程包括:
将目标说话人的编号和源音频数据对应的说话人编号输入到Speaker Embedding中,得到对应的说话人向量。将频谱输入到编码器中,经过CNN(Convolutional Neural Networks卷积神经网络),并在Bi-LSTM(Bi-directional Long Short-Term Memory,一种用于文本上下文建模的模型,包括前向和后向)中输入说话人向量,通过Linear线性化得到语音特征向量。将得到特征向量输入到解码器中,经过PreNet,并在Attention(注意力模型)中输入说话人向量,以及在LSTM(Long Short-Term Memory,一种用于文本上下文建模的模型,单向)中输入说话人向量。最后通过CNN(Convolutional Neural Networks卷积神经网络)输出与说话人向量对应的目标说话人的频谱。
采用将说话人向量输入到编解码器的中间值,使得编解码过程中携带了编号变量,从而最终根据编号变量输出对应的频谱。
如图2所示,在一个实施例中,语音转换模型的训练步骤如下:
步骤202,获取训练样本集,训练样本集中包括多个训练样本,每个训练样本包括:训练目标说话人编号、训练音频数据对应的说话人编号、训练音频数据对应的频谱和训练目标说话人的频谱。
其中,训练样本集中包含不同说话人的编号、频谱。例如,现需要将‘A’说话人说出的‘a’话语转换为‘B’说话人说出的‘a’话语,此时“‘A’说话人说出的‘a’话语”对应的频谱即为训练音频数据对应的频谱,“‘B’说话人说出的‘a’话语”对应的频谱即为训练目标说话人的频谱,‘A,B’即分别为训练音频数据对应的说话人编号和训练目标说话人编号。
样本训练是为了依据大数据,使得语音转换模型拟合出该样本群范围内转换语音特征的参数,使得后续实际生产过程中能依据训练拟合的参数,来进行语音特征的转换,训练样本越多,样本群范围越大,越有可能包含实际生产过程中输入的新语音。
步骤204,将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出。
其中,如上述举例中,将“‘A’说话人说出的‘a’话语”对应的频谱和编号‘A,B’作为输入,将“‘B’说话人说出的‘a’话语”对应的频谱作为期望输出,语音转换模型输出对应频谱时参考期望输出。
步骤206,根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型。
其中,在训练得到的实际输出后,根据与期望输出的比对结果进行分析,更新语音转换模型的权重参数,优化语音转换模型。
通过预设的输入和期望输出,对语音转换模型进行训练,得到训练好的语音转换模型,基于训练好的语音转换模型进行输出,使得语音转换的精度更高、效果更好。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型,包括:
根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值。
当损失值没有达到预设的收敛条件时,根据损失值更新语音转换模型的权重参数。
获取下一个训练样本,重新进入将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出的步骤,直至计算得到的损失值满足预设的收敛条件时,停止训练,得到训练好的语音转换模型。
其中,损失值是指实际输出相比期望输出的失真的多少,具体的损失值可以指实际输出的频谱和期望输出的频谱的比对差值,也可以包括其他差值。训练时,不断循环训练训练样本集中的所有训练样本,并计算每次训练的损失值,检测损失值是否满足预设的收敛条件,当检测到损失值满足预设的收敛条件时,完成训练,得到训练好的语音转换模型。
通过不断的多次训练,并每次训练都根据损失值进行权重参数的调整,直至得到的损失值收敛,判定此时训练完成,得到训练好的语音转换模型,基于训练好的语音转换模型进行输出,使得语音转换的精度更高、效果更好。
如图3所示,在一个实施例中,根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值,包括:
步骤302,计算语音转换模型实际输出的频谱与目标说话人的频谱的第一差值。
其中,损失值具体包含两部分,一部分即为实际输出的频谱与目标频谱之间的第一差值,另一部分为预测到的音素信息与源音素信息之间的第二差值。
步骤304,将语音转换模型实际输出的频谱输入音素识别器,得到预测的音素信息,并与源音频数据对应的音素信息对比,计算得到第二差值。
其中,音素识别器是指能够将频谱作为输入,输出频谱中的音素信息的虚拟程序模块,具体的音素识别器采用CTC算法(Connectionist Temporal Classification),其内部架构为Linear Projection+CTC Loss。源音素信息则是基于源音频数据提取获得的,具体的音素信息是指由音素编码形成的音素向量。CTC用于训练阶段。
步骤306,根据第一差值与第二差值,得到损失值。
其中,损失值具体为第一差值和第二差值的相加得到。
通过引入CTC算法,计算音素信息的比对差值,帮助训练过程中使语音转换模型加速对齐和收敛,提高了训练速度。
如图4所示,在一个实施例中,解码器的训练过程,包括:
步骤402,获取预设的训练目标频谱帧和预设的训练目标说话人的平均频谱帧。
其中,其中在现有的解码器在训练阶段输出频谱时,是参考给定的预设目标频谱进行输出,输出的每一帧都参考对应目标频谱的每一帧。但在实际生成过程中,并没有目标频谱可参考,因此训练阶段得到的结果与生成阶段得到的结果是存在偏差的。若完全参考目标频谱帧,则生成阶段无法得到如训练阶段一样的好结果,若完全不参考目标频谱帧,模型又难以收敛,所以通过解码器内部控制,设置参考概率,使得目标频谱帧随机分布在参考帧中,通过不完全参考目标频谱帧来使得生成阶段得到的结果接近真实情况。
上述中的训练目标频谱帧是指目标说话人的频谱中每一帧,而训练目标说话人的平均频谱帧则是指目标说话人对应的所有频谱中频谱帧的平均值。
步骤404,获取预设概率,根据预设概率确定每个频谱帧对应的参考帧。
其中,预设概率是预先设置,通过解码器中的teacher forcing rate和speaker global mean frame进行控制,在解码器输出频谱时,根据预设概率参考对应的频谱帧。在一个实施例中,预设概率为0.5;毫无疑问的,预设概率也可以为其他数值。
步骤406,当解码器输出频谱帧对应的参考帧为训练目标频谱帧时,根据训练目标频谱帧输出对应的频谱帧。
其中,具体的,根据预设概率,当确定的当前输出的频谱帧对应的参考帧为训练目标频谱帧时,就根据训练目标频谱帧进行输出。
步骤408,当解码器输出频谱帧对应的参考帧为平均频谱帧时,根据平均频谱帧输出对应的频谱帧。
其中,当确定的当前输出的频谱帧对应的参考帧为平均频谱帧时,就根据平均频谱帧进行输出。
通过引入teacher forcing rate和speaker global mean frame来控制参考目标频谱帧的概率,使得不完全参考目标频谱帧,接近实际生成的效果,改善了训练和生成过程不一样导致结果出现偏差的现象。
如图5所示,在一个实施例中,对源音频数据进行预处理,得到与源音频数据对应的频谱,包括:
步骤502,对源音频数据的首尾去掉空白部分、预加重、短时傅里叶变换,得到第一频谱。
其中,对源音频数据减去音频空白部分是为了让Attention模块更好地学习对齐;预加重给音频增添了高频信息,过滤了一部分噪音;STFT(短时傅里叶变换)将波形从时域转换到频域,得到第一频谱,方便提取语音特征
步骤504,将第一频谱通过梅尔滤波器组,得到梅尔频谱。
其中,得到的第一频谱的频率刻度是不符合人耳的听觉线性习惯的,因此将第一频谱通过梅尔滤波器组,得到梅尔频谱,梅尔频谱的频率刻度是符合人耳听觉习惯的。梅尔滤波器组则是分布在低频出的滤波器更密集,门限值大,反之分布在高频出滤波器更稀疏,门限值小。
通过预处理,将源音频数据进行了过滤、降噪、转换频域,使得进入语音转换模型的频谱是清晰、精准的,提高了语音转换的精度。
如图6所示,在一个实施例中,语音转换的生成阶段具体包括:通过对源音频数据进行预处理得到源说话人的梅尔频谱,将源说话人的梅尔频谱、目标说话人编号和源音频数据对应的说话人编号输入到语音转换模型中,得到目标说话人的梅尔频谱,具体的,将目标说话人的编号和源音频数据对应的说话人编号输入到Speaker Embedding中,得到对应的说话人向量。将频谱输入到编码器中,经过CNN(Convolutional Neural Networks卷积神经网络),并在Bi-LSTM(Bi-directional Long Short-Term Memory,一种用于文本上下文建模的模型,包括前向和后向)中输入说话人向量,通过Linear线性化得到语音特征向量。将得到特征向量输入到解码器中,经过PreNet,并在Attention(注意力模型)中输入编码向量,以及在LSTM(directional Long Short-Term Memory,一种用于文本上下文建模的模型,单向)中输入说话人向量,最后通过CNN(Convolutional Neural Networks卷积神经网络)输出与说话人向量对应的目标说话人的频谱。将得到的目标说话人的梅尔频谱通过声码器转换为目标说话人的语音。
如图7所示,在一个实施例中,语音转换的训练阶段具体包括:通过对训练音频数据进行预处理得到训练说话人的梅尔频谱,将训练说话人的梅尔频谱、训练目标说话人编号和训练音频数据对应的说话人编号输入到语音转换模型中,得到训练目标说话人的梅尔频谱。具体的,将训练目标说话人的编号和训练音频数据对应的说话人编号输入到Speaker Embedding中,得到对应的训练说话人向量。将频谱输入到编码器中,经过CNN(Convolutional Neural Networks卷积神经网络),并在Bi-LSTM(Bi-directional Long Short-Term Memory,一种用于文本上下文建模的模型,包括前向和后向)中输入训练说话人向量,通过Linear线性化得到语音特征向量。将得到特征向量输入到解码器中,经过PreNet,并在Attention(注意力模型)中输入训练说话人向量,以及在LSTM(directional Long Short-Term Memory,一种用于文本上下文建模的模型,单向)中输入训练说话人向量。最后通过CNN(Convolutional Neural Networks卷积神经网络)输出与训练说话人向量对应的训练目标说话人的频谱。将得到的训练目标说话人的梅尔频谱输入到CTC中,得到预测的音素信息,将预测的音素信息与源音素信息进行比对,得到音素信息的对比误差,结合频谱的比对误差,反向传播更新语音转换模型的权重参数。此外将得到的训练目标说话人的梅尔频谱通过声码器转换为训练目标说话人的语音。
如图8所示,本申请提供一种语音转换装置,该装置包括:
获取模块802,用于获取源音频数据;
接收模块804,用于接收选择的目标说话人编号和源音频数据对应的说话人编号;
处理模块806,用于对源音频数据进行预处理,得到与源音频数据对应的频谱;
频谱转换模块808,用于将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;
语音生成模块810,用于通过声码器将目标说话人的频谱转换为目标说话人的语音。
在一个实施例中,语音转换模型包括:仿射矩阵、编码器和解码器,仿射矩阵用于将输入的目标说话人编号和源音频数据对应的说话人编号编码为说话人向量,编码器用于根据说话人向量和源音频数据对应的频谱得到特征向量,解码器用于根据特征向量和说话人向量得到目标说话人的频谱。
在一个实施例中,频谱转换模块还用于获取训练样本集,训练样本集中包括多个训练样本,每个训练样本包括:训练目标说话人编号、训练音频数据对应的说话人编号、训练音频数据对应的频谱和训练目标说话人的频谱;将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出;根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型。
在一个实施例中,频谱转换模块还用于根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值;当损失值没有达到预设的收敛条件时,根据损失值更新语音转换模型的权重参数;获取下一个训练样本,重新进入将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出的步骤,直至计算得到的损失值满足预设的收敛条件时,停止训练,得到训练好的语音转换模型。
在一个实施例中,频谱转换模块还用于计算语音转换模型实际输出的频谱与训练目标说话人的频谱的第一差值;将语音转换模型实际输出的频谱输入音素识别器,得到预测的音素信息,并与源音频数据对应的音素信息对比,计算得到第二差值;根据第一差值与第二差值,得到损失值。
在一个实施例中,频谱转换模块还用于获取预设的训练目标频谱帧和预设的训练目标说话人的平均频谱帧;获取预设概率,根据预设概率确定每个频谱帧对应的参考帧;当解码器输出频谱帧对应的参考帧为训练目标频谱帧时,根据训练目标频谱帧输出对应的频谱帧;当解码器输出频谱帧对应的参考帧为平均频谱帧时,根据平均频谱帧输出对应的频谱帧。
在一个实施例中,处理模块还用于对源音频数据的首尾去掉空白部分、预加重、短时傅里叶变换,得到第一频谱;将第一频谱通过梅尔滤波器组,得到梅尔频谱。
在一个实施例中,本申请提供一种语音转换设备,该语音转换设备的内部结构图如图9所示。该语音转换设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该语音转换设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音转换方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音转换方法。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的语音转换设备的限定,具体的语音转换设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供的一种语音转换方法可以实现为一种计算机程序的形式,计算机程序可在如图9所示的语音转换设备上运行。语音转换设备的存储器中可存储组成一种语音转换装置的各个程序模块。比如,获取模块802、接收模块804、处理模块806、频谱转换模块808、语音生成模块810。
一种语音转换设备,包括处理器和存储器,存储器中储存有计算机程序,计算机程序被处理器执行时,使得处理器执行如下步骤:获取源音频数据;接收选择的目标说话人编号和源音频数据对应的说话人编号;对源音频数据进行预处理,得到与源音频数据对应的频谱;将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;通过声码器将目标说话人的频谱转换为目标说话人的语音。
在一个实施例中,语音转换模型包括:仿射矩阵、编码器和解码器,仿射矩阵用于将输入的目标说话人编号和源音频数据对应的说话人编号编码为说话人向量,编码器用于根据说话人向量和源音频数据对应的频谱得到特征向量,解码器用于根据特征向量和说话人向量得到目标说话人的频谱。
在一个实施例中,语音转换模型的训练步骤如下:获取训练样本集,训练样本集中包括多个训练样本,每个训练样本包括:训练目标说话人编号、训练音频数据对应的说话人编号、训练音频数据对应的频谱和训练目标说话人的频谱;将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出;根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型,包括:根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值;当损失值没有达到预设的收敛条件时,根据损失值更新语音转换模型的权重参数;获取下一个训练样本,重新进入将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出的步骤,直至计算得到的损失值满足预设的收敛条件时,停止训练,得到训练好的语音转换模型。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值,包括:计算语音转换模型实际输出的频谱与训练目标说话人的频谱的第一差值;将语音转换模型实际输出的频谱输入音素识别器,得到预测的音素信息,并与源音频数据对应的音素信息对比,计算得到第二差值;根据第一差值与第二差值,得到损失值。
在一个实施例中,上述语音转换模型的训练过程,包括:获取预设的训练目标频谱帧和预设的训练目标说话人的平均频谱帧;获取预设概率,根据预设概率确定每个频谱帧对应的参考帧;当解码器输出频谱帧对应的参考帧为训练目标频谱帧时,根据训练目标频谱帧输出对应的频谱帧;当解码器输出频谱帧对应的参考帧为平均频谱帧时,根据平均频谱帧输出对应的频谱帧。
在一个实施例中,对源音频数据进行预处理,得到与源音频数据对应的频谱,包括:对源音频数据的首尾去掉空白部分、预加重、短时傅里叶变换,得到第一频谱;将第一频谱通过梅尔滤波器组,得到梅尔频谱。
在一个实施例中,本发提供一种存储介质,储存有计算机程序,计算机程序被处理器执行时,使得处理器执行如下步骤:获取源音频数据;接收选择的目标说话人编号和源音频数据对应的说话人编号;对源音频数据进行预处理,得到与源音频数据对应的频谱;将目标说话人编号、源音频数据对应的说话人编号和源音频数据对应的频谱作为语音转换模型的输入,获取语音转换模型输出的目标说话人的频谱;通过声码器将目标说话人的频谱转换为目标说话人的语音。
在一个实施例中,语音转换模型包括:仿射矩阵、编码器和解码器,仿射矩阵用于将输入的目标说话人编号和源音频数据对应的说话人编号编码为说话人向量,编码器用于根据说话人向量和源音频数据对应的频谱得到特征向量,解码器用于根据特征向量和说话人向量得到目标说话人的频谱。
在一个实施例中,语音转换模型的训练步骤如下:获取训练样本集,训练样本集中包括多个训练样本,每个训练样本包括:训练目标说话人编号、训练音频数据对应的说话人编号、训练音频数据对应的频谱和训练目标说话人的频谱;将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出;根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对结果更新语音转换模型的权重参数,得到训练好的语音转换模型,包括:根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值;当损失值没有达到预设的收敛条件时,根据损失值更新语音转换模型的权重参数;获取下一个训练样本,重新进入将训练说话人编号、训练音频数据对应的说话人编号和训练音频数据对应的频谱作为语音转换模型的输入,将训练目标说话人的频谱作为期望的输出的步骤,直至计算得到的损失值满足预设的收敛条件时,停止训练,得到训练好的语音转换模型。
在一个实施例中,根据语音转换模型的实际输出和期望输出的比对差值计算得到损失值,包括:计算语音转换模型实际输出的频谱与训练目标说话人的频谱的第一差值;将语音转换模型实际输出的频谱输入音素识别器,得到预测的音素信息,并与源音频数据对应的音素信息对比,计算得到第二差值;根据第一差值与第二差值,得到损失值。
在一个实施例中,语音转换模型的训练过程,包括:获取预设的训练目标频谱帧和预设的训练目标说话人的平均频谱帧;获取预设概率,根据预设概率确定每个频谱帧对应的参考帧;当解码器输出频谱帧对应的参考帧为训练目标频谱帧时,根据训练目标频谱帧输出对应的频谱帧;当解码器输出频谱帧对应的参考帧为平均频谱帧时,根据平均频谱帧输出对应的频谱帧。
在一个实施例中,对源音频数据进行预处理,得到与源音频数据对应的频谱,包括:对源音频数据的首尾去掉空白部分、预加重、短时傅里叶变换,得到第一频谱;将第一频谱通过梅尔滤波器组,得到梅尔频谱。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。请输入具体实施内容部分。

Claims (10)

  1. 一种语音转换方法,其特征在于,所述方法包括:
    获取源音频数据;
    接收选择的目标说话人编号和源音频数据对应的说话人编号;
    对所述源音频数据进行预处理,得到与所述源音频数据对应的频谱;
    将所述目标说话人编号、所述源音频数据对应的说话人编号和所述源音频数据对应的频谱作为语音转换模型的输入,获取所述语音转换模型输出的目标说话人的频谱;
    通过声码器将所述目标说话人的频谱转换为所述目标说话人的语音。
  2. 根据权利要求1所述的方法,其特征在于,所述语音转换模型包括:
    仿射矩阵、编码器和解码器,所述仿射矩阵用于将输入的所述目标说话人编号和所述源音频数据对应的说话人编号编码为说话人向量,所述编码器用于根据所述说话人向量和所述源音频数据对应的频谱得到特征向量,所述解码器用于根据所述特征向量和所述说话人向量得到所述目标说话人的频谱。
  3. 根据权利要求2所述的方法,其特征在于,所述语音转换模型的训练步骤如下:
    获取训练样本集,所述训练样本集中包括多个训练样本,每个训练样本包括:训练目标说话人编号、训练音频数据对应的说话人编号、训练音频数据对应的频谱和训练目标说话人的频谱;
    将所述训练说话人编号、所述训练音频数据对应的说话人编号和所述训练音频数据对应的频谱作为所述语音转换模型的输入,将所述训练目标说话人的频谱作为期望的输出;
    根据所述语音转换模型的实际输出和期望输出的比对结果更新所述语音转换模型的权重参数,得到训练好的所述语音转换模型。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述语音转换模型的实际输出和期望输出的比对结果更新所述语音转换模型的权重参数,得到训练好的所述语音转换模型,包括:
    根据所述语音转换模型的所述实际输出和期望输出的比对差值计算得到损失值;
    当所述损失值没有达到预设的收敛条件时,根据所述损失值更新所述语音转换模型的权重参数;
    获取下一个训练样本,重新进入所述将所述训练说话人编号、所述训练音频数据对应的说话人编号和所述训练音频数据对应的频谱作为所述语音转换模型的输入,将所述训练目标说话人的频谱作为期望的输出的步骤,直至计算得到的所述损失值满足所述预设的收敛条件时,停止训练,得到训练好的所述语音转换模型。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述语音转换模型的所述实际输出和期望输出的比对差值计算得到损失值,包括:
    计算所述语音转换模型实际输出的频谱与所述目标说话人的频谱的第一差值;
    将所述语音转换模型实际输出的频谱输入音素识别器,得到预测的音素信息,并与所述源音频数据对应的音素信息对比,计算得到第二差值;
    根据所述第一差值与所述第二差值,得到所述损失值。
  6. 根据权利要求2所述的方法,其特征在于,所述解码器的训练过程,包括:
    获取预设的训练目标频谱帧和预设的训练目标说话人的平均频谱帧;
    获取预设概率,根据所述预设概率确定每个频谱帧对应的参考帧;
    当所述解码器输出频谱帧对应的参考帧为所述训练目标频谱帧时,根据所述训练目标频谱帧输出对应的频谱帧;
    当所述解码器输出频谱帧对应的参考帧为所述平均频谱帧时,根据所述平均频谱帧输出对应的频谱帧。
  7. 根据权利要求1所述的方法,其特征在于,所述对所述源音频数据进行预处理,得到与所述源音频数据对应的频谱,包括:
    对所述源音频数据的首尾去掉空白部分、预加重、短时傅里叶变换,得到第一频谱;
    将所述第一频谱通过梅尔滤波器组,得到梅尔频谱。
  8. 一种语音转换装置,其特征在于,所述装置包括:
    获取模块,用于获取源音频数据;
    接收模块,用于接收选择的目标说话人编号和源音频数据对应的说话人编号;
    处理模块,用于对所述源音频数据进行预处理,得到与所述源音频数据对应的频谱;
    频谱转换模块,用于将所述目标说话人编号、所述源音频数据对应的说话人编号和所述源音频数据对应的频谱作为语音转换模型的输入,获取所述语音转换模型输出的目标说话人的频谱;
    语音生成模块,用于通过声码器将所述目标说话人的频谱转换为所述目标说话人的语音。
  9. 一种语音转换设备,包括存储器和处理器,所述存储器中储存有计算机程序,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1-7所述任一种方法的步骤。
  10. 一种存储介质,储存有计算机程序,其特征在于,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1-7所述任一种方法的步骤。
PCT/CN2019/129115 2019-12-27 2019-12-27 语音转换方法、装置、设备及存储介质 WO2021128256A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980003287.4A CN111247585B (zh) 2019-12-27 2019-12-27 语音转换方法、装置、设备及存储介质
PCT/CN2019/129115 WO2021128256A1 (zh) 2019-12-27 2019-12-27 语音转换方法、装置、设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/129115 WO2021128256A1 (zh) 2019-12-27 2019-12-27 语音转换方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021128256A1 true WO2021128256A1 (zh) 2021-07-01

Family

ID=70864468

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/129115 WO2021128256A1 (zh) 2019-12-27 2019-12-27 语音转换方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111247585B (zh)
WO (1) WO2021128256A1 (zh)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808595A (zh) * 2020-06-15 2021-12-17 颜蔚 一种从源说话人到目标说话人的声音转换方法及装置
CN111428867B (zh) * 2020-06-15 2020-09-18 深圳市友杰智新科技有限公司 基于可逆分离卷积的模型训练方法、装置和计算机设备
CN111862934B (zh) * 2020-07-24 2022-09-27 思必驰科技股份有限公司 语音合成模型的改进方法和语音合成方法及装置
CN111883149B (zh) * 2020-07-30 2022-02-01 四川长虹电器股份有限公司 一种带情感和韵律的语音转换方法及装置
CN112164407A (zh) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 音色转换方法及装置
CN112562728A (zh) * 2020-11-13 2021-03-26 百果园技术(新加坡)有限公司 生成对抗网络训练方法、音频风格迁移方法及装置
CN112382297A (zh) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 用于生成音频的方法、装置、设备和介质
CN112509550A (zh) * 2020-11-13 2021-03-16 中信银行股份有限公司 语音合成模型训练方法、语音合成方法、装置及电子设备
CN112634919B (zh) * 2020-12-18 2024-05-28 平安科技(深圳)有限公司 语音转换方法、装置、计算机设备及存储介质
CN112634920B (zh) * 2020-12-18 2024-01-02 平安科技(深圳)有限公司 基于域分离的语音转换模型的训练方法及装置
CN112712789B (zh) * 2020-12-21 2024-05-03 深圳市优必选科技股份有限公司 跨语言音频转换方法、装置、计算机设备和存储介质
WO2022133630A1 (zh) * 2020-12-21 2022-06-30 深圳市优必选科技股份有限公司 跨语言音频转换方法、计算机设备和存储介质
CN112712812B (zh) * 2020-12-24 2024-04-26 腾讯音乐娱乐科技(深圳)有限公司 音频信号生成方法、装置、设备以及存储介质
CN112767912A (zh) * 2020-12-28 2021-05-07 深圳市优必选科技股份有限公司 跨语言语音转换方法、装置、计算机设备和存储介质
CN112863529B (zh) * 2020-12-31 2023-09-22 平安科技(深圳)有限公司 基于对抗学习的说话人语音转换方法及相关设备
CN113178200B (zh) * 2021-04-28 2024-03-01 平安科技(深圳)有限公司 语音转换方法、装置、服务器及存储介质
CN113345454B (zh) * 2021-06-01 2024-02-09 平安科技(深圳)有限公司 语音转换模型的训练、应用方法、装置、设备及存储介质
CN113611324B (zh) * 2021-06-21 2024-03-26 上海一谈网络科技有限公司 一种直播中环境噪声抑制的方法、装置、电子设备及存储介质
CN114283824B (zh) * 2022-03-02 2022-07-08 清华大学 一种基于循环损失的语音转换方法及装置
CN115064177A (zh) * 2022-06-14 2022-09-16 中国第一汽车股份有限公司 基于声纹编码器的语音转换方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160358597A1 (en) * 2014-03-04 2016-12-08 Tribune Digital Ventures, Llc Real Time Popularity Based Audible Content Acquisition
CN107481735A (zh) * 2017-08-28 2017-12-15 中国移动通信集团公司 一种转换音频发声的方法、服务器及计算机可读存储介质
CN108847249A (zh) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 声音转换优化方法和系统
CN108922543A (zh) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 模型库建立方法、语音识别方法、装置、设备及介质
CN109308892A (zh) * 2018-10-25 2019-02-05 百度在线网络技术(北京)有限公司 语音合成播报方法、装置、设备及计算机可读介质
CN110085254A (zh) * 2019-04-22 2019-08-02 南京邮电大学 基于beta-VAE和i-vector的多对多语音转换方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107464569A (zh) * 2017-07-04 2017-12-12 清华大学 声码器
CN110136690B (zh) * 2019-05-22 2023-07-14 平安科技(深圳)有限公司 语音合成方法、装置及计算机可读存储介质
CN110223705B (zh) * 2019-06-12 2023-09-15 腾讯科技(深圳)有限公司 语音转换方法、装置、设备及可读存储介质
CN110600047B (zh) * 2019-09-17 2023-06-20 南京邮电大学 基于Perceptual STARGAN的多对多说话人转换方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160358597A1 (en) * 2014-03-04 2016-12-08 Tribune Digital Ventures, Llc Real Time Popularity Based Audible Content Acquisition
CN107481735A (zh) * 2017-08-28 2017-12-15 中国移动通信集团公司 一种转换音频发声的方法、服务器及计算机可读存储介质
CN108847249A (zh) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 声音转换优化方法和系统
CN108922543A (zh) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 模型库建立方法、语音识别方法、装置、设备及介质
CN109308892A (zh) * 2018-10-25 2019-02-05 百度在线网络技术(北京)有限公司 语音合成播报方法、装置、设备及计算机可读介质
CN110085254A (zh) * 2019-04-22 2019-08-02 南京邮电大学 基于beta-VAE和i-vector的多对多语音转换方法

Also Published As

Publication number Publication date
CN111247585B (zh) 2024-03-29
CN111247585A (zh) 2020-06-05

Similar Documents

Publication Publication Date Title
WO2021128256A1 (zh) 语音转换方法、装置、设备及存储介质
CN111048064B (zh) 基于单说话人语音合成数据集的声音克隆方法及装置
CN111247584B (zh) 语音转换方法、系统、装置及存储介质
CN109767756B (zh) 一种基于动态分割逆离散余弦变换倒谱系数的音声特征提取算法
US11393452B2 (en) Device for learning speech conversion, and device, method, and program for converting speech
Huang et al. Refined wavenet vocoder for variational autoencoder based voice conversion
CN111261145B (zh) 语音处理装置、设备及其训练方法
WO2023001128A1 (zh) 音频数据的处理方法、装置及设备
Du et al. A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement
WO2023116660A2 (zh) 一种模型训练以及音色转换方法、装置、设备及介质
CN112562655A (zh) 残差网络的训练和语音合成方法、装置、设备及介质
JP2023546098A (ja) オーディオ生成器ならびにオーディオ信号生成方法およびオーディオ生成器学習方法
CN111724809A (zh) 一种基于变分自编码器的声码器实现方法及装置
Yang et al. Adversarial feature learning and unsupervised clustering based speech synthesis for found data with acoustic and textual noise
CN112908293B (zh) 一种基于语义注意力机制的多音字发音纠错方法及装置
US20240127832A1 (en) Decoder
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
Zhang et al. AccentSpeech: learning accent from crowd-sourced data for target speaker TTS with accents
CN112185342A (zh) 语音转换与模型训练方法、装置和系统及存储介质
Li et al. Frame-level specaugment for deep convolutional neural networks in hybrid ASR systems
JPWO2007037359A1 (ja) 音声符号化装置および音声符号化方法
CN113436607A (zh) 一种快速语音克隆方法
Li et al. A Two-stage Approach to Quality Restoration of Bone-conducted Speech
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
Nikitaras et al. Fine-grained noise control for multispeaker speech synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957236

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957236

Country of ref document: EP

Kind code of ref document: A1