WO2022121180A1 - 模型的训练方法、装置、语音转换方法、设备及存储介质 - Google Patents
模型的训练方法、装置、语音转换方法、设备及存储介质 Download PDFInfo
- Publication number
- WO2022121180A1 WO2022121180A1 PCT/CN2021/084219 CN2021084219W WO2022121180A1 WO 2022121180 A1 WO2022121180 A1 WO 2022121180A1 CN 2021084219 W CN2021084219 W CN 2021084219W WO 2022121180 A1 WO2022121180 A1 WO 2022121180A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- network
- mel spectrum
- label
- output
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 139
- 238000012549 training Methods 0.000 title claims abstract description 106
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000001228 spectrum Methods 0.000 claims abstract description 241
- 230000006870 function Effects 0.000 claims description 61
- 230000001105 regulatory effect Effects 0.000 claims description 24
- 238000010586 diagram Methods 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Definitions
- the present application relates to the field of language processing, and in particular, to a method, device, method, device and storage medium for training a speech conversion model.
- the present application provides a training method, device, speech conversion method, device and storage medium for a speech conversion model, so as to reduce the requirements for building a model for audio corpus and reduce the complexity of model building.
- the present application provides a method for training a speech conversion model, the method comprising:
- sample audio convert the sample audio into sample mel spectrum, and the sample audio includes unlabeled audio and labeled audio; collect noise audio, and jointly input the noise audio and the sample mel spectrum into a generation network , obtain the output mel spectrum, and the noise audio is unlabeled audio; input the output mel spectrum into the discriminant network to obtain the type probability of the output mel spectrum and the label of the output mel spectrum; according to the The generation network and the discriminant network are alternately and iteratively trained by outputting the type probability of the Mel spectrum and the label of the output Mel spectrum, and the trained generation network is used as a speech conversion model to complete the model training.
- the present application provides a voice conversion method, the method comprising:
- the present application also provides a training device for a speech conversion model, the device comprising:
- a sample acquisition module is used to acquire sample audio, and convert the sample audio into a sample Mel spectrum, where the sample audio includes unlabeled audio and labeled audio;
- a noise acquisition module is used to collect noise audio and convert the noise
- the audio frequency and the sample mel spectrum are jointly input to the generation network, and the output mel spectrum is obtained, and the noise audio is unlabeled audio;
- the discriminant output module is used to input the output mel spectrum into the discriminant network to obtain the output mel spectrum.
- the type probability of the mel spectrum and the label of the output mel spectrum; the model training module is used for the generation network and the discriminant network according to the type probability of the output mel spectrum and the label of the output mel spectrum Perform alternate iterative training, and use the trained generative network as a speech conversion model to complete model training.
- the present application also provides a voice conversion device, the device comprising:
- the data acquisition module is used to obtain the audio data to be converted and the target conversion label of the user; the audio conversion module is used to input the audio data to be converted and the target conversion label into a pre-trained voice conversion model to obtain the converted audio data; wherein, the pre-trained speech conversion model is a generation network trained by using the above-mentioned training method of the speech conversion model.
- the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used for storing computer-readable instructions; the processor is used for executing the computer-readable instructions and The following steps are implemented when executing the computer-readable instructions:
- sample audio convert the sample audio into sample mel spectrum, and the sample audio includes unlabeled audio and labeled audio; collect noise audio, and jointly input the noise audio and the sample mel spectrum into a generation network , obtain the output mel spectrum, and the noise audio is unlabeled audio; input the output mel spectrum into the discriminant network to obtain the type probability of the output mel spectrum and the label of the output mel spectrum; according to the The generation network and the discriminant network are alternately and iteratively trained by outputting the type probability of the Mel spectrum and the label of the output Mel spectrum, and the trained generation network is used as a speech conversion model to complete the model training.
- the present application also provides a computer device, the computer device includes a memory and a processor; the memory is used to store computer-readable instructions; the processor is used to execute the computer-readable instructions and The following steps are implemented when executing the computer-readable instructions:
- the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the processor can implement the following steps :
- sample audio convert the sample audio into sample mel spectrum, and the sample audio includes unlabeled audio and labeled audio; collect noise audio, and jointly input the noise audio and the sample mel spectrum into a generation network , obtain the output mel spectrum, and the noise audio is unlabeled audio; input the output mel spectrum into the discriminant network to obtain the type probability of the output mel spectrum and the label of the output mel spectrum; according to the The generation network and the discriminant network are alternately and iteratively trained by outputting the type probability of the Mel spectrum and the label of the output Mel spectrum, and the trained generation network is used as a speech conversion model to complete the model training.
- the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the processor can implement the following steps :
- the present application discloses a training method, device, speech conversion method, device and storage medium for a speech conversion model.
- acquiring sample audio including labeled audio and unlabeled audio and converting the sample audio into a sample Mel spectrum, then Collect noise audio, and input the noise audio and sample mel spectrum into the generation network together to get the output mel spectrum, and then input the output mel spectrum into the discriminant network to get the type probability and label of the output mel spectrum, and finally according to the output mel spectrum
- the generation network and the discriminant network are alternately and iteratively trained according to the type probability and label of the frequency spectrum, and the trained generation network is used as a speech conversion model to complete the model training.
- FIG. 1 is a schematic flowchart of a training method of a speech conversion model provided by an embodiment of the present application
- FIG. 2 is a schematic flowchart of a voice conversion method provided by an embodiment of the present application.
- FIG. 3 is a schematic block diagram of an apparatus for training a speech conversion model provided by an embodiment of the present application
- FIG. 4 is a schematic block diagram of a voice conversion apparatus provided by an embodiment of the present application.
- FIG. 5 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
- Embodiments of the present application provide a training method, apparatus, voice conversion method, device, and storage medium for a speech conversion model.
- the training method of the speech conversion model can train the speech conversion model based on the generative confrontation network, and by training the discriminant network, the discriminant network can output the label of the input mel spectrum, and only a small amount of labeled audio can be trained, reducing It reduces the difficulty of obtaining sample audio, reduces the requirement for audio corpus when training the speech conversion model, and reduces the complexity of model construction.
- FIG. 1 is a schematic flowchart of a training method for a speech conversion model provided by an embodiment of the present application.
- the training method of the speech conversion model performs alternate iterative training on the generation network and the discriminant network, and uses the trained generation network as the speech conversion model.
- the training method of the speech conversion model specifically includes steps S101 to S104.
- the sample audio includes unlabeled audio and labeled audio, where labeled audio refers to audio with a definite label.
- labeled audio refers to audio with a definite label.
- the labels corresponding to the audio are man, woman, little girl, little boy, etc. Audio with definite tags is called tagged audio.
- Unlabeled audio means that the audio itself does not have a corresponding label. Set the label to unknown for the audio that does not have a corresponding label. That is, unlabeled audio refers to the audio with an unknown label, indicating that the audio has no definite label. .
- the sample audio can be obtained in various ways, for example, a web crawler can be used to obtain the sample audio from the network, and so on.
- a web crawler can be used to obtain the sample audio from the network, and so on.
- the sample audio is converted into a sample mel spectrum using a mel filter, and each sample mel audio carries a corresponding label.
- S102 Collect noise audio, and jointly input the noise audio and the sample Mel spectrum into a generation network to obtain an output Mel spectrum, where the noise audio is unlabeled audio.
- the generation network is used to generate the noise mel spectrum corresponding to the noise audio according to the collected noise audio.
- the structure of the generation network may include a preprocessing layer, a downsampling layer, a bottleneck layer and an upsampling layer.
- the preprocessing layer consists of convolutional layers, batch normalization layers and nonlinear affine transformation layers;
- the downsampling layer consists of several convolutional layers and batching layers;
- the bottleneck layer consists of convolutions with residuals;
- the upsampling layer consists of Dilated convolution and batch normalization layers.
- a noise audio is randomly collected, wherein the collected noise audio needs to obey a prior probability distribution, which can be uniform distribution or Gaussian distribution. Then, the label of the collected noise audio is set as unknown, and the unlabeled audio and the sample mel spectrum are jointly input to the generation network, and the noise audio is processed by the generation network to obtain the output mel spectrum output by the generation network.
- a prior probability distribution which can be uniform distribution or Gaussian distribution.
- the obtained output mel spectrum includes both the sample mel spectrum corresponding to the sample audio and the noise mel spectrum corresponding to the noise audio.
- the types of the output mel spectrum include sample mel spectrum and noise mel spectrum, and the type probability of the output mel spectrum specifically refers to the probability that the output mel spectrum is the sample mel spectrum.
- the discriminant network is used to judge the probability that the input output Mel spectrum is the sample Mel spectrum, and to determine the predicted label corresponding to the output Mel spectrum.
- the backbone network of the discriminant network can be composed of several nonlinear affine transformations and convolutional layers, the last layer is the linear mapping of binary classification and multi-classification, and the output results of the discriminant network are respectively the input output Mel.
- Spectrum is the probability of the sample mel spectrum and the predicted label of the output mel spectrum.
- the output mel spectrum output by the generation network is used as the input of the discriminant network, and the output mel spectrum predicted by the discriminant network is the probability of the sample mel spectrum and the predicted label of the output mel spectrum.
- the output mel spectrum predicted by the discriminant network is the probability of the sample mel spectrum and the predicted label of the output mel spectrum
- the generator network and the discriminant network are alternately and iteratively trained, and then when the generation network and the discriminant network are trained, no longer use The discriminative network, but the trained generation network is used as a speech conversion model to complete the training of the speech conversion model.
- the discriminant network is optimized first. At the beginning of the training, the discriminant network can easily distinguish the noise mel-spectrum and the sample mel-spectrum from the output mel-spectrum.
- the noise mel spectrum generated by the starting generative network from the noisy audio has a large deviation compared to the sample mel spectrum.
- the generation network is optimized so that the loss function of the generation network is gradually reduced. In this process, the binary classification ability of the discriminant network is gradually improved, and the discrimination accuracy of the discriminant network for the output Mel spectrum output by the generation network is also gradually improved.
- the generative network generates noise mel spectrum close to the real data as much as possible to deceive the discriminant network, while the discriminant network needs to try to distinguish the sample mel spectrum from the noise mel spectrum generated by the generative network, so that the generative network and the discriminant network are formed.
- a dynamic game process The generative network generates noise mel spectrum close to the real data as much as possible to deceive the discriminant network, while the discriminant network needs to try to distinguish the sample mel spectrum from the noise mel spectrum generated by the generative network, so that the generative network and the discriminant network are formed.
- the discriminant network cannot determine whether the output Mel spectrum is a sample Mel spectrum or a noise Mel spectrum, it means that the generation network has been trained, and the trained generation network is used as a speech conversion model.
- the method further includes: when the accuracy of the predicted label of the output Mel spectrum output by the discrimination network reaches a preset value, inputting the sample Mel spectrum of the unlabeled audio into the The discriminant network uses the obtained predicted label as the label of the unlabeled audio.
- the resulting output mel spectrum also has labels corresponding to the corresponding audio.
- the sample mel spectrum of unlabeled audio is input into the discriminant network, so that the discriminant network predicts the label corresponding to the sample mel spectrum of unlabeled audio, and the predicted predicted label is used as the label of unlabeled audio.
- the unlabeled audio becomes labeled audio according to the predicted label, and its label is the predicted label.
- the training of the discriminant network can be rejoined, so that the discriminant network can predict the label classification of the sample audio with few labels.
- the method includes: adjusting the speech rate of the sample audio to obtain a speed-regulated sample audio, and converting the speed-regulated sample audio into a speed-regulated Mel spectrum; according to the speed-regulated Mel spectrum
- the discriminant network is trained so that the discriminant network outputs the speech rate corresponding to the speed-regulated Mel spectrum.
- Adjust the speech rate of the sample audio to get the speed-regulated sample audio for example, it can be adjusted to 0.9 times, 1.0 times, and 1.1 times. Then, the speed-regulated sample audio is converted into a speed-regulated mel spectrum by using a Mel filter, and the discriminant network is trained by using the speed-regulated mel spectrum, so that the discriminant network outputs the speech rate corresponding to the speed-regulated mel spectrum.
- the discriminant network can recognize the speech rate, which can improve the training stability of the adversarial generation network and reduce the training error caused by different speech rates in the sample audio.
- performing alternate iterative training on the generation network and the discriminant network according to the type probability of the output Mel spectrum includes: calculating the generation network according to the type probability of the output Mel spectrum.
- the value of the type loss function of the network and the value of the type loss function of the discriminant network according to the value of the type loss function of the generation network and the value of the type loss function of the discriminant network, the generation network and The type network performs alternate iterative training; when the type probability output by the discriminating network reaches a preset value, the training of the generating network is completed.
- the value of the type loss function of the generation network and the value of the type loss function of the discriminant network are calculated, and then according to the value of the type loss function of the generation network and the type loss function of the discriminant network , adjust the network parameters of the generative network and the discriminant network, iteratively train the generative network and the discriminant network, and gradually reduce the value of the type loss function of the generative network.
- the ability to discriminate the binary classification of the network model is determined, thereby ensuring that the noise mel spectrum generated by the generation network using the noise audio is similar to the sample mel spectrum.
- the preset value may be 0.5.
- the preset value is 0.5, it means that the discriminating network cannot judge whether the Mel spectrum generated by the generating network is a noise Mel spectrum or a sample Mel spectrum, indicating that the generating network has been trained.
- the formula for the type loss function of the generative network can be as follows:
- L G1 -E x ⁇ p(x),c ⁇ p(c) [log(D(G(x,c),c))]
- the formula of the type loss function of the discriminative network can be as follows:
- L D1 -E (y,c) ⁇ p(y,c) [log(D(y,c))]-E x ⁇ p(x),c ⁇ p(c) [log(1-D( G(x,c),c))]
- L G1 represents the type loss function of the generation network
- L D1 represents the type loss function of the discriminant network
- D(G(x,c),c) means that the discriminant network judges the sample Mel spectrum x with the label c as the sample Mel spectrum x is the probability of the mel spectrum
- D(y,c) represents the probability of judging the noise mel spectrum x labeled c as the sample mel spectrum.
- performing alternate iterative training on the generation network and the discriminant network according to the type probability and predicted label of the output mel spectrum includes: if determining according to the type probability of the output mel spectrum The audio corresponding to the output mel spectrum is sample audio, and when the predicted label of the output mel spectrum is different from the corresponding label of the sample audio, the error is included in the label loss function of the discriminant network; If it is determined according to the type probability of the output mel spectrum that the audio corresponding to the output mel spectrum is noise audio, and the predicted label of the output mel spectrum is different from the corresponding label of the noise audio, this time Errors are factored into the label loss function of the generation network; the generation network is iteratively trained according to the label loss function of the generation network, and the type network is iteratively trained according to the label loss function of the discriminant network.
- the label loss function of the generative network and the discriminant network is determined according to the predicted label, so as to optimize the generative network and the discriminant network, so that the generative network can generate specific Audio of the tag.
- the formula for the label loss function of the generative network can be as follows:
- L G2 -E x ⁇ p(x),c ⁇ p(c) [log(p(c)(c
- L G2 represents the label loss function of the generation network
- L D2 represents the label loss function of the discriminant network
- (G(x,c)) represents the discriminant network for the sample Mel spectrum x with the label c
- y) indicates that the discriminant network predicts the label of the noise mel spectrum x with label c wrongly.
- the training method of the speech conversion model provided by the above-mentioned embodiment, by acquiring the sample audio including the labeled audio and the unlabeled audio, and converting the sample audio into the sample Mel spectrum, then collecting the noise audio, and converting the noise audio and the sample Mel spectrum.
- the spectrum is jointly input into the generation network, the output Mel spectrum is obtained, and then the output Mel spectrum is input into the discriminant network to obtain the type probability and label of the output Mel spectrum.
- the discriminative network is trained alternately and iteratively, and the trained generative network is used as a speech conversion model to complete the model training.
- FIG. 2 is a schematic flowchart of a voice conversion method provided by an embodiment of the present application.
- the voice conversion method includes steps S201 to S202.
- the audio to be converted refers to the audio that the user needs to convert
- the target conversion tag refers to the tag when the audio to be converted is converted.
- the audio to be converted is audio of a woman's voice
- the target conversion tag is a girl.
- the pre-trained speech conversion model is a generation network obtained by training using any one of the speech conversion model training methods provided in the above embodiments.
- the purpose of voice conversion is achieved, and the user experience is improved.
- FIG. 3 is a schematic block diagram of an apparatus for training a speech conversion model, which is further provided by an embodiment of the present application.
- the apparatus for training a speech conversion model is used to execute the aforementioned training method for a speech conversion model.
- the training device of the speech conversion model may be configured in a server or a terminal.
- the server may be an independent server or a server cluster.
- the terminal may be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.
- the apparatus 300 for training a speech conversion model includes: a sample acquisition module 301 , a noise acquisition module 302 , a discrimination output module 303 and a model training module 304 .
- the sample acquisition module 301 is configured to acquire sample audio, convert the sample audio into sample mel spectrum, and the sample audio includes unlabeled audio and labeled audio.
- the noise collection module 302 is configured to collect noise audio, and jointly input the noise audio and the sample mel spectrum into a generation network to obtain an output mel spectrum, where the noise audio is unlabeled audio.
- the discriminant output module 303 is configured to input the output mel spectrum into a discriminant network to obtain the type probability of the output mel spectrum and the label of the output mel spectrum.
- a model training module 304 configured to perform alternate iterative training on the generation network and the discriminant network according to the type probability of the output mel spectrum and the label of the output mel spectrum, and use the trained generation network as a speech Convert the model and complete the model training.
- FIG. 4 is a schematic block diagram of a voice conversion apparatus further provided by an embodiment of the present application, and the voice conversion apparatus is configured to execute the aforementioned voice conversion method.
- the voice conversion device may be configured in a server or a terminal.
- the server may be an independent server or a server cluster.
- the terminal may be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.
- the voice conversion apparatus 400 includes: a data acquisition module 401 and an audio conversion module 402 .
- the data acquisition module 401 is used to acquire the audio data to be converted and the target conversion label of the user;
- the audio conversion module 402 is used to input the audio data to be converted and the target conversion label into a pre-trained speech conversion model to obtain converted audio data; wherein, the pre-trained speech conversion model adopts the above-mentioned speech
- the training method of the transformation model trains the resulting generative network.
- Both the above-mentioned training apparatus for the speech conversion model and the speech conversion apparatus may be implemented in the form of computer-readable instructions, and the computer-readable instructions may be executed on the computer equipment as shown in FIG. 5 .
- FIG. 5 is a schematic structural block diagram of a computer device provided by an embodiment of the present application.
- the computer device can be a server or a terminal.
- the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a computer-readable storage medium and an internal memory.
- the computer-readable storage medium can be non-volatile or volatile, and the computer-readable storage medium can store an operating system and computer-readable instructions.
- the computer-readable instructions when executed, can cause the processor to execute any method of training a speech conversion model and a speech conversion method.
- the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
- the internal memory provides an environment for running the computer-readable instructions in the computer-readable storage medium.
- the processor can execute any method for training a speech conversion model and a speech conversion method.
- the network interface is used for network communication, such as sending assigned tasks.
- the network interface is used for network communication, such as sending assigned tasks.
- FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
- the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated circuits) Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
- the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
- sample audio convert the sample audio into sample mel spectrum, and the sample audio includes unlabeled audio and labeled audio; collect noise audio, and jointly input the noise audio and the sample mel spectrum into a generation network , obtain the output Mel spectrum, and the noise audio is unlabeled audio; Input the output Mel spectrum into the discriminant network to obtain the type probability and predicted label of the output Mel spectrum; According to the type of the output Mel spectrum
- the generating network and the discriminating network are alternately and iteratively trained by the probability and the predicted label, and the trained generating network is used as a speech conversion model to complete the model training.
- the processor is further configured to:
- the processor is used to implement:
- the processor when the processor implements the alternate iterative training of the generation network and the discriminant network according to the type probability of the output Mel spectrum, the processor is configured to implement:
- the value of the type loss function of the generation network and the value of the type loss function of the discriminant network are calculated; according to the value of the type loss function of the generation network and the According to the value of the type loss function of the network, alternate iterative training is performed on the generation network and the type network respectively; when the type probability output by the discriminating network reaches a preset value, the training of the generation network is completed.
- the processor when implementing the alternate iterative training of the generation network and the discriminant network according to the type probability and predicted label of the output Mel spectrum, the processor is configured to implement:
- the audio corresponding to the output mel spectrum is sample audio, and the predicted label of the output mel spectrum is different from the corresponding label of the sample audio, this time Errors are included in the label loss function of the discriminant network; if the audio corresponding to the output mel spectrum is determined to be noise audio according to the type probability of the output mel spectrum, and the predicted label of the output mel spectrum is the same as the corresponding When the labels of the noise audio are different, the error is included in the label loss function of the generation network; the generation network is iteratively trained according to the label loss function of the generation network, and the label of the discriminant network is The loss function iteratively trains the type of network.
- the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
- the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the processor executes the computer-readable instructions to implement any of the methods provided in the embodiments of the present application.
- a training method for a speech conversion model and a speech conversion method is provided in the embodiments of the present application.
- the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
- the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) ) card, Flash Card, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
一种语音转换模型的训练方法、语音转换方法,训练方法包括:获取样本音频,将样本音频转换为样本梅尔频谱(S101);采集噪声音频,并将噪声音频和样本梅尔频谱共同输入生成网络,得到输出梅尔频谱(S102);将输出梅尔频谱输入判别网络,得到输出梅尔频谱的类型概率和输出梅尔频谱的标签(S103);根据输出梅尔频谱的类型概率和输出梅尔频谱的标签对生成网络和判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型(S104),以降低构建模型对于音频语料的要求,降低模型构建的复杂度。
Description
本申请要求于2020年12月11日提交中国专利局、申请号为202011446585.1、发明名称为“模型的训练方法、装置、语音转换方法、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及语言处理领域,尤其涉及一种语音转换模型的训练方法、装置、语音转换方法、设备及存储介质。
随着语音转换技术的发展,它的应用前景也日益广泛,例如,可以用来做影视作品的配音,或者在语音合成时用于生成多样的合成结果等等。发明人意识到现有的语音转换大多采用对抗生成网络来进行语音转换,在进行语音转换时,所有的音频语料都需要具有对应的标签,在多说话人语音转换时,需要标识每一个音频对应的说话人标签,模型构建的复杂程度较高。
如何降低构建模型对于音频语料的要求,降低模型构建的复杂度成为亟待解决的问题。
本申请提供了一种语音转换模型的训练方法、装置、语音转换方法、设备及存储介质,以降低构建模型对于音频语料的要求,降低模型构建的复杂度。
第一方面,本申请提供了一种语音转换模型的训练方法,所述方法包括:
获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签;根据所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
第二方面,本申请提供了一种语音转换方法,所述方法包括:
获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据;其中,所述预先训练的语音转换模型为采用上述的语音转换模型的训练方法训练得到的生成网络。
第三方面,本申请还提供了一种语音转换模型的训练装置,所述装置包括:
样本获取模块,用于获取样本音频,将所述样本音频转换为样本梅尔频谱,所述 样本音频包括无标签音频和有标签音频;噪声采集模块,用于采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;判别输出模块,用于将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签;模型训练模块,用于根据所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
第四方面,本申请还提供了一种语音转换装置,所述装置包括:
数据获取模块,用于获取用户的待转换音频数据和目标转换标签;音频转换模块,用于将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据;其中,所述预先训练的语音转换模型为采用上述的语音转换模型的训练方法训练得到的生成网络。
第五方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机可读指令;所述处理器,用于执行所述计算机可读指令并在执行所述计算机可读指令时实现如下步骤:
获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签;根据所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
第六方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机可读指令;所述处理器,用于执行所述计算机可读指令并在执行所述计算机可读指令时实现如下步骤:
获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据。
第七方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使所述处理器实现如下步骤:
获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签;根据所 述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
第八方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使所述处理器实现如下步骤:
获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据。
本申请公开了一种语音转换模型的训练方法、装置、语音转换方法、设备及存储介质,通过获取包括有标签音频和无标签音频的样本音频,并且将样本音频转换为样本梅尔频谱,然后采集噪声音频,并且将噪声音频和样本梅尔频谱共同输入生成网络中,得到输出梅尔频谱,再将输出梅尔频谱输入判别网络,得到输出梅尔频谱的类型概率和标签,最终根据输出梅尔频谱的类型概率和标签来对生成网络和判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。利用判别网络得到输出梅尔频谱的标签,从而使得在训练生成网络和判别网络时,仅需要少量的有标签音频即可进行训练,降低了在训练语音转换模型时对于音频语料的要求,降低模型构建的复杂度。
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种语音转换模型的训练方法的示意流程图;
图2是本申请实施例提供的一种语音转换方法的示意流程图;
图3是本申请实施例提供的一种语音转换模型的训练装置的示意性框图;
图4是本申请实施例提供的一种语音转换装置的示意性框图;
图5为本申请实施例提供的一种计算机设备的结构示意性框图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
本申请的实施例提供了一种语音转换模型的训练方法、装置、语音转换方法、设备及存储介质。语音转换模型的训练方法可基于生成对抗网络训练语音转换模型,并且通过对判别网络进行训练,使判别网络能够输出输入的梅尔频谱的标签,仅需要少量的有标签音频即可进行训练,降低了样本音频的获取难度,并且也降低了在训练语音转换模型时对于音频语料的要求,降低模型构建的复杂度。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参阅图1,图1是本申请实施例提供的一种语音转换模型的训练方法的示意流程图。该语音转换模型的训练方法通过对生成网络和判别网络进行交替迭代训练,将训练完成的生成网络作为语音转换模型。
如图1所示,该语音转换模型的训练方法,具体包括:步骤S101至步骤S104。
S101、获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频。
样本音频中包括无标签音频和有标签音频,其中,有标签音频是指具有确定的标签的音频,例如,音频所对应的标签为男人、女人、小女孩、小男孩等多种,将这类带有确定的标签的音频称为有标签音频。
而无标签音频是指音频本身没有对应的标签,为这类音频本身没有对应的标签的音频设置标签为未知,也即,无标签音频是指标签为未知的音频,表示该音频没有确定的标签。
可以采用多种方式获取样本音频,例如可以采用网络爬虫从网络上获取样本音频等等。对于获取到的样本音频,将样本音频利用梅尔滤波器转换为样本梅尔频谱,每个样本梅尔音频均携带有对应的标签。
S102、采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络, 得到输出梅尔频谱,所述噪声音频为无标签音频。
生成网络用于根据采集到的噪声音频生成噪声音频对应的噪声梅尔频谱。在具体实施过程中,生成网络的结构可以包括前处理层、下采样层、瓶颈层和上采样层。
前处理层由卷积层、批标准化层和非线性仿射变换层组成;下采样层由若干卷积层和批处理层组成;瓶颈层由带有残差的卷积组成;上采样层由扩张卷积和批标准化层组成。
随机采集一个噪声音频,其中,采集的噪声音频需要服从先验概率分布,可以是均匀分布或高斯分布等。然后将采集到的噪声音频的标签设置为未知,作为无标签音频和样本梅尔频谱共同输入生成网络,由生成网络对噪声音频进行处理,得到生成网络输出的输出梅尔频谱。
由于生成网络的输入是噪声音频和样本梅尔频谱,因此,得到的输出梅尔频谱中,既包括样本音频对应的样本梅尔频谱,还包括噪声音频对应的噪声梅尔频谱。
S103、将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和预测标签。
其中,输出梅尔频谱的类型包括样本梅尔频谱和噪声梅尔频谱,输出梅尔频谱的类型概率具体是指输出梅尔频谱为样本梅尔频谱的概率。
判别网络用于判断输入的输出梅尔频谱为样本梅尔频谱的概率,以及确定输出梅尔频谱所对应的预测标签。
在具体实施过程中,判别网络的主干网络可以由若干非线性仿射变换和卷积层组成,最后一层为二分类和多分类的线性映射,判别网络的输出结果分别为输入的输出梅尔频谱是样本梅尔频谱的概率和输出梅尔频谱的预测标签。
将生成网络输出的输出梅尔频谱作为判别网络的输入,得到判别网络预测的输出梅尔频谱是样本梅尔频谱的概率和输出梅尔频谱的预测标签。
S104、根据所述输出梅尔频谱的类型概率和所述预测标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
根据判别网络预测的输出梅尔频谱是样本梅尔频谱的概率和输出梅尔频谱的预测标签,对生成网络和判别网络进行交替迭代训练,然后在生成网络和判别网络训练完成时,不再使用判别网络,而是将训练完成的生成网络作为语音转换模型,完成语音转换模型的训练。
因为在有限的训练数据情况下,如果先将判别网络优化完成会导致过拟合,从而使最终的模型无法收敛,因此,在训练过程中,对生成网络和判别网络的训练优化需 要交替进行。
在对生成网络和判别网络进行交替训练的过程中,先优化判别网络,在训练开始时,判别网络很容易从输出梅尔频谱中区分出噪声梅尔频谱和样本梅尔频谱,这说明在一开始生成网络根据噪声音频生成的噪声梅尔频谱和样本梅尔频谱相比,具有很大的偏差。接着对生成网络进行优化,使生成网络的损失函数逐渐减小,在此过程中判别网络的二分类能力也逐渐提高,判别网络对于生成网络所输出的输出梅尔频谱的判别准确率也逐渐提高。生成网络尽可能的生成靠近真实数据的噪声梅尔频谱去欺骗判别网络,而判别网络则需要尽量的把样本梅尔频谱和生成网络生成的噪声梅尔频谱区分开,从而生成网络和判别网络形成一个动态的博弈过程。
最后直至判别网络无法判断输出梅尔频谱是样本梅尔频谱还是噪声梅尔频谱,此时表示生成网络已经训练完成,将训练完成的生成网络作为语音转换模型。
在一实施例中,该方法还包括:当所述判别网络输出的所述输出梅尔频谱的预测标签的准确度达到预设值时,将所述无标签音频的样本梅尔频谱输入所述判别网络,将得到的预测标签作为所述无标签音频的标签。
由于噪声音频和样本音频都具有相应的标签,因此,得到的输出梅尔频谱也都具有与对应的音频相对应的标签。
当判别网络输出的输出梅尔频谱的预测标签的准确度达到预设值时,认为此时判别网络已经能够准确的判断梅尔频谱所对应的标签。
因此,将无标签音频的样本梅尔频谱输入判别网络中,使判别网络对无标签音频的样本梅尔频谱对应的标签进行预测,并将预测得到的预测标签作为无标签音频的标签。
此时,无标签音频根据预测标签变为有标签音频,其标签即为预测标签。在无标签音频转变为有标签音频后,可重新加入对判别网络的训练,以此循环,从而使得判别网络可对少标签的样本音频也能够预测其标签分类。
在一实施例中,所述方法包括:调整所述样本音频的语速,得到调速样本音频,并将所述调速样本音频转换为调速梅尔频谱;根据所述调速梅尔频谱对判别网络进行训练,使所述判别网络输出所述调速梅尔频谱所对应的语速。
调整样本音频的语速,得到调速样本音频,例如可以将其调整为0.9倍速、1.0倍速以及1.1倍速。然后将调速样本音频利用梅尔滤波器转换为调速梅尔频谱,利用调速梅尔频谱对判别网络进行训练,使判别网络输出调速梅尔频谱对应的语速。
通过对判别网络进行训练,使判别网络能够对语速进行识别,能够提高对抗生成网络的训练稳定性,减少样本音频中不同语速导致的训练误差。
在一实施例中,所述根据所述输出梅尔频谱的类型概率对所述生成网络和所述判别网络进行交替迭代训练,包括:根据所述输出梅尔频谱的类型概率,计算所述生成网络的类型损失函数的值和所述判别网络的类型损失函数的值;根据所述生成网络的类型损失函数的值以及根据所述判别网络的类型损失函数的值,分别对所述生成网络和所述类型网络进行交替迭代训练;当所述判别网络输出的所述类型概率达到预设值时,完成所述生成网络的训练。
根据判别网络输出的输出梅尔频谱的类型概率,来计算生成网络的类型损失函数的值以及判别网络的类型损失函数的值,然后根据生成网络的类型损失函数的值以及判别网络的类型损失函数的值,调整生成网络和判别网络的网络参数,对生成网络和判别网络进行迭代训练,并使生成网络的类型损失函数值逐渐减小。
通过设置预设值的方式,来确定判别网络模型的二分类的能力,进而保证生成网络利用噪声音频生成的噪声梅尔频谱与样本梅尔频谱相类似。其中,预设值可以是0.5。当预设值为0.5时,表示此时判别网络无法判断生成网络所生成的梅尔频谱是噪声梅尔频谱还是样本梅尔频谱,表示生成网络已经训练完成。
需要说明的是,当判别网络输出的类型概率达到预设值时,此时,生成网络和判别网络的损失函数的值均趋近于稳定。
例如,生成网络的类型损失函数的公式可以如下所示:
L
G1=-E
x~p(x),c~p(c)[log(D(G(x,c),c))]
判别网络的类型损失函数的公式可以如下所示:
L
D1=-E
(y,c)~p(y,c)[log(D(y,c))]-E
x~p(x),c~p(c)[log(1-D(G(x,c),c))]
其中,L
G1表示生成网络的类型损失函数,L
D1表示判别网络的类型损失函数,D(G(x,c),c)表示判别网络将标签为c的样本梅尔频谱x判断为样本梅尔频谱的概率,D(y,c)表示将标签为c的噪声梅尔频谱x判断为样本梅尔频谱的概率。
在一实施例中,所述根据所述输出梅尔频谱的类型概率和预测标签对所述生成网络和所述判别网络进行交替迭代训练,包括:若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为样本音频,且所述输出梅尔频谱的预测标签与对应的所述样本音频的标签不同时,将该次错误计入所述判别网络的标签损失函数;若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为噪声音频,且所述输出梅尔频谱的预测标签与对应的所述噪声音频的标签不同时,将该次错误计入所述生成网络的标签损失函数;根据所述生成网络的标签损失函数对所述生成网络进行迭代训练,以及根据所述判别网络的标签损失函数对所述类型网络进行迭代训练。
由于判别网络的输出中还包括输出梅尔频谱的预测标签,因此,根据预测标签来 确定生成网络和判别网络的标签损失函数,从而对生成网络和判别网络进行优化,使生成网络能够生成具有特定标签的音频。
在判别网络对输出梅尔频谱进行标签预测的过程中,若对样本梅尔频谱进行标签预测时,预测标签与样本梅尔频谱的标签不同,则认为对样本梅尔频谱预测标签出错,将该次错误计入判别网络的标签损失函数中。
若对噪声梅尔频谱进行标签预测时,预测标签与噪声梅尔频谱的标签不同,则认为对噪声梅尔频谱预测标签出错,将该次错误计入生成网络的标签损失函数中。
例如,生成网络的标签损失函数的公式可以如下所示:
L
G2=-E
x~p(x),c~p(c)[log(p(c)(c|(G(x,c)))]
判别网络的标签损失函数的公式可以如下所示:
L
D2=-E
(y,c)~p(y,c)[log(p(c)(c|y)]
其中,L
G2表示生成网络的标签损失函数,L
D2表示判别网络的标签损失函数,p(c)(c|(G(x,c))表示判别网络对标签为c的样本梅尔频谱x的标签预测错误的情况,p(c)(c|y)表示判别网络对标签为c的噪声梅尔频谱x的标签预测错误的情况。
基于上述公式计算得到生成网络的标签损失函数的值和判别网络的标签损失函数的值后,对生成网络和类型网络进行交替迭代训练,使生成网络和判别网络的标签损失函数的值逐渐减小,表示生成网络可以生成具有特定标签的音频。
上述实施例提供的语音转换模型的训练方法,通过获取包括有标签音频和无标签音频的样本音频,并且将样本音频转换为样本梅尔频谱,然后采集噪声音频,并且将噪声音频和样本梅尔频谱共同输入生成网络中,得到输出梅尔频谱,再将输出梅尔频谱输入判别网络,得到输出梅尔频谱的类型概率和标签,最终根据输出梅尔频谱的类型概率和标签来对生成网络和判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。利用判别网络得到输出梅尔频谱的标签,从而使得在训练生成网络和判别网络时,仅需要少量的有标签音频即可进行训练,降低了在训练语音转换模型时对于音频语料的要求,降低模型构建的复杂度。
请参阅图2,图2是本申请实施例提供的一种语音转换方法的示意流程图。
如图2所示,该语音转换方法,包括:步骤S201至步骤S202。
S201、获取用户的待转换音频数据和目标转换标签。
待转换音频是指用户需要进行转换的音频,目标转换标签是指对待转换音频转换进行转换时的标签。
例如,待转换音频为一女人音色的音频,目标转换标签是女孩。
S202、将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型, 得到转换后的音频数据。
其中,所述预先训练的语音转换模型为采用上述实施例提供的任一种语音转换模型的训练方法训练得到的生成网络。
将待转换音频数据和目标转换标签输入至预先训练的语音转换模型,语音转换模型可根据待转换音频数据和目标转换标签进行音频合成,从而输出转换后的音频数据。由此实现语音转换的目的,提高用户体验。
请参阅图3,图3是本申请的实施例还提供一种语音转换模型的训练装置的示意性框图,该语音转换模型的训练装置用于执行前述的语音转换模型的训练方法。其中,该语音转换模型的训练装置可以配置于服务器或终端中。
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。
如图3所示,语音转换模型的训练装置300包括:样本获取模块301、噪声采集模块302、判别输出模块303和模型训练模块304。
样本获取模块301,用于获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频。
噪声采集模块302,用于采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频。
判别输出模块303,用于将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签。
模型训练模块304,用于根据所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的语音转换模型的训练装置和各模块的具体工作过程,可以参考前述语音转换模型的训练方法实施例中的对应过程,在此不再赘述。
请参阅图4,图4是本申请的实施例还提供一种语音转换装置的示意性框图,该语音转换装置用于执行前述的语音转换方法。其中,该语音转换装置可以配置于服务器或终端中。
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。
如图4所示,语音转换装置400包括:数据获取模块401和音频转换模块402。
数据获取模块401,用于获取用户的待转换音频数据和目标转换标签;
音频转换模块402,用于将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据;其中,所述预先训练的语音转换模型为采用上述的语音转换模型的训练方法训练得到的生成网络。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的语音转换装置和各模块的具体工作过程,可以参考前述语音转换方法实施例中的对应过程,在此不再赘述。
上述的语音转换模型的训练装置和语音转换装置均可以实现为一种计算机可读指令的形式,该计算机可读指令可以在如图5所示的计算机设备上运行。
请参阅图5,图5是本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器或终端。
参阅图5,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括计算机可读存储介质和内存储器。
计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质可存储操作系统和计算机可读指令。该计算机可读指令被执行时,可使得处理器执行任意一种语音转换模型的训练方法和语音转换方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为计算机可读存储介质中的计算机可读指令的运行提供环境,该计算机可读指令被处理器执行时,可使得处理器执行任意一种语音转换模型的训练方法和语音转换方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机可读指令,以实现如下步骤:
获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签 音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和预测标签;根据所述输出梅尔频谱的类型概率和所述预测标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
在一个实施例中,所述处理器还用于实现:
当所述判别网络输出的所述输出梅尔频谱的预测标签的准确度达到预设值时,将所述无标签音频的样本梅尔频谱输入所述判别网络,将得到的预测标签作为所述无标签音频的标签。
在一个实施例中,所述处理器用于实现:
调整所述样本音频的语速,得到调速样本音频,并将所述调速样本音频转换为调速梅尔频谱;根据所述调速梅尔频谱对判别网络进行训练,使所述判别网络输出所述调速梅尔频谱所对应的语速。
在一个实施例中,所述处理器在实现所述根据所述输出梅尔频谱的类型概率对所述生成网络和所述判别网络进行交替迭代训练时,用于实现:
根据所述输出梅尔频谱的类型概率,计算所述生成网络的类型损失函数的值和所述判别网络的类型损失函数的值;根据所述生成网络的类型损失函数的值以及根据所述判别网络的类型损失函数的值,分别对所述生成网络和所述类型网络进行交替迭代训练;当所述判别网络输出的所述类型概率达到预设值时,完成所述生成网络的训练。
在一个实施例中,所述处理器在实现所述根据所述输出梅尔频谱的类型概率和预测标签对所述生成网络和所述判别网络进行交替迭代训练时,用于实现:
若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为样本音频,且所述输出梅尔频谱的预测标签与对应的所述样本音频的标签不同时,将该次错误计入所述判别网络的标签损失函数;若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为噪声音频,且所述输出梅尔频谱的预测标签与对应的所述噪声音频的标签不同时,将该次错误计入所述生成网络的标签损失函数;根据所述生成网络的标签损失函数对所述生成网络进行迭代训练,以及根据所述判别网络的标签损失函数对所述类型网络进行迭代训练。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机可读指令,以实现如下步骤:
获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据;其中,所述预先训 练的语音转换模型为采用上所述的语音转换模型的训练方法训练得到的生成网络。
本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述处理器执行所述计算机可读指令,实现本申请实施例提供的任一项语音转换模型的训练方法和语音转换方法。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
Claims (20)
- 一种语音转换模型的训练方法,其中,包括:获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和预测标签;根据所述输出梅尔频谱的类型概率和所述预测标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
- 根据权利要求1所述的语音转换模型的训练方法,其中,所述方法还包括:当所述判别网络输出的所述输出梅尔频谱的预测标签的准确度达到预设值时,将所述无标签音频的样本梅尔频谱输入所述判别网络,将得到的预测标签作为所述无标签音频的标签。
- 根据权利要求1所述的语音转换模型的训练方法,其中,所述方法包括:调整所述样本音频的语速,得到调速样本音频,并将所述调速样本音频转换为调速梅尔频谱;根据所述调速梅尔频谱对判别网络进行训练,使所述判别网络输出所述调速梅尔频谱所对应的语速。
- 根据权利要求1所述的语音转换模型的训练方法,其中,所述根据所述输出梅尔频谱的类型概率对所述生成网络和所述判别网络进行交替迭代训练,包括:根据所述输出梅尔频谱的类型概率,计算所述生成网络的类型损失函数的值和所述判别网络的类型损失函数的值;根据所述生成网络的类型损失函数的值以及根据所述判别网络的类型损失函数的值,分别对所述生成网络和所述类型网络进行交替迭代训练;当所述判别网络输出的所述类型概率达到预设值时,完成所述生成网络的训练。
- 根据权利要求1所述的语音转换模型的训练方法,其中,所述根据所述输出梅尔频谱的类型概率和预测标签对所述生成网络和所述判别网络进行交替迭代训练,包括:若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为样本音频,且所述输出梅尔频谱的预测标签与对应的所述样本音频的标签不同时,将该次错误计入所述判别网络的标签损失函数;若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为噪声音频,且所述输出梅尔频谱的预测标签与对应的所述噪声音频的标签不同时,将该次错误计入所述生成网络的标签损失函数;根据所述生成网络的标签损失函数对所述生成网络进行迭代训练,以及根据所述判别网络的标签损失函数对所述类型网络进行迭代训练。
- 一种语音转换方法,其中,包括:获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据;其中,所述预先训练的语音转换模型为采用权利要求1至5中任一项所述的语音转换模型的训练方法训练得到的生成网络。
- 一种语音转换模型的训练装置,其中,包括:样本获取模块,用于获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频;噪声采集模块,用于采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;判别输出模块,用于将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签;模型训练模块,用于根据所述输出梅尔频谱的类型概率和所述输出梅尔频谱的标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
- 一种语音转换装置,其中,包括:数据获取模块,用于获取用户的待转换音频数据和目标转换标签;音频转换模块,用于将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据;其中,所述预先训练的语音转换模型为采用权利要求1至5中任一项所述的语音转换模型的训练方法训练得到的生成网络。
- 一种计算机设备,其中,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机可读指令;所述处理器,用于执行所述计算机可读指令并在执行所述计算机可读指令时实现如下步骤:获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签 音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和预测标签;根据所述输出梅尔频谱的类型概率和所述预测标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
- 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还实现如下步骤:当所述判别网络输出的所述输出梅尔频谱的预测标签的准确度达到预设值时,将所述无标签音频的样本梅尔频谱输入所述判别网络,将得到的预测标签作为所述无标签音频的标签。
- 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还实现如下步骤:调整所述样本音频的语速,得到调速样本音频,并将所述调速样本音频转换为调速梅尔频谱;根据所述调速梅尔频谱对判别网络进行训练,使所述判别网络输出所述调速梅尔频谱所对应的语速。
- 根据权利要求9所述的计算机设备,其中,所述根据所述输出梅尔频谱的类型概率对所述生成网络和所述判别网络进行交替迭代训练,包括:根据所述输出梅尔频谱的类型概率,计算所述生成网络的类型损失函数的值和所述判别网络的类型损失函数的值;根据所述生成网络的类型损失函数的值以及根据所述判别网络的类型损失函数的值,分别对所述生成网络和所述类型网络进行交替迭代训练;当所述判别网络输出的所述类型概率达到预设值时,完成所述生成网络的训练。
- 根据权利要求9所述的计算机设备,其中,所述根据所述输出梅尔频谱的类型概率和预测标签对所述生成网络和所述判别网络进行交替迭代训练,包括:若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为样本音频,且所述输出梅尔频谱的预测标签与对应的所述样本音频的标签不同时,将该次错误计入所述判别网络的标签损失函数;若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为噪声音频,且所述输出梅尔频谱的预测标签与对应的所述噪声音频的标签不同时,将该次错 误计入所述生成网络的标签损失函数;根据所述生成网络的标签损失函数对所述生成网络进行迭代训练,以及根据所述判别网络的标签损失函数对所述类型网络进行迭代训练。
- 一种计算机设备,其中,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机可读指令;所述处理器,用于执行所述计算机可读指令并在执行所述计算机可读指令时实现如下步骤:获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使所述处理器实现如下步骤:获取样本音频,将所述样本音频转换为样本梅尔频谱,所述样本音频包括无标签音频和有标签音频;采集噪声音频,并将所述噪声音频和所述样本梅尔频谱共同输入生成网络,得到输出梅尔频谱,所述噪声音频为无标签音频;将所述输出梅尔频谱输入判别网络,得到所述输出梅尔频谱的类型概率和预测标签;根据所述输出梅尔频谱的类型概率和所述预测标签对所述生成网络和所述判别网络进行交替迭代训练,并将训练完成的生成网络作为语音转换模型,完成模型训练。
- 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时使所述处理器还实现如下步骤:当所述判别网络输出的所述输出梅尔频谱的预测标签的准确度达到预设值时,将所述无标签音频的样本梅尔频谱输入所述判别网络,将得到的预测标签作为所述无标签音频的标签。
- 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时使所述处理器还实现如下步骤:调整所述样本音频的语速,得到调速样本音频,并将所述调速样本音频转换为调速梅尔频谱;根据所述调速梅尔频谱对判别网络进行训练,使所述判别网络输出所述调速梅尔频谱所对应的语速。
- 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述输出梅尔 频谱的类型概率对所述生成网络和所述判别网络进行交替迭代训练,包括:根据所述输出梅尔频谱的类型概率,计算所述生成网络的类型损失函数的值和所述判别网络的类型损失函数的值;根据所述生成网络的类型损失函数的值以及根据所述判别网络的类型损失函数的值,分别对所述生成网络和所述类型网络进行交替迭代训练;当所述判别网络输出的所述类型概率达到预设值时,完成所述生成网络的训练。
- 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述输出梅尔频谱的类型概率和预测标签对所述生成网络和所述判别网络进行交替迭代训练,包括:若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为样本音频,且所述输出梅尔频谱的预测标签与对应的所述样本音频的标签不同时,将该次错误计入所述判别网络的标签损失函数;若根据所述输出梅尔频谱的类型概率确定所述输出梅尔频谱对应的音频为噪声音频,且所述输出梅尔频谱的预测标签与对应的所述噪声音频的标签不同时,将该次错误计入所述生成网络的标签损失函数;根据所述生成网络的标签损失函数对所述生成网络进行迭代训练,以及根据所述判别网络的标签损失函数对所述类型网络进行迭代训练。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使所述处理器实现如下步骤:获取用户的待转换音频数据和目标转换标签;将所述待转换音频数据和所述目标转换标签输入预先训练的语音转换模型,得到转换后的音频数据。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011446585.1A CN112509600A (zh) | 2020-12-11 | 2020-12-11 | 模型的训练方法、装置、语音转换方法、设备及存储介质 |
CN202011446585.1 | 2020-12-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022121180A1 true WO2022121180A1 (zh) | 2022-06-16 |
Family
ID=74971318
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/084219 WO2022121180A1 (zh) | 2020-12-11 | 2021-03-31 | 模型的训练方法、装置、语音转换方法、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112509600A (zh) |
WO (1) | WO2022121180A1 (zh) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112509600A (zh) * | 2020-12-11 | 2021-03-16 | 平安科技(深圳)有限公司 | 模型的训练方法、装置、语音转换方法、设备及存储介质 |
CN113257283B (zh) * | 2021-03-29 | 2023-09-26 | 北京字节跳动网络技术有限公司 | 音频信号的处理方法、装置、电子设备和存储介质 |
CN113241054B (zh) * | 2021-05-10 | 2023-03-21 | 北京声智科技有限公司 | 语音平滑处理模型生成方法、语音平滑处理方法及装置 |
CN113780454B (zh) * | 2021-09-17 | 2023-10-24 | 平安科技(深圳)有限公司 | 模型训练及调用方法、装置、计算机设备、存储介质 |
CN115240708A (zh) * | 2021-09-30 | 2022-10-25 | 达闼科技(北京)有限公司 | 模型训练方法、装置、电子设备和计算机可读存储介质 |
CN115065482B (zh) * | 2022-06-16 | 2024-05-17 | 平安银行股份有限公司 | 一种声音识别方法、装置、终端设备及存储介质 |
CN114999447B (zh) * | 2022-07-20 | 2022-10-25 | 南京硅基智能科技有限公司 | 一种基于对抗生成网络的语音合成模型及语音合成方法 |
CN115240680A (zh) * | 2022-08-05 | 2022-10-25 | 安徽大学 | 一种模糊耳语音的转换方法、系统及其装置 |
CN116013272A (zh) * | 2022-12-29 | 2023-04-25 | 北京天玛智控科技股份有限公司 | 声音识别模型的训练方法、装置、电子设备及存储介质 |
CN116705055B (zh) * | 2023-08-01 | 2023-10-17 | 国网福建省电力有限公司 | 一种变电站噪声监测方法、系统、设备和存储介质 |
CN118380008B (zh) * | 2024-06-25 | 2024-09-06 | 武汉天虹环保产业股份有限公司 | 一种环境噪声污染智能实时识别与定位监测系统及方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109741736A (zh) * | 2017-10-27 | 2019-05-10 | 百度(美国)有限责任公司 | 使用生成对抗网络进行鲁棒语音识别的系统和方法 |
WO2019092931A1 (ja) * | 2017-11-07 | 2019-05-16 | 日本電気株式会社 | 判別モデル生成装置、判別モデル生成方法および判別モデル生成プログラム |
CN110136686A (zh) * | 2019-05-14 | 2019-08-16 | 南京邮电大学 | 基于STARGAN与i向量的多对多说话人转换方法 |
CN110136690A (zh) * | 2019-05-22 | 2019-08-16 | 平安科技(深圳)有限公司 | 语音合成方法、装置及计算机可读存储介质 |
CN110706692A (zh) * | 2019-10-21 | 2020-01-17 | 上海交通大学 | 儿童语音识别模型的训练方法及系统 |
CN112509600A (zh) * | 2020-12-11 | 2021-03-16 | 平安科技(深圳)有限公司 | 模型的训练方法、装置、语音转换方法、设备及存储介质 |
-
2020
- 2020-12-11 CN CN202011446585.1A patent/CN112509600A/zh active Pending
-
2021
- 2021-03-31 WO PCT/CN2021/084219 patent/WO2022121180A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109741736A (zh) * | 2017-10-27 | 2019-05-10 | 百度(美国)有限责任公司 | 使用生成对抗网络进行鲁棒语音识别的系统和方法 |
WO2019092931A1 (ja) * | 2017-11-07 | 2019-05-16 | 日本電気株式会社 | 判別モデル生成装置、判別モデル生成方法および判別モデル生成プログラム |
CN110136686A (zh) * | 2019-05-14 | 2019-08-16 | 南京邮电大学 | 基于STARGAN与i向量的多对多说话人转换方法 |
CN110136690A (zh) * | 2019-05-22 | 2019-08-16 | 平安科技(深圳)有限公司 | 语音合成方法、装置及计算机可读存储介质 |
CN110706692A (zh) * | 2019-10-21 | 2020-01-17 | 上海交通大学 | 儿童语音识别模型的训练方法及系统 |
CN112509600A (zh) * | 2020-12-11 | 2021-03-16 | 平安科技(深圳)有限公司 | 模型的训练方法、装置、语音转换方法、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112509600A (zh) | 2021-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022121180A1 (zh) | 模型的训练方法、装置、语音转换方法、设备及存储介质 | |
US11996091B2 (en) | Mixed speech recognition method and apparatus, and computer-readable storage medium | |
WO2020253060A1 (zh) | 语音识别方法、模型的训练方法、装置、设备及存储介质 | |
WO2017219991A1 (zh) | 适用于模式识别的模型的优化方法、装置及终端设备 | |
WO2022121185A1 (zh) | 模型训练方法、方言识别方法、装置、服务器及存储介质 | |
WO2018133761A1 (zh) | 一种人机对话的方法和装置 | |
WO2022121257A1 (zh) | 模型训练方法、语音识别方法、装置、设备及存储介质 | |
WO2018105194A1 (en) | Method and system for generating multi-relevant label | |
WO2021135438A1 (zh) | 多语种语音识别模型训练方法、装置、设备及存储介质 | |
WO2022178942A1 (zh) | 情绪识别方法、装置、计算机设备和存储介质 | |
WO2022257454A1 (zh) | 一种合成语音的方法、装置、终端及存储介质 | |
WO2022121176A1 (zh) | 语音合成方法、装置、电子设备及可读存储介质 | |
WO2023134067A1 (zh) | 语音分类模型的训练方法、装置、设备及存储介质 | |
WO2022142115A1 (zh) | 基于对抗学习的说话人语音转换方法及相关设备 | |
WO2020192009A1 (zh) | 一种基于神经网络的静音检测方法、终端设备及介质 | |
WO2020252935A1 (zh) | 声纹验证方法、装置、设备及存储介质 | |
WO2021244099A1 (zh) | 语音编辑方法、电子设备及计算机可读存储介质 | |
JP2021081713A (ja) | 音声信号を処理するための方法、装置、機器、および媒体 | |
CN111339308A (zh) | 基础分类模型的训练方法、装置和电子设备 | |
WO2018014537A1 (zh) | 语音识别方法和装置 | |
CN116956835A (zh) | 一种基于预训练语言模型的文书生成方法 | |
WO2022174499A1 (zh) | 文本韵律边界预测的方法、装置、设备及存储介质 | |
WO2024012017A1 (zh) | 反应物分子的预测、模型的训练方法、装置、设备及介质 | |
WO2023061259A1 (zh) | 语速调整方法、装置、电子设备及可读存储介质 | |
KR20240067967A (ko) | 음성 웨이크업 방법, 음성 웨이크업 장치, 전자장비, 저장 매체 및 컴퓨터 프로그램 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901896 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21901896 Country of ref document: EP Kind code of ref document: A1 |