CN113689868B - Training method and device of voice conversion model, electronic equipment and medium - Google Patents

Training method and device of voice conversion model, electronic equipment and medium Download PDF

Info

Publication number
CN113689868B
CN113689868B CN202110950488.4A CN202110950488A CN113689868B CN 113689868 B CN113689868 B CN 113689868B CN 202110950488 A CN202110950488 A CN 202110950488A CN 113689868 B CN113689868 B CN 113689868B
Authority
CN
China
Prior art keywords
content
voice
encoder
sequence
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110950488.4A
Other languages
Chinese (zh)
Other versions
CN113689868A (en
Inventor
王俊超
陈怿翔
康永国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110950488.4A priority Critical patent/CN113689868B/en
Publication of CN113689868A publication Critical patent/CN113689868A/en
Application granted granted Critical
Publication of CN113689868B publication Critical patent/CN113689868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本公开提供了一种语音转换模型的训练方法、装置、电子设备及介质,涉及人工智能技术领域,尤其涉及语音和深度学习技术。具体实现方案为:将语音的原始的声学特征分别输入至内容编码器和音色编码器,得到内容编码器输出的内容序列和音色编码器输出的音色向量;将内容序列输入至内容监督网络,得到内容监督网络输出的监督序列;基于监督序列和内容序列对内容编码器进行训练;将内容序列和音色向量分别输入至解码器,得到解码器输出的预测的声学特征;基于预测的声学特征和原始的声学特征对待训练的语音转换模型进行训练。本申请实施例可以通过内容监督网络专门针对内容编码器进行训练,使得语音转换模型可以更加准确地实现语音转换。

Figure 202110950488

The present disclosure provides a training method, device, electronic device and medium for a speech conversion model, which relate to the technical field of artificial intelligence, and in particular, to speech and deep learning technologies. The specific implementation scheme is as follows: input the original acoustic features of speech into the content encoder and the timbre encoder, respectively, to obtain the content sequence output by the content encoder and the timbre vector output by the timbre encoder; input the content sequence into the content supervision network to obtain The supervised sequence output by the content supervision network; the content encoder is trained based on the supervised sequence and the content sequence; the content sequence and the timbre vector are input to the decoder respectively, and the predicted acoustic features output by the decoder are obtained; based on the predicted acoustic features and the original The acoustic features of the to-be-trained speech conversion model are trained. In this embodiment of the present application, the content encoder can be trained specifically through the content supervision network, so that the speech conversion model can more accurately implement speech conversion.

Figure 202110950488

Description

Training method and device of voice conversion model, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and further relates to speech and deep learning technologies, and in particular, to a method and an apparatus for training a speech conversion model, an electronic device, and a medium.
Background
Speech conversion, the purpose of which is to convert the voice of a source speaker into the timbre of a target speaker and keep the content of the speech expression unchanged, is becoming more and more interesting in the market. According to the linguistic data required by the model, the voice conversion can be divided into parallel linguistic data voice conversion and non-parallel linguistic data voice conversion; in parallel corpus voice conversion, when recording the required corpus, a source speaker and a target speaker are required to record the audio of the same text. The non-parallel corpus voice conversion needs to record a plurality of voices of a target speaker, and does not need the voice of a source speaker during training.
The existing self-reconstruction many-to-many voice conversion system mainly comprises: a content encoder, a tone encoder and a decoder; when the self-reconfigurable many-to-many voice conversion system is trained, the original acoustic features are input into a tone encoder to obtain a tone vector of the whole sentence, wherein the tone vector represents tone information of a speaker; inputting the original acoustic features into a content encoder, wherein the content encoder is provided with a module (such as a down-sampling module, a vector quantization encoder and the like) capable of encoding content information to obtain a content sequence of the whole sentence, and the content sequence represents the content information of the speaker; then inputting the tone vector and the content sequence into a decoder to obtain predicted acoustic features; and finally, training the self-reconfigurable many-to-many voice conversion system based on the predicted acoustic features and the original acoustic features.
The traditional self-reconfigurable many-to-many voice conversion system is trained by adopting the mode, so that the information of a speaker can be removed through an encoder, and the information decoupling is completed by adding tone information to the input of a decoder. However, when the content encoder encodes the content of the speaker, part of the content information may be removed, resulting in more errors in the converted content.
Disclosure of Invention
The disclosure provides a method and a device for training a voice conversion model, electronic equipment and a medium.
In a first aspect, the present application provides a method for training a speech conversion model, the method including:
respectively inputting original acoustic features of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;
inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;
respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder;
and training a speech conversion model to be trained based on the predicted acoustic features and the original acoustic features.
In a second aspect, the present application provides an apparatus for training a speech conversion model, the apparatus comprising: the device comprises an encoding module, a monitoring module, a decoding module and a training module; wherein,
the encoding module is used for respectively inputting the original acoustic characteristics of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;
the monitoring module is used for inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;
the decoding module is used for respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder;
and the training module is used for training a voice conversion model to be trained on the basis of the predicted acoustic features and the original acoustic features.
In a third aspect, an embodiment of the present application provides an electronic device, including:
one or more processors;
a memory for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the method for training a speech conversion model according to any embodiment of the present application.
In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for training a speech conversion model according to any embodiment of the present application.
In a fifth aspect, a computer program product is provided, which when executed by a computer device implements the method for training a speech conversion model according to any of the embodiments of the present application.
According to the technical scheme, the technical problem that when the content encoder in the prior art encodes the content of a speaker, part of content information can be removed, and more errors occur in the converted content is solved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure;
FIG. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a training system for a speech conversion model according to an embodiment of the present application;
fig. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of an apparatus for training a speech conversion model according to an embodiment of the present application;
FIG. 7 is a block diagram of an electronic device for implementing a method for training a speech conversion model according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Example one
Fig. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure, where the method may be performed by an apparatus or an electronic device for training a speech conversion model, where the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated in any intelligent device with a network communication function. As shown in fig. 1, the training method of the speech conversion model may include the following steps:
s101, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.
In this step, the electronic device may input the original acoustic features of the speech to the content encoder and the tone encoder, respectively, to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder. Specifically, when the speech conversion model to be trained does not satisfy the preset convergence condition, the electronic device may input the original acoustic features of the speech to the content encoder and the tone encoder, respectively, to obtain the content sequence output by the content encoder and the tone vector output by the tone encoder. Furthermore, the electronic device can extract content-related information from the original acoustic features through a content encoder to obtain a content sequence corresponding to the original acoustic features; and extracting information related to tone in the original acoustic features through a tone encoder to obtain tone vectors corresponding to the original acoustic features.
S102, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network to be trained; the content encoder is trained based on the supervisory sequence and the content sequence.
In this step, the electronic device may input the content sequence to the content monitoring network to obtain a monitoring sequence output by the content monitoring network to be trained; the content encoder is trained based on the supervisory sequence and the content sequence. Specifically, the electronic device may extract text information from the content sequence through a content surveillance network; then, a supervision sequence is obtained based on the text information; or, the electronic device may further input the content sequence into a speech recognition acoustic model of the content surveillance network, and output a phoneme probability sequence through the speech recognition acoustic model; a supervised sequence is then derived based on the phoneme probability sequence.
And S103, respectively inputting the content sequence and the tone vector to a decoder to obtain the predicted acoustic features output by the decoder.
In this step, the electronic device may input the content sequence and the timbre vector to a decoder, respectively, to obtain the predicted acoustic features output by the decoder. Specifically, the decoder may fuse content-related information in the original acoustic features and timbre-related information in the original acoustic features to obtain the predicted acoustic features.
And S104, training the voice conversion model to be trained based on the predicted acoustic features and the original acoustic features.
In this step, the electronic device may train the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. Then, the electronic device may reselect a speech to train the speech conversion model to be trained until the speech conversion model to be trained satisfies the preset convergence condition. The reselected voice may be a voice adjacent to the previous voice, or a voice not adjacent to the previous voice, which is not limited herein. Further, the electronic device may calculate a loss value of the predicted acoustic feature and the original acoustic feature through a pre-constructed loss function, and then train the speech conversion model to be trained based on the loss value.
The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the training is specially carried out aiming at the content encoder is adopted, the technical problem that when the content encoder codes the content of a speaker in the prior art, partial content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.
Example two
Fig. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 2, the training method of the speech conversion model may include the following steps:
s201, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.
S202, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network.
In this step, the electronic device may input the content sequence to the content surveillance network, resulting in a surveillance sequence output by the content surveillance network. Specifically, the electronic device may extract text information in the content sequence through the content surveillance network; then, a supervision sequence is obtained based on the text information; or, the electronic device may further input the content sequence into a speech recognition acoustic model of the content surveillance network, and output a phoneme probability sequence through the speech recognition acoustic model; a supervised sequence is then derived based on the phoneme probability sequence. In addition, the content monitoring network may also use other monitoring methods to obtain the monitoring sequence, which is not limited herein.
And S203, calculating a loss value of the content encoder for the original acoustic features based on the supervision sequence and the content sequence.
In this step, the electronic device may calculate a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence. Specifically, the electronic device may calculate a loss value of the content encoder for the original acoustic feature through a pre-constructed loss function.
And S204, adjusting model parameters in the content encoder according to the loss value of the content encoder aiming at the original acoustic features.
In this step, the electronic device may adjust the model parameters in the content encoder according to the loss value of the content encoder for the original acoustic features. Specifically, the content encoder may be a neural network, and the model parameters in the content encoder may be adjusted according to the loss value of the content encoder for the original acoustic features.
And S205, respectively inputting the content sequence and the tone vector to a decoder to obtain the predicted acoustic features output by the decoder.
And S206, training the voice conversion model to be trained based on the predicted acoustic features and the original acoustic features.
Fig. 3 is a schematic structural diagram of a training system for a speech conversion model according to an embodiment of the present application. As shown in fig. 3, the training system of the speech conversion model may include: a content monitoring network, a content encoder, a tone encoder and a decoder; when a voice conversion model is trained, firstly, respectively inputting original acoustic features of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; meanwhile, the content sequence can be input into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; then, respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; the speech conversion model is then trained based on the predicted acoustic features and the original acoustic features.
The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the content encoder is specially trained is adopted, the technical problem that when the content encoder encodes the content of a speaker in the prior art, part of content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.
EXAMPLE III
Fig. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 4, the training method of the speech conversion model may include the following steps:
s401, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.
S402, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; the content encoder is trained based on the supervisory sequence and the content sequence.
And S403, respectively inputting the content sequence and the tone vector into a decoder to obtain the predicted acoustic features output by the decoder.
S404, training a speech conversion model to be trained based on the predicted acoustic features and the original acoustic features.
S405, respectively inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into a trained voice conversion model, and obtaining target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice.
After the trained voice conversion model is obtained through the above steps, in this step, the electronic device may respectively input the original acoustic feature of the first user for the first voice and the original acoustic feature of the second user for the second voice into the trained voice conversion model, and obtain the target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice. Specifically, the electronic device may input an original acoustic feature of a first user for a first voice to a content encoder, to obtain a content sequence of the first voice output by the content encoder; simultaneously inputting original acoustic features of a second user aiming at a second voice into a trained timbre coder to obtain a timbre vector of the second voice output by the timbre coder; then, respectively inputting the content sequence of the first voice and the tone vector of the second voice into a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic characteristics into the trained vocoder to obtain the target voice output by the vocoder.
Fig. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application. As shown in fig. 5, the prediction system of the speech conversion model may include: content encoder, tone encoder, decoder and vocoder. Assuming that the voice of the user A is required to be converted into the tone of the user B, firstly, inputting the original acoustic feature (acoustic feature of the user A) of the user A aiming at the first voice into a content encoder to obtain a content sequence of the first voice output by the content encoder; simultaneously inputting the original acoustic features of the user B aiming at the second voice (acoustic features of the user B) into a trained tone encoder to obtain a tone vector of the second voice output by the tone encoder; then, respectively inputting the content sequence of the first voice and the tone vector of the second voice into a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic characteristics into the trained vocoder to obtain the target voice output by the vocoder.
The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the content encoder is specially trained is adopted, the technical problem that when the content encoder encodes the content of a speaker in the prior art, part of content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.
Example four
Fig. 6 is a schematic structural diagram of a training apparatus for a speech conversion model according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes: an encoding module 601, a supervision module 602, a decoding module 603 and a training module 604; wherein,
the encoding module 601 is configured to input original acoustic features of a speech to a content encoder and a tone encoder respectively, so as to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;
the monitoring module 602 is configured to input the content sequence into a content monitoring network, so as to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;
the decoding module 603 is configured to input the content sequence and the timbre vector to a decoder, respectively, so as to obtain predicted acoustic features output by the decoder;
the training module 604 is configured to train a to-be-trained speech conversion model based on the predicted acoustic features and the original acoustic features.
Further, the monitoring module 602 is specifically configured to extract text information from the content sequence through the content monitoring network; the supervision sequence is derived based on the textual information.
Further, the monitoring module 602 is specifically configured to input the content sequence into a speech recognition acoustic model of the content monitoring network, and output a phoneme probability sequence through the speech recognition acoustic model; and obtaining the supervision sequence based on the phoneme probability sequence.
Further, the supervision module 602 is specifically configured to calculate a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence; and adjusting model parameters in the content encoder according to the loss value of the content encoder for the original acoustic features.
Further, the apparatus further comprises: a prediction module 605 (not shown in the figure), configured to input an original acoustic feature of a first user for a first voice and an original acoustic feature of a second user for a second voice into a trained voice conversion model, respectively, and obtain, through the voice conversion model, a target voice converted from the first voice and the second voice; wherein the target voice includes content information of the first voice and tone information of the second voice.
Further, the prediction module 605 is specifically configured to input an original acoustic feature of the first user for the first speech into the content encoder, so as to obtain a content sequence of the first speech output by the content encoder; inputting the original acoustic features of the second voice of the second user to a trained tone encoder to obtain a tone vector of the second voice output by the tone encoder; respectively inputting the content sequence of the first voice and the tone vector of the second voice to a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic features into a trained vocoder to obtain the target voice output by the vocoder.
The training device of the voice conversion model can execute the method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For the technical details not described in detail in this embodiment, reference may be made to a method for training a speech conversion model provided in any embodiment of the present application.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
EXAMPLE five
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the training method of the speech conversion model. For example, in some embodiments, the method of training the speech conversion model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method for training a speech conversion model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the speech conversion model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (6)

1.一种语音转换模型的训练方法,所述方法包括:1. A training method for a speech conversion model, the method comprising: 将语音的原始的声学特征分别输入至内容编码器和音色编码器,得到所述内容编码器输出的内容序列和所述音色编码器输出的音色向量;其中,所述语音为与进行训练的上一个语音相邻的语音;The original acoustic features of the voice are input into the content encoder and the timbre encoder respectively, and the content sequence output by the content encoder and the timbre vector output by the timbre encoder are obtained; wherein, the voice is the same as the training above. a phonetically adjacent phonetic; 将所述内容序列输入至内容监督网络,通过所述内容监督网络在所述内容序列中提取出文本信息;基于所述文本信息得到监督序列;Inputting the content sequence into a content supervision network, and extracting text information from the content sequence through the content supervision network; obtaining a supervision sequence based on the text information; 基于所述监督序列和所述内容序列计算所述内容编码器针对所述原始的声学特征的损失值;calculating a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence; 根据所述内容编码器针对所述原始的声学特征的损失值对所述内容编码器中的模型参数进行调整;adjusting the model parameters in the content encoder according to the loss value of the content encoder for the original acoustic feature; 将所述内容序列和所述音色向量分别输入至解码器,得到所述解码器输出的预测的声学特征;Inputting the content sequence and the timbre vector to a decoder respectively, to obtain the predicted acoustic feature output by the decoder; 通过预先构建的损失函数计算所述预测的声学特征和所述原始的声学特征的损失值,基于所述损失值对待训练的语音转换模型进行训练;Calculate the loss value of the predicted acoustic feature and the original acoustic feature by using a pre-built loss function, and train the speech conversion model to be trained based on the loss value; 将第一用户针对第一语音的原始的声学特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型,通过所述语音转换模型得到由所述第一语音和所述第二语音转换后的目标语音;其中,所述目标语音包括所述第一语音的内容信息和所述第二语音的音色信息。The original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model, and the first voice and the second voice are obtained through the voice conversion model. The target voice after the second voice conversion; wherein, the target voice includes content information of the first voice and timbre information of the second voice. 2.根据权利要求1所述的方法,其中,所述将第一用户针对第一语音的原始的声学特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型,通过所述语音转换模型得到由所述第一语音和所述第二语音转换后的目标语音,包括:2 . The method according to claim 1 , wherein the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model. 3 . , obtain the target voice converted by the first voice and the second voice through the voice conversion model, including: 将所述第一用户针对所述第一语音的原始的声学特征输入至所述内容编码器,得到所述内容编码器输出的所述第一语音的内容序列;Inputting the original acoustic features of the first voice by the first user into the content encoder, to obtain a content sequence of the first voice output by the content encoder; 将所述第二用户针对所述第二语音的原始的声学特征输入至训练好的音色编码器,得到所述音色编码器输出的所述第二语音的音色向量;Inputting the original acoustic features of the second voice to the trained timbre encoder by the second user to obtain the timbre vector of the second voice output by the timbre encoder; 将所述第一语音的内容序列和所述第二语音的音色向量分别输入至训练好的解码器,通过所述解码器输出的预测的融合声学特征;The content sequence of the first voice and the timbre vector of the second voice are respectively input into the trained decoder, and the predicted fusion acoustic feature output by the decoder is used; 将所述预测的融合声学特征输入至训练好的声码器,得到所述声码器输出的所述目标语音。The predicted fused acoustic features are input to the trained vocoder to obtain the target speech output by the vocoder. 3.一种语音转换模型的训练装置,所述装置包括:编码模块、监督模块、解码模块和训练模块;其中,3. A training device for a speech conversion model, the device comprising: an encoding module, a supervision module, a decoding module and a training module; wherein, 所述编码模块,用于将语音的原始的声学特征分别输入至内容编码器和音色编码器,得到所述内容编码器输出的内容序列和所述音色编码器输出的音色向量;其中,所述语音为与进行训练的上一个语音相邻的语音;The encoding module is used for inputting the original acoustic features of the voice into the content encoder and the timbre encoder respectively, so as to obtain the content sequence output by the content encoder and the timbre vector output by the timbre encoder; wherein, the The speech is the speech adjacent to the previous speech being trained; 所述监督模块,用于将所述内容序列输入至内容监督网络,通过所述内容监督网络在所述内容序列中提取出文本信息;基于所述文本信息得到监督序列;用于基于所述监督序列和所述内容序列计算所述内容编码器针对所述原始的声学特征的损失值;根据所述内容编码器针对所述原始的声学特征的损失值对所述内容编码器中的模型参数进行调整;The supervision module is configured to input the content sequence into a content supervision network, and extract text information from the content sequence through the content supervision network; obtain a supervision sequence based on the text information; The sequence and the content sequence calculate the loss value of the content encoder for the original acoustic feature; perform model parameters in the content encoder according to the loss value of the content encoder for the original acoustic feature. Adjustment; 所述解码模块,用于将所述内容序列和所述音色向量分别输入至解码器,得到所述解码器输出的预测的声学特征;The decoding module is configured to respectively input the content sequence and the timbre vector to a decoder to obtain the predicted acoustic feature output by the decoder; 所述训练模块,用于通过预先构建的损失函数计算所述预测的声学特征和所述原始的声学特征的损失值,基于所述损失值对待训练的语音转换模型进行训练;The training module is used to calculate the loss value of the predicted acoustic feature and the original acoustic feature through a pre-built loss function, and train the speech conversion model to be trained based on the loss value; 预测模块,用于将第一用户针对第一语音的原始的声学特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型,通过所述语音转换模型得到由所述第一语音和所述第二语音转换后的目标语音;其中,所述目标语音包括所述第一语音的内容信息和所述第二语音的音色信息。The prediction module is used for inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into the trained voice conversion model respectively, and obtains the data obtained by the voice conversion model through the voice conversion model. The target voice converted from the first voice and the second voice; wherein, the target voice includes content information of the first voice and timbre information of the second voice. 4.根据权利要求3所述的装置,所述预测模块,具体用于将所述第一用户针对所述第一语音的原始的声学特征输入至所述内容编码器,得到所述内容编码器输出的所述第一语音的内容序列;将所述第二用户针对所述第二语音的原始的声学特征输入至训练好的音色编码器,得到所述音色编码器输出的所述第二语音的音色向量;将所述第一语音的内容序列和所述第二语音的音色向量分别输入至训练好的解码器,通过所述解码器输出的预测的融合声学特征;将所述预测的融合声学特征输入至训练好的声码器,得到所述声码器输出的所述目标语音。4. The apparatus according to claim 3, wherein the prediction module is specifically configured to input the original acoustic features of the first user for the first speech into the content encoder to obtain the content encoder outputting the content sequence of the first voice; inputting the original acoustic features of the second voice by the second user into the trained timbre encoder to obtain the second voice output by the timbre encoder timbre vector; input the content sequence of the first voice and the timbre vector of the second voice into the trained decoder respectively, and pass the predicted fusion acoustic feature output by the decoder; The acoustic features are input to the trained vocoder to obtain the target speech output by the vocoder. 5.一种电子设备,包括:5. An electronic device comprising: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-2中任一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-2 Methods. 6.一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-2中任一项所述的方法。6. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-2.
CN202110950488.4A 2021-08-18 2021-08-18 Training method and device of voice conversion model, electronic equipment and medium Active CN113689868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110950488.4A CN113689868B (en) 2021-08-18 2021-08-18 Training method and device of voice conversion model, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110950488.4A CN113689868B (en) 2021-08-18 2021-08-18 Training method and device of voice conversion model, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN113689868A CN113689868A (en) 2021-11-23
CN113689868B true CN113689868B (en) 2022-09-13

Family

ID=78580470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110950488.4A Active CN113689868B (en) 2021-08-18 2021-08-18 Training method and device of voice conversion model, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN113689868B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114007075B (en) * 2021-11-30 2024-11-29 沈阳雅译网络技术有限公司 Gradual compression method for acoustic coding
CN114360557B (en) * 2021-12-22 2022-11-01 北京百度网讯科技有限公司 Voice tone conversion method, model training method, device, equipment and medium
CN114333865B (en) * 2021-12-22 2024-07-19 广州市百果园网络科技有限公司 Model training and tone conversion method, device, equipment and medium
CN116246642A (en) * 2022-12-30 2023-06-09 广州趣丸网络科技有限公司 Voice changing method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN113178201A (en) * 2021-04-30 2021-07-27 平安科技(深圳)有限公司 Unsupervised voice conversion method, unsupervised voice conversion device, unsupervised voice conversion equipment and unsupervised voice conversion medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11854562B2 (en) * 2019-05-14 2023-12-26 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion
US11410667B2 (en) * 2019-06-28 2022-08-09 Ford Global Technologies, Llc Hierarchical encoder for speech conversion system
CN111599368B (en) * 2020-05-18 2022-10-18 杭州电子科技大学 Adaptive instance normalized voice conversion method based on histogram matching
CN111785261B (en) * 2020-05-18 2023-07-21 南京邮电大学 Method and system for cross-lingual speech conversion based on disentanglement and interpretive representation
CN112037754B (en) * 2020-09-09 2024-02-09 广州方硅信息技术有限公司 Method for generating speech synthesis training data and related equipment
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN112365881A (en) * 2020-11-11 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, and training method, device, equipment and medium of corresponding model
CN112365874B (en) * 2020-11-17 2021-10-26 北京百度网讯科技有限公司 Attribute registration of speech synthesis model, apparatus, electronic device, and medium
CN112382271B (en) * 2020-11-30 2024-03-26 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN112365882B (en) * 2020-11-30 2023-09-22 北京百度网讯科技有限公司 Speech synthesis method and model training method, device, equipment and storage medium
CN112634920B (en) * 2020-12-18 2024-01-02 平安科技(深圳)有限公司 Training method and device for speech conversion model based on domain separation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN113178201A (en) * 2021-04-30 2021-07-27 平安科技(深圳)有限公司 Unsupervised voice conversion method, unsupervised voice conversion device, unsupervised voice conversion equipment and unsupervised voice conversion medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data;Seung-won Park等;《arXiv:2005.03295v2 [eess.AS]》;20200814;全文 *
基于特征分离的任意说话人语音转换算法设计与实现;陈莹;《中国优秀硕士学位论文全文数据库》;20210215;全文 *

Also Published As

Publication number Publication date
CN113689868A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN113689868B (en) Training method and device of voice conversion model, electronic equipment and medium
CN112466288B (en) Voice recognition method and device, electronic equipment and storage medium
CN114360557B (en) Voice tone conversion method, model training method, device, equipment and medium
US20240028841A1 (en) Speech translation method, device, and storage medium
KR102925589B1 (en) Voice processing method, encoding and decoding method and device, equipment and computer storage medium
US20230178067A1 (en) Method of training speech synthesis model and method of synthesizing speech
CN113051894B (en) Text error correction method and device
CN113674732B (en) Speech confidence detection method, device, electronic device and storage medium
CN114023342B (en) Voice conversion method, device, storage medium and electronic equipment
JP7264951B2 (en) Offline speech recognition method, device, electronic device, storage medium and computer program
CN113689866B (en) Training method and device of voice conversion model, electronic equipment and medium
CN114937478B (en) Method for training models, method and apparatus for generating molecules
CN111862961A (en) Method and device for recognizing voice
JP2021117989A (en) Language generation method, device and electronic apparatus
CN113129869B (en) Method and device for training and recognizing voice recognition model
US20230410794A1 (en) Audio recognition method, method of training audio recognition model, and electronic device
CN115662397B (en) Voice signal processing method and device, electronic equipment and storage medium
CN113793598A (en) Training method and data enhancement method, device and equipment for speech processing model
CN113553413A (en) Method, device, electronic device and storage medium for generating dialog state
CN113689867B (en) A training method, device, electronic device and medium for a speech conversion model
CN113468857B (en) Training method and device for style conversion model, electronic equipment and storage medium
CN113793599B (en) Speech recognition model training method and speech recognition method and device
CN115512682A (en) Polyphone pronunciation prediction method, device, electronic equipment and storage medium
CN113255332A (en) Training and text error correction method and device for text error correction model
CN114023310B (en) Method, apparatus and computer program product for speech data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant