CN113380231B

CN113380231B - Voice conversion method and device and electronic equipment

Info

Publication number: CN113380231B
Application number: CN202110660033.9A
Authority: CN
Inventors: 王旭; 衷奕; 饶丰; 魏萌
Original assignee: Beijing Yiyi Education Technology Co ltd
Current assignee: Beijing Yiyi Education Technology Co ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2023-01-24
Anticipated expiration: 2041-06-15
Also published as: CN113380231A

Abstract

The invention provides a voice conversion method, a voice conversion device and electronic equipment, wherein the method comprises the following steps: determining a voice recognition model, a voice changing model of a target user and a vocoder model; extracting a feature vector of the source audio data based on the voice recognition model, wherein the feature vector of the source audio data has no tone mark; converting the feature vector of the source audio data into acoustic features of the target user; the acoustic features of the target user are converted into an audio signal of the target user. By the method, the device and the electronic equipment for voice conversion provided by the embodiment of the invention, the voice recognition model is trained on the basis of the audio data which is not marked with tone, so that the tone information does not exist in the feature vector of the source audio data extracted by the voice recognition model, the tone difference between the training stage and the conversion stage can be weakened, the source audio data can be converted into the acoustic feature which is closer to a target user, and the similarity between the converted audio and the required audio is improved.

Description

Voice conversion method and device and electronic equipment

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for speech conversion, an electronic device, and a computer-readable storage medium.

Background

The voice change technology refers to converting an original voice of a person into a specific voice of another person, which retains semantic content of an original voice signal but can change voice characteristics of a speaker to make the voice of the person sound like the voice of another person. The voice changing technique may be a conversion between a male voice and a female voice, a conversion between different age levels, or the like, and may also convert the voice of the speaker a into the voice of the speaker B.

The traditional sound changing technology needs to align the parallel linguistic data and then perform tone conversion. The processing mode needs to collect audio corpora with the same content, and train a conversion model by using the aligned spectral features; the sound changing effect obtained by the method is poor, and some application scenes with real-time requirements cannot be met.

In addition, a partial scheme establishes a sound variation model between hidden layer characteristics of voice recognition and voice synthesis characteristics, but the hidden layer characteristics adopted in the scheme contain partial information of a source speaker, so that the converted voice still retains the characteristics of the source speaker, and the similarity between the converted voice and a target voice is reduced.

Disclosure of Invention

In order to solve the existing technical problem, embodiments of the present invention provide a method and an apparatus for voice conversion, an electronic device, and a computer-readable storage medium.

In a first aspect, an embodiment of the present invention provides a method for voice conversion, including:

determining a voice recognition model, determining a voice variation model of a target user, and determining a vocoder model; the voice recognition model is obtained by training audio data without tone marks based on text marks, and the sound changing model is obtained by training feature vectors of the audio data extracted by the voice recognition model;

acquiring source audio data of a source user, and extracting a feature vector of the source audio data based on the voice recognition model, wherein the feature vector of the source audio data does not have tone marks;

converting feature vectors of the source audio data into acoustic features of the target user based on the acoustic variation model;

and inputting the acoustic features of the target user into the vocoder model, and converting the acoustic features of the target user into an audio signal of the target user.

In a second aspect, an embodiment of the present invention further provides an apparatus for speech conversion, including:

the determining module is used for determining a voice recognition model, determining a voice changing model of a target user and determining a vocoder model; the voice recognition model is obtained by training audio data without tone marks based on text marks, and the sound change model is obtained by training feature vectors of the audio data extracted by the voice recognition model;

the feature extraction module is used for acquiring source audio data of a source user, extracting a feature vector of the source audio data based on the voice recognition model, wherein the feature vector of the source audio data has no tone mark;

a conversion module for converting the feature vector of the source audio data into acoustic features of the target user based on the acoustic change model;

a vocoder module, configured to input the acoustic feature of the target user into the vocoder model, and convert the acoustic feature of the target user into an audio signal of the target user.

In a third aspect, an embodiment of the present invention provides an electronic device, including a bus, a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor, where the transceiver, the memory, and the processor are connected via the bus, and the computer program, when executed by the processor, implements the steps in the method for voice conversion described in any one of the above.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the method for speech conversion described in any one of the above.

The method, the device, the electronic equipment and the computer readable storage medium for voice conversion provided by the embodiment of the invention train the voice recognition model based on the audio data without tone marking, so that the feature vector of the source audio data extracted by the voice recognition model has no tone information, the tone difference between the training stage and the conversion stage can be weakened, the source audio data can be converted into the acoustic feature closer to a target user, the similarity between the converted audio and the required audio is improved, and the conversion effect is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present invention, the drawings required to be used in the embodiments or the background art of the present invention will be described below.

FIG. 1 is a flow chart illustrating a method of voice conversion provided by an embodiment of the present invention;

FIG. 2 is a diagram illustrating a model processing procedure in the method for speech conversion according to the embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating an architecture of an acoustic variation model provided by an embodiment of the present invention;

FIG. 4 is a detailed diagram of a method for converting speech provided by an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device for speech conversion according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device for performing a method of voice conversion according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings.

Fig. 1 is a flowchart illustrating a method for voice conversion according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101: determining a voice recognition model, determining a voice variation model of a target user, and determining a vocoder model; the voice recognition model is obtained by training on the basis of audio data without tone marks in text marks, and the sound changing model is obtained by training on the basis of feature vectors of the audio data extracted by the voice recognition model.

In the embodiment of the invention, a voice recognition model, a sound changing model and a vocoder model are predetermined. The voice recognition model is used for extracting a feature vector in the audio data; the acoustic variation model is used for converting feature vectors in the audio data into corresponding acoustic features, such as Mel (Mel) spectral features and the like; the vocoder model is used to convert the acoustic features into corresponding audio signals. The acoustic variation model is a model corresponding to the target user, that is, the acoustic variation model can convert the feature vector into an acoustic feature having the characteristics of the target user, so that the audio signal generated by the vocoder model has the characteristics of the target user, and thus, the audio data of other users (such as a source user) can be converted into an audio signal having the characteristics of the target user.

The characteristics of the source user are still kept due to the extracted features of the traditional voice recognition model, and the input of the traditional sound changing model in the training stage is different from the input of the traditional sound changing model in the conversion stage, so that the similarity between the sound changed audio and the target audio is low. In particular, because the input of the acoustic variation model in the training phase is the audio data of the target user, and the input of the acoustic variation model in the conversion phase is the audio data of the source user, it is difficult for different inputs to make the output of the acoustic variation model the same. The inventor finds out from the experimental result that if the source user deliberately simulates the characteristic pronunciation of the target user, the voice frequency after the change has extremely high similarity with the target user, and the whole auditory effect is obviously improved. However, this requires each user (i.e., source user) to mimic the speech pattern of the target user, which increases the difficulty of use; and many times, the recorded audio is converted, and the target user cannot be simulated. In addition, by contrast, the inventor finds that the main difference between the voice-altered audio and the desired audio of the target user is the tone difference, so the embodiment of the present invention optimizes the tone information in the feature vector to minimize the difference between the training phase and the conversion phase.

Specifically, the speech recognition model in the embodiment of the present invention is obtained by training based on audio data with text labels and without tone labels, and correspondingly, the sound-changing model is obtained by training based on feature vectors of the audio data extracted by the speech recognition model. That is, when training the speech recognition model, no tones are present in the annotation text of the audio data used. For example, in chinese, the annotation text of conventional audio data typically contains five tones, i.e., one, two, three, four, and soft; in this embodiment, the annotation text does not include the annotation related to the tone. The output of the voice recognition model trained on the audio data does not contain tone information; even if the input in the training stage is different from that in the conversion stage, the speech recognition model can output similar or even consistent feature vectors, so that the difference of tone information between the training stage and the conversion stage can be weakened. Moreover, the sound changing model is trained by adopting the feature vector without tone information, so that the audio data of the source user can be converted into the acoustic features with the characteristics of the target user.

The Speech Recognition model may be an ASR (Automatic Speech Recognition) acoustic model. The feature vector may be a hidden layer feature output by the speech recognition model, or may be a feature output by the last layer of the speech recognition model. Alternatively, the probability distribution output by the last layer of the speech recognition model may be used as a feature vector, and the probability distribution may specifically be a posterior probability of speech (PPG) vector.

Step 102: the method comprises the steps of obtaining source audio data of a source user, and extracting a feature vector of the source audio data based on a voice recognition model, wherein the feature vector of the source audio data has no tone mark.

Step 103: feature vectors of the source audio data are converted into acoustic features of the target user based on the acoustic variation model.

Step 104: and inputting the acoustic features of the target user into the vocoder model, and converting the acoustic features of the target user into an audio signal of the target user.

In the embodiment of the invention, a source user is a user needing to convert audio, and the collected audio data sent by the source user is source audio data; the target user is the conversion target in the voice conversion. For example, if the audio data a of the user a needs to be converted into an audio with the characteristics of the user B, the user a is the source user, the audio data a is the source audio data, and the user B is the target user. After the source audio data needing to be converted is obtained, the source audio data can be converted into the audio signal of the target user based on the preset voice recognition model, the preset sound changing model and the preset vocoder model.

As shown in fig. 2, after the source audio data of the source user is acquired, a feature vector of the source audio data, such as PPG, may be extracted based on the speech recognition model, and as described above, the feature vector does not have tone information; then, the feature vector is input into the acoustic change model, so that the feature vector can be converted into corresponding acoustic features, and as described above, the acoustic features are acoustic features of the target user, that is, the acoustic features have characteristics of the target user. Then, the acoustic characteristics are input into a vocoder model, so that an audio signal with the characteristics of the target user, namely the audio signal of the target user, can be obtained, and the sound similar to the target user can be heard by playing the audio signal.

According to the voice conversion method provided by the embodiment of the invention, the voice recognition model is trained on the basis of the audio data without tone marks, so that the tone information does not exist in the feature vector of the source audio data extracted by the voice recognition model, the tone difference between the training stage and the conversion stage can be weakened, the source audio data can be converted into the acoustic feature closer to the target user, the similarity between the converted audio and the required audio is improved, and the conversion effect is improved.

On the basis of the above embodiment, the step 101 "determining a speech recognition model" may include:

step A1: and acquiring sample audio data, and removing tone marks in text marks of the sample audio data.

Step A2: and training by taking sample audio data as input and taking the corresponding text label without the tone label as output to generate a speech recognition model.

In the embodiment of the invention, the voice recognition model is obtained based on sample audio data training. The existing audio data used for training can be used as sample audio data, and only the tone marks in the text marks are removed. And then, training a preset model by taking the sample audio data as input and taking the text label without the tone label as output, thereby generating the speech recognition model capable of extracting the feature vector without the tone information.

Optionally, the speech recognition model is specifically used for converting speech features of the audio data into feature vectors that do not contain tonal information. Specifically, the step A2 of training with the sample audio data as input and the corresponding text label without the tone label as output includes:

step A21: speech features of the sample audio data are extracted.

Step A22: and training by taking the voice characteristics of the sample audio data as input and taking the corresponding text label without the tone label as output.

In the embodiment of the invention, the corresponding voice characteristics are extracted from the sample audio data, and training is carried out based on the voice characteristics, so that the characteristic vector of the audio data can be more accurately extracted. The speech feature may include MFCC (Mel-scale Frequency Cepstral Coefficients, mel Frequency Cepstral Coefficients), PLP (Perceptual Linear prediction) parameters, and the like; moreover, existing mature technologies can be adopted to extract the voice features, and the manner of extracting the voice features is not limited in this embodiment.

Optionally, the process of "determining the acoustic variation model of the target user" in step 101 may specifically include:

step B1: acquiring first audio data of a sample user, and extracting acoustic features of the first audio data; and acquiring second audio data of the target user, and extracting acoustic features of the second audio data.

In the embodiment of the present invention, when the acoustic change model is trained, based on two types of audio data, that is, the audio data of the sample user and the audio data of the target user, for convenience of description, the former is referred to as first audio data, and the latter is referred to as second audio data. And, the acoustic features in each type of audio data, that is, the acoustic features of the first audio data and the acoustic features of the second audio data, may be extracted based on an acoustic feature extraction technique. The sample user is a common user, and the data size of the first audio data may be many, which may be the above sample audio data; whereas the target user is a specific user, the data amount of the second audio data is generally small.

It should be noted that the "acoustic features" in the present embodiment and the "speech features" described above are both features that can be extracted from audio data, and the extraction methods are all existing mature techniques. Here, the "acoustic feature" is a feature of a deeper level than the "speech feature", and generally needs to be extracted on the basis of the speech feature.

And step B2: and extracting a feature vector of the first audio data according to the voice recognition model, and extracting a feature vector of the second audio data according to the voice recognition model, wherein neither the feature vector of the first audio data nor the feature vector of the second audio data has tone marks.

And step B3: and training by taking the feature vector of the first audio data as input and taking the acoustic feature of the first audio data as output to generate a sound-changing baseline model.

And step B4: and fine-tuning the sound variation baseline model by taking the feature vector of the second audio data as input and the acoustic feature of the second audio data as output to generate the sound variation model of the target user.

In the embodiment of the invention, the voice-changing model can be obtained by training after the voice recognition model is trained, namely, the voice recognition model is trained firstly, and then the voice-changing model is trained. Specifically, after a speech recognition model is obtained through training, a feature vector of first audio data and a feature vector of second audio data are extracted based on the speech recognition model, and based on the characteristics of the speech recognition model, the two feature vectors do not have tone marks, that is, no tone information is contained.

In the embodiment of the present invention, the acoustic change model is used to convert the feature vector into corresponding acoustic features. In the process of training the sound variation model, a baseline model, namely the sound variation baseline model, is obtained on the basis of a large amount of first audio data training; and then, carrying out fine tuning training on the variable-sound baseline model based on a small amount of second audio data, wherein the second audio data are data of a target user, so that the variable-sound model obtained after fine tuning can be used for generating acoustic features with the characteristics of the target user. When the source audio data is converted, the feature vector of the source audio data can also be converted into the acoustic feature of a specific target user, so that audio conversion is realized.

Optionally, referring to fig. 3, the acoustic modeling includes an encoder, a self-attention layer (self-attention), a two-layer long-short term memory (BiLSTM) layer, and a decoder, and both the encoder and the decoder include a plurality of Deep Neural Network (DNN) layers. Wherein the encoder is configured to encode the feature vector of the audio data into a first hidden layer feature; the first hidden layer feature sequentially passes through the self-attention layer and the double-layer long and short term memory layer to generate a second hidden layer feature; the decoder is used for converting the second hidden layer characteristics into corresponding acoustic characteristics.

The following explains in detail an embodiment of the present invention, which can convert source audio data into acoustic features of a target user, by way of an example. As shown in fig. 4, the ASR acoustic model (i.e. the speech recognition model) is used to extract the feature vector PPG that does not contain tones, for example, the feature vector corresponding to "young" is "nian qing ren" instead of "nian2 qing1 ren2".

In the process of training the acoustic variation model, the second audio data of the target user is 'Saizi tailing juice', and the acoustic characteristic (namely Mel spectrum) of the second audio data is 'hao 4 zi0 wei3 zhi 1'; and because the ASR acoustic model does not extract tone information, the PPG extracted by the ASR acoustic model is 'hao zi wei zhi'. Therefore, when the acoustic change model is trained, the acoustic change model is input with the feature vector of "hao zi wei zhi" and output with the acoustic feature of "hao4 zi0 wei3 zhi1".

After training, when the sound-changing model is used for conversion, if the source audio data input by the source user is normal, the source audio data is 'good self', although the source audio data is different from the audio data 'waste child tail juice' of the target user, because the feature vector does not contain tone information, the ASR acoustic model can still extract the same or similar feature vector, namely 'hao zi wei zhi', and at the moment, the sound-changing model can better convert the input 'hao zi wei zhi' into the acoustic feature 'hao 4 zi0 wei3 zhi 1' with the characteristics of the target user. Therefore, although the inputs in the training process and the conversion process are different, one is "waste child tailing juice", and the other is "good for self", based on the method provided by the embodiment, the same or similar feature vector "hao zi wei zhi" can be extracted during training or conversion, so that the sound-varying model can better convert the feature vector into the acoustic feature "hao4 zi0 wei3 zhi1" with the characteristics of the target user, and the conversion effect is improved.

On the basis of the above embodiments, the vocoder model may be a general-purpose vocoder, or may be a vocoder more suitable for the target user, and this embodiment adopts a vocoder more suitable for the target user to further improve the conversion effect. In this embodiment, the step 101 of "determining a vocoder model" specifically includes:

step C1: acquiring third audio data of a sample user, and extracting acoustic features and audio signals of the third audio data; and acquiring fourth audio data of the target user, and extracting acoustic features and audio signals of the fourth audio data.

In the embodiment of the present invention, when determining the vocoder model, a large amount of audio data (i.e., third audio data) based on the sample user and a small amount of audio data (i.e., fourth audio data) based on the target user are also required. The acoustic features of the third audio data and the fourth audio data can be extracted based on the existing mature technology, and the audio signals of the third audio data and the fourth audio data can also be extracted; in this embodiment, the audio signal is data that can be played, and the "audio data" is not substantially different from the audio signal, i.e. the audio data can be directly used as the corresponding audio signal in general.

In the embodiment of the invention, the process of training the vocoder model is irrelevant to the process of training the voice recognition model and the voice change model, namely, the vocoder model can be trained firstly, and the time for training the vocoder model is not limited in the embodiment when the voice recognition model is trained. The third audio data may be the same as or different from the first audio data; the fourth audio data may be the same as or different from the second audio data described above.

And step C2: and training by taking the acoustic characteristics of the third audio data as input and taking the audio signal of the third audio data as output to generate a vocoder baseline model.

Step C3: and fine-tuning the vocoder baseline model by taking the acoustic characteristics of the fourth audio data as input and the audio signal of the fourth audio data as output to generate the vocoder model of the target user.

In the embodiment of the invention, similar to the process of training the sound changing model, when the vocoder model is trained, the base line model, namely the vocoder base line model, is obtained based on the training of a large amount of third audio data, and then the fine tuning training is carried out based on a small amount of fourth audio data, so that the vocoder model more suitable for a target user is generated. The vocoder model may also better synthesize the source audio data into an audio signal for the target user when it is later processed.

It will be understood by those skilled in the art that the training of the variant acoustic model and the vocoder model may be based on audio data of a plurality of sample users, i.e., the number of sample users may be plural, but the target user is a specific user, which is a single user. For different target users, a sound variation model and a sound coder model need to be set respectively.

The method for voice conversion provided by the embodiment of the present invention is described above in detail, and the method can also be implemented by a corresponding apparatus.

Fig. 5 is a schematic structural diagram of a device for speech conversion according to an embodiment of the present invention. As shown in fig. 5, the apparatus for converting speech includes:

a determining module 51, configured to determine a speech recognition model, determine a voice-changing model of a target user, and determine a vocoder model; the voice recognition model is obtained by training audio data without tone marks based on text marks, and the sound change model is obtained by training feature vectors of the audio data extracted by the voice recognition model;

the feature extraction module 52 is configured to obtain source audio data of a source user, and extract a feature vector of the source audio data based on the speech recognition model, where the feature vector of the source audio data does not have a tone label;

a conversion module 53, configured to convert the feature vector of the source audio data into an acoustic feature of the target user based on the acoustic change model;

a vocoder module 54, configured to input the acoustic characteristics of the target user into the vocoder model, and convert the acoustic characteristics of the target user into an audio signal of the target user.

On the basis of the above embodiment, the determining module 51 determines the speech recognition model by:

obtaining sample audio data, and removing tone marks in text marks of the sample audio data;

and training by taking the sample audio data as input and taking the corresponding text label without the tone label as output to generate a speech recognition model.

On the basis of the above embodiment, the determining module 51 takes the sample audio data as input and takes the corresponding text label without tone label as output for training, including:

extracting voice features of the sample audio data;

and training by taking the voice characteristics of the sample audio data as input and taking the corresponding text label without the tone label as output.

On the basis of the above embodiment, the determining module 51 determines the acoustic change model of the target user, including:

acquiring first audio data of a sample user, and extracting acoustic features of the first audio data; acquiring second audio data of a target user, and extracting acoustic features of the second audio data;

extracting a feature vector of the first audio data according to the voice recognition model, and extracting a feature vector of the second audio data according to the voice recognition model, wherein neither the feature vector of the first audio data nor the feature vector of the second audio data has a tone mark;

training by taking the feature vector of the first audio data as input and the acoustic feature of the first audio data as output to generate a sound variation baseline model;

and fine-tuning the sound variation baseline model by taking the feature vector of the second audio data as input and the acoustic feature of the second audio data as output to generate the sound variation model of the target user.

On the basis of the above embodiment, the acoustic change model comprises an encoder, a self-attention layer, a double-layer long-short term memory layer and a decoder, wherein the encoder and the decoder each comprise a plurality of deep neural network layers;

the encoder is configured to encode a feature vector of the audio data into a first hidden layer feature; the first hidden layer feature sequentially passes through the self-attention layer and the double-layer long-short term memory layer to generate a second hidden layer feature; the decoder is configured to convert the second hidden layer features into corresponding acoustic features.

On the basis of the above embodiment, the determining module 51 determines a vocoder model, including:

acquiring third audio data of a sample user, and extracting acoustic features and audio signals of the third audio data; acquiring fourth audio data of a target user, and extracting acoustic features and audio signals of the fourth audio data;

training by taking the acoustic features of the third audio data as input and taking the audio signals of the third audio data as output to generate a vocoder baseline model;

and fine-tuning the vocoder baseline model by taking the acoustic characteristics of the fourth audio data as input and the audio signal of the fourth audio data as output to generate the vocoder model of the target user.

On the basis of the above embodiment, the feature vector is a speech posterior probability vector.

In addition, an embodiment of the present invention further provides an electronic device, which includes a bus, a transceiver, a memory, a processor, and a computer program stored in the memory and executable on the processor, where the transceiver, the memory, and the processor are connected via the bus, respectively, and when the computer program is executed by the processor, the processes of the foregoing voice conversion method embodiment are implemented, and the same technical effects can be achieved, and are not described herein again to avoid repetition.

Specifically, referring to fig. 6, an embodiment of the present invention further provides an electronic device, which includes a bus 1110, a processor 1120, a transceiver 1130, a bus interface 1140, a memory 1150, and a user interface 1160.

In an embodiment of the present invention, the electronic device further includes: a computer program stored on the memory 1150 and executable on the processor 1120, the computer program, when executed by the processor 1120, implementing the various processes of the method embodiments of speech conversion described above.

A transceiver 1130 for receiving and transmitting data under the control of the processor 1120.

In embodiments of the invention in which a bus architecture (represented by bus 1110) is used, bus 1110 may include any number of interconnected buses and bridges, with bus 1110 connecting various circuits including one or more processors, represented by processor 1120, and memory, represented by memory 1150.

Bus 1110 represents one or more of any of several types of bus structures, including a memory bus, and memory controller, a peripheral bus, an Accelerated Graphics Port (AGP), a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include: industry Standard Architecture (ISA) bus, micro Channel Architecture (MCA) bus, enhanced ISA (EISA) bus, video Electronics Standards Association (VESA), peripheral Component Interconnect (PCI) bus.

Processor 1120 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits in hardware or instructions in software in a processor. The processor described above includes: general purpose processors, central Processing Units (CPUs), network Processors (NPs), digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), programmable Logic Arrays (PLAs), micro Control Units (MCUs) or other Programmable Logic devices, discrete gates, transistor Logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. For example, the processor may be a single core processor or a multi-core processor, which may be integrated on a single chip or located on multiple different chips.

Processor 1120 may be a microprocessor or any conventional processor. The steps of the method disclosed in connection with the embodiments of the present invention may be performed directly by a hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), a register, and other readable storage media known in the art. The readable storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The bus 1110 may also connect various other circuits such as peripherals, voltage regulators, or power management circuits to provide an interface between the bus 1110 and the transceiver 1130, as is well known in the art. Therefore, the embodiments of the present invention will not be further described.

The transceiver 1130 may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. For example: the transceiver 1130 receives external data from other devices, and the transceiver 1130 transmits data processed by the processor 1120 to other devices. Depending on the nature of the computer system, a user interface 1160 may also be provided, such as: touch screen, physical keyboard, display, mouse, speaker, microphone, trackball, joystick, stylus.

It is to be appreciated that in an embodiment of the invention, the memory 1150 may further include remotely located memory relative to the processor 1120, such remotely located memory may be coupled to the server via a network. One or more portions of the above-described networks may be an ad hoc network (ad hoc network), an intranet (intranet), an extranet (extranet), a Virtual Private Network (VPN), a Local Area Network (LAN), a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), a Wireless Wide Area Network (WWAN), a Metropolitan Area Network (MAN), the Internet (Internet), a Public Switched Telephone Network (PSTN), a plain old telephone service network (POTS), a cellular telephone network, a wireless fidelity (Wi-Fi) network, and combinations of two or more of the above. For example, the cellular telephone network and the wireless network may be a global system for Mobile Communications (GSM) system, a Code Division Multiple Access (CDMA) system, a Worldwide Interoperability for Microwave Access (WiMAX) system, a General Packet Radio Service (GPRS) system, a Wideband Code Division Multiple Access (WCDMA) system, a Long Term Evolution (LTE) system, an LTE Frequency Division Duplex (FDD) system, an LTE Time Division Duplex (TDD) system, a long term evolution-advanced (LTE-a) system, a Universal Mobile Telecommunications (UMTS) system, an enhanced Mobile Broadband (eMBB) system, a mass Machine Type Communication (mtc) system, an Ultra Reliable Low Latency Communication (urrllc) system, or the like.

It is to be understood that the memory 1150 in embodiments of the present invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Wherein the nonvolatile memory includes: read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable Programmable Read-Only Memory (EPROM), electrically Erasable Programmable Read-Only Memory (EEPROM), or Flash Memory (Flash Memory).

The volatile memory includes: random Access Memory (RAM), which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as: static random access memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced Synchronous DRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 1150 of the electronic device described in connection with the embodiments of the invention includes, but is not limited to, the above-described and any other suitable types of memory.

In an embodiment of the present invention, memory 1150 stores the following elements of operating system 1151 and application programs 1152: an executable module, a data structure, or a subset thereof, or an expanded set thereof.

Specifically, the operating system 1151 includes various system programs such as: a framework layer, a core library layer, a driver layer, etc. for implementing various basic services and processing hardware-based tasks. Applications 1152 include various applications such as: media Player (Media Player), browser (Browser), used to implement various application services. Programs that implement methods in accordance with embodiments of the present invention can be included in application programs 1152. The application programs 1152 include: applets, objects, components, logic, data structures, and other computer system executable instructions that perform particular tasks or implement particular abstract data types.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process of the foregoing method for voice conversion, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The computer-readable storage medium includes: permanent and non-permanent, removable and non-removable media may be tangible devices that retain and store instructions for use by an instruction execution apparatus. The computer-readable storage medium includes: electronic memory devices, magnetic memory devices, optical memory devices, electromagnetic memory devices, semiconductor memory devices, and any suitable combination of the foregoing. The computer-readable storage medium includes: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), non-volatile random access memory (NVRAM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic tape cartridge storage, magnetic tape disk storage or other magnetic storage devices, memory sticks, mechanically encoded devices (e.g., punched cards or raised structures in a groove having instructions recorded thereon), or any other non-transmission medium useful for storing information that may be accessed by a computing device. As defined in embodiments of the present invention, the computer-readable storage medium does not include transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses traveling through a fiber optic cable), or electrical signals transmitted through a wire.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electrical, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to solve the problem to be solved by the embodiment of the invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be substantially or partially contributed by the prior art, or all or part of the technical solutions may be embodied in a software product stored in a storage medium and including instructions for causing a computer device (including a personal computer, a server, a data center, or other network devices) to execute all or part of the steps of the methods of the embodiments of the present invention. And the storage medium includes various media that can store the program code as listed in the foregoing.

In the description of the embodiments of the present invention, it should be apparent to those skilled in the art that the embodiments of the present invention can be embodied as methods, apparatuses, electronic devices, and computer-readable storage media. Thus, embodiments of the invention may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), a combination of hardware and software. Furthermore, in some embodiments, embodiments of the invention may also be embodied in the form of a computer program product in one or more computer-readable storage media having computer program code embodied in the medium.

The computer-readable storage media described above may take any combination of one or more computer-readable storage media. The computer-readable storage medium includes: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer-readable storage medium include: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only Memory (ROM), an erasable programmable read-only Memory (EPROM), a Flash Memory (Flash Memory), an optical fiber, a compact disc read-only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any combination thereof. In embodiments of the invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, device, or apparatus.

The computer program code embodied on the computer readable storage medium may be transmitted using any appropriate medium, including: wireless, wire, fiber optic cable, radio Frequency (RF), or any suitable combination thereof.

Computer program code for carrying out operations for embodiments of the present invention may be written in assembly instructions, instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or in one or more programming languages, including an object oriented programming language, such as: java, smalltalk, C + +, and also include conventional procedural programming languages, such as: c or a similar programming language. The computer program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may travel through any type of network, including: a Local Area Network (LAN) or a Wide Area Network (WAN), which may be connected to the user's computer, may be connected to an external computer.

The method, the device and the electronic equipment are described through the flow chart and/or the block diagram.

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions. These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing apparatus to function in a particular manner. Thus, the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The above description is only a specific implementation of the embodiments of the present invention, but the scope of the embodiments of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present invention, and all such changes or substitutions should be covered by the scope of the embodiments of the present invention. Therefore, the protection scope of the embodiments of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of speech conversion, comprising:

determining a voice recognition model, determining a voice variation model of a target user, and determining a vocoder model; the voice recognition model is obtained by training audio data without tone marks based on text marks, the sound change model is obtained by training feature vectors of the audio data extracted by the voice recognition model, and the feature vectors of the audio data extracted by the voice recognition model do not contain tone information;

converting feature vectors of the source audio data into acoustic features of the target user based on the acoustic change model;

2. The method of claim 1, wherein the determining a speech recognition model comprises:

3. The method of claim 2, wherein training with the sample audio data as input and the corresponding removed-tone-label text label as output comprises:

extracting voice features of the sample audio data;

4. The method of claim 1, wherein the determining the acoustic change model of the target user comprises:

training by taking the feature vector of the first audio data as input and the acoustic feature of the first audio data as output to generate a variable-sound baseline model;

and fine-tuning the variable-sound baseline model by taking the feature vector of the second audio data as input and the acoustic feature of the second audio data as output to generate the variable-sound model of the target user.

5. The method of claim 4, wherein the acoustic change model comprises an encoder, a self-attention layer, a dual-layer long-short term memory layer, and a decoder, each of the encoder and the decoder comprising a plurality of deep neural network layers;

6. The method of claim 1, wherein determining the vocoder model comprises:

7. The method according to any of claims 1-6, wherein the feature vector is a speech posterior probability vector.

8. An apparatus for speech conversion, comprising:

the determining module is used for determining a voice recognition model, determining a voice changing model of a target user and determining a vocoder model; the voice recognition model is obtained by training audio data without tone marks based on text marks, the sound change model is obtained by training feature vectors of the audio data extracted by the voice recognition model, and the feature vectors of the audio data extracted by the voice recognition model do not contain tone information;

the feature extraction module is used for acquiring source audio data of a source user and extracting a feature vector of the source audio data based on the voice recognition model, wherein the feature vector of the source audio data does not have tone marks;

9. An electronic device comprising a bus, a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, the transceiver, the memory and the processor being connected via the bus, characterized in that the computer program, when executed by the processor, performs the steps of the method of speech conversion according to any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of speech conversion according to any one of claims 1 to 7.